Unsupervised Thoughts A blog on machine learning

Fast, High-Fidelity LLM Decoding with Regex Constraints

Constraining the text generated by an LLM is essential for developing robust LLM applications. For example, a developer may require LLM responses to adhere to a specific JSON or YAML schema so that they are comprehensive and reliably parseable.

Powerful tools and methods (Beurer-Kellner et al., 2022; Lundberg & Ribeiro, 2023; Willard & Louf, 2023; Zheng et al., 2023) have been published in the past year to address this need. Notably, the open source library implements an efficient method (Willard & Louf, 2023) that ensures compliance with an arbitrary regex.

In this blog post, I propose two novel, fast, high-fidelity alternatives to this method. Specifically, I will:

This blog post comes with a technical appendix formalizing and proving the algorithms. Additionally, a Python notebook with implementations of and , along with additional empirical results, is available.

Deterministic Finite Automata and Constrained Decoding

I now briefly explain the principle of . Please read the original paper (Willard & Louf, 2023) or this blog post for a more comprehensive overview.

As a preprocessing step, converts a character-level deterministic finite automaton (DFA) testing whether a string matches a regex (cf. Figure 1a) into a token-level DFA testing whether a token sequence is decoded in a string matching the regex (cf. Figure 1b):

`"a*b*"` DFA

Figure 1. Top left: character-level DFA for \(\texttt{a*b*}\), top right: token-level DFA returned by \(\texttt{Outlines}\) for the same regex, bottom: token masking at decoding time.

At decoding time, the DFA is used to determine, for each new token, which potential tokens are allowed. For this and starting from the initial state of the DFA, we determine the allowed tokens with the outgoing transitions from the current state, apply the corresponding mask to the next token probabilities and renormalize these probabilities. We can then sample a new token and update the state of the DFA.

For example, let’s imagine that the tokens , and have already been generated. Starting from state , we remained in state with and but transitioned to state with . The tokens now allowed are simply those with an outgoing transition from state , i.e. and . We then consider the next token probabilities of these two tokens and we renormalize them so that they sum to 1.

Three Challenges due to Improper Token Sequences

The token sequences potentially generated with are exactly those that are decoded into a string matching the regex. This includes token sequences that do not result from the tokenization of a string. We call them improper token sequences below.

Consider the regex: . While there are only two strings matching this regex, the corresponding DFA returned by has 16 states and 48 transitions (Figure 2). For example, the first token is chosen among , , and .

DFA built by Outlines for `"boolean: ((true)|(false))"`

Figure 2. DFA built by \(\texttt{Outlines}\) for the \(\texttt{"boolean: ((true)|(false))"}\) regex

In contrast, as we will see in the following sections, the DFA built by and only accepts the token sequences resulting from the tokenization of compliant strings and it has only 2 states and 4 transitions (Figure 3).

DFA built by DirectMerge and CartesianMerge for `"boolean: ((true)|(false))"`

Figure 3. DFA built by \(\texttt{DirectMerge}\) and \(\texttt{CartesianMerge}\) for the \(\texttt{"boolean: ((true)|(false))"}\) regex

Challenge #1: Skewed Probability Distribution

The permissiveness of with improper token sequences may significantly distort the probability distribution of the LLM, beyond what is strictly necessary to enforce the regex constraint.

To illustrate this, let’s imagine that we generate a continuation of the prompt Question: What is the first name of the US president who was a member of the Sigma Alpha Epsilon fraternity?\nAnswer: with the regex constraint (the true answer is William like William McKinley, Theodore Roosevelt’s predecessor).

When the text generation is unconstrained, we can compute the probability along all token sequences leading to one of the two possible answers. If we do so with and a temperature equal to 1, we get and notice that almost all the probability mass for these two answers comes from proper token sequences (99.8586% for and 99.9997% for ).

If we constrain the text generation with and multinomial sampling as a decoding strategy, any prefix of and will be accepted as the first token and determine the answer (if this first token is not the blank space). For example, the token can be selected as the first token and in this case, the answer will necessarily be . As a result, the probability to generate will be the sum of generating , , , etc. as the first token, even though these first tokens, except almost never lead to and will be significantly boosted by the sole virtue of having a very common token, , as a prefix. Indeed, in this case.

With and , only the tokens corresponding to the tokenization of the possible answers are taken into account (e.g. is ignored). The conditional probability, , is very close the one in the unconstrained case, which makes sense given that the LLM spontaneously generates proper token sequences a large majority of the time.

Challenge #2: Self-Intoxication

Another concern is that the tokens generated are fed back to the LLM to select the following ones. During its pretraining and depending whether subword regularization techniques like BPE-Dropout (Provilkov et al., 2020) have been used, an LLM was never or rarely exposed to improper token sequences so this LLM may not properly interpret the newly added tokens and this can impact the quality of its output.

Challenge #3: Missed Speedup Opportunities

As suggested by two recent blog posts, offers opportunities to accelerate decoding.

Let’s go back to the DFA provided by and on Figure 3. There is only one outgoing transition from states and . This means that we don’t even need to compute the next token probability when we reach these states since there is only one possible token. We can then generate the whole text in just one LLM call. Meanwhile, the DFA built with for the same regex on Figure 2 has 16 states but only 3 states with one outgoing transition (states 6, 7 and 13). There are then much fewer opportunities to skip computationally expensive LLM calls. Moreover, there are many potential paths that are much longer than three transitions. This can also slow down the decoding.

Ensuring Both Compliance and Proper Tokenization

We can now focus on . Even though its aims are slightly different, shares the same overall approach as : it derives a token-level DFA from a character-level DFA recognizing the regex and this token-level DFA is used to identify allowed tokens at decoding time.

applies to tokenizers based on a merge table, like Byte Pair Encoding tokenizers (Sennrich et al., 2016). Encoding a string with such tokenizers consists of converting the string into a sequence of single-letter tokens and then performing an ordered list of merge operations. Each of these merge operations is defined by two tokens (e.g. and ) and merge all the corresponding pairs of consecutive tokens, starting from the left of the current token sequence.

Initial string
Conversion to a sequence of single-letter tokens
Token sequence after
Token sequence after
Final token sequence after
Table 1. Tokenization of the string \(\texttt{aaaabaacac}\) with the merge table \([(\texttt{a}, \texttt{b}), (\texttt{a}, \texttt{a}), (\texttt{a}, \texttt{c})]\). The order of the merge operations and the order of the letters in the string both affect the final token sequence.

With , we start with the character-level DFA and we apply a series of simple transformations for each merge operation. Let us consider the case of an merge operation with . affects each state with one or several incoming transitions and one outgoing transition and its impact depends on two criteria:

  1. Has non- incoming transitions or is the start state?
  2. Has non- outgoing transitions or is an accept state?

More precisely and if the outgoing transition from leads to , the transformations applied to state are the following:

DirectMerge when two different tokens are merged

Figure 4. Examples of the transformations to apply in case of an \((a, b)\) merge operation with \(a \neq b\). Left: situation before \(\texttt{DirectMerge}\), right: situation after \(\texttt{DirectMerge}\).

The case of merge operations with two identical tokens is more challenging because not all pairs of matching tokens should be merged. For example, on the fourth row of Table 1 above, the second and third are not merged because the merge operation is applied left-to-right and the second is already merged with the first . You can find the description of in the case and the correctness proof for both cases in the technical appendix.

Scaling with the Complexity of the Regex

guarantees both compliance with the regex constraint and proper tokenization. However, removes or adds states and transitions and in some unfavorable circumstances, the number of states and transitions may become overwhelmingly large.

Fortunately, with , we can efficiently simulate the DFA created by without explicitly building it. We can indeed notice that the target language is the intersection of the set of all token sequences decoded into strings matching the regex and the set of all proper token sequences. Both of these sets are regular languages and we already know how to build the corresponding DFAs, with for the first one and with applied to the regex for the second one. Furthermore, as opposed to the DFA returned by , they will not become uncontrolably large as the regex becomes more complex. does not add any state to the character-level DFA and the DFA returned by with happens to have interesting properties that make it easy to compute and compress (cf. the technical appendix for more details).

Cross product operation with a DFA

Figure 5. The target language is the intersection of two regular languages and we already know how to build the DFAs for both of them.

Once we have built these two DFAs, we can keep track of their respective states at decoding time and the allowed tokens are in the intersection of their respective sets of allowed tokens.

However, knowing which tokens are valid transitions in both DFAs is not enough because we also need to check that the tokens generated can eventually lead to a token sequence decoded into a string matching the regex. This was not an issue before because both and generate token-level DFAs whose states are all relevant (i.e. on a path between the start state and an accept state) as long as the initial character-level DFA only has relevant states. The pairs of states that are valid can be computed with a breadth-first exploration starting from the pair of start states and then from the pairs of accept states with reverse transitions.

Is it Fast Enough?

Table 2 reports the execution time of the various steps of for a sample of JSON strings from the dataset. The delay of the first step is negligible because this activity only needs to be executed once for a given tokenizer and its outcome will be useful for all future regex. The second and fourth steps also add little overhead while the third step is the performance bottleneck.

Step Activity Mean duration (seconds) Comment
1 Compute the DFA recognizing proper token sequences with (preprocessing) 20.9 Only once for a given tokenizer
2 Compute the DFA recognizing the token sequences decoded into strings matching the regex with (preprocessing) 0.687 Only once for a given regex
3 Identify relevant pairs of states (preprocessing) 27.3 Only once for a given regex
4 Build the token mask (decoding) 0.00153 At each decoding step
Table 2. Execution time of the various steps of \(\texttt{CartesianMerge}\) on a sample of 1000 JSON strings from the \(\texttt{glaive-function-calling-v2}\) dataset.

The latency of the third step remains limited if the regex constraint is used a large number of times but it may become an issue when a new regex is expected to be provided only at the time of decoding. It would typically be the case if an LLM provider offers a feature similar to OpenAI’s function calling. However, when the real goal is to adhere to a JSON or YAML schema, we can significantly shorten the preprocessing.

Oftentimes in practical situations, there will be parts of the regex surrounded by two delimiters with which they cannot merge. This is particularly interesting for us because we then know that the portion of the DFA representing these parts is unaffected by all other parts of the regex. Let’s call these situations Vegas configurations (because what happens between the delimiters stays between the delimiters).

For example with the Mistral-7B tokenizer, whatever string starting with a blank space and delimited by a colon and a line break is a Vegas configuration and this is of course very useful for YAML schemas.

Vegas configurations can be leveraged in two ways, as illustrated by Figure 6:

Vegas configuration

Figure 6. The DFA corresponding to Vegas configurations, i.e. repeated parts surrounded by two delimiters with which they cannot merge, can be computed and stored only once. Top: naive implementation of \(\texttt{CartesianMerge}\), bottom: implementation of \(\texttt{CartesianMerge}\) taking advantage of Vegas configurations.

Conclusion

In this blog post, I introduced two alternatives to . and also guarantee compliance with a regex constraint while accelerating decoding and mitigating the risks of distortions2 of the language model distribution.

It is unclear how often such distortions may happen. However it is easy to find examples of them and should they happen in practice, they would probably go unnoticed. Given that uncertainty and the decoding speedup, should be a no-regret option for most constrained decoding needs.

Beyond constrained decoding, it would also be interesting to assess whether the fact that models sometimes generate improper token sequences affect their performance. If it were the case, may be used to enforce proper tokenization.

Thanks to the authors of the papers mentioned below, in particular to Luca Beurer-Kellner for a helpful discussion, and to the contributors of the various resources used (inter alia, , , and ).

If you want to cite this blog post, you are welcome to use the following BibTeX entry:

@misc{Tran-Thien_2024,
title={Fast, High-Fidelity LLM Decoding with Regex Constraints},
url={https://vivien000.github.io/blog/journal/llm-decoding-with-regex-constraints.html},
journal={Unsupervised Thoughts (blog)},
author={Tran-Thien, Vivien},
year={2024}
}

References

  1. Beurer-Kellner, L., Fischer, M., & Vechev, M. (2022). Prompting Is Programming: A Query Language For Large Language Models. PLDI ’23.
  2. Lundberg, S., & Ribeiro, M. T. C. et al. (2023). Guidanceai/guidance: A guidance language for controlling large language models. https://github.com/guidance-ai/guidance
  3. Willard, B. T., & Louf, R. (2023). Efficient Guided Generation for Large Language Models.
  4. Zheng, L., Yin, L., Xie, Z., Huang, J., Sun, C., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., & Sheng, Y. (2023). Efficiently Programming Large Language Models using SGLang.
  5. Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. In K. Erk & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1715–1725). Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1162
  6. Provilkov, I., Emelianenko, D., & Voita, E. (2020). BPE-Dropout: Simple and Effective Subword Regularization. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 1882–1892). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.170
  1. In this blog post, I abusively use to denote the underlying method (Willard & Louf, 2023) rather than the open source library. 

  2. As illustrated in the second section, distortion here means a discrepancy between the probability distribution with constrained decoding — — and the probability distribution with unconstrained decoding conditioned on the satisfaction of the constraint — (where is the set of the compliant strings). 

comments powered by Disqus