See also: notes
TL/DR:
- Structured decoding allows precise control over LLM output formats
- vLLM now supports both outlines and XGrammar backends for structured decoding
- Recent XGrammar integration brings up to 5x improvement in time per output token (TPOT) under load
- Upcoming v1 release focuses on enhanced performance and schedule-level mask broadcasting for mixed-requests batch support
vLLM is the high-throughput and efficient inference engine for running large-language models (LLM). In this post, we will explore the annotated history of language models, describe the current state of structured decoding in vLLM, as well as the recent integration with XGrammar, and share a tentative roadmap for vLLM’s v1 improvement for structured decoding.
We would also invite users to tackle this blog post from a philosophical perspective, and in the process trying to posit that structured decoding represents a fundamental shift in how we think about LLM outputs. It also plays an important role in building complex agentic system.
For more information about vLLM, please check out our documentation.
language model, a brief historical context
If you have read about the history of the field before, feel free to skip this part to reason for structured decoding
The inception of AI might well be traced back to the origin of logics, where men put emphasis on reducing reasoning to some specific sets of calculations (a reductionist approach). As such, Plato generalised the belief in total formalisation of knowledge, where knowledge must be universally applicable with explicit definitions 1.
In 1950, Alan Turing posited that a high-speed digital computer, programmed with rules, would exhibit emergent behaviour of intelligence (Turing, 1950).
A paradigm quickly emerged among researchers in the 1950s, where expert systems were designed to replicate the decision-making capabilities of a human specialist 2, (or symbolic reasoning system), referred to by Haugland as Good Old-Fashioned AI (GOFAI) (Haugeland, 1997). However, it quickly ran into funding problems due to its semantic representation not being able to scaled up to generalized tasks (Also known as the “AI Winter” (Hendler, 2008)).
Concurrently, Donald Norman’s Parallel Distributed Processing (Rumelhart et al., 1986) group investigated variations of Rosenblatt’s perception (Rosenblatt, 1958), where they proposed hidden layers within the network alongside with inputs and outputs to extrapolate appropriate responses based on what it had learned during training process. These connectionist networks, often built on top of statistical methods3, are often referred as New-Fangled AI (NFAI) (Haugeland, 1997). Given the abundance of data and Moore’s Law4 resulting in an unprecedented amount of compute available, we see the complete dominance of connectionist networks in both research and production use-cases, with variants of decoder-only transformers5 for text generations tasks.
on the term "next-token prediction"
The term “next-token prediction” is also used sparsely when describing text-generation tasks, but it rather describes the “autoregressive” objectives of text generations. These models are trained to calculate a probability distributions for next tokens based on the causality of existing tokens within a sentence 6
You can think of intuitively as “the models try to pick from a set of words what possible options are applicable to be used given a sentence”. The model then “chooses” this words (or for more correct term tokens) based on a probability distribution and such iteratively repeats this process until it is “being told” to stop (in this case either mechanistically via configurations or out-of-memory (OOM)).
For more information on how these auto-regressive transformers work, I highly recommend checking out this awesome visualisation by Brendan Bycroft or 3Blue1Brown’s Deep Learning series.
For now, whenever we mentions LLMs, we assume that they are auto-regressive decoder-only transformers models.
In retrospect, GOFAI are deterministic in a sense that intentionality is injected within symbolic tokens through explicit programming. Connectionist networks, on the other hand, are often considered as black-box models, given their hidden nature of intermediate representations of perceptron. Unlike GOFAI, its internal representation is determined by the state of the entire network states rather than one singular unit. Although these models exhibit emergent behaviour of intelligence, one should be aware that this is not artificial general intelligence yet, largely due to researchers’ observer-expectancy effect.
In summary:
- GOFAI are deterministic and rule-based, given its intentionality is injected through explicit programming
- NFAI are often considered as “black-box” models (in: input - out: some output), data-driven given the networked complexity nature of its internal representations
why do we need structured decoding?
LLMs excel at the following heuristic: given a blob of text, the model will generate a contiguous piece of text that it predicts as the most probable tokens. For example, if you give it a Wikipedia article, the model should produce text consistent with the remainder of said article.
These models works well given the following assumption: the inputs prompt must be coherent and well-structured surrounding a given problem the users want to achieve. In other words, generations are often non-deterministic when you need output in specific formats. Think of asking a model to generate JSON - without guidance, it might produce valid text that breaks JSON specification7.
This arises for the need of applying explicit rules and grammar8 (an addition of GOFAI system) that allows users to have control over certain aspect of the generations format while keeping the non-deterministic nature of the overall system.
OpenAI then introduced JSON mode to constrain 9 the output format from these models. If you have built with these functionality before (such as function calling, coding assistant), chances are you are using structured decoding under the hood.
Guided decoding is to LLMs what validation is to APIs - it acts as a guarantee that what comes out matches what you expect. Guided decoding ensures structure integrity that allows developers to integrate LLMs into their application with ease!
structured decoding and vLLM.
In layman terms, structured decoding gives LLMs a “template” to follow. Users provide a schema that “influences” the model’s output, ensuring compliance with the desired structure:
graph LR subgraph Structured automaton A[Schema] -->|Defines| B[Logit Bias] end subgraph Decoding C[Language Model] -->|Token distribution| D[Constrained Sampling] end B --> D D --> E[Structured Output] style A fill:#e1f3f8,stroke:#333,stroke-width:2px style B fill:#e1f3f8,stroke:#333,stroke-width:2px style C fill:#f8e1e1,stroke:#333,stroke-width:2px style D fill:#f8e1e1,stroke:#333,stroke-width:2px style E fill:#e1f8e1,stroke:#333,stroke-width:2px classDef default fill:#fff,stroke:#333,stroke-width:2px;
From a technical perspective, an inference engine can modify the probability distribution for next-tokens by applying bias (often via logit masks) for all tokens from any given schemas. To apply these biases, outlines was proposed to construct a finite-state machine 10 (FSM) from any given schemas to then guide the generations (Willard & Louf, 2023). This allows us to track the current state during decoding and filter out invalid tokens by applying logit bias to the output.
courtesy of LMSys, 2024
in vLLM, you can use this by passing a JSON schema to the sampling params (either through Python SDK or HTTP requests)
Note
In some cases, it can even improve the native decoding performance for LLM!
previous limitation in vLLM.
There are few limitations with current vLLM’s support of the Outlines backend:
- Slow decoding: FSM has to be constructed at a token-level, meaning it can only transition the state one token per step. Therefore, it can only decode one token at a time, resulting in slow decoding.
- Batch processing bottlenecks: Implementation in vLLM relies heavily on logit processor11. As such, this is on the critical path of the sampling process 12. In batching use-case, compiling FSM per requests as well as computing the mask synchronous means that all requests in any given batches will get blocked, resulting in high time-to-first-tokens (TTFT) and lower throughput.
- We found that compiling FSM is proven to be a relatively expensive task, making it a significant contributor to the increased TTFT.
- Performance issues with CFG mode: With outlines integrations, while JSON mode is relatively fast, the CFG mode runs significantly slower, and can occasionally crashes the engine.
- Limited advanced feature support: Techniques like jump-forward decoding are currently not possible with logit-processor approach. It requires prefilling a set of k-next tokens, whereas for logit processors we can only deal with the next-token.
integrations with XGrammar
XGrammar introduces a new technique that batch constrained decoding with pushdown automaton (PDA). You can think of a PDA as a “collection of FSMs, and each FSM represents a context-free grammar (CFG).” One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional optimisation (for those who are interested) to reduce grammar compilation overhead.
This is great because it addresses limitation (1) by moving grammar compilation out of Python into C, utilising pthread
. Additionally, XGrammar lays the groundwork for addressing limitation (4) in future releases. Below are performance comparisons between the XGrammar and Outlines backends:
In vLLM’s v0 architecture, we’ve implemented XGrammar as a logit processor, optimizing it with caching for tokenizer data. While the performance improvements are encouraging, we believe there’s still significant room for optimisation.
integrations with v0
Important
vLLM now has a basic support for XGrammar by default. In case where we know XGrammar is insufficient to serve the request, we fall back to outlines.
Note that vLLM also includes support for lm-format-enforcer. However, from our testing we found that in some long context test cases,
lm-format-enforcer
fails to enforce correct outputs, and not up to par with Outlines in terms of performance.
tentative plans for v1
With the release of v1 on the horizon, the following includes a tentative plan for structured decoding:
- Moving guided decoding towards scheduler-level:
- Reason: We have more context regarding which requests that use structured decoding at a scheduler-level, therefore it shouldn’t block other requests within the batch (tentatively addressing limitation (2)). In a sense, this moves guided decoding outside of the critical path.
- This would allow for more natural vertical integration with jump-forward decoding (address limitation (4))
- Allowing bitmask calculation in one process instead of each GPU workers
- Reason: We can then broadcast this bitmask to each GPU worker instead of repeating this process per GPU worker.
- We will look to carefully analyze the bandwidth implication of broadcasting masks for every sample per requests doing guided decoding.
- Good baseline for speculative decoding and tool-use
- Reason: XGrammar includes plans to support tool-use, such that we can move away from Python’s tool parser
- Tree scoring in speculative decoding can then use the same API as jump-forward decoding. (which depends on the integration of guided decoding at the scheduler-level)
NOTE: if you have any more suggestions we are more than happy to take it into consideration. Consider joining vLLM slack via #feat-structured-output
acknowledgement
I want to thank the vLLM team, XGrammar team, Michael Groin (Red Hat), Chendi Xue (Intel), and Russell Bryant (Red Hat) for their valuable feedback and collaboration on bringing XGrammar to vLLM and the continuous effort to improve structured decoding in vLLM.
Bibliographie
- Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473 [arXiv]
- Haugeland, J. (1997). Mind Design II: Philosophy, Psychology, and Artificial Intelligence. The MIT Press. https://doi.org/10.7551/mitpress/4626.001.0001
- Hendler, J. (2008). Avoiding Another AI Winter. IEEE Intelligent Systems, 23(2), 2–4. https://doi.org/10.1109/MIS.2008.20
- Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation.
- Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361 [arXiv]
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781 [arXiv]
- Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. https://doi.org/10.1037/h0042519
- Rumelhart, D. E., McClelland, J. L., & Group, P. R. (1986). Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations. The MIT Press. https://doi.org/10.7551/mitpress/5236.001.0001
- Shortliffe, E. H. (1974). MYCIN: A Rule-Based Computer Program for Advising Physicians Regarding Antimicrobial Therapy Selection (Technical Report STAN-CS-74-465). Stanford University.
- Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv preprint arXiv:1712.01815 [arXiv]
- Statistical Machine Translation. (n.d.). IBM Models. Statistical Machine Translation Survey. http://www2.statmt.org/survey/Topic/IBMModels
- Turing, A. M. (1950). i.—Computing Machinery And Intelligence. Mind, LIX(236), 433–460. https://doi.org/10.1093/mind/LIX.236.433
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2023). Attention Is All You Need. arXiv preprint arXiv:1706.03762 [arXiv]
- Willard, B. T., & Louf, R. (2023). Efficient Guided Generation for Large Language Models. arXiv preprint arXiv:2307.09702 [arXiv]
Remarque
-
In other words, intuition, feeling would not constitute as the definition of knowing. For Plato, cooks, who proceed by taste and intuition does not involve understanding because they have no knowledge. Intuition is considered as a mere belief. ↩
-
Allen Newell and Herbert Simon’s work at RAND initially showed that computers can simulate important aspects of intelligence.
Another notable application was found in the medical domain (Haugeland, 1997). MYCIN, developed at Stanford University in the 1970s, diagnosed and recommended treatments for blood infections (Shortliffe, 1974). MYCIN’s developers recognized the importance of justifying recommendations, implementing what were known as “rule traces” to explain the system’s reasoning in human-understandable terms. ↩
-
In the 1990s, IBM released a sequence of complex statistical models that is trained to perform machine translations tasks (Statistical Machine Translation, n.d.) (see also: this lecture from Cornell).
In 2001, Bag of words (BoW)-variants model was trained on 0.3B tokens and was considered SOTA at the time (Mikolov et al., 2013). These earlier works proved to the research community that statistical modelling triumphs over symbolic counterpart for language processing given it can capture the general patterns for large corpuses of text. ↩
-
In 2017, The landmark paper “Attention is all You Need” introduced Transformers architecture (Vaswani et al., 2023) for neural machine translations tasks, which is based on the attention mechanism first proposed by (Bahdanau et al., 2016).
OpenAI then introduced the scaling law for neural language models (Kaplan et al., 2020), which sets off the race towards building these systems based on foundational language models. ↩
-
Prior to Attention-based transformers, seq-to-seq models uses RNNs given its ability for longer context length and better memory. However, they are more susceptible to vanishing/exploding gradients comparing to feed-forward network, and thus LSTM (Hochreiter & Schmidhuber, 1997) was used to solve this problem.
Yet, one of the main problems with LSTM is that they still tend to have poor memory recall with data they have seen many steps ago.
The Attention paper addresses this problem by encoding additional positional data into the inputs. The paper also additionally proposed a encoder-decoder architecture for translation tasks, however, most of text-generation models nowadays are decoder-only, given its superior performance over zero-shot tasks.
One of the many reasons why attention-based transformers works better than LSTM is because transformers are very scalable and hardware-aware (you can’t just arbitrary add more LSTM block and hope for better long-term retention). For more information, please refer back to the original paper. ↩
-
Machine learning models needs to process arbitrary numbers rather than text, as such we have to convert a given word to its corresponding encoded tokens (often known as tokenization).
For example, the model will see the phrase
"The quick brown fox jumps over the lazy dog"
as['<|begin_of_text|>', 'The', ' quick', ' brown', ' fox', ' jumps', ' over', ' the', ' lazy', ' dog']
↩ -
One might argue that we can reliably achieve these through few-shot promptings, i.e “Give me a JSON that yields the address of users. Example output can be …”. However, there is no guarantee that the generated outputs is a valid JSON. This is because these models are probabilistic systems, as they are “sampling” the next results based on the distribution of data that it was trained on.
One might also argue that one should use specific fine-tuned models for JSON outputs to perform such cases. However, fine-tuning often requires extensive training and a lot more labor to curate data, monitor progress, and perform evaluation, which is a huge resources not everyone can afford to do. ↩
-
Most recent notable example of GOFAI-NFAI hybrid system is AlphaZero. AlphaZero is a connectionist network-based Go playing systems, that uses a deep neural networks to assess new positions (a NFAI algorithm) and Monte-Carlo Tree Search (a GOFAI algorithm) to determine its next move (Silver et al., 2017). DeepMind then applies these techniques to build AlphaFold, a system that predicts a protein’s 3D structure from its amino acid sequence. ↩
-
Note that the phrase “[structured/constrained/guided] decoding” are used interchangeably, but they all refer to the same mechanism of “using a format for the model to structurally sampling outputs.” ↩
-
We define a finite-state machine, given by where character comprising the strings in are drawn from , i.e: :
- is a finite set of states
- is a finite alphabet
- is the transition function
- is the start state
- is the set of all accepted states.
Intuitively, we can describe any structured format (JSON, YAML, etc.) using a context-free grammar due to its expressive nature. ↩
-
See this blog post from Transformers for using logit processors to control generations process. ↩
-
The following was excerpt from vllm-project/vllm#5329:
-
There are a few PR trying to cover this usage. There was one bugfix PR on vLLM and one upstream ↩