/

/

Continuous batching

⌘ '

raccourcis clavier

Continuous batching

Étiquette

ml

publié à
08 févr. 2024
modifié à
07 nov. 2024
durée
1 min de lecture (81 words)
source
llms.txt

Continuous batching

(Yu et al., 2022) solves the static batching to reduce cost and improve throughput by appending requests continuously into existing KV cache ¹

Bibliographie

Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., & Chun, B.-G. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 521–538. https://www.usenix.org/conference/osdi22/presentation/yu

Remarque

The paper and presentation for the paper. Most notable open source implementation is vLLM.

p/s: Actually, I think first implemented in huggingface/tgi ↩

(Yu et al., 2022) solves the static batching to reduce cost and improve throughput by appending requests continuously into existing KV cache ¹

Bibliographie

Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., & Chun, B.-G. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 521–538. https://www.usenix.org/conference/osdi22/presentation/yu

Remarque

The paper and presentation for the paper. Most notable open source implementation is vLLM.

p/s: Actually, I think first implemented in huggingface/tgi ↩

Vous pourriez aimer ce qui suit

Defining Internal Alignment & Job Analysis

The Prisoner's Dilemma

steady-state error

Liens retour

(Vaswani et al., 2023) Attention operates on a sequence of query Q, key K and value V vector. Attention matrix of a sequence then computed as: A(Q, K, V) = \text{softmax}(\frac{Q \cdot K^{T}}{\sqrt{d}})V \space \space \text{ for } Q_{L \times d}, K_{L \times d}, V_{L \times d} We can probably arrange the attention function (composed of multiple attention-heads) according to (Elhage et al., 2021): \text{Attn}^{\vec{l,h}}(X_{\leq i}^{l-1}) = \sum_{j \leq i}a^{l,h}_{i,j} x^{l-1}_j W^{l,h}_{V} W_{O}^{l,h} where the learn-able weight matrices W_{V}^{l,h} \in \mathbb{R}^{d \times d_h} and W_{O}^{l,h} \in \mathbb{R}^{d_h \times d}, d_h is the dimension per head, are combined OV matrix Muti-head Attention Allows the model to jointly attend to information from different representation subspaces at different positions: \begin{aligned} \text{MHA}(Q,K,V) &= \text{concat}(\text{head}_1, \cdots, \text{head}_n) W^O \\ &\text{where } \space \text{head}_i = \text{A}(QW_i^O, KW_i^O, VW_i^O) \\ W^O & \in \mathbb{R}^{hd_v \times d_{\text{model}}} \end{aligned} Group-Query Attention by (Ainslie et al., 2023) idea: reduce number of KV heads n_k to a fraction n_k^{'} = \frac{n_q}{k} of number of query heads n_q (evenly dividing the query heads into n_k groups with r heads) RadixAttention Implemented in (Zheng et al., 2024) where they maintain a LRU eviction policy to maintain relevant KV cache for all requests within a radix tree radix tree setup: key: sequence of tokens value: KV cache tensor (stored in GPU in a paged layout) dynamic evolution of the radix tree in response to various requests.

large language models, often implemented as autoregressive transformers models. GPTs and friends Most variants of LLMs are decoder-only (Radford et al., 2019) Have “capabilities” to understand natural language.

efficient LLM serving engine.

Créé avec Quartz v4.4.0 © 2025