(Yu et al., 2022) solves the static batching to reduce cost and improve throughput by appending requests continuously into existing KV cache 1
Bibliographie
- Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., & Chun, B.-G. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 521–538. https://www.usenix.org/conference/osdi22/presentation/yu
Remarque
-
The paper and presentation for the paper. Most notable open source implementation is vLLM.
p/s: Actually, I think first implemented in huggingface/tgi ↩