profile pic
⌘ '
raccourcis clavier

(Yu et al., 2022) solves the static batching to reduce cost and improve throughput by appending requests continuously into existing KV cache 1

Bibliographie

  • Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., & Chun, B.-G. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 521–538. https://www.usenix.org/conference/osdi22/presentation/yu

Remarque

  1. The paper and presentation for the paper. Most notable open source implementation is vLLM.

    p/s: Actually, I think first implemented in huggingface/tgi