Paper: Attention is all you need
See also Transformers, or distraction
Self-attention
Muti-head Attention
RingAttention
Paged Attention
Used in conjunction with Continuous batching
Reduce memory usage of attention mechanism by swapping kv-cache in and out of memory. A block manager is similar to those of virtual memory in OS.
Essentially, it’s a form of paging, such that attention can be stored in contiguous memory. Partitions the KV cache of each sequence into KV blocks.
Given:
- each block contains KV vectors for fixed number of tokens, denoted as block size .
- Key block
- Value block
where is row vector of attention score on j-th KV block.