---
date: '2026-05-27'
description: prefix-tree caching of KV pages with LRU eviction, shared request prefixes reuse computed K,V.
id: attention-radix
modified: 2026-06-06 01:39:05 GMT-04:00
seealso:
  - '[[thoughts/Attention|Attention]]'
  - '[[thoughts/Radix tree|Radix tree]]'
  - '[[thoughts/paged attention|Paged Attention]]'
  - '[[thoughts/KV compression|KV compression]]'
tags:
  - ml
  - llm
  - technical
title: Radix Attention
created: '2026-05-27'
published: '2026-05-27'
pageLayout: default
slug: thoughts/radix-attention
permalink: https://aarnphm.xyz/thoughts/radix-attention.md
generator:
  quartz: v4.6.0
  hostedProvider: Cloudflare
  baseUrl: aarnphm.xyz
full: https://aarnphm.xyz/llms-full.txt
---
RadixAttention \[@zheng2024sglangefficientexecutionstructured\] maintains an LRU eviction policy to keep relevant [[thoughts/KV compression|KV cache]] entries for all requests within a [[thoughts/Radix tree|radix tree]], implemented in <https://github.com/sgl-project/sglang>.

Every request inserts its prefix tokens $\pi = (t_1,\ldots,t_m)$ along the tree; each node stores a pointer to the KV page that realises that prefix. During decoding the runtime walks the tree to find the deepest cached prefix shared with the new suffix $\sigma$ and reuses the cached $K,V$ tensors before appending freshly computed blocks:

$$
\text{reuse}(\pi, \sigma) = \bigoplus_{j=1}^{m} \text{KV}(t_{1:j}) \;\Vert\; \bigoplus_{k=1}^{|\sigma|} \text{attend}(t_{1:m+k}).
$$

The LRU policy keeps the union of active prefixes resident on GPU while evicting the coldest leaf pages; evicted prefixes spill to host memory so they can be faulted back in if a later request revisits them.

> \[!example\] request routing pseudo-code
>
> ```python
> def serve(request):
>     prefix, suffix = split_prefix_suffix(request.tokens)
>     node = radix.longest_prefix(prefix)
>     kv_pages = node.cached_kv()
>     for token in suffix:
>         logits, kv_new = transformer.step(token, kv_pages)
>         kv_pages.push(kv_new)
>     radix.touch(node)        # update lru state
>     radix.insert(prefix+suffix, kv_pages)
> ```
>
> Each insert/touch updates the LRU ordering so hot conversations stay resident while rarely used prefixes migrate to CPU or disk.

```jsx imports={Zoomable,RadixPrefixTree}
<Zoomable label="radix prefix tree">
  <RadixPrefixTree caption="Click a request to walk the deepest matching prefix (coral) and watch newly cached tokens append (olive). Prompt #4 falls back to root and evicts the coldest branch under LRU; the readout reports the per-request hit rate $H = 1 - C / \sum |r|$." />
</Zoomable>
```

_dynamic evolution of the radix tree in response to various requests._

## cache-aware scheduling

We define the hit rate as

$$
\begin{aligned}
\text{hit rate} &= \frac{\sum_{r \in R} \text{number of cached prefill tokens in } r}{\sum_{r \in R} \text{number of prefill tokens in } r} \\[8pt]
&=1 - \frac{C}{\sum_{r \in R} \text{number of prefill tokens}}
\end{aligned}
$$

_in batch settings: sort requests by matching prefix length and prioritise one with longer matched prefixes instead of FIFO schedule._

<div class="ps-root" data-inline-macros=""><span type="button" class="clipboard-button ps-clipboard" aria-label="Copy pseudocode to clipboard"><svg width="16" height="16" viewBox="0 0 16 16" class="copy-icon"><use href="#github-copy"></use></svg><svg width="16" height="16" viewBox="0 0 16 16" class="check-icon"><use href="#github-check" fill-rule="evenodd" fill="rgb(63, 185, 80)"></use></svg></span><span class="ps-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><annotation encoding="application/x-tex">"\\begin{algorithm}\n\\caption{Cache-Aware Scheduling}\n\\begin{algorithmic}\n\\State \\textbf{Input:} Radix tree $T$, Memory pool $P$.\n\\State \\textbf{Input:} current running batch $B$, waiting queue $Q$.\n\\State \\textbf{Output:} Finished requests and updated system state.\n\\State // Get all requests from the waiting queue\n\\State requests $\\gets Q.\\text{get\\_all\\_requests}()$\n\\State // Search for prefix matching for all waiting request\n\\For{req $\\in$ requests}\n    \\State req.prefix\\_node, req.prefix\\_len $\\gets$ T.match\\_prefix(req.input\\_tokens)\n\\EndFor\n\\State // Sort the request according to matched prefix lengths\n\\State requests.sort()\n\\State // Select requests for the next batch\n\\State available\\_size $\\gets$ T.evictable\\_size() + P.available\\_size()\n\\State current\\_size $\\gets$ 0\n\\State new\\_batch $\\gets$ []\n\\For{req $\\in$ requests}\n    \\If{req.size() + current\\_size $\\le$ available\\_size}\n        \\State new\\_batch.append(req)\n        \\State $\\delta \\gets T.\\text{increase\\_ref\\_counter}(req.\\text{prefix\\_node})$\n        \\State available\\_size $\\gets$ available\\_size + $\\delta$\n    \\EndIf\n\\EndFor\n\\State Q.remove\\_requests(new\\_batch)\n\\State // Insert requests into the current running batch\n\\State B.merge(new\\_batch)\n\\State // Allocate new memory and do eviction if necessary\n\\State needed\\_size $\\gets$ B.needed\\_size()\n\\State success, buffer $\\gets$ P.alloc(needed\\_size)\n\\If{$\\neg \\text{success}$}\n    \\State T.evict(needed\\_size)\n    \\State success, buffer $\\gets$ P.alloc(needed\\_size)\n\\EndIf\n\\State B.run(buffer)\n\\State // Process finished requests\n\\State finished\\_requests $\\gets$ B.drop\\_finished\\_requests()\n\\For{req $\\in$ finished\\_requests}\n    \\State T.decrease\\_ref\\_counter(req.prefix\\_node)\n    \\State T.insert(req)\n\\EndFor\n\\State \\Return finished\\_requests\n\\end{algorithmic}\n\\end{algorithm}"</annotation></semantics></math></span>
<div class="ps-algorithm with-caption">
<p class="ps-line" style="text-indent:-0.6em;padding-left:0.6em;">
<span class="ps-keyword">Algorithm 2 </span>Cache-Aware Scheduling</p>
<div class="ps-algorithmic">
<div class="ps-block" style="margin-left:0.6em;">
<p class="ps-line ps-code">
<span style="font-weight:bold;">Input:</span> Radix tree <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>T</mi></mrow><annotation encoding="application/x-tex">T</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.1389em;">T</span></span></span></span>, Memory pool <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>P</mi></mrow><annotation encoding="application/x-tex">P</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.1389em;">P</span></span></span></span>.</p>
<p class="ps-line ps-code">
<span style="font-weight:bold;">Input:</span> current running batch <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>B</mi></mrow><annotation encoding="application/x-tex">B</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.0502em;">B</span></span></span></span>, waiting queue <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>Q</mi></mrow><annotation encoding="application/x-tex">Q</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord mathnormal">Q</span></span></span></span>.</p>
<p class="ps-line ps-code">
<span style="font-weight:bold;">Output:</span> Finished requests and updated system state.</p>
<p class="ps-line ps-code">
// Get all requests from the waiting queue</p>
<p class="ps-line ps-code">
requests <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>←</mo><mi>Q</mi><mi mathvariant="normal">.</mi><mtext>get_all_requests</mtext><mo stretchy="false">(</mo><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\gets Q.\text{get\_all\_requests}()</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em;"></span><span class="mrel">←</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em;"></span><span class="mord mathnormal">Q</span><span class="mord">.</span><span class="mord text"><span class="mord">get_all_requests</span></span><span class="mopen">(</span><span class="mclose">)</span></span></span></span></p>
<p class="ps-line ps-code">
// Search for prefix matching for all waiting request</p>
<p class="ps-line ps-code">
<span class="ps-keyword">for </span>req <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>∈</mo></mrow><annotation encoding="application/x-tex">\in</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5782em;vertical-align:-0.0391em;"></span><span class="mrel">∈</span></span></span></span> requests<span class="ps-keyword"> do</span></p>
<div class="ps-block" style="margin-left:0.6em;">
<p class="ps-line ps-code">
req.prefix_node, req.prefix_len <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>←</mo></mrow><annotation encoding="application/x-tex">\gets</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em;"></span><span class="mrel">←</span></span></span></span> T.match_prefix(req.input_tokens)</p>
</div>
<p class="ps-line ps-code">
<span class="ps-keyword">end for</span></p>
<p class="ps-line ps-code">
// Sort the request according to matched prefix lengths</p>
<p class="ps-line ps-code">
requests.sort()</p>
<p class="ps-line ps-code">
// Select requests for the next batch</p>
<p class="ps-line ps-code">
available_size <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>←</mo></mrow><annotation encoding="application/x-tex">\gets</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em;"></span><span class="mrel">←</span></span></span></span> T.evictable_size() + P.available_size()</p>
<p class="ps-line ps-code">
current_size <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>←</mo></mrow><annotation encoding="application/x-tex">\gets</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em;"></span><span class="mrel">←</span></span></span></span> 0</p>
<p class="ps-line ps-code">
new_batch <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>←</mo></mrow><annotation encoding="application/x-tex">\gets</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em;"></span><span class="mrel">←</span></span></span></span> []</p>
<p class="ps-line ps-code">
<span class="ps-keyword">for </span>req <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>∈</mo></mrow><annotation encoding="application/x-tex">\in</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5782em;vertical-align:-0.0391em;"></span><span class="mrel">∈</span></span></span></span> requests<span class="ps-keyword"> do</span></p>
<div class="ps-block" style="margin-left:0.6em;">
<p class="ps-line ps-code">
<span class="ps-keyword">if </span>req.size() + current_size <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>≤</mo></mrow><annotation encoding="application/x-tex">\le</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7719em;vertical-align:-0.136em;"></span><span class="mrel">≤</span></span></span></span> available_size<span class="ps-keyword"> then</span></p>
<div class="ps-block" style="margin-left:0.6em;">
<p class="ps-line ps-code">
new_batch.append(req)</p>
<p class="ps-line ps-code">
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>δ</mi><mo>←</mo><mi>T</mi><mi mathvariant="normal">.</mi><mtext>increase_ref_counter</mtext><mo stretchy="false">(</mo><mi>r</mi><mi>e</mi><mi>q</mi><mi mathvariant="normal">.</mi><mtext>prefix_node</mtext><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\delta \gets T.\text{increase\_ref\_counter}(req.\text{prefix\_node})</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord mathnormal" style="margin-right:0.0379em;">δ</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">←</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em;"></span><span class="mord mathnormal" style="margin-right:0.1389em;">T</span><span class="mord">.</span><span class="mord text"><span class="mord">increase_ref_counter</span></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.0278em;">r</span><span class="mord mathnormal">e</span><span class="mord mathnormal" style="margin-right:0.0359em;">q</span><span class="mord">.</span><span class="mord text"><span class="mord">prefix_node</span></span><span class="mclose">)</span></span></span></span></p>
<p class="ps-line ps-code">
available_size <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>←</mo></mrow><annotation encoding="application/x-tex">\gets</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em;"></span><span class="mrel">←</span></span></span></span> available_size + <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>δ</mi></mrow><annotation encoding="application/x-tex">\delta</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord mathnormal" style="margin-right:0.0379em;">δ</span></span></span></span></p>
</div>
<p class="ps-line ps-code">
<span class="ps-keyword">end if</span></p>
</div>
<p class="ps-line ps-code">
<span class="ps-keyword">end for</span></p>
<p class="ps-line ps-code">
Q.remove_requests(new_batch)</p>
<p class="ps-line ps-code">
// Insert requests into the current running batch</p>
<p class="ps-line ps-code">
B.merge(new_batch)</p>
<p class="ps-line ps-code">
// Allocate new memory and do eviction if necessary</p>
<p class="ps-line ps-code">
needed_size <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>←</mo></mrow><annotation encoding="application/x-tex">\gets</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em;"></span><span class="mrel">←</span></span></span></span> B.needed_size()</p>
<p class="ps-line ps-code">
success, buffer <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>←</mo></mrow><annotation encoding="application/x-tex">\gets</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em;"></span><span class="mrel">←</span></span></span></span> P.alloc(needed_size)</p>
<p class="ps-line ps-code">
<span class="ps-keyword">if </span><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">¬</mi><mtext>success</mtext></mrow><annotation encoding="application/x-tex">\neg \text{success}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord">¬</span><span class="mord text"><span class="mord">success</span></span></span></span></span><span class="ps-keyword"> then</span></p>
<div class="ps-block" style="margin-left:0.6em;">
<p class="ps-line ps-code">
T.evict(needed_size)</p>
<p class="ps-line ps-code">
success, buffer <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>←</mo></mrow><annotation encoding="application/x-tex">\gets</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em;"></span><span class="mrel">←</span></span></span></span> P.alloc(needed_size)</p>
</div>
<p class="ps-line ps-code">
<span class="ps-keyword">end if</span></p>
<p class="ps-line ps-code">
B.run(buffer)</p>
<p class="ps-line ps-code">
// Process finished requests</p>
<p class="ps-line ps-code">
finished_requests <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>←</mo></mrow><annotation encoding="application/x-tex">\gets</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em;"></span><span class="mrel">←</span></span></span></span> B.drop_finished_requests()</p>
<p class="ps-line ps-code">
<span class="ps-keyword">for </span>req <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>∈</mo></mrow><annotation encoding="application/x-tex">\in</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5782em;vertical-align:-0.0391em;"></span><span class="mrel">∈</span></span></span></span> finished_requests<span class="ps-keyword"> do</span></p>
<div class="ps-block" style="margin-left:0.6em;">
<p class="ps-line ps-code">
T.decrease_ref_counter(req.prefix_node)</p>
<p class="ps-line ps-code">
T.insert(req)</p>
</div>
<p class="ps-line ps-code">
<span class="ps-keyword">end for</span></p>
<p class="ps-line ps-code">
</p>
<p class="ps-line ps-code">
<span class="ps-keyword">return </span>finished_requests</p>
</div>
</div>
</div>
</div>

We got lower bound:

$$
C \ge \sum_{e \in \text{edges}(T)} \mid e \mid
$$

Consider we visit radix tree $T$ in DFS order. For each edge $e$ of $T$, the first time we compute KV cache associated with $e$, then we will compute the whole subtree of $e$.

During computation of $e$ subtree, then edge $e$ will be continuously hit, thus no additional computation will happen.

> \[!tip\] cache hit
>
> with cache size $\ge$ maximum request length (which will equals to longest path in radix tree), edge $e$ **WILL NOT** be evicted during computation of its subtree
> since the common prefix including $e$ of the subtree will be continuously hit.

We can show that longest-shared-prefix-first order is equivalent to DFS order by induction [^proof]

[^proof]: _base_: a random request correspond to node $x \in T$ will be processed.

    - All requests correspond to nodes $\{v_{1}, \ldots, v_{n}\}$ on path $x \gets \text{root}$ doesn’t need recomputation.
    - Thus, computation complexity for requests of nodes $\{v_{1}, \ldots, v_{n}, x\}$ is aligned with DFS

    _induction_: assume we visit node $y \in T$, and the visited node align with DFS order. Let $P$ denote _path of_ $y \gets \text{root}$.

    - Each node that has not been visited has the lowest common ancestor with visited nodes on $P$.
    - Since nodes on $P$ are cached, a node $z$ that has yet to be visited with lowest common ancestor on $P$ will have the _longest shared prefix_
    - longest-shared-prefix-first order will select $z$, which is a valid DFS
      $\boxed{}$

