---
aliases:
  - specdec
  - speculative
date: '2025-05-21'
description: draft-and-verify using smaller models to generate a head tokens
id: Speculative decoding
modified: 2026-06-05 15:08:24 GMT-04:00
socials:
  slides: https://docs.google.com/presentation/d/1p1xE-EbSAnXpTSiSI0gmy_wdwxN5XaULO3AnCWWoRe4/edit#slide=id.p
tags:
  - ml
  - inference
title: Speculative decoding
transclude:
  title: false
created: '2025-05-21'
published: '2025-05-21'
pageLayout: default
slug: thoughts/Speculative-decoding
permalink: https://aarnphm.xyz/thoughts/Speculative-decoding.md
generator:
  quartz: v4.6.0
  hostedProvider: Cloudflare
  baseUrl: aarnphm.xyz
full: https://aarnphm.xyz/llms-full.txt
---
<blockquote class="twitter-tweet" data-lang="fr" data-dnt="true"><p lang="en" dir="ltr">Speculative execution for LLMs is an excellent inference-time optimization.<br><br>It hinges on the following unintuitive observation: forwarding an LLM on a single input token takes about as much time as forwarding an LLM on K input tokens in a batch (for larger K than you might… <a href="https://t.co/FiwTwqsfho">https://t.co/FiwTwqsfho</a></p>&mdash; Andrej Karpathy (@karpathy) <a href="https://x.com/karpathy/status/1697318534555336961?ref_src=twsrc%5Etfw">31 août 2023</a></blockquote>



Intuition:

- we generate a small set of lookahead tokens, albeit 2-5 tokens with smaller speculators
- uses the larger [[thoughts/Transformers#model]] to “verify” the input sequences + draft tokens (then replace tokens that aren’t valid from rejection sampler)

In a sense, we are verify these in parallel instead of [[thoughts/Autoregressive models|autoregressive decoding]]. Given that verification is a lot cheaper comparing to next-token prediction

A few techniques such as [[#ngrams|ngrams]], [[#EAGLE|EAGLE]] are supported in [[thoughts/vllm|vLLM]]

## papers

- [Utility-Driven Speculative Decoding for Mixture-of-Experts](https://arxiv.org/abs/2506.20675) \[@saxena2025utilitydrivenspeculativedecodingmixtureofexperts\]&#x20;

## EAGLE

_Extrapolation Algorithm for Greater Language-model Efficiency_

- [EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test](https://arxiv.org/abs/2503.01840) \[@li2025eagle3scalinginferenceacceleration\]&#x20;
- [EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees](https://arxiv.org/abs/2406.16858) \[@li2024eagle2fasterinferencelanguage\]&#x20;
- [EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty](https://arxiv.org/abs/2401.15077) \[@li2025eaglespeculativesamplingrequires\]&#x20;

Motivation:

- Some tokens are easier to generate than other tokens
  - Think of conjunctive adverbs, comparing to more complex word such as “annoyingly”
- [[#speculative sampling|speculative sampling]] relies on the draft models having similar distributions as the target models.
  - use smaller models. i.e: Llama 3.2 3B as draft for Llama 3.3 70B.
  - high overhead for stepping through the whole models would outweighs the benefits

> \[!note\] Difference between [EAGLE-1](#eagle-1) and [EAGLE-3](#eagle-3)
>
> - EAGLE-1’s limitation at its feature prediction constraints, via LM head architecture
> - EAGLE-3 addresses this by use direct token prediction and rely on multi-layer feature fusion called “training-time test”, similar to [[#MLP Speculator|MLP Speculator]]

> \[!tip\] distribution skew
>
> EAGLE _does not_ involve any fine-tuning of the target model, therefore preservation of outputs distributions by EAGLE is theoretically guaranteed for both greedy and non-greedy sampling. This is not the case with Lookahead and Medusa.

### EAGLE-1

Observations:

> autoregressive on feature-level [^denote] is simpler than token-level, given that there are more regularity.

[^denote]: features here refer to the hidden states of the decoder layers second-to-top-layer of the LLM, before the LM head. Not to be confused with [[thoughts/mechanistic interpretability#features|interpretable features]]

> uncertainty in sampling process hinders the performance of predicting the next feature.
>
> _feature-level_ are high-dimensional and continuous, meaning sampling “am” or “always” will results in different feature sequences.

EAGLE address this by **inputs the token sequence from one time step ahead including the sampling outcomes into the draft models**.

- predicting $f_{\text{always}}$ based on $f_{\text{I}}$ and $t_\text{always}$
- predicting $f_{\text{am}}$ based on $f_{\text{I}}$ and $t_\text{am}$

![[thoughts/images/eagle-feature-prediction-one-time-step.webp]]

#### notation.

- “Features” refers to second-to-top-layer feature of LLM, or the hidden states before LM head
- Token by $t$, embedding by $e$, features by $f$, distributions by $p$
- Sequences are referred as $T_{i:j}$ for $(t_i, t_{i+1},\ldots, t_j)$ [^forward-pass-simplified]

[^forward-pass-simplified]: Vanilla [[thoughts/Autoregressive models|autoregressive]] at token-level is described by $T_{1:j} \rightarrow E_{1:j} \rightarrow f_j \rightarrow p_{j+1} \rightarrow t_{j+1}$:

    - input $T_{1:j}$ is then transformed into embeddings $E_{1:j}$
    - then into features $F_{1:j}$,
    - LM Head maps $f_j$ to a distribution $p_{j+1} = \text{LM\_Head}(f_j)$
    - sampling next token $t_{j+1}$

#### architecture

![[thoughts/images/eagle-figure-5-comparison.webp]]

![[thoughts/images/eagle-figure-6-architecture.webp]]

- See [vllm-project/vllm#20078](https://github.com/vllm-project/vllm/pull/20078) for triton fused ops.
  ```python
  [feature_seq, token_seq] # [bs, seq_len, hidden_dim], [bs, seq_len]
  token_seq -> token_emb # [bs, seq_len] -> [bs, seq_len, hidden_dim]
  fused_seq = feature_seq * token_emb # [bs, seq_len, 2 x hidden_dim]
  ```
- autoregressive\_head:
  - FC layer <span>&rarr;</span> `reduce # [bs, seq_len, hidden_dim]{:prolog}`
  - decoder layer <span>&rarr;</span> `features`
- using [[thoughts/Attention#TreeAttention|tree attention]] to generate a draft tree of depth $m$ and more than $m$ tokens for $m$ forward pass. [^tree-attention]

[^tree-attention]: Aligns with DistillSpec and Medusa

#### training

- Smooth [[thoughts/university/twenty-four-twenty-five/sfwr-4ml3/Bias and intercept#overfitting.|L1]] loss:
  $$
    L_\text{reg} = \text{Smooth L1}(f_{i+1} \text{draft}(T_{2:i+1}, F_{1:i}))
  $$

- classification loss to optimize given objectives:
  $$
  \begin{aligned}
  p_{i+2} &= \text{Softmax}(\text{LM\_Head}(f_{i+1})) \\
  \hat{p}_{i+2} &= \text{Softmax}(\text{LM\_Head}(\hat{f}_{i+1})) \\
  L_{\text{cls}} &= \text{CrossEntropy}(p_{i+2}, \hat{p}_{i+2})
  \end{aligned}
  $$

- Autoregressive head with loss $L = L_{\text{reg}} + w_{\text{cls}} L_{\text{cls}}$
  - set $w_{\text{cls}}=0.1$ given that classification loss is in order magnitude bigger than regression loss

- Dataset: ShareGPT, 68k dialogue

- Hyperparameter:
  - LR: $3e^{-5}$
  - AdamW with beta $(\beta_1, \beta_2)=(0.9,0.95)$
  - gradient clipping: $0.5$

### EAGLE-2

tl/dr: Improvement on EAGLE-1 via context-aware dynamic draft tree into this drafting modeling.

### EAGLE-3

![[thoughts/images/eagle-3-inference-pipeline.webp]]

> \[!note\] Qwen3 replication
>
> <https://mp.weixin.qq.com/s/Dmdg6aLgFHZEcm6TY1vKkA>

## HASS

[Learning Harmonized Representations for Speculative Sampling](https://arxiv.org/abs/2408.15766) \[@zhang2025learningharmonizedrepresentationsspeculative\]&#x20;

<https://github.com/HArmonizedSS/HASS>

## Falcon

[Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree](https://arxiv.org/abs/2412.12639v1) \[@gao2025falconfasterparallelinference\]&#x20;

## MLP Speculator

_via combined tokens/embedding speculators_

[Accelerating Production LLMs with Combined Token/Embedding Speculators](https://arxiv.org/abs/2404.19124v1) \[@wertheimer2024acceleratingproductionllmscombined\]&#x20;

## DistillSpec

[DistillSpec: Improving Speculative Decoding via Knowledge Distillation](https://arxiv.org/abs/2310.08461) \[@zhou2024distillspecimprovingspeculativedecoding\]&#x20;

## SpecInfer

\[@Miao\_2024\]

## Medusa

<https://sites.google.com/view/medusa-llm>

<https://github.com/FasterDecoding/Medusa>

The idea is pretty interesting, if we add $\textbf{LM\_Head}$ to understand how to speculative certain tokens, then uses [[thoughts/Attention|Tree Attention]] to create a masks for certain future prediction tokens.

Instead of [[#von Neumann acceptance rejection|rejection sampling]], they propose <mark>typical acceptance</mark> to select plausible candidates, inspired by @hewitt2022truncationsamplinglanguagemodel. This is to circumvent greedy-decoding when using in conjunction with other sampling parameters (such as top-p, top-k, temperature)

> \[!note\] typical acceptance
>
> Given $x_{1},x_{2},\ldots,x_{n+K+1}$, composed by both top-prediction from the language model heads and MEDUSA heads, consider the following condition:
>
> $$
> p_{\text{original}}(x_{n+k}|x_1, x_2, \ldots, x_{n+k-1}) > \min(\varepsilon, \delta \exp(-H(p_{\text{original}}(\cdot|x_1, x_2, \ldots, x_{n+k-1}))))
> $$

where $H(\cdot)$ denotes [[thoughts/Entropy|entropy function]], and $\varepsilon, \delta$ denotes hard-threshold and entropy-dependent threshold respectively.

### self distillation.

a variant of LoRA for PEFT.

### search through optimized tree construction.

## ngrams

<https://github.com/apoorvumang/prompt-lookup-decoding>

_also known as Prompt Lookup Decoding (PLD)_, [HF’s assisted generations](https://huggingface.co/blog/assisted-generation)

idea: to use string matching from prompt to generate candidate tokens, instead of using a draft-based models.

```python
def find_candidate_pred_tokens(
  input_ids, max_ngram_size=3, num_pred_tokens=10
):
  input_length = input_ids.size(1)

  for ngram_size in range(max_ngram_size, 0, -1):
    # Extract the last n tokens as our search ngram
    ngram = input_ids[0, -ngram_size:].tolist()

    # Create sliding windows of size ngram_size
    windows = input_ids.unfold(dimension=1, size=ngram_size, step=1)

    # Convert ngram to a tensor for comparison
    ngram_tensor = torch.tensor(ngram, device=input_ids.device).unsqueeze(0)

    # Find where the windows match the ngram
    matches = (windows == ngram_tensor).all(dim=2)

    # Get the indices of matches
    match_indices = matches.nonzero(as_tuple=True)[1]

    # Iterate through match indices to find a valid continuation
    for idx in match_indices:
      start_idx = idx + ngram_size
      end_idx = start_idx + num_pred_tokens
      # Ensure we don't go beyond the length of input_ids and avoid self-match
      if end_idx <= input_length and start_idx < input_length - ngram_size:
        return input_ids[0, start_idx:end_idx]

  # If no match is found, return an empty tensor
  return torch.tensor([], dtype=torch.long, device=input_ids.device)
```

## lookahead decoding

see also: [LMSYS blog](https://lmsys.org/blog/2023-11-21-lookahead-decoding/),

## SPiRE

## MagicDec

---

## optimization

\[@liu2024optimizingspeculativedecodingserving\] proposes SmartSpec via optimizing goodput.

### speculative length

_number of effective tokens generated by draft-models per iteration_

Improvement factor (IF) [[#wall-time improvement|determines]] the value of $\alpha$.

[Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models](https://arxiv.org/abs/2405.04304v1) \[@mamou2024dynamicspeculationlookaheadaccelerates\]  proposes a dynamic speculative length to optimize for best decoding quality. fwiw `num_speculative_tokens=5` has been found to be a pretty good balance between latency and quality trade-off for larger models.
They propose an oracle classifier per draft requests to determine whether they should increase/decrease SL as follows:

$$
C_i = \text{FFN}(\operatorname{Concat}(\text{top\_k}({y_i}^D), \text{entropy}({y_i}^D), i))
$$

where it takes the probability vectors of draft models $y_i^D$ for token position $i$ to generate a confidence score $C_i$ [^discussion]

[^discussion]: This seems like an premature optimization. For use-cases where the batch sizes fluctuates, the calculation for an optimal speculative length would probably too overkill when the improvement could be minimal.

## distributed sps

[Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318) \[@chen2023acceleratinglargelanguagemodel\]&#x20;

## von Neumann acceptance rejection

see also: von Neumann (1951), foundational statement of the acceptance–rejection method.

> \[!note\] problem statement
>
> We often need exact samples from a target distribution, but two obstacles arise:
>
> - only an unnormalized density $f(x)$ is available (unknown normalizer $Z=\int f$), and/or
> - the inverse CDF $F^{-1}$ is unavailable or numerically unstable.
>
> The acceptance–rejection (AR) method circumvents both by using an easy-to-sample proposal $g$ and an envelope constant $M\ge 1$ such that $f(x)\le M\,g(x)$ pointwise.
>
> AR produces samples exactly from the normalized target $p^\star(x)=f(x)/Z$.
>
> Efficiency is controlled by $M$: the expected proposals per accepted draw is $\mathbb E[N]=M/Z$ (equals $M$ when $f$ is normalized).

Classical scheme to sample from a target density/mass $p^\star(x)\propto f(x)$ using proposals from an easy distribution $g$ under an envelope bound.

### setup and algorithm

- Target (unnormalized): $f:\mathcal X\to[0,\infty)$ with mass $Z=\int f(x)\,dx\in(0,\infty)$; desired law $p^\star(x)=f(x)/Z$.
- Proposal: probability density/mass $g$ with support covering that of $f$.
- Envelope: constant $M\ge 1$ such that $\forall x,\ f(x)\le M\,g(x)$.
- Algorithm (one attempt):
  1. Draw $Y\sim g$ and $U\sim\mathrm{Unif}(0,1)$.
  2. Accept $Y$ if $U\le f(Y)/(M g(Y))$; otherwise reject and repeat.

We can also think as follows:

1. Choose a proposal $g$ that approximates the shape of $f$ but is cheap to sample; verify support coverage ($f>0\Rightarrow g>0$).
2. Find an envelope constant $M\ge \sup_x f(x)/g(x)$ (analytically if possible; otherwise pick a safe upper bound and refine empirically).
3. Repeat until you have one sample:
   - Sample $Y\sim g$ and a uniform $U\sim\mathrm{Unif}(0,1)$.
   - Compute the acceptance ratio $a(Y)=f(Y)/(M g(Y))\in[0,1]$.
   - If $U\le a(Y)$, return $Y$; else reject and go back to the first bullet.
4. Diagnostics: empirical acceptance $\approx Z/M$; tighten $M$ or pick a closer $g$ to increase acceptance.

For correctness:

Accepted region: $A=\{(y,u): 0\le u\le f(y)/(M g(y))\}$. Then

$$
\Pr(Y\in dy,\ \text{accept})
\;=\;\int_0^{f(y)/(M g(y))} g(y)\,du\,dy
\;=\;\frac{f(y)}{M}\,dy.
$$

Hence $\Pr(\text{accept})=\int f/M=Z/M$ and

$$
p_{Y\mid \text{acc}}(y)
\;=\;\frac{\tfrac{f(y)}{M}}{\tfrac{Z}{M}}
\;=\;\frac{f(y)}{Z}
\;=\;p^\star(y).
$$

Therefore accepted samples are exactly distributed as the normalized target.

### acceptance rate

- Acceptance probability per proposal: $\Pr(\text{accept})=Z/M$ (equals $1/M$ when $f$ is normalized).
- Number of proposals until first accept is geometric with mean $\mathbb E[N]=M/Z$ (equals $M$ when $Z=1$).

> \[!question\] why do we choose uniform distribution?
>
> also known as _Bernoulli view_
>
> For any realized proposal $Y=y$, the accept/reject decision is a Bernoulli trial with success probability
>
> $$
> \Pr(\text{accept}\mid Y=y)\;=\;\frac{f(y)}{M\,g(y)}\;\in[0,1].
> $$

Sampling $U\sim\mathrm{Unif}(0,1)$ and checking $U\le f(Y)/(M g(Y))$ implements exactly that coin. Geometrically, the uniform supplies the “vertical” measure that carves an acceptance slice of area $f(y)/M$ under the rectangle of height 1 over $y$.

> \[!tip\] for <ref slug="thoughts/LLMs">
>
> For vocab $\mathcal V$, define target mass $p(i)\propto f(i)$, proposal $q(i)=g(i)$, and constant $M\ge\max_i f(i)/g(i)$. One attempt draws $I\sim q$, $U\sim\mathrm{Unif}(0,1)$ and accepts if $U\le f(I)/(M q(I))$. This yields
>
> $$
> \Pr(\text{output}=i)
> \;=\;\frac{ f(i) }{ \sum_j f(j) }\;=\;\frac{f(i)}{Z}.
> $$

variants:

- Pointwise envelope: if $f(x)\le c(x)g(x)$ with $c(x)\ge 1$, accept iff $U\le f(Y)/(c(Y)g(Y))$. Correctness is unchanged; acceptance is $\int f/c$.
- Squeezing: if a cheap lower bound $h(x)\le f(x)\le M g(x)$ is available, early-accept if $U\le h(Y)/(M g(Y))$; otherwise evaluate $f$ and apply the standard test.

The following is the pseudocode:

```text
Given target f (unnormalized OK), proposal g, envelope M ≥ sup_x f(x)/g(x)
repeat:
  y ~ g
  u ~ Uniform(0,1)
  if u ≤ f(y)/(M*g(y)):
     return y
```

## speculative sampling

aliases: SpS, speculative decoding.

Based on:

- [Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192) \[@leviathan2023fastinferencetransformersspeculative\]  [^standard-sampling] [^independent-finding]
- [Blockwise Parallel Decoding for Deep Autoregressive Models](https://arxiv.org/abs/1811.03115) \[@stern2018blockwiseparalleldecodingdeep\]&#x20;
- [`vllm/v1/sample/rejection_sampler.py`](https://github.com/vllm-project/vllm/blob/02f0c7b220422792f5e53de2a7d51d2d3ff2df28/vllm/v1/sample/rejection_sampler.py)

[^standard-sampling]: Note that we refer to standard sampling to methods such as argmax, top-k, nucleus, temperatures, et al., albeit each have a different ways to process logits.
    We will consider these as _standard sampling from an adjusted distribution_

[^independent-finding]: This work from DeepMind was performed concurrently and independently from @leviathan2023fastinferencetransformersspeculative. The work at DeepMind focuses more on [[#distributed sps|distributed settings]] of speculative decoding

### tl/dr

- Latency is improved at the cost of increasing ops, with $\gamma=5$ [^alias]
- This is not useful when computation resources are limited.
- [[#wall-time improvement|Wall-time]] improvement by $\frac{1-\alpha^{\gamma +1}}{(1-\alpha)(\gamma c + 1)}$ where $\alpha$ is the approximation $E(\beta)$ [^approx-beta]
- extension of rejection sampling [^discrepancy-with-rejection-sampling]
- Lenience factor $l$ to perform speed versus quality trade-off [^lenience] when draft-models distributions is different from target-models’. [^greedy-and-non-greedy]

[^alias]: also referred in practice as `num_speculative_tokens`

[^approx-beta]: or natural measure of the acceptance rate $\beta$

[^discrepancy-with-rejection-sampling]: Rejection sampling follows a iterative sampling procedure that might looks superficially similar to speculative sampling:

    1. Sample $x \sim q(x)$ and returns $r \sim U(0,1)$
    2. If $r < \frac{p(x)}{M q(x)}$ return $x$
    3. then go to 1

    Where $M = \operatorname{max}_{x} \frac{p(x)}{q(x)}$

    We could employ non-iterative version of rejection sampling instead of speculative sampling here (go through step 1 and 2, and otherwise sample an _unmodified_ $p(x)$ directly)

    Specifically, the expected accept probability:

    $$
    E_{x\sim q(x)} \frac{p(x)}{M q(x)} = \sum_{x}  p(x) \min_{x^{'}}{\frac{q(x^{'})}{p(x^{'})}} \le \sum_{x} p(x) \min{(1, \frac{q(x)}{p(x)})} = \sum_{x} \min{(p(x), q(x))}
    $$

[^greedy-and-non-greedy]: Note that we can’t use `temperature=0` (i.e argmax sampling):

    - Instead we allow some lenience before standardizing the distribution (accept token $x$ sampled from $M_q$ in case of $p(x) \le l \cdot \max{(p)}$)
    - In this case, then similar empirical increases to $\alpha$ to those of `temperature=1`

[^lenience]: A lenience parameter $l \in [0,1]$ to introduce further trade-off. This is useful when the distributions of draft models does not match the target model exactly.

    Specifically we have:

    $$
    \begin{aligned}
    \alpha
      &= \mathbb{E}_{x\sim q(x)}
        \!\left[
          \begin{cases}
            1, & \text{if } l\,q(x) \le p(x),\\[6pt]
            \displaystyle\frac{p(x)}{l\,q(x)}, & \text{if } l\,q(x) > p(x)
          \end{cases}
        \right] \\[10pt]
      &= \mathbb{E}_{x\sim q(x)}\!
        \frac{p(x)}{\max\!\bigl(p(x),\,l\,q(x)\bigr)} \\[8pt]
      &= \sum_{x}
        \frac{p(x)\,q(x)}{\max\!\bigl(p(x),\,l\,q(x)\bigr)} \\[8pt]
      &= \frac{1}{l}\sum_{x}
        \min\!\bigl(p(x),\,l\,q(x)\bigr) \\[8pt]
      &= \sum_{x}
        \min\!\Bigl(\tfrac{p(x)}{l},\,q(x)\Bigr).
    \end{aligned}
    $$

    > \[!tip\] Important
    >
    > this relies on _q_ is sampled from this given distributions, and $l$ increases $\alpha$

    In the case of greedy decoding (`temperature=0`), the draft essentially outputs $x^{'}_q = \argmax{q(x)}$, so scaling $l q(x)$ becomes a no-op, given that the argmax will be unchanged in this case.

### goal and algorithm

Let $M_p$ be the target model for task $X$, and $p(x_t \mid x_{<t})$ the distribution we get from model for a prefix $x_{<t}$

Let $M_q$ be the draft/approximation models at the same task, and $q(x_t \mid x_{<t})$ the distribution we get from model for a prefix $x_{<t}$

_Objective_: to use $M_q$ to generate $\gamma \in \mathbb{Z}^{+}$ completions, and use $M_p$ to verify $\gamma$ tokens _in parallel_

- Keep when $q(x) \le p(x)$
- Reject when $q(x) \ge p(x)$ for **sample** with $P=1-\frac{p(x)}{q(x)}$ and sample $x$ again from $p^{'}(x) = \textit{norm}(\textit{max}(0, p(x) - q(x)))$ [^a.1]

<div class="ps-root" data-inline-macros=""><span type="button" class="clipboard-button ps-clipboard" aria-label="Copy pseudocode to clipboard"><svg width="16" height="16" viewBox="0 0 16 16" class="copy-icon"><use href="#github-copy"></use></svg><svg width="16" height="16" viewBox="0 0 16 16" class="check-icon"><use href="#github-check" fill-rule="evenodd" fill="rgb(63, 185, 80)"></use></svg></span><span class="ps-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><annotation encoding="application/x-tex">"\\begin{algorithm}\n\\caption{SpeculativeDecodingStep}\n\\begin{algorithmic}\n\n\\INPUT{$M_p,\\;M_q,\\;\\textit{prefix}$}\n\n\\State $\\triangleright$ Sample $\\gamma$ guesses $x_1,\\dots,x_\\gamma$ from $M_q$\n\\FOR{$i \\gets 1$ \\TO $\\gamma$}\n    \\STATE $q_i(x) \\gets M_q\\!\\bigl(\\textit{prefix} + [x_1,\\dots,x_{i-1}]\\bigr)$\n    \\STATE $x_i \\sim q_i(x)$\n\\ENDFOR\n\n\\State $\\triangleright$ Run $M_p$ in parallel\n\\STATE $p_1(x),\\dots,p_{\\gamma+1}(x) \\gets\n       M_p(\\textit{prefix}),\\dots,\n       M_p\\!\\bigl(\\textit{prefix} + [x_1,\\dots,x_\\gamma]\\bigr)$\n\n\\State $\\triangleright$ Determine the number of accepted guesses $n$\n\\STATE $r_1,\\dots,r_\\gamma \\sim U(0,1)$\n\\STATE $n \\gets \\min\\!\\bigl(\\{\\,i-1 \\mid\n          1\\le i\\le\\gamma,\\;\n          r_i > \\frac{p_i(x)}{q_i(x)}\\,\\}\\cup\\{\\gamma\\}\\bigr)$\n\n\\State $\\triangleright$ Adjust $M_p$’s distribution if needed\n\\STATE $p'(x) \\gets p_{n+1}(x)$\n\\IF{$n &#x3C; \\gamma$}\n    \\STATE $p'(x) \\gets \\mathrm{norm}\\!\\bigl(\\max\\!\\bigl(0,\\;\n           p_{n+1}(x)-q_{n+1}(x)\\bigr)\\bigr)$\n\\ENDIF\n\n\\State $\\triangleright$ Emit one token from $M_p$ and $n$ from $M_q$\n\\STATE $t \\sim p'(x)$\n\\RETURN $\\textit{prefix} + [x_1,\\dots,x_n,t]$\n\n\\end{algorithmic}\n\\end{algorithm}"</annotation></semantics></math></span>
<div class="ps-algorithm with-caption">
<p class="ps-line" style="text-indent:-0.6em;padding-left:0.6em;">
<span class="ps-keyword">Algorithm 1 </span>SpeculativeDecodingStep</p>
<div class="ps-algorithmic">
<p class="ps-line" style="text-indent:-0.6em;padding-left:0.6em;">
<span class="ps-keyword">Input: </span><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>M</mi><mi>p</mi></msub><mo separator="true">,</mo><mtext>  </mtext><msub><mi>M</mi><mi>q</mi></msub><mo separator="true">,</mo><mtext>  </mtext><mtext mathvariant="italic">prefix</mtext></mrow><annotation encoding="application/x-tex">M_p,\;M_q,\;\textit{prefix}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9805em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.109em;">M</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.109em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">p</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.109em;">M</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.109em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0359em;">q</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord text"><span class="mord textit">prefix</span></span></span></span></span></p>
<div class="ps-block" style="margin-left:0.6em;">
<p class="ps-line ps-code">
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>▹</mo></mrow><annotation encoding="application/x-tex">\triangleright</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4653em;"></span><span class="mord">▹</span></span></span></span> Sample <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>γ</mi></mrow><annotation encoding="application/x-tex">\gamma</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.0556em;">γ</span></span></span></span> guesses <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>x</mi><mn>1</mn></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>x</mi><mi>γ</mi></msub></mrow><annotation encoding="application/x-tex">x_1,\dots,x_\gamma</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7167em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0556em;">γ</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span> from <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>M</mi><mi>q</mi></msub></mrow><annotation encoding="application/x-tex">M_q</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9694em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.109em;">M</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.109em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0359em;">q</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span></p>
<p class="ps-line ps-code">
<span class="ps-keyword">for </span><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>i</mi><mo>←</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">i \gets 1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6595em;"></span><span class="mord mathnormal">i</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">←</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">1</span></span></span></span> <span class="ps-keyword">to</span> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>γ</mi></mrow><annotation encoding="application/x-tex">\gamma</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.0556em;">γ</span></span></span></span><span class="ps-keyword"> do</span></p>
<div class="ps-block" style="margin-left:0.6em;">
<p class="ps-line ps-code">
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>q</mi><mi>i</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>←</mo><msub><mi>M</mi><mi>q</mi></msub><mtext> ⁣</mtext><mo fence="true" stretchy="true" minsize="1.2em" maxsize="1.2em">(</mo><mtext mathvariant="italic">prefix</mtext><mo>+</mo><mo stretchy="false">[</mo><msub><mi>x</mi><mn>1</mn></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>x</mi><mrow><mi>i</mi><mo>−</mo><mn>1</mn></mrow></msub><mo stretchy="false">]</mo><mo fence="true" stretchy="true" minsize="1.2em" maxsize="1.2em">)</mo></mrow><annotation encoding="application/x-tex">q_i(x) \gets M_q\!\bigl(\textit{prefix} + [x_1,\dots,x_{i-1}]\bigr)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0359em;">q</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">←</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1.2em;vertical-align:-0.35em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.109em;">M</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.109em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0359em;">q</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="mopen"><span class="delimsizing size1">(</span></span><span class="mord text"><span class="mord textit">prefix</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1.2em;vertical-align:-0.35em;"></span><span class="mopen">[</span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mbin mtight">−</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2083em;"><span></span></span></span></span></span></span><span class="mclose">]</span><span class="mclose"><span class="delimsizing size1">)</span></span></span></span></span></p>
<p class="ps-line ps-code">
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>x</mi><mi>i</mi></msub><mo>∼</mo><msub><mi>q</mi><mi>i</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">x_i \sim q_i(x)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">∼</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0359em;">q</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span></span></span></span></p>
</div>
<p class="ps-line ps-code">
<span class="ps-keyword">end for</span></p>
<p class="ps-line ps-code">
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>▹</mo></mrow><annotation encoding="application/x-tex">\triangleright</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4653em;"></span><span class="mord">▹</span></span></span></span> Run <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>M</mi><mi>p</mi></msub></mrow><annotation encoding="application/x-tex">M_p</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9694em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.109em;">M</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.109em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">p</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span> in parallel</p>
<p class="ps-line ps-code">
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>p</mi><mn>1</mn></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>p</mi><mrow><mi>γ</mi><mo>+</mo><mn>1</mn></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>←</mo><msub><mi>M</mi><mi>p</mi></msub><mo stretchy="false">(</mo><mtext mathvariant="italic">prefix</mtext><mo stretchy="false">)</mo><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>M</mi><mi>p</mi></msub><mtext> ⁣</mtext><mo fence="true" stretchy="true" minsize="1.2em" maxsize="1.2em">(</mo><mtext mathvariant="italic">prefix</mtext><mo>+</mo><mo stretchy="false">[</mo><msub><mi>x</mi><mn>1</mn></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>x</mi><mi>γ</mi></msub><mo stretchy="false">]</mo><mo fence="true" stretchy="true" minsize="1.2em" maxsize="1.2em">)</mo></mrow><annotation encoding="application/x-tex">p_1(x),\dots,p_{\gamma+1}(x) \gets
       M_p(\textit{prefix}),\dots,
       M_p\!\bigl(\textit{prefix} + [x_1,\dots,x_\gamma]\bigr)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0361em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">p</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">p</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.0556em;">γ</span><span class="mbin mtight">+</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">←</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1.2em;vertical-align:-0.35em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.109em;">M</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.109em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">p</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord text"><span class="mord textit">prefix</span></span><span class="mclose">)</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.109em;">M</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.109em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">p</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="mopen"><span class="delimsizing size1">(</span></span><span class="mord text"><span class="mord textit">prefix</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1.2em;vertical-align:-0.35em;"></span><span class="mopen">[</span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0556em;">γ</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mclose">]</span><span class="mclose"><span class="delimsizing size1">)</span></span></span></span></span></p>
<p class="ps-line ps-code">
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>▹</mo></mrow><annotation encoding="application/x-tex">\triangleright</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4653em;"></span><span class="mord">▹</span></span></span></span> Determine the number of accepted guesses <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi></mrow><annotation encoding="application/x-tex">n</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">n</span></span></span></span></p>
<p class="ps-line ps-code">
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>r</mi><mn>1</mn></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>r</mi><mi>γ</mi></msub><mo>∼</mo><mi>U</mi><mo stretchy="false">(</mo><mn>0</mn><mo separator="true">,</mo><mn>1</mn><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">r_1,\dots,r_\gamma \sim U(0,1)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7167em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0278em;">r</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0278em;">r</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0556em;">γ</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">∼</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.109em;">U</span><span class="mopen">(</span><span class="mord">0</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">1</span><span class="mclose">)</span></span></span></span></p>
<p class="ps-line ps-code">
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi><mo>←</mo><mi>min</mi><mo>⁡</mo><mtext> ⁣</mtext><mo fence="true" stretchy="true" minsize="1.2em" maxsize="1.2em">(</mo><mo stretchy="false">{</mo><mtext> </mtext><mi>i</mi><mo>−</mo><mn>1</mn><mo>∣</mo><mn>1</mn><mo>≤</mo><mi>i</mi><mo>≤</mo><mi>γ</mi><mo separator="true">,</mo><mtext>  </mtext><msub><mi>r</mi><mi>i</mi></msub><mo>></mo><mfrac><mrow><msub><mi>p</mi><mi>i</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><mrow><msub><mi>q</mi><mi>i</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mfrac><mtext> </mtext><mo stretchy="false">}</mo><mo>∪</mo><mo stretchy="false">{</mo><mi>γ</mi><mo stretchy="false">}</mo><mo fence="true" stretchy="true" minsize="1.2em" maxsize="1.2em">)</mo></mrow><annotation encoding="application/x-tex">n \gets \min\!\bigl(\{\,i-1 \mid
          1\le i\le\gamma,\;
          r_i > \frac{p_i(x)}{q_i(x)}\,\}\cup\{\gamma\}\bigr)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">n</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">←</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1.2em;vertical-align:-0.35em;"></span><span class="mop">min</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="mopen"><span class="delimsizing size1">(</span></span><span class="mopen">{</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal">i</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">1</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">∣</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.7804em;vertical-align:-0.136em;"></span><span class="mord">1</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.7955em;vertical-align:-0.136em;"></span><span class="mord mathnormal">i</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.7335em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.0556em;">γ</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0278em;">r</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">></span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1.53em;vertical-align:-0.52em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.01em;"><span style="top:-2.655em;"><span class="pstrut" style="height:3em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.0359em;">q</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3281em;"><span style="top:-2.357em;margin-left:-0.0359em;margin-right:0.0714em;"><span class="pstrut" style="height:2.5em;"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.143em;"><span></span></span></span></span></span></span><span class="mopen mtight">(</span><span class="mord mathnormal mtight">x</span><span class="mclose mtight">)</span></span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.485em;"><span class="pstrut" style="height:3em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight">p</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3281em;"><span style="top:-2.357em;margin-left:0em;margin-right:0.0714em;"><span class="pstrut" style="height:2.5em;"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.143em;"><span></span></span></span></span></span></span><span class="mopen mtight">(</span><span class="mord mathnormal mtight">x</span><span class="mclose mtight">)</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.52em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mclose">}</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">∪</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1.2em;vertical-align:-0.35em;"></span><span class="mopen">{</span><span class="mord mathnormal" style="margin-right:0.0556em;">γ</span><span class="mclose">}</span><span class="mclose"><span class="delimsizing size1">)</span></span></span></span></span></p>
<p class="ps-line ps-code">
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>▹</mo></mrow><annotation encoding="application/x-tex">\triangleright</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4653em;"></span><span class="mord">▹</span></span></span></span> Adjust <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>M</mi><mi>p</mi></msub></mrow><annotation encoding="application/x-tex">M_p</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9694em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.109em;">M</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.109em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">p</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span>’s distribution if needed</p>
<p class="ps-line ps-code">
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mi>p</mi><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>←</mo><msub><mi>p</mi><mrow><mi>n</mi><mo>+</mo><mn>1</mn></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">p'(x) \gets p_{n+1}(x)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0019em;vertical-align:-0.25em;"></span><span class="mord"><span class="mord mathnormal">p</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7519em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">←</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord"><span class="mord mathnormal">p</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">n</span><span class="mbin mtight">+</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2083em;"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span></span></span></span></p>
<p class="ps-line ps-code">
<span class="ps-keyword">if </span><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi><mo>&#x3C;</mo><mi>γ</mi></mrow><annotation encoding="application/x-tex">n &#x3C; \gamma</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5782em;vertical-align:-0.0391em;"></span><span class="mord mathnormal">n</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">&#x3C;</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.0556em;">γ</span></span></span></span><span class="ps-keyword"> then</span></p>
<div class="ps-block" style="margin-left:0.6em;">
<p class="ps-line ps-code">
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mi>p</mi><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>←</mo><mrow><mi mathvariant="normal">n</mi><mi mathvariant="normal">o</mi><mi mathvariant="normal">r</mi><mi mathvariant="normal">m</mi></mrow><mtext> ⁣</mtext><mo fence="true" stretchy="true" minsize="1.2em" maxsize="1.2em">(</mo><mi>max</mi><mo>⁡</mo><mtext> ⁣</mtext><mo fence="true" stretchy="true" minsize="1.2em" maxsize="1.2em">(</mo><mn>0</mn><mo separator="true">,</mo><mtext>  </mtext><msub><mi>p</mi><mrow><mi>n</mi><mo>+</mo><mn>1</mn></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>−</mo><msub><mi>q</mi><mrow><mi>n</mi><mo>+</mo><mn>1</mn></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo fence="true" stretchy="true" minsize="1.2em" maxsize="1.2em">)</mo><mo fence="true" stretchy="true" minsize="1.2em" maxsize="1.2em">)</mo></mrow><annotation encoding="application/x-tex">p'(x) \gets \mathrm{norm}\!\bigl(\max\!\bigl(0,\;
           p_{n+1}(x)-q_{n+1}(x)\bigr)\bigr)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0019em;vertical-align:-0.25em;"></span><span class="mord"><span class="mord mathnormal">p</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7519em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">←</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1.2em;vertical-align:-0.35em;"></span><span class="mord"><span class="mord mathrm">norm</span></span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="mopen"><span class="delimsizing size1">(</span></span><span class="mop">max</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="mopen"><span class="delimsizing size1">(</span></span><span class="mord">0</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">p</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">n</span><span class="mbin mtight">+</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2083em;"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1.2em;vertical-align:-0.35em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0359em;">q</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">n</span><span class="mbin mtight">+</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2083em;"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mclose"><span class="delimsizing size1">)</span></span><span class="mclose"><span class="delimsizing size1">)</span></span></span></span></span></p>
</div>
<p class="ps-line ps-code">
<span class="ps-keyword">end if</span></p>
<p class="ps-line ps-code">
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>▹</mo></mrow><annotation encoding="application/x-tex">\triangleright</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4653em;"></span><span class="mord">▹</span></span></span></span> Emit one token from <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>M</mi><mi>p</mi></msub></mrow><annotation encoding="application/x-tex">M_p</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9694em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.109em;">M</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.109em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">p</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi></mrow><annotation encoding="application/x-tex">n</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">n</span></span></span></span> from <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>M</mi><mi>q</mi></msub></mrow><annotation encoding="application/x-tex">M_q</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9694em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.109em;">M</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.109em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0359em;">q</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span></p>
<p class="ps-line ps-code">
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi><mo>∼</mo><msup><mi>p</mi><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">t \sim p'(x)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em;"></span><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">∼</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1.0019em;vertical-align:-0.25em;"></span><span class="mord"><span class="mord mathnormal">p</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7519em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span></span></span></span></p>
<p class="ps-line ps-code">
<span class="ps-keyword">return </span><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mtext mathvariant="italic">prefix</mtext><mo>+</mo><mo stretchy="false">[</mo><msub><mi>x</mi><mn>1</mn></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>x</mi><mi>n</mi></msub><mo separator="true">,</mo><mi>t</mi><mo stretchy="false">]</mo></mrow><annotation encoding="application/x-tex">\textit{prefix} + [x_1,\dots,x_n,t]</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord text"><span class="mord textit">prefix</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">[</span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">n</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal">t</span><span class="mclose">]</span></span></span></span></p>
</div>
</div>
</div>
</div>

[^a.1]: _On Correctness of Speculative Sampling (SpS)_

    We will show that $\forall p(x) \text{ and } q(x)$, _tokens sampled via speculative sampling_ from $p(x)$ and $q(x)$ are **distributed identically** to those sampled from $p(x)$ alone.

    Let $\beta$ be the [[#acceptance probability|acceptance probability]]

    Note that

    $$
    \begin{aligned}
    p'(x)
      &= \operatorname{norm}\!\bigl(\max(0,\;p(x)-q(x))\bigr) \\
      &= \frac{p(x)-\min\!\bigl(q(x),\,p(x)\bigr)}
            {\displaystyle \sum_{x'}\!\bigl(p(x')-\min\!\bigl(q(x'),\,p(x')\bigr)\bigr)} \\
      &= \frac{p(x)-\min\!\bigl(q(x),\,p(x)\bigr)}{1-\beta},
    \end{aligned}
    $$

    so the normalising constant for the adjusted distribution $p'(x)$ is $1-\beta$;
    the last equality follows immediately from Lemma 3.3 and Theorem 3.5.

    Now

    $$
    P(x = x') \;=\;
    P(\text{guess accepted},\,x = x') \;+\;
    P(\text{guess rejected},\,x = x').
    $$

    **Where**

    $$
    P(\text{guess accepted},\,x = x')
      \;=\; q(x')\,\min\!\bigl(1,\tfrac{p(x')}{q(x')}\bigr)
      \;=\; \min\!\bigl(q(x'),\,p(x')\bigr),
    $$

    and

    $$
    P(\text{guess rejected},\,x = x')
      \;=\; (1-\beta)\,p'(x')
      \;=\; p(x') - \min\!\bigl(q(x'),\,p(x')\bigr).
    $$

    **Overall**

    $$
    \begin{aligned}
    P(x = x')
      &= \min\!\bigl(p(x'),\,q(x')\bigr)
        \;+\; p(x') - \min\!\bigl(p(x'),\,q(x')\bigr) \\
      &= p(x').
    \end{aligned}
    $$

    $\boxed{}$

### acceptance probability

alias: acceptance rate

> \[!definition\] Definition 3.1.
>
> _acceptance rate_ $\beta_{x<t}$ given a prefix $x_{<t}$ is the probability of accepting $x_t \sim q(x_t\mid x_{<t})$ via speculative sampling.

$E(\beta)$ is the natural measure of how well $M_q$ approximates $M_p$

$\alpha  = E(\beta)$ assuming $\beta$ are i.i.d, (1) is a capped geometrics variables, with success probability of $1 - \alpha$ and cap $\gamma + 1$:

$$
  E(\text{\# generated tokens}) = \frac{1-\alpha^{\gamma +1}}{1-\alpha}
$$

#### calculating $\alpha$

> \[!definition\] Definition 3.2.
>
> Let natural divergence $D_{LK}$ be:
>
> $$
> D_{LK}(p,q) = \sum_{x} |p(x) - M(x)| = \sum_{x} \mid q(x) - M(x) \mid
> $$
>
> where $M(x) = \frac{p(x) + q(x)}{2}$

> \[!lemma\] Lemma 3.3.
>
> $D_{LK}(p,q) = 1 - \sum_{x} \min(p(x), q(x))$ [^proof-3-3]

[^proof-3-3]: $$
    \begin{aligned}
      D_{LK}(p,q) &= \sum_{x}  |p(x) - M(x)| = \sum_{x} \frac{|p-q|}{2} \\
      &= 1- \sum_{x} \frac{p+q - |p-q|}{2}  \\
      &= 1 - \sum_{x} \min(p(x), q(x))
    \end{aligned}
    $$

    $\boxed{}$

> \[!corollary\] Corollary 3.4.
>
> $D_{LK}(p,q)$ is a symmetric divergence in $[0,1]$, where
>
> $D_{LK}(p,q)=0 \Longleftrightarrow p=q$
>
> $D_{LK}(p,q)=1 \Longleftrightarrow \text{p and q have disjoint support}$

> \[!theorem\] Theorem 3.5.
>
> $\beta = 1 - D_{LK}(p,q)$ [^proof-3-5]

> \[!corollary\] Corollary 3.6.
>
> $\alpha = 1 - E(D_{LK}(p,q)) = E(\min(p,q))$

[^proof-3-5]: $$
    \begin{aligned}
    \beta
      &= \mathbb{E}_{x \sim q(x)}
        \Biggl[
          \begin{cases}
            1 & \text{if } q(x) \le p(x), \\[6pt]
            \displaystyle\frac{p(x)}{q(x)} & \text{if } q(x) > p(x)
          \end{cases}
        \Biggr] \\[8pt]
      &= \sum_{x} \min\!\bigl(p(x),\,q(x)\bigr).
    \end{aligned}
    \qquad\square
    $$

### wall-time improvement

With i.i.d assumption speculative sampling reduces $\text{\# of calls}$ to target models by $\frac{1-\alpha^{\gamma +1}}{1-\alpha }$, assuming running on compute resources that support increased concurrency (GPUs.)

For wall-time [^definition] analysis, assuming we can run $\gamma +1$ concurrent evaluation of $M_p$:

[^definition]: also known as [elapsed real time](https://en.wikipedia.org/wiki/Elapsed_real_time). This is different from CPU time, given that it measure the _actual time taken from the start of the computer program_, where as CPU time only measures _time during which processor is actively working on a certain task or process_

> \[!definition\] Definition 3. cost-efficient
>
> let $c$ be the ratio between time for single run of $M_q$ and the time for single run $M_p$
>
> $c$ is highly dependent on hardware measure. From the paper, $c \approx 0$ to avoid expectancy biases

> \[!theorem\] Theorem 3.8.
>
> expected improvement factor in total wall-time by $\frac{1-\alpha^{\gamma +1}}{(1-\alpha)(\gamma c + 1)}$ [^proof-3-8]
>
> Note that we assume there are long enough generations sequence here.

[^proof-3-8]: Denote the cost of running single steps of $M_p$ by $T$.

    Each run will then costs $T c \gamma  + T = T(c \gamma +1)$ (running $M_q$ $\gamma$ times and running $M_p$ once)

    Given (1) produces $\frac{1-\alpha^{\gamma +1}}{1-\alpha}$ tokens

    The cost to produces a token with speculative sampling would be $\frac{(c \gamma +1)(1-\alpha )}{1-\alpha^{\gamma +1}} T$

    $\boxed{}$

> \[!corollary\] Corollary 3.9.
>
> $\forall \alpha > c \space \exists \space \gamma \mid \text{ we will get improvement by a factor of } \frac{1+\alpha }{1+c}$

If we get an improvement for $\gamma$, we’d also get improvement for any $0 < \gamma^{*} < \gamma$, hence we can use (3.8) for $\gamma = 1$, which yields $\frac{1+\alpha}{1+c}$

### arithmetic operations

> \[!definition\] Definition 4. arithmetic operations per token
>
> let $\hat{c}$ be the ratio of arithmetics operations per tokens of $M_q$ to that of $M_p$
>
> Note that the number of operations will then grow by $\gamma +1$, given that we will produce at most $\gamma +1$ tokens per run.

> \[!theorem\] Theorem 3.11.
>
> The expected factor of increase in number of operations is $\frac{(1-\alpha)(\gamma \hat{c} + \gamma + 1)}{1-\alpha^{\gamma +1}}$ [^proof-3-11]

[^proof-3-11]: Denote by $\hat{T}$ the number of arithmetic operations done by standard decoding per tokens, therefore [[#speculative sampling|speculative sampling]] costs $\hat{T} \hat{c} \gamma + \hat{T}(\gamma +1)$ operations. Then divided by the expected tokens we got the desired results $\boxed{}$

---

### blog post draft

<https://philkrav.com/posts/speculative/>, @zhou2024distillspecimprovingspeculativedecoding

