---
chatgpt:
  threads:
    693baf26-7f00-8005-8844-8c9be18ab5c3: tweets research
date: '2025-12-03'
description: DSA and RL from pre-training
id: DS32
modified: 2026-06-05 15:08:12 GMT-04:00
seealso:
  - '[[thoughts/Attention]]'
socials:
  elie: https://x.com/eliebakouch/status/1972719388668084525
  inference: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/inference/generate.py
  raschka: https://magazine.sebastianraschka.com/p/technical-deepseek
  wh: https://x.com/nrehiew_/status/1973193918662713510
  zhihu: https://x.com/ZhihuFrontier/status/1993231992876421156
tags:
  - model
  - attention
title: DeepSeek V3.2
created: '2025-12-03'
published: '2025-12-03'
pageLayout: default
slug: thoughts/DS32
permalink: https://aarnphm.xyz/thoughts/DS32.md
generator:
  quartz: v4.6.0
  hostedProvider: Cloudflare
  baseUrl: aarnphm.xyz
full: https://aarnphm.xyz/llms-full.txt
---
The major contributions from [[thoughts/DeepSeek]] V3.2 is DeepSeek Sparse Attention (DSA), where there are two components:

- lightning indexer
- top\_k tokens selection

![[thoughts/images/DSA.webp]]

We can think about DSA as a noncontiguous sliding window where each tokens only attends to 2048 other tokens. Which means for decode both memory and FLOPs is constant at $O(2048)$

For lightning indexer, we calculate index score $I_{t,s}$ between query token $h_{t} \in \mathbb{R}^{d}$ and a preceding token $h_{s} \in \mathbb{R}^d$

$$
I_{t,s} = \sum_{j=1}^{H^{I}} w^{I}_{t,j} \cdot \operatorname{ReLU}(q^{I}_{t,j} \cdot k^{I}_{s})
$$

where $H^{I}$ denotes the number of indexer heads; $q^{I}_{t,j} \in \mathbb{R}^{d^I}$ and $w^I_{t,j} \in \mathbb{R}$ are derived from token $h_{t}$, and $k^I_{s} \in \mathbb{R}^{d^I}$ is derived from preceding token $h_{s}$

To select the 2048 tokens, they do $\operatorname{top\_k}(\operatorname{ReLU}(QK^T))$ [^note-relu]

[^note-relu]: The reason for separating $w$ from $q$ (instead of completely eliminating $w$ and letting $q$ learn $w$ as its magnitude) seems to be that $w$ can be negative for some heads.

    Heads with $w < 0$ will capture how badly matched those source tokens are.

    For each query token, the top-k tokens are selected, shared across all heads.

    These top-k tokens are chosen using an indexer that computes faster and supports FP8, as low precision is sufficient.

    They trained it with a top-k of 2048 for 128k token sequences, 64x speedup.

    Unlike NSA, this technique can likely be applied to train any pre-trained model.

Note that the indexer is actually still quadratic, but given that indexing is done directly in FP8, then it is pretty cheap given the number of heads and indexer head dim is small. It basically does $(Q_{\text{fp8}}\space @ \space K_{\text{fp8}}^T) * q_{\text{scale}} * k_{\text{scale}}$

Instead of loading the entire `[S, 512 + 64]` KV Cache ($d_{\text{content}} + d_{\text{rope}}$), it adds `[S, 128]` FP8 Cache (compressed indexer key, 128-dim proxy vector stored in FP8) and block scale of`[S, 1]` FP32 cache (de-quantization scaler, `fp8_value * scale`, and they basically store _one scaling factor per token_) [^implementation]

With decode length=1, it needs an extra

- 2x1536x128x64 (proj) + 2x7168x128 (kproj) + 2x7168x64 (weights) = 0.027918336GFLOPs
- load FP8\[S,128\] + FP32\[S,1\], 2xSx128x64 (dot product)
- For savings you only do 2048/S FLOPs and loads when doing attention

[^implementation]: This actually aligns with the block-wise quantization scale here, with dim=128

![[thoughts/images/mqa-mha-mode.webp]]

This is essentially two stages for training and inference (MHA and MQA respectively)

standard MHA projects inputs to $d$-dimensional vectors, then splits them into $h$ heads.
kv cache size per token would be $n_{\text{heads}} \times d_{\text{head}} \times 2$ (for KV).

### training mode (MHA-style)

during training, we need gradients to flow through distinct up-projections to learn head-specific features.

$$
k_t = c_{KV} W_{UK}
$$

$$
v_t = c_{KV} W_{UV}
$$

$$
A = \text{Softmax}\left(\frac{q_t (c_{KV} W_{UK})^T}{\sqrt{d}}\right) (c_{KV} W_{UV})
$$

This makes sense given that we require full rank MHA for activations fully materialized.

### inference mode (MQA-style)

Associativity of matrix multiplication allows us to merge the up-projection $W_{UK}$ into the query projection $W_{Q}$.

recall that $A \propto q k^T$.
if we expand $k$:

$$
\text{score} = q_t (c_{KV} W_{UK})^T = q_t W_{UK}^T c_{KV}^T
$$

notice we can group $(q_t W_{UK}^T)$. let’s define an **absorbed query**:

$$
Q_{\text{absorbed}} = Q_{\text{original}} \cdot W_{UK}^T
$$

now the attention score is:

$$
\text{score} = Q_{\text{absorbed}} \cdot c_{KV}^T
$$

Now, we _never_ have compute/store the full high-dim key matrix $K$ during inference. And we only store the tiny compressed latent $c_{KV}$, which is nice

### notes

![[thoughts/images/indexer-dsa-hf.webp]]

hmm, the indexer implementation on HF does a Hadamard transform[^hadamard] on $Q$ and $K$ before dot product.

- prevent certain dimensions from being too influential on the dot product

[^hadamard]: size-2 [discrete Fourier transforms](https://en.wikipedia.org/wiki/Discrete_Fourier_transform "Discrete Fourier transform") (DFTs), or the Hadamard matrix transforms $2^m$ real numbers $x_{n}$ to $2^m$ real numbers $X_{k}$

> Why people don’t do this in normal attention?

- possibly too expensive
- indexing is _important_ since it is a “hard” selection of the tokens without no recovery (i.e if we choose tokens wrong, then we are screwed!)

![[thoughts/images/nsa-infllm-dsa-dinstinction.webp]]

- The logic for NSA/InfLLM v2 is that they do some “sliding window” component for short-context performance (size=512 for NSA) while for long context they compress the info per block plus attend to each token of the most relevant block.
- We can probably think conceptually about the indexer as a “router” (with multiple tokens to attend). Instead of block like NSA this is mostly per token.
- essentially we don’t really have any global information wrt previous block, which could be a _limitation_? But probably the reason why this doesn’t affect performance because of causal implications in features ablation.
- pretty impressive for continual learning, given that they first do some warm up in pre-training with top\_k token selection, then tuned the indexer to fit with their RL pipeline (similar to GLM-4.5 training pipeline here.)