What were the 1.3B/100B-token comparison results?

Trained at 1.3B parameters on 100B FineWeb-Edu tokens, it beat Mamba-2, Mamba-3, GDN, and KDA overall. The recurrent average was 53.11 (Mamba-3 MIMO 52.39, KDA 52.28) and the hybrid average 53.97 (Mamba-3 MIMO 52.72). On RULER retrieval, S-NIAH-3 (2K) rose 63.2 to 89.8 and MK-NIAH-1 (4K) rose 28.0 to 37.8 over KDA.

Sub-quadratic (linear) attention processes long context cheaply but could not edit its fixed memory safely. Gated DeltaNet-2 targets that weakness with one clear mechanistic fix: breaking the coupling between erase and write. The numbers are at the 1.3B/100B scale, so generalization to larger models needs further verification.

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Q: What does Gated DeltaNet-2 decouple?

It decouples erasing from writing, which prior linear attention tied to a single scalar gate. It keeps KDA's channel-wise decay while adding a channel-wise erase gate on the key axis and a channel-wise write gate on the value axis, so editing memory disturbs existing associations less. It reduces to KDA when both gates collapse to the same scalar, and to Gated DeltaNet when the decay collapses too.

Gated DeltaNet-2 is a recurrent attention layer from NVIDIA that splits "how much to erase" from "how much to write" into the fixed-size memory state of linear attention. Released in May 2026 by Ali Hatamizadeh, Yejin Choi, and Jan Kautz, the paper fixes a shared weakness of Gated DeltaNet and KDA, which tie erasing and writing to a single scalar gate, by separating them into a channel-wise erase gate on the key axis and a channel-wise write gate on the value axis. As a result, at 1.3B parameters trained on 100B tokens, it beats Mamba-2, Mamba-3, GDN, and KDA.

The Dilemma of Editing a Fixed-Size Memory

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, giving a cost that is linear in sequence length. As of 2026, Mamba-2, Gated DeltaNet (GDN), and KDA all belong to this family, and the core advantage is processing long context in constant memory.

The problem is that when the state size is fixed, admitting new information forces something old to be given up. The prior delta rule uses a single scalar gate to decide both "how much old association to erase (key side)" and "how much new value to write (value side)." Because write strength and erase strength ride on the same knob, any attempt to edit memory also disturbs intact existing associations. The loss of editing precision was the price paid for giving up the unbounded cache.

Splitting One Knob Into Two

Gated DeltaNet-2's core move is splitting the tied scalar delta gate into two channel-wise gates. It keeps KDA's channel-wise decay but adds a channel-wise erase gate b_t on the key axis and a channel-wise write gate w_t on the value axis as separate controls.

The effect of decoupling is finer-grained memory editing. The model performs three independent operations: clearing broad context through decay, removing only selected stale associations through erase, and inserting only the value channels that should persist through write. What stands out is that the design subsumes prior models as special cases: it reduces to KDA when both gates collapse to the same scalar, and to Gated DeltaNet when the decay collapses too. Because the new rule wraps the earlier generation as a superset rather than replacing it, its worst case is in principle no worse than those models.

Where Single and Decoupled Gates Diverge

Decoupled gates are what break the "write only as much as you erase" coupling that a single scalar gate enforces. The table below compares what GDN, KDA, and Gated DeltaNet-2 handle channel-wise during memory updates.

Model	Decay	Erase / write gate
Gated DeltaNet (GDN)	Scalar	Single scalar (tied)
KDA	Channel-wise	Single scalar (tied)
Gated DeltaNet-2	Channel-wise	Erase / write decoupled, channel-wise

Read top to bottom, the table traces degrees of freedom being released one at a time. GDN keeps decay scalar; KDA freed decay to be channel-wise; Gated DeltaNet-2 pulls apart the last remaining coupling, erase and write, channel-wise as well. The last row is the paper's one-line fix, and pulling erase and write apart per channel means that writing new information disturbs fewer unrelated existing associations.

How Much It Wins at 1.3B/100B Tokens

Gated DeltaNet-2 is the strongest overall among comparable linear models when trained at 1.3B parameters on 100B FineWeb-Edu tokens. In the recurrent setting it averages 53.11 across LAMBADA and the reasoning suite, above Mamba-3 MIMO at 52.39 and KDA at 52.28.

The gains are largest on long-context retrieval. The key numbers are as follows:

The hybrid-setting average is 53.97, above Mamba-3 MIMO at 52.72.
On RULER retrieval, S-NIAH-3 (2K) rises from 63.2 to 89.8 over KDA.
In the same comparison, MK-NIAH-1 (4K) rises from 28.0 to 37.8.

How to Read the Numbers

When reading this table, the overall average and the retrieval scores should not be judged on the same scale. The overall-average gap of 53.11 versus 52.28 is under a point, too small to call a leap in language modeling itself. By contrast, S-NIAH-3 (2K) jumping from 63.2 to 89.8 is a gain of more than 20 points, and MK-NIAH-1 (4K) moving from 28.0 to 37.8 is a relative rise of over 30 percent.

That asymmetry actually supports the design intent. The purpose of decoupling erase and write is to scramble memory less, and that effect is maximized not in general language modeling but in retrieval tasks that require holding specific information for a long time and pulling it out precisely. The fact that the gap widens sharply in long context matches exactly the design intent of decoupled gates. Still, the absolute retrieval scores remain on the low side, so it is early to say this scale has proven the completeness needed to displace softmax attention.

Implications for Researchers and Practitioners

There are takeaways for research and serving teams too. First, because the result comes at the 1.3B/100B academic scale, this is a low-cost experimental space that can be verified without a large foundation model. A one-line fix to the gate structure leaves room to attempt reproduction and extension without growing the parameter count.

Second, the long-context retrieval gains tie directly to document summarization and RAG serving, where demand is high. Processing long inputs in constant memory while raising retrieval accuracy is especially attractive for on-premise and small-scale serving environments that have been squeezed by softmax cache cost.

That said, the numbers above are results at the 1.3B/100B scale, and generalization to larger models and other data needs further verification. Gated DeltaNet-2's value is not an elaborate architecture but a single, clear fix aimed at the memory mechanism, targeting sub-quadratic attention's long-standing weakness of not being able to edit fixed memory safely.

Reference: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention (Hatamizadeh, Choi, Kautz, NVIDIA, 2026)