ASAPAi Soon As Possible · AI & tech, delivered fastest
Article

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

AASAP
2026-06-19 · 4 min read

Gated DeltaNet-2 is a recurrent attention layer from NVIDIA that splits "how much to erase" from "how much to write" into the fixed-size memory state of linear attention. Released in May 2026 by Ali Hatamizadeh, Yejin Choi, and Jan Kautz, the paper fixes a shared weakness of Gated DeltaNet and KDA, which tie erasing and writing to a single scalar gate, by separating them into a channel-wise erase gate on the key axis and a channel-wise write gate on the value axis. As a result, at 1.3B parameters trained on 100B tokens, it beats Mamba-2, Mamba-3, GDN, and KDA.

Why Linear Attention's Memory Breaks

Linear attention is a family that replaces the unbounded cache of softmax attention with a fixed-size recurrent state, giving a cost that is linear in sequence length. As of 2026, Mamba-2, Gated DeltaNet (GDN), and KDA all belong to this family, and the core advantage is processing long context in constant memory.

The problem lies in how that fixed state is updated. The prior delta rule uses a single scalar gate to decide both "how much old association to erase (key side)" and "how much new value to write (value side)," so any attempt to edit memory also disturbs intact existing associations.

What Gated DeltaNet-2 Decouples

Gated DeltaNet-2's core move is splitting the tied scalar delta gate into two channel-wise gates. It keeps KDA's channel-wise decay, but adds a channel-wise erase gate b_t on the key axis and a channel-wise write gate w_t on the value axis as separate controls.

The effect of decoupling is finer-grained memory editing. The model performs three independent operations: clearing broad context through decay, removing only selected stale associations through erase, and inserting only the value channels that should persist through write. The design also generalizes cleanly: it reduces to KDA when both gates collapse to the same scalar, and to Gated DeltaNet when the decay collapses too.

Single Gate vs. Decoupled Gates: What Differs

Decoupled gates are what break the "write only as much as you erase" coupling that a single scalar gate enforces. The table below compares what GDN, KDA, and Gated DeltaNet-2 handle channel-wise during memory updates.

ModelDecayErase / write gate
Gated DeltaNet (GDN)ScalarSingle scalar (tied)
KDAChannel-wiseSingle scalar (tied)
Gated DeltaNet-2Channel-wiseErase / write decoupled, channel-wise

The last row is the paper's one-line fix. Pulling erase and write apart per channel means that writing new information disturbs fewer unrelated existing associations.

How Much It Wins at 1.3B/100B Tokens

Gated DeltaNet-2 is the strongest overall among comparable linear models when trained at 1.3B parameters on 100B FineWeb-Edu tokens. In the recurrent setting it averages 53.11 across LAMBADA and the reasoning suite, above Mamba-3 MIMO at 52.39 and KDA at 52.28.

The gains are largest on long-context retrieval. The key numbers are as follows:

  1. The hybrid-setting average is 53.97, above Mamba-3 MIMO at 52.72.
  2. On RULER retrieval, S-NIAH-3 (2K) rises from 63.2 to 89.8 over KDA.
  3. In the same comparison, MK-NIAH-1 (4K) rises from 28.0 to 37.8.

The fact that the gap widens sharply in long context matches exactly the design intent of decoupled gates that scramble memory less.

Why It Matters: Sub-Quadratic Long Context Fixed by One Idea

Gated DeltaNet-2's value is not an elaborate architecture but a single, clear fix aimed at the memory mechanism. As of 2026, sub-quadratic (linear) attention is a leading candidate for processing long context cheaply, yet it carried the weakness of not being able to edit fixed memory safely.

Breaking the coupling between erase and write targets that weakness directly. That said, the numbers above are results at the 1.3B/100B scale, and generalization to larger models and other data needs further verification.


Reference: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention (Hatamizadeh, Choi, Kautz, NVIDIA, 2026)

← All posts