ASAPAi Soon As Possible · AI & tech, delivered fastest
Article

Understanding or Generation Is the Wrong Question: Can One Multimodal Model Do Both Without a "Generation Tax"?

2026-06-19 · 4 min read

SenseNova-U1 is SenseTime's 2026 native unified model that fuses multimodal understanding and generation into a single process while still matching understanding-only VLMs. Released on May 12, 2026, by Haiwen Diao, Penghao Wu, and colleagues, it argues that bolting a generation head onto an understanding backbone—the standard recipe—leaves the two representation spaces misaligned, and instead uses a native Mixture-of-Transformers (MoT) to treat understanding and generation as synergistic views of one process. The crucial result is that the unified model does not pay the "generation tax" that unified models usually incur.

Why Did Earlier "Unified" Models Pay a Generation Tax?

The generation tax in earlier unified models is a structural problem: most of them bolt a separate generation head onto an understanding backbone, leaving the two representation spaces misaligned. SenseTime researchers Haiwen Diao, Penghao Wu, and colleagues argue this misalignment is not merely a lack of training but a consequence of the split design. The result is a "generation tax": the more generation capability you add, the more understanding and reasoning performance erodes.

This trade-off has long been the weak point of unified models. Teaching one model to draw tended to degrade its ability to see and reason.

How Does the Mixture-of-Transformers Unify Understanding and Generation?

The native Mixture-of-Transformers (MoT) is an architecture that places an understanding stream and a generation stream inside a single backbone while letting them keep interacting through shared attention. Released in 2026, SenseNova-U1 processes pixel and text inputs directly without a separate vision encoder (VE) or VAE, and sits this MoT on top of a native unified paradigm called NEO-unify.

The two streams share the same token sequence and attention structure but use separate parameters for understanding versus generation. As a result, understanding and generation evolve together as two views of one process.

Its operation can be summarized in three stages:

  1. Near-lossless visual interface: two-layer convolutional encoding and MLP-like decoding take pixels directly, minimizing information loss.
  2. Native MoT backbone: the understanding and generation streams are separated internally but linked through shared attention.
  3. X2I generation: a single model performs any-to-image generation from text, images, or other inputs.

Is "No Generation Tax" Proven by Benchmarks?

SenseNova-U1 is reported to rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence, while simultaneously achieving strong any-to-image (X2I) generation. In short, the central claim is that adding generation capability did not erode the understanding side.

Laid out side by side, the conventional wisdom about unified models flips.

AspectEarlier unified modelsSenseNova-U1
ArchitectureUnderstanding backbone + generation head (split)NEO-unify · native MoT
Representation spaceMisalignedAligned via shared attention
UnderstandingDrops when generation is added (generation tax)On par with understanding-only VLMs
GenerationSeparate headSingle-model X2I

At What Scale Was It Released?

SenseNova-U1 is available as two variants built on an 8B dense and a 30B-A3B mixture-of-experts (MoE) understanding baseline, respectively. They are SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, and SenseTime fully open-sourced them.

The scales break down as follows:

Why Does This Unification Matter?

The real significance of SenseNova-U1 is showing that "understanding or generation" is the wrong question to ask. As of 2026, binding understanding and generation into one process while preserving understanding performance structurally refutes the assumption that unification must come at a cost. That said, this rests on the benchmarks and design claims SenseTime presents, and results may vary by task and evaluation.


Reference: SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture (Diao, Wu et al., SenseTime, 2026)

← All posts