What is Mirage's latent spatial memory?

Mirage is a method that caches a video world model's 3D spatial memory directly in the diffusion model's latent space rather than as an RGB point cloud. It stores seen latent tokens in 3D via depth-guided back-projection and queries new viewpoints by latent warping, eliminating the per-step round trip of rendering to pixels and re-encoding.

How much faster and lighter is Mirage than prior methods?

According to the paper released in 2026 by Microsoft Research and collaborators (arXiv 2606.09828), Mirage increases generation speed by up to 10.6x (10.57x) and cuts GPU memory by up to 55x versus legacy RGB point-cloud memory, while reaching the top score on the WorldScore spatial-consistency benchmark.

What is the problem with legacy RGB point-cloud memory?

The legacy approach stores the seen scene as an RGB point cloud and, whenever a new viewpoint is needed, rasterizes the point cloud and re-encodes it back into a latent in a round trip. This round trip inflates per-frame computation and erodes latent information by passing through pixel space, which shakes consistency.

The Next Generation of World Models Remembers the World as "Latent 3D," Not Pixels: What Mirage Proves About Latent Spatial Memory

Mirage is a new method that caches a video world model's spatial memory directly in the diffusion model's latent space rather than as an RGB point cloud. Released in June 2026 by researchers at Microsoft Research, Zhejiang University, Adelaide, and Monash (arXiv 2606.09828), the paper eliminates the per-step round trip of re-rendering to pixels and re-encoding, boosting generation speed by up to 10.6x, cutting memory by up to 55x, and reaching the top score on the WorldScore benchmark. The takeaway is simple: spatial memory does not need to live in pixel space at all.

Drift: An Old Nuisance Resurfaces

The core challenge that Mirage targets in long-horizon video world models is "spatial consistency": redrawing the same scene coherently when the camera comes back around. Given only a single image and a camera trajectory, a model that keeps generating new viewpoints tends to forget what was around a previously seen corner, and the scene drifts.

The reason this problem is back in the spotlight is that world models are crossing over from watch-only video generators into steerable simulators. The moment a user is free to swing the camera around, "is the place I came from still there" stops being a matter of aesthetics and becomes a matter of trust.

The standard fix is an explicit memory. The model stores the seen scene as an RGB-colored 3D point cloud, and whenever a new viewpoint is needed, it renders that point cloud and feeds it back in as model input.

The Double Tax of Routing Memory Through Pixels

The bottleneck of legacy RGB point-cloud memory is the "rasterize-and-encode round trip" that happens every step. The point cloud must be rasterized into a pixel image, and that image must then be re-encoded into the diffusion model's latent, so heavy decode, render, and encode operations repeat at each step.

This round trip levies two taxes:

Compute cost: converting latent → pixel → latent twice inflates the per-frame computation.
Information loss: passing through pixel space erodes the latent information the diffusion model relied on, shaking consistency.

Notably, these two are not a trade-off. Routing through pixels makes the model both slower and less accurate at once, so removing the round trip leaves room to improve speed and quality together.

Moving the Unit of Memory From Pixels to Latents

The core idea of Mirage is to accumulate 3D geometric information directly in the diffusion latent space rather than in pixels. It builds the cache by lifting seen latent tokens into 3D coordinates via depth-guided back-projection, and when a new viewpoint is needed, it queries that latent cache by projecting it through direct latent warping.

The procedure reduces to three steps:

Store: back-project generated latent tokens by depth and cache them as latents at 3D positions.
Query: warp the latent cache to the new camera viewpoint through a single latent-resolution projection.
Generate: condition the next frame on the warped latent, preserving consistency without any descent to pixels and back.

The idea itself is simple, but its implication is large. If the old approach was "save what you saw as a picture, then redraw it to show it again," Mirage is closer to "save what you saw in the model's own language (latents) and hand it back in that same language." Because the latent is never unpacked into pixels, the rasterize-and-encode round trip itself disappears.

How to Read the 10.6x and 55x Numbers

Mirage is the top-scoring (SOTA) method on WorldScore, the standardized spatial-consistency benchmark as of 2026, while also being faster and lighter than prior approaches. Holding consistency equal, generation speed increases by up to 10.6x (10.57x in the paper) and GPU memory usage drops by up to 55x.

The cue to watch when reading these figures is the word "up to." These are likely the numbers at the most favorable operating point among several conditions, so the average gain may be gentler. Even so, the striking part is that speed, memory, and consistency improved together rather than at one another's expense. Usually pushing one of the three pushes back on another, but changing the representation unit sidesteps that tension.

Aspect	Legacy RGB point-cloud memory	Mirage latent spatial memory
Memory storage	Pixel space (RGB point cloud)	Diffusion latent space (3D cache)
Viewpoint query	Rasterize-then-re-encode round trip	Single latent-resolution warping
Generation speed	Baseline	Up to 10.6x faster
Memory usage	Baseline	Up to 55x smaller
WorldScore	Comparison point	Top score (SOTA)

Implications for Practitioners and the Open Questions

The real implication of Mirage is that abstracting the representation unit of spatial memory up one level can win speed, memory, and consistency at once. Dropping the intuitive but heavy pixel representation and managing memory on top of the latent representation the model already uses is what returned the 10.6x and 55x numbers.

Cutting memory by 55x is especially concrete for teams on a tight GPU budget, from startups to smaller labs. If you are aiming at interactive simulators or game-like content, it reads as a signal that sustaining long camera moves may open up without top-tier hardware.

From ASAP's vantage point, though, the limits are clear. This is a single paper's reported figures that depend on depth-estimation quality and camera-trajectory setup. If the depth is wrong, a latent gets cached at the wrong 3D position, and that error can propagate straight into the next frame without a pixel round trip to catch it. Reproduction across other scenes and resolutions, and under abrupt viewpoint changes, warrants follow-up verification. The very strength of "no information loss because it never touches pixels" leaves open the question of whether it flips into the weakness of "no pixel step to check the work either."

Reference: Latent Spatial Memory for Video World Models (Weijie Wang et al., 2026) · Project page