arXiv 2026

Stateful Visual Encoders for Vision-Language Models

Voio, Inc. Berkeley AI Research Sky Computing Lab, UC Berkeley UCSF
Stateless vs Stateful visual encoder
Stateless (left) vs. Stateful (right). The Stateful Visual Encoder injects cross-image interaction at every encoder layer, exposing each visual representation to previous images before tokens reach the LLM.

πŸ‘€ Watch this for a quick intro!

Abstract

Summary

Vision-language models are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. Yet in existing open-weight VLMs, visual comparison happens only inside the language model β€” the visual encoder itself remains stateless, encoding each image without access to prior visual context. As a result, small but task-critical changes may be attenuated before the language model can compare them. We introduce a Stateful Visual Encoder (SVE), which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with SVEs achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning β€” and these gains hold across input resolutions, language-model sizes, and VLM backbones. On real-world tasks (longitudinal radiology, fine-grained image comparison, and remote sensing), SVEs consistently improve generalist VLM baselines and can match or surpass specialized models.

256²–768Β²
Robust across resolutions
0.8B–9B
Robust across model sizes
5
VLM families improved
3
Real-world domains
Contributions

What we add

1

The Stateful Visual Encoder

A simple architectural extension that injects cross-image interaction inside the visual encoder of open-weight VLMs β€” no new backbone, no full retraining.

2

A practical SFT recipe

Initialization and optimization choices β€” weight cloning, zero-init outputs, and stop-gradient on prior features β€” that stabilize finetuning and learn state-dependent representations.

3

Broad validation

Effective and general across controlled comparison tasks, resolutions, model sizes, and VLM families, plus three real-world comparison domains.

Method

Four designs, one winning recipe

We study four ways to make a pretrained encoder stateful β€” Self-Ext, AdaLN-Zero, Cross, and Cross+FFN. Cross+FFN inserts token-level cross-attention (queries from the current image, keys/values from the previous image) followed by a feed-forward block, and wins across the board.

SVE architecture and designs
The SVE design space. From left: the standard stateless VLM block, the four stateful designs, and a detailed layer view of Cross+FFN (Ours) with weight cloning, zero-initialized output projections, and a stop-gradient on the previous-image branch.

The recipe that matters

  • Weight cloning β€” copy Q/K/V and the first FFN layer from the matched self-attention block.
  • Zero-init outputs β€” preserve the pretrained feature distribution at the start of training (removing this hurts most).
  • Stop-gradient (K,V) β€” treat previous-image features as a stable retrieval context.
  • Positional embeddings in the cross-attention pathway, plus a first-image key/value fallback.

Capacity is not the explanation

A capacity-matched Self+FFN baseline uses the identical added pathway but cannot attend to the previous image. It improves over stateless β€” yet stays below Cross+FFN on nearly every task, isolating the gain to statefulness rather than added parameters or FLOPs.

Controlled Tasks

Three diagnostics for cross-image comparison

Controlled synthetic tasks
(1) Cross-image spatial aggregation over rich computer-use backgrounds, (2) multi-object visual differencing on CLEVR scenes with 30–40 objects, and (3) visual trajectory behavioral cloning on VisGym interactive tasks.

Cross-image spatial aggregation

MAE / RMSE (Γ—10⁻²), averaged across dot-distance & polygon-area tasks. Lower is better.

MethodMAE ↓RMSE ↓
Baseline (Stateless)1.151.60
Self-Ext.1.442.02
AdaLN-Zero1.171.60
Cross1.031.39
Cross+FFN (Ours)0.720.96

Visual differencing & behavior cloning

CLEVR-Multi-Change (higher better) and VisGym perplexity (lower better).

MethodCIDEr ↑Acc ↑VisGym PPL ↓
Baseline (Stateless)529.591.12.074
Self-Ext.538.192.52.132
AdaLN-Zero531.891.42.069
Cross515.089.32.009
Cross+FFN (Ours)543.992.71.944

VisGym PPL shown for the 3D Mental Rotation (Cube) task; Cross+FFN improves all four VisGym tasks.

Generality

Gains hold across resolution, size & backbone

On multi-object visual differencing, SVE beats the stateless baseline from 256Β² to 768Β² resolution and from 0.8B to 9B parameters β€” and smaller SVE models can match or outperform much larger stateless ones.

Generalization across resolution and model size
SVE (blue) vs. stateless baseline (yellow) across input resolution (top) and model size (bottom), on perplexity, BLEU-4, CIDEr, METEOR, ROUGE-L, and accuracy.

SVE generalizes across VLM backbones

CLEVR-Multi-Change (30–40 objects). β–² = improvement over each backbone's stateless baseline (+SVE shown).

BackbonePPL ↓BLEU-4 ↑CIDEr ↑METEOR ↑ROUGE-L ↑Acc ↑
Qwen3-VL-4B1.268 β–².00482.5 β–²2.5482.1 β–²15.188.6 β–²1.386.8 β–²1.587.3 β–².7
Qwen3.5-4B1.219 β–².01092.7 β–²2.2543.9 β–²14.495.4 β–²1.993.9 β–²1.692.7 β–²1.6
GLM-4.6V-Flash1.236 β–².00592.4 β–².7542.0 β–²3.895.0 β–².493.6 β–².492.2 β–².1
InternVL3.5-4B1.332 β–².02668.2 β–²1.7389.5 β–²11.577.8 β–²1.176.3 β–²1.277.4 β–²1.8
Gemma-3-4B1.316 β–².08368.4 β–²8.0387.0 β–²45.678.0 β–²5.976.3 β–²5.977.9 β–²7.9
Real-World Validation

From X-rays to satellites

Real-world qualitative examples
Qualitative examples. Longitudinal radiology (top), fine-grained image-edit comparison (bottom-left), and remote-sensing change captioning (bottom-right). SVE captures subtle changes the stateless baseline misses.

Longitudinal radiology β€” Medical-Diff-VQA

Captioning + RATE-style finding-level F1 over 27 chest findings.

MethodCIDEr ↑R-L ↑Micro F1 ↑Chg Acc ↑
Qwen3.5-4B (SFT)145.162.731.5586.83
+ SVE (Ours)178.966.332.2089.21

Fine-grained image comparison β€” ImgEdit

MLLM-as-judge pairwise preference counts.

vs. BaselineBase WinTiedSVE Win
Reference instruction296758346
Qwen3.5-4B (SFT)1711020209

Remote sensing change captioning β€” LEVIR-CC

SVE improves the generalist baseline and surpasses all prior specialist models. Sm* averages B4 / M / R-L / C.

MethodBLEU-4 ↑METEOR ↑ROUGE-L ↑CIDEr ↑Sm* ↑
Best prior specialist models
Chg2Cap62.9839.4274.34136.2578.25
PromptCC63.5438.8273.72136.4478.13
SAGE-CC65.5039.9274.77137.5079.42
SACNet65.5740.3075.68138.3479.97
Generalist VLMs
Qwen3.5-4B (SFT)60.7039.4276.03142.2679.60
+ SVE (Ours)61.3339.9176.26144.3580.46
Cite

BibTeX

@article{wang2026sve,
  title   = {Stateful Visual Encoders for Vision-Language Models},
  author = {Wang, Zirui and Yu, Junwei and Yala, Adam and Chan, David M.
         and Gonzalez, Joseph E. and Darrell, Trevor},
  journal = {arXiv preprint arXiv:2606.04433},
  year  = {2026}
}