π Watch this for a quick intro!
Vision-language models are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. Yet in existing open-weight VLMs, visual comparison happens only inside the language model β the visual encoder itself remains stateless, encoding each image without access to prior visual context. As a result, small but task-critical changes may be attenuated before the language model can compare them. We introduce a Stateful Visual Encoder (SVE), which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with SVEs achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning β and these gains hold across input resolutions, language-model sizes, and VLM backbones. On real-world tasks (longitudinal radiology, fine-grained image comparison, and remote sensing), SVEs consistently improve generalist VLM baselines and can match or surpass specialized models.
A simple architectural extension that injects cross-image interaction inside the visual encoder of open-weight VLMs β no new backbone, no full retraining.
Initialization and optimization choices β weight cloning, zero-init outputs, and stop-gradient on prior features β that stabilize finetuning and learn state-dependent representations.
Effective and general across controlled comparison tasks, resolutions, model sizes, and VLM families, plus three real-world comparison domains.
We study four ways to make a pretrained encoder stateful β Self-Ext, AdaLN-Zero, Cross, and Cross+FFN. Cross+FFN inserts token-level cross-attention (queries from the current image, keys/values from the previous image) followed by a feed-forward block, and wins across the board.
A capacity-matched Self+FFN baseline uses the identical added pathway but cannot attend to the previous image. It improves over stateless β yet stays below Cross+FFN on nearly every task, isolating the gain to statefulness rather than added parameters or FLOPs.
MAE / RMSE (Γ10β»Β²), averaged across dot-distance & polygon-area tasks. Lower is better.
| Method | MAE β | RMSE β |
|---|---|---|
| Baseline (Stateless) | 1.15 | 1.60 |
| Self-Ext. | 1.44 | 2.02 |
| AdaLN-Zero | 1.17 | 1.60 |
| Cross | 1.03 | 1.39 |
| Cross+FFN (Ours) | 0.72 | 0.96 |
CLEVR-Multi-Change (higher better) and VisGym perplexity (lower better).
| Method | CIDEr β | Acc β | VisGym PPL β |
|---|---|---|---|
| Baseline (Stateless) | 529.5 | 91.1 | 2.074 |
| Self-Ext. | 538.1 | 92.5 | 2.132 |
| AdaLN-Zero | 531.8 | 91.4 | 2.069 |
| Cross | 515.0 | 89.3 | 2.009 |
| Cross+FFN (Ours) | 543.9 | 92.7 | 1.944 |
VisGym PPL shown for the 3D Mental Rotation (Cube) task; Cross+FFN improves all four VisGym tasks.
On multi-object visual differencing, SVE beats the stateless baseline from 256Β² to 768Β² resolution and from 0.8B to 9B parameters β and smaller SVE models can match or outperform much larger stateless ones.
CLEVR-Multi-Change (30β40 objects). β² = improvement over each backbone's stateless baseline (+SVE shown).
| Backbone | PPL β | BLEU-4 β | CIDEr β | METEOR β | ROUGE-L β | Acc β |
|---|---|---|---|---|---|---|
| Qwen3-VL-4B | 1.268 β².004 | 82.5 β²2.5 | 482.1 β²15.1 | 88.6 β²1.3 | 86.8 β²1.5 | 87.3 β².7 |
| Qwen3.5-4B | 1.219 β².010 | 92.7 β²2.2 | 543.9 β²14.4 | 95.4 β²1.9 | 93.9 β²1.6 | 92.7 β²1.6 |
| GLM-4.6V-Flash | 1.236 β².005 | 92.4 β².7 | 542.0 β²3.8 | 95.0 β².4 | 93.6 β².4 | 92.2 β².1 |
| InternVL3.5-4B | 1.332 β².026 | 68.2 β²1.7 | 389.5 β²11.5 | 77.8 β²1.1 | 76.3 β²1.2 | 77.4 β²1.8 |
| Gemma-3-4B | 1.316 β².083 | 68.4 β²8.0 | 387.0 β²45.6 | 78.0 β²5.9 | 76.3 β²5.9 | 77.9 β²7.9 |
Captioning + RATE-style finding-level F1 over 27 chest findings.
| Method | CIDEr β | R-L β | Micro F1 β | Chg Acc β |
|---|---|---|---|---|
| Qwen3.5-4B (SFT) | 145.1 | 62.7 | 31.55 | 86.83 |
| + SVE (Ours) | 178.9 | 66.3 | 32.20 | 89.21 |
MLLM-as-judge pairwise preference counts.
| vs. Baseline | Base Win | Tied | SVE Win |
|---|---|---|---|
| Reference instruction | 296 | 758 | 346 |
| Qwen3.5-4B (SFT) | 171 | 1020 | 209 |
SVE improves the generalist baseline and surpasses all prior specialist models. Sm* averages B4 / M / R-L / C.
| Method | BLEU-4 β | METEOR β | ROUGE-L β | CIDEr β | Sm* β |
|---|---|---|---|---|---|
| Best prior specialist models | |||||
| Chg2Cap | 62.98 | 39.42 | 74.34 | 136.25 | 78.25 |
| PromptCC | 63.54 | 38.82 | 73.72 | 136.44 | 78.13 |
| SAGE-CC | 65.50 | 39.92 | 74.77 | 137.50 | 79.42 |
| SACNet | 65.57 | 40.30 | 75.68 | 138.34 | 79.97 |
| Generalist VLMs | |||||
| Qwen3.5-4B (SFT) | 60.70 | 39.42 | 76.03 | 142.26 | 79.60 |
| + SVE (Ours) | 61.33 | 39.91 | 76.26 | 144.35 | 80.46 |