Visual-OPSD

Cross-Modal On-Policy Self-Distillation
for Efficient Unified Multimodal Reasoning

Distill the visual generation pathway's reasoning knowledge into text-only inference — no image generation at test time, yet outperforming the generative teacher.

14.3Γ—
Inference Speedup
+3.40pp
Avg. over Teacher
6/9
Tasks Improved
+63.8pp
vs VLMs on VSP
Pengyu Li1,3†, Zhitao Gao1,3, Lingling Zhang1,2*, Muye Huang1,3, Yuanming Li4, Fangzhi Xu1, Jun Liu1,2
1Xi'an Jiaotong University    2MOE KLINNS Lab    3Shaanxi Province Key Lab of Big Data Knowledge Engineering    4Sun Yat-sen University

What is Visual-OPSD?

Unified multimodal models generate intermediate "visual thoughts" (VTs) via diffusion during reasoning, but this costs ~14Γ— latency. We discover that rendered VT pixels are not load-bearing β€” yet the generation pathway encodes valuable reasoning knowledge. Visual-OPSD distills this knowledge into text-only inference via cross-modal on-policy self-distillation.

Visual-OPSD main results: matches teacher at 14x lower latency

Visual-OPSD matches its VT-generating teacher at 14Γ— lower latency. (a) Visual-OPSD (green) β‰₯ teacher (purple) on 6/9 tasks. (b) Per-task Ξ”: VSP +10.0, VisPuzzle +8.5, BLINK-J +11.3. (c) Pareto: 74.0%/10.0s vs. 70.6%/142.8s.

Visual Thoughts: Load-Bearing Pixels or Pathway Knowledge?

Through controlled interventions on ThinkMorph, we reveal a surprising gap: rendered VT pixels contribute little to accuracy, yet the generation pathway encodes substantial distributional knowledge measurable via KL divergence (4.64 nats/token).

Pilot intervention study

Inference-Time VT Intervention

Removing or corrupting intermediate VTs at inference leaves accuracy largely unchanged across all nine benchmarks. The rendered pixels are not load-bearing.

Attention analysis

Cross-Modal Attention Analysis

Once generated, a VT dominates subsequent reasoning attention regardless of its content β€” even Gaussian noise receives the same attention allocation.

Teacher VT = None
Text-only SFT
63.75
Avg Accuracy (9 benchmarks)
Teacher VT = Gaussian Noise
Visual-OPSD-Noise
64.15
+0.40pp (β‰ˆ regularization only)
Teacher VT = Real VT
Visual-OPSD (Ours)
74.03
+10.28pp (semantic content)

VT Information Quality Scaling. The decisive gap (+10.28pp vs +0.40pp) confirms knowledge originates from VT semantic content, not regularization.

Six Pillars of Visual-OPSD

🧠

Cross-Modal Self-Distillation

First OPSD framework bridging the generation–understanding gap within a unified multimodal architecture. Teacher and student share identical weights.

⚑

14.3Γ— Faster Inference

Eliminates 50-step diffusion at test time: 10.0s vs 142.8s per sample. No architectural changes or extra parameters needed.

πŸ“Š

Student > Teacher

Distribution-level distillation de-noises VT artifacts: the student outperforms its generative teacher on 6/9 benchmarks (+3.40pp average).

🎯

On-Policy Trajectories

Student samples its own completions, keeping training on-policy. Image-skip injection prevents collapse into generation mode.

πŸ”¬

Rigorous Controls

Gaussian-noise control (+0.40pp vs +10.28pp) and KL gap-closing analysis (58.4% closure) confirm semantic VT content is the knowledge source.

🌟

Spatial Reasoning Champion

Dominates same-scale VLMs on spatial tasks: VSP 85.8% vs InternVL3.5-8B 8.2% and Qwen3-VL-8B 22.0%.

Cross-Modal On-Policy Self-Distillation

Visual-OPSD exploits the information asymmetry between generation and understanding pathways within a single unified multimodal model. No external teacher, no architectural changes.

Visual-OPSD pipeline overview

Overview of Visual-OPSD. A student πθ(Β·|CS) sees only [sys, ViT(x), q], while an EMA teacher πθ̄(Β·|CT) additionally receives privileged VTs. Both rescore the student's on-policy completion via per-token JSD. At inference, the student runs text-only β€” 14.3Γ— faster, +3.40pp over teacher.

1

KL Diagnostic

Measure the distributional gap between VT-conditioned and question-only contexts. Confirms 4.64 nats/token of distillable generation knowledge.

2

On-Policy Sampling

Student generates completions from its current policy. Image-skip injection handles <image_start> tokens, keeping sampling on-policy without generation mode collapse.

3

JSD Distillation

Per-token JSD with pointwise clipping (Ο„=0.05) and top-K=256 truncation. EMA teacher (Ξ±=0.995) provides stable, progressively updated targets.

L(ΞΈ) = (1/T) Ξ£t JSDΞ²(pT(t) β€– pS(t)),    Ξ² = 0.5 (symmetric JSD)
Visual-OPSD vs prior methods comparison

Prior interleaved visual CoT vs. Visual-OPSD. (Top) Previous methods iterate generate-then-reason, rendering each VT via 50-step diffusion at high cost. (Bottom) Visual-OPSD distills the generation pathway into a text-only student, yielding +3.40pp at 14.3Γ— speedup with no image generation at inference.

Main Results Across 9 Benchmarks

Visual-OPSD achieves the best average accuracy among open 7–8B models while running 14.3Γ— faster than ThinkMorph.

Method VSP ↑ VisPuzzle ↑ ChartQA ↑ VStar ↑ BLINK-J ↑ MMVP ↑ SAT ↑ BLINK ↑ CV-Bench ↑ Avg ↑ Lat. (s) ↓
Vision-Language Models (VLMs)
InternVL3.5-8B 8.1734.7576.2668.5971.33 76.3345.3359.6081.9958.04–
Qwen3-VL-8B 22.0037.0082.5584.2968.66 77.6654.0069.4385.6364.58–
GPT-4o 33.5043.7576.3461.7872.67 84.6728.0060.2875.6159.62–
GPT-5 57.3378.0080.8571.7377.33 86.3373.3069.8685.4675.58–
Unified Multimodal Models (UMMs)
Janus-Pro-7B 0.0033.5043.0838.2250.67 63.3322.0038.5167.8339.68–
BAGEL-7B 0.8335.0061.8255.4967.33 70.3344.6747.6676.0351.02–
ThinkMorph 75.8377.5078.0067.0166.00 78.3352.6759.4980.8670.63142.8
Ours
Text-only SFT 49.1763.5081.6656.0268.67 76.3346.6354.3977.3763.7528.5
Visual-OPSD-Noise 50.8364.5073.7761.7068.66 75.3348.0055.4979.0964.1514.9
Visual-OPSD (Ours) 85.8386.0078.7964.9277.33 77.3354.0061.4480.6474.0310.0
Ξ” vs. ThinkMorph +10.00+8.50+0.79βˆ’2.09+11.33 βˆ’1.00+1.33+1.95βˆ’0.22+3.4014.3Γ—
Ξ” vs. Text-only SFT +36.66+22.50βˆ’2.87+8.90+8.66 +1.00+7.37+7.05+3.27+10.282.9Γ—
Win/Loss analysis
Per-sample Win/Loss. Visual-OPSD wins substantially more on VT-useful spatial tasks (BLINK-J net +11.3pp, VSP +10.0pp), while deficits are near-symmetric.
KL token heatmap
Per-token KL Divergence. Generation knowledge concentrates on content tokens (spatial labels, object references, quantities), while function words carry near-zero divergence.
Knowledge transfer by task
Task-Specific Knowledge Transfer. Spatial reasoning tasks benefit most (mean Ξ”=+10.28pp over SFT), while chart understanding shows minimal change.
Latency comparison
Inference Efficiency. Visual-OPSD is 14.3Γ— faster than the VT teacher and 2.9Γ— faster than text-only SFT, enabled by ~2Γ— shorter outputs.

When VT Images Mislead Reasoning

We present cases where ThinkMorph's generated VT images actively harm reasoning, while Visual-OPSD avoids these failure modes by reasoning directly from the original input.

Key Takeaways

We introduced Visual On-Policy Self-Distillation (Visual-OPSD), the first On-Policy SelfDistillation framework that operates across modalities within a single unified multimodal model. Visual-OPSD provides direct evidence that the visual generation pathway of UMMs encodes reasoning knowledge into the model’s representations beyond what the generated pixels themselves contain, and that this knowledge can be distilled into the text understanding pathway via on-policy JSD without any architectural changes. The Visual-OPSD student outperforms its generative teacher on 6/9 benchmarks (+3.40pp on average) while achieving a 14.3Γ— inference speedup, and substantially exceeds same-scale dedicated VLMs on spatial reasoning tasks. The Visual-OPSD-Noise control (+0.40pp vs. +10.28pp) and the post-distillation KL closing analysis (58.4% vs. 3.5%) together confirm that the transferred signal specifically requires the generation pathway’s semantic content, ruling out regularization as the primary mechanism.

BibTeX

@article{li2026visual,
    title={Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning},
    author={Li, Pengyu and Gao, Zhitao and Zhang, Lingling and Huang, Muye and Li, Yuanming and Xu, Fangzhi and Liu, Jun},
    journal={arXiv preprint arXiv:2606.18974},
    year={2026}
}

Acknowledgement

We thank Jialong Wu for valuable discussions.