Visual-OPSD

Cross-Modal On-Policy Self-Distillation
for Efficient Unified Multimodal Reasoning

Distill the visual generation pathway's reasoning knowledge into text-only inference — no image generation at test time, yet outperforming the generative teacher.

14.3×

Inference Speedup

+3.40pp

Avg. over Teacher

6/9

Tasks Improved

+63.8pp

vs VLMs on VSP

📄 Read Paper 💻 GitHub

Pengyu Li^1,3†, Zhitao Gao^1,3, Lingling Zhang^1,2*, Muye Huang^1,3, Yuanming Li⁴, Fangzhi Xu¹, Jun Liu^1,2

¹Xi'an Jiaotong University ²MOE KLINNS Lab ³Shaanxi Province Key Lab of Big Data Knowledge Engineering ⁴Sun Yat-sen University

Overview

What is Visual-OPSD?

Unified multimodal models generate intermediate "visual thoughts" (VTs) via diffusion during reasoning, but this costs ~14× latency. We discover that rendered VT pixels are not load-bearing — yet the generation pathway encodes valuable reasoning knowledge. Visual-OPSD distills this knowledge into text-only inference via cross-modal on-policy self-distillation.

Visual-OPSD main results: matches teacher at 14x lower latency

Visual-OPSD matches its VT-generating teacher at 14× lower latency. (a) Visual-OPSD (green) ≥ teacher (purple) on 6/9 tasks. (b) Per-task Δ: VSP +10.0, VisPuzzle +8.5, BLINK-J +11.3. (c) Pareto: 74.0%/10.0s vs. 70.6%/142.8s.

Key Finding

Visual Thoughts: Load-Bearing Pixels or Pathway Knowledge?

Through controlled interventions on ThinkMorph, we reveal a surprising gap: rendered VT pixels contribute little to accuracy, yet the generation pathway encodes substantial distributional knowledge measurable via KL divergence (4.64 nats/token).

Inference-Time VT Intervention

Removing or corrupting intermediate VTs at inference leaves accuracy largely unchanged across all nine benchmarks. The rendered pixels are not load-bearing.

Cross-Modal Attention Analysis

Once generated, a VT dominates subsequent reasoning attention regardless of its content — even Gaussian noise receives the same attention allocation.

Teacher VT = None

Text-only SFT

63.75

Avg Accuracy (9 benchmarks)

Teacher VT = Gaussian Noise

Visual-OPSD-Noise

64.15

+0.40pp (≈ regularization only)

Teacher VT = Real VT

Visual-OPSD (Ours)

74.03

+10.28pp (semantic content)

VT Information Quality Scaling. The decisive gap (+10.28pp vs +0.40pp) confirms knowledge originates from VT semantic content, not regularization.

Key Capabilities

Six Pillars of Visual-OPSD

🧠

Cross-Modal Self-Distillation

First OPSD framework bridging the generation–understanding gap within a unified multimodal architecture. Teacher and student share identical weights.

⚡

14.3× Faster Inference

Eliminates 50-step diffusion at test time: 10.0s vs 142.8s per sample. No architectural changes or extra parameters needed.

📊

Student > Teacher

Distribution-level distillation de-noises VT artifacts: the student outperforms its generative teacher on 6/9 benchmarks (+3.40pp average).

🎯

On-Policy Trajectories

Student samples its own completions, keeping training on-policy. Image-skip injection prevents collapse into generation mode.

🔬

Rigorous Controls

Gaussian-noise control (+0.40pp vs +10.28pp) and KL gap-closing analysis (58.4% closure) confirm semantic VT content is the knowledge source.

🌟

Spatial Reasoning Champion

Dominates same-scale VLMs on spatial tasks: VSP 85.8% vs InternVL3.5-8B 8.2% and Qwen3-VL-8B 22.0%.

Method

Cross-Modal On-Policy Self-Distillation

Visual-OPSD exploits the information asymmetry between generation and understanding pathways within a single unified multimodal model. No external teacher, no architectural changes.

Overview of Visual-OPSD. A student π_θ(·|C_S) sees only [sys, ViT(x), q], while an EMA teacher π_θ̄(·|C_T) additionally receives privileged VTs. Both rescore the student's on-policy completion via per-token JSD. At inference, the student runs text-only — 14.3× faster, +3.40pp over teacher.

KL Diagnostic

Measure the distributional gap between VT-conditioned and question-only contexts. Confirms 4.64 nats/token of distillable generation knowledge.

On-Policy Sampling

Student generates completions from its current policy. Image-skip injection handles <image_start> tokens, keeping sampling on-policy without generation mode collapse.

JSD Distillation

Per-token JSD with pointwise clipping (τ=0.05) and top-K=256 truncation. EMA teacher (α=0.995) provides stable, progressively updated targets.

L(θ) = (1/T) Σ_t JSD_β(p_T^(t) ‖ p_S^(t)), β = 0.5 (symmetric JSD)

Prior interleaved visual CoT vs. Visual-OPSD. (Top) Previous methods iterate generate-then-reason, rendering each VT via 50-step diffusion at high cost. (Bottom) Visual-OPSD distills the generation pathway into a text-only student, yielding +3.40pp at 14.3× speedup with no image generation at inference.

Experiments

Main Results Across 9 Benchmarks

Visual-OPSD achieves the best average accuracy among open 7–8B models while running 14.3× faster than ThinkMorph.

Method	VSP ↑	VisPuzzle ↑	ChartQA ↑	VStar ↑	BLINK-J ↑	MMVP ↑	SAT ↑	BLINK ↑	CV-Bench ↑	Avg ↑	Lat. (s) ↓
Vision-Language Models (VLMs)
InternVL3.5-8B	8.17	34.75	76.26	68.59	71.33	76.33	45.33	59.60	81.99	58.04	–
Qwen3-VL-8B	22.00	37.00	82.55	84.29	68.66	77.66	54.00	69.43	85.63	64.58	–
GPT-4o	33.50	43.75	76.34	61.78	72.67	84.67	28.00	60.28	75.61	59.62	–
GPT-5	57.33	78.00	80.85	71.73	77.33	86.33	73.30	69.86	85.46	75.58	–
Unified Multimodal Models (UMMs)
Janus-Pro-7B	0.00	33.50	43.08	38.22	50.67	63.33	22.00	38.51	67.83	39.68	–
BAGEL-7B	0.83	35.00	61.82	55.49	67.33	70.33	44.67	47.66	76.03	51.02	–
ThinkMorph	75.83	77.50	78.00	67.01	66.00	78.33	52.67	59.49	80.86	70.63	142.8
Ours
Text-only SFT	49.17	63.50	81.66	56.02	68.67	76.33	46.63	54.39	77.37	63.75	28.5
Visual-OPSD-Noise	50.83	64.50	73.77	61.70	68.66	75.33	48.00	55.49	79.09	64.15	14.9
Visual-OPSD (Ours)	85.83	86.00	78.79	64.92	77.33	77.33	54.00	61.44	80.64	74.03	10.0
Δ vs. ThinkMorph	+10.00	+8.50	+0.79	−2.09	+11.33	−1.00	+1.33	+1.95	−0.22	+3.40	14.3×
Δ vs. Text-only SFT	+36.66	+22.50	−2.87	+8.90	+8.66	+1.00	+7.37	+7.05	+3.27	+10.28	2.9×

Per-sample Win/Loss. Visual-OPSD wins substantially more on VT-useful spatial tasks (BLINK-J net +11.3pp, VSP +10.0pp), while deficits are near-symmetric.

Per-token KL Divergence. Generation knowledge concentrates on content tokens (spatial labels, object references, quantities), while function words carry near-zero divergence.

Task-Specific Knowledge Transfer. Spatial reasoning tasks benefit most (mean Δ=+10.28pp over SFT), while chart understanding shows minimal change.

Inference Efficiency. Visual-OPSD is 14.3× faster than the VT teacher and 2.9× faster than text-only SFT, enabled by ~2× shorter outputs.

Case Studies

When VT Images Mislead Reasoning

We present cases where ThinkMorph's generated VT images actively harm reasoning, while Visual-OPSD avoids these failure modes by reasoning directly from the original input.

Case 1: Self-Reinforcing VT Confirmation
Q: Which image is the missing part in the first image? GT: B

ThinkMorph ✗ A

VT generated after committing to A visually reinforces the error, creating a self-reinforcing feedback loop.

Visual-OPSD ✓ B

Directly compares both options against original input, correctly identifying contour alignment.

Case 2: Pixel-Level Noise Masking
Q: Which image is the missing part in the first image? GT: B

ThinkMorph ✗ A

VT blending artifacts at patch boundaries mask spatial discontinuities that reveal incorrect placement.

Visual-OPSD ✓ B

Identifies correct spatial continuity by analyzing alignment of visual elements directly from the original.

Case 3: VT Strips Spatial Context
Q: Is she about to swing the racket towards her left or right? GT: A (right)

ThinkMorph ✗ B (left)

VT bounding box isolates the racket from full-body context, causing reasoning from a spatially impoverished representation.

Visual-OPSD ✓ A (right)

Integrates holistic body cues — foot positioning, torso orientation, and shoulder angle — to determine correct direction.

Case 4: VT-Induced Self-Contradiction
Q: Were any objects moved from their original positions? GT: A (chair moved right & towards camera)

ThinkMorph ✗ B

Round 0 reasoning contradicts itself; VT resolves internal ambiguity toward the wrong spatial direction.

Visual-OPSD ✓ A

Consistent reasoning without VT-induced self-contradiction. Correctly assesses chair movement direction.

Conclusion

Key Takeaways

We introduced Visual On-Policy Self-Distillation (Visual-OPSD), the first On-Policy SelfDistillation framework that operates across modalities within a single unified multimodal model. Visual-OPSD provides direct evidence that the visual generation pathway of UMMs encodes reasoning knowledge into the model’s representations beyond what the generated pixels themselves contain, and that this knowledge can be distilled into the text understanding pathway via on-policy JSD without any architectural changes. The Visual-OPSD student outperforms its generative teacher on 6/9 benchmarks (+3.40pp on average) while achieving a 14.3× inference speedup, and substantially exceeds same-scale dedicated VLMs on spatial reasoning tasks. The Visual-OPSD-Noise control (+0.40pp vs. +10.28pp) and the post-distillation KL closing analysis (58.4% vs. 3.5%) together confirm that the transferred signal specifically requires the generation pathway’s semantic content, ruling out regularization as the primary mechanism.

Visual-OPSD

Cross-Modal On-Policy Self-Distillation
for Efficient Unified Multimodal Reasoning

What is Visual-OPSD?