Vision-Language-Action (VLA) models that couple pretrained VLMs with continuous action experts achieve strong manipulation, yet generalize poorly to out-of-distribution (OOD) language instructions. The cause is a structural imbalance in VLA data — language is far less diverse than visual and action content — which biases policies toward visual shortcuts. Unlike discrete-action methods protected by vision-language co-training, continuous action experts learn from scratch on this imbalanced data, producing noisy gradients that corrupt the VLM. From a Bayesian perspective, we factorize the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage method emphasizing Action expert PreTraining: Stage 1 pretrains the action expert as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance; Stage 2 injects language through a gated fusion mechanism that preserves the learned visuomotor prior. APT applies to mainstream VLA architectures (π- and GR00T-style) and consistently improves generalization to unseen instructions and compositional tasks.
We factorize the VLA policy from a Bayesian perspective, separating action generation into a language-agnostic Vision-Action (VA) prior \(\pi^{p}(\mathbf{a}\mid\mathbf{v})\) and a language-conditioned VLA likelihood \(L(\ell\mid\mathbf{v},\mathbf{a})\). The key observation is that although full VLA triplets suffer from language-vision imbalance, vision-action pairs alone are well-balanced and create no shortcut incentive.
Stage 1 — VA Prior. A diffusion-based action expert is pretrained as a VA prior conditioned solely on visual tokens from a frozen VLM backbone, with language tokens masked from all self-attention computations — bypassing the language imbalance and providing a principled initialization before any language is introduced. Stage 2 — VLA Likelihood. Language tokens are injected through newly introduced interleaved attention layers, and all layers are jointly finetuned starting from the Stage 1 checkpoint. A novel action expert design with a layer-wise gated fusion mechanism integrates VLM features into the action expert, inheriting the VLM's representational capacity while preserving the pretrained visuomotor prior. The two-stage scheme applies to mainstream continuous-action VLA architectures, such as the π- and GR00T-style architectures.
| Method | Spatial | Object | Goal | Long | Avg | ||||
|---|---|---|---|---|---|---|---|---|---|
| Pos | Task | Pos | Task | Pos | Task | Pos | Task | ||
| OpenVLA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| π0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| π0.5 | 20 | 1 | 17 | 1 | 38 | 0 | 8 | 1 | 11 |
| LangForce | 11 | 48 | 10 | 10 | 4 | 11 | 2 | 15 | 14 |
| CaP-X | 12 | 14 | 22 | 18 | 26 | 17 | – | – | – |
| APT | 44 | 48 | 7 | 10 | 23 | 11 | 6 | 3 | 19 |
| APT (Ft VLM) | 62 | 62 | 24 | 17 | 10 | 20 | 12 | 12 | 27 |
| Method | KI | 2-Stage | Ft VLM | SO | UO | UC | UOleiUE |
|---|---|---|---|---|---|---|---|
| π0 | – | – | ✓ | 42 | 30 | 26 | 16 |
| π0.5 | ✓ | – | ✓ | 84 | 70 | 86 | 50 |
| APT | – | – | ✓ | 88 | 56 | 66 | 34 |
| ✓ | – | – | 90 | 58 | 40 | 40 | |
| ✓ | ✓ | – | 96 | 74 | 90 | 62 | |
| – | ✓ | ✓ | 98 | 84 | 92 | 58 |
Test settings of increasing OOD difficulty: Seen Object (SO), Unseen Object (UO), Unseen Container (UC), Unseen Object & Unseen Environment (UOUE).
OOD language and object generalization within individual tasks.
| Method | Pick-Place | Clutter Pick-Place | ||||||
|---|---|---|---|---|---|---|---|---|
| SO | UO | UOUC | UOUCUE | SO | UC | UO | UOUE | |
| π0.5 | 27/30 | 11/20 | 9/20 | 16/40 | 18/30 | 18/30 | 4/10 | 3/10 |
| APT | 29/30 | 17/20 | 16/20 | 28/40 | 25/30 | 22/30 | 7/10 | 6/10 |
Pick-Place
Pick a specified object and place it on a specified container, under unseen objects (UO), unseen containers (UC), and unseen environments (UE).
Clutter Pick-Place
Pick-and-place amid cluttered distractor objects, under seen objects (SO), unseen objects (UO), and unseen environments (UE).
Sequential multi-task instruction following, via task coaching (per-task instructions issued in sequence) and task chaining (a single concatenated instruction).
Coaching
Per-task instructions issued in sequence.
Chaining
A single concatenated instruction.
@article{xu2026apt,
title = {APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies},
author = {Xu, Kechun and Zhu, Zhenjie and Chen, Anzhe and Xiong, Rong and Wang, Yue},
journal = {arXiv preprint arXiv:XXXX.XXXXX},
year = {2026}
}