APT: Action Expert Pretraining Improves Instruction
Generalization of Vision-Language-Action Policies

Kechun Xu1    Zhenjie Zhu1    Anzhe Chen1    Rong Xiong1,2    Yue Wang1
1Zhejiang University     2Zhejiang Humanoid Robot Innovation Center

TL;DR. We improve out-of-distribution language generalization of continuous-action VLA policies through action expert pretraining. Guided by a Bayesian factorization \(\pi(\mathbf{a}\mid\mathbf{v},\ell)\propto\pi^{p}(\mathbf{a}\mid\mathbf{v})\,L(\ell\mid\mathbf{v},\mathbf{a})\), we first pretrain the action expert as a language-agnostic Vision-Action (VA) prior \(\pi^{p}(\mathbf{a}\mid\mathbf{v})\), then inject language to form the VLA likelihood.


Abstract

Vision-Language-Action (VLA) models that couple pretrained VLMs with continuous action experts achieve strong manipulation, yet generalize poorly to out-of-distribution (OOD) language instructions. The cause is a structural imbalance in VLA data — language is far less diverse than visual and action content — which biases policies toward visual shortcuts. Unlike discrete-action methods protected by vision-language co-training, continuous action experts learn from scratch on this imbalanced data, producing noisy gradients that corrupt the VLM. From a Bayesian perspective, we factorize the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage method emphasizing Action expert PreTraining: Stage 1 pretrains the action expert as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance; Stage 2 injects language through a gated fusion mechanism that preserves the learned visuomotor prior. APT applies to mainstream VLA architectures (π- and GR00T-style) and consistently improves generalization to unseen instructions and compositional tasks.

Motivation of APT: a randomly initialized action expert learns visual shortcuts and corrupts the VLM, while VA pretraining yields clean gradients and enables instruction following.
A randomly initialized action expert gravitates toward visual shortcuts and produces noisy gradients that corrupt the VLM. Pretraining the action expert as a VA prior (VA Pretrain) instead yields clean gradients and enables effective instruction following.

Method

We factorize the VLA policy from a Bayesian perspective, separating action generation into a language-agnostic Vision-Action (VA) prior \(\pi^{p}(\mathbf{a}\mid\mathbf{v})\) and a language-conditioned VLA likelihood \(L(\ell\mid\mathbf{v},\mathbf{a})\). The key observation is that although full VLA triplets suffer from language-vision imbalance, vision-action pairs alone are well-balanced and create no shortcut incentive.

$$ \pi(\mathbf{a}\mid\mathbf{v},\ell)\;\propto\;\pi^{p}(\mathbf{a}\mid\mathbf{v})\;\cdot\;L(\ell\mid\mathbf{v},\mathbf{a}) $$

Stage 1 — VA Prior. A diffusion-based action expert is pretrained as a VA prior conditioned solely on visual tokens from a frozen VLM backbone, with language tokens masked from all self-attention computations — bypassing the language imbalance and providing a principled initialization before any language is introduced. Stage 2 — VLA Likelihood. Language tokens are injected through newly introduced interleaved attention layers, and all layers are jointly finetuned starting from the Stage 1 checkpoint. A novel action expert design with a layer-wise gated fusion mechanism integrates VLM features into the action expert, inheriting the VLM's representational capacity while preserving the pretrained visuomotor prior. The two-stage scheme applies to mainstream continuous-action VLA architectures, such as the π- and GR00T-style architectures.

APT two-stage training overview: Stage 1 pretrains the action expert as a VA prior with language tokens masked; Stage 2 copies the weights and jointly trains VLM and action expert with language tokens injected.
Overview of APT. Stage 1 (VA Prior Initialization): the action expert is pretrained on vision tokens from a frozen VLM, with language tokens masked. Stage 2 (VLA Joint Training): the learned weights are copied, language tokens are injected, and the VLM and action expert are jointly finetuned.

Simulation Results

Quantitative Results

Method Spatial Object Goal Long Avg
PosTaskPosTask PosTaskPosTask
OpenVLA000000000
π0000000000
π0.52011713808111
LangForce1148101041121514
CaP-X121422182617
APT444871023116319
APT (Ft VLM)626224171020121227
Table 1: Results on LIBERO-PRO (success rate %). Pos perturbs object positions; Task replaces the manipulated object for OOD language generalization.
MethodKI2-StageFt VLMSOUOUCUOleiUE
π042302616
π0.584708650
APT88566634
90584040
96749062
98849258
Table 2: Results on Rigid Object Pick-Place (success rate %). All four APT rows are trained solely on VLA data and ablate three dimensions — KI: Knowledge Insulation, 2-Stage: action expert pretraining, Ft VLM: joint VLM finetuning. SO: Seen Object, UO: Unseen Object, UC: Unseen Container, UE: Unseen Environment.

Qualitative Rollouts — Rigid Object Pick-Place

Test settings of increasing OOD difficulty: Seen Object (SO), Unseen Object (UO), Unseen Container (UC), Unseen Object & Unseen Environment (UOUE).

SO “Move the blue Pepsi to the bowl.”
\( \pi_{0.5} \)🙂 Success
APT (Ours)🙂 Success
UO “Move the orange bottle to the bowl.”
\( \pi_{0.5} \)🙁 Failure
APT (Ours)🙂 Success
UC “Move the black Pepsi to the mug.”
\( \pi_{0.5} \)🙁 Failure
APT (Ours)🙂 Success
UOUE “Move the blue can to the bowl.”
\( \pi_{0.5} \)🙁 Failure
APT (Ours)🙂 Success

Architecture Generalization

Bar chart showing success-rate gains from action expert pretraining (2-Stage vs 1-Stage) across pi-style, GR00T-style, and APT architectures on SO, UO, UC, and UOUE settings.
Action expert pretraining applies to diverse architectures. The two-stage scheme improves generalization across almost all settings, with the largest gains on APT and GR00T-style backbones; labels denote the success-rate gain of 2-Stage over 1-Stage.

Real-World Results

Single Task Generalization

OOD language and object generalization within individual tasks.

Method Pick-Place Clutter Pick-Place
SOUOUOUCUOUCUE SOUCUOUOUE
π0.527/3011/209/2016/4018/3018/304/103/10
APT29/3017/2016/2028/4025/3022/307/106/10
Table 3: Real-world single-task generalization results (successes / trials). 30 demonstrations per task; APT is compared against π0.5.

Pick-Place
Pick a specified object and place it on a specified container, under unseen objects (UO), unseen containers (UC), and unseen environments (UE).

UOUC “Pick up the grape and place it on the drawer.”
\( \pi_{0.5} \)🙁 Failure
APT (Ours)🙂 Success
UOUCUE “Pick up the grape and place it on the drawer.”
\( \pi_{0.5} \)🙁 Failure
APT (Ours)🙂 Success

Clutter Pick-Place
Pick-and-place amid cluttered distractor objects, under seen objects (SO), unseen objects (UO), and unseen environments (UE).

SO “Pick up the pepper and place it on the blue plate.”
\( \pi_{0.5} \)🙁 Failure
APT (Ours)🙂 Success
UOUE “Pick up the grape and place it on the pink plate.”
\( \pi_{0.5} \)🙁 Failure
APT (Ours)🙂 Success

Compositional Task Generalization

Sequential multi-task instruction following, via task coaching (per-task instructions issued in sequence) and task chaining (a single concatenated instruction).

Bar chart of success rate for pi0.5 vs APT on Coaching (seen), Coaching (unseen), and Chaining, reported for the first task (T1), second task (T2), and full sequence (T1+T2).
Success rate on compositional tasks for the first task (T1), second task (T2), and the full sequence (T1+T2). APT degrades far less than \( \pi_{0.5} \) under task chaining, where a single instruction must be parsed into multiple subtasks.

Coaching
Per-task instructions issued in sequence.

Example 1 “(1) Put the red tea bag into the box, then close the storage box. (2) Pick up the pepper and place it on the blue plate.”
\( \pi_{0.5} \)🙁 Failure
APT (Ours)🙂 Success
Example 2 “(1) Put the pepper into the box, then close the storage box. (2) Pick up the red tea bag and place it on the blue plate.”
\( \pi_{0.5} \)🙁 Failure
APT (Ours)🙂 Success

Chaining
A single concatenated instruction.

Highlight. Task chaining is the hardest test of compositional generalization — a single instruction must be parsed into multiple sequential subtasks with no per-step prompting. On the full sequence (T1+T2), \( \pi_{0.5} \) collapses to 0%, while APT reaches 65% (and 90% / 70% on the individual subtasks), demonstrating compositional instruction following beyond what existing open-source VLA policies achieve.
Example 1 “Put the red tea bag into the box. Then close the storage box. Pick up the pepper and place it on the blue plate.”
\( \pi_{0.5} \)🙁 Failure
APT (Ours)🙂 Success
Example 2 “Put the red tea bag into the storage box. Then close the box. Pick up the carrot and place it on the blue plate.”
\( \pi_{0.5} \)🙁 Failure
APT (Ours)🙂 Success


BibTeX

@article{xu2026apt,
  title   = {APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies},
  author  = {Xu, Kechun and Zhu, Zhenjie and Chen, Anzhe and Xiong, Rong and Wang, Yue},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026}
}