The pursuit of out-of-distribution generalization in Vision-Language-Action (VLA) models is often hindered by catastrophic forgetting of the Vision-Language Model (VLM) backbone during fine-tuning. While co-training with external reasoning data helps, it requires experienced tuning and data-related overhead. Beyond such external dependencies, we identify an intrinsic cause within VLA datasets: modality imbalance, where language diversity is much lower than visual and action diversity. This imbalance biases the model toward visual shortcuts and language forgetting. To address this, we introduce BayesVLA, a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify. This inherently preserves generalization and promotes instruction following. We further incorporate pre- and post-contact phases to better leverage pre-trained foundation models. Information-theoretic analysis formally validates our effectiveness in mitigating shortcut learning. Extensive experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods.
Given VLA datasets with modality imbalance, BayesVLA models the VLA policy using a prior and a likelihood. For stage 1, we train a prior model \( \pi^{p}(\mathbf{a}\mid\mathbf{v}) \) that takes in visual input to generate multimodal action distribution. Based on the prior, for stage 2, we train the likelihood model to further align the action priors with the language instruction.
Test Settings: Seen Object (SO), Unseen Object (UO), Unseen Container (UC), Unseen Environment (UE)
SO
\( \pi_{0.5} \)
BayesVLA
UO
\( \pi_{0.5} \)
BayesVLA
UC
\( \pi_{0.5} \)
BayesVLA
UO+UE
\( \pi_{0.5} \)
BayesVLA
SO
\( \pi_{0.5} \)
BayesVLA
SO
\( \pi_{0.5} \)
BayesVLA
UO
\( \pi_{0.5} \)
BayesVLA
UO+UC
\( \pi_{0.5} \)
BayesVLA
UO+UC+UE
\( \pi_{0.5} \)
BayesVLA
UO
\( \pi_{0.5} \)
BayesVLA
UO+UE
\( \pi_{0.5} \)
BayesVLA
SO
\( \pi_{0.5} \)
BayesVLA
UC
\( \pi_{0.5} \)
BayesVLA
@article{xu2025bayesvla,
title={Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy},
author={Xu, Kechun and Zhu, Zhenjie and Chen, Anzhe and Zhao, Shuqi and Huang, Qing and Yang, Yifei and Lu, Haojian and Xiong, Rong and Tomizuka, Masayoshi and Wang, Yue},
journal={arXiv preprint arXiv:},
year={2025}
}