Seeing to Act, Prompting to Specify:
A Bayesian Factorization of Vision Language Action Policy

Kechun Xu§,¶     Zhenjie Zhu§     Anzhe Chen§      Shuqi Zhao      Qing Huang§      Yifei Yang§
Haojian Lu§      Rong Xiong§     Masayoshi Tomizuka¶,†      Yue Wang§,†
§Zhejiang University     UC Berkeley     Corresponding Author
Paper Code (Coming Soon)

TL; DR: BayesVLA decomposes the policy into a vision-action prior and a language-conditioned likelihood. The vision-action prior leverages visual information for action generation (seeing to act), while the language-conditioned likelihood aligns these action priors with the language instruction (prompting to specify).

Abstract

The pursuit of out-of-distribution generalization in Vision-Language-Action (VLA) models is often hindered by catastrophic forgetting of the Vision-Language Model (VLM) backbone during fine-tuning. While co-training with external reasoning data helps, it requires experienced tuning and data-related overhead. Beyond such external dependencies, we identify an intrinsic cause within VLA datasets: modality imbalance, where language diversity is much lower than visual and action diversity. This imbalance biases the model toward visual shortcuts and language forgetting. To address this, we introduce BayesVLA, a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify. This inherently preserves generalization and promotes instruction following. We further incorporate pre- and post-contact phases to better leverage pre-trained foundation models. Information-theoretic analysis formally validates our effectiveness in mitigating shortcut learning. Extensive experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods.



Method

Given VLA datasets with modality imbalance, BayesVLA models the VLA policy using a prior and a likelihood. For stage 1, we train a prior model \( \pi^{p}(\mathbf{a}\mid\mathbf{v}) \) that takes in visual input to generate multimodal action distribution. Based on the prior, for stage 2, we train the likelihood model to further align the action priors with the language instruction.

$$ \pi(\mathbf{a}\mid\mathbf{v},\ell) \;\propto\; \pi^{p}(\mathbf{a}\mid\mathbf{v})\; L(\ell\mid\mathbf{v},\mathbf{a}) $$

Simulation Results

Test Settings: Seen Object (SO), Unseen Object (UO), Unseen Container (UC), Unseen Environment (UE)

Rigid Object Pick-Place

SO

Task instruction: Move the green bottle to the bowl

\( \pi_{0.5} \)

BayesVLA

UO

Task instruction: Move the orange bottle to the bowl

\( \pi_{0.5} \)

BayesVLA

UC

Task instruction: Move the black Pepsi to the mug

\( \pi_{0.5} \)

BayesVLA

UO+UE

Task instruction: Move the orange soda to the bowl

\( \pi_{0.5} \)

BayesVLA

Articulated Object Manipulation

SO

Task instruction: Open the oven

\( \pi_{0.5} \)

BayesVLA

SO

Task instruction: Open the drawer

\( \pi_{0.5} \)

BayesVLA

UO

Task instruction: Open the microwave

\( \pi_{0.5} \)

BayesVLA

Real-World Results

Pick-Place

UO+UC

\( \pi_{0.5} \)

BayesVLA

UO+UC+UE

\( \pi_{0.5} \)

BayesVLA

Fridge Storage

UO

\( \pi_{0.5} \)

BayesVLA

UO+UE

\( \pi_{0.5} \)

BayesVLA

Clutter Pick-Place

SO

\( \pi_{0.5} \)

BayesVLA

UC

\( \pi_{0.5} \)

BayesVLA

BibTeX

@article{xu2025bayesvla,
      title={Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy},
      author={Xu, Kechun and Zhu, Zhenjie and Chen, Anzhe and Zhao, Shuqi and Huang, Qing and Yang, Yifei and Lu, Haojian and Xiong, Rong and Tomizuka, Masayoshi and Wang, Yue},
      journal={arXiv preprint arXiv:},
      year={2025}
    }