Abstract
Zero-shot sim-to-real transfer is demonstrated for robotic manipulation using large-scale synthetic data and vision-language models with flow-matching action heads, achieving high success rates without real-world fine-tuning.
A prevailing view in robot learning is that simulation alone is not enough; effective sim-to-real transfer is widely believed to require at least some real-world data collection or task-specific fine-tuning to bridge the gap between simulated and physical environments. We challenge that assumption. With sufficiently large-scale and diverse simulated synthetic training data, we show that zero-shot transfer to the real world is not only possible, but effective for both static and mobile manipulation. We introduce MolmoBot-Engine, a fully open-source pipeline for procedural data generation across robots, tasks, and diverse simulated environments in MolmoSpaces. With it, we release MolmoBot-Data, a dataset of 1.8 million expert trajectories for articulated object manipulation and pick-and-place tasks. We train three policy classes: MolmoBot, a Molmo2-based multi-frame vision-language model with a flow-matching action head; MolmoBot-Pi0, which replicates the π_0 architecture to enable direct comparison; and MolmoBot-SPOC, a lightweight policy suitable for edge deployment and amenable to RL fine-tuning. We evaluate on two robotic platforms: the Franka FR3 for tabletop manipulation tasks and the Rainbow Robotics RB-Y1 mobile manipulator for door opening, drawer manipulation, cabinet interaction, and mobile pick-and-place. Without any real-world fine-tuning, our policies achieve zero-shot transfer to unseen objects and environments. On tabletop pick-and-place, MolmoBot achieves a success rate of 79.2% in real world evaluations across 4 settings, outperforming π_{0.5} at 39.2%. Our results demonstrate that procedural environment generation combined with diverse articulated assets can produce robust manipulation policies that generalize broadly to the real world. Technical Blog: https://allenai.org/blog/molmobot-robot-manipulation
Community
MolmoB0T demonstrates zero-shot real-world manipulation via large-scale procedural simulation, releasing MolmoBot-Data and open-source MolmoBot pipelines to train robust policies without real-world fine-tuning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MobileManiBench: Simplifying Model Verification for Mobile Manipulation (2026)
- Point Bridge: 3D Representations for Cross Domain Policy Learning (2026)
- AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation (2026)
- Mirage2Matter: A Physically Grounded Gaussian World Model from Video (2026)
- Scaling World Model for Hierarchical Manipulation Policies (2026)
- Green-VLA: Staged Vision-Language-Action Model for Generalist Robots (2026)
- LAP: Language-Action Pre-Training Enables Zero-shot Cross-Embodiment Transfer (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper