HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
Abstract
HiVLA presents a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert with cascaded cross-attention for improved robotic manipulation.
While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.
Community
✨ HiVLA is a hierarchical embodied manipulation agent system that combines visual grounding, semantic planning, and robust action execution for long-horizon and fine-grained robotic manipulation.
the cascaded cross attention inside the DiT action expert is the clever hinge here, it sequentially fuses global features, high-res object crops grounded by the bbox, and a language embedding that encodes the required skill. that design keeps the low-level policy focused on robust execution while the high-level planner handles the grounding, which seems essential for long-horizon tasks in clutter. i'd be curious to see an ablation where you remove the local crop path or replace the language conditioning with just the global features to quantify each piece's contribution. the emergent error correction where the planner can re propose grounding if the diffusion misses a grasp feels promising but i'd like to see more systematic failure cases. btw, the arxivlens breakdown helped me parse the method details, a solid walkthrough that covers this setup; it even covers how the cascaded attention plays into the conditioning: https://arxivlens.com/PaperView/Details/hivla-a-visual-grounded-centric-hierarchical-embodied-manipulation-system-60-0dabfd0c
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models (2026)
- Universal Pose Pretraining for Generalizable Vision-Language-Action Policies (2026)
- HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning (2026)
- VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models (2026)
- SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation (2026)
- Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models (2026)
- ST-VLA: Enabling 4D-Aware Spatiotemporal Understanding for General Robot Manipulation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.14125 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper