Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
Abstract
Agents with meta-cognitive deficits struggle with tool usage decisions, leading to inefficiencies; a new framework called HDPO addresses this through decoupled optimization channels for accuracy and efficiency.
The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.
Community
🔗 Project Page:https://Accio-Lab.github.io/Metis
💻 GitHub:https://github.com/Accio-Lab/Metis
🤗 HuggingFace:https://huggingface.co/Accio-Lab/Metis-8B-RL
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization (2026)
- AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning (2026)
- PyVision-RL: Forging Open Agentic Vision Models via RL (2026)
- PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment (2026)
- VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning (2026)
- Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning (2026)
- SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.08545 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper