New Papers - a ThreeSR Collection

ThreeSR 's Collections

New Papers

updated 20 days ago

Upvote

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Paper • 2503.10615 • Published Mar 13, 2025 • 17
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation

Paper • 2503.10630 • Published Mar 13, 2025 • 6
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Paper • 2503.09516 • Published Mar 12, 2025 • 41
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Paper • 2503.07536 • Published Mar 10, 2025 • 89
Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

Paper • 2503.07572 • Published Mar 10, 2025 • 48
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Paper • 2503.08625 • Published Mar 11, 2025 • 27
OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Paper • 2503.03734 • Published Mar 5, 2025 • 1
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

Paper • 2503.10460 • Published Mar 13, 2025 • 30
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Paper • 2503.21696 • Published Mar 27, 2025 • 23
ViLBench: A Suite for Vision-Language Process Reward Modeling

Paper • 2503.20271 • Published Mar 26, 2025 • 7
Gemini Robotics: Bringing AI into the Physical World

Paper • 2503.20020 • Published Mar 25, 2025 • 31
Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26, 2025 • 174
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Paper • 2503.19757 • Published Mar 25, 2025 • 51
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users

Paper • 2503.02268 • Published Mar 4, 2025 • 11
Unified Reward Model for Multimodal Understanding and Generation

Paper • 2503.05236 • Published Mar 7, 2025 • 124
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Paper • 2503.05592 • Published Mar 7, 2025 • 27
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Paper • 2503.13444 • Published Mar 17, 2025 • 20
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

Paper • 2503.11579 • Published Mar 14, 2025 • 21
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Paper • 2503.10291 • Published Mar 13, 2025 • 36
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Paper • 2503.15558 • Published Mar 18, 2025 • 52
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

Paper • 2503.12797 • Published Mar 17, 2025 • 32
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Paper • 2504.01990 • Published Mar 31, 2025 • 305
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

Paper • 2504.00072 • Published Mar 31, 2025 • 6
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Paper • 2503.24290 • Published Mar 31, 2025 • 62
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Paper • 2503.01743 • Published Mar 3, 2025 • 91
VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

Paper • 2502.18906 • Published Feb 26, 2025 • 12
Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published Feb 19, 2025 • 218
Agentic Knowledgeable Self-awareness

Paper • 2504.03553 • Published Apr 4, 2025 • 27
Slow-Fast Architecture for Video Multi-Modal Large Language Models

Paper • 2504.01328 • Published Apr 2, 2025 • 7
Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10, 2025 • 142
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement

Paper • 2504.07934 • Published Apr 10, 2025 • 21
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

Paper • 2504.07096 • Published Apr 9, 2025 • 78
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

Paper • 2504.05541 • Published Apr 7, 2025 • 16
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Paper • 2504.06958 • Published Apr 9, 2025 • 13
Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

Paper • 2504.05520 • Published Apr 7, 2025 • 11
ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning

Paper • 2503.22738 • Published Mar 26, 2025 • 17
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Paper • 2504.03601 • Published Apr 4, 2025 • 18
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement

Paper • 2504.03561 • Published Apr 4, 2025 • 18
Towards Trustworthy GUI Agents: A Survey

Paper • 2503.23434 • Published Mar 30, 2025 • 21
Sleep-time Compute: Beyond Inference Scaling at Test-time

Paper • 2504.13171 • Published Apr 17, 2025 • 16
BitNet b1.58 2B4T Technical Report

Paper • 2504.12285 • Published Apr 16, 2025 • 87
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 311
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Paper • 2504.08942 • Published Apr 11, 2025 • 29
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

Paper • 2504.09641 • Published Apr 13, 2025 • 16
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

Paper • 2504.10449 • Published Apr 14, 2025 • 15
SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

Paper • 2504.07891 • Published Apr 10, 2025 • 5
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Paper • 2504.11536 • Published Apr 15, 2025 • 63
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Paper • 2504.11468 • Published Apr 10, 2025 • 30
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

Paper • 2504.16656 • Published Apr 23, 2025 • 59
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Paper • 2504.17192 • Published Apr 24, 2025 • 124
Process Reward Models That Think

Paper • 2504.16828 • Published Apr 23, 2025 • 19
Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22, 2025 • 66
Progent: Programmable Privilege Control for LLM Agents

Paper • 2504.11703 • Published Apr 16, 2025 • 6
TTRL: Test-Time Reinforcement Learning

Paper • 2504.16084 • Published Apr 22, 2025 • 123
Learning to Reason under Off-Policy Guidance

Paper • 2504.14945 • Published Apr 21, 2025 • 88
ToolRL: Reward is All Tool Learning Needs

Paper • 2504.13958 • Published Apr 16, 2025 • 49
LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark

Paper • 2504.13805 • Published Apr 18, 2025 • 11
AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale

Paper • 2505.08311 • Published May 13, 2025 • 19
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

Paper • 2505.07263 • Published May 12, 2025 • 30
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

Paper • 2505.05467 • Published May 8, 2025 • 13
Visual Planning: Let's Think Only with Images

Paper • 2505.11409 • Published May 16, 2025 • 57
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Paper • 2505.13227 • Published May 19, 2025 • 46
Efficient Agent Training for Computer Use

Paper • 2505.13909 • Published May 20, 2025 • 44
One RL to See Them All: Visual Triple Unified Reinforcement Learning

Paper • 2505.18129 • Published May 23, 2025 • 63
Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models

Paper • 2505.17826 • Published May 23, 2025 • 10
Interactive Post-Training for Vision-Language-Action Models

Paper • 2505.17016 • Published May 22, 2025 • 6
ARM: Adaptive Reasoning Model

Paper • 2505.20258 • Published May 26, 2025 • 45
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

Paper • 2506.00123 • Published May 30, 2025 • 35
ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Paper • 2505.23762 • Published May 29, 2025 • 45
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Paper • 2505.22618 • Published May 28, 2025 • 47
SWE-bench Goes Live!

Paper • 2505.23419 • Published May 29, 2025 • 21
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Paper • 2505.24864 • Published May 30, 2025 • 146
ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

Paper • 2505.24875 • Published May 30, 2025 • 10
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

Paper • 2505.24726 • Published May 30, 2025 • 283
WebSailor: Navigating Super-human Reasoning for Web Agent

Paper • 2507.02592 • Published Jul 3, 2025 • 127
MemOS: A Memory OS for AI System

Paper • 2507.03724 • Published Jul 4, 2025 • 168
RoboBrain 2.0 Technical Report

Paper • 2507.02029 • Published Jul 2, 2025 • 36
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Paper • 2506.11763 • Published Jun 13, 2025 • 74
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

Paper • 2506.09038 • Published Jun 10, 2025 • 6
GTA1: GUI Test-time Scaling Agent

Paper • 2507.05791 • Published Jul 8, 2025 • 27
Budget-Aware Tool-Use Enables Effective Agent Scaling

Paper • 2511.17006 • Published Nov 21, 2025 • 34
Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

Paper • 2511.13288 • Published Nov 17, 2025 • 19
EvoVLA: Self-Evolving Vision-Language-Action Model

Paper • 2511.16166 • Published Nov 20, 2025 • 6
WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

Paper • 2511.09515 • Published Nov 12, 2025 • 21
TiDAR: Think in Diffusion, Talk in Autoregression

Paper • 2511.08923 • Published Nov 12, 2025 • 130
Robot Learning from a Physical World Model

Paper • 2511.07416 • Published Nov 10, 2025 • 32
Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale

Paper • 2511.05705 • Published Nov 7, 2025 • 10
Real-Time Reasoning Agents in Evolving Environments

Paper • 2511.04898 • Published Nov 7, 2025 • 13
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Paper • 2511.02778 • Published Nov 4, 2025 • 104
World Simulation with Video Foundation Models for Physical AI

Paper • 2511.00062 • Published Oct 28, 2025 • 46
Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games

Paper • 2510.26298 • Published Oct 30, 2025 • 46
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Paper • 2510.25992 • Published Oct 29, 2025 • 48
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Paper • 2602.22675 • Published Feb 26 • 23
OmniGAIA: Towards Native Omni-Modal AI Agents

Paper • 2602.22897 • Published Feb 26 • 53
FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Paper • 2603.19835 • Published Mar 20 • 353
Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Paper • 2603.29620 • Published Mar 31 • 49
AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents

Paper • 2603.27490 • Published Mar 29 • 20
Mind DeepResearch Technical Report

Paper • 2604.14518 • Published Apr 17 • 23
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Paper • 2604.21375 • Published Apr 23 • 19
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

Paper • 2604.25256 • Published Apr 28 • 30
Co-Director: Agentic Generative Video Storytelling

Paper • 2604.24842 • Published Apr 27 • 16
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

Paper • 2604.19859 • Published Apr 21 • 54
Image Generators are Generalist Vision Learners

Paper • 2604.20329 • Published Apr 22 • 22
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Paper • 2604.19741 • Published Apr 21 • 17
From Web to Pixels: Bringing Agentic Search into Visual Perception

Paper • 2605.12497 • Published May 12 • 14
SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Paper • 2605.23904 • Published May 22 • 254
DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch

Paper • 2606.10728 • Published Jun 9 • 34

Upvote

Collection guide
Browse collections