Video-BrowseComp: Benchmarking Agentic Video Research on Open Web Paper • 2512.23044 • Published 8 days ago • 9
OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation Paper • 2512.08294 • Published 28 days ago • 17
AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation Paper • 2512.01334 • Published Dec 1, 2025
MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval Paper • 2509.26378 • Published Sep 30, 2025
Emu3.5: Native Multimodal Models are World Learners Paper • 2510.26583 • Published Oct 30, 2025 • 108
OmniGen2: Exploration to Advanced Multimodal Generation Paper • 2506.18871 • Published Jun 23, 2025 • 78
OmniGen2: Exploration to Advanced Multimodal Generation Paper • 2506.18871 • Published Jun 23, 2025 • 78
MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos Paper • 2502.12558 • Published Feb 18, 2025
Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval Paper • 2502.11431 • Published Feb 17, 2025
VideoDeepResearch: Long Video Understanding With Agentic Tool Using Paper • 2506.10821 • Published Jun 12, 2025 • 19
OmniGen2: Exploration to Advanced Multimodal Generation Paper • 2506.18871 • Published Jun 23, 2025 • 78
EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models Paper • 2506.01667 • Published Jun 2, 2025 • 21
MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos Paper • 2502.12558 • Published Feb 18, 2025
MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding Paper • 2406.04264 • Published Jun 6, 2024 • 2
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding Paper • 2409.14485 • Published Sep 22, 2024 • 2
TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control Paper • 2410.10133 • Published Oct 14, 2024 • 1
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding Paper • 2503.18478 • Published Mar 24, 2025 • 1
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models Paper • 2502.06788 • Published Feb 10, 2025 • 13
Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions Paper • 2406.10638 • Published Jun 15, 2024
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval Paper • 2412.14475 • Published Dec 19, 2024 • 55