WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models Paper β’ 2602.02537 β’ Published 12 days ago β’ 5
Generative Frame Sampler for Long Video Understanding Paper β’ 2503.09146 β’ Published Mar 12, 2025 β’ 1
Kimi-VL-A3B Collection Moonshot's efficient MoE VLMs, exceptional on agent, long-context, and thinking β’ 7 items β’ Updated 13 days ago β’ 78
view article Article π€ππ¬π₯οΈπ Kimi-VL-A3B-Thinking-2506: A Quick Navigation Jun 21, 2025 β’ 76
ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks Paper β’ 2503.06885 β’ Published Mar 10, 2025 β’ 4
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? Paper β’ 2505.23359 β’ Published May 29, 2025 β’ 38
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning Paper β’ 2504.08837 β’ Published Apr 10, 2025 β’ 43
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper β’ 2504.10479 β’ Published Apr 14, 2025 β’ 306
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper β’ 2502.14786 β’ Published Feb 20, 2025 β’ 157
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation Paper β’ 2411.13281 β’ Published Nov 20, 2024 β’ 20
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper β’ 2501.12948 β’ Published Jan 22, 2025 β’ 438
Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper β’ 2501.12599 β’ Published Jan 22, 2025 β’ 126
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Paper β’ 2412.05237 β’ Published Dec 6, 2024 β’ 46
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper β’ 2412.00927 β’ Published Dec 1, 2024 β’ 29