Papers - Multimodal
updated
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
• 2402.14289
• Published
• 20
ImageBind: One Embedding Space To Bind Them All
Paper
• 2305.05665
• Published
• 6
DocLLM: A layout-aware generative language model for multimodal document
understanding
Paper
• 2401.00908
• Published
• 189
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture
of Experts
Paper
• 2206.02770
• Published
• 4
Paper
• 2104.03964
• Published
• 3
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
• 2403.07508
• Published
• 77
Veagle: Advancements in Multimodal Representation Learning
Paper
• 2403.08773
• Published
• 10
mPLUG-Owl: Modularization Empowers Large Language Models with
Multimodality
Paper
• 2304.14178
• Published
• 3
Gemini: A Family of Highly Capable Multimodal Models
Paper
• 2312.11805
• Published
• 49
Flamingo: a Visual Language Model for Few-Shot Learning
Paper
• 2204.14198
• Published
• 16
Training Compute-Optimal Large Language Models
Paper
• 2203.15556
• Published
• 11
Paper
• 2309.16609
• Published
• 38
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Paper
• 2402.12226
• Published
• 45
Unifying Vision, Text, and Layout for Universal Document Processing
Paper
• 2212.02623
• Published
• 11
Uni-SMART: Universal Science Multimodal Analysis and Research
Transformer
Paper
• 2403.10301
• Published
• 54
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
• 2403.10517
• Published
• 37
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
• 2403.11703
• Published
• 17
TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation
Paper
• 2403.12906
• Published
• 7
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
Large Language Models
Paper
• 2403.13447
• Published
• 19
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
• 2403.14624
• Published
• 53
A Multimodal Approach to Device-Directed Speech Detection with Large
Language Models
Paper
• 2403.14438
• Published
• 2
InternVideo2: Scaling Video Foundation Models for Multimodal Video
Understanding
Paper
• 2403.15377
• Published
• 29
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
• 2403.18814
• Published
• 48
FormNetV2: Multimodal Graph Contrastive Learning for Form Document
Information Extraction
Paper
• 2305.02549
• Published
• 7
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
• 2308.12966
• Published
• 11
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language
Models
Paper
• 2404.03118
• Published
• 25
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency
Determines Multimodal Model Performance
Paper
• 2404.04125
• Published
• 29
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
• 2404.05719
• Published
• 83
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
• 2404.07973
• Published
• 32
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
• 2404.12390
• Published
• 26
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
• 2407.02392
• Published
• 23
Data curation via joint example selection further accelerates multimodal
learning
Paper
• 2406.17711
• Published
• 3
The Synergy between Data and Multi-Modal Large Language Models: A Survey
from Co-Development Perspective
Paper
• 2407.08583
• Published
• 13
PaliGemma: A versatile 3B VLM for transfer
Paper
• 2407.07726
• Published
• 72
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
• 2407.07053
• Published
• 47
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality
Models
Paper
• 2407.11691
• Published
• 16
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware
Open-domain Visual Storytelling
Paper
• 2408.03695
• Published
• 13
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
Large Language Models
Paper
• 2408.04840
• Published
• 33
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
• 2408.08872
• Published
• 101
Law of Vision Representation in MLLMs
Paper
• 2408.16357
• Published
• 95
Geodesic Multi-Modal Mixup for Robust Fine-Tuning
Paper
• 2203.03897
• Published
• 1
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
• 2411.14402
• Published
• 47
SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
Paper
• 2308.11466
• Published
• 1
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
• 2412.10360
• Published
• 147