RULER: What's the Real Context Size of Your Long-Context Language Models? Paper • 2404.06654 • Published Apr 9, 2024 • 40
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens Paper • 2603.23516 • Published 24 days ago • 34
It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization Paper • 2504.13173 • Published Apr 17, 2025 • 20
MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding Paper • 2603.22458 • Published 6 days ago • 130
Perception Encoder: The best visual embeddings are not at the output of the network Paper • 2504.13181 • Published Apr 17, 2025 • 36
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published Feb 20, 2025 • 162
Learning Transferable Architectures for Scalable Image Recognition Paper • 1707.07012 • Published Jul 21, 2017 • 1
CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks Paper • 2401.14109 • Published Jan 25, 2024 • 11
A Comprehensive Overview and Comparative Analysis on Deep Learning Models: CNN, RNN, LSTM, GRU Paper • 2305.17473 • Published May 27, 2023 • 1
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models Paper • 2402.19427 • Published Feb 29, 2024 • 57
Short window attention enables long-term memorization Paper • 2509.24552 • Published Sep 29, 2025 • 4
Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration Paper • 2602.11937 • Published Feb 12 • 3
Puzzle: Distillation-Based NAS for Inference-Optimized LLMs Paper • 2411.19146 • Published Nov 28, 2024 • 20