Reading List
updated
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
Paper
• 2404.15420
• Published
• 11
OpenELM: An Efficient Language Model Family with Open-source Training
and Inference Framework
Paper
• 2404.14619
• Published
• 126
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your
Phone
Paper
• 2404.14219
• Published
• 259
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
Paper
• 2404.14047
• Published
• 45
LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for
Boosting Query Efficiency
Paper
• 2404.12872
• Published
• 11
TriForce: Lossless Acceleration of Long Sequence Generation with
Hierarchical Speculative Decoding
Paper
• 2404.11912
• Published
• 17
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Paper
• 2403.09636
• Published
• 3
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper
• 2403.09919
• Published
• 21
Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
Paper
• 2402.05109
• Published
• 2
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Paper
• 2402.11131
• Published
• 42
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads
Paper
• 2401.10774
• Published
• 59
Break the Sequential Dependency of LLM Inference Using Lookahead
Decoding
Paper
• 2402.02057
• Published
FP8-LM: Training FP8 Large Language Models
Paper
• 2310.18313
• Published
• 33
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Paper
• 2310.08659
• Published
• 27
Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models
Paper
• 2309.02784
• Published
• 2
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with
Modular Quantizers
Paper
• 2309.16119
• Published
• 1
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
Paper
• 2310.16836
• Published
• 14
Quantizable Transformers: Removing Outliers by Helping Attention Heads
Do Nothing
Paper
• 2306.12929
• Published
• 13
Matryoshka Representation Learning
Paper
• 2205.13147
• Published
• 25
MambaByte: Token-free Selective State Space Model
Paper
• 2401.13660
• Published
• 60
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Paper
• 2307.13304
• Published
• 2
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression
Paper
• 2306.03078
• Published
• 3
Efficient LLM inference solution on Intel GPU
Paper
• 2401.05391
• Published
• 11
A Careful Examination of Large Language Model Performance on Grade
School Arithmetic
Paper
• 2405.00332
• Published
• 33
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper
• 2404.07413
• Published
• 38
Prometheus 2: An Open Source Language Model Specialized in Evaluating
Other Language Models
Paper
• 2405.01535
• Published
• 124
H2O-Danube3 Technical Report
Paper
• 2407.09276
• Published
• 20
Paper
• 2407.10671
• Published
• 168
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive
Low-Rank Gradients
Paper
• 2407.08296
• Published
• 33
Inference Performance Optimization for Large Language Models on CPUs
Paper
• 2407.07304
• Published
• 53
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a
Single GPU
Paper
• 2403.06504
• Published
• 56