stereoplegic 's Collections Inference
updated
S^{3}: Increasing GPU Utilization during Generative Inference for
Higher Throughput
Paper
• 2306.06000
• Published
• 1
Fast Distributed Inference Serving for Large Language Models
Paper
• 2305.05920
• Published
• 1
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM
Inference Pipeline
Paper
• 2305.13144
• Published
• 1
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert
(MoE) Inference
Paper
• 2303.06182
• Published
• 1
Dynamic Context Pruning for Efficient and Interpretable Autoregressive
Transformers
Paper
• 2305.15805
• Published
• 1
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper
• 2311.01282
• Published
• 37
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Paper
• 2311.03285
• Published
• 31
Fast Inference from Transformers via Speculative Decoding
Paper
• 2211.17192
• Published
• 11
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper
• 2311.04934
• Published
• 32
RecycleGPT: An Autoregressive Language Model with Recyclable Module
Paper
• 2308.03421
• Published
• 9
Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing
Important Tokens
Paper
• 2305.04241
• Published
• 1
Latency Adjustable Transformer Encoder for Language Understanding
Paper
• 2201.03327
• Published
• 1
Punica: Multi-Tenant LoRA Serving
Paper
• 2310.18547
• Published
• 2
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative
Model Inference with Unstructured Sparsity
Paper
• 2309.10285
• Published
• 1
Distributed Inference and Fine-tuning of Large Language Models Over The
Internet
Paper
• 2312.08361
• Published
• 27
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper
• 2312.04985
• Published
• 40
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model
Scaling Laws
Paper
• 2401.00448
• Published
• 30
Fast Inference of Mixture-of-Experts Language Models with Offloading
Paper
• 2312.17238
• Published
• 7
Exploiting Inter-Layer Expert Affinity for Accelerating
Mixture-of-Experts Model Inference
Paper
• 2401.08383
• Published
• 1
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts
Models
Paper
• 2402.07033
• Published
• 19
IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs
Paper
• 2405.02842
• Published
• 2