Paper reading: LLM Inference

Original Paper: Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems (arxiv.org)

Summary

Generative Large Language Models (LLMs) have shown remarkable advancements in AI, offering capabilities spanning from text generation to complex problem-solving. However, deploying these models efficiently in real-world applications poses a significant challenge due to their computational and memory demands. This survey aims to address these challenges by exploring algorithms and system-level optimizations.

Methodologies

To curate our findings, we surveyed extensive literature and practical implementations focusing on LLM serving efficiency. Criteria for selection included method innovation, deployment scalability, and demonstrated improvement in serving latency and throughput.

Background

Refresher on how LLM works

The auto-regressive decoding approach is central to how LLM models generate text, ensuring that each new word or token produced takes into account the entire sequence generated so far.

Untitled

Challenges for efficient LLM serving

Latency & Response time

Balancing model complexity with inference speed is a critical challenge that necessitates optimizing algorithms and system architectures to minimize response time without compromising accuracy.

Memory footprint & Model size

LLMs come with significant memory requirements due to their size and the number of parameters they contain. Deploying them to memory-constrained devices poses a limitation, demanding some effective ways of compressing models without sacrificing performance.

Scalability & Throughput

Variable levels of request loads in the production inference systems mean the system should be able to scale to the needs and effectively distribute the workload to available resources.

Hardware Compatibility and Acceleration

Adapting LLM inference to diverse hardware platforms and architectures, including CPUs and GPUs is crucial. This requires hardware-aware algorithm design and optimization to exploit the full potential of the underlying hardware.

Trade-offs between Accuracy and Efficiency