A high-throughput and memory-efficient inference and serving engine for LLMs
Project Description
vLLM is a high-performance, open-source library designed for efficient and scalable large language model (LLM) inference and serving. It features state-of-the-art serving throughput, optimized memory management with **PagedAttention**, and supports advanced techniques like continuous batching, CUDA/HIP graph execution, and various quantization methods (e.g., GPTQ, AWQ, INT4, INT8, FP8). vLLM integrates seamlessly with popular Hugging Face models, offers OpenAI-compatible API servers, and supports distributed inference with tensor and pipeline parallelism. It is highly flexible, supporting a wide range of hardware (NVIDIA, AMD, Intel, TPU, AWS Neuron) and models, including Transformer-based LLMs, Mixture-of-Experts, and multi-modal models. vLLM is community-driven, with contributions from academia and industry, and is backed by sponsors like a16z, Google Cloud, and NVIDIA.