vllm

未分类

vllm-project

GitHub Website

A high-throughput and memory-efficient inference and serving engine for LLMs

51.1k

Stars

8.4k

Forks

1.9k

Issues

1.2k

Contributors

398

Watchers

gptllmpytorchllmopsmlopsmodel-servingtransformerllm-servinginferencellamaamdrocmcudainferentiatrainiumtpuxpuhpudeepseekqwen

Python

{"name":"Apache License 2.0","spdxId":"Apache-2.0"}

Project Description

vLLM is a high-performance, open-source library designed for efficient and scalable large language model (LLM) inference and serving. It features state-of-the-art serving throughput, optimized memory management with **PagedAttention**, and supports advanced techniques like continuous batching, CUDA/HIP graph execution, and various quantization methods (e.g., GPTQ, AWQ, INT4, INT8, FP8). vLLM integrates seamlessly with popular Hugging Face models, offers OpenAI-compatible API servers, and supports distributed inference with tensor and pipeline parallelism. It is highly flexible, supporting a wide range of hardware (NVIDIA, AMD, Intel, TPU, AWS Neuron) and models, including Transformer-based LLMs, Mixture-of-Experts, and multi-modal models. vLLM is community-driven, with contributions from academia and industry, and is backed by sponsors like a16z, Google Cloud, and NVIDIA.