vllm-project

vllm

未分类

vllm-project

A high-throughput and memory-efficient inference and serving engine for LLMs

46.1k
Stars
7.1k
Forks
1.7k
Issues
1.1k
Contributors
380
Watchers
gptllmpytorchllmopsmlopsmodel-servingtransformerllm-servinginferencellamaamdrocmcudainferentiatrainiumtpuxpuhpudeepseekqwen
Python
{"name":"Apache License 2.0","spdxId":"Apache-2.0"}

Project Description

vLLM is a high-performance, open-source library designed for efficient and scalable large language model (LLM) inference and serving. It features state-of-the-art serving throughput, optimized memory management with **PagedAttention**, and supports advanced techniques like continuous batching, CUDA/HIP graph execution, and various quantization methods (e.g., GPTQ, AWQ, INT4, INT8, FP8). vLLM integrates seamlessly with popular Hugging Face models, offers OpenAI-compatible API servers, and supports distributed inference with tensor and pipeline parallelism. It is highly flexible, supporting a wide range of hardware (NVIDIA, AMD, Intel, TPU, AWS Neuron) and models, including Transformer-based LLMs, Mixture-of-Experts, and multi-modal models. vLLM is community-driven, with contributions from academia and industry, and is backed by sponsors like a16z, Google Cloud, and NVIDIA.

© 2025 GitHub Fun. All rights reserved.