vLLM

High-throughput, memory-efficient LLM inference and serving engine with PagedAttention and continuous batching — built for production GPU workloads.

model-serving-frameworksRecently released

Hero Score

Popularity

Performance

100

Ecosystem

Maturity

Dev Experience

⭐ 86,106 stars⬇ 1.5M downloads/wkFirst release: Jun 2023Last release: Jul 2026

Async Support: YesPlugin Extensions: HighSpeed: Very fastDoc Quality: HighLearning Curve: Medium

• PagedAttention and continuous batching deliver state-of-the-art throughput on GPU workloads
• Ships an OpenAI-compatible HTTP server, so existing clients can point at vLLM with no code changes
• Broad model support including LoRA adapters and multiple quantization formats