Releases · vllm-project/vllm

13 Sep 06:37

simon-mo

v0.10.2

e017120

v0.10.2 Latest

Latest

Highlights

This release contains 740 commits from 266 contributors (97 new)!

Breaking Changes: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully.

aarch64 support: This release features native support for aarch64 allowing usage of vLLM on GB200 platform. The docker image vllm/vllm-openai should already be multiplatform. To install the wheels, you can download the wheels from this release artifact or install via

uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto

Model Support

New model families and enhancements: Apertus (#23068), LFM2 (#22845), MiDashengLM (#23652), Motif-1-Tiny (#23414), Seed-Oss (#23241), Google EmbeddingGemma-300m (#24318), GTE sequence classification (#23524), Donut OCR model (#23229), KeyeVL-1.5-8B (#23838), R-4B vision model (#23246), Ernie4.5 VL (#22514), MiniCPM-V 4.5 (#23586), Ovis2.5 (#23084), Qwen3-Next with hybrid attention (#24526), InternVL3.5 with video support (#23658), Qwen2Audio embeddings (#23625), NemotronH Nano VLM (#23644), BLOOM V1 engine support (#23488), and Whisper encoder-decoder for V1 (#21088).
Pipeline parallelism expansion: Added PP support for Hunyuan (#24212), Ovis2.5 (#23405), GPT-OSS (#23680), and Kimi-VL-A3B-Thinking-2506 (#23114).
Data parallelism for vision models: Enabled DP for ViT across Qwen2.5VL (#22742), MiniCPM-V (#23948, #23327), Kimi-VL (#23817), and GLM-4.5V (#23168).
LoRA ecosystem expansion: Added LoRA support to Voxtral (#24517), Qwen-2.5-Omni (#24231), and DeepSeek models V2/V3/R1-0528 (#23971), with significantly faster LoRA startup performance (#23777).
Classification and pooling enhancements: Multi-label classification support (#23173), logit bias and sigmoid normalization (#24031), and FP32 precision heads for pooling models (#23810).
Performance optimizations: Removed unnecessary CUDA sync from GLM-4.1V (#24332) and Qwen2VL (#24334) preprocessing, eliminated redundant all-reduce in Qwen3 MoE (#23169), optimized InternVL CPU threading (#24519), and GLM4.5-V video frame decoding (#24161).

Engine Core

V1 engine maturation: Extended V1 support to compute capability < 8.0 (#23614, #24022), added cross-attention KV cache for encoder-decoder models (#23664), request-level logits processor integration (#23656), and KV events from connectors (#19737).
Backend expansion: Terratorch backend integration (#23513) enabling non-language model tasks like semantic segmentation and geospatial applications with --model-impl terratorch support.
Hybrid and Mamba model improvements: Enabled full CUDA graphs by default for hybrid models (#22594), disabled prefix caching for hybrid/Mamba models (#23716), added FP32 SSM kernel support (#23506), full CUDA graph support for Mamba1 (#23035), and V1 as default for Mamba models (#23650).
Performance core improvements: --safetensors-load-strategy for NFS based file loading acceleration (#24469), critical CUDA graph capture throughput fix (#24128), scheduler optimization for single completions (#21917), multi-threaded model weight loading (#23928), and tensor core usage enforcement for FlashInfer decode (#23214).
Multimodal enhancements: Multimodal cache tracking with mm_hash (#22711), UUID-based multimodal identifiers (#23394), improved V1 video embedding estimation (#24312), and simplified multimodal UUID handling (#24271).
Sampling and structured outputs: Support for all prompt logprobs (#23868), final logprobs (#22387), grammar bitmask optimization (#23361), and user-configurable KV cache memory size (#21489).
Distributed: Support Decode Context Parallel (DCP) for MLA (#23734)

Hardware & Performance

NVIDIA Blackwell/SM100 generation: FP8 MLA support with CUTLASS backend (#23289), DeepGEMM Linear with 1.5% E2E throughput improvement (#23351), Hopper DeepGEMM E8M0 for DeepSeekV3.1 (#23666), SM100 FlashInfer CUTLASS MoE FP8 backend (#22357), MXFP4 fused CUTLASS MoE (#23696), default MXFP4 MoE on Blackwell (#23008), and GPT-OSS DP/EP support with 52,003 tokens/s throughput (#23608).
Breaking change: FlashMLA disabled on Blackwell GPUs due to compatibility issues (#24521).
Kernel and attention optimizations: FlashAttention MLA with CUDA graph support (#14258, #23958), V1 cross-attention support (#23297), FP8 support for FlashMLA (#22668), fused grouped TopK for MoE (#23274), Flash Linear Attention kernels (#24518), and W4A8 support on Hopper (#23198).
Performance improvements: 13.7x speedup for token conversion (#20413), TTIT/TTFT improvements for disaggregated serving (#22760), symmetric memory all-reduce by default (#24111), FlashInfer warmup during startup (#23439), V1 model execution overlap (#23569), and various Triton configuration tuning (#23748, #23939).
Platform expansion: Apple Silicon bfloat16 support for M2+ (#24129), IBM Z V1 engine support (#22725), Intel XPU torch.compile (#22609), XPU MoE data parallelism (#22887), XPU Triton attention (#24149), XPU FP8 quantization (#23148), and ROCm pipeline parallelism with Ray (#24275).
Model-specific optimizations: Hardware-tuned MoE configurations for Qwen3-Next on B200/H200/H100 (#24698, #24688, #24699, #24695), GLM-4.5-Air-FP8 B200 configs (#23695), Kimi K2 optimization (#24597), and QWEN3 Coder/Thinking configs (#24266, #24330).

Quantization

New quantization capabilities: Per-layer quantization routing (#23556), GGUF quantization with layer skipping (#23188), NFP4+FP8 MoE support (#22674), W4A8 channel scales (#23570), and AMD CDNA2/CDNA3 FP4 support (#22527).
Advanced quantization infrastructure: Compressed tensors transforms for linear operations (#22486) enabling techniques like SpinQuantR1R2R4 and QuIP quantization methods.
FlashInfer quantization integration: FP8 KV cache for TRTLLM prefill attention (#24197), FP8-qkv attention kernels (#23647), and FP8 per-tensor GEMMs (#22895).
Platform-specific quantization: ROCm TorchAO quantization enablement (#24400) and TorchAO module swap configuration (#21982).
Performance optimizations: MXFP4 MoE loading cache optimization (#24154) and compressed tensors version updates (#23202).
Breaking change: Removed original Marlin quantization format (#23204).

API & Frontend

OpenAI API enhancements: Gemma3n audio transcription/translation endpoints (#23735), transcription response usage statistics (#23576), and return_token_ids parameter (#22587).
Response API improvements: Streaming support for non-harmony responses (#23741), non-streaming logprobs (#23319), MCP tool background mode (#23494), MCP streaming+background support (#23927), and tool output token reporting (#24285).
Frontend optimizations: Error stack traces with --log-error-stack (#22960), collective RPC endpoint (#23075), beam search concurrency optimization (#23599), unnecessary detokenization skipping (#24236), and custom media UUIDs (#23449).
Configuration enhancements: Formalized --mm-encoder-tp-mode flag (#23190), VLLM_DISABLE_PAD_FOR_CUDAGRAPH environment variable (#23595), EPLB configuration parameter (#20562), embedding endpoint chat request support (#23931), and LM Format Enforcer V1 integration (#22564).

Dependencies

Major updates: PyTorch 2.8.0 upgrade (#20358) - breaking change requiring environment updates, FlashInfer v0.3.0 upgrade (#24086), and FlashInfer 0.2.14.post1 maintenance update (#23537).
Supporting updates: XGrammar 0.1.23 (#22988), TPU core dump fix with tpu_info 0.4.0 (#23135), and compressed tensors version bump (#23202).
Deployment improvements: FlashInfer cubin directory environment variable (#22675) for offline environments and pre-cached CUDA binaries.

V0 Deprecation

Backend removals: V0 Neuron backend deprecation (#21159), V0 pooling model support removal (#23434), V0 FlashInfer attention backend removal (#22776), and V0 test cleanup (#23418, #23862).
API breaking changes: prompt_token_ids fallback removal from LLM.generate and LLM.embed (#18800), LoRA extra vocab size deprecation warning (#23635), LoRA bias parameter deprecation (#24339), and metrics naming change from TPOT to ITL (#24110).

Breaking Changes

PyTorch 2.8.0 upgrade - Environment dependency change requiring updated CUDA versions
FlashMLA Blackwell restriction - FlashMLA disabled on Blackwell GPUs due to compatibility issues
V0 feature removals - Neuron backend, pooling models, FlashInfer attention backend
Quantizations - Removed quantized Mixtral hack implementation, and original Marlin format.
Metrics renaming - TPOT deprecated in favor of ITL

What's Changed

[Misc] Minor code cleanup for _get_prompt_logprobs_dict by @WoosukKwon in #23064
[Misc] enhance static type hint by @andyxning in #23059
[Bugfix] fix Qwen2.5-Omni processor output mapping by @DoubleVII in #23058
[Bugfix][CI] Machete kernels: deterministic ordering for more cache hits by @andylolu2 in #23055
[Misc] refactor function name by @andyxning in #23029
[Misc] Fix backward compatibility from #23030 by @ywang96 in #23070
[XPU] Fix compile size for xpu by @jikunshang in #23069
[XPU][CI]add xpu env vars in CI scripts by @jikunshang in #22946
[Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs by @DarkLight1337 in #23053
[Bugfix] fix IntermediateTensors equal method by @andyxning in https://github.com/vllm-project/...

Contributors

markmc, rasmith, and 197 other contributors

Assets 6

20 Aug 21:20

github-actions

v0.10.1.1

1da94e6

v0.10.1.1

This is a critical bugfix and security release:

Fix CUTLASS MLA Full CUDAGraph (#23200)
Limit HTTP header count and size (#23267): GHSA-rxc4-3w6r-4v47
Do not use eval() to convert unknown types (#23266): GHSA-79j6-g2m3-jgfw

Full Changelog: v0.10.1...v0.10.1.1

Assets 6

18 Aug 04:39

github-actions

v0.10.1

aab5498

v0.10.1

Highlights

v0.10.1 release includes 727 commits, 245 committers (105 new contributors).

NOTE: This release deprecates V0 FA3 support and as a result FP8 kv-cache in V0 may have issues

Model Support

New model families: GPT-OSS with comprehensive tool calling and streaming support (#22327, #22330, #22332, #22335, #22339, #22340, #22342), Command-A-Vision (#22660), mBART (#22883), and SmolLM3 using Transformers backend (#22665).
Vision-language models: Official Eagle multimodal support with Llama4 backend (#20788), Step3 vision-language models (#21998), Gemma3n multimodal (#20495), MiniCPM-V 4.0 (#22166), HyperCLOVAX-SEED-Vision-Instruct-3B (#20931), Emu3 with Transformers backend (#21319), Intern-S1 (#21628), and Prithvi in online serving mode (#21518).
Enhanced existing models: NemotronH support (#22349), Ernie 4.5 Base 0.3B model name change (#21735), GLM-4.5 series improvements (#22215), Granite models with fused MoE configurations (#21332) and quantized checkpoint loading (#22925), Ultravox support for Llama 4 and Gemma 3 backends (#17818), Mamba1 and Jamba model support in V1 (without CUDA graphs) (#21249)
Advanced model capabilities: Qwen3 EPLB (#20815) and dual-chunk attention support (#21924), Qwen native Eagle3 target support (#22333).
Architecture expansions: Encoder-only models without KV-cache enabling BERT-style architectures (#21270), expanded tensor parallelism support in Transformers backend (#22651), tensor parallelism for Deepseek_vl2 vision transformer (#21494), and tensor/pipeline parallelism with Mamba2 kernel for PLaMo2 (#19674).
V1 engine compatibility: Extended support for additional pooling models (#21747) and Step3VisionEncoder distributed processing option (#22697).

Engine Core

CUDA graph performance: Full CUDA graph support with separate attention routines, adding FA2 and FlashInfer compatibility (#20059), plus 6% end-to-end throughput improvement from Cutlass MLA (#22763).
Attention system advances: Multiple attention metadata builders per KV cache specification (#21588), tree attention backend for v1 engine (experimental) (#20401), FlexAttention encoder-only support (#22273), upgraded FlashAttention 3 with attention sink support (#22313), and multiple attention groups for KV sharing patterns (#22672).
Speculative decoding optimizations: N-gram speculative decoding with single KMP token proposal algorithm (#22437), explicit EAGLE3 interface for enhanced compatibility (#22642).
Default behavior improvements: Pooling models now default to chunked prefill and prefix caching (#20930), disabled chunked local attention by default for Llama4 for better performance (#21761).
Extensibility and configuration: Model loader plugin system (#21067), custom operations support for FusedMoe (#22509), rate limiting with bucket algorithm for proxy server (#22643), torch.compile support for bailing MoE (#21664).
Performance optimizations: Improved startup time by disabling C++ compilation of symbolic shapes (#20836), enhanced headless models for pooling in Transformers backend (#21767).

Hardware & Performance

NVIDIA Blackwell (SM100) optimizations: CutlassMLA as default backend (#21626), FlashInfer MoE per-tensor scale FP8 backend (#21458), SM90 CUTLASS FP8 GEMM with kernel tuning and swap AB support (#20396).
NVIDIA RTX 5090/RTX PRO 6000 (SM120) support: Block FP8 quantization (#22131) and CUTLASS NVFP4 4-bit weights/activations support (#21309).
AMD ROCm platform enhancements: Flash Attention backend for Qwen-VL models (#22069), AITER HIP block quantization kernels (#21242), reduced device-to-host transfers (#22683), and optimized kernel performance for small batch sizes 1-4 (#21350).
Attention and compute optimizations: FlashAttention 3 attention sinks performance boost (#22478), Triton-based multi-dimensional RoPE replacing PyTorch implementation (#22375), async tensor parallelism for scaled matrix multiplication (#20155), optimized FlashInfer metadata building (#21137).
Memory and throughput improvements: Mamba2 reduced device-to-device copy overhead (#21075), fused Triton kernels for RMSNorm (#20839, #22184), improved multimodal hasher performance for repeated image prompts (#22825), multithreaded async multimodal loading (#22710).
Parallelization and MoE optimizations: Guided decoding throughput improvements (#21862), balanced expert sharding for MoE models (#21497), expanded fused kernel support for topk softmax (#22211), fused MoE for nomic-embed-text-v2-moe (#18321).
Hardware compatibility and kernels: ARM CPU build fixes for systems without BF16 support (#21848), Machete memory-bound performance improvements (#21556), FlashInfer TRT-LLM prefill attention kernel support (#22095), optimized reshape_and_cache_flash CUDA kernel (#22036), CPU transfer support in NixlConnector (#18293).
Specialized CUDA kernels: GPT-OSS activation functions (#22538), RLHF weight loading acceleration (#21164).

Quantization

Advanced quantization techniques: MXFP4 and bias support for Marlin kernel (#22428), NVFP4 GEMM FlashInfer backends (#22346), compressed-tensors mixed-precision model loading (#22468), FlashInfer MoE support for NVFP4 (#21639).
Hardware-optimized quantization: Dynamic 4-bit quantization with Kleidiai kernels for CPU inference (#17112), TensorRT-LLM FP4 quantization optimized for MoE low-latency inference (#21331).
Expanded model quantization support: BitsAndBytes quantization for InternS1 (#21953) and additional MoE models (#21370, #21548), Gemma3n quantization compatibility (#21974), calibration-free RTN quantization for MoE models (#20766), ModelOpt Qwen3 NVFP4 support (#20101).
Performance and compatibility improvements: CUDA kernel optimization for Int8 per-token group quantization (#21476), non-contiguous tensor support in FP8 quantization (#21961), automatic detection of ModelOpt quantization formats (#22073).
Breaking change: Removed AQLM quantization support (#22943) - users should migrate to alternative quantization methods.

API & Frontend

OpenAI API compatibility: Unix domain socket support for local communication (#18097), improved error response format matching upstream specification (#22099), aligned tool_choice="required" behavior with OpenAI when tools list is empty (#21052).
New API capabilities: Dedicated LLM.reward interface for reward models (#21720), chunked processing for long inputs in embedding models (#22280), AsyncLLM proper response handling for aborted requests (#22283).
Configuration and environment: Multiple API keys support for enhanced authentication (#18548), custom vLLM tuned configuration paths (#22791), environment variable control for logging statistics (#22905), multimodal cache size (#22441), and DeepGEMM E8M0 scaling behavior (#21968).
CLI and tooling improvements: V1 API support for run-batch command (#21541), custom process naming for better monitoring (#21445), improved help display showing available choices (#21760), optional memory profiling skip for multimodal models (#22950), enhanced logging of non-default arguments (#21680).
Tool and parser support: HermesToolParser for models without special tokens (#16890), multi-turn conversation benchmarking tool (#20267).
Distributed serving enhancements: Enhanced hybrid distributed serving with multiple API servers in load balancing mode (#21510), request_id support for external load balancers (#21009).
User experience enhancements: Improved error messaging for multimodal items (#22114), per-request pooling control via PoolingParams (#20538).

Dependencies

FlashInfer updates: Updated to v0.2.8 for improved performance (#21385), moved to optional dependency install with pip install vllm[flashinfer] for flexible installation (#21959).
Mamba SSM restructuring: Updated to version 2.2.5 (#21421), removed from core requirements to reduce installation complexity (#22541).
Docker and deployment: Docker-aware precompiled wheel support for easier containerized deployment (#21127, #22106).
Python package updates: OpenAI Python dependency updated to latest version for API compatibility (#22316).
Dependency optimizations: Removed xformers requirement for Mistral-format Pixtral and Mistral3 models (#21154), deprecation warnings added for old DeepGEMM version (#22194).

V0 Deprecation

Important: As part of the ongoing V0 engine cleanup, several breaking changes have been introduced:

CLI flag updates: Replaced --task with --runner and --convert options (#21470), deprecated --disable-log-requests in favor of --enable-log-requests for clearer semantics (#21739), renamed --expand-tools-even-if-tool-choice-none to --exclude-tools-when-tool-choice-none for consistency (#20544).
API cleanup: Removed previously deprecated arguments and methods as part of ongoing V0 engine codebase cleanup (#21907).

What's Changed

Deduplicate Transformers backend code using inheritance by @hmellor in #21461
[Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in #21205
[TPU][Bugfix] fix moe layer by @yaochengji in #21340
[v1][Core] Clean up usages of SpecializedManager by @zhouwfang in #21407
[Misc] Fix duplicate FusedMoEConfig debug messages by @njhill in #21455
[Core] Support model loader plugins by @22quinn in #21067
remove GLM-4 quantization wrong Code by @zRzRzRzRzRzRzR in #21435
Replace --expand-tools-even-if-tool-choice-none with --exclude-tools-when-tool-choice-none ...

Contributors

rasmith, russellb, and 198 other contributors

Assets 6

17 Aug 22:57

github-actions

v0.10.1rc1

0fc8fa7

v0.10.1rc1 Pre-release

Pre-release

What's Changed

Deduplicate Transformers backend code using inheritance by @hmellor in #21461
[Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in #21205
[TPU][Bugfix] fix moe layer by @yaochengji in #21340
[v1][Core] Clean up usages of SpecializedManager by @zhouwfang in #21407
[Misc] Fix duplicate FusedMoEConfig debug messages by @njhill in #21455
[Core] Support model loader plugins by @22quinn in #21067
remove GLM-4 quantization wrong Code by @zRzRzRzRzRzRzR in #21435
Replace --expand-tools-even-if-tool-choice-none with --exclude-tools-when-tool-choice-none for v0.10.0 by @okdshin in #20544
[Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_devices() by @ruisearch42 in #21501
[Feat] Allow custom naming of vLLM processes by @chaunceyjiang in #21445
bump flashinfer to v0.2.8 by @cjackal in #21385
[Attention] Optimize FlashInfer MetadataBuilder Build call by @LucasWilkinson in #21137
[Model] Officially support Emu3 with Transformers backend by @hmellor in #21319
[Bugfix] Fix CUDA arch flags for MoE permute by @minosfuture in #21426
[Fix] Update mamba_ssm to 2.2.5 by @elvischenv in #21421
[Docs] Update Tensorizer usage documentation by @sangstar in #21190
[Docs] Rewrite Distributed Inference and Serving guide by @crypdick in #20593
[Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access by @yewentao256 in #21465
Update flashinfer CUTLASS MoE Kernel by @wenscarl in #21408
[XPU] Conditionally import CUDA-specific passes to avoid import errors on xpu platform by @chaojun-zhang in #21036
[P/D] Move FakeNixlWrapper to test dir by @ruisearch42 in #21328
[P/D] Support CPU Transfer in NixlConnector by @juncgu in #18293
[Docs][minor] Fix broken gh-file link in distributed serving docs by @crypdick in #21543
[Docs] Add Expert Parallelism Initial Documentation by @simon-mo in #21373
update flashinfer to v0.2.9rc1 by @weireweire in #21485
[TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3. by @QiliangCui in #21539
[MoE] More balanced expert sharding by @WoosukKwon in #21497
[Frontend] run-batch supports V1 by @DarkLight1337 in #21541
[Docs] Fix site_url for RunLLM by @hmellor in #21564
[Bug] Fix DeepGemm Init Error by @yewentao256 in #21554
Fix GLM-4 PP Missing Layer When using with PP. by @zRzRzRzRzRzRzR in #21531
[Kernel] adding fused_moe configs for upcoming granite4 by @bringlein in #21332
[Bugfix] DeepGemm utils : Fix hardcoded type-cast by @varun-sundar-rabindranath in #21517
[DP] Support api-server-count > 0 in hybrid DP LB mode by @njhill in #21510
[TPU][Test] Temporarily suspend this MoE model in test_basic.py. by @QiliangCui in #21560
[Docs] Add requirements/common.txt to run unit tests by @zhouwfang in #21572
Integrate TensorSchema with shape validation for Phi3VImagePixelInputs by @bbeckca in #21232
[CI] Update CODEOWNERS for CPU and Intel GPU by @bigPYJ1151 in #21582
[Bugfix] fix modelscope snapshot_download serialization by @andyxning in #21536
[Model] Support tensor parallel for timm ViT in Deepseek_vl2 by @wzqd in #21494
[Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings by @hfan in #21479
[Misc][Tools] make max-model-len a parameter in auto_tune script by @yaochengji in #21321
[CI/Build] fix cpu_extension for apple silicon by @ignaciosica in #21195
[Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS by @chenyang78 in #21262
[TPU][Bugfix] fix OOM issue in CI test by @yaochengji in #21550
[Tests] Harden DP tests by @njhill in #21508
Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @Xu-Wenqing in #21598
[Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribute 'startswith' by @kebe7jun in #21579
[Quantization] Enable BNB support for more MoE models by @jeejeelee in #21370
[V1] Get supported tasks from model runner instead of model config by @DarkLight1337 in #21585
[Bugfix][Logprobs] Fix logprobs op to support more backend by @MengqingCao in #21591
[Model] Fix Ernie4.5MoE e_score_correction_bias parameter by @xyxinyang in #21586
[MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B by @bigshanedogg in #20931
[Frontend] Add request_id to the Request object so they can be controlled better via external load balancers by @kouroshHakha in #21009
[Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel by @cyang49 in #20839
[ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. by @fsx950223 in #20295
[Kernel] Improve machete memory bound perf by @czhu-cohere in #21556
Add support for Prithvi in Online serving mode by @mgazz in #21518
[CI] Unifying Dockerfiles for ARM and X86 Builds by @kebe7jun in #21343
[Docs] add auto-round quantization readme by @wenhuach21 in #21600
[TPU][Test] Rollback PR-21550. by @QiliangCui in #21619
Add Unsloth to RLHF.md by @danielhanchen in #21636
[Perf] Cuda Kernel for Int8 Per Token Group Quant by @yewentao256 in #21476
Add interleaved RoPE test for Llama4 (Maverick) by @sarckk in #21478
[Bugfix] Fix sync_and_slice_intermediate_tensors by @ruisearch42 in #21537
[Bugfix] Always set RAY_ADDRESS for Ray actor before spawn by @ruisearch42 in #21540
[TPU] Update ptxla nightly version to 20250724 by @yaochengji in #21555
[Feature] Add support for MoE models in the calibration-free RTN-based quantization by @sakogan in #20766
[Model] Ultravox: Support Llama 4 and Gemma 3 backends by @farzadab in #17818
[Docs] add offline serving multi-modal video input expamle Qwen2.5-VL by @david6666666 in #21530
Correctly kill vLLM processes after finishing serving benchmarks by @huydhn in #21641
[Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds to use dtype comparison by @Mitix-EPI in #21612
[TPU][Test] Divide TPU v1 Test into 2 parts. by @QiliangCui in #21431
Support Intern-S1 by @lvhan028 in #21628
[Misc] remove unused try-except in pooling config check by @reidliu41 in #21618
[Take 2] Correctly kill vLLM processes after benchmarks by @huydhn in #21646
Migrate AriaImagePixelInputs to TensorSchema for shape validation by @bbeckca in #21620
Migrate AyaVisionImagePixelInputs to TensorSchema for shape validation by @bbeckca in #21622
[Bugfix] Investigate Qwen2-VL failing test by @Isotr0py in #21527
Support encoder-only models without KV-Cache by @maxdebayser in #21270
[Bug] Fix has_flashinfer_moe Import Error when it is not installed by @yewentao256 in #21634
[Misc] Improve memory profiling debug message by @yeqcharlotte in #21429
[BugF...

Contributors

rasmith, russellb, and 198 other contributors

Assets 2

24 Jul 22:43

github-actions

v0.10.0

6d8d0a2

v0.10.0

Highlights

v0.10.0 release includes 308 commits, 168 contributors (62 new!).

NOTE: This release begins the cleanup of V0 engine codebase. We have removed V0 CPU/XPU/TPU/HPU backends (#20412), long context LoRA (#21169), Prompt Adapters (#20588), Phi3-Small & BlockSparse Attention (#21217), and Spec Decode workers (#21152) so far and plan to continued to delete code that is no longer used.

Model Support

New families: Llama 4 with EAGLE support (#20591), EXAONE 4.0 (#21060), Microsoft Phi-4-mini-flash-reasoning (#20702), Hunyuan V1 Dense + A13B with reasoning/tool parsing (#21368, #20625, #20820), Ling MoE models (#20680), JinaVL Reranker (#20260), Nemotron-Nano-VL-8B-V1 (#20349), Arcee (#21296), Voxtral (#20970).
Enhanced compatibility: BERT/RoBERTa with AutoWeightsLoader (#20534), HF format support for MiniMax (#20211), Gemini configuration (#20971), GLM-4 updates (#20736).
Architecture expansions: Attention-free model support (#20811), Hybrid SSM/Attention models on V1 (#20016), LlamaForSequenceClassification (#20807), expanded Mamba2 layer support (#20660).
VLM improvements: VLM support with transformers backend (#20543), PrithviMAE on V1 engine (#20577).

Engine Core

Experimental async scheduling --async-scheduling flag to overlap engine core scheduling with GPU runner (#19970).
V1 engine improvements: backend-agnostic local attention (#21093), MLA FlashInfer ragged prefill (#20034), hybrid KV cache with local chunked attention (#19351).
Multi-task support: models can now support multiple tasks (#20771), multiple poolers (#21227), and dynamic pooling parameter configuration (#21128).
RLHF Support: new RPC methods for runtime weight reloading (#20096) and config updates (#20095), logprobs mode for selecting which stage of logprobs to return (#21398).
Enhanced caching: multi-modal caching for transformers backend (#21358), reproducible prefix cache hashing using SHA-256 + CBOR (#20511).
Startup time reduction via CUDA graph capture speedup via frozen GC (#21146).
Elastic expert parallel for dynamic GPU scaling while preserving state (#20775).

Hardwares & Performance

NVIDIA Blackwell/SM100 optimizations: CUTLASS block scaled group GEMM for smaller batches (#20640), FP8 groupGEMM support (#20447), DeepGEMM integration (#20087), FlashInfer MoE blockscale FP8 backend (#20645), CUDNN prefill API for MLA (#20411), Triton Fused MoE kernel config for FP8 E=16 on B200 (#20516).
Performance improvements: 48% request duration reduction via microbatch tokenization for concurrent requests (#19334), fused MLA QKV + strided layernorm (#21116), Triton causal-conv1d for Mamba models (#18218).
Hardware expansion: ARM CPU int8 quantization (#14129), PPC64LE/ARM V1 support (#20554), Intel XPU ray distributed execution (#20659), shared-memory pipeline parallel for CPU (#21289), FlashInfer ARM CUDA support (#21013).

Quantization

New quantization support: MXFP4 for MoE models (#17888), BNB support for Mixtral and additional MoE models (#20893, #21100), in-flight quantization for MoE (#20061).
Hardware-specific: FP8 KV cache quantization on TPU (#19292), FP8 support for BatchedTritonExperts (#18864), optimized INT8 vectorization kernels (#20331).
Performance optimizations: Triton backend for DeepGEMM per-token group quantization (#20841), CUDA kernel for per-token group quantization (#21083), CustomOp abstraction for FP8 (#19830).

API & Frontend

OpenAI compatibility: Responses API implementation (#20504, #20975), image object support in llm.chat (#19635), tool calling with required choice and $defs (#20629).
New endpoints: get_tokenizer_info for tokenizer/chat-template information (#20575), cache_salt support for completions/responses (#20981).
Model loading: Tensorizer S3 integration with arbitrary arguments (#19619), HF repo paths & URLs for GGUF models (#20793), tokenization_kwargs for embedding truncation (#21033).
CLI improvements: --help=page option for enhanced help documentation (#20961), default model changed to Qwen3-0.6B (#20335).

Dependencies

Updated PyTorch to 2.7.1 for CUDA (#21011)
FlashInfer updated to v0.2.8rc1 (#20718)

What's Changed

[Docs] Note that alternative structured output backends are supported by @russellb in #19426
[ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
[Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
[BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
[New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
[BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
[CI] Disable failing GGUF model test by @mgoin in #19454
[Misc] Remove unused MultiModalHasher.hash_prompt_mm_data by @lgeiger in #19422
Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
Fix Typo in Documentation and Function Name by @leopardracer in #19442
[ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
[Kernel] Support deep_gemm for linear methods by @artetaout in #19085
[Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
[Doc] Fix quantization link titles by @DarkLight1337 in #19478
[Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
[Misc] Reduce warning message introduced in env_override by @houseroad in #19476
Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
Add cache to cuda get_device_capability by @mgoin in #19436
Fix some typo by @Ximingwang-09 in #19475
Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
[Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
[CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
[doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
[Misc] Fix misleading ROCm warning by @jeejeelee in #19486
[Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
[Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
[UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
[CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
[CI] change spell checker from codespell to typos by @andyxning in #18711
[BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
[Frontend] Improve error message in tool_choice validation by @22quinn in #19239
[BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
[BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
[AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
Fix typo by @2niuhe in #19525
[Security] Prevent new imports of (cloud)pickle by @russellb in #18018
[Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
[Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
[Quantization] Improve AWQ logic by @jeejeelee in #19431
[Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
[NixlConnector] Drop num_blocks check by @NickLucche in #19532
[Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
Fix TorchAOConfig skip layers by @mobicham in #19265
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in https://github.com/vllm-proj...

Contributors

rasmith, kzjeef, and 198 other contributors

Assets 6

24 Jul 05:04

github-actions

v0.10.0rc2

6d8d0a2

v0.10.0rc2 Pre-release

Pre-release

What's Changed

[Model] use AutoWeightsLoader for bart by @calvin0327 in #18299
[Model] Support VLMs with transformers backend by @zucchini-nlp in #20543
[bugfix] fix syntax warning caused by backslash by @1195343015 in #21251
[CI] Cleanup modelscope version constraint in Dockerfile by @yankay in #21243
[Docs] Add RFC Meeting to Issue Template by @simon-mo in #21279
Add the instruction to run e2e validation manually before release by @huydhn in #21023
[Bugfix] Fix missing placeholder in logger debug by @DarkLight1337 in #21280
[Model][1/N] Support multiple poolers at model level by @DarkLight1337 in #21227
[Docs] Fix hardcoded links in docs by @hmellor in #21287
[Docs] Make tables more space efficient in supported_models.md by @hmellor in #21291
[Misc] unify variable for LLM instance by @andyxning in #20996
Add Nvidia ModelOpt config adaptation by @Edwardf0t1 in #19815
[Misc] Add sliding window to flashinfer test by @WoosukKwon in #21282
[CPU] Enable shared-memory based pipeline parallel for CPU backend by @bigPYJ1151 in #21289
[BugFix] make utils.current_stream thread-safety (#21252) by @simpx in #21253
[Misc] Add dummy maverick test by @minosfuture in #21199
[Attention] Clean up iRoPE in V1 by @LucasWilkinson in #21188
[DP] Fix Prometheus Logging by @robertgshaw2-redhat in #21257
Fix bad lm-eval fork by @mgoin in #21318
[perf] Speed up align sum kernels by @hj-mistral in #21079
[v1][sampler] Inplace logprobs comparison to get the token rank by @houseroad in #21283
[XPU] Enable external_launcher to serve as an executor via torchrun by @chaojun-zhang in #21021
[Doc] Fix CPU doc format by @bigPYJ1151 in #21316
[Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU by @ratnampa in #21338
Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) by @minosfuture in #21334
[Core] Minimize number of dict lookup in _maybe_evict_cached_block by @Jialin in #21281
[V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible by @tdoublep in #21300
[Refactor] Fix Compile Warning #1444-D by @yewentao256 in #21208
Fix kv_cache_dtype handling for out-of-tree HPU plugin by @kzawora-intel in #21302
[Misc] DeepEPHighThroughtput - Enable Inductor pass by @varun-sundar-rabindranath in #21311
[Bug] DeepGemm: Fix Cuda Init Error by @yewentao256 in #21312
Update fp4 quantize API by @wenscarl in #21327
[Feature][eplb] add verify ep or tp or dp by @lengrongfu in #21102
Add arcee model by @alyosha-swamy in #21296
[Bugfix] Fix eviction cached blocked logic by @simon-mo in #21357
[Misc] Remove deprecated args in v0.10 by @kebe7jun in #21349
[Core] Optimize update checks in LogitsProcessor by @Jialin in #21245
[benchmark] Port benchmark request sent optimization to benchmark_serving by @Jialin in #21209
[Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool by @Jialin in #21222
[Misc] unify variable for LLM instance v2 by @andyxning in #21356
[perf] Add fused MLA QKV + strided layernorm by @mickaelseznec in #21116
[feat]: add SM100 support for cutlass FP8 groupGEMM by @djmmoss in #20447
[Perf] Cuda Kernel for Per Token Group Quant by @yewentao256 in #21083
Adds parallel model weight loading for runai_streamer by @bbartels in #21330
[feat] Enable mm caching for transformers backend by @zucchini-nlp in #21358
Revert "[Refactor] Fix Compile Warning #1444-D (#21208)" by @yewentao256 in #21384
Add tokenization_kwargs to encode for embedding model truncation by @Receiling in #21033
[Bugfix] Decode Tokenized IDs to Strings for hf_processor in llm.chat() with model_impl=transformers by @ariG23498 in #21353
[CI/Build] Fix test failure due to updated model repo by @DarkLight1337 in #21375
Fix Flashinfer Allreduce+Norm enable disable calculation based on fi_allreduce_fusion_max_token_num by @xinli-git in #21325
[Model] Add Qwen3CoderToolParser by @ranpox in #21396
[Misc] Copy HF_TOKEN env var to Ray workers by @ruisearch42 in #21406
[BugFix] Fix ray import error mem cleanup bug by @joerunde in #21381
[CI/Build] Fix model executor tests by @DarkLight1337 in #21387
[Bugfix][ROCm][Build] Fix build regression on ROCm by @gshtras in #21393
Simplify weight loading in Transformers backend by @hmellor in #21382
[BugFix] Update python to python3 calls for image; fix prefix & input calculations. by @ericehanley in #21391
[BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update by @xuechendi in #21414
[Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported by @elvischenv in #21420
Changing "amdproduction" allocation. by @Alexei-V-Ivanov-AMD in #21409
[Bugfix] Fix nightly transformers CI failure by @Isotr0py in #21427
[Core] Add basic unit test for maybe_evict_cached_block by @Jialin in #21400
[Cleanup] Only log MoE DP setup warning if DP is enabled by @mgoin in #21315
add clear messages for deprecated models by @youkaichao in #21424
[Bugfix] ensure tool_choice is popped when tool_choice:null is passed in json payload by @gcalmettes in #19679
Fixed typo in profiling logs by @sergiopaniego in #21441
[Docs] Fix bullets and grammars in tool_calling.md by @windsonsea in #21440
[Sampler] Introduce logprobs mode for logging by @houseroad in #21398
Mamba V2 Test not Asserting Failures. by @fabianlim in #21379
[Misc] fixed nvfp4_moe test failures due to invalid kwargs by @chenyang78 in #21246
[Docs] Clean up v1/metrics.md by @windsonsea in #21449
[Model] add Hunyuan V1 Dense Model support. by @kzjeef in #21368
[V1] Check all pooling tasks during profiling by @DarkLight1337 in #21299
[Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. by @sighingnow in #21364
[Tests] Add tests for headless internal DP LB by @njhill in #21450
[Core][Model] PrithviMAE Enablement on vLLM v1 engine by @christian-pinto in #20577
Add test case for compiling multiple graphs by @sarckk in #21044
[TPU][TEST] Fix the downloading issue in TPU v1 test 11. by @QiliangCui in #21418
[Core] Add reload_weights RPC method by @22quinn in #20096
[V1] Fix local chunked attention always disabled by @sarckk in #21419
[V0 Deprecation] Remove Prompt Adapters by @mgoin in #20588
[Core] Freeze gc during cuda graph capture to speed up init by @mgoin in #21146
feat(gguf_loader): accept HF repo paths & URLs for GGUF by @hardikkgupta in #20793
[Frontend] Set MAX_AUDIO_CLI...

Contributors

kzjeef, simpx, and 64 other contributors

Assets 2

20 Jul 05:17

github-actions

v0.10.0rc1

d1fb65b

v0.10.0rc1 Pre-release

Pre-release

What's Changed

[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in #18864
[Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning by @NickLucche in #20400
[Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in #19510
Change warn_for_unimplemented_methods to debug by @mgoin in #20455
[Platform] Add custom default max tokens by @gmarinho2 in #18557
Add ignore consolidated file in mistral example code by @princepride in #20420
[Misc] small update by @reidliu41 in #20462
[Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in #20365
[Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in #20331
[Misc] Add SPDX-FileCopyrightText by @jeejeelee in #20428
Support Llama 4 for fused_marlin_moe by @mgoin in #20457
[Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in #18809
[Model][3/N] Automatic conversion of CrossEncoding model by @noooop in #20168
[Doc] Fix classification table in list of supported models by @DarkLight1337 in #20489
[CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in #18193
[Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in #20395
Enable V1 for Hybrid SSM/Attention Models by @tdoublep in #20016
[feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in #19757
[CI Bugfix] Fix pre-commit failures on main by @mgoin in #20502
[Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in #20497
[Misc] Add security warning for development mode endpoints by @reidliu41 in #20508
[doc] small fix by @reidliu41 in #20506
[Misc] Remove the unused LoRA test code by @jeejeelee in #20494
Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in #20507
[v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in #19754
[Misc] Add logger.exception for TPU information collection failures by @reidliu41 in #20510
[Misc] remove unused import by @reidliu41 in #20517
test_attention compat with coming xformers change by @bottler in #20487
[BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in #20491
[Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in #20509
[BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in #20513
[TPU][Bugfix] fix the MoE OOM issue by @yaochengji in #20339
[Frontend] Support image object in llm.chat by @sfeng33 in #19635
[Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py + Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in #20516
[Misc] call the pre-defined func by @reidliu41 in #20518
[V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in #20412
[V1] Support any head size for FlexAttention backend by @DarkLight1337 in #20467
[BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in #20530
[Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler by @DarkLight1337 in #20527
Implement OpenAI Responses API [1/N] by @WoosukKwon in #20504
[Misc] add a tip for pre-commit by @reidliu41 in #20536
[Refactor]Abstract Platform Interface for Distributed Backend and Add xccl Support for Intel XPU by @dbyoung18 in #19410
[CI/Build] Enable phi2 lora test by @jeejeelee in #20540
[XPU][CI] add v1/core test in xpu hardware ci by @Liangliang-Ma in #20537
Add docstrings to url_schemes.py to improve readability by @windsonsea in #20545
[XPU] log clean up for XPU platform by @yma11 in #20553
[Docs] Clean up tables in supported_models.md by @windsonsea in #20552
[Misc] remove unused jinaai_serving_reranking by @Abirdcfly in #18878
[Misc] Set the minimum openai version by @jeejeelee in #20539
[Doc] Remove extra whitespace from CI failures doc by @hmellor in #20565
[Doc] Use gh-pr and gh-issue everywhere we can in the docs by @hmellor in #20564
[Doc] Fix internal links so they don't always point to latest by @hmellor in #20563
[Doc] Add outline for content tabs by @hmellor in #20571
[Doc] Fix some MkDocs snippets used in the installation docs by @hmellor in #20572
[Model][Last/4] Automatic conversion of CrossEncoding model by @noooop in #19675
[Bugfix] Prevent IndexError for cached requests when pipeline parallelism is disabled by @panpan0000 in #20486
[Feature] microbatch tokenization by @ztang2370 in #19334
[DP] Copy environment variables to Ray DPEngineCoreActors by @ruisearch42 in #20344
[Kernel] Optimize Prefill Attention in Unified Triton Attention Kernel by @jvlunteren in #20308
[Misc] Add fully interleaved support for multimodal 'string' content format by @Dekakhrone in #14047
[Misc] feat output content in stream response by @lengrongfu in #19608
Fix links in multi-modal model contributing page by @hmellor in #18615
[Config] Refactor mistral configs by @patrickvonplaten in #20570
[Misc] Improve logging for dynamic shape cache compilation by @kyolebu in #20573
[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe by @minosfuture in #20167
[Optimize] Don't send token ids when kv connector is not used by @WoosukKwon in #20586
Make distinct code and console admonitions so readers are less likely to miss them by @hmellor in #20585
[Bugfix]: Fix messy code when using logprobs by @chaunceyjiang in #19209
[Doc] Syntax highlight request responses as JSON instead of bash by @hmellor in #20582
[Docs] Rewrite offline inference guide by @crypdick in #20594
[Docs] Improve docstring for ray data llm example by @crypdick in #20597
[Docs] Add Ray Serve LLM section to openai compatible server guide by @crypdick in #20595
[Docs] Add Anyscale to frameworks by @crypdick in #20590
[Misc] improve error msg by @reidliu41 in #20604
[CI/Build][CPU] Fix CPU CI and remove all CPU V0 files by @bigPYJ1151 in #20560
[TPU] Temporary fix vmem oom for long model len by reducing page size by @Chenyaaang in #20278
[Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load by @sangstar in #19619
[PD][Nixl] Remote consumer READ timeout for clearing request blocks by @NickLucche in #20139
[Docs] Improve documentation for Deepseek R1 on Ray Serve LLM by @crypdick in #20601
Remove unnecessary explicit title anchors and use relative links instead by @hmellor in #20620
Stop using title frontmatter and fix doc that can only be ...

Contributors

rabi, kzjeef, and 133 other contributors

Assets 2

07 Jul 17:05

github-actions

v0.9.2

a5dd03c

v0.9.2

Highlights

This release contains 452 commits from 167 contributors (31 new!)

NOTE: This is the last version where V0 engine code and features stay intact. We highly recommend migrating to V1 engine.

Engine Core

Priority Scheduling is now implemented in V1 engine (#19057), embedding models in V1 (#16188), Mamba2 in V1 (#19327).
Full CUDA‑Graph execution is now available for all FlashAttention v3 (FA3) and FlashMLA paths, including prefix‑caching. CUDA graph now has a live capture progress bar makes debugging easier (#20301, #18581, #19617, #19501).
FlexAttention update – any head size, FP32 fallback (#20467, #19754).
Shared CachedRequestData objects and cached sampler‑ID stores deliver perf enhancements (#20232, #20291).

Model Support

New families: Ernie 4.5 (+MoE) (#20220), MiniMax‑M1 (#19677, #20297), Slim‑MoE “Phi‑tiny‑MoE‑instruct” (#20286), Tencent HunYuan‑MoE‑V1 (#20114), Keye‑VL‑8B‑Preview (#20126), GLM‑4.1 V (#19331), Gemma‑3 (text‑only, #20134), Tarsier 2 (#19887), Qwen 3 Embedding & Reranker (#19260), dots1 (#18254), GPT‑2 for Sequence Classification (#19663).
Granite hybrid MoE configurations with shared experts are fully supported (#19652).

Large‑Scale Serving & Engine Improvements

Expert‑Parallel Load Balancer (EPLB) has been added! (#18343, #19790, #19885).
Disaggregated serving enhancements: Avoid stranding blocks in P when aborted in D's waiting queue (#19223), let toy proxy handle /chat/completions (#19730)
Native xPyD P2P NCCL transport as a base case for native PD without external dependency (#18242, #20246).

Hardware & Performance

NVIDIA Blackwell
- SM120: CUTLASS W8A8/FP8 kernels and related tuning, added to Dockerfile (#17280, #19566, #20071, #19794)
- SM100: block‑scaled‑group GEMM, INT8/FP8 vectorization, deep‑GEMM kernels, activation‑chunking for MoE, and group‑size 64 for Machete (#19757, #19572, #19168, #19085, #20290, #20331).
Intel GPU (V1) backend with Flash‑Attention support (#19560).
AMD ROCm: full‑graph capture for TritonAttention, quick All‑Reduce, and chunked pre‑fill (#19158, #19744, #18596).
- Split‑KV support landed in the unified Triton Attention kernel, boosting long‑context throughput (#19152).
- Full‑graph mode enabled in ROCm AITER MLA V1 decode path (#20254).
TPU: dynamic‑grid KV‑cache updates, head‑dim less than 128, tuned paged‑attention kernels, and KV‑padding fixes (#19928, #20235, #19620, #19813, #20048, #20339).
- Add models and features supporting matrix. (#20230)

Quantization

Calibration‑free RTN INT4/INT8 pipeline for effortless, accurate compression (#18768).
Compressed‑Tensor NVFP4 (including MoE) + emulation; FP4 emulation removed on < SM100 devices (#19879, #19990, #19563).
Dynamic MoE‑layer quant (Marlin/GPTQ) and INT8 vectorization primitives (#19395, #20331, #19233).
Bits‑and‑Bytes 0.45 + with improved double‑quant logic and AWQ quality (#20424, #20033, #19431, #20076).

API · CLI · Frontend

API Server: Eliminate api_key and x_request_id headers middleware overhead (#19946)
New OpenAI‑compatible endpoints: /v1/audio/translations & revamped /v1/audio/transcriptions (#19615, #20179, #19597).
Token‑level progress bar for LLM.beam_search and cached template‑resolution speed‑ups (#19301, #20065).
Image‑object support in llm.chat, tool‑choice expansion, and custom‑arg passthroughs enrich multi‑modal agents (#19635, #17177, #16862).
CLI QoL: better parsing for -O/--compilation-config, batch‑size‑sweep benchmarking, richer --help, faster startup (#20156, #20516, #20430, #19941).
Metrics: Deprecate metrics with gpu_ prefix for non GPU specific metrics (#18354), Export NaNs in logits to scheduler_stats if output is corrupted (#18777)

Platform & Deployment

No‑privileged CPU / Docker / K8s mode (#19241) and custom default max‑tokens for hosted platforms (#18557).
Security hardening – runtime (cloud)pickle imports forbidden (#18018).
Hermetic builds and wheel slimming (FA2 8.0 + PTX only) shrink supply‑chain surface (#18064, #19336).

What's Changed

[Docs] Note that alternative structured output backends are supported by @russellb in #19426
[ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
[Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
[BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
[New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
[BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
[CI] Disable failing GGUF model test by @mgoin in #19454
[Misc] Remove unused MultiModalHasher.hash_prompt_mm_data by @lgeiger in #19422
Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
Fix Typo in Documentation and Function Name by @leopardracer in #19442
[ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
[Kernel] Support deep_gemm for linear methods by @artetaout in #19085
[Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
[Doc] Fix quantization link titles by @DarkLight1337 in #19478
[Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
[Misc] Reduce warning message introduced in env_override by @houseroad in #19476
Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
Add cache to cuda get_device_capability by @mgoin in #19436
Fix some typo by @Ximingwang-09 in #19475
Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
[Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
[CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
[doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
[Misc] Fix misleading ROCm warning by @jeejeelee in #19486
[Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
[Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
[UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
[CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
[CI] change spell checker from codespell to typos by @andyxning in #18711
[BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
[Frontend] Improve error message in tool_choice validation by @22quinn in #19239
[BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
[BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
[AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
Fix typo by @2niuhe in #19525
[Security] Prevent new imports of (cloud)pickle by @russellb in #18018
[Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
[Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
[Quantization] Improve AWQ logic by @jeejeelee in #19431
[Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
[NixlConnector] Drop num_blocks check by @NickLucche in #19532
[Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
Fix TorchAOConfig skip layers by @mobicham in #19265
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in #16756
[doc] Make top navigatio...

Contributors

rasmith, shawntan, and 166 other contributors

Assets 6

06 Jul 21:03

github-actions

v0.9.2rc2

a5dd03c

v0.9.2rc2 Pre-release

Pre-release

What's Changed

[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in #18864
[Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning by @NickLucche in #20400
[Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in #19510
Change warn_for_unimplemented_methods to debug by @mgoin in #20455
[Platform] Add custom default max tokens by @gmarinho2 in #18557
Add ignore consolidated file in mistral example code by @princepride in #20420
[Misc] small update by @reidliu41 in #20462
[Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in #20365
[Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in #20331
[Misc] Add SPDX-FileCopyrightText by @jeejeelee in #20428
Support Llama 4 for fused_marlin_moe by @mgoin in #20457
[Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in #18809
[Model][3/N] Automatic conversion of CrossEncoding model by @noooop in #20168
[Doc] Fix classification table in list of supported models by @DarkLight1337 in #20489
[CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in #18193
[Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in #20395
Enable V1 for Hybrid SSM/Attention Models by @tdoublep in #20016
[feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in #19757
[CI Bugfix] Fix pre-commit failures on main by @mgoin in #20502
[Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in #20497
[Misc] Add security warning for development mode endpoints by @reidliu41 in #20508
[doc] small fix by @reidliu41 in #20506
[Misc] Remove the unused LoRA test code by @jeejeelee in #20494
Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in #20507
[v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in #19754
[Misc] Add logger.exception for TPU information collection failures by @reidliu41 in #20510
[Misc] remove unused import by @reidliu41 in #20517
test_attention compat with coming xformers change by @bottler in #20487
[BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in #20491
[Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in #20509
[BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in #20513
[TPU][Bugfix] fix the MoE OOM issue by @yaochengji in #20339
[Frontend] Support image object in llm.chat by @sfeng33 in #19635
[Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py + Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in #20516
[Misc] call the pre-defined func by @reidliu41 in #20518
[V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in #20412
[V1] Support any head size for FlexAttention backend by @DarkLight1337 in #20467
[BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in #20530
[Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler by @DarkLight1337 in #20527

New Contributors

@sangbumlikeagod made their first contribution in #18809
@djmmoss made their first contribution in #19757
@GuyStone made their first contribution in #20497
@bottler made their first contribution in #20487

Full Changelog: v0.9.2rc1...v0.9.2rc2

Contributors

bottler, mgoin, and 25 other contributors

Assets 2

03 Jul 21:54

github-actions

v0.9.2rc1

2f2fcb3

v0.9.2rc1 Pre-release

Pre-release

What's Changed

[Docs] Note that alternative structured output backends are supported by @russellb in #19426
[ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
[Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
[BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
[New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
[BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
[CI] Disable failing GGUF model test by @mgoin in #19454
[Misc] Remove unused MultiModalHasher.hash_prompt_mm_data by @lgeiger in #19422
Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
Fix Typo in Documentation and Function Name by @leopardracer in #19442
[ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
[Kernel] Support deep_gemm for linear methods by @artetaout in #19085
[Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
[Doc] Fix quantization link titles by @DarkLight1337 in #19478
[Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
[Misc] Reduce warning message introduced in env_override by @houseroad in #19476
Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
Add cache to cuda get_device_capability by @mgoin in #19436
Fix some typo by @Ximingwang-09 in #19475
Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
[Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
[CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
[doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
[Misc] Fix misleading ROCm warning by @jeejeelee in #19486
[Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
[Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
[UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
[CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
[CI] change spell checker from codespell to typos by @andyxning in #18711
[BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
[Frontend] Improve error message in tool_choice validation by @22quinn in #19239
[BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
[BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
[AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
Fix typo by @2niuhe in #19525
[Security] Prevent new imports of (cloud)pickle by @russellb in #18018
[Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
[Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
[Quantization] Improve AWQ logic by @jeejeelee in #19431
[Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
[NixlConnector] Drop num_blocks check by @NickLucche in #19532
[Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
Fix TorchAOConfig skip layers by @mobicham in #19265
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in #16756
[doc] Make top navigation sticky by @reidliu41 in #19540
[Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets by @ekagra-ranjan in #18847
[Misc] Turn MOE_DP_CHUNK_SIZE into an env var by @varun-sundar-rabindranath in #19506
[Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant by @mgoin in #19452
[Doc] Unify structured outputs examples by @aarnphm in #18196
[V1] Resolve failed concurrent structred output requests by @russellb in #19565
Revert "[Build/CI] Add tracing deps to vllm container image (#15224)" by @kouroshHakha in #19378
[BugFix] : Fix Batched DeepGemm Experts by @varun-sundar-rabindranath in #19515
[Bugfix] Fix EAGLE vocab embedding for multimodal target model by @zixi-qi in #19570
[Doc] uses absolute links for structured outputs by @aarnphm in #19582
[doc] fix incorrect link by @reidliu41 in #19586
[Misc] Correct broken docs link by @Zerohertz in #19553
[CPU] Refine default config for the CPU backend by @bigPYJ1151 in #19539
[Fix] bump mistral common to support magistral by @princepride in #19533
[Fix] The zip function in Python 3.9 does not have the strict argument by @princepride in #19549
use base version for version comparison by @BoyuanFeng in #19587
[torch.compile] reorganize the cache directory to support compiling multiple models by @youkaichao in #19064
[BugFix] Honor enable_caching in connector-delayed kvcache load case by @njhill in #19435
[Model] Fix minimax model cache & lm_head precision by @qscqesze in #19592
[Refactor] Remove unused variables in moe_permute_unpermute_kernel.inl by @yewentao256 in #19573
[doc][mkdocs] fix the duplicate Supported features sections in GPU docs by @reidliu41 in #19606
[CUDA] Enable full cudagraph for FlashMLA by @ProExpertProg in #18581
[Doc] Add troubleshooting section to k8s deployment by @annapendleton in #19377
[torch.compile] Use custom ops when use_inductor=False by @WoosukKwon in #19618
Adding "AMD: Multi-step Tests" to amdproduction. by @Concurrensee in #19508
[BugFix] Fix DP Coordinator incorrect debug log message by @njhill in #19624
[V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. by @sahelib25 in #18354
[Bugfix][1/n] Fix the speculative decoding test by setting the target dtype by @houseroad in #19633
[Misc] Modularize CLI Argument Parsing in Benchmark Scripts by @reidliu41 in #19593
[Bugfix] Fix auto dtype casting for BatchFeature by @Isotr0py in #19316
[Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization by @jiahanc in #19500
Only build CUTLASS MoE kernels on Hopper by @huydhn in #19648
[Bugfix] Don't attempt to use triton if no driver is active by @kzawora-intel in #19561
[Fix] Convert kv_transfer_config from dict to KVTransferConfig by @maobaolong in #19262
[Perf...

Contributors

rasmith, shawntan, and 158 other contributors

Assets 2

Uh oh!

Releases: vllm-project/vllm

v0.10.2

Highlights

Model Support

Engine Core

Hardware & Performance

Quantization

API & Frontend

Dependencies

V0 Deprecation

Breaking Changes

What's Changed

Contributors

Uh oh!

v0.10.1.1

Uh oh!

v0.10.1

Highlights

Model Support

Engine Core

Hardware & Performance

Quantization

API & Frontend

Dependencies

V0 Deprecation

What's Changed

Contributors

Uh oh!

v0.10.1rc1

What's Changed

Contributors

Uh oh!

v0.10.0

Highlights

Model Support

Engine Core

Hardwares & Performance

Quantization

API & Frontend

Dependencies

What's Changed

Contributors

Uh oh!

v0.10.0rc2

What's Changed

Contributors

Uh oh!

v0.10.0rc1

What's Changed

Contributors

Uh oh!

v0.9.2

Highlights

Engine Core

Model Support

Large‑Scale Serving & Engine Improvements

Hardware & Performance

Quantization

API · CLI · Frontend

Platform & Deployment

What's Changed

Contributors

Uh oh!

v0.9.2rc2

What's Changed

New Contributors

Contributors

Uh oh!

v0.9.2rc1

What's Changed

Contributors

Uh oh!