[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector #23624

KuntaiDu · 2025-08-26T05:53:58Z

[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector

Checklist at the bottom is considered.

Purpose

This PR aims to support hybrid allocator + kv cache connector code path.

Design doc: link

Related to #23079
Solves #22292

Test Plan

Local correctness test passed. Will further work on instructions to let other people reproduce.

Core test logic:

        first_prompt = "Hello, how are you?" * 5000 + "Hello, my name is"
        second_prompt = [
            "Hello, how are you?" * 1000 + "Tell me a very long story",
        ]
        sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=10)
        print_output(llm, [first_prompt], sampling_params, "first")
        print_output(llm, ["1" + first_prompt], sampling_params, "second")
        print_output(llm, ["2" + first_prompt], sampling_params, "second")

        # Now the first request is evicted. Run this request.
        # It will trigger KV cache loading from LMCache.
        print_output(llm, second_prompt, sampling_params, "third")

Test Result

For the last request:

[2025-08-26 05:36:37,718] LMCache INFO: Reqid: 3, Total tokens 6007, LMCache hit tokens: 5888, need to load: 5888 (vllm_v1_adapter.py:1091:lmcache.integration.vllm.vllm_v1_adapter)
[2025-08-26 05:36:37,820] LMCache INFO: Retrieved 5888 tokens (vllm_v1_adapter.py:822:lmcache.integration.vllm.vllm_v1_adapter)
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.26it/s, est. speed input: 7594.87 toks/s, output: 12.64 toks/s]
--------------------------------------------------
Generated text: '.\n\nOkay, here we go...\n\nOnce'
Generation took 0.80 seconds, third request done.
--------------------------------------------------
[2025-08-26 05:36:44,886] LMCache INFO: Storage manager closed. (storage_manager.py:472:lmcache.v1.storage_backend.storage_manager)
[2025-08-26 05:36:48,332] LMCache INFO: LMCacheEngine closed. (cache_engine.py:965:lmcache.v1.cache_engine)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: KuntaiDu <[email protected]>

…o GPU memory, the inference results are wrong. Fix this first. Signed-off-by: KuntaiDu <[email protected]>

Signed-off-by: KuntaiDu <[email protected]>

mergify · 2025-08-26T05:54:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @KuntaiDu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: KuntaiDu <[email protected]>

gemini-code-assist · 2025-08-26T06:01:23Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

Signed-off-by: KuntaiDu <[email protected]>

…onnector Signed-off-by: KuntaiDu <[email protected]>

…KuntaiDu/vllm into kuntai-support-hybrid-allocator Signed-off-by: KuntaiDu <[email protected]>

vllm/v1/core/block_pool.py

vllm/v1/core/kv_cache_manager.py

heheda12345 · 2025-09-06T06:19:14Z

Will take deeper look later

mergify · 2025-09-06T06:19:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @KuntaiDu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Co-authored-by: Chen Zhang <[email protected]> Signed-off-by: Kuntai Du <[email protected]>

Signed-off-by: KuntaiDu <[email protected]>

…KuntaiDu/vllm into kuntai-support-hybrid-allocator Signed-off-by: KuntaiDu <[email protected]>

Signed-off-by: KuntaiDu <[email protected]>

heheda12345

Some random idea to discuss

vllm/v1/core/sched/scheduler.py

vllm/v1/core/kv_cache_manager.py

Signed-off-by: KuntaiDu <[email protected]>

KuntaiDu added 5 commits August 26, 2025 05:51

initial release

7e61f1a

Signed-off-by: KuntaiDu <[email protected]>

fall back to simpler case: even when the allocation can fully fit int…

96910d7

…o GPU memory, the inference results are wrong. Fix this first. Signed-off-by: KuntaiDu <[email protected]>

vllm side of hybrid allocator impl

28d5d8e

Signed-off-by: KuntaiDu <[email protected]>

remove previous debug footprint

5d0b504

Signed-off-by: KuntaiDu <[email protected]>

remove debugging codes

9f0ac8c

Signed-off-by: KuntaiDu <[email protected]>

KuntaiDu requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, simon-mo, youkaichao, mgoin, tlrmchlsmth, houseroad, hmellor, yewentao256 and ProExpertProg as code owners August 26, 2025 05:54

mergify bot added the v1 label Aug 26, 2025

mergify bot added the needs-rebase label Aug 26, 2025

merge from main, and resolve conflict in worker

2f4b7b2

Signed-off-by: KuntaiDu <[email protected]>

mergify bot removed the needs-rebase label Aug 26, 2025

clean up some code diff footprint, and remove some debugging statements

a926b1d

Signed-off-by: KuntaiDu <[email protected]>

KuntaiDu mentioned this pull request Aug 26, 2025

[feat] Support hybrid allocator LMCache/LMCache#1436

Open

KuntaiDu marked this pull request as draft August 26, 2025 06:05

KuntaiDu added 2 commits August 26, 2025 21:51

allow allocating when GPU memory is limited and make formatter happy

1cd2654

Signed-off-by: KuntaiDu <[email protected]>

add an empty line to improve readability

34fbe1d

Signed-off-by: KuntaiDu <[email protected]>

KuntaiDu added 4 commits September 4, 2025 22:21

[misc] adjust function order to reduce unnecessary code diff

93ed683

Signed-off-by: KuntaiDu <[email protected]>

Merge branch 'main' into kuntai-support-hybrid-allocator

1e79663

[lint] add kv_cache_config when initializing base class of kv cache c…

02dc4a1

…onnector Signed-off-by: KuntaiDu <[email protected]>

Merge branch 'kuntai-support-hybrid-allocator' of https://github.com/…

4ba7d14

…KuntaiDu/vllm into kuntai-support-hybrid-allocator Signed-off-by: KuntaiDu <[email protected]>

heheda12345 reviewed Sep 6, 2025

View reviewed changes

vllm/v1/core/block_pool.py Outdated Show resolved Hide resolved

vllm/v1/core/block_pool.py Outdated Show resolved Hide resolved

vllm/v1/core/kv_cache_manager.py Show resolved Hide resolved

mergify bot added the needs-rebase label Sep 6, 2025

KuntaiDu and others added 3 commits September 8, 2025 14:10

Update vllm/v1/core/block_pool.py

3e679a1

Co-authored-by: Chen Zhang <[email protected]> Signed-off-by: Kuntai Du <[email protected]>

[Bugfix] fix fail-to-initialize for TPU

0b3f4b4

Signed-off-by: KuntaiDu <[email protected]>

Merge branch 'kuntai-support-hybrid-allocator' of https://github.com/…

7c16621

…KuntaiDu/vllm into kuntai-support-hybrid-allocator Signed-off-by: KuntaiDu <[email protected]>

KuntaiDu requested a review from zhuohan123 as a code owner September 8, 2025 21:53

mergify bot added the tpu Related to Google TPUs label Sep 8, 2025

merge from main and resolve conflicts

0865316

Signed-off-by: KuntaiDu <[email protected]>

mergify bot removed the needs-rebase label Sep 8, 2025

[Misc] fix chen's suggestions

1545c49

Signed-off-by: KuntaiDu <[email protected]>

heheda12345 reviewed Sep 9, 2025

View reviewed changes

vllm/v1/core/sched/scheduler.py Show resolved Hide resolved

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

vllm/v1/core/kv_cache_manager.py Outdated Show resolved Hide resolved

vllm/v1/core/kv_cache_manager.py Outdated Show resolved Hide resolved

KuntaiDu added 6 commits September 10, 2025 23:57

[API] change the set of APIs to reflect Chen's advice.

a143072

Signed-off-by: KuntaiDu <[email protected]>

[bugfix] uncomment fullattentionmanager

20bb31c

Signed-off-by: KuntaiDu <[email protected]>

[bugfix] fix tests

080ead9

Signed-off-by: KuntaiDu <[email protected]>

[API] avoid use internal APIs

c981208

Signed-off-by: KuntaiDu <[email protected]>

Merge branch 'main' into kuntai-support-hybrid-allocator

364ce21

[bugfix] align variable names

498931f

Signed-off-by: KuntaiDu <[email protected]>

KuntaiDu requested a review from NickLucche as a code owner September 11, 2025 17:26

KuntaiDu added 6 commits September 11, 2025 14:13

[bugfix] fix the bug that get_num_blocks_to_allocate < 0

cf4a7d0

Signed-off-by: KuntaiDu <[email protected]>

[fix] re-trigger CI

01a1248

Signed-off-by: KuntaiDu <[email protected]>

Merge branch 'main' into kuntai-support-hybrid-allocator

0f74416

[bugfix] resolve double definition in Mamba:get_num_blocks_to_allocate

aad71ba

Signed-off-by: KuntaiDu <[email protected]>

[bugfix] add mamba speculative decoding into consideration

f144b1b

Signed-off-by: KuntaiDu <[email protected]>

[misc] adjust function order to minimize diff

d863ef4

Signed-off-by: KuntaiDu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector #23624

[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector #23624

KuntaiDu commented Aug 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Aug 26, 2025

Uh oh!

gemini-code-assist bot commented Aug 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heheda12345 commented Sep 6, 2025

Uh oh!

mergify bot commented Sep 6, 2025

Uh oh!

heheda12345 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector #23624

Are you sure you want to change the base?

[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector #23624

Conversation

KuntaiDu commented Aug 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Aug 26, 2025

Uh oh!

gemini-code-assist bot commented Aug 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heheda12345 commented Sep 6, 2025

Uh oh!

mergify bot commented Sep 6, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KuntaiDu commented Aug 26, 2025 •

edited by github-actions bot

Loading