Allow users to specify kv cache memory size #21489

BoyuanFeng · 2025-07-24T02:38:17Z

Currently we rely on memory profiling to estimate the kv cache memory.

However, there are two issues:

The estimation does not include CUDAGraph memory since the estimation happens before CUDAGraph capture. This leads to underestimation of memory consumptions and allocate too much KV cache memory, thus OOM later. For example, on LLAMA4 Maverick, it underestimates ~1.3 GB for CUDAGraph. So it OOM when gpu-memory-utilization=0.99
Some users may want to fully utilize gpu memory. Currently they have to trial-and-error with different gpu-memory-utilization config.

This PR provides a kv_cache_memory config such that users can specify the kv cache memory size in bytes.

By default, kv_cache_memory is None. We still rely on memory profiling to estimate the kv cache memory. But memory_profiling suggests an optimal kv_cache_memory config which users can add in the future runs. For example, with vllm bench latency --model "google/gemma-3-4b-it", there would a log:

Free memory on device (177.66/178.36 GiB) on startup. Desired GPU memory utilization is (0.9, 160.53 GiB). Actual usage is 8.58 GiB for weight, 9.96 GiB for peak activation, 0.07 GiB for non-torch memory, and 0.65 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=151519866880` to fit into requested memory, or `--kv-cache-memory=169914314752` to fully utilize gpu memory. Current kv cache memory in use is 152379699200 bytes.

In the future runs, users could run with the suggested config as vllm bench latency --model "google/gemma-3-4b-it" --kv-cache-memory=169914314752. This would skip the memory profiling and follow user-specified kv cache memory size.

Tested on: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8, meta-llama/Llama-4-Scout-17B-16E-Instruct,
Qwen/Qwen3-30B-A3B, and google/gemma-3-4b-it.

Closes: #19480

gemini-code-assist

Code Review

This pull request improves the estimation of available KV Cache memory by accounting for CUDAGraph and non-torch memory, which helps prevent out-of-memory errors. The changes look good and correctly address the described issues. I've added a couple of suggestions to improve robustness and maintainability: one to prevent a potential division-by-zero error, and another to replace a magic number with a named constant for better clarity.

vllm/utils/__init__.py

github-actions · 2025-07-24T03:15:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

ywang96

I think originally it was designed on purpose that gpu_memory_utilization does not take cuda graph memory into account so that users can tune this parameter given their cuda graph configurations (whether to turn it on or not, and how many graphs to capture)

This means vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --disable-log-requests -tp 8 --max-model-len 4096 --gpu-memory-utilization 0.99 --trust-remote-code is most definitely going to OOM by design, but not if the user adds --enforce-eager

While I do much agree that adding this sort of estimation should generally speaking improves UX, it's indeed a change of behavior of how our memory profiling works, so I'd like others to chime in too. cc @youkaichao @WoosukKwon

vllm/utils/__init__.py

NickLucche · 2025-07-25T15:48:10Z

Gemma's overestimation looks considerable

houseroad

Like @ywang96 mentioned, shall we introduce a flag which disables the profile, and map the gpu_mem_utilization to the full HBM, and powerful user can just play with it?

The benefits will be 1) simplify the logic of start up, smaller chance to fail, 2) faster start up, 3) more option for powerful users to tweak.

yinghai · 2025-07-26T06:05:24Z

I agree with @ywang96 that we are just adding another guesstimation on top of gpu_memory_utilization.

mergify · 2025-08-13T20:30:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BoyuanFeng.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Boyuan Feng <[email protected]>

houseroad · 2025-08-25T19:29:31Z

vllm/config/cache.py

    necessary for implementating this optimization in some models (e.g. Gemma3n)
    """

+    kv_cache_memory: Optional[int] = None


should this name with unit, some thing like kv_cache_memory_gb? Also why not make it a float?

kv_cache_memory_bytes?

If it's bytes, then no need to use float.

yes renamed as kv_cache_memory_bytes

mergify · 2025-08-26T02:42:58Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BoyuanFeng.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Boyuan Feng <[email protected]>

Signed-off-by: Harry Mellor <[email protected]>

hmellor

LGTM

Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]>

BoyuanFeng requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, zhuohan123 and youkaichao as code owners July 24, 2025 02:38

mergify bot added the v1 label Jul 24, 2025

gemini-code-assist bot reviewed Jul 24, 2025

View reviewed changes

vllm/utils/__init__.py Outdated Show resolved Hide resolved

vllm/utils/__init__.py Outdated Show resolved Hide resolved

BoyuanFeng mentioned this pull request Jul 24, 2025

[Bug]: Compile inductor / CUDA Graph build before the memory profiling #19480

Closed

1 task

ywang96 reviewed Jul 24, 2025

View reviewed changes

vllm/utils/__init__.py Outdated Show resolved Hide resolved

vllm/utils/__init__.py Outdated Show resolved Hide resolved

houseroad reviewed Jul 26, 2025

View reviewed changes

BoyuanFeng requested review from simon-mo, mgoin, tlrmchlsmth, hmellor and aarnphm as code owners August 12, 2025 05:28

mergify bot added the frontend label Aug 12, 2025

BoyuanFeng marked this pull request as draft August 12, 2025 05:32

mergify bot added the needs-rebase label Aug 13, 2025

BoyuanFeng changed the title ~~improve estimation of available KV Cache memory~~ Allow users to specify kv cache memory size Aug 13, 2025

BoyuanFeng marked this pull request as ready for review August 13, 2025 22:47

BoyuanFeng requested review from yewentao256 and ProExpertProg as code owners August 13, 2025 22:47

mergify bot removed the needs-rebase label Aug 13, 2025

lint

9b16c00

Signed-off-by: Boyuan Feng <[email protected]>

BoyuanFeng force-pushed the bf/memory_utilization branch from 15dada4 to 9b16c00 Compare August 21, 2025 21:29

mergify bot removed tpu Related to Google TPUs needs-rebase labels Aug 21, 2025

BoyuanFeng added 2 commits August 21, 2025 14:36

lint

c25477a

Signed-off-by: Boyuan Feng <[email protected]>

Merge branch 'main' into bf/memory_utilization

e9c0dd5

houseroad reviewed Aug 25, 2025

View reviewed changes

Merge branch 'main' into bf/memory_utilization

dde802d

mergify bot added the needs-rebase label Aug 26, 2025

BoyuanFeng added 2 commits August 25, 2025 22:30

kv_cache_memory -> kv_cache_memory_bytes

bff1f6d

Signed-off-by: Boyuan Feng <[email protected]>

kv_cache_memory -> kv_cache_memory_bytes

f6a85dc

Signed-off-by: Boyuan Feng <[email protected]>

mergify bot removed the needs-rebase label Aug 26, 2025

BoyuanFeng and others added 2 commits August 26, 2025 10:30

Merge branch 'main' into bf/memory_utilization

c9a2a67

Allow passing kv_cache_memory_bytes has a human readable int

9dac874

Signed-off-by: Harry Mellor <[email protected]>

hmellor requested a review from chaunceyjiang as a code owner September 11, 2025 09:47

hmellor approved these changes Sep 11, 2025

View reviewed changes

Merge branch 'main' into bf/memory_utilization

0a77bf6

hmellor enabled auto-merge (squash) September 11, 2025 09:59

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 11, 2025

hmellor merged commit 94e6b2d into vllm-project:main Sep 11, 2025
53 of 54 checks passed

github-project-automation bot moved this to Done in Structured Output Sep 11, 2025

github-project-automation bot moved this to Done in Tool Calling Sep 11, 2025

ProExpertProg mentioned this pull request Sep 23, 2025

Allow to override KV cache memory calculation #19804

Closed

Uh oh!

Allow users to specify kv cache memory size #21489

Allow users to specify kv cache memory size #21489

Uh oh!

Conversation

BoyuanFeng commented Jul 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NickLucche commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

yinghai commented Jul 26, 2025

Uh oh!

mergify bot commented Aug 13, 2025

Uh oh!

houseroad Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

houseroad Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

houseroad Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

BoyuanFeng Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Aug 26, 2025

Uh oh!

hmellor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

BoyuanFeng commented Jul 24, 2025 •

edited by github-actions bot

Loading

NickLucche commented Jul 25, 2025 •

edited

Loading