-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Allow users to specify kv cache memory size #21489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request improves the estimation of available KV Cache memory by accounting for CUDAGraph and non-torch memory, which helps prevent out-of-memory errors. The changes look good and correctly address the described issues. I've added a couple of suggestions to improve robustness and maintainability: one to prevent a potential division-by-zero error, and another to replace a magic number with a named constant for better clarity.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think originally it was designed on purpose that gpu_memory_utilization
does not take cuda graph memory into account so that users can tune this parameter given their cuda graph configurations (whether to turn it on or not, and how many graphs to capture)
This means vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --disable-log-requests -tp 8 --max-model-len 4096 --gpu-memory-utilization 0.99 --trust-remote-code
is most definitely going to OOM by design, but not if the user adds --enforce-eager
While I do much agree that adding this sort of estimation should generally speaking improves UX, it's indeed a change of behavior of how our memory profiling works, so I'd like others to chime in too. cc @youkaichao @WoosukKwon
Gemma's overestimation looks considerable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like @ywang96 mentioned, shall we introduce a flag which disables the profile, and map the gpu_mem_utilization to the full HBM, and powerful user can just play with it?
The benefits will be 1) simplify the logic of start up, smaller chance to fail, 2) faster start up, 3) more option for powerful users to tweak.
I agree with @ywang96 that we are just adding another guesstimation on top of |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Boyuan Feng <[email protected]>
15dada4
to
9b16c00
Compare
Signed-off-by: Boyuan Feng <[email protected]>
vllm/config/cache.py
Outdated
necessary for implementating this optimization in some models (e.g. Gemma3n) | ||
""" | ||
|
||
kv_cache_memory: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this name with unit, some thing like kv_cache_memory_gb? Also why not make it a float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kv_cache_memory_bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's bytes, then no need to use float.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes renamed as kv_cache_memory_bytes
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Boyuan Feng <[email protected]>
Signed-off-by: Boyuan Feng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]>
Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]>
Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]>
Currently we rely on memory profiling to estimate the kv cache memory.
However, there are two issues:
This PR provides a kv_cache_memory config such that users can specify the kv cache memory size in bytes.
By default, kv_cache_memory is None. We still rely on memory profiling to estimate the kv cache memory. But memory_profiling suggests an optimal kv_cache_memory config which users can add in the future runs. For example, with
vllm bench latency --model "google/gemma-3-4b-it"
, there would a log:In the future runs, users could run with the suggested config as
vllm bench latency --model "google/gemma-3-4b-it" --kv-cache-memory=169914314752
. This would skip the memory profiling and follow user-specified kv cache memory size.Tested on: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8, meta-llama/Llama-4-Scout-17B-16E-Instruct,
Qwen/Qwen3-30B-A3B, and google/gemma-3-4b-it.
Closes: #19480