Skip to content

Conversation

orozery
Copy link
Contributor

@orozery orozery commented Sep 9, 2025

Currently in v1 this field (num_cpu_blocks) is always zero.
This PR sets it by dividing the swap_size parameter with the total number of bytes per KV block.

This will be useful for the v1 CPU offloading feature (#19854), as it will allow the user to set the CPU utilization in GB instead of number of blocks.

Copy link
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

This commit sets the vllm_config.cache_config.num_cpu_blocks according
to vllm_config.cache_config.swap_space.

Signed-off-by: Or Ozeri <[email protected]>
Copy link
Collaborator

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

Comment on lines +213 to +214
num_cpu_blocks = (int(vllm_config.cache_config.swap_space_bytes) //
kv_cache_configs[0].kv_bytes_per_block)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a knob called offloaded_block_size in #22595. IIUC, it also impacts the calculation of num_cpu_blocks, right? (i.e., if we have larger CPU blocks, we should have less number of CPU blocks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In v0, the offloading was part of the core.
My suggestion for v1 is to have the offloading as a connector.
I wanted to follow the convention for connectors, where all of their arguments are actually defined in their kv_connector_extra_config.

However, deriving num_cpu_blocks from some kind of a swap_space parameter requires knowledge of kv_bytes_per_block.
So basically, I need my connector (both scheduler-side and worker-side) to be aware of kv_bytes_per_block.
This requires changing things in core, so I tried to make minimal changes and came up with the approach here:

For the scheduler-side connector, report kv_bytes_per_block by setting the existing V0 field num_cpu_blocks.
For the worker-side connector, pass-on kv_cache_configs via the register_kv_caches function (in a follow-up PR).

When the offloading connector gets this num_cpu_blocks (given in GPU block size), it can derive the actual num_cpu_blocks by dividing by block_size_factor.

To sum-up, I'm trying to make minimal changes to the core.
This results in the actual offloading configuration parameters split between vllm_config.cache_config and kv_connector_extra_config.

I'm good with taking a different approach.
Your thoughts?
Perhaps we should ask other relevant folks on their opinion here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This is a good point. I think at a high level, there should be two parameters that can be configured by users: (1) total_cpu_buffer_size and (2) cpu_buffer_block_size (how many tokens in each CPU block).

For (1), it's also worth thinking whether it's per rank or per vLLM instance (i.e., summed across all ranks). I feel like if it's per rank, probably it will be better to pass it in the KV connector configs, while it makes more sense to have a "global" cache size when it's configured by global configurations like --swap-space.

For (2), I think it should definitely be put into the KV connector config as it's the current CPU-offloading-connector-specific configuration.

To sum up, I feel like putting all the configs into the KV connector config will probably be better and less confusing. WDYT?

Copy link

mergify bot commented Sep 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants