Skip to content

Conversation

sdavidbd
Copy link
Contributor

@sdavidbd sdavidbd commented Jul 30, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

This PR refactors the KV connector integration in GPUModelRunner.execute_model by introducing a context manager that encapsulates the lifecycle of the KV Connector. This clarifies the execution flow and improves modularity.

Additionally, this PR simplifies IntermediateTensors and ModelRunnerOutput by consolidating multiple ad-hoc KV-related fields into a single kv_connector_output field of type KVConnectorOutput, which improves readability and maintainability.

Test Plan

Run all existing tests.

Test Result

All tests pass.

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added v1 tpu Related to Google TPUs labels Jul 30, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable refactoring by encapsulating the KV connector lifecycle within a context manager in GPUModelRunner. This significantly improves code clarity and maintainability. The consolidation of KV-related fields into a single kv_connector_output in ModelRunnerOutput and IntermediateTensors is also a welcome change that enhances readability.

However, a critical issue has been introduced in the TPUModelRunner. The refactoring was not applied to it, and it now calls methods that have been removed from KVConnectorModelRunnerMixin, which will cause runtime failures. This needs to be addressed before merging.

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey thanks for your work!
I am just wondering why can't we have KVConnector.get_finished return a KVConnectorOutput? That would make for easier extensibility as we need to move more stuff from workers to executor.

@sdavidbd
Copy link
Contributor Author

Hey thanks for your work! I am just wondering why can't we have KVConnector.get_finished return a KVConnectorOutput? That would make for easier extensibility as we need to move more stuff from workers to executor.

Thanks @NickLucche ! That's a great question.

Shaping the output returned from the connector into a general structure is still a work in progress. To avoid locking in a premature design, I believe it’s best to construct KVConnectorOutput externally rather than returning it directly from get_finished. For example, in #19330, I use an additional API to extend the output without modifying the get_finished interface.
A key concern here is API stability - we want to minimize breaking changes to the connector interface, especially given the number of third-party implementations already in use. By keeping the connector output external, we can evolve the design more safely and incrementally.

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdavidbd, this looks great to me.

I also feel it would make sense to return KVConnectorOutput from get_finished(), but since there's not yet agreement on that we could get this merged asap and handle that as a follow-on.

@sdavidbd sdavidbd requested a review from njhill July 31, 2025 11:43
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdavidbd! We can continue discussion in follow-on PRs

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 31, 2025
@lk-chen
Copy link
Collaborator

lk-chen commented Jul 31, 2025

For example, in #19330, I use an additional API to extend the output without modifying the get_finished interface. A key concern here is API stability - we want to minimize breaking changes to the connector interface, especially given the number of third-party implementations already in use.

I think calling "additional API" brings complexity of keeping atomicity? If you get done_sending from get_finished(), and metrics from new API get_metrics(), how would you guarantee these two are from the same time point?

Regarding "third-party impl.", if this is referring the implementation inside vllm/ repo, I guess we just change all of them altogether at once?

Never the less, I'm ok merging this PR. At least in follow-up PR, we don't need to touch ModelRunnerOutput etc.

@NickLucche
Copy link
Collaborator

We have to fix tests

2025-07-31T13:30:51Z] FAILED v1/core/test_async_scheduler.py::test_stop_by_max_tokens[1] - TypeError: ModelRunnerOutput.__init__() missing 1 required positional argument: 'kv_connector_output'
[2025-07-31T13:30:51Z] FAILED v1/core/test_async_scheduler.py::test_stop_by_max_tokens[2] - TypeError: ModelRunnerOutput.__init__() missing 1 required positional argument: 'kv_connector_output'
[2025-07-31T13:30:51Z] FAILED v1/core/test_async_scheduler.py::test_stop_by_max_tokens[3] - TypeError: ModelRunnerOutput.__init__() missing 1 required positional argument: 'kv_connector_output'
[2025-07-31T13:30:51Z] FAILED v1/core/test_async_scheduler.py::test_stop_by_max_tokens[5] - TypeError: ModelRunnerOutput.__init__() missing 1 required positional argument: 'kv_connector_output'

@DarkLight1337
Copy link
Member

There are more failing tests in V1 tests, please fix them. The rest should be fixed if you merge from main

@sdavidbd
Copy link
Contributor Author

sdavidbd commented Aug 2, 2025

There are more failing tests in V1 tests, please fix them. The rest should be fixed if you merge from main

Strange - there seems to be a mismatch between the code shown in the commit view and the version that was actually tested in CI. I'll try rebasing and pushing again to see if that resolves it.

5d3ceee9fc code
5d3ceee9fc failure

@sdavidbd sdavidbd force-pushed the feature/kv-connector-context-manager branch from 5d3ceee to 373df31 Compare August 2, 2025 19:10
@sdavidbd
Copy link
Contributor Author

sdavidbd commented Aug 3, 2025

@DarkLight1337 Failed checks appear to be caused by known issues unrelated to this PR:

  • ModuleNotFoundError: No module named 'ray.experimental.channel.accelerator_context' - addressed by PR [Misc] Bump ray to 2.48.0 #22123
  • AssertionError: two copies of the same prompt should be the same in test_completion.py - seems unrelated and was mentioned in the #sig-ci channel as a known flake.

@sdavidbd sdavidbd force-pushed the feature/kv-connector-context-manager branch from 373df31 to 25f2873 Compare August 3, 2025 06:48
@vllm-bot vllm-bot merged commit aefeea0 into vllm-project:main Aug 3, 2025
44 of 46 checks passed
@sdavidbd
Copy link
Contributor Author

sdavidbd commented Aug 3, 2025

For example, in #19330, I use an additional API to extend the output without modifying the get_finished interface. A key concern here is API stability - we want to minimize breaking changes to the connector interface, especially given the number of third-party implementations already in use.

I think calling "additional API" brings complexity of keeping atomicity? If you get done_sending from get_finished(), and metrics from new API get_metrics(), how would you guarantee these two are from the same time point?

Regarding "third-party impl.", if this is referring the implementation inside vllm/ repo, I guess we just change all of them altogether at once?

Never the less, I'm ok merging this PR. At least in follow-up PR, we don't need to touch ModelRunnerOutput etc.

Thanks, @lk-chen!

A significant part of this PR is the introduction of the KV connector context manager, which manages the connector lifecycle over a single model execution (i.e., a scheduling step). This provides a natural boundary for atomicity - between bind_connector_metadata and clear_connector_metadata. The intention is that any output retrieved within this scope (e.g., via get_finished or a potential get_metrics) reflects a consistent view of the connector state for that scheduling step. Open to feedback if there are scenarios I’ve missed or if this consistency assumption doesn’t hold in practice.

Regarding third-party implementations, I was referring to out-of-tree connectors that are dynamically loaded via KVTransferConfig.kv_connector_module_path, not just the ones under the vllm repo. That’s where the concern for interface stability comes in - while the connector API is still a work in progress and some breaking changes are expected, we’d like to minimize the frequency of such changes to reduce churn for external users and give the interface a chance to stabilize gradually

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
Signed-off-by: David Ben-David <[email protected]>
Co-authored-by: David Ben-David <[email protected]>
@liuzijing2014
Copy link
Collaborator

Hi @sdavidbd I think we miss handling the edge case where KVConnectorOutput could be None. I put up a quick fix here: #22473

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
Signed-off-by: David Ben-David <[email protected]>
Co-authored-by: David Ben-David <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025
Signed-off-by: David Ben-David <[email protected]>
Co-authored-by: David Ben-David <[email protected]>
Signed-off-by: Noam Gat <[email protected]>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
Signed-off-by: David Ben-David <[email protected]>
Co-authored-by: David Ben-David <[email protected]>
Signed-off-by: Paul Pak <[email protected]>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
Signed-off-by: David Ben-David <[email protected]>
Co-authored-by: David Ben-David <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: David Ben-David <[email protected]>
Co-authored-by: David Ben-David <[email protected]>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: David Ben-David <[email protected]>
Co-authored-by: David Ben-David <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants