[WIP][Feature] mooncake connector support GQA transport #2947

lidenghui1110 · 2025-09-16T02:59:59Z

What this PR does / why we need it?

TODO:

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@f225ea7

Signed-off-by: zzy-ContiLearn <[email protected]>

…g errors caused by .npu()

github-actions · 2025-09-16T03:00:11Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces support for QGA transport in the mooncake connector, which involves significant changes to handle chunked KV cache transfers. The changes include modifications to the sending and receiving threads, metadata handling, and the logic for selecting tensor parallelism ranks for data transfer. While the overall direction seems correct for enabling more flexible data transport, I've found a critical issue in the KV cache transfer logic where source and destination addresses appear to be swapped, and the remote address calculation for chunked transfers seems incorrect. This could lead to incorrect data transfer or corruption. Addressing this is crucial for the feature to work correctly.

gemini-code-assist · 2025-09-16T03:01:53Z

vllm_ascend/distributed/mooncake_connector.py

        src_list, dst_list, length_list = [], [], []
        for k, (src_layer_base_addr, dst_layer_base_addr) in enumerate(
                zip(local_kv_caches_base_addrs, remote_kv_caches_base_addrs)):
-            block_len = (self.block_len[k % 2]
-                         if self.use_mla else self.block_len[0])
-            for i, remote_block_id in enumerate(grouped_remote_block_ids):
-                local_block_ids = grouped_local_block_ids[i]
-                src = src_layer_base_addr + local_block_ids[0] * block_len
-                dst = dst_layer_base_addr + remote_block_id[0] * block_len
-                length = len(local_block_ids) * block_len
+            block_len = self.block_len[k % 2]
+            inner_block_len = block_len // self.num_need_pulls
+            for remote_block_id, local_block_id in zip(grouped_remote_block_ids, grouped_local_block_ids):
+                src = src_layer_base_addr + local_block_id[0] * block_len + offset * inner_block_len
+                dst = dst_layer_base_addr + remote_block_id[0] * inner_block_len
+                length = inner_block_len * len(local_block_id)
                src_list.append(src)
                dst_list.append(dst)
                length_list.append(length)
-        ret = self.engine.batch_transfer_sync_read(session_id, src_list,
-                                                   dst_list, length_list)
+
+        ret = self.engine.batch_transfer_sync_read(session_id, src_list, dst_list,
+                                                length_list)


There appears to be a critical issue with the batch_transfer_sync_read call and the address calculations. The KVCacheRecvingThread should be reading from a remote source to a local destination, but the source and destination addresses seem to be swapped.

Swapped Source/Destination: src_list is populated with local addresses and dst_list with remote addresses. Since batch_transfer_sync_read is a read operation, the source should be remote and the destination local. The logic for calculating addresses for src and dst seems to be swapped.

Incorrect Remote Address Calculation: The remote address calculation (assigned to dst in the current code) does not use the offset parameter. This will cause it to read the same chunk from the remote source repeatedly for multi-chunk transfers. It should likely incorporate both block_len and offset.

Here is a suggested correction that assumes batch_transfer_sync_read takes (session, remote_src_addrs, local_dst_addrs, lengths) and fixes the address calculations. The confusing variable names (src_layer_base_addr for local, dst_layer_base_addr for remote) are renamed for clarity within the suggestion.

src_list, dst_list, length_list = [], [], [] for k, (local_layer_base_addr, remote_layer_base_addr) in enumerate( zip(local_kv_caches_base_addrs, remote_kv_caches_base_addrs)): block_len = self.block_len[k % 2] inner_block_len = block_len // self.num_need_pulls for remote_block_id, local_block_id in zip(grouped_remote_block_ids, grouped_local_block_ids): # remote source address src = remote_layer_base_addr + remote_block_id[0] * block_len + offset * inner_block_len # local destination address dst = local_layer_base_addr + local_block_id[0] * block_len + offset * inner_block_len length = inner_block_len * len(local_block_id) src_list.append(src) dst_list.append(dst) length_list.append(length) ret = self.engine.batch_transfer_sync_read(session_id, src_list, dst_list, length_list)

Signed-off-by: zzy-ContiLearn <[email protected]>

github-actions · 2025-09-23T01:55:47Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

…ache eviction in the Prefill node. Signed-off-by: chenxiao <[email protected]>

lidenghui1110 and others added 9 commits September 15, 2025 17:05

add mooncake connector with GQA support

32b6188

modify cat operation for heterogenous tp

07d74f3

Signed-off-by: zzy-ContiLearn <[email protected]>

code modify

dbf1ce3

Signed-off-by: zzy-ContiLearn <[email protected]>

rebased to vllm-ascend main

ad9fbf6

Signed-off-by: zzy-ContiLearn <[email protected]>

fix: fix code for vllm-ascend v0.10.1

6126739

Signed-off-by: zzy-ContiLearn <[email protected]>

fix: remove some variables

fd12913

Signed-off-by: zzy-ContiLearn <[email protected]>

Fix parameter issues of npu_paged_cache_load operator and data loadin…

701c0ed

…g errors caused by .npu()

Beautify the code

54a4b92

fit new kv shape

c343b61

gemini-code-assist bot reviewed Sep 16, 2025

View reviewed changes

lidenghui1110 changed the title ~~[WIP][Feature] mooncake connector support QGA transport~~ [WIP][Feature] mooncake connector support GQA transport Sep 17, 2025

bugfix: modify code for aggregator changes

4808ccc

Signed-off-by: zzy-ContiLearn <[email protected]>

github-actions bot added the merge-conflicts label Sep 23, 2025

github-actions bot added the module:core label Sep 24, 2025

Kurumi5210 force-pushed the qwen-modify-rebase-main branch from 6c0d05d to f002afb Compare September 24, 2025 12:57

modified the TP size for _get_remote_tp_ranks_for_req to support KV c…

82db3b5

…ache eviction in the Prefill node. Signed-off-by: chenxiao <[email protected]>

Kurumi5210 force-pushed the qwen-modify-rebase-main branch from f002afb to 82db3b5 Compare September 24, 2025 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][Feature] mooncake connector support GQA transport #2947

[WIP][Feature] mooncake connector support GQA transport #2947

lidenghui1110 commented Sep 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

Uh oh!

[WIP][Feature] mooncake connector support GQA transport #2947

Are you sure you want to change the base?

[WIP][Feature] mooncake connector support GQA transport #2947

Conversation

lidenghui1110 commented Sep 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Sep 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

Uh oh!

lidenghui1110 commented Sep 16, 2025 •

edited by github-actions bot

Loading