Remove redundant all gather + split #23441

chenxi-yang · 2025-08-22T20:03:00Z

Summary: Clean up the glm model architecture

Test Plan:
max_batch_size=128

CUDA_VISIBLE_DEVICES=6,7
VLLM_DISABLE_COMPILE_CACHE=1
VLLM_MQ_MAX_CHUNK_BYTES_MB=256
VLLM_GPU_MEMORY_UTILIZATION=0.9
buck2 run @//mode/{opt,inplace}
-c fbcode.enable_vllm=true
-c fbcode.enable_gpu_sections=true
-c fbcode.nvcc_arch=h100a
//smart/inference_platform_sp/llm_predictor_gpu:service --
--local_cache_dir "$HOME/local/models/GLM-4.5V-FP8"
--try_local_cache
--max_seq_len=16384
--max_batch_size ${max_batch_size}
--thrift_server_port 12345
--enable_warmup=true
--model_mf_bucket=llm_inference
--model_mf_path=tree/oss/GLM-4.5V-FP8
--force_llm_format=true
--allow_custom_stop_tokens
--model_parallel_size 2
--vllm_engine
--cpu_offload_gb=0
--kv_cache_quantization 8
2>&1 | tee "/tmp/$USER/server_vllm_glm4_5v.log"

Before:
QPS: 1.13
Avg latency: 104.364s
Avg TTFT (client): 47662.44ms
P50 TTFT (client): 55740.47ms
P99 TTFT (client): 57673.41ms
Avg TTIT (client): 56.70ms
P50 TTIT (client): 53.03ms
P99 TTIT (client): 92.62ms
Avg TTFT (server): 45252.67ms
Avg TTIT (server): 56.23ms
Avg prefill len: 2643.00 tokens
P50 prefill len: 2643.00 tokens
P99 prefill len: 2643.00 tokens
Avg decode len: 1000.00 tokens
P50 decode len: 1000.00 tokens
P99 decode len: 1000.00 tokens

After:
QPS: 1.26
Avg latency: 49.998s
Avg TTFT (client): 1679.44ms
P50 TTFT (client): 1584.17ms
P99 TTFT (client): 5748.46ms
Avg TTIT (client): 48.32ms
P50 TTIT (client): 48.21ms
P99 TTIT (client): 59.81ms
Avg TTFT (server): 2481.96ms
Avg TTIT (server): 48.06ms
Avg prefill len: 2643.00 tokens
P50 prefill len: 2643.00 tokens
P99 prefill len: 2643.00 tokens
Avg decode len: 1000.00 tokens
P50 decode len: 1000.00 tokens
P99 decode len: 1000.00 tokens

Rollback Plan:

Differential Revision: D80552189

facebook-github-bot · 2025-08-22T20:03:17Z

This pull request was exported from Phabricator. Differential Revision: D80552189

facebook-github-bot · 2025-08-22T20:04:40Z

This pull request was exported from Phabricator. Differential Revision: D80552189

gemini-code-assist

Code Review

This pull request removes a redundant all-gather and split operation within the split_qkv method of the GLM-4.1V model's attention mechanism. This change significantly improves performance by avoiding unnecessary data communication and manipulation across tensor parallel ranks, as demonstrated by the impressive benchmark results. The previous implementation gathered sharded QKV tensors from all ranks only to immediately split them back into local shards, which was an expensive no-op. The new implementation correctly operates on the local QKV shards directly. This change appears correct and is a valuable optimization.

Summary: Clean up the glm model architecture Test Plan: max_batch_size=128 CUDA_VISIBLE_DEVICES=6,7 \ VLLM_DISABLE_COMPILE_CACHE=1 \ VLLM_MQ_MAX_CHUNK_BYTES_MB=256 \ VLLM_GPU_MEMORY_UTILIZATION=0.9 \ buck2 run @//mode/{opt,inplace} \ -c fbcode.enable_vllm=true \ -c fbcode.enable_gpu_sections=true \ -c fbcode.nvcc_arch=h100a \ //smart/inference_platform_sp/llm_predictor_gpu:service -- \ --local_cache_dir "$HOME/local/models/GLM-4.5V-FP8" \ --try_local_cache \ --max_seq_len=16384 \ --max_batch_size ${max_batch_size} \ --thrift_server_port 12345 \ --enable_warmup=true \ --model_mf_bucket=llm_inference \ --model_mf_path=tree/oss/GLM-4.5V-FP8 \ --force_llm_format=true \ --allow_custom_stop_tokens \ --model_parallel_size 2 \ --vllm_engine \ --cpu_offload_gb=0 \ --kv_cache_quantization 8 \ 2>&1 | tee "/tmp/$USER/server_vllm_glm4_5v.log" Before: QPS: 1.13 Avg latency: 104.364s Avg TTFT (client): 47662.44ms P50 TTFT (client): 55740.47ms P99 TTFT (client): 57673.41ms Avg TTIT (client): 56.70ms P50 TTIT (client): 53.03ms P99 TTIT (client): 92.62ms Avg TTFT (server): 45252.67ms Avg TTIT (server): 56.23ms Avg prefill len: 2643.00 tokens P50 prefill len: 2643.00 tokens P99 prefill len: 2643.00 tokens Avg decode len: 1000.00 tokens P50 decode len: 1000.00 tokens P99 decode len: 1000.00 tokens After: QPS: 1.26 Avg latency: 49.998s Avg TTFT (client): 1679.44ms P50 TTFT (client): 1584.17ms P99 TTFT (client): 5748.46ms Avg TTIT (client): 48.32ms P50 TTIT (client): 48.21ms P99 TTIT (client): 59.81ms Avg TTFT (server): 2481.96ms Avg TTIT (server): 48.06ms Avg prefill len: 2643.00 tokens P50 prefill len: 2643.00 tokens P99 prefill len: 2643.00 tokens Avg decode len: 1000.00 tokens P50 decode len: 1000.00 tokens P99 decode len: 1000.00 tokens The accuracy is validated in OSS env: 0.74 on MMMU. Rollback Plan: Differential Revision: D80552189

facebook-github-bot · 2025-08-22T22:03:38Z

This pull request was exported from Phabricator. Differential Revision: D80552189

DarkLight1337

cc @zRzRzRzRzRzRzR can you verify this?

talwolman · 2025-08-25T21:00:31Z

@zRzRzRzRzRzRzR any chance you can help review and get this merged? @chenxi-yang is working on very high priority projects and these PRs are critical for them

chenxi-yang · 2025-09-10T17:02:13Z

cc @zRzRzRzRzRzRzR can you verify this?

Wanted to check if there is any update, thanks. The accuracy on MMMU is 0.74 (on par with the original model architecture).

houseroad · 2025-09-10T17:32:39Z

okay, this is useless round trip q,k,v conversion. We can skip it.

houseroad · 2025-09-10T17:32:58Z

We have verified it's working.

Co-authored-by: Chenxi Yang <[email protected]> Co-authored-by: Lu Fang <[email protected]>

chenxi-yang force-pushed the export-D80552189 branch from fbe781d to 6d2dd46 Compare August 22, 2025 20:04

gemini-code-assist bot reviewed Aug 22, 2025

View reviewed changes

chenxi-yang force-pushed the export-D80552189 branch from 6d2dd46 to 76f717e Compare August 22, 2025 22:03

DarkLight1337 reviewed Aug 23, 2025

View reviewed changes

houseroad approved these changes Sep 10, 2025

View reviewed changes

houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2025

Merge branch 'main' into export-D80552189

a67d857

vllm-bot merged commit d133601 into vllm-project:main Sep 11, 2025
38 of 41 checks passed

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

Remove redundant all gather + split (vllm-project#23441)

afdc8bb

Co-authored-by: Chenxi Yang <[email protected]> Co-authored-by: Lu Fang <[email protected]>

dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025

Remove redundant all gather + split (vllm-project#23441)

601bee3

Co-authored-by: Chenxi Yang <[email protected]> Co-authored-by: Lu Fang <[email protected]>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

Remove redundant all gather + split (vllm-project#23441)

66077bc

Co-authored-by: Chenxi Yang <[email protected]> Co-authored-by: Lu Fang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Remove redundant all gather + split #23441

Remove redundant all gather + split #23441

Uh oh!

chenxi-yang commented Aug 22, 2025

Uh oh!

facebook-github-bot commented Aug 22, 2025

Uh oh!

facebook-github-bot commented Aug 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

facebook-github-bot commented Aug 22, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

talwolman commented Aug 25, 2025

Uh oh!

chenxi-yang commented Sep 10, 2025 •

edited

Loading

Uh oh!

houseroad commented Sep 10, 2025

Uh oh!

houseroad commented Sep 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Remove redundant all gather + split #23441

Remove redundant all gather + split #23441

Uh oh!

Conversation

chenxi-yang commented Aug 22, 2025

Uh oh!

facebook-github-bot commented Aug 22, 2025

Uh oh!

facebook-github-bot commented Aug 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

facebook-github-bot commented Aug 22, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

talwolman commented Aug 25, 2025

Uh oh!

chenxi-yang commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

houseroad commented Sep 10, 2025

Uh oh!

houseroad commented Sep 10, 2025

Uh oh!

Uh oh!

Uh oh!

chenxi-yang commented Sep 10, 2025 •

edited

Loading