Skip to content

Conversation

chenxi-yang
Copy link
Contributor

Summary: Clean up the glm model architecture

Test Plan:
max_batch_size=128

CUDA_VISIBLE_DEVICES=6,7
VLLM_DISABLE_COMPILE_CACHE=1
VLLM_MQ_MAX_CHUNK_BYTES_MB=256
VLLM_GPU_MEMORY_UTILIZATION=0.9
buck2 run @//mode/{opt,inplace}
-c fbcode.enable_vllm=true
-c fbcode.enable_gpu_sections=true
-c fbcode.nvcc_arch=h100a
//smart/inference_platform_sp/llm_predictor_gpu:service --
--local_cache_dir "$HOME/local/models/GLM-4.5V-FP8"
--try_local_cache
--max_seq_len=16384
--max_batch_size ${max_batch_size}
--thrift_server_port 12345
--enable_warmup=true
--model_mf_bucket=llm_inference
--model_mf_path=tree/oss/GLM-4.5V-FP8
--force_llm_format=true
--allow_custom_stop_tokens
--model_parallel_size 2
--vllm_engine
--cpu_offload_gb=0
--kv_cache_quantization 8
2>&1 | tee "/tmp/$USER/server_vllm_glm4_5v.log"

Before:
QPS: 1.13
Avg latency: 104.364s
Avg TTFT (client): 47662.44ms
P50 TTFT (client): 55740.47ms
P99 TTFT (client): 57673.41ms
Avg TTIT (client): 56.70ms
P50 TTIT (client): 53.03ms
P99 TTIT (client): 92.62ms
Avg TTFT (server): 45252.67ms
Avg TTIT (server): 56.23ms
Avg prefill len: 2643.00 tokens
P50 prefill len: 2643.00 tokens
P99 prefill len: 2643.00 tokens
Avg decode len: 1000.00 tokens
P50 decode len: 1000.00 tokens
P99 decode len: 1000.00 tokens

After:
QPS: 1.26
Avg latency: 49.998s
Avg TTFT (client): 1679.44ms
P50 TTFT (client): 1584.17ms
P99 TTFT (client): 5748.46ms
Avg TTIT (client): 48.32ms
P50 TTIT (client): 48.21ms
P99 TTIT (client): 59.81ms
Avg TTFT (server): 2481.96ms
Avg TTIT (server): 48.06ms
Avg prefill len: 2643.00 tokens
P50 prefill len: 2643.00 tokens
P99 prefill len: 2643.00 tokens
Avg decode len: 1000.00 tokens
P50 decode len: 1000.00 tokens
P99 decode len: 1000.00 tokens

Rollback Plan:

Differential Revision: D80552189

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D80552189

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D80552189

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes a redundant all-gather and split operation within the split_qkv method of the GLM-4.1V model's attention mechanism. This change significantly improves performance by avoiding unnecessary data communication and manipulation across tensor parallel ranks, as demonstrated by the impressive benchmark results. The previous implementation gathered sharded QKV tensors from all ranks only to immediately split them back into local shards, which was an expensive no-op. The new implementation correctly operates on the local QKV shards directly. This change appears correct and is a valuable optimization.

Summary:

Clean up the glm model architecture

Test Plan:
max_batch_size=128

  CUDA_VISIBLE_DEVICES=6,7 \
  VLLM_DISABLE_COMPILE_CACHE=1 \
  VLLM_MQ_MAX_CHUNK_BYTES_MB=256 \
  VLLM_GPU_MEMORY_UTILIZATION=0.9 \
  buck2 run @//mode/{opt,inplace} \
    -c fbcode.enable_vllm=true \
    -c fbcode.enable_gpu_sections=true \
    -c fbcode.nvcc_arch=h100a \
    //smart/inference_platform_sp/llm_predictor_gpu:service -- \
    --local_cache_dir "$HOME/local/models/GLM-4.5V-FP8" \
    --try_local_cache \
    --max_seq_len=16384 \
    --max_batch_size ${max_batch_size} \
    --thrift_server_port 12345 \
    --enable_warmup=true \
    --model_mf_bucket=llm_inference \
    --model_mf_path=tree/oss/GLM-4.5V-FP8 \
    --force_llm_format=true \
    --allow_custom_stop_tokens \
    --model_parallel_size 2 \
    --vllm_engine \
    --cpu_offload_gb=0 \
    --kv_cache_quantization 8 \
    2>&1 | tee "/tmp/$USER/server_vllm_glm4_5v.log"

Before:
QPS:                 1.13
Avg latency:         104.364s
Avg TTFT (client):   47662.44ms
P50 TTFT (client):   55740.47ms
P99 TTFT (client):   57673.41ms
Avg TTIT (client):   56.70ms
P50 TTIT (client):   53.03ms
P99 TTIT (client):   92.62ms
Avg TTFT (server):   45252.67ms
Avg TTIT (server):   56.23ms
Avg prefill len:     2643.00 tokens
P50 prefill len:     2643.00 tokens
P99 prefill len:     2643.00 tokens
Avg decode len:      1000.00 tokens
P50 decode len:      1000.00 tokens
P99 decode len:      1000.00 tokens

After:
QPS:                 1.26
Avg latency:         49.998s
Avg TTFT (client):   1679.44ms
P50 TTFT (client):   1584.17ms
P99 TTFT (client):   5748.46ms
Avg TTIT (client):   48.32ms
P50 TTIT (client):   48.21ms
P99 TTIT (client):   59.81ms
Avg TTFT (server):   2481.96ms
Avg TTIT (server):   48.06ms
Avg prefill len:     2643.00 tokens
P50 prefill len:     2643.00 tokens
P99 prefill len:     2643.00 tokens
Avg decode len:      1000.00 tokens
P50 decode len:      1000.00 tokens
P99 decode len:      1000.00 tokens

The accuracy is validated in OSS env: 0.74 on MMMU.

Rollback Plan:

Differential Revision: D80552189
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D80552189

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @zRzRzRzRzRzRzR can you verify this?

@talwolman
Copy link

@zRzRzRzRzRzRzR any chance you can help review and get this merged? @chenxi-yang is working on very high priority projects and these PRs are critical for them

@chenxi-yang
Copy link
Contributor Author

chenxi-yang commented Sep 10, 2025

cc @zRzRzRzRzRzRzR can you verify this?

Wanted to check if there is any update, thanks. The accuracy on MMMU is 0.74 (on par with the original model architecture).

@houseroad
Copy link
Collaborator

okay, this is useless round trip q,k,v conversion. We can skip it.

@houseroad
Copy link
Collaborator

We have verified it's working.

@houseroad houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2025
@vllm-bot vllm-bot merged commit d133601 into vllm-project:main Sep 11, 2025
38 of 41 checks passed
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants