-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Remove redundant all gather + split #23441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D80552189 |
fbe781d
to
6d2dd46
Compare
This pull request was exported from Phabricator. Differential Revision: D80552189 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request removes a redundant all-gather and split operation within the split_qkv
method of the GLM-4.1V model's attention mechanism. This change significantly improves performance by avoiding unnecessary data communication and manipulation across tensor parallel ranks, as demonstrated by the impressive benchmark results. The previous implementation gathered sharded QKV tensors from all ranks only to immediately split them back into local shards, which was an expensive no-op. The new implementation correctly operates on the local QKV shards directly. This change appears correct and is a valuable optimization.
Summary: Clean up the glm model architecture Test Plan: max_batch_size=128 CUDA_VISIBLE_DEVICES=6,7 \ VLLM_DISABLE_COMPILE_CACHE=1 \ VLLM_MQ_MAX_CHUNK_BYTES_MB=256 \ VLLM_GPU_MEMORY_UTILIZATION=0.9 \ buck2 run @//mode/{opt,inplace} \ -c fbcode.enable_vllm=true \ -c fbcode.enable_gpu_sections=true \ -c fbcode.nvcc_arch=h100a \ //smart/inference_platform_sp/llm_predictor_gpu:service -- \ --local_cache_dir "$HOME/local/models/GLM-4.5V-FP8" \ --try_local_cache \ --max_seq_len=16384 \ --max_batch_size ${max_batch_size} \ --thrift_server_port 12345 \ --enable_warmup=true \ --model_mf_bucket=llm_inference \ --model_mf_path=tree/oss/GLM-4.5V-FP8 \ --force_llm_format=true \ --allow_custom_stop_tokens \ --model_parallel_size 2 \ --vllm_engine \ --cpu_offload_gb=0 \ --kv_cache_quantization 8 \ 2>&1 | tee "/tmp/$USER/server_vllm_glm4_5v.log" Before: QPS: 1.13 Avg latency: 104.364s Avg TTFT (client): 47662.44ms P50 TTFT (client): 55740.47ms P99 TTFT (client): 57673.41ms Avg TTIT (client): 56.70ms P50 TTIT (client): 53.03ms P99 TTIT (client): 92.62ms Avg TTFT (server): 45252.67ms Avg TTIT (server): 56.23ms Avg prefill len: 2643.00 tokens P50 prefill len: 2643.00 tokens P99 prefill len: 2643.00 tokens Avg decode len: 1000.00 tokens P50 decode len: 1000.00 tokens P99 decode len: 1000.00 tokens After: QPS: 1.26 Avg latency: 49.998s Avg TTFT (client): 1679.44ms P50 TTFT (client): 1584.17ms P99 TTFT (client): 5748.46ms Avg TTIT (client): 48.32ms P50 TTIT (client): 48.21ms P99 TTIT (client): 59.81ms Avg TTFT (server): 2481.96ms Avg TTIT (server): 48.06ms Avg prefill len: 2643.00 tokens P50 prefill len: 2643.00 tokens P99 prefill len: 2643.00 tokens Avg decode len: 1000.00 tokens P50 decode len: 1000.00 tokens P99 decode len: 1000.00 tokens The accuracy is validated in OSS env: 0.74 on MMMU. Rollback Plan: Differential Revision: D80552189
6d2dd46
to
76f717e
Compare
This pull request was exported from Phabricator. Differential Revision: D80552189 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @zRzRzRzRzRzRzR can you verify this?
@zRzRzRzRzRzRzR any chance you can help review and get this merged? @chenxi-yang is working on very high priority projects and these PRs are critical for them |
Wanted to check if there is any update, thanks. The accuracy on MMMU is 0.74 (on par with the original model architecture). |
okay, this is useless round trip q,k,v conversion. We can skip it. |
We have verified it's working. |
Co-authored-by: Chenxi Yang <[email protected]> Co-authored-by: Lu Fang <[email protected]>
Co-authored-by: Chenxi Yang <[email protected]> Co-authored-by: Lu Fang <[email protected]>
Co-authored-by: Chenxi Yang <[email protected]> Co-authored-by: Lu Fang <[email protected]>
Summary: Clean up the glm model architecture
Test Plan:
max_batch_size=128
CUDA_VISIBLE_DEVICES=6,7
VLLM_DISABLE_COMPILE_CACHE=1
VLLM_MQ_MAX_CHUNK_BYTES_MB=256
VLLM_GPU_MEMORY_UTILIZATION=0.9
buck2 run @//mode/{opt,inplace}
-c fbcode.enable_vllm=true
-c fbcode.enable_gpu_sections=true
-c fbcode.nvcc_arch=h100a
//smart/inference_platform_sp/llm_predictor_gpu:service --
--local_cache_dir "$HOME/local/models/GLM-4.5V-FP8"
--try_local_cache
--max_seq_len=16384
--max_batch_size ${max_batch_size}
--thrift_server_port 12345
--enable_warmup=true
--model_mf_bucket=llm_inference
--model_mf_path=tree/oss/GLM-4.5V-FP8
--force_llm_format=true
--allow_custom_stop_tokens
--model_parallel_size 2
--vllm_engine
--cpu_offload_gb=0
--kv_cache_quantization 8
2>&1 | tee "/tmp/$USER/server_vllm_glm4_5v.log"
Before:
QPS: 1.13
Avg latency: 104.364s
Avg TTFT (client): 47662.44ms
P50 TTFT (client): 55740.47ms
P99 TTFT (client): 57673.41ms
Avg TTIT (client): 56.70ms
P50 TTIT (client): 53.03ms
P99 TTIT (client): 92.62ms
Avg TTFT (server): 45252.67ms
Avg TTIT (server): 56.23ms
Avg prefill len: 2643.00 tokens
P50 prefill len: 2643.00 tokens
P99 prefill len: 2643.00 tokens
Avg decode len: 1000.00 tokens
P50 decode len: 1000.00 tokens
P99 decode len: 1000.00 tokens
After:
QPS: 1.26
Avg latency: 49.998s
Avg TTFT (client): 1679.44ms
P50 TTFT (client): 1584.17ms
P99 TTFT (client): 5748.46ms
Avg TTIT (client): 48.32ms
P50 TTIT (client): 48.21ms
P99 TTIT (client): 59.81ms
Avg TTFT (server): 2481.96ms
Avg TTIT (server): 48.06ms
Avg prefill len: 2643.00 tokens
P50 prefill len: 2643.00 tokens
P99 prefill len: 2643.00 tokens
Avg decode len: 1000.00 tokens
P50 decode len: 1000.00 tokens
P99 decode len: 1000.00 tokens
Rollback Plan:
Differential Revision: D80552189