[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading #24154

frank-wei · 2025-09-03T06:33:30Z

Summary:
On GB200, the MOE MXFP4 weight transpose takes quite a long time when the gpt-oss model is loaded.
Add the cache for weight transpose indices so that the expert weight transpose time can be reduced

20b:
Before: Model loading took 94sec

^[[1;36m(EngineCore_0 pid=3397977)^[[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds
^[[1;36m(EngineCore_0 pid=3397977)^[[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds

After: Model loading took 5.9sec

^[[1;36m(EngineCore_0 pid=3005216)^[[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds
^[[1;36m(EngineCore_0 pid=3005216)^[[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds

120b:
Loading time verification:
Before, P1928776629
E2E predictor warm up takes: 17:28:53 ~ 17:39:59 = 11min 6sec

Model loading takes 568.133048 seconds

(EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds
(EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds

After, P1928762318
E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec

Model loading takes 15.083996 seconds

(EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds
(EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds

Accuracy verification:

aime25 medium: P1928806083
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}]

aime25 high:P1928898566
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}]

Test Plan:
Compared the transposed weights and they are matched between before and after. [link]
python test_eq.py

import torch

[g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt")
[g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt")

for i in range(len(g1w)):
    print(i)
    print(torch.equal(g1w[i], g1w2[i]))
    print(torch.equal(g1s[i], g1s2[i]))
    print(torch.equal(g1b[i], g1b2[i]))

[g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt")
[g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt")

for i in range(len(g2w)):
    print(i)
    print(torch.equal(g2w[i], g2w2[i]))
    print(torch.equal(g2s[i], g2s2[i]))
    print(torch.equal(g2b[i], g2b2[i]))

Rollback Plan:

Differential Revision: D81544286

facebook-github-bot · 2025-09-03T06:33:46Z

This pull request was exported from Phabricator. Differential Revision: D81544286

gemini-code-assist

Code Review

This pull request introduces a caching mechanism for permutation indices to accelerate weight loading in MoE layers, which is a valuable optimization that demonstrates significant performance gains. The overall implementation is solid, but I've identified a critical bug where an incorrect device is used for a tensor operation, which could lead to runtime errors or incorrect behavior.

gemini-code-assist · 2025-09-03T06:35:08Z

vllm/model_executor/layers/quantization/mxfp4.py

+                        w2_weight_scale[i]
+                        .view(torch.uint8)[
+                            permute_sf_indices.to(w13_weight_scale.device)
+                        ]
+                        .contiguous()


There appears to be a typo on line 426. The device should be w2_weight_scale.device instead of w13_weight_scale.device. While this might work if both tensors are on the same device, it is safer and more correct to use the device of the tensor being processed to avoid potential runtime errors or incorrect behavior.

Suggested change

w2_weight_scale[i]

.view(torch.uint8)[

permute_sf_indices.to(w13_weight_scale.device)

]

.contiguous()

w2_weight_scale[i]

.view(torch.uint8)[

permute_sf_indices.to(w2_weight_scale.device)

]

.contiguous()

mergify · 2025-09-04T16:51:29Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @frank-wei.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Summary: ATT On GB200, the MOE MXFP4 weight transpose takes quite a long time. Add the cache for weight transpose indices so that the expert weight transpose time can be reduced **20b:** Before: Model loading took 94sec ``` �[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds �[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds ``` After: Model loading took 5.9sec ``` �[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds �[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds ``` **120b:** **Loading time verification:** **Before, P1928776629** E2E predictor warm up takes: 17:28:53 ~ 17:39:59 = 11min 6sec Model loading takes 568.133048 seconds ``` (EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds (EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds ``` **After, P1928762318** E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec Model loading takes 15.083996 seconds ``` (EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds (EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds ``` **Accuracy verification:** ``` aime25 medium: P1928806083 [{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}] aime25 high:P1928898566 [{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}] ``` Test Plan: Compared the transposed weights and they are matched between before and after. P1928725920 python test_eq.py ``` import torch [g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt") [g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt") for i in range(len(g1w)): print(i) print(torch.equal(g1w[i], g1w2[i])) print(torch.equal(g1s[i], g1s2[i])) print(torch.equal(g1b[i], g1b2[i])) [g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt") [g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt") for i in range(len(g2w)): print(i) print(torch.equal(g2w[i], g2w2[i])) print(torch.equal(g2s[i], g2s2[i])) print(torch.equal(g2b[i], g2b2[i])) ``` Rollback Plan: Reviewed By: zixi-qi Differential Revision: D81544286

facebook-github-bot · 2025-09-04T17:10:40Z

This pull request was exported from Phabricator. Differential Revision: D81544286

facebook-github-bot · 2025-09-04T21:26:37Z

This pull request was exported from Phabricator. Differential Revision: D81544286

Summary: Pull Request resolved: vllm-project#24154 ATT On GB200, the MOE MXFP4 weight transpose takes quite a long time. Add the cache for weight transpose indices so that the expert weight transpose time can be reduced **20b:** Before: Model loading took 94sec ``` �[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds �[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds ``` After: Model loading took 5.9sec ``` �[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds �[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds ``` **120b:** **Loading time verification:** **Before, P1928776629** E2E predictor warm up takes: 17:28:53 ~ 17:39:59 = 11min 6sec Model loading takes 568.133048 seconds ``` (EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds (EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds ``` **After, P1928762318** E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec Model loading takes 15.083996 seconds ``` (EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds (EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds ``` **Accuracy verification:** ``` aime25 medium: P1928806083 [{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}] aime25 high:P1928898566 [{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}] ``` Test Plan: Compared the transposed weights and they are matched between before and after. P1928725920 python test_eq.py ``` import torch [g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt") [g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt") for i in range(len(g1w)): print(i) print(torch.equal(g1w[i], g1w2[i])) print(torch.equal(g1s[i], g1s2[i])) print(torch.equal(g1b[i], g1b2[i])) [g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt") [g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt") for i in range(len(g2w)): print(i) print(torch.equal(g2w[i], g2w2[i])) print(torch.equal(g2s[i], g2s2[i])) print(torch.equal(g2b[i], g2b2[i])) ``` Rollback Plan: Reviewed By: zixi-qi Differential Revision: D81544286

Summary: ATT On GB200, the MOE MXFP4 weight transpose takes quite a long time. Add the cache for weight transpose indices so that the expert weight transpose time can be reduced **20b:** Before: Model loading took 94sec ``` �[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds �[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds ``` After: Model loading took 5.9sec ``` �[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds �[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds ``` **120b:** **Loading time verification:** **Before, P1928776629** E2E predictor warm up takes: 17:28:53 ~ 17:39:59 = 11min 6sec Model loading takes 568.133048 seconds ``` (EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds (EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds ``` **After, P1928762318** E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec Model loading takes 15.083996 seconds ``` (EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds (EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds ``` **Accuracy verification:** ``` aime25 medium: P1928806083 [{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}] aime25 high:P1928898566 [{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}] ``` Test Plan: Compared the transposed weights and they are matched between before and after. P1928725920 python test_eq.py ``` import torch [g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt") [g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt") for i in range(len(g1w)): print(i) print(torch.equal(g1w[i], g1w2[i])) print(torch.equal(g1s[i], g1s2[i])) print(torch.equal(g1b[i], g1b2[i])) [g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt") [g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt") for i in range(len(g2w)): print(i) print(torch.equal(g2w[i], g2w2[i])) print(torch.equal(g2s[i], g2s2[i])) print(torch.equal(g2b[i], g2b2[i])) ``` Rollback Plan: Reviewed By: zixi-qi Differential Revision: D81544286

facebook-github-bot · 2025-09-04T22:28:32Z

This pull request was exported from Phabricator. Differential Revision: D81544286

Summary: Pull Request resolved: vllm-project#24154 ATT On GB200, the MOE MXFP4 weight transpose takes quite a long time. Add the cache for weight transpose indices so that the expert weight transpose time can be reduced **20b:** Before: Model loading took 94sec ``` �[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds �[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds ``` After: Model loading took 5.9sec ``` �[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds �[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds ``` **120b:** **Loading time verification:** **Before, P1928776629** E2E predictor warm up takes: 17:28:53 ~ 17:39:59 = 11min 6sec Model loading takes 568.133048 seconds ``` (EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds (EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds ``` **After, P1928762318** E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec Model loading takes 15.083996 seconds ``` (EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds (EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds ``` **Accuracy verification:** ``` aime25 medium: P1928806083 [{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}] aime25 high:P1928898566 [{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}] ``` Test Plan: Compared the transposed weights and they are matched between before and after. P1928725920 python test_eq.py ``` import torch [g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt") [g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt") for i in range(len(g1w)): print(i) print(torch.equal(g1w[i], g1w2[i])) print(torch.equal(g1s[i], g1s2[i])) print(torch.equal(g1b[i], g1b2[i])) [g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt") [g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt") for i in range(len(g2w)): print(i) print(torch.equal(g2w[i], g2w2[i])) print(torch.equal(g2s[i], g2s2[i])) print(torch.equal(g2b[i], g2b2[i])) ``` Rollback Plan: Reviewed By: zixi-qi Differential Revision: D81544286

facebook-github-bot · 2025-09-04T22:38:17Z

This pull request was exported from Phabricator. Differential Revision: D81544286

Summary: Pull Request resolved: vllm-project#24154 ATT On GB200, the MOE MXFP4 weight transpose takes quite a long time. Add the cache for weight transpose indices so that the expert weight transpose time can be reduced **20b:** Before: Model loading took 94sec ``` �[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds �[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds ``` After: Model loading took 5.9sec ``` �[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds �[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds ``` **120b:** **Loading time verification:** **Before, P1928776629** E2E predictor warm up takes: 17:28:53 ~ 17:39:59 = 11min 6sec Model loading takes 568.133048 seconds ``` (EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds (EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds ``` **After, P1928762318** E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec Model loading takes 15.083996 seconds ``` (EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds (EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds ``` **Accuracy verification:** ``` aime25 medium: P1928806083 [{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}] aime25 high:P1928898566 [{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}] ``` Test Plan: Compared the transposed weights and they are matched between before and after. P1928725920 python test_eq.py ``` import torch [g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt") [g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt") for i in range(len(g1w)): print(i) print(torch.equal(g1w[i], g1w2[i])) print(torch.equal(g1s[i], g1s2[i])) print(torch.equal(g1b[i], g1b2[i])) [g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt") [g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt") for i in range(len(g2w)): print(i) print(torch.equal(g2w[i], g2w2[i])) print(torch.equal(g2s[i], g2s2[i])) print(torch.equal(g2b[i], g2b2[i])) ``` Rollback Plan: Reviewed By: zixi-qi Differential Revision: D81544286

Summary: Pull Request resolved: vllm-project#24154 ATT On GB200, the MOE MXFP4 weight transpose takes quite a long time. Add the cache for weight transpose indices so that the expert weight transpose time can be reduced **20b:** Before: Model loading took 94sec ``` �[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds �[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds ``` After: Model loading took 5.9sec ``` �[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds �[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds ``` **120b:** **Loading time verification:** **Before, P1928776629** E2E predictor warm up takes: 17:28:53 ~ 17:39:59 = 11min 6sec Model loading takes 568.133048 seconds ``` (EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds (EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds ``` **After, P1928762318** E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec Model loading takes 15.083996 seconds ``` (EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds (EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds ``` **Accuracy verification:** ``` aime25 medium: P1928806083 [{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}] aime25 high:P1928898566 [{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}] ``` Test Plan: Compared the transposed weights and they are matched between before and after. P1928725920 python test_eq.py ``` import torch [g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt") [g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt") for i in range(len(g1w)): print(i) print(torch.equal(g1w[i], g1w2[i])) print(torch.equal(g1s[i], g1s2[i])) print(torch.equal(g1b[i], g1b2[i])) [g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt") [g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt") for i in range(len(g2w)): print(i) print(torch.equal(g2w[i], g2w2[i])) print(torch.equal(g2s[i], g2s2[i])) print(torch.equal(g2b[i], g2b2[i])) ``` Rollback Plan: Reviewed By: zixi-qi Differential Revision: D81544286 Signed-off-by: Wei Wei <[email protected]>

Signed-off-by: Wei Wei <[email protected]>

Summary: Pull Request resolved: vllm-project#24154 ATT On GB200, the MOE MXFP4 weight transpose takes quite a long time. Add the cache for weight transpose indices so that the expert weight transpose time can be reduced **20b:** Before: Model loading took 94sec ``` �[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds �[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds ``` After: Model loading took 5.9sec ``` �[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds �[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds ``` **120b:** **Loading time verification:** **Before, P1928776629** E2E predictor warm up takes: 17:28:53 ~ 17:39:59 = 11min 6sec Model loading takes 568.133048 seconds ``` (EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds (EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds ``` **After, P1928762318** E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec Model loading takes 15.083996 seconds ``` (EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds (EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds ``` **Accuracy verification:** ``` aime25 medium: P1928806083 [{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}] aime25 high:P1928898566 [{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}] ``` Test Plan: Compared the transposed weights and they are matched between before and after. P1928725920 python test_eq.py ``` import torch [g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt") [g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt") for i in range(len(g1w)): print(i) print(torch.equal(g1w[i], g1w2[i])) print(torch.equal(g1s[i], g1s2[i])) print(torch.equal(g1b[i], g1b2[i])) [g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt") [g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt") for i in range(len(g2w)): print(i) print(torch.equal(g2w[i], g2w2[i])) print(torch.equal(g2s[i], g2s2[i])) print(torch.equal(g2b[i], g2b2[i])) ``` Rollback Plan: Reviewed By: zixi-qi Differential Revision: D81544286 Signed-off-by: Wei Wei <[email protected]>

22quinn · 2025-09-09T03:22:24Z

Look good! Let's add a unit test?

Signed-off-by: Wei Wei <[email protected]>

yeqcharlotte · 2025-09-09T09:17:48Z

thanks for the change! this is huge!

could you update your PR title and replace internal pastebin with gist? :)

cc: @houseroad @mgoin @LucasWilkinson

Signed-off-by: Wei Wei <[email protected]>

frank-wei · 2025-09-09T15:15:53Z

Thanks @yeqcharlotte and @22quinn for the review. I have updated this PR as suggested.

jwfromm · 2025-09-09T17:17:20Z

Great fix! Just curious, do we know why this issue is so much more noticeable on GB200 than other GPUs? It seems like this improvement is backend agnostic.

frank-wei · 2025-09-09T17:51:25Z

Great fix! Just curious, do we know why this issue is so much more noticeable on GB200 than other GPUs? It seems like this improvement is backend agnostic.

@jwfromm , the issue is raised during enabling the mxfp4 on MOE weight. AFAIK, only Blackwell supports this format.

…llm-project#24154) Signed-off-by: Wei Wei <[email protected]>

…llm-project#24154) Signed-off-by: Wei Wei <[email protected]> Signed-off-by: rogeryoungh <[email protected]>

…llm-project#24154) Signed-off-by: Wei Wei <[email protected]> Signed-off-by: bruceszchen <[email protected]>

…llm-project#24154) Signed-off-by: Wei Wei <[email protected]>

frank-wei requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners September 3, 2025 06:33

gemini-code-assist bot reviewed Sep 3, 2025

View reviewed changes

mergify bot added the needs-rebase label Sep 4, 2025

frank-wei force-pushed the export-D81544286 branch from ba70a5e to f01f213 Compare September 4, 2025 17:10

mergify bot removed the needs-rebase label Sep 4, 2025

22quinn requested review from houseroad and DarkLight1337 September 4, 2025 17:37

frank-wei force-pushed the export-D81544286 branch from f01f213 to 00b87a0 Compare September 4, 2025 21:26

frank-wei force-pushed the export-D81544286 branch from 00b87a0 to 41d88d2 Compare September 4, 2025 22:25

frank-wei force-pushed the export-D81544286 branch from 41d88d2 to d28ec0e Compare September 4, 2025 22:28

frank-wei force-pushed the export-D81544286 branch from d28ec0e to 80d85d5 Compare September 4, 2025 22:38

frank-wei force-pushed the export-D81544286 branch from 80d85d5 to 44d4e65 Compare September 5, 2025 00:24

frank-wei added 2 commits September 4, 2025 18:07

precommit fix

ddb3770

Signed-off-by: Wei Wei <[email protected]>

Merge branch 'main' into export-D81544286

4ae7954

22quinn added performance Performance-related issues quantization moe gpt-oss Related to GPT-OSS models labels Sep 9, 2025

add unit test to verify the accuracy is untouched

b75161b

Signed-off-by: Wei Wei <[email protected]>

frank-wei requested a review from WoosukKwon as a code owner September 9, 2025 06:09

frank-wei added 3 commits September 8, 2025 23:44

linter

0866462

Signed-off-by: Wei Wei <[email protected]>

Merge branch 'main' into export-D81544286

c119c9c

linter

7591c10

Signed-off-by: Wei Wei <[email protected]>

linter

335e243

Signed-off-by: Wei Wei <[email protected]>

frank-wei changed the title ~~reduce the weight loading time~~ [Misc] Reduce the gpt-oss model loading time Sep 9, 2025

22quinn added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 9, 2025

22quinn changed the title ~~[Misc] Reduce the gpt-oss model loading time~~ [gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading Sep 10, 2025

22quinn approved these changes Sep 10, 2025

View reviewed changes

22quinn enabled auto-merge (squash) September 10, 2025 02:09

Merge branch 'main' into export-D81544286

b3800b1

22quinn merged commit 0efdb5c into vllm-project:main Sep 10, 2025
46 checks passed

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading (v…

39c79f0

…llm-project#24154) Signed-off-by: Wei Wei <[email protected]>

rogeryoungh pushed a commit to MiniMax-AI/vllm that referenced this pull request Sep 15, 2025

[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading (v…

f7ec3e7

…llm-project#24154) Signed-off-by: Wei Wei <[email protected]> Signed-off-by: rogeryoungh <[email protected]>

cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025

[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading (v…

51ed888

…llm-project#24154) Signed-off-by: Wei Wei <[email protected]> Signed-off-by: bruceszchen <[email protected]>

cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025

[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading (v…

06f87f8

…llm-project#24154) Signed-off-by: Wei Wei <[email protected]> Signed-off-by: bruceszchen <[email protected]>

langc23 pushed a commit to zte-riscv/vllm that referenced this pull request Sep 23, 2025

[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading (v…

209ad51

…llm-project#24154) Signed-off-by: Wei Wei <[email protected]>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading (v…

fe8b04f

…llm-project#24154) Signed-off-by: Wei Wei <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading #24154

[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading #24154

Uh oh!

frank-wei commented Sep 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

facebook-github-bot commented Sep 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 3, 2025

Uh oh!

mergify bot commented Sep 4, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

22quinn commented Sep 9, 2025

Uh oh!

yeqcharlotte commented Sep 9, 2025 •

edited

Loading

Uh oh!

frank-wei commented Sep 9, 2025

Uh oh!

jwfromm commented Sep 9, 2025

Uh oh!

frank-wei commented Sep 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading #24154

[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading #24154

Uh oh!

Conversation

frank-wei commented Sep 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Sep 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 4, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

22quinn commented Sep 9, 2025

Uh oh!

yeqcharlotte commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frank-wei commented Sep 9, 2025

Uh oh!

jwfromm commented Sep 9, 2025

Uh oh!

frank-wei commented Sep 9, 2025

Uh oh!

Uh oh!

Uh oh!

frank-wei commented Sep 3, 2025 •

edited by github-actions bot

Loading

yeqcharlotte commented Sep 9, 2025 •

edited

Loading