Skip to content

Conversation

frank-wei
Copy link
Contributor

@frank-wei frank-wei commented Sep 3, 2025

Summary:
On GB200, the MOE MXFP4 weight transpose takes quite a long time when the gpt-oss model is loaded.
Add the cache for weight transpose indices so that the expert weight transpose time can be reduced

20b:
Before: Model loading took 94sec

^[[1;36m(EngineCore_0 pid=3397977)^[[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds
^[[1;36m(EngineCore_0 pid=3397977)^[[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds

After: Model loading took 5.9sec

^[[1;36m(EngineCore_0 pid=3005216)^[[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds
^[[1;36m(EngineCore_0 pid=3005216)^[[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds

120b:
Loading time verification:
Before, P1928776629
E2E predictor warm up takes: 17:28:53 ~ 17:39:59 = 11min 6sec

Model loading takes 568.133048 seconds

(EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds
(EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds

After, P1928762318
E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec

Model loading takes 15.083996 seconds

(EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds
(EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds

Accuracy verification:

aime25 medium: P1928806083
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}]

aime25 high:P1928898566
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}]

Test Plan:
Compared the transposed weights and they are matched between before and after. [link]
python test_eq.py

import torch

[g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt")
[g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt")

for i in range(len(g1w)):
    print(i)
    print(torch.equal(g1w[i], g1w2[i]))
    print(torch.equal(g1s[i], g1s2[i]))
    print(torch.equal(g1b[i], g1b2[i]))

[g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt")
[g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt")

for i in range(len(g2w)):
    print(i)
    print(torch.equal(g2w[i], g2w2[i]))
    print(torch.equal(g2s[i], g2s2[i]))
    print(torch.equal(g2b[i], g2b2[i]))

Rollback Plan:

Differential Revision: D81544286

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D81544286

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a caching mechanism for permutation indices to accelerate weight loading in MoE layers, which is a valuable optimization that demonstrates significant performance gains. The overall implementation is solid, but I've identified a critical bug where an incorrect device is used for a tensor operation, which could lead to runtime errors or incorrect behavior.

Comment on lines 424 to 411
w2_weight_scale[i]
.view(torch.uint8)[
permute_sf_indices.to(w13_weight_scale.device)
]
.contiguous()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a typo on line 426. The device should be w2_weight_scale.device instead of w13_weight_scale.device. While this might work if both tensors are on the same device, it is safer and more correct to use the device of the tensor being processed to avoid potential runtime errors or incorrect behavior.

Suggested change
w2_weight_scale[i]
.view(torch.uint8)[
permute_sf_indices.to(w13_weight_scale.device)
]
.contiguous()
w2_weight_scale[i]
.view(torch.uint8)[
permute_sf_indices.to(w2_weight_scale.device)
]
.contiguous()

Copy link

mergify bot commented Sep 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @frank-wei.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 4, 2025
frank-wei added a commit to frank-wei/vllm that referenced this pull request Sep 4, 2025
Summary:

ATT
On GB200, the MOE MXFP4 weight transpose takes quite a long time.
Add the cache for weight transpose indices so that the expert weight transpose time can be reduced

**20b:**
Before: Model loading took 94sec
```
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds
```
After: Model loading took  5.9sec
```
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds
```

**120b:**
**Loading time verification:**
**Before, P1928776629**
E2E predictor warm up takes: 17:28:53 ~  17:39:59 = 11min 6sec

Model loading takes  568.133048 seconds
```
(EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds
(EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds
```

**After, P1928762318**
E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec

Model loading takes 15.083996 seconds
```
(EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds
(EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds
```
**Accuracy verification:**
```
aime25 medium: P1928806083
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}]

aime25 high:P1928898566
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}]
```

Test Plan:
Compared the transposed weights and they are matched between before and after. P1928725920
python test_eq.py
```
import torch

[g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt")
[g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt")

for i in range(len(g1w)):
    print(i)
    print(torch.equal(g1w[i], g1w2[i]))
    print(torch.equal(g1s[i], g1s2[i]))
    print(torch.equal(g1b[i], g1b2[i]))

[g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt")
[g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt")

for i in range(len(g2w)):
    print(i)
    print(torch.equal(g2w[i], g2w2[i]))
    print(torch.equal(g2s[i], g2s2[i]))
    print(torch.equal(g2b[i], g2b2[i]))
```

Rollback Plan:

Reviewed By: zixi-qi

Differential Revision: D81544286
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D81544286

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D81544286

frank-wei added a commit to frank-wei/vllm that referenced this pull request Sep 4, 2025
Summary:
Pull Request resolved: vllm-project#24154

ATT
On GB200, the MOE MXFP4 weight transpose takes quite a long time.
Add the cache for weight transpose indices so that the expert weight transpose time can be reduced

**20b:**
Before: Model loading took 94sec
```
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds
```
After: Model loading took  5.9sec
```
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds
```

**120b:**
**Loading time verification:**
**Before, P1928776629**
E2E predictor warm up takes: 17:28:53 ~  17:39:59 = 11min 6sec

Model loading takes  568.133048 seconds
```
(EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds
(EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds
```

**After, P1928762318**
E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec

Model loading takes 15.083996 seconds
```
(EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds
(EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds
```
**Accuracy verification:**
```
aime25 medium: P1928806083
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}]

aime25 high:P1928898566
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}]
```

Test Plan:
Compared the transposed weights and they are matched between before and after. P1928725920
python test_eq.py
```
import torch

[g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt")
[g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt")

for i in range(len(g1w)):
    print(i)
    print(torch.equal(g1w[i], g1w2[i]))
    print(torch.equal(g1s[i], g1s2[i]))
    print(torch.equal(g1b[i], g1b2[i]))

[g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt")
[g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt")

for i in range(len(g2w)):
    print(i)
    print(torch.equal(g2w[i], g2w2[i]))
    print(torch.equal(g2s[i], g2s2[i]))
    print(torch.equal(g2b[i], g2b2[i]))
```

Rollback Plan:

Reviewed By: zixi-qi

Differential Revision: D81544286
frank-wei added a commit to frank-wei/vllm that referenced this pull request Sep 4, 2025
Summary:

ATT
On GB200, the MOE MXFP4 weight transpose takes quite a long time.
Add the cache for weight transpose indices so that the expert weight transpose time can be reduced

**20b:**
Before: Model loading took 94sec
```
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds
```
After: Model loading took  5.9sec
```
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds
```

**120b:**
**Loading time verification:**
**Before, P1928776629**
E2E predictor warm up takes: 17:28:53 ~  17:39:59 = 11min 6sec

Model loading takes  568.133048 seconds
```
(EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds
(EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds
```

**After, P1928762318**
E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec

Model loading takes 15.083996 seconds
```
(EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds
(EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds
```
**Accuracy verification:**
```
aime25 medium: P1928806083
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}]

aime25 high:P1928898566
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}]
```

Test Plan:
Compared the transposed weights and they are matched between before and after. P1928725920
python test_eq.py
```
import torch

[g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt")
[g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt")

for i in range(len(g1w)):
    print(i)
    print(torch.equal(g1w[i], g1w2[i]))
    print(torch.equal(g1s[i], g1s2[i]))
    print(torch.equal(g1b[i], g1b2[i]))

[g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt")
[g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt")

for i in range(len(g2w)):
    print(i)
    print(torch.equal(g2w[i], g2w2[i]))
    print(torch.equal(g2s[i], g2s2[i]))
    print(torch.equal(g2b[i], g2b2[i]))
```

Rollback Plan:

Reviewed By: zixi-qi

Differential Revision: D81544286
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D81544286

frank-wei added a commit to frank-wei/vllm that referenced this pull request Sep 4, 2025
Summary:
Pull Request resolved: vllm-project#24154

ATT
On GB200, the MOE MXFP4 weight transpose takes quite a long time.
Add the cache for weight transpose indices so that the expert weight transpose time can be reduced

**20b:**
Before: Model loading took 94sec
```
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds
```
After: Model loading took  5.9sec
```
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds
```

**120b:**
**Loading time verification:**
**Before, P1928776629**
E2E predictor warm up takes: 17:28:53 ~  17:39:59 = 11min 6sec

Model loading takes  568.133048 seconds
```
(EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds
(EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds
```

**After, P1928762318**
E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec

Model loading takes 15.083996 seconds
```
(EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds
(EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds
```
**Accuracy verification:**
```
aime25 medium: P1928806083
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}]

aime25 high:P1928898566
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}]
```

Test Plan:
Compared the transposed weights and they are matched between before and after. P1928725920
python test_eq.py
```
import torch

[g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt")
[g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt")

for i in range(len(g1w)):
    print(i)
    print(torch.equal(g1w[i], g1w2[i]))
    print(torch.equal(g1s[i], g1s2[i]))
    print(torch.equal(g1b[i], g1b2[i]))

[g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt")
[g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt")

for i in range(len(g2w)):
    print(i)
    print(torch.equal(g2w[i], g2w2[i]))
    print(torch.equal(g2s[i], g2s2[i]))
    print(torch.equal(g2b[i], g2b2[i]))
```

Rollback Plan:

Reviewed By: zixi-qi

Differential Revision: D81544286
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D81544286

frank-wei added a commit to frank-wei/vllm that referenced this pull request Sep 4, 2025
Summary:
Pull Request resolved: vllm-project#24154

ATT
On GB200, the MOE MXFP4 weight transpose takes quite a long time.
Add the cache for weight transpose indices so that the expert weight transpose time can be reduced

**20b:**
Before: Model loading took 94sec
```
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds
```
After: Model loading took  5.9sec
```
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds
```

**120b:**
**Loading time verification:**
**Before, P1928776629**
E2E predictor warm up takes: 17:28:53 ~  17:39:59 = 11min 6sec

Model loading takes  568.133048 seconds
```
(EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds
(EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds
```

**After, P1928762318**
E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec

Model loading takes 15.083996 seconds
```
(EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds
(EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds
```
**Accuracy verification:**
```
aime25 medium: P1928806083
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}]

aime25 high:P1928898566
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}]
```

Test Plan:
Compared the transposed weights and they are matched between before and after. P1928725920
python test_eq.py
```
import torch

[g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt")
[g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt")

for i in range(len(g1w)):
    print(i)
    print(torch.equal(g1w[i], g1w2[i]))
    print(torch.equal(g1s[i], g1s2[i]))
    print(torch.equal(g1b[i], g1b2[i]))

[g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt")
[g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt")

for i in range(len(g2w)):
    print(i)
    print(torch.equal(g2w[i], g2w2[i]))
    print(torch.equal(g2s[i], g2s2[i]))
    print(torch.equal(g2b[i], g2b2[i]))
```

Rollback Plan:

Reviewed By: zixi-qi

Differential Revision: D81544286
Summary:
Pull Request resolved: vllm-project#24154

ATT
On GB200, the MOE MXFP4 weight transpose takes quite a long time.
Add the cache for weight transpose indices so that the expert weight transpose time can be reduced

**20b:**
Before: Model loading took 94sec
```
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds
```
After: Model loading took  5.9sec
```
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds
```

**120b:**
**Loading time verification:**
**Before, P1928776629**
E2E predictor warm up takes: 17:28:53 ~  17:39:59 = 11min 6sec

Model loading takes  568.133048 seconds
```
(EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds
(EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds
```

**After, P1928762318**
E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec

Model loading takes 15.083996 seconds
```
(EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds
(EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds
```
**Accuracy verification:**
```
aime25 medium: P1928806083
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}]

aime25 high:P1928898566
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}]
```

Test Plan:
Compared the transposed weights and they are matched between before and after. P1928725920
python test_eq.py
```
import torch

[g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt")
[g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt")

for i in range(len(g1w)):
    print(i)
    print(torch.equal(g1w[i], g1w2[i]))
    print(torch.equal(g1s[i], g1s2[i]))
    print(torch.equal(g1b[i], g1b2[i]))

[g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt")
[g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt")

for i in range(len(g2w)):
    print(i)
    print(torch.equal(g2w[i], g2w2[i]))
    print(torch.equal(g2s[i], g2s2[i]))
    print(torch.equal(g2b[i], g2b2[i]))
```

Rollback Plan:

Reviewed By: zixi-qi

Differential Revision: D81544286

Signed-off-by: Wei Wei <[email protected]>
yeqcharlotte pushed a commit to yeqcharlotte/vllm that referenced this pull request Sep 6, 2025
Summary:
Pull Request resolved: vllm-project#24154

ATT
On GB200, the MOE MXFP4 weight transpose takes quite a long time.
Add the cache for weight transpose indices so that the expert weight transpose time can be reduced

**20b:**
Before: Model loading took 94sec
```
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds
```
After: Model loading took  5.9sec
```
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds
```

**120b:**
**Loading time verification:**
**Before, P1928776629**
E2E predictor warm up takes: 17:28:53 ~  17:39:59 = 11min 6sec

Model loading takes  568.133048 seconds
```
(EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds
(EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds
```

**After, P1928762318**
E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec

Model loading takes 15.083996 seconds
```
(EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds
(EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds
```
**Accuracy verification:**
```
aime25 medium: P1928806083
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}]

aime25 high:P1928898566
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}]
```

Test Plan:
Compared the transposed weights and they are matched between before and after. P1928725920
python test_eq.py
```
import torch

[g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt")
[g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt")

for i in range(len(g1w)):
    print(i)
    print(torch.equal(g1w[i], g1w2[i]))
    print(torch.equal(g1s[i], g1s2[i]))
    print(torch.equal(g1b[i], g1b2[i]))

[g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt")
[g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt")

for i in range(len(g2w)):
    print(i)
    print(torch.equal(g2w[i], g2w2[i]))
    print(torch.equal(g2s[i], g2s2[i]))
    print(torch.equal(g2b[i], g2b2[i]))
```

Rollback Plan:

Reviewed By: zixi-qi

Differential Revision: D81544286

Signed-off-by: Wei Wei <[email protected]>
@22quinn
Copy link
Collaborator

22quinn commented Sep 9, 2025

Look good! Let's add a unit test?

@22quinn 22quinn added performance Performance-related issues quantization moe gpt-oss Related to GPT-OSS models labels Sep 9, 2025
@yeqcharlotte
Copy link
Collaborator

yeqcharlotte commented Sep 9, 2025

thanks for the change! this is huge!

could you update your PR title and replace internal pastebin with gist? :)

cc: @houseroad @mgoin @LucasWilkinson

Signed-off-by: Wei Wei <[email protected]>
@frank-wei frank-wei changed the title reduce the weight loading time [Misc] Reduce the gpt-oss model loading time Sep 9, 2025
@frank-wei
Copy link
Contributor Author

Thanks @yeqcharlotte and @22quinn for the review. I have updated this PR as suggested.

@jwfromm
Copy link

jwfromm commented Sep 9, 2025

Great fix! Just curious, do we know why this issue is so much more noticeable on GB200 than other GPUs? It seems like this improvement is backend agnostic.

@frank-wei
Copy link
Contributor Author

Great fix! Just curious, do we know why this issue is so much more noticeable on GB200 than other GPUs? It seems like this improvement is backend agnostic.

@jwfromm , the issue is raised during enabling the mxfp4 on MOE weight. AFAIK, only Blackwell supports this format.

@22quinn 22quinn added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 9, 2025
@22quinn 22quinn changed the title [Misc] Reduce the gpt-oss model loading time [gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading Sep 10, 2025
@22quinn 22quinn enabled auto-merge (squash) September 10, 2025 02:09
@22quinn 22quinn merged commit 0efdb5c into vllm-project:main Sep 10, 2025
46 checks passed
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
rogeryoungh pushed a commit to MiniMax-AI/vllm that referenced this pull request Sep 15, 2025
cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025
cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025
langc23 pushed a commit to zte-riscv/vllm that referenced this pull request Sep 23, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpt-oss Related to GPT-OSS models moe performance Performance-related issues quantization ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants