Update `opt_learning_rate_warmup_steps` and `opt_learning_rate_decay_steps` constraint check for llama 3.1 8b and 405b model #435

suachong · 2025-09-05T15:38:16Z

This is a resubmission of PR #434 due to CLA check failed from @gyula-htec

github-actions · 2025-09-05T15:38:26Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

ShriyaRishab · 2025-09-05T19:38:36Z

mlperf_logging/compliance_checker/training_5.1.0/closed_llama31_8b.yaml

@@ -25,7 +25,6 @@
 - KEY:
    NAME:  opt_learning_rate_decay_steps
    REQ:   EXACTLY_ONE
-    CHECK: " v['value'] == 1200000 " 


Why is this removed? We need to make sure all submissions set it to 1200000 so that the cosine decay matches.

From the training policies, it says it should be ceil(1_200_000 / global_batch_size) - ceil(8000 * 1152 / global_batch_size) instead of checking the step to be 1200000.

We could do the check

- KEY: NAME: opt_learning_rate_decay_steps REQ: EXACTLY_ONE CHECK: " v['value'] == math.ceil(1_200_000 / s['global_batch_size'] ) - math.ceil(8000 * 1152 / s['global_batch_size'] ) "

or follow the same constraint check for the 405b model:

logging/mlperf_logging/compliance_checker/training_5.1.0/closed_llama31_405b.yaml

Lines 26 to 28 in 497b7c1

- KEY:

NAME: opt_learning_rate_decay_steps

REQ: EXACTLY_ONE

Let's do the actual check to ensure that all submissions indeed set this correctly.

So can you please add this to both llama3.1 8b and llama3.1 405b as well?

- KEY: NAME: opt_learning_rate_decay_steps REQ: EXACTLY_ONE CHECK: " v['value'] == math.ceil(1_200_000 / s['global_batch_size'] ) - math.ceil(8000 * 1152 / s['global_batch_size'] ) "

I think the check works for 405b model since it subtracts the warmup_steps. The warmup_steps is fixed for the 405b model. however for 8b model, the warmup_steps is set to unconstrained. How should we modify the check?

ShriyaRishab · 2025-09-08T15:07:04Z

cc @mmarcinkiewicz

ShriyaRishab · 2025-09-08T15:09:10Z

@suachong - fixed and merged by #436
Can we close this?

remove constraint for opt_learning_rate_decay_steps

688d2f0

suachong requested review from a team as code owners September 5, 2025 15:38

Update rcp_checker.py to handle llama31_8b epochs correctly

f705313

ShriyaRishab reviewed Sep 5, 2025

View reviewed changes

This was referenced Sep 5, 2025

Update rcp_checker.py to handle llama31_8b epochs correctly #434

Closed

DRAFT: Fix submission checkers before v5.1 #436

Merged

suachong added 2 commits September 11, 2025 01:32

Merge remote-tracking branch 'mlcommons/master'

6b5151a

added constraint check for 8b and 405b

a827a87

suachong changed the title ~~Remove constraint for opt_learning_rate_decay_steps~~ Updated opt_learning_rate_warmup_steps and opt_learning_rate_decay_steps constraint check for llama 3.1 8b and 405b model Sep 11, 2025

suachong changed the title ~~Updated opt_learning_rate_warmup_steps and opt_learning_rate_decay_steps constraint check for llama 3.1 8b and 405b model~~ Update opt_learning_rate_warmup_steps and opt_learning_rate_decay_steps constraint check for llama 3.1 8b and 405b model Sep 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update `opt_learning_rate_warmup_steps` and `opt_learning_rate_decay_steps` constraint check for llama 3.1 8b and 405b model #435

Update `opt_learning_rate_warmup_steps` and `opt_learning_rate_decay_steps` constraint check for llama 3.1 8b and 405b model #435

Uh oh!

suachong commented Sep 5, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 5, 2025 •

edited

Loading

Uh oh!

ShriyaRishab Sep 5, 2025

Uh oh!

suannchong Sep 9, 2025

Uh oh!

ShriyaRishab Sep 9, 2025

Uh oh!

suachong Sep 10, 2025 •

edited

Loading

Uh oh!

ShriyaRishab commented Sep 8, 2025

Uh oh!

ShriyaRishab commented Sep 8, 2025

Uh oh!

Uh oh!

Update opt_learning_rate_warmup_steps and opt_learning_rate_decay_steps constraint check for llama 3.1 8b and 405b model #435

Are you sure you want to change the base?

Update opt_learning_rate_warmup_steps and opt_learning_rate_decay_steps constraint check for llama 3.1 8b and 405b model #435

Uh oh!

Conversation

suachong commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShriyaRishab Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

suannchong Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

ShriyaRishab Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

suachong Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShriyaRishab commented Sep 8, 2025

Uh oh!

ShriyaRishab commented Sep 8, 2025

Uh oh!

Uh oh!

Update `opt_learning_rate_warmup_steps` and `opt_learning_rate_decay_steps` constraint check for llama 3.1 8b and 405b model #435

Update `opt_learning_rate_warmup_steps` and `opt_learning_rate_decay_steps` constraint check for llama 3.1 8b and 405b model #435

suachong commented Sep 5, 2025 •

edited

Loading

github-actions bot commented Sep 5, 2025 •

edited

Loading

suachong Sep 10, 2025 •

edited

Loading