Skip to content

Conversation

suachong
Copy link
Contributor

@suachong suachong commented Sep 5, 2025

This is a resubmission of PR #434 due to CLA check failed from @gyula-htec

@suachong suachong requested review from a team as code owners September 5, 2025 15:38
Copy link

github-actions bot commented Sep 5, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@@ -25,7 +25,6 @@
- KEY:
NAME: opt_learning_rate_decay_steps
REQ: EXACTLY_ONE
CHECK: " v['value'] == 1200000 "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed? We need to make sure all submissions set it to 1200000 so that the cosine decay matches.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the training policies, it says it should be ceil(1_200_000 / global_batch_size) - ceil(8000 * 1152 / global_batch_size) instead of checking the step to be 1200000.

We could do the check

- KEY:
    NAME:  opt_learning_rate_decay_steps
    REQ:   EXACTLY_ONE
    CHECK: " v['value'] == math.ceil(1_200_000 / s['global_batch_size'] ) - math.ceil(8000 * 1152 / s['global_batch_size'] ) "

or follow the same constraint check for the 405b model:

- KEY:
NAME: opt_learning_rate_decay_steps
REQ: EXACTLY_ONE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do the actual check to ensure that all submissions indeed set this correctly.

So can you please add this to both llama3.1 8b and llama3.1 405b as well?

- KEY:
    NAME:  opt_learning_rate_decay_steps
    REQ:   EXACTLY_ONE
    CHECK: " v['value'] == math.ceil(1_200_000 / s['global_batch_size'] ) - math.ceil(8000 * 1152 / s['global_batch_size'] ) "

Copy link
Contributor Author

@suachong suachong Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the check works for 405b model since it subtracts the warmup_steps. The warmup_steps is fixed for the 405b model. however for 8b model, the warmup_steps is set to unconstrained. How should we modify the check?

@ShriyaRishab
Copy link
Contributor

cc @mmarcinkiewicz

@ShriyaRishab
Copy link
Contributor

@suachong - fixed and merged by #436
Can we close this?

@suachong suachong changed the title Remove constraint for opt_learning_rate_decay_steps Updated opt_learning_rate_warmup_steps and opt_learning_rate_decay_steps constraint check for llama 3.1 8b and 405b model Sep 11, 2025
@suachong suachong changed the title Updated opt_learning_rate_warmup_steps and opt_learning_rate_decay_steps constraint check for llama 3.1 8b and 405b model Update opt_learning_rate_warmup_steps and opt_learning_rate_decay_steps constraint check for llama 3.1 8b and 405b model Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants