-
Notifications
You must be signed in to change notification settings - Fork 51
Update opt_learning_rate_warmup_steps
and opt_learning_rate_decay_steps
constraint check for llama 3.1 8b and 405b model
#435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
@@ -25,7 +25,6 @@ | |||
- KEY: | |||
NAME: opt_learning_rate_decay_steps | |||
REQ: EXACTLY_ONE | |||
CHECK: " v['value'] == 1200000 " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this removed? We need to make sure all submissions set it to 1200000
so that the cosine decay matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the training policies, it says it should be ceil(1_200_000 / global_batch_size) - ceil(8000 * 1152 / global_batch_size)
instead of checking the step to be 1200000
.
We could do the check
- KEY:
NAME: opt_learning_rate_decay_steps
REQ: EXACTLY_ONE
CHECK: " v['value'] == math.ceil(1_200_000 / s['global_batch_size'] ) - math.ceil(8000 * 1152 / s['global_batch_size'] ) "
or follow the same constraint check for the 405b model:
logging/mlperf_logging/compliance_checker/training_5.1.0/closed_llama31_405b.yaml
Lines 26 to 28 in 497b7c1
- KEY: | |
NAME: opt_learning_rate_decay_steps | |
REQ: EXACTLY_ONE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do the actual check to ensure that all submissions indeed set this correctly.
So can you please add this to both llama3.1 8b and llama3.1 405b as well?
- KEY:
NAME: opt_learning_rate_decay_steps
REQ: EXACTLY_ONE
CHECK: " v['value'] == math.ceil(1_200_000 / s['global_batch_size'] ) - math.ceil(8000 * 1152 / s['global_batch_size'] ) "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the check works for 405b model since it subtracts the warmup_steps
. The warmup_steps
is fixed for the 405b model. however for 8b model, the warmup_steps
is set to unconstrained. How should we modify the check?
opt_learning_rate_warmup_steps
and opt_learning_rate_decay_steps
constraint check for llama 3.1 8b and 405b model
opt_learning_rate_warmup_steps
and opt_learning_rate_decay_steps
constraint check for llama 3.1 8b and 405b model opt_learning_rate_warmup_steps
and opt_learning_rate_decay_steps
constraint check for llama 3.1 8b and 405b model
This is a resubmission of PR #434 due to CLA check failed from @gyula-htec