-
Notifications
You must be signed in to change notification settings - Fork 455
chore(mlobs): infer the name of the job from the submission ID #14540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dubloom/ray-v0
Are you sure you want to change the base?
chore(mlobs): infer the name of the job from the submission ID #14540
Conversation
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 273 ± 2 ms. The average import time from base is: 270 ± 3 ms. The import time difference between this PR and base is: 2.3 ± 0.1 ms. Import time breakdownThe following import paths have grown:
|
""" | ||
if submission_id is None: | ||
submission_id = os.environ.get("_RAY_SUBMISSION_ID") or "" | ||
match = JOB_NAME_REGEX.match(submission_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of extracting the first section of submission id, I would recommend asking users to send additional parameter in job metadata else default it to submission id.
@kanwang thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@savitagm I agree and like your idea about passing the job id through the job metadata, the reason I didn't do it right away in the same PR is that from a quick look it didn't seem like Ray passes the job metadata through the runtime context into every span. So I was going to make that change in a separate follow-up PR. I think if there is no model name in the metadata, we should still default to the first part of the submission ID and not to the entire submission ID, so that similar jobs get grouped together. Let me know if you disagree with either of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with savita on that.
We should let the users choose the name of the service otherwise setting something meaningful. I think the two bests choice would be : ray-job or the submission-id as it is as I think having raysubmit
as a service name could lack a bit of meaning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dubloom @savitagm I have updated the code to support two ways of specifying the job name:
ray job submit --metadata-json='{"job_name": "my_test_job"}'
ray job submit --submission_id="job:my_test_job,run:1"
If neither one of these is specified (no job name in the metadata, and the submission ID is not in the job:X,run:Y
format), then the job name will fall back on the full submission ID string as in Louis' original code. Let me know if you think this is reasonable.
Performance SLOsCandidate: yakov.shapiro/MLOB-3361/get-service-name-from-ray (f16e29b) ❌ Test Failures (1 suite)❌ flasksimple - 16/17🔵 No baseline data available for this suite ✅ appsec-getTime: ✅ 4.590ms (SLO: <4.750ms -3.4%) Memory: ✅ 61.912MB (SLO: <64.500MB -4.0%) ✅ appsec-postTime: ✅ 6.607ms (SLO: <6.750ms -2.1%) Memory: ✅ 62.167MB (SLO: <64.500MB -3.6%) ✅ appsec-telemetryTime: ✅ 4.606ms (SLO: <4.750ms -3.0%) Memory: ✅ 61.774MB (SLO: <64.500MB -4.2%) ❌ debuggerTime: ✅ 1.858ms (SLO: <2.000ms -7.1%) Memory: ❌ 45.141MB (SLO: <45.000MB +0.3%) ✅ iast-getTime: ✅ 1.860ms (SLO: <2.000ms -7.0%) Memory: ✅ 42.074MB (SLO: <49.000MB 📉 -14.1%) ✅ profilerTime: ✅ 1.911ms (SLO: <2.100ms -9.0%) Memory: ✅ 44.807MB (SLO: <46.500MB -3.6%) ✅ tracerTime: ✅ 3.374ms (SLO: <3.650ms -7.6%) Memory: ✅ 51.197MB (SLO: <53.500MB -4.3%) ✅ tracer-nativeTime: ✅ 3.378ms (SLO: <3.650ms -7.5%) Memory: ✅ 52.475MB (SLO: <53.500MB 🟡 -1.9%) 🔵 No Baseline Data (23 suites)🔵 coreapiscenario - 12/12 (2 unstable)🔵 No baseline data available for this suite
|
@@ -21,13 +20,13 @@ def process_trace(self, trace): | |||
if span.get_tag(RAY_JOB_ID_TAG_KEY) is not None: | |||
span.span_type = f"ray.{span.name}" | |||
span.name = DEFAULT_SPAN_NAME | |||
span.service = config.service or DEFAULT_SERVICE_NAME | |||
del span.tags["_dd.base_service"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this code is for the OpenTelemetry code and is not used by my integration.
What is the reason behind this deletion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I wanted to have only one service on the span and not two, but since that's not really related to the rest of my changes, I'll just not do this instead of moving this change to the correct processor.
""" | ||
if submission_id is None: | ||
submission_id = os.environ.get("_RAY_SUBMISSION_ID") or "" | ||
match = JOB_NAME_REGEX.match(submission_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with savita on that.
We should let the users choose the name of the service otherwise setting something meaningful. I think the two bests choice would be : ray-job or the submission-id as it is as I think having raysubmit
as a service name could lack a bit of meaning
|
||
# Inject the context of the job so that ray.job.run is its child | ||
env_vars = kwargs.setdefault("runtime_env", {}).setdefault("env_vars", {}) | ||
_TraceContext._inject(job_span.context, env_vars) | ||
env_vars["_RAY_SUBMISSION_ID"] = submission_id | ||
if job_name: | ||
env_vars["_RAY_JOB_NAME"] = job_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will eventually want to make _RAY_JOB_NAME
and _RAY_SUBMISSION_ID
string constants instead of strings, but in the interest of minimizing merge conflicts, I am postponing that change until a future PR.
In the new observability UI that we are building, we want to group together different runs of one and the same job that train one and the same model on different days and/or with different training data. Right now we are using the
span.service
field to convey the name of the model; we do not want this field to be unique per trace, as Ray submission IDs usually are. We want to group together similar submission IDs in the UI.Ray will, by default, assign submission IDs that look like
raysubmit_VFFkXZzz5CGiiJqx
, and in our own Ray staging environment we use submission IDs that look likeyakov.shapiro-235da2fc-6645-4c6a-94a3-3804a286f019
. Because of this, I am implementing the following heuristic: take the first part of the submission ID that contains only alphanumeric characters or dots (.), and use that as the service name on the spans. This is not perfect, but in the short term it accomplishes the goal of stripping away meaningless suffixes from job names.For testing, I followed these instructions to launch a Docker container to run tests in my repository, and then ran the new tests with this command:
I verified that this shows that one test passed and 30 others were deselected.
For an end-to-end test, I installed Ray and ddtrace with the following commands:
The last command symlinks the library installation to the repo, so that you do not need to reinstall after every code change. Then I started Ray by running
and ran a sample job as follows:
using the attached script:
louis.py
I verified that
span.service
on the resulting spans (the ones that are sent to the Datadog backend by the Datadog agent running on my laptop) is now set to sinecure and notyakovs.test.job-19
, so that now one and the same job can be run with multiple submission IDs without having to restart the cluster in between.I also tested the alternate way of setting the job name through the submission id as follows:
I confirmed that this has the effect of setting the service names on the spans to
frobulation_model
.We also have some ideas for how users should be able to propagate the job name via the job metadata and/or how we can infer it based on the job's entry point. I will possibly add these other mechanisms in a future PR.
Checklist
Reviewer Checklist