Skip to content

Conversation

Snektron
Copy link
Collaborator

@Snektron Snektron commented Aug 25, 2025

Description

This is an initial implementation of profiling using rocPROF for MI300 jobs. It works as follows: When the profiling run time is selected, the application is simply executed using rocPROF with options set to gather a bunch of traces. These are then packages with RunResults, and finally sent to the user. This current version is the most straight forward way to implement this, but there are some missing features and limitations:

  • The tracing results ZIP needs to be less than 8 MB. While Discord should support 25 MB for non-boosted servers and 100 MB with boosted servers, this is apparently an API limitation. See Request entity to large Rapptz/discord.py#1733. Test traces appear to be about 5 MB, but I don't know yet about longer programs and multi-GPU programs.
    • A github link is now shared
  • The HSA trace is not collected. It is too large to even be opened by perfetto in a reasonable time, so for now I've left that out. Perhaps later.
    • I think its best to leave this out
  • I've not yet tested how well it works with multi-GPU programs. I'll check by rebasing on top of Multi-GPU #335.
  • I'd like to add code object dumps to the profile results, but those increase the binary by quite a bit too.
  • eval.py does not yet contain rocTX markers.
  • rocPROF prints a bunch of perfetto messages which always appear in the stderr. This looks like an issue on rocPROF's side, logging isn't configured properly. Not sure if we should solve this now...

@Snektron Snektron force-pushed the amd-profiling branch 7 times, most recently from 0dfd175 to 58ce131 Compare August 28, 2025 22:05
Copy link

github-actions bot commented Aug 28, 2025

Coverage report

This PR does not seem to contain any modification to coverable code.

@Snektron Snektron force-pushed the amd-profiling branch 9 times, most recently from ae1ff0a to 710e5d6 Compare August 31, 2025 16:31
@msaroufim msaroufim self-requested a review September 2, 2025 05:18
@msaroufim
Copy link
Member

msaroufim commented Sep 2, 2025

Sorry for the delay, I'll have time to review tomorrow afternoon PST

EDIT: PR looks reasonable to me. Probably would want a final stamp by @ngc92 before merge

But my feedback would be

  1. Can we have a standalone PR where we update examples/
  2. CI still failing
  3. Rebase on multi gpu branch PR will become shorter

On size limits gpu mode is max boosted and that'll likely continue in the future

@msaroufim msaroufim requested review from ngc92 and S1ro1 September 2, 2025 05:18
@ngc92
Copy link
Collaborator

ngc92 commented Sep 2, 2025

quickly scrolled over the changes, didn't spot anything that looked problematic. I'd echo Mark's comment, though, that it'd be nice if the PR was a bit smaller :)
There's a bunch of changes here that aren't strictly about profiling (the example updates, but also, e.g., reporting cuda vs rocm); if we could get them in separate PRs, those would be trivial to approve and this would be a bit more focused.

@Snektron
Copy link
Collaborator Author

Snektron commented Sep 2, 2025

Can we have a standalone PR where we update examples/

Yeah, sure, that seems like good idea. Should the reference kernels also be updated after this?

CI still failing

Hmm im not exactly sure what this is caused by. I thought that it was unrelated, but it doesn't seem to be a problem with other pull requests...

Rebase on multi gpu branch PR will become shorter

Its my impression that the multi gpu stuff already in main is all that is needed. Am I missing something?

@msaroufim
Copy link
Member

msaroufim commented Sep 3, 2025

  1. So re the CI failure my guess is that nvcc command construction is seeing utils.h

It's more of a guess but what I'd try in run_eval.py is making this change since CUDA_FLAGS does get modified in place and maybe some repeated run is causing clunky behavior

  if flags is None:
      flags = CUDA_FLAGS.copy()
  1. On the multi gpu stuff my bad thought you weren't rebased

  2. I don't think we need to update the reference kernels with a dev suffix but yeah if any name is outright wrong happy to accept a PR fixing it

@Snektron
Copy link
Collaborator Author

Snektron commented Sep 9, 2025

I've split up the pull request as requested, see #354 and #355. This pull request is now based on #354; that one should be merged first because GH cant into stacked pull requests

This uses rocPROF to fetch some interesting data and put it
in the profile_data directory, the download link of which
is then returned to the user.
rocPROF generates one trace for every process. Simply combine them
together into a single trace for ease of use. Also remove the
individual traces are they are no longer useful afterwards.
@Snektron Snektron requested review from msaroufim and removed request for msaroufim September 11, 2025 07:32
@Snektron Snektron merged commit 51af552 into main Sep 13, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants