-
Notifications
You must be signed in to change notification settings - Fork 15
rocshmem dependencies #349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Could you share a toy user submission as well using rocshmem. Just wanna get a sense of what things will look like e2e |
Also @saienduri to sanity check |
Vibe coded this but is gonna look similar to HIP kernels in python |
Looks good to me. Starting a test docker build here to check status: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17545534459. |
ooo! looks like there is some issue with UCX. I ll debug it today! |
@saienduri I made some changes but not sure if it works, is there a way to test the workflow without approval? I don't have MI300X to test 😅 |
Thanks, trying a build here now: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17701378282. You can locally try building the docker just to see if the build passes. |
Cool, the build passed and a sanity test passed here: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17702258708 |
@saienduri added one, lmk if it works! |
Hmm getting |
You want the example working with load_inline in PyTorch |
done but idk if it works 😬 |
@saienduri can we test the provided payload example on the server directly? If it's fine then we should be good to merge |
ok running the payload in github actions yielded the following (https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17790562194):
I think it will be the same error on the server itself as well. |
Pushed a commit to fix the import issue.
|
Ok, I ll test this on runpod and push a working version. Apologies for all the back and forth! |
@chivatam Hi, I have no permission to directly push commit to your repo, I corrected your payload, you can refer to that. Just use extra_ldflag instead
|
@saienduri hi sai, could you pls replace the current one with mine above and trigger test again? Thanks |
@danielhua23 just gave you write access as well |
Latest log @danielhua23:
|
You can always trigger a run as you make changes like this (make sure to select the same branch and runner name): After it runs, you can download the artifacts: Also, if you have access to a mi3x server, you can use this docker Just want to make sure I'm not slowing y'all down here :) |
Big thanks for your tutorials Sai, I will have a try! |
Currently the new payload with new docker works well on my local MI3x machines, but how to trigger a job with a new docker built by the new dockerfile? I already ping Sai, if you guys have solutions, you can also help! Thanks! |
@danielhua23 for the dockerfile you can publish a new one here https://github.com/gpu-mode/discord-cluster-manager/actions/workflows/publish_amd_docker.yml just link it to your branch and my understanding is @saienduri's infra should automatically pick it up |
Description
added rocshmem dependencies to the dockerfile
@msaroufim