-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
ci: Add CUDA + arm64 release builds #21201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new Buildkite CI step to build a release wheel for arm64
with CUDA 12.9. The torch_cuda_arch_list
build argument in the docker build command does not include CUDA compute capabilities 10.0 or 12.0, which should be added to ensure compatibility.
.buildkite/release-pipeline.yaml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This arm64
build step will likely fail because the torch_cuda_arch_list
build argument in the docker build command does not include CUDA compute capabilities 10.0 or 12.0. These architectures should be added to ensure compatibility with the target CUDA version and architecture.
"DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0+PTX' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
.buildkite/release-pipeline.yaml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the arches, I think we only need to include the ones that compute capabilities that have SKUs with ARM CPUs
Referencing https://developer.nvidia.com/cuda-gpus, I think that is mostly GH200 and GB200, which are 9.0 and 10.0 respectively. We could include 8.7 to cover the Jetson boards as well.
So I am proposing torch_cuda_arch_list='8.7 9.0 10.0+PTX'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds reasonable to me
can we get this reviewed & merged? |
@seemethere could you fix the DCO issue? thanks |
@seemethere Could you rebase this PR and fix the DCO issue? Thanks! |
.buildkite/release-pipeline.yaml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for torch 2.7.0? Does this change for 2.7.1 or 2.8.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still the case for all PyTorch releases today afaik
Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
575304e
to
846aa6d
Compare
Rebased + DCO should be fixed now! |
q: how do we actually run this? |
buildkite, release pipeline, run with this branch and commit |
Also explicitly sets VLLM_TARGET_DEVICE as an environment variable Signed-off-by: Eli Uriegas <[email protected]>
1895656
to
cf453f3
Compare
Will this actually work on a forked PR? |
seems to be working now! https://buildkite.com/vllm/release/builds/7267/steps/canvas?jid=0198aa78-11c6-4d9f-9f8c-b27dfe8e05df ![]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳
Signed-off-by: Eli Uriegas <[email protected]> Signed-off-by: Yiwen Chen <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]> Signed-off-by: Duncan Moss <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Adds CUDA + arm64 builds
Test Plan
CI
Test Result
(Optional) Documentation Update