-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
add the codes to check AMD Instinct GPU number #22367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for detecting AMD Instinct GPUs to ensure the environment has at least two GPUs. The implementation has been updated to check for AMD GPUs by looking for a ROCm installation and using lspci
. My feedback focuses on making this detection more robust by using the standard rocm-smi
utility instead of relying on installation paths and parsing lspci
output.
if [ ! -d "/opt/rocm" ]; then | ||
num_gpus=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l) | ||
else | ||
num_gpus=$(lspci | grep Instinct | wc -l) | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current method for detecting AMD GPUs has a few potential issues that could make it unreliable:
- Dependency on
/opt/rocm
: It assumes ROCm is always installed in/opt/rocm
. This might not be true for all installations or distributions. - Fragile
lspci
parsing: Relying onlspci | grep Instinct
is brittle. The output format oflspci
can change, and the string "Instinct" might not be present for all AMD data center GPUs, or it could accidentally match other devices.
A more robust approach is to use rocm-smi
, which is the AMD equivalent of nvidia-smi
. We can check for the availability of nvidia-smi
or rocm-smi
and then use the appropriate command to count GPUs. This avoids hardcoded paths and fragile text parsing.
if [ ! -d "/opt/rocm" ]; then | |
num_gpus=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l) | |
else | |
num_gpus=$(lspci | grep Instinct | wc -l) | |
fi | |
if command -v nvidia-smi &> /dev/null; then | |
num_gpus=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l) | |
elif command -v rocm-smi &> /dev/null; then | |
# rocm-smi --showid lists the GPU IDs, one per line. | |
num_gpus=$(rocm-smi --showid | wc -l) | |
else | |
echo "Error: Neither nvidia-smi nor rocm-smi found. Cannot determine GPU count." | |
exit 1 | |
fi |
Signed-off-by: Zhang Jason <[email protected]>
Signed-off-by: Zhang Jason <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
Signed-off-by: Zhang Jason <[email protected]> Signed-off-by: avtc <[email protected]>
Signed-off-by: Zhang Jason <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Zhang Jason <[email protected]> Signed-off-by: Noam Gat <[email protected]>
Signed-off-by: Zhang Jason <[email protected]>
Signed-off-by: Zhang Jason <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Zhang Jason <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Zhang Jason <[email protected]>
Signed-off-by: Zhang Jason <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: Zhang Jason <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
LMcache PD example needs at leaest 2 GPUs, but the existed codes are using "nvidia-smi", whcih doesn't work on AMD GPU. so add the codes to check AMD Instinct GPU number
Test Plan
run the shell scripts of check_num_gpus on both AMD and NV platforms
Test Result
the check_num_gpus function could retrun the right number of AMD/NV GPUs
(Optional) Documentation Update