-
Notifications
You must be signed in to change notification settings - Fork 171
feat(nvidia-fabricmanager): support Blackwell baseboards (DGX/HGX B100/B200/B300) #717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
@frezbo I'm building rdma-core here, so perhaps #127 could be used instead? Edit: on second thought, it seemed too complex to have a decision tree determining whether an |
2c717e0
to
228b53b
Compare
c46b952
to
6e89cc9
Compare
This comment was marked as outdated.
This comment was marked as outdated.
c6a79a7
to
39fd9d3
Compare
The issue was due to the configuration not being found because of an incorrect path. The PR has been tested on a DGX-B200 system, and I expect all new baseboards to work on Talos now. It will be required for Blackwell baseboard users to load the kernel module machine:
kernel:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
- name: ib_umad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks like a really complex task you accomplished. Please see some of the questions I had, maybe those would help us clarify some questions
nvidia-gpu/nvidia-fabricmanager/nvidia-fabricmanager-wrapper/fixtures/dgxb200-ibstat-mlx5_0
Outdated
Show resolved
Hide resolved
nvidia-gpu/nvidia-fabricmanager/nvidia-fabricmanager-wrapper/nvswitch.go
Show resolved
Hide resolved
nvidia-gpu/nvidia-fabricmanager/nvidia-fabricmanager-wrapper/nvswitch.go
Show resolved
Hide resolved
args: | ||
- --config | ||
- /usr/local/share/nvidia/nvswitch/fabricmanager.cfg | ||
entrypoint: /usr/bin/nvidia-fabricmanager-wrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this change be done for all models (e.g. A100 only needs nvfm, not nvlsm)? Has this been regression-tested against a system not requiring nvlsm?
This could perhaps be done in the integration-aws-nvidia-nonfree CI workflow perhaps, if you cannot test, so that's not a big deal I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been regression-tested against a system not requiring nvlsm?
No, unfortunately I can't at this time and it might not even happen. So using the CI workflow sounds like the best option.
What was tested on Blackwell is altering the IB lookup function so that no NVSwitch configuration port is found. In this case NVLSM was skipped as expected and NVFM started like it used to be. But being a Blackwell system it failed to initialize the baseboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, unfortunately I can't at this time and it might not even happen. So using the CI workflow sounds like the best option.
Okay, that should be possible. Keeping this thread open so Noel or someone else keeps need for CI test in mind as I recall that test being occasionally disabled due to pricey GPU and often failures due to no resources available in the cloud
Is this likely to merge for 1.11.0 or are you in a freeze window already? I guess it wouldn't hit a patchlevel later. |
if we can sort out everything before siderolabs/talos#10907 beta.0 it we can pull this in |
4173afb
to
20474a0
Compare
@npdgm is this good for another round of review? |
Yes! It's been in use with multi-GPU inference in production for 3 months, no problem. Lately built on 1.10.4. |
Cool, I'll get this worked on and run our tests and see if this works for existing stuff |
shell: /bin/bash | ||
dependencies: | ||
- stage: base | ||
- stage: base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we don't need to do this hack anymore? since this stage doesn't seem to use anything from wolfi anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Sorry I had it fixed in production
but not lts
. Both are the same now, except for vars.
So the base stage is back to /
, and Wolfi is pulled to /wolfi-base
. We still need Wolfi because of the libgcc_s.so.1
file we copy to the extension rootfs, for NVLSM. No package is installed so that file is locked down by WOLFI_BASE_REF
. Anyway, GCC's runtime library is quite small and portable, we could pick one from any distro or version and that would work.
- stage: base | ||
from: / | ||
to: /base-rootfs | ||
- image: cgr.dev/chainguard/wolfi-base@{{ .WOLFI_BASE_REF }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this shouldn't need Wolfi IIUC, since no Glibc build happens here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. See my answer to frezbo above, Wolfi is moved to it's own root but still needed, as a source for GCC's runtime library that's not present in Talos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's create a meta pkg
and copy the file to the rootfs in there, then use that as depends on here, so the intent is clear and obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks much better indeed. Let me know what you think I wasn't sure how to name it.
I'm satisfied with how it builds so far. Once this round of review is cleared I'll schedule an upgrade for validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good, could you validate from your side? Then will see about getting this merged and run through our NVIDIA tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so is this good to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't tell yet, sorry. Last maintenance window was wasted on the issue with containerd file permission.
I've got a new slot from my client, we'll upgrade nodes on Tueday at 11 GMT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It failed again. I don't get it, looks like I can't make working builds anymore.
Previously I never had to call the imager with --system-extension-image ghcr.io/siderolabs/glibc:x
. Now if I don't, then the service container fails to bind mount /usr/local/glibc
because it doesn't exist.
But then, when explicitely imaging with the glibc extention, then I still get errors I had never see before about libraries such as :
nv-fabricmanager: /usr/bin/nv-fabricmanager: error while loading shared libraries: libz.so.1: cannot open shared object file: No such file or directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if glibc is added as a dependency it should work, maybe it's looking for libz
from the alpine base
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the noise, I figured it out. Silly problem on my side.
So it's good to go, I validated the current PR head.
- rebased on 1.10.6 : OK and now used in production
- rebased on release-1.11 : OK too
700c4db
to
778abf1
Compare
c6dbe0f
to
d8696d3
Compare
d8696d3
to
4c3649c
Compare
NVIDIA_FABRIC_MANAGER_PRODUCTION_ARM64_SHA512: 6606cb088f055e4511c94cdd772bdf3f0ae118b8cb5f60533351b587e058d2fdb34078ff7aac6e96db2c2f6d993f69229f1d8ed655c8e8a29c4d2992d545694d | ||
NVIDIA_FABRIC_MANAGER_PRODUCTION_AMD64_SHA256: 8d24cacde4554d471899ad426f46a349d5ca0a2e8acd45c2a76381c8f496491e | ||
NVIDIA_FABRIC_MANAGER_PRODUCTION_AMD64_SHA512: f893f92e144c46d2f2c114399c11873ab843bfec480652878453a07ec3a8ed9af429937ca33f60039e120512a47a75ca032f0de7d37f8a8ebf385d90a0099007 | ||
NVIDIA_NVLSM_PRODUCTION_VERSION: 2025.03.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is the production and lts versions determined?
…0/B200/B300) Quoting: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf > On NVIDIA DGX-B200, HGX-B200, and HGX-B100 systems and later, the FabricManager package needs an additional NVLSM dependency for proper operation. > NVLink Subnet manager (NVLSM) originated from the InfiniBand networking and contains additional logic to program NVSwitches and NVLinks. Not running the NVLSM on Blackwell and newer fabrics with NVLink 5.0+ will result in FabricManager failing to start with error `NV_WARN_NOTHING_TO_DO`. NVSwitches will remain uninitialized and applications will fail with the `CUDA_ERROR_SYSTEM_NOT_READY` or `cudaErrorSystemNotReady` error. The CUDA initialization process can only begin after the GPUs complete their registration process with the NVLink fabric. A GPU fabric registration status can be verified with the command: `nvidia-smi -q -i 0 | grep -i -A 2 Fabric`. An `In Progress` state indicates that the GPU is being registered and FabricManager is likely not running or missing the NVLSM dependency. A `Completed` state is shown when the GPU is successfully registered with the NVLink fabric. The FabricManager package includes the script `nvidia-fabricmanager-start.sh`, which is used to selectively start FabricManager and NVLSM processes depending on the underlying platform. A key aspect of determining whether an NVLink 5.0+ fabric is present is to look for a Limited Physical Function (LPF) port in InfiniBand devices. To differentiate LPFs, the Vital Product Data (VPD) information includes a vendor-specific field called `SMDL`, with a non-zero value defined as `SW_MNG`. The first device is then selected, and it's port GUID is extracted and passed to NVLSM and FabricManager. So both services share a configuration key that result from a common initialization process. Additionally, they communicate with each other over a Unix socket. This patch introduces the following changes: * Adds NVLSM to the nvidia-fabricmanager extension. * Introduces a new `nvidia-fabricmanager-wrapper` program to replicate the initialization process from `nvidia-fabricmanager-start.sh`: * Detects NVLink 5.0+ fabrics and extracts an NVSwitch LPF port GUID. This is done by calling libibumad directly with CGO instead of parsing the output of the `ibstat` command as in the upstream script. * Starts FabricManager, and NVLSM only when needed. * Keeps both process lifecycles synchronized and ensures the Talos container will restart if either process crashes. * Refactors the nvidia-fabricmanager container to be self-contained, as this service does not share files with other nvidia-gpu extensions. Signed-off-by: Thibault VINCENT <[email protected]> Signed-off-by: Noel Georgi <[email protected]>
4c3649c
to
279ed65
Compare
Quoting: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
Not running the NVLSM on Blackwell and newer fabrics with NVLink 5.0+ will result in FabricManager failing to start with error
NV_WARN_NOTHING_TO_DO
. NVSwitches will remain uninitialized and applications will fail with theCUDA_ERROR_SYSTEM_NOT_READY
orcudaErrorSystemNotReady
error. The CUDA initialization process can only begin after the GPUs complete their registration process with the NVLink fabric.A GPU fabric registration status can be verified with the command:
nvidia-smi -q -i 0 | grep -i -A 2 Fabric
. AnIn Progress
state indicates that the GPU is being registered and FabricManager is likely not running or missing the NVLSM dependency. ACompleted
state is shown when the GPU is successfully registered with the NVLink fabric.The FabricManager package includes the script
nvidia-fabricmanager-start.sh
, which is used to selectively start FabricManager and NVLSM processes depending on the underlying platform.A key aspect of determining whether an NVLink 5.0+ fabric is present is to look for a Limited Physical Function (LPF) port in InfiniBand devices. To differentiate LPFs, the Vital Product Data (VPD) information includes a vendor-specific field called
SMDL
, with a non-zero value defined asSW_MNG
. The first device is then selected, and it's port GUID is extracted and passed to NVLSM and FabricManager.So both services share a configuration key that result from a common initialization process. Additionally, they communicate with each other over a Unix socket.
This patch introduces the following changes:
nvidia-fabricmanager-wrapper
program to replicate the initialization process fromnvidia-fabricmanager-start.sh
:ibstat
command as in the upstream script.