feat(nvidia-fabricmanager): support Blackwell baseboards (DGX/HGX B100/B200/B300) #717

npdgm · 2025-05-31T00:37:47Z

Quoting: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

On NVIDIA DGX-B200, HGX-B200, and HGX-B100 systems and later, the FabricManager package needs an additional NVLSM dependency for proper operation.
NVLink Subnet manager (NVLSM) originated from the InfiniBand networking and contains additional logic to program NVSwitches and NVLinks.

Not running the NVLSM on Blackwell and newer fabrics with NVLink 5.0+ will result in FabricManager failing to start with error NV_WARN_NOTHING_TO_DO. NVSwitches will remain uninitialized and applications will fail with the CUDA_ERROR_SYSTEM_NOT_READY or cudaErrorSystemNotReady error. The CUDA initialization process can only begin after the GPUs complete their registration process with the NVLink fabric.
A GPU fabric registration status can be verified with the command: nvidia-smi -q -i 0 | grep -i -A 2 Fabric. An In Progress state indicates that the GPU is being registered and FabricManager is likely not running or missing the NVLSM dependency. A Completed state is shown when the GPU is successfully registered with the NVLink fabric.

The FabricManager package includes the script nvidia-fabricmanager-start.sh, which is used to selectively start FabricManager and NVLSM processes depending on the underlying platform.
A key aspect of determining whether an NVLink 5.0+ fabric is present is to look for a Limited Physical Function (LPF) port in InfiniBand devices. To differentiate LPFs, the Vital Product Data (VPD) information includes a vendor-specific field called SMDL, with a non-zero value defined as SW_MNG. The first device is then selected, and it's port GUID is extracted and passed to NVLSM and FabricManager.
So both services share a configuration key that result from a common initialization process. Additionally, they communicate with each other over a Unix socket.

This patch introduces the following changes:

Adds NVLSM to the nvidia-fabricmanager extension.
Introduces a new nvidia-fabricmanager-wrapper program to replicate the initialization process from nvidia-fabricmanager-start.sh:
- Detects NVLink 5.0+ fabrics and extracts an NVSwitch LPF port GUID. This is done by calling libibumad directly with CGO instead of parsing the output of the ibstat command as in the upstream script.
- Starts FabricManager, and NVLSM only when needed.
- Keeps both process lifecycles synchronized and ensures the Talos container will restart if either process crashes.
Refactors the nvidia-fabricmanager container to be self-contained, as this service does not share files with other nvidia-gpu extentions.

npdgm · 2025-05-31T00:51:58Z

@frezbo I'm building rdma-core here, so perhaps #127 could be used instead?

Edit: on second thought, it seemed too complex to have a decision tree determining whether an rdma-core extension dependency is necessary for nvidia-fabricmanager to run, depending on the platform. This was especially true considering it's only for pulling a 50 KB library file. So instead, I switched to building only libibumad.a and linking the archive.

frezbo · 2025-06-02T06:01:17Z

@frezbo I'm building rdma-core here, so perhaps #127 could be used instead?

I guess we can split this PR out, rdma-core in a different one, feel free to work on #127

npdgm · 2025-06-12T21:22:05Z

not initializing (loading ?) the GRPC plugin

no unix socket file created for FM

The issue was due to the configuration not being found because of an incorrect path.

The PR has been tested on a DGX-B200 system, and I expect all new baseboards to work on Talos now.
It depends on siderolabs/pkgs#1245 and siderolabs/talos#11211 to be effective. However, the lack of the IB kernel module won't pose any problems for starting nvidia-fabricmanager on older GPU architectures.

It will be required for Blackwell baseboard users to load the kernel module ib_umad so that the extension can detect the NVSwitch and start NVLSM:

machine:
  kernel:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset
      - name: ib_umad

dsseng

Thanks, this looks like a really complex task you accomplished. Please see some of the questions I had, maybe those would help us clarify some questions

nvidia-gpu/nvidia-fabricmanager/nvidia-fabricmanager-wrapper/fixtures/dgxb200-ibstat-mlx5_0

nvidia-gpu/vars.yaml

nvidia-gpu/nvidia-fabricmanager/nvidia-fabricmanager-wrapper/pkg.yaml

nvidia-gpu/nvidia-fabricmanager/nvidia-fabricmanager-wrapper/nvswitch.go

dsseng · 2025-06-13T20:08:38Z

nvidia-gpu/nvidia-fabricmanager/production/nvidia-fabricmanager.yaml

-  args:
-   - --config
-   - /usr/local/share/nvidia/nvswitch/fabricmanager.cfg
+  entrypoint: /usr/bin/nvidia-fabricmanager-wrapper


Should this change be done for all models (e.g. A100 only needs nvfm, not nvlsm)? Has this been regression-tested against a system not requiring nvlsm?

This could perhaps be done in the integration-aws-nvidia-nonfree CI workflow perhaps, if you cannot test, so that's not a big deal I guess

Has this been regression-tested against a system not requiring nvlsm?

No, unfortunately I can't at this time and it might not even happen. So using the CI workflow sounds like the best option.

What was tested on Blackwell is altering the IB lookup function so that no NVSwitch configuration port is found. In this case NVLSM was skipped as expected and NVFM started like it used to be. But being a Blackwell system it failed to initialize the baseboard.

No, unfortunately I can't at this time and it might not even happen. So using the CI workflow sounds like the best option.

Okay, that should be possible. Keeping this thread open so Noel or someone else keeps need for CI test in mind as I recall that test being occasionally disabled due to pricey GPU and often failures due to no resources available in the cloud

nvidia-gpu/nvidia-fabricmanager/nvidia-fabricmanager-wrapper/main.go

nvidia-gpu/nvidia-fabricmanager/lts/pkg.yaml

npdgm · 2025-07-01T16:43:40Z

Is this likely to merge for 1.11.0 or are you in a freeze window already? I guess it wouldn't hit a patchlevel later.
No pressure, I just need a little insight to adjust ops planning with clients. Thanks!

frezbo · 2025-07-01T16:46:27Z

Is this likely to merge for 1.11.0 or are you in a freeze window already? I guess it wouldn't hit a patchlevel later. No pressure, I just need a little insight to adjust ops planning with clients. Thanks!

if we can sort out everything before siderolabs/talos#10907 beta.0 it we can pull this in

frezbo · 2025-08-12T12:46:22Z

@npdgm is this good for another round of review?

npdgm · 2025-08-12T13:33:28Z

@npdgm is this good for another round of review?

Yes! It's been in use with multi-GPU inference in production for 3 months, no problem. Lately built on 1.10.4.
I still can't validate this on pre-Blackwell baseboards myself, sorry. Also any new changes would take some time to test on B200 as I would need to schedule a maintenance. Resources are scarce...

frezbo · 2025-08-12T14:39:19Z

@npdgm is this good for another round of review?

Yes! It's been in use with multi-GPU inference in production for 3 months, no problem. Lately built on 1.10.4. I still can't validate this on pre-Blackwell baseboards myself, sorry. Also any new changes would take some time to test on B200 as I would need to schedule a maintenance. Resources are scarce...

Cool, I'll get this worked on and run our tests and see if this works for existing stuff

nvidia-gpu/nvidia-fabricmanager/nvidia-fabricmanager-wrapper/pkg.yaml

frezbo · 2025-08-13T06:12:26Z

nvidia-gpu/nvidia-fabricmanager/lts/pkg.yaml

 shell: /bin/bash
 dependencies:
- - stage: base
+  - stage: base


i think we don't need to do this hack anymore? since this stage doesn't seem to use anything from wolfi anymore?

Fixed.

Sorry I had it fixed in production but not lts. Both are the same now, except for vars.
So the base stage is back to /, and Wolfi is pulled to /wolfi-base. We still need Wolfi because of the libgcc_s.so.1 file we copy to the extension rootfs, for NVLSM. No package is installed so that file is locked down by WOLFI_BASE_REF. Anyway, GCC's runtime library is quite small and portable, we could pick one from any distro or version and that would work.

nvidia-gpu/nvidia-fabricmanager/production/pkg.yaml

dsseng · 2025-08-13T08:16:50Z

nvidia-gpu/nvidia-fabricmanager/lts/pkg.yaml

+  - stage: base
+    from: /
+    to: /base-rootfs
+  - image: cgr.dev/chainguard/wolfi-base@{{ .WOLFI_BASE_REF }}


Agreed, this shouldn't need Wolfi IIUC, since no Glibc build happens here

Thanks. See my answer to frezbo above, Wolfi is moved to it's own root but still needed, as a source for GCC's runtime library that's not present in Talos.

Let's create a meta pkg and copy the file to the rootfs in there, then use that as depends on here, so the intent is clear and obvious

It looks much better indeed. Let me know what you think I wasn't sure how to name it.
I'm satisfied with how it builds so far. Once this round of review is cleared I'll schedule an upgrade for validation.

this looks good, could you validate from your side? Then will see about getting this merged and run through our NVIDIA tests

so is this good to go?

Can't tell yet, sorry. Last maintenance window was wasted on the issue with containerd file permission.
I've got a new slot from my client, we'll upgrade nodes on Tueday at 11 GMT.

It failed again. I don't get it, looks like I can't make working builds anymore.

Previously I never had to call the imager with --system-extension-image ghcr.io/siderolabs/glibc:x. Now if I don't, then the service container fails to bind mount /usr/local/glibc because it doesn't exist.
But then, when explicitely imaging with the glibc extention, then I still get errors I had never see before about libraries such as :

nv-fabricmanager: /usr/bin/nv-fabricmanager: error while loading shared libraries: libz.so.1: cannot open shared object file: No such file or directory

if glibc is added as a dependency it should work, maybe it's looking for libz from the alpine base

Sorry about the noise, I figured it out. Silly problem on my side.

So it's good to go, I validated the current PR head.

rebased on 1.10.6 : OK and now used in production

rebased on release-1.11 : OK too

frezbo · 2025-09-09T09:42:51Z

nvidia-gpu/vars.yaml

 NVIDIA_FABRIC_MANAGER_PRODUCTION_ARM64_SHA512: 6606cb088f055e4511c94cdd772bdf3f0ae118b8cb5f60533351b587e058d2fdb34078ff7aac6e96db2c2f6d993f69229f1d8ed655c8e8a29c4d2992d545694d
 NVIDIA_FABRIC_MANAGER_PRODUCTION_AMD64_SHA256: 8d24cacde4554d471899ad426f46a349d5ca0a2e8acd45c2a76381c8f496491e
 NVIDIA_FABRIC_MANAGER_PRODUCTION_AMD64_SHA512: f893f92e144c46d2f2c114399c11873ab843bfec480652878453a07ec3a8ed9af429937ca33f60039e120512a47a75ca032f0de7d37f8a8ebf385d90a0099007
+NVIDIA_NVLSM_PRODUCTION_VERSION: 2025.03.1


how is the production and lts versions determined?

…0/B200/B300) Quoting: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf > On NVIDIA DGX-B200, HGX-B200, and HGX-B100 systems and later, the FabricManager package needs an additional NVLSM dependency for proper operation. > NVLink Subnet manager (NVLSM) originated from the InfiniBand networking and contains additional logic to program NVSwitches and NVLinks. Not running the NVLSM on Blackwell and newer fabrics with NVLink 5.0+ will result in FabricManager failing to start with error `NV_WARN_NOTHING_TO_DO`. NVSwitches will remain uninitialized and applications will fail with the `CUDA_ERROR_SYSTEM_NOT_READY` or `cudaErrorSystemNotReady` error. The CUDA initialization process can only begin after the GPUs complete their registration process with the NVLink fabric. A GPU fabric registration status can be verified with the command: `nvidia-smi -q -i 0 | grep -i -A 2 Fabric`. An `In Progress` state indicates that the GPU is being registered and FabricManager is likely not running or missing the NVLSM dependency. A `Completed` state is shown when the GPU is successfully registered with the NVLink fabric. The FabricManager package includes the script `nvidia-fabricmanager-start.sh`, which is used to selectively start FabricManager and NVLSM processes depending on the underlying platform. A key aspect of determining whether an NVLink 5.0+ fabric is present is to look for a Limited Physical Function (LPF) port in InfiniBand devices. To differentiate LPFs, the Vital Product Data (VPD) information includes a vendor-specific field called `SMDL`, with a non-zero value defined as `SW_MNG`. The first device is then selected, and it's port GUID is extracted and passed to NVLSM and FabricManager. So both services share a configuration key that result from a common initialization process. Additionally, they communicate with each other over a Unix socket. This patch introduces the following changes: * Adds NVLSM to the nvidia-fabricmanager extension. * Introduces a new `nvidia-fabricmanager-wrapper` program to replicate the initialization process from `nvidia-fabricmanager-start.sh`: * Detects NVLink 5.0+ fabrics and extracts an NVSwitch LPF port GUID. This is done by calling libibumad directly with CGO instead of parsing the output of the `ibstat` command as in the upstream script. * Starts FabricManager, and NVLSM only when needed. * Keeps both process lifecycles synchronized and ensures the Talos container will restart if either process crashes. * Refactors the nvidia-fabricmanager container to be self-contained, as this service does not share files with other nvidia-gpu extensions. Signed-off-by: Thibault VINCENT <[email protected]> Signed-off-by: Noel Georgi <[email protected]>

talos-bot added this to Planning May 31, 2025

npdgm marked this pull request as draft May 31, 2025 00:37

talos-bot moved this to In Review in Planning May 31, 2025

This comment was marked as outdated.

Sign in to view

npdgm force-pushed the feat/nvidia-nvlsm branch 2 times, most recently from 2c717e0 to 228b53b Compare May 31, 2025 20:00

smira assigned frezbo Jun 2, 2025

npdgm force-pushed the feat/nvidia-nvlsm branch from 228b53b to 2844254 Compare June 2, 2025 16:45

npdgm changed the title ~~feat(nvidia-fabricmanager): add support for Blackwell HGX B200/B100 baseboards and DGX systems~~ feat(nvidia-fabricmanager): support Blackwell baseboards (DGX/HGX B100/B200/B300) Jun 2, 2025

npdgm force-pushed the feat/nvidia-nvlsm branch 5 times, most recently from c46b952 to 6e89cc9 Compare June 3, 2025 19:10

This comment was marked as outdated.

Sign in to view

npdgm force-pushed the feat/nvidia-nvlsm branch 2 times, most recently from c6a79a7 to 39fd9d3 Compare June 3, 2025 19:30

npdgm force-pushed the feat/nvidia-nvlsm branch from 39fd9d3 to f78bfa8 Compare June 12, 2025 21:03

npdgm marked this pull request as ready for review June 12, 2025 21:22

dsseng reviewed Jun 13, 2025

View reviewed changes

dsseng requested a review from frezbo June 13, 2025 20:19

npdgm force-pushed the feat/nvidia-nvlsm branch from f78bfa8 to 3826055 Compare June 18, 2025 17:25

frezbo reviewed Jun 27, 2025

View reviewed changes

nvidia-gpu/nvidia-fabricmanager/lts/pkg.yaml Outdated Show resolved Hide resolved

frezbo reviewed Jun 27, 2025

View reviewed changes

nvidia-gpu/nvidia-fabricmanager/lts/pkg.yaml Outdated Show resolved Hide resolved

smira moved this from In Review to On Hold in Planning Jul 7, 2025

npdgm force-pushed the feat/nvidia-nvlsm branch from 3826055 to 4173afb Compare July 22, 2025 14:34

npdgm force-pushed the feat/nvidia-nvlsm branch from 4173afb to 20474a0 Compare August 5, 2025 15:39

frezbo reviewed Aug 13, 2025

View reviewed changes

nvidia-gpu/nvidia-fabricmanager/nvidia-fabricmanager-wrapper/pkg.yaml Show resolved Hide resolved

frezbo reviewed Aug 13, 2025

View reviewed changes

dsseng reviewed Aug 13, 2025

View reviewed changes

npdgm force-pushed the feat/nvidia-nvlsm branch 4 times, most recently from 700c4db to 778abf1 Compare August 13, 2025 12:04

npdgm force-pushed the feat/nvidia-nvlsm branch 2 times, most recently from c6dbe0f to d8696d3 Compare August 22, 2025 15:11

frezbo force-pushed the feat/nvidia-nvlsm branch from d8696d3 to 4c3649c Compare September 9, 2025 09:39

frezbo reviewed Sep 9, 2025

View reviewed changes

frezbo force-pushed the feat/nvidia-nvlsm branch from 4c3649c to 279ed65 Compare September 9, 2025 09:46

frezbo approved these changes Sep 9, 2025

View reviewed changes

github-project-automation bot moved this from On Hold to Approved in Planning Sep 9, 2025

feat(nvidia-fabricmanager): support Blackwell baseboards (DGX/HGX B100/B200/B300) #717

Are you sure you want to change the base?

feat(nvidia-fabricmanager): support Blackwell baseboards (DGX/HGX B100/B200/B300) #717

Uh oh!

Conversation

npdgm commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

npdgm commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frezbo commented Jun 2, 2025

Uh oh!

This comment was marked as outdated.

npdgm commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsseng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

npdgm Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

npdgm commented Jul 1, 2025

Uh oh!

frezbo commented Jul 1, 2025

Uh oh!

frezbo commented Aug 12, 2025

Uh oh!

npdgm commented Aug 12, 2025

Uh oh!

frezbo commented Aug 12, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

npdgm Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

npdgm commented May 31, 2025 •

edited

Loading

npdgm commented May 31, 2025 •

edited

Loading

npdgm commented Jun 12, 2025 •

edited

Loading

npdgm Jun 18, 2025 •

edited

Loading

npdgm Aug 13, 2025 •

edited

Loading