Skip to content

Conversation

npdgm
Copy link

@npdgm npdgm commented May 31, 2025

Quoting: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

On NVIDIA DGX-B200, HGX-B200, and HGX-B100 systems and later, the FabricManager package needs an additional NVLSM dependency for proper operation.
NVLink Subnet manager (NVLSM) originated from the InfiniBand networking and contains additional logic to program NVSwitches and NVLinks.

Not running the NVLSM on Blackwell and newer fabrics with NVLink 5.0+ will result in FabricManager failing to start with error NV_WARN_NOTHING_TO_DO. NVSwitches will remain uninitialized and applications will fail with the CUDA_ERROR_SYSTEM_NOT_READY or cudaErrorSystemNotReady error. The CUDA initialization process can only begin after the GPUs complete their registration process with the NVLink fabric.
A GPU fabric registration status can be verified with the command: nvidia-smi -q -i 0 | grep -i -A 2 Fabric. An In Progress state indicates that the GPU is being registered and FabricManager is likely not running or missing the NVLSM dependency. A Completed state is shown when the GPU is successfully registered with the NVLink fabric.

The FabricManager package includes the script nvidia-fabricmanager-start.sh, which is used to selectively start FabricManager and NVLSM processes depending on the underlying platform.
A key aspect of determining whether an NVLink 5.0+ fabric is present is to look for a Limited Physical Function (LPF) port in InfiniBand devices. To differentiate LPFs, the Vital Product Data (VPD) information includes a vendor-specific field called SMDL, with a non-zero value defined as SW_MNG. The first device is then selected, and it's port GUID is extracted and passed to NVLSM and FabricManager.
So both services share a configuration key that result from a common initialization process. Additionally, they communicate with each other over a Unix socket.

This patch introduces the following changes:

  • Adds NVLSM to the nvidia-fabricmanager extension.
  • Introduces a new nvidia-fabricmanager-wrapper program to replicate the initialization process from nvidia-fabricmanager-start.sh:
    • Detects NVLink 5.0+ fabrics and extracts an NVSwitch LPF port GUID. This is done by calling libibumad directly with CGO instead of parsing the output of the ibstat command as in the upstream script.
    • Starts FabricManager, and NVLSM only when needed.
    • Keeps both process lifecycles synchronized and ensures the Talos container will restart if either process crashes.
  • Refactors the nvidia-fabricmanager container to be self-contained, as this service does not share files with other nvidia-gpu extentions.

@npdgm npdgm marked this pull request as draft May 31, 2025 00:37
@talos-bot talos-bot moved this to In Review in Planning May 31, 2025
@npdgm

This comment was marked as outdated.

@npdgm
Copy link
Author

npdgm commented May 31, 2025

@frezbo I'm building rdma-core here, so perhaps #127 could be used instead?

Edit: on second thought, it seemed too complex to have a decision tree determining whether an rdma-core extension dependency is necessary for nvidia-fabricmanager to run, depending on the platform. This was especially true considering it's only for pulling a 50 KB library file. So instead, I switched to building only libibumad.a and linking the archive.

@npdgm npdgm force-pushed the feat/nvidia-nvlsm branch 2 times, most recently from 2c717e0 to 228b53b Compare May 31, 2025 20:00
@frezbo
Copy link
Member

frezbo commented Jun 2, 2025

@frezbo I'm building rdma-core here, so perhaps #127 could be used instead?

I guess we can split this PR out, rdma-core in a different one, feel free to work on #127

@npdgm npdgm force-pushed the feat/nvidia-nvlsm branch from 228b53b to 2844254 Compare June 2, 2025 16:45
@npdgm npdgm changed the title feat(nvidia-fabricmanager): add support for Blackwell HGX B200/B100 baseboards and DGX systems feat(nvidia-fabricmanager): support Blackwell baseboards (DGX/HGX B100/B200/B300) Jun 2, 2025
@npdgm npdgm force-pushed the feat/nvidia-nvlsm branch 5 times, most recently from c46b952 to 6e89cc9 Compare June 3, 2025 19:10
@npdgm

This comment was marked as outdated.

@npdgm npdgm force-pushed the feat/nvidia-nvlsm branch 2 times, most recently from c6a79a7 to 39fd9d3 Compare June 3, 2025 19:30
@npdgm npdgm force-pushed the feat/nvidia-nvlsm branch from 39fd9d3 to f78bfa8 Compare June 12, 2025 21:03
@npdgm
Copy link
Author

npdgm commented Jun 12, 2025

  • not initializing (loading ?) the GRPC plugin
  • no unix socket file created for FM

The issue was due to the configuration not being found because of an incorrect path.

The PR has been tested on a DGX-B200 system, and I expect all new baseboards to work on Talos now.
It depends on siderolabs/pkgs#1245 and siderolabs/talos#11211 to be effective. However, the lack of the IB kernel module won't pose any problems for starting nvidia-fabricmanager on older GPU architectures.

It will be required for Blackwell baseboard users to load the kernel module ib_umad so that the extension can detect the NVSwitch and start NVLSM:

machine:
  kernel:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset
      - name: ib_umad

@npdgm npdgm marked this pull request as ready for review June 12, 2025 21:22
Copy link
Member

@dsseng dsseng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks like a really complex task you accomplished. Please see some of the questions I had, maybe those would help us clarify some questions

args:
- --config
- /usr/local/share/nvidia/nvswitch/fabricmanager.cfg
entrypoint: /usr/bin/nvidia-fabricmanager-wrapper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this change be done for all models (e.g. A100 only needs nvfm, not nvlsm)? Has this been regression-tested against a system not requiring nvlsm?

This could perhaps be done in the integration-aws-nvidia-nonfree CI workflow perhaps, if you cannot test, so that's not a big deal I guess

Copy link
Author

@npdgm npdgm Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been regression-tested against a system not requiring nvlsm?

No, unfortunately I can't at this time and it might not even happen. So using the CI workflow sounds like the best option.

What was tested on Blackwell is altering the IB lookup function so that no NVSwitch configuration port is found. In this case NVLSM was skipped as expected and NVFM started like it used to be. But being a Blackwell system it failed to initialize the baseboard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, unfortunately I can't at this time and it might not even happen. So using the CI workflow sounds like the best option.

Okay, that should be possible. Keeping this thread open so Noel or someone else keeps need for CI test in mind as I recall that test being occasionally disabled due to pricey GPU and often failures due to no resources available in the cloud

@dsseng dsseng requested a review from frezbo June 13, 2025 20:19
@npdgm npdgm force-pushed the feat/nvidia-nvlsm branch from f78bfa8 to 3826055 Compare June 18, 2025 17:25
@npdgm
Copy link
Author

npdgm commented Jul 1, 2025

Is this likely to merge for 1.11.0 or are you in a freeze window already? I guess it wouldn't hit a patchlevel later.
No pressure, I just need a little insight to adjust ops planning with clients. Thanks!

@frezbo
Copy link
Member

frezbo commented Jul 1, 2025

Is this likely to merge for 1.11.0 or are you in a freeze window already? I guess it wouldn't hit a patchlevel later. No pressure, I just need a little insight to adjust ops planning with clients. Thanks!

if we can sort out everything before siderolabs/talos#10907 beta.0 it we can pull this in

@smira smira moved this from In Review to On Hold in Planning Jul 7, 2025
@npdgm npdgm force-pushed the feat/nvidia-nvlsm branch from 3826055 to 4173afb Compare July 22, 2025 14:34
@npdgm npdgm force-pushed the feat/nvidia-nvlsm branch from 4173afb to 20474a0 Compare August 5, 2025 15:39
@frezbo
Copy link
Member

frezbo commented Aug 12, 2025

@npdgm is this good for another round of review?

@npdgm
Copy link
Author

npdgm commented Aug 12, 2025

@npdgm is this good for another round of review?

Yes! It's been in use with multi-GPU inference in production for 3 months, no problem. Lately built on 1.10.4.
I still can't validate this on pre-Blackwell baseboards myself, sorry. Also any new changes would take some time to test on B200 as I would need to schedule a maintenance. Resources are scarce...

@frezbo
Copy link
Member

frezbo commented Aug 12, 2025

@npdgm is this good for another round of review?

Yes! It's been in use with multi-GPU inference in production for 3 months, no problem. Lately built on 1.10.4. I still can't validate this on pre-Blackwell baseboards myself, sorry. Also any new changes would take some time to test on B200 as I would need to schedule a maintenance. Resources are scarce...

Cool, I'll get this worked on and run our tests and see if this works for existing stuff

shell: /bin/bash
dependencies:
- stage: base
- stage: base
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we don't need to do this hack anymore? since this stage doesn't seem to use anything from wolfi anymore?

Copy link
Author

@npdgm npdgm Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Sorry I had it fixed in production but not lts. Both are the same now, except for vars.
So the base stage is back to /, and Wolfi is pulled to /wolfi-base. We still need Wolfi because of the libgcc_s.so.1 file we copy to the extension rootfs, for NVLSM. No package is installed so that file is locked down by WOLFI_BASE_REF. Anyway, GCC's runtime library is quite small and portable, we could pick one from any distro or version and that would work.

- stage: base
from: /
to: /base-rootfs
- image: cgr.dev/chainguard/wolfi-base@{{ .WOLFI_BASE_REF }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this shouldn't need Wolfi IIUC, since no Glibc build happens here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. See my answer to frezbo above, Wolfi is moved to it's own root but still needed, as a source for GCC's runtime library that's not present in Talos.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a meta pkg and copy the file to the rootfs in there, then use that as depends on here, so the intent is clear and obvious

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks much better indeed. Let me know what you think I wasn't sure how to name it.
I'm satisfied with how it builds so far. Once this round of review is cleared I'll schedule an upgrade for validation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good, could you validate from your side? Then will see about getting this merged and run through our NVIDIA tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is this good to go?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't tell yet, sorry. Last maintenance window was wasted on the issue with containerd file permission.
I've got a new slot from my client, we'll upgrade nodes on Tueday at 11 GMT.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It failed again. I don't get it, looks like I can't make working builds anymore.

Previously I never had to call the imager with --system-extension-image ghcr.io/siderolabs/glibc:x. Now if I don't, then the service container fails to bind mount /usr/local/glibc because it doesn't exist.
But then, when explicitely imaging with the glibc extention, then I still get errors I had never see before about libraries such as :

nv-fabricmanager: /usr/bin/nv-fabricmanager: error while loading shared libraries: libz.so.1: cannot open shared object file: No such file or directory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if glibc is added as a dependency it should work, maybe it's looking for libz from the alpine base

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the noise, I figured it out. Silly problem on my side.

So it's good to go, I validated the current PR head.

  • rebased on 1.10.6 : OK and now used in production
  • rebased on release-1.11 : OK too

@npdgm npdgm force-pushed the feat/nvidia-nvlsm branch 4 times, most recently from 700c4db to 778abf1 Compare August 13, 2025 12:04
@npdgm npdgm force-pushed the feat/nvidia-nvlsm branch 2 times, most recently from c6dbe0f to d8696d3 Compare August 22, 2025 15:11
NVIDIA_FABRIC_MANAGER_PRODUCTION_ARM64_SHA512: 6606cb088f055e4511c94cdd772bdf3f0ae118b8cb5f60533351b587e058d2fdb34078ff7aac6e96db2c2f6d993f69229f1d8ed655c8e8a29c4d2992d545694d
NVIDIA_FABRIC_MANAGER_PRODUCTION_AMD64_SHA256: 8d24cacde4554d471899ad426f46a349d5ca0a2e8acd45c2a76381c8f496491e
NVIDIA_FABRIC_MANAGER_PRODUCTION_AMD64_SHA512: f893f92e144c46d2f2c114399c11873ab843bfec480652878453a07ec3a8ed9af429937ca33f60039e120512a47a75ca032f0de7d37f8a8ebf385d90a0099007
NVIDIA_NVLSM_PRODUCTION_VERSION: 2025.03.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is the production and lts versions determined?

…0/B200/B300)

Quoting: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

> On NVIDIA DGX-B200, HGX-B200, and HGX-B100 systems and later, the FabricManager package needs an additional NVLSM dependency for proper operation.
> NVLink Subnet manager (NVLSM) originated from the InfiniBand networking and contains additional logic to program NVSwitches and NVLinks.

Not running the NVLSM on Blackwell and newer fabrics with NVLink 5.0+ will result in FabricManager failing to start with error `NV_WARN_NOTHING_TO_DO`. NVSwitches will remain uninitialized and applications will fail with the `CUDA_ERROR_SYSTEM_NOT_READY` or `cudaErrorSystemNotReady` error.  The CUDA initialization process can only begin after the GPUs complete their registration process with the NVLink fabric.
A GPU fabric registration status can be verified with the command: `nvidia-smi -q -i 0 | grep -i -A 2 Fabric`. An `In Progress` state indicates that the GPU is being registered and FabricManager is likely not running or missing the NVLSM dependency. A `Completed` state is shown when the GPU is successfully registered with the NVLink fabric.

The FabricManager package includes the script `nvidia-fabricmanager-start.sh`, which is used to selectively start FabricManager and NVLSM processes depending on the underlying platform.
A key aspect of determining whether an NVLink 5.0+ fabric is present is to look for a Limited Physical Function (LPF) port in InfiniBand devices. To differentiate LPFs, the Vital Product Data (VPD) information includes a vendor-specific field called `SMDL`, with a non-zero value defined as `SW_MNG`. The first device is then selected, and it's port GUID is extracted and passed to NVLSM and FabricManager.
So both services share a configuration key that result from a common initialization process. Additionally, they communicate with each other over a Unix socket.

This patch introduces the following changes:
 * Adds NVLSM to the nvidia-fabricmanager extension.
 * Introduces a new `nvidia-fabricmanager-wrapper` program to replicate the initialization process from `nvidia-fabricmanager-start.sh`:
   * Detects NVLink 5.0+ fabrics and extracts an NVSwitch LPF port GUID. This is done by calling libibumad directly with CGO instead of parsing the output of the `ibstat` command as in the upstream script.
   * Starts FabricManager, and NVLSM only when needed.
   * Keeps both process lifecycles synchronized and ensures the Talos container will restart if either process crashes.
 * Refactors the nvidia-fabricmanager container to be self-contained, as this service does not share files with other nvidia-gpu extensions.

Signed-off-by: Thibault VINCENT <[email protected]>
Signed-off-by: Noel Georgi <[email protected]>
@github-project-automation github-project-automation bot moved this from On Hold to Approved in Planning Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Approved
Development

Successfully merging this pull request may close these issues.

3 participants