Skip to content

Commit 20474a0

Browse files
committed
feat(nvidia-fabricmanager): support Blackwell baseboards (DGX/HGX B100/B200/B300)
Quoting: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf > On NVIDIA DGX-B200, HGX-B200, and HGX-B100 systems and later, the FabricManager package needs an additional NVLSM dependency for proper operation. > NVLink Subnet manager (NVLSM) originated from the InfiniBand networking and contains additional logic to program NVSwitches and NVLinks. Not running the NVLSM on Blackwell and newer fabrics with NVLink 5.0+ will result in FabricManager failing to start with error `NV_WARN_NOTHING_TO_DO`. NVSwitches will remain uninitialized and applications will fail with the `CUDA_ERROR_SYSTEM_NOT_READY` or `cudaErrorSystemNotReady` error. The CUDA initialization process can only begin after the GPUs complete their registration process with the NVLink fabric. A GPU fabric registration status can be verified with the command: `nvidia-smi -q -i 0 | grep -i -A 2 Fabric`. An `In Progress` state indicates that the GPU is being registered and FabricManager is likely not running or missing the NVLSM dependency. A `Completed` state is shown when the GPU is successfully registered with the NVLink fabric. The FabricManager package includes the script `nvidia-fabricmanager-start.sh`, which is used to selectively start FabricManager and NVLSM processes depending on the underlying platform. A key aspect of determining whether an NVLink 5.0+ fabric is present is to look for a Limited Physical Function (LPF) port in InfiniBand devices. To differentiate LPFs, the Vital Product Data (VPD) information includes a vendor-specific field called `SMDL`, with a non-zero value defined as `SW_MNG`. The first device is then selected, and it's port GUID is extracted and passed to NVLSM and FabricManager. So both services share a configuration key that result from a common initialization process. Additionally, they communicate with each other over a Unix socket. This patch introduces the following changes: * Adds NVLSM to the nvidia-fabricmanager extension. * Introduces a new `nvidia-fabricmanager-wrapper` program to replicate the initialization process from `nvidia-fabricmanager-start.sh`: * Detects NVLink 5.0+ fabrics and extracts an NVSwitch LPF port GUID. This is done by calling libibumad directly with CGO instead of parsing the output of the `ibstat` command as in the upstream script. * Starts FabricManager, and NVLSM only when needed. * Keeps both process lifecycles synchronized and ensures the Talos container will restart if either process crashes. * Refactors the nvidia-fabricmanager container to be self-contained, as this service does not share files with other nvidia-gpu extentions. Signed-off-by: Thibault VINCENT <[email protected]>
1 parent 7ef078a commit 20474a0

File tree

11 files changed

+485
-96
lines changed

11 files changed

+485
-96
lines changed

go.work

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,5 @@ use (
55
./examples/hello-world-service/src
66
./nvidia-gpu/nvidia-container-toolkit/nvidia-container-runtime-wrapper
77
./nvidia-gpu/nvidia-container-toolkit/nvidia-persistenced-wrapper
8+
./nvidia-gpu/nvidia-fabricmanager/nvidia-fabricmanager-wrapper
89
)

nvidia-gpu/nvidia-fabricmanager/lts/nvidia-fabricmanager.yaml

Lines changed: 19 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,7 @@
11
# https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
22
name: nvidia-fabricmanager
33
container:
4-
entrypoint: /usr/local/bin/nv-fabricmanager
5-
args:
6-
- --config
7-
- /usr/local/share/nvidia/nvswitch/fabricmanager.cfg
4+
entrypoint: /usr/bin/nvidia-fabricmanager-wrapper
85
mounts:
96
# device files
107
- source: /dev
@@ -28,44 +25,40 @@ container:
2825
options:
2926
- bind
3027
- ro
31-
# nvidia libraries
32-
- source: /usr/local/lib
33-
destination: /usr/local/lib
34-
type: bind
35-
options:
36-
- bind
37-
- ro
3828
# service state file
29+
# * nvlsm:
30+
# - pid file that can't be disabled
31+
# - unix socket /var/run/nvidia-fabricmanager/fm_sm_ipc.socket
32+
# can't be changed, path is hardcoded into fabricmanager
33+
# * fabricmanager
34+
# - state file
35+
# - database files
3936
- source: /var/run/nvidia-fabricmanager
40-
destination: /var/run/nvidia-fabricmanager
37+
destination: /var/run
4138
type: bind
4239
options:
4340
- rshared
4441
- rbind
4542
- rw
46-
# log files
47-
- source: /var/log
48-
destination: /var/log
43+
# service cache file
44+
# * nvlsm: database files
45+
- source: /var/cache/nvidia-fabricmanager
46+
destination: /var/cache
4947
type: bind
5048
options:
5149
- rshared
5250
- rbind
5351
- rw
54-
# fabric topology files
55-
- source: /usr/local/share/nvidia/nvswitch
56-
destination: /usr/local/share/nvidia/nvswitch
52+
# service log files
53+
# * nvlsm:
54+
# - mandatory dump files hardcoded to /var/log/<file>, so /var/log must be writable
55+
- source: /var/log/nvidia-fabricmanager
56+
destination: /var/log
5757
type: bind
5858
options:
5959
- rshared
6060
- rbind
61-
- ro
62-
# binaries
63-
- source: /usr/local/bin
64-
destination: /usr/local/bin
65-
type: bind
66-
options:
67-
- bind
68-
- ro
61+
- rw
6962
depends:
7063
- service: cri
7164
# we need to depend on udevd so that the nvidia device files are created

nvidia-gpu/nvidia-fabricmanager/lts/pkg.yaml

Lines changed: 59 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,56 +2,93 @@ name: nvidia-fabricmanager-lts
22
variant: scratch
33
shell: /bin/bash
44
dependencies:
5-
- stage: base
5+
- stage: base
6+
from: /
7+
to: /base-rootfs
8+
- image: cgr.dev/chainguard/wolfi-base@{{ .WOLFI_BASE_REF }}
9+
- stage: nvidia-fabricmanager-wrapper
10+
install:
11+
- bash
612
steps:
713
- sources:
814
# {{ if eq .ARCH "aarch64" }} This in fact is YAML comment, but Go templating instruction is evaluated by bldr
915
- url: https://developer.download.nvidia.com/compute/nvidia-driver/redist/fabricmanager/linux-sbsa/fabricmanager-linux-sbsa-{{ .NVIDIA_DRIVER_LTS_VERSION }}-archive.tar.xz
1016
destination: fabricmanager.tar.xz
1117
sha256: {{ .NVIDIA_FABRIC_MANAGER_LTS_ARM64_SHA256 }}
1218
sha512: {{ .NVIDIA_FABRIC_MANAGER_LTS_ARM64_SHA512 }}
19+
- url: https://developer.download.nvidia.com/compute/cuda/redist/nvlsm/linux-sbsa/nvlsm-linux-sbsa-{{ .NVIDIA_NVLSM_VERSION }}-archive.tar.xz
20+
destination: nvlsm.tar.xz
21+
sha256: {{ .NVIDIA_NVLSM_ARM64_SHA256 }}
22+
sha512: {{ .NVIDIA_NVLSM_ARM64_SHA512 }}
1323
# {{ else }} This in fact is YAML comment, but Go templating instruction is evaluated by bldr
1424
- url: https://developer.download.nvidia.com/compute/nvidia-driver/redist/fabricmanager/linux-x86_64/fabricmanager-linux-x86_64-{{ .NVIDIA_DRIVER_LTS_VERSION }}-archive.tar.xz
1525
destination: fabricmanager.tar.xz
1626
sha256: {{ .NVIDIA_FABRIC_MANAGER_LTS_AMD64_SHA256 }}
1727
sha512: {{ .NVIDIA_FABRIC_MANAGER_LTS_AMD64_SHA512 }}
18-
# {{ end }} This in fact is YAML comment, but Go templating instruction is evaluated by bldr
28+
- url: https://developer.download.nvidia.com/compute/cuda/redist/nvlsm/linux-x86_64/nvlsm-linux-x86_64-{{ .NVIDIA_NVLSM_VERSION }}-archive.tar.xz
29+
destination: nvlsm.tar.xz
30+
sha256: {{ .NVIDIA_NVLSM_AMD64_SHA256 }}
31+
sha512: {{ .NVIDIA_NVLSM_AMD64_SHA512 }}
32+
# {{ end }} This in fact is YAML comment, but Go templating instruction is evaluated by bld
1933
prepare:
2034
- |
21-
tar -xf fabricmanager.tar.xz --strip-components=1
22-
35+
mkdir fm sm
36+
tar -xf fabricmanager.tar.xz --strip-components=1 -C fm
37+
tar -xf nvlsm.tar.xz --strip-components=1 -C sm
38+
- |
2339
sed -i 's#$VERSION#{{ .VERSION }}#' /pkg/manifest.yaml
2440
install:
2541
- |
26-
mkdir -p /rootfs/usr/local/bin \
27-
/rootfs/usr/local/lib \
28-
/rootfs/usr/local/share/nvidia/nvswitch \
29-
/rootfs/usr/local/lib/containers/nvidia-fabricmanager \
30-
/rootfs/usr/local/etc/containers
42+
mkdir -p /rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/bin \
43+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/lib \
44+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvswitch \
45+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/opt/nvidia/nvlsm/sbin \
46+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/opt/nvidia/nvlsm/lib \
47+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvlsm
48+
# nvlsm
49+
- |
50+
cp sm/sbin/nvlsm /rootfs/usr/local/lib/containers/nvidia-fabricmanager/opt/nvidia/nvlsm/sbin/
3151
32-
cp lib/libnvfm.so.1 /rootfs/usr/local/lib/libnvfm.so.1
33-
ln -s libnvfm.so.1 /rootfs/usr/local/lib/libnvfm.so
52+
cp sm/lib/libgrpc_mgr.so \
53+
/usr/lib/libgcc_s.so.1 \
54+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/opt/nvidia/nvlsm/lib/
3455
35-
cp bin/nv-fabricmanager /rootfs/usr/local/bin/
36-
cp bin/nvswitch-audit /rootfs/usr/local/bin/
56+
cp sm/share/nvidia/nvlsm/device_configuration.conf \
57+
sm/share/nvidia/nvlsm/grpc_mgr.conf \
58+
sm/share/nvidia/nvlsm/nvlsm.conf \
59+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvlsm/
60+
# fabricmanager
61+
- |
62+
cp fm/bin/nv-fabricmanager \
63+
fm/bin/nvswitch-audit \
64+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/bin/
3765
38-
cp share/nvidia/nvswitch/dgx2_hgx2_topology /rootfs/usr/local/share/nvidia/nvswitch/
39-
cp share/nvidia/nvswitch/dgxa100_hgxa100_topology /rootfs/usr/local/share/nvidia/nvswitch/
66+
cp fm/lib/libnvfm.so.1 /rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/lib/
67+
ln -s libnvfm.so.1 /rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/lib/libnvfm.so
4068
41-
cp etc/fabricmanager.cfg /rootfs/usr/local/share/nvidia/nvswitch/
69+
cp fm/share/nvidia/nvswitch/* \
70+
fm/etc/fabricmanager.cfg \
71+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvswitch/
4272
43-
cp /pkg/nvidia-fabricmanager.yaml /rootfs/usr/local/etc/containers/nvidia-fabricmanager.yaml
73+
sed -i 's/DAEMONIZE=.*/DAEMONIZE=0/g' /rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvswitch/fabricmanager.cfg
74+
sed -i 's/LOG_FILE_NAME=.*/LOG_FILE_NAME=/g' /rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvswitch/fabricmanager.cfg
75+
sed -i 's#STATE_FILE_NAME=.*#STATE_FILE_NAME=/var/run/fabricmanager.state#g' /rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvswitch/fabricmanager.cfg
4476
45-
sed -i 's/DAEMONIZE=.*/DAEMONIZE=0/g' /rootfs/usr/local/share/nvidia/nvswitch/fabricmanager.cfg
46-
sed -i 's/STATE_FILE_NAME=.*/STATE_FILE_NAME=\/var\/run\/nvidia-fabricmanager\/fabricmanager.state/g' /rootfs/usr/local/share/nvidia/nvswitch/fabricmanager.cfg
47-
sed -i 's/TOPOLOGY_FILE_PATH=.*/TOPOLOGY_FILE_PATH=\/usr\/local\/share\/nvidia\/nvswitch/g' /rootfs/usr/local/share/nvidia/nvswitch/fabricmanager.cfg
48-
sed -i 's/DATABASE_PATH=.*/DATABASE_PATH=\/usr\/local\/share\/nvidia\/nvswitch/g' /rootfs/usr/local/share/nvidia/nvswitch/fabricmanager.cfg
77+
if grep -q '^DATABASE_PATH=' /rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvswitch/fabricmanager.cfg
78+
then
79+
sed -i 's#DATABASE_PATH=.*#DATABASE_PATH=/var/run#g' /rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvswitch/fabricmanager.cfg
80+
else
81+
echo -e '\nDATABASE_PATH=/var/run\n' >>/rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvswitch/fabricmanager.cfg
82+
fi
83+
- |
84+
mkdir -p /rootfs/usr/local/etc/containers
85+
cp /pkg/nvidia-fabricmanager.yaml /rootfs/usr/local/etc/containers/nvidia-fabricmanager.yaml
4986
test:
5087
- |
5188
mkdir -p /extensions-validator-rootfs
5289
cp -r /rootfs/ /extensions-validator-rootfs/rootfs
5390
cp /pkg/manifest.yaml /extensions-validator-rootfs/manifest.yaml
54-
/extensions-validator validate --rootfs=/extensions-validator-rootfs --pkg-name="${PKG_NAME}"
91+
/base-rootfs/extensions-validator validate --rootfs=/extensions-validator-rootfs --pkg-name="${PKG_NAME}"
5592
finalize:
5693
- from: /rootfs
5794
to: /rootfs
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
module nvidia-fabricmanager-wrapper
2+
3+
go 1.23.0
4+
5+
require github.com/goaux/decowriter v1.0.0
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
github.com/goaux/decowriter v1.0.0 h1:f1mfBWGFIo3Upev3gswfGLQzQvC4SBVYi2ZAkNZsIaU=
2+
github.com/goaux/decowriter v1.0.0/go.mod h1:8GKUmiBlNCYxVHU2vlZoQHwLvYh7Iw1c7/tRekJbX7o=
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
// This Source Code Form is subject to the terms of the Mozilla Public
2+
// License, v. 2.0. If a copy of the MPL was not distributed with this
3+
// file, You can obtain one at http://mozilla.org/MPL/2.0/.
4+
5+
package main
6+
7+
import (
8+
"bufio"
9+
"context"
10+
"fmt"
11+
"log"
12+
"os"
13+
"os/exec"
14+
"os/signal"
15+
"path/filepath"
16+
"strings"
17+
"sync"
18+
"syscall"
19+
"time"
20+
21+
"github.com/goaux/decowriter"
22+
)
23+
24+
const (
25+
// FabricManager
26+
fmCmdFile = "/usr/bin/nv-fabricmanager"
27+
fmConfigFile = "/usr/share/nvidia/nvswitch/fabricmanager.cfg"
28+
fmStopTimeout = 5 * time.Second
29+
30+
// NVLSM
31+
smCmdFile = "/opt/nvidia/nvlsm/sbin/nvlsm"
32+
smConfigFile = "/usr/share/nvidia/nvlsm/nvlsm.conf"
33+
smPidFile = "/var/run/nvlsm.pid"
34+
smSocket = "/var/run/nvidia-fabricmanager/fm_sm_ipc.socket"
35+
smStopTimeout = 5 * time.Second
36+
smSocketWait = 15 * time.Second
37+
)
38+
39+
func runCommand(ctx context.Context, wg *sync.WaitGroup, doneCb func(), waitDelay time.Duration, path string, arg ...string) {
40+
wg.Add(1)
41+
42+
cmd := exec.CommandContext(ctx, path, arg...)
43+
cmd.WaitDelay = waitDelay
44+
cmd.Cancel = func() error {
45+
return cmd.Process.Signal(os.Interrupt)
46+
}
47+
48+
// TODO line writer to log module
49+
name := filepath.Base(path)
50+
cmd.Stdout = decowriter.New(bufio.NewWriter(os.Stdout), []byte(name+": "), []byte{})
51+
cmd.Stderr = decowriter.New(bufio.NewWriter(os.Stderr), []byte(name+": "), []byte{})
52+
53+
go func() {
54+
log.Printf("nvidia-fabricmanager-wrapper: running command: %s %s\n", path, strings.Join(arg, " "))
55+
56+
err := cmd.Run()
57+
if err == nil {
58+
log.Printf("nvidia-fabricmanager-wrapper: command %s [%d] completed successfully\n", path, cmd.Process.Pid)
59+
} else if exitErr, ok := err.(*exec.ExitError); ok {
60+
if exitErr.Exited() {
61+
log.Printf("nvidia-fabricmanager-wrapper: command %s [%d] exited with code %d\n", path, exitErr.Pid(),
62+
exitErr.ExitCode())
63+
} else {
64+
log.Printf("nvidia-fabricmanager-wrapper: command %s [%d] was terminated\n", path, exitErr.Pid())
65+
}
66+
} else {
67+
log.Printf("nvidia-fabricmanager-wrapper: failed to run command %s: %v\n", path, err)
68+
}
69+
70+
wg.Done()
71+
doneCb()
72+
}()
73+
}
74+
75+
func waitForFile(ctx context.Context, filepath string, timeout time.Duration) error {
76+
timer := time.NewTimer(timeout)
77+
defer timer.Stop()
78+
79+
for {
80+
select {
81+
case <-ctx.Done():
82+
return fmt.Errorf("parent context canceled: %w", ctx.Err())
83+
case <-timer.C:
84+
return fmt.Errorf("timeout waiting for file")
85+
default:
86+
if _, err := os.Stat(filepath); err == nil {
87+
return nil
88+
}
89+
time.Sleep(100 * time.Millisecond)
90+
}
91+
}
92+
}
93+
94+
func main() {
95+
var cmdWg sync.WaitGroup
96+
97+
signal.Ignore(syscall.SIGHUP)
98+
99+
runCtx, gracefulShutdown := context.WithCancel(context.Background())
100+
101+
signalsChan := make(chan os.Signal, 1)
102+
signal.Notify(signalsChan, os.Interrupt)
103+
signal.Notify(signalsChan, syscall.SIGTERM)
104+
105+
go func() {
106+
received := <-signalsChan
107+
signal.Stop(signalsChan)
108+
log.Printf("nvidia-fabricmanager-wrapper: received signal '%s', initiating a graceful shutdown\n", received.String())
109+
gracefulShutdown()
110+
}()
111+
112+
nvswitchPorts := findNvswitchMgmtPorts()
113+
for _, port := range nvswitchPorts {
114+
log.Printf("nvidia-fabricmanager-wrapper: found NVSwitch LPF: device=%s guid=0x%x\n", port.IBDevice, port.PortGUID)
115+
}
116+
117+
fmSmMgmtPortGUID := ""
118+
if len(nvswitchPorts) > 0 {
119+
fmSmMgmtPortGUID = fmt.Sprintf("0x%x", nvswitchPorts[0].PortGUID)
120+
log.Printf("nvidia-fabricmanager-wrapper: using NVSwitch management port GUID: %s\n", fmSmMgmtPortGUID)
121+
} else {
122+
log.Println("nvidia-fabricmanager-wrapper: No InfiniBand NVSwitch detected. On Blackwell HGX baseboards and newer",
123+
"with NVLink 5.0+, please load kernel module 'ib_umad' for NVLSM to run along FabricManager. Otherwise it will",
124+
"fail to start with error NV_WARN_NOTHING_TO_DO, and GPU workloads will report CUDA_ERROR_SYSTEM_NOT_READY.")
125+
}
126+
127+
if fmSmMgmtPortGUID != "" {
128+
if err := os.Mkdir(filepath.Dir(smSocket), 0755); err != nil {
129+
log.Printf("nvidia-fabricmanager-wrapper: error creating socket directory: %v\n", err)
130+
}
131+
132+
runCommand(runCtx, &cmdWg, gracefulShutdown, smStopTimeout, smCmdFile, "--config", smConfigFile,
133+
"--guid", fmSmMgmtPortGUID, "--pid_file", smPidFile, "--log_file", "stdout")
134+
135+
// vendor startup script waits for 5 seconds for NVLSM socket to be available before starting FM
136+
// let's wait for the actual GRPC socket to be created by the plugin
137+
log.Println("nvidia-fabricmanager-wrapper: waiting for socket creation at", smSocket)
138+
err := waitForFile(runCtx, smSocket, smSocketWait)
139+
if err != nil {
140+
log.Printf("nvidia-fabricmanager-wrapper: error waiting for socket: %v\n", err)
141+
} else {
142+
log.Println("nvidia-fabricmanager-wrapper: socket found at", smSocket)
143+
}
144+
// for safety
145+
time.Sleep(time.Second)
146+
}
147+
148+
fmCmdArgs := []string{"--config", fmConfigFile}
149+
if fmSmMgmtPortGUID != "" {
150+
fmCmdArgs = append(fmCmdArgs, "--fm-sm-mgmt-port-guid", fmSmMgmtPortGUID)
151+
}
152+
runCommand(runCtx, &cmdWg, gracefulShutdown, fmStopTimeout, fmCmdFile, fmCmdArgs...)
153+
154+
log.Println("nvidia-fabricmanager-wrapper: initialization completed")
155+
cmdWg.Wait()
156+
}

0 commit comments

Comments
 (0)