Skip to content

Commit 39fd9d3

Browse files
committed
feat(nvidia-fabricmanager): support Blackwell baseboards (DGX/HGX B100/B200/B300)
Quoting: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf > On NVIDIA DGX-B200, HGX-B200, and HGX-B100 systems and later, the FabricManager package needs an additional NVLSM dependency for proper operation. > NVLink Subnet manager (NVLSM) originated from the InfiniBand networking and contains additional logic to program NVSwitches and NVLinks. Not running the NVLSM on Blackwell and newer fabrics with NVLink 5.0+ will result in FabricManager failing to start with error `NV_WARN_NOTHING_TO_DO`. NVSwitches will remain uninitialized and applications will fail with the `CUDA_ERROR_SYSTEM_NOT_READY` or `cudaErrorSystemNotReady` error. The CUDA initialization process can only begin after the GPUs complete their registration process with the NVLink fabric. A GPU fabric registration status can be verified with the command: `nvidia-smi -q -i 0 | grep -i -A 2 Fabric`. An `In Progress` state indicates that the GPU is being registered and FabricManager is likely not running or missing the NVLSM dependency. A `Completed` state is shown when the GPU is successfully registered with the NVLink fabric. The FabricManager package includes the script `nvidia-fabricmanager-start.sh`, which is used to selectively start FabricManager and NVLSM processes depending on the underlying platform. A key aspect of determining whether an NVLink 5.0+ fabric is present is to look for a Limited Physical Function (LPF) port in InfiniBand devices. To differentiate LPFs, the Vital Product Data (VPD) information includes a vendor-specific field called `SMDL`, with a non-zero value defined as `SW_MNG`. The first device is then selected, and it's port GUID is extracted and passed to NVLSM and FabricManager. So both services share a configuration key that result from a common initialization process. Additionally, they communicate with each other over a Unix socket. This patch introduces the following changes: * Adds NVLSM to the nvidia-fabricmanager extension. * Introduces a new `nvidia-fabricmanager-wrapper` program to replicate the initialization process from `nvidia-fabricmanager-start.sh`: * Detects NVLink 5.0+ fabrics and extracts an NVSwitch LPF port GUID. This is done by calling libibumad directly with CGO instead of parsing the output of the `ibstat` command as in the upstream script. * Starts FabricManager, and NVLSM only when needed. * Keeps both process lifecycles synchronized and ensures the Talos container will restart if either process crashes. * Refactors the nvidia-fabricmanager container to be self-contained, as this service does not share files with other nvidia-gpu extentions. Signed-off-by: Thibault VINCENT <[email protected]>
1 parent f9b5bf6 commit 39fd9d3

File tree

14 files changed

+600
-97
lines changed

14 files changed

+600
-97
lines changed

go.work

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@ use (
44
./examples/hello-world-service/src
55
./nvidia-gpu/nvidia-container-toolkit/nvidia-container-runtime-wrapper
66
./nvidia-gpu/nvidia-container-toolkit/nvidia-persistenced-wrapper
7+
./nvidia-gpu/nvidia-fabricmanager/nvidia-fabricmanager-wrapper
78
)

nvidia-gpu/nvidia-fabricmanager/lts/nvidia-fabricmanager.yaml

Lines changed: 23 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
# https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
22
name: nvidia-fabricmanager
33
container:
4-
entrypoint: /usr/local/bin/nv-fabricmanager
5-
args:
6-
- --config
7-
- /usr/local/share/nvidia/nvswitch/fabricmanager.cfg
4+
entrypoint: /usr/bin/nvidia-fabricmanager-wrapper
5+
environment:
6+
# for NVLSM to find [/usr/local/lib/containers/nvidia-fabricmanager]/usr/lib/libgcc_s.so.1
7+
# - LD_LIBRARY_PATH=/usr/lib
8+
# security:
9+
# writeableSysfs: true
810
mounts:
911
# device files
1012
- source: /dev
@@ -28,44 +30,39 @@ container:
2830
options:
2931
- bind
3032
- ro
31-
# nvidia libraries
32-
- source: /usr/local/lib
33-
destination: /usr/local/lib
34-
type: bind
35-
options:
36-
- bind
37-
- ro
3833
# service state file
34+
# - nvlsm:
35+
# - pid file that can't be disabled
36+
# - unix socket /var/run/nvidia-fabricmanager/fm_sm_ipc.socket
37+
# don't change it, path is hardcoded into fabricmanager
3938
- source: /var/run/nvidia-fabricmanager
40-
destination: /var/run/nvidia-fabricmanager
39+
destination: /var/run
4140
type: bind
4241
options:
4342
- rshared
4443
- rbind
4544
- rw
46-
# log files
47-
- source: /var/log
48-
destination: /var/log
45+
# service cache file
46+
# - nvlsm: database files
47+
- source: /var/cache/nvidia-fabricmanager
48+
destination: /var/cache
4949
type: bind
5050
options:
5151
- rshared
5252
- rbind
5353
- rw
54-
# fabric topology files
55-
- source: /usr/local/share/nvidia/nvswitch
56-
destination: /usr/local/share/nvidia/nvswitch
54+
# service log files
55+
# - nvlsm:
56+
# - mandatory dump files hardcoded to /var/log/<file>, so /var/log must be writable
57+
# - fabricmanager:
58+
# - log files, with self-managed rotation and size limit
59+
- source: /var/log/nvidia-fabricmanager
60+
destination: /var/log
5761
type: bind
5862
options:
5963
- rshared
6064
- rbind
61-
- ro
62-
# binaries
63-
- source: /usr/local/bin
64-
destination: /usr/local/bin
65-
type: bind
66-
options:
67-
- bind
68-
- ro
65+
- rw
6966
depends:
7067
- service: cri
7168
# we need to depend on udevd so that the nvidia device files are created

nvidia-gpu/nvidia-fabricmanager/lts/pkg.yaml

Lines changed: 51 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,56 +2,85 @@ name: nvidia-fabricmanager-lts
22
variant: scratch
33
shell: /bin/bash
44
dependencies:
5-
- stage: base
5+
- stage: base
6+
from: /
7+
to: /base-rootfs
8+
- image: cgr.dev/chainguard/wolfi-base@{{ .WOLFI_BASE_REF }}
9+
- stage: nvidia-fabricmanager-wrapper
10+
install:
11+
- bash
612
steps:
713
- sources:
814
# {{ if eq .ARCH "aarch64" }} This in fact is YAML comment, but Go templating instruction is evaluated by bldr
915
- url: https://developer.download.nvidia.com/compute/nvidia-driver/redist/fabricmanager/linux-sbsa/fabricmanager-linux-sbsa-{{ .NVIDIA_DRIVER_LTS_VERSION }}-archive.tar.xz
1016
destination: fabricmanager.tar.xz
1117
sha256: {{ .NVIDIA_FABRIC_MANAGER_LTS_ARM64_SHA256 }}
1218
sha512: {{ .NVIDIA_FABRIC_MANAGER_LTS_ARM64_SHA512 }}
19+
- url: https://developer.download.nvidia.com/compute/cuda/redist/nvlsm/linux-sbsa/nvlsm-linux-sbsa-{{ .NVIDIA_NVLSM_LTS_VERSION }}-archive.tar.xz
20+
destination: nvlsm.tar.xz
21+
sha256: {{ .NVIDIA_NVLSM_LTS_ARM64_SHA256 }}
22+
sha512: {{ .NVIDIA_NVLSM_LTS_ARM64_SHA512 }}
1323
# {{ else }} This in fact is YAML comment, but Go templating instruction is evaluated by bldr
1424
- url: https://developer.download.nvidia.com/compute/nvidia-driver/redist/fabricmanager/linux-x86_64/fabricmanager-linux-x86_64-{{ .NVIDIA_DRIVER_LTS_VERSION }}-archive.tar.xz
1525
destination: fabricmanager.tar.xz
1626
sha256: {{ .NVIDIA_FABRIC_MANAGER_LTS_AMD64_SHA256 }}
1727
sha512: {{ .NVIDIA_FABRIC_MANAGER_LTS_AMD64_SHA512 }}
18-
# {{ end }} This in fact is YAML comment, but Go templating instruction is evaluated by bldr
28+
- url: https://developer.download.nvidia.com/compute/cuda/redist/nvlsm/linux-x86_64/nvlsm-linux-x86_64-{{ .NVIDIA_NVLSM_LTS_VERSION }}-archive.tar.xz
29+
destination: nvlsm.tar.xz
30+
sha256: {{ .NVIDIA_NVLSM_LTS_AMD64_SHA256 }}
31+
sha512: {{ .NVIDIA_NVLSM_LTS_AMD64_SHA512 }}
32+
# {{ end }} This in fact is YAML comment, but Go templating instruction is evaluated by bld
1933
prepare:
2034
- |
21-
tar -xf fabricmanager.tar.xz --strip-components=1
22-
35+
mkdir fm sm
36+
tar -xf fabricmanager.tar.xz --strip-components=1 -C fm
37+
tar -xf nvlsm.tar.xz --strip-components=1 -C sm
38+
- |
2339
sed -i 's#$VERSION#{{ .VERSION }}#' /pkg/manifest.yaml
2440
install:
2541
- |
26-
mkdir -p /rootfs/usr/local/bin \
27-
/rootfs/usr/local/lib \
28-
/rootfs/usr/local/share/nvidia/nvswitch \
29-
/rootfs/usr/local/lib/containers/nvidia-fabricmanager \
30-
/rootfs/usr/local/etc/containers
42+
mkdir -p /rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/bin \
43+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/lib \
44+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvswitch \
45+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/opt/nvidia/nvlsm/sbin \
46+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/opt/nvidia/nvlsm/lib \
47+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvlsm
48+
# nvlsm
49+
- |
50+
cp sm/sbin/nvlsm /rootfs/usr/local/lib/containers/nvidia-fabricmanager/opt/nvidia/nvlsm/sbin/
3151
32-
cp lib/libnvfm.so.1 /rootfs/usr/local/lib/libnvfm.so.1
33-
ln -s libnvfm.so.1 /rootfs/usr/local/lib/libnvfm.so
52+
cp sm/lib/libgrpc_mgr.so \
53+
/usr/lib/libgcc_s.so.1 \
54+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/opt/nvidia/nvlsm/lib/
3455
35-
cp bin/nv-fabricmanager /rootfs/usr/local/bin/
36-
cp bin/nvswitch-audit /rootfs/usr/local/bin/
56+
cp sm/share/nvidia/nvlsm/device_configuration.conf \
57+
sm/share/nvidia/nvlsm/grpc_mgr.conf \
58+
sm/share/nvidia/nvlsm/nvlsm.conf \
59+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvlsm/
60+
# fabricmanager
61+
- |
62+
cp fm/bin/nv-fabricmanager \
63+
fm/bin/nvswitch-audit \
64+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/bin/
3765
38-
cp share/nvidia/nvswitch/dgx2_hgx2_topology /rootfs/usr/local/share/nvidia/nvswitch/
39-
cp share/nvidia/nvswitch/dgxa100_hgxa100_topology /rootfs/usr/local/share/nvidia/nvswitch/
66+
cp fm/lib/libnvfm.so.1 /rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/lib/
67+
ln -s libnvfm.so.1 /rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/lib/libnvfm.so
4068
41-
cp etc/fabricmanager.cfg /rootfs/usr/local/share/nvidia/nvswitch/
69+
cp fm/share/nvidia/nvswitch/* \
70+
fm/etc/fabricmanager.cfg \
71+
/rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvswitch/
4272
73+
sed -i 's/DAEMONIZE=.*/DAEMONIZE=0/g' /rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvswitch/fabricmanager.cfg
74+
sed -i 's#STATE_FILE_NAME=.*#STATE_FILE_NAME=/var/run/fabricmanager.state#g' /rootfs/usr/local/lib/containers/nvidia-fabricmanager/usr/share/nvidia/nvswitch/fabricmanager.cfg
75+
- |
76+
mkdir -p /rootfs/usr/local/etc/containers
4377
cp /pkg/nvidia-fabricmanager.yaml /rootfs/usr/local/etc/containers/nvidia-fabricmanager.yaml
44-
45-
sed -i 's/DAEMONIZE=.*/DAEMONIZE=0/g' /rootfs/usr/local/share/nvidia/nvswitch/fabricmanager.cfg
46-
sed -i 's/STATE_FILE_NAME=.*/STATE_FILE_NAME=\/var\/run\/nvidia-fabricmanager\/fabricmanager.state/g' /rootfs/usr/local/share/nvidia/nvswitch/fabricmanager.cfg
47-
sed -i 's/TOPOLOGY_FILE_PATH=.*/TOPOLOGY_FILE_PATH=\/usr\/local\/share\/nvidia\/nvswitch/g' /rootfs/usr/local/share/nvidia/nvswitch/fabricmanager.cfg
48-
sed -i 's/DATABASE_PATH=.*/DATABASE_PATH=\/usr\/local\/share\/nvidia\/nvswitch/g' /rootfs/usr/local/share/nvidia/nvswitch/fabricmanager.cfg
4978
test:
5079
- |
5180
mkdir -p /extensions-validator-rootfs
5281
cp -r /rootfs/ /extensions-validator-rootfs/rootfs
5382
cp /pkg/manifest.yaml /extensions-validator-rootfs/manifest.yaml
54-
/extensions-validator validate --rootfs=/extensions-validator-rootfs --pkg-name="${PKG_NAME}"
83+
/base-rootfs/extensions-validator validate --rootfs=/extensions-validator-rootfs --pkg-name="${PKG_NAME}"
5584
finalize:
5685
- from: /rootfs
5786
to: /rootfs
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
CA 'mlx5_0'
2+
CA type: MT4129
3+
Number of ports: 1
4+
Firmware version: 28.42.1274
5+
Hardware version: 0
6+
Node GUID: 0xe09d730300e400e8
7+
System image GUID: 0xe09d730300e400e8
8+
Port 1:
9+
State: Active
10+
Physical state: LinkUp
11+
Rate: 100
12+
Base lid: 1
13+
LMC: 0
14+
SM lid: 1
15+
Capability mask: 0xa751e84a
16+
Port GUID: 0xe09d730300e400e8
17+
Link layer: InfiniBand
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
05:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
2+
Subsystem: Mellanox Technologies Device 0087
3+
Physical Slot: 14
4+
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
5+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
6+
Latency: 0, Cache Line Size: 64 bytes
7+
Interrupt: pin A routed to IRQ 16
8+
NUMA node: 0
9+
IOMMU group: 154
10+
Region 0: Memory at 20fff4000000 (64-bit, prefetchable) [size=32M]
11+
Expansion ROM at a9400000 [disabled] [size=1M]
12+
Capabilities: [60] Express (v2) Endpoint, MSI 00
13+
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
14+
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
15+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
16+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
17+
MaxPayload 256 bytes, MaxReadReq 512 bytes
18+
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
19+
LnkCap: Port #0, Speed 16GT/s, Width x2, ASPM not supported
20+
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
21+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
22+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
23+
LnkSta: Speed 8GT/s (downgraded), Width x2 (ok)
24+
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
25+
DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
26+
10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt- EETLPPrefix-
27+
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
28+
FRS- TPHComp- ExtTPHComp-
29+
AtomicOpsCap: 32bit+ 64bit+ 128bitCAS+
30+
DevCtl2: Completion Timeout: 260ms to 900ms, TimeoutDis- LTR- OBFF Disabled,
31+
AtomicOpsCtl: ReqEn-
32+
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
33+
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
34+
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
35+
Compliance De-emphasis: -6dB
36+
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
37+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
38+
Retimer- 2Retimers- CrosslinkRes: unsupported
39+
Capabilities: [48] Vital Product Data
40+
Product Name: Nvidia ConnectX-7 mezz internal for Nvidia Umbriel system
41+
Read-only fields:
42+
[PN] Part number: 692-9X760-00SE-S00
43+
[EC] Engineering changes: A4
44+
[V2] Vendor specific: 692-9X760-00SE-S00
45+
[SN] Serial number: MT2503603H5V
46+
[V3] Vendor specific: 96375e6109d7ef118000e09d73e400e8
47+
[VA] Vendor specific: MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:SMDL=SW_MNG:MODL=C7010Z
48+
[V0] Vendor specific: PCIeGen4 x2
49+
[VU] Vendor specific: MT2503603H5VMLNXS0D0F0
50+
[RV] Reserved: checksum good, 2 byte(s) reserved
51+
End
52+
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
53+
Vector table: BAR=0 offset=00002000
54+
PBA: BAR=0 offset=00003000
55+
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
56+
Capabilities: [40] Power Management version 3
57+
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
58+
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
59+
Capabilities: [100 v1] Advanced Error Reporting
60+
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
61+
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC+ UnsupReq+ ACSViol-
62+
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
63+
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
64+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
65+
AERCap: First Error Pointer: 08, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
66+
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
67+
HeaderLog: 00000000 00000000 00000000 00000000
68+
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
69+
ARICap: MFVC- ACS-, Next Function: 1
70+
ARICtl: MFVC- ACS-, Function Group: 0
71+
Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
72+
IOVCap: Migration-, Interrupt Message Number: 000
73+
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
74+
IOVSta: Migration-
75+
Initial VFs: 16, Total VFs: 16, Number of VFs: 0, Function Dependency Link: 00
76+
VF offset: 4, stride: 1, Device ID: 101e
77+
Supported Page Size: 000007ff, System Page Size: 00000001
78+
Region 0: Memory at 000020fffc000000 (64-bit, prefetchable)
79+
VF Migration: offset: 00000000, BIR: 0
80+
Capabilities: [1c0 v1] Secondary PCI Express
81+
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
82+
LaneErrStat: 0
83+
Capabilities: [230 v1] Access Control Services
84+
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
85+
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
86+
Capabilities: [240 v1] Precision Time Measurement
87+
PTMCap: Requester:+ Responder:- Root:-
88+
PTMClockGranularity: Unimplemented
89+
PTMControl: Enabled:+ RootSelected:-
90+
PTMEffectiveGranularity: 2ns
91+
Capabilities: [320 v1] Lane Margining at the Receiver <?>
92+
Capabilities: [370 v1] Physical Layer 16.0 GT/s <?>
93+
Capabilities: [420 v1] Data Link Feature <?>
94+
Kernel driver in use: mlx5_core
407 Bytes
Binary file not shown.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
module nvidia-fabricmanager-wrapper
2+
3+
go 1.23.0
4+
5+
require github.com/goaux/decowriter v1.0.0
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
github.com/goaux/decowriter v1.0.0 h1:f1mfBWGFIo3Upev3gswfGLQzQvC4SBVYi2ZAkNZsIaU=
2+
github.com/goaux/decowriter v1.0.0/go.mod h1:8GKUmiBlNCYxVHU2vlZoQHwLvYh7Iw1c7/tRekJbX7o=

0 commit comments

Comments
 (0)