Skip to content

Commit 2c717e0

Browse files
committed
feat(nvidia-fabricmanager): add support for HGX B200/B100 baseboards
Quoting: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf > On NVIDIA DGX-B200, HGX-B200, and HGX-B100 systems and later, the FabricManager package needs an additional NVLSM dependency for proper operation. > NVLink Subnet manager (NVLSM) originated from the InfiniBand networking and contains additional logic to program NVSwitches and NVLinks. Not running the NVLSM on Blackwell and newer fabrics with NVLink 5.0+ will result in FabricManager failing to start with error `NV_WARN_NOTHING_TO_DO`. NVSwitches will remain uninitialized and applications will fail with the `CUDA_ERROR_SYSTEM_NOT_READY` or `cudaErrorSystemNotReady` error. The CUDA initialization process can only begin after the GPUs complete their registration process with the NVLink fabric. A GPU fabric registration status can be verified with the command: `nvidia-smi -q -i 0 | grep -i -A 2 Fabric`. An `In Progress` state indicates that the GPU is being registered and FabricManager is likely not running or missing the NVLSM dependency. A `Completed` state is shown when the GPU is successfully registered with the NVLink fabric. The FabricManager package includes the script `nvidia-fabricmanager-start.sh`, which is used to selectively start FabricManager and NVLSM processes depending on the underlying platform. A key aspect of determining whether an NVLink 5.0+ fabric is present is to look for a Limited Physical Function (LPF) port in InfiniBand devices. To differentiate LPFs, the Vital Product Data (VPD) information includes a vendor-specific field called `SMDL`, with a non-zero value defined as `SW_MNG`. The first device is then selected, and it's port GUID is extracted and passed to NVLSM and FabricManager. So both services share a configuration key that result from a common initialization process. Additionally, they communicate with each other over a Unix socket. This patch introduces the following changes: * Adds NVLSM to the nvidia-fabricmanager extension. * Introduces a new `nvidia-fabricmanager-wrapper` program to replicate the initialization process from `nvidia-fabricmanager-start.sh`: * Detects NVLink 5.0+ fabrics and extracts an NVSwitch LPF port GUID. This is done by calling libibumad directly with CGO instead of parsing the output of the `ibstat` command as in the upstream script. * Starts FabricManager, and NVLSM only when needed. * Keeps both process lifecycles synchronized and ensures the Talos container will restart if either process crashes. * Refactors the nvidia-fabricmanager container to be self-contained, as this service does not share files with other nvidia-gpu extentions.
1 parent 1efc06b commit 2c717e0

File tree

12 files changed

+526
-80
lines changed

12 files changed

+526
-80
lines changed

go.work

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@ use (
44
./examples/hello-world-service/src
55
./nvidia-gpu/nvidia-container-toolkit/nvidia-container-runtime-wrapper
66
./nvidia-gpu/nvidia-container-toolkit/nvidia-persistenced-wrapper
7+
./nvidia-gpu/nvidia-fabricmanager/nvidia-fabricmanager-wrapper
78
)
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
CA 'mlx5_0'
2+
CA type: MT4129
3+
Number of ports: 1
4+
Firmware version: 28.42.1274
5+
Hardware version: 0
6+
Node GUID: 0xe09d730300e400e8
7+
System image GUID: 0xe09d730300e400e8
8+
Port 1:
9+
State: Active
10+
Physical state: LinkUp
11+
Rate: 100
12+
Base lid: 1
13+
LMC: 0
14+
SM lid: 1
15+
Capability mask: 0xa751e84a
16+
Port GUID: 0xe09d730300e400e8
17+
Link layer: InfiniBand
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
05:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
2+
Subsystem: Mellanox Technologies Device 0087
3+
Physical Slot: 14
4+
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
5+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
6+
Latency: 0, Cache Line Size: 64 bytes
7+
Interrupt: pin A routed to IRQ 16
8+
NUMA node: 0
9+
IOMMU group: 154
10+
Region 0: Memory at 20fff4000000 (64-bit, prefetchable) [size=32M]
11+
Expansion ROM at a9400000 [disabled] [size=1M]
12+
Capabilities: [60] Express (v2) Endpoint, MSI 00
13+
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
14+
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
15+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
16+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
17+
MaxPayload 256 bytes, MaxReadReq 512 bytes
18+
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
19+
LnkCap: Port #0, Speed 16GT/s, Width x2, ASPM not supported
20+
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
21+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
22+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
23+
LnkSta: Speed 8GT/s (downgraded), Width x2 (ok)
24+
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
25+
DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
26+
10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt- EETLPPrefix-
27+
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
28+
FRS- TPHComp- ExtTPHComp-
29+
AtomicOpsCap: 32bit+ 64bit+ 128bitCAS+
30+
DevCtl2: Completion Timeout: 260ms to 900ms, TimeoutDis- LTR- OBFF Disabled,
31+
AtomicOpsCtl: ReqEn-
32+
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
33+
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
34+
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
35+
Compliance De-emphasis: -6dB
36+
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
37+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
38+
Retimer- 2Retimers- CrosslinkRes: unsupported
39+
Capabilities: [48] Vital Product Data
40+
Product Name: Nvidia ConnectX-7 mezz internal for Nvidia Umbriel system
41+
Read-only fields:
42+
[PN] Part number: 692-9X760-00SE-S00
43+
[EC] Engineering changes: A4
44+
[V2] Vendor specific: 692-9X760-00SE-S00
45+
[SN] Serial number: MT2503603H5V
46+
[V3] Vendor specific: 96375e6109d7ef118000e09d73e400e8
47+
[VA] Vendor specific: MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:SMDL=SW_MNG:MODL=C7010Z
48+
[V0] Vendor specific: PCIeGen4 x2
49+
[VU] Vendor specific: MT2503603H5VMLNXS0D0F0
50+
[RV] Reserved: checksum good, 2 byte(s) reserved
51+
End
52+
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
53+
Vector table: BAR=0 offset=00002000
54+
PBA: BAR=0 offset=00003000
55+
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
56+
Capabilities: [40] Power Management version 3
57+
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
58+
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
59+
Capabilities: [100 v1] Advanced Error Reporting
60+
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
61+
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC+ UnsupReq+ ACSViol-
62+
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
63+
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
64+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
65+
AERCap: First Error Pointer: 08, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
66+
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
67+
HeaderLog: 00000000 00000000 00000000 00000000
68+
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
69+
ARICap: MFVC- ACS-, Next Function: 1
70+
ARICtl: MFVC- ACS-, Function Group: 0
71+
Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
72+
IOVCap: Migration-, Interrupt Message Number: 000
73+
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
74+
IOVSta: Migration-
75+
Initial VFs: 16, Total VFs: 16, Number of VFs: 0, Function Dependency Link: 00
76+
VF offset: 4, stride: 1, Device ID: 101e
77+
Supported Page Size: 000007ff, System Page Size: 00000001
78+
Region 0: Memory at 000020fffc000000 (64-bit, prefetchable)
79+
VF Migration: offset: 00000000, BIR: 0
80+
Capabilities: [1c0 v1] Secondary PCI Express
81+
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
82+
LaneErrStat: 0
83+
Capabilities: [230 v1] Access Control Services
84+
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
85+
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
86+
Capabilities: [240 v1] Precision Time Measurement
87+
PTMCap: Requester:+ Responder:- Root:-
88+
PTMClockGranularity: Unimplemented
89+
PTMControl: Enabled:+ RootSelected:-
90+
PTMEffectiveGranularity: 2ns
91+
Capabilities: [320 v1] Lane Margining at the Receiver <?>
92+
Capabilities: [370 v1] Physical Layer 16.0 GT/s <?>
93+
Capabilities: [420 v1] Data Link Feature <?>
94+
Kernel driver in use: mlx5_core
407 Bytes
Binary file not shown.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
module nvidia-fabricmanager-wrapper
2+
3+
go 1.23.0
4+
5+
require github.com/goaux/decowriter v1.0.0
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
github.com/goaux/decowriter v1.0.0 h1:f1mfBWGFIo3Upev3gswfGLQzQvC4SBVYi2ZAkNZsIaU=
2+
github.com/goaux/decowriter v1.0.0/go.mod h1:8GKUmiBlNCYxVHU2vlZoQHwLvYh7Iw1c7/tRekJbX7o=
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
// This Source Code Form is subject to the terms of the Mozilla Public
2+
// License, v. 2.0. If a copy of the MPL was not distributed with this
3+
// file, You can obtain one at http://mozilla.org/MPL/2.0/.
4+
5+
package main
6+
7+
import (
8+
"bufio"
9+
"context"
10+
"fmt"
11+
"log"
12+
"os"
13+
"os/exec"
14+
"os/signal"
15+
"path/filepath"
16+
"strings"
17+
"sync"
18+
"syscall"
19+
"time"
20+
21+
"github.com/goaux/decowriter"
22+
)
23+
24+
const (
25+
// FabricManager
26+
fmCmdFile = "/usr/local/bin/nv-fabricmanager"
27+
fmConfigFile = "/usr/local/share/nvidia/nvswitch/fabricmanager.cfg"
28+
fmStopDeadline = 5 * time.Second
29+
30+
// NVLSM
31+
smCmdFile = "/usr/bin/nvlsm"
32+
smConfigFile = "/usr/share/nvidia/nvlsm/nvlsm.cfg"
33+
smStateFolder = "/var/run/nvidia-nvlsm"
34+
smPidFile = smStateFolder + "/" + "nvlsm.pid"
35+
smStopDeadline = 5 * time.Second
36+
)
37+
38+
func runCommand(ctx context.Context, wg *sync.WaitGroup, doneCb func(), waitDelay time.Duration, path string, arg ...string) {
39+
wg.Add(1)
40+
41+
cmd := exec.CommandContext(ctx, path, arg...)
42+
cmd.WaitDelay = waitDelay
43+
cmd.Cancel = func() error {
44+
return cmd.Process.Signal(os.Interrupt)
45+
}
46+
47+
// TODO line writer to log module
48+
name := filepath.Base(path)
49+
cmd.Stdout = decowriter.New(bufio.NewWriter(os.Stdout), []byte(name+": "), []byte{})
50+
cmd.Stderr = decowriter.New(bufio.NewWriter(os.Stderr), []byte(name+": "), []byte{})
51+
52+
go func() {
53+
log.Printf("nvidia-fabricmanager-wrapper: running command: %s %s\n", path, strings.Join(arg, " "))
54+
55+
err := cmd.Run()
56+
if err == nil {
57+
log.Printf("nvidia-fabricmanager-wrapper: command %s [%d] completed successfully\n", path, cmd.Process.Pid)
58+
} else if exitErr, ok := err.(*exec.ExitError); ok {
59+
if exitErr.Exited() {
60+
log.Printf("nvidia-fabricmanager-wrapper: command %s [%d] exited with code %d\n", path, exitErr.Pid(),
61+
exitErr.ExitCode())
62+
} else {
63+
log.Printf("nvidia-fabricmanager-wrapper: command %s [%d] was terminated\n", path, exitErr.Pid())
64+
}
65+
} else {
66+
log.Printf("nvidia-fabricmanager-wrapper: failed to run command %s: %v\n", path, err)
67+
}
68+
69+
wg.Done()
70+
doneCb()
71+
}()
72+
}
73+
74+
func main() {
75+
var cmdWg sync.WaitGroup
76+
77+
signal.Ignore(syscall.SIGHUP)
78+
79+
runCtx, gracefulShutdown := context.WithCancel(context.Background())
80+
81+
signalsChan := make(chan os.Signal, 1)
82+
signal.Notify(signalsChan, os.Interrupt)
83+
signal.Notify(signalsChan, syscall.SIGTERM)
84+
85+
go func() {
86+
received := <-signalsChan
87+
signal.Stop(signalsChan)
88+
log.Printf("nvidia-fabricmanager-wrapper: received signal '%s', initiating a graceful shutdown\n", received.String())
89+
gracefulShutdown()
90+
}()
91+
92+
nvswitchPorts := findNvswitchMgmtPorts()
93+
for _, port := range nvswitchPorts {
94+
log.Printf("nvidia-fabricmanager-wrapper: found NVSwitch LPF: device=%s guid=0x%x\n", port.IBDevice, port.PortGUID)
95+
}
96+
97+
fmSmMgmtPortGUID := ""
98+
if len(nvswitchPorts) > 0 {
99+
fmSmMgmtPortGUID = fmt.Sprintf("0x%x", nvswitchPorts[0].PortGUID)
100+
log.Printf("nvidia-fabricmanager-wrapper: using NVSwitch management port GUID: %s\n", fmSmMgmtPortGUID)
101+
} else {
102+
log.Println("nvidia-fabricmanager-wrapper: No InfiniBand NVSwitch detected. On Blackwell HGX baseboards and newer",
103+
"with NVLink 5.0+, please load kernel module 'ib_umad' for NVLSM to run along FabricManager. Otherwise it will",
104+
"fail to start with error NV_WARN_NOTHING_TO_DO, and GPU workloads will report CUDA_ERROR_SYSTEM_NOT_READY.")
105+
}
106+
107+
if fmSmMgmtPortGUID != "" {
108+
runCommand(runCtx, &cmdWg, gracefulShutdown, smStopDeadline, smCmdFile, "--config", smConfigFile,
109+
"--guid", fmSmMgmtPortGUID, "--pid_file", smPidFile, "--log_file", "stdout")
110+
// runCommand(runCtx, &cmdWg, gracefulShutdown, smStopDeadline, "/usr/bin/sleep", "2")
111+
}
112+
113+
fmCmdArgs := []string{"--config", fmConfigFile}
114+
if fmSmMgmtPortGUID != "" {
115+
fmCmdArgs = append(fmCmdArgs, "--fm-sm-mgmt-port-guid", fmSmMgmtPortGUID)
116+
}
117+
runCommand(runCtx, &cmdWg, gracefulShutdown, fmStopDeadline, fmCmdFile, fmCmdArgs...)
118+
// runCommand(runCtx, &cmdWg, gracefulShutdown, fmStopDeadline, "/usr/bin/sleep", "2")
119+
120+
log.Println("nvidia-fabricmanager-wrapper: initialization completed")
121+
cmdWg.Wait()
122+
}
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
// This Source Code Form is subject to the terms of the Mozilla Public
2+
// License, v. 2.0. If a copy of the MPL was not distributed with this
3+
// file, You can obtain one at http://mozilla.org/MPL/2.0/.
4+
5+
package main
6+
7+
// #cgo LDFLAGS: -libumad
8+
// #include <linux/types.h> /* __be64 */
9+
// #include <infiniband/umad.h>
10+
import "C"
11+
import (
12+
"bytes"
13+
"encoding/binary"
14+
"os"
15+
"path"
16+
"unsafe"
17+
)
18+
19+
type NVSwitchMgmtPort struct {
20+
IBDevice string
21+
PortGUID uint64
22+
}
23+
24+
/*
25+
Find InfiniBand devices with the capability to configure NVSwitches.
26+
---
27+
From: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
28+
29+
The CX7 bridge device is integrated into the GPU baseboard, which includes two physical ports. Each port exposes one
30+
physical function (FC PF) and one Limited physical function (LPF) to the host system, which totals four PFs. The PFs
31+
are categorized into the following PFs:
32+
- Limited PFs (LPF) are designated for specific tasks in the system.
33+
They are used by the FM and the NVLSM to configure and set up NVSwitches, GPU, and NVLink routing information.
34+
LPFs are also used by telemetry agents, such as NVIBDM and DCGM, to monitor and collect data. Resetting this
35+
PF with FLR also resets the corresponding NVSwitch device.
36+
37+
To differentiate between LPFs and FC PFs, the LPF VPD information includes a vendor-specific field called SMDL, with
38+
a non-zero value defined as SW_MNG. For bare-metal, full pass-through, and shared NVSwitch deployments, the prelaunch
39+
script in the FM service unit file will run and query the available CX7 devices for this VPD information. The file
40+
populates the required FM and NVLSM configuration values so that these communication entities can access the relevant
41+
devices.
42+
*/
43+
func findLpfDevices() (devices []string) {
44+
const ibPath = "/sys/class/infiniband"
45+
46+
devDir, err := os.ReadDir(ibPath)
47+
if err != nil {
48+
return
49+
}
50+
51+
for _, device := range devDir {
52+
vpd, err := os.ReadFile(path.Join(ibPath, device.Name(), "device/vpd"))
53+
if err != nil {
54+
continue
55+
}
56+
57+
// TODO: parse it like lspci does
58+
if bytes.Contains(vpd, []byte("SMDL=SW_MNG")) {
59+
devices = append(devices, device.Name())
60+
}
61+
}
62+
return
63+
}
64+
65+
func findNvswitchMgmtPorts() (ports []NVSwitchMgmtPort) {
66+
lpfDevs := findLpfDevices()
67+
if len(lpfDevs) == 0 {
68+
return
69+
}
70+
71+
if C.umad_init() < 0 {
72+
return
73+
}
74+
75+
for _, lpf := range lpfDevs {
76+
const maxPorts = 16
77+
var portGUIDs [maxPorts]C.__be64
78+
79+
/*
80+
$ man 3 umad_get_ca_portguids
81+
82+
On success, umad_get_ca_portguids() returns a non-negative value equal to the number of port GUIDs actually
83+
filled. Not all filled entries may be valid. Invalid entries will be 0. For example, on a CA node with only
84+
one port, this function returns a value of 2. In this case, the value at index 0 will be invalid as it is
85+
reserved for switches. On failure, a negative value is returned.
86+
*/
87+
numPort := C.umad_get_ca_portguids(C.CString(lpf), &portGUIDs[0], maxPorts)
88+
89+
for i := range int(numPort) {
90+
var guid uint64
91+
92+
// convert kernel __be64 to uint64
93+
buf := bytes.NewReader((*[8]byte)(unsafe.Pointer(&portGUIDs[i]))[:])
94+
if err := binary.Read(buf, binary.BigEndian, &guid); err != nil {
95+
continue
96+
}
97+
98+
if guid != 0 {
99+
ports = append(ports, NVSwitchMgmtPort{lpf, guid})
100+
}
101+
}
102+
}
103+
104+
C.umad_done()
105+
return
106+
}

0 commit comments

Comments
 (0)