Fix in `convert_hf_checkpoint` related to Gemma 3 #2062

mseeger · 2025-05-30T15:19:18Z

The existing test assumes that parameter names for Gemma 3 start with "vision_tower", "language_model", "multi_modal_projector", but they really start with "model.vision_tower", "model.language_model", "model.multi_modal_projector".

I have transformers==4.52.4, which is most recent.

Borda · 2025-05-31T07:36:07Z

I have transformers==4.52.4, which is most recent

can we have it back compatible, meaning have this depending on transformer version

litgpt/scripts/convert_hf_checkpoint.py

mseeger · 2025-05-31T11:33:36Z

It seems these two failing tests are flaky.

litgpt/scripts/convert_hf_checkpoint.py

for more information, see https://pre-commit.ci

Borda · 2025-06-04T10:01:09Z

Seems we may need to skip this test for our RTX3090...

>       work = group.broadcast([tensor], opts)
E       torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
E       ncclUnhandledCudaError: Call to CUDA function failed.
E       Last error:
E       Cuda failure 217 'peer access is not supported between these two devices'

/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2714: DistBackendError
------------------------------ Captured log call -------------------------------
INFO     lightning.pytorch.utilities.rank_zero:cuda.py:166 You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
INFO     lightning.fabric.utilities.distributed:distributed.py:297 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
INFO     lightning.pytorch.utilities.rank_zero:distributed.py:305 ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

t-vi · 2025-06-04T10:24:48Z

Maybe setting NCCL_IGNORE_DISABLED_P2P=1 helps? Not sure why we're seeing this failure at this specific point.
(eg vllm-project/vllm#406 (comment) )

Borda · 2025-06-04T11:15:12Z

NCCL_IGNORE_DISABLED_P2P=1

seems did not help: https://dev.azure.com/Lightning-AI/lit%20Models/_build/results?buildId=234196&view=logs&jobId=47e66f3c-897a-5428-da11-bf5c7745762e&j=47e66f3c-897a-5428-da11-bf5c7745762e&t=8ec4ada6-1a54-5cba-5628-67d3de5b9e16

t-vi · 2025-06-10T09:13:42Z

would NCCL_DEBUG=INFO give interesting information?

.azure/gpu-test.yml

Borda · 2025-06-11T09:48:49Z

would NCCL_DEBUG=INFO give interesting information?

======================== 1 failed, 3 warnings in 8.40s =========================
7996498d3326:10492:10492 [1] NCCL INFO cudaDriverVersion 12080
7996498d3326:10492:10492 [1] NCCL INFO Bootstrap: Using eth0:172.18.0.2<0>
7996498d3326:10492:10492 [1] NCCL INFO NCCL version 2.26.2+cuda12.2
7996498d3326:10492:10492 [1] NCCL INFO Comm config Blocking set to 1
7996498d3326:10492:11102 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
7996498d3326:10492:11102 [1] NCCL INFO NET/IB : No device found.
7996498d3326:10492:11102 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.18.0.2<0>
7996498d3326:10492:11102 [1] NCCL INFO NET/Socket : Using [0]eth0:172.18.0.2<0>
7996498d3326:10492:11102 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
7996498d3326:10492:11102 [1] NCCL INFO Using network Socket
7996498d3326:10492:11102 [1] NCCL INFO ncclCommInitRankConfig comm 0x2c537d10 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 25000 commId 0xbd604eb987af1e4f - Init START
7996498d3326:10492:11102 [1] NCCL INFO RAS client listening socket at ::1<28028>
7996498d3326:10492:11102 [1] NCCL INFO Bootstrap timings total 0.041625 (create 0.000067, send 0.000205, recv 0.040397, ring 0.000063, delay 0.000002)
7996498d3326:10492:11102 [1] NCCL INFO NCCL_IGNORE_DISABLED_P2P set by environment to 1.
7996498d3326:10492:11102 [1] NCCL INFO Setting affinity for GPU 1 to ffff,0000ffff
7996498d3326:10492:11102 [1] NCCL INFO comm 0x2c537d10 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
7996498d3326:10492:11102 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
7996498d3326:10492:11102 [1] NCCL INFO P2P Chunksize set to 131072
7996498d3326:10492:11108 [1] NCCL INFO [Proxy Service] Device 1 CPU core 6
7996498d3326:10492:11111 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 5
7996498d3326:10492:11102 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
7996498d3326:10492:11102 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
7996498d3326:10492:11102 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
7996498d3326:10492:11102 [1] NCCL INFO ncclCommInitRankConfig comm 0x2c537d10 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 25000 commId 0xbd604eb987af1e4f - Init COMPLETE
7996498d3326:10492:11102 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.46 (kernels 0.24, alloc 0.15, bootstrap 0.04, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.01, rest 0.00)
7996498d3326:10492:11114 [1] NCCL INFO Channel 00 : 1[1] -> 0[6] via SHM/direct/direct
7996498d3326:10492:11114 [1] NCCL INFO Channel 01 : 1[1] -> 0[6] via SHM/direct/direct

[2025-06-11 09:33:24] 7996498d3326:10492:11114 [1] transport/shm.cc:590 NCCL WARN Cuda failure 217 'peer access is not supported between these two devices'
7996498d3326:10492:11114 [1] NCCL INFO transport/shm.cc:169 -> 1
7996498d3326:10492:11114 [1] NCCL INFO transport.cc:197 -> 1
7996498d3326:10492:11114 [1] NCCL INFO transport/generic.cc:19 -> 1
7996498d3326:10492:11114 [1] NCCL INFO group.cc:148 -> 1
7996498d3326:10492:11114 [1] NCCL INFO group.cc:75 -> 1 [Async thread]
7996498d3326:10492:10492 [1] NCCL INFO group.cc:460 -> 1
7996498d3326:10492:10492 [1] NCCL INFO group.cc:581 -> 1
7996498d3326:10492:10492 [1] NCCL INFO enqueue.cc:2299 -> 1
7996498d3326:10492:11108 [1] NCCL INFO misc/socket.cc:881 -> 3
7996498d3326:10492:11132 [1] NCCL INFO misc/socket.cc:64 -> 3
7996498d3326:10492:11132 [1] NCCL INFO misc/socket.cc:80 -> 3
7996498d3326:10492:11132 [1] NCCL INFO misc/socket.cc:829 -> 3
7996498d3326:10492:11132 [1] NCCL INFO misc/socket.cc:64 -> 3
7996498d3326:10492:11132 [1] NCCL INFO misc/socket.cc:80 -> 3
7996498d3326:10492:11132 [1] NCCL INFO misc/socket.cc:829 -> 3
7996498d3326:10492:11108 [1] NCCL INFO misc/socket.cc:881 -> 3
7996498d3326:10492:11134 [1] NCCL INFO comm 0x2c537d10 rank 1 nranks 2 cudaDev 1 busId 25000 - Abort COMPLETE

Borda · 2025-06-11T09:50:31Z

wondering why it refers to failing with NCCL 2.26, but we have installed 2.24.3
cc: @t-vi @k223kim

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka B <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>

mseeger requested review from lantiga, t-vi and Borda as code owners May 30, 2025 15:19

mseeger force-pushed the fix_test branch from 4374d24 to 3e1ea74 Compare May 30, 2025 15:23

Borda changed the title ~~Fix in convert_hf_checkpoint.py related to Gemma 3~~ Fix in convert_hf_checkpoint related to Gemma 3 May 31, 2025

Borda approved these changes May 31, 2025

View reviewed changes

Borda reviewed May 31, 2025

View reviewed changes

litgpt/scripts/convert_hf_checkpoint.py Outdated Show resolved Hide resolved

mseeger force-pushed the fix_test branch 3 times, most recently from 297b047 to 9596dcd Compare May 31, 2025 09:24

Borda reviewed Jun 2, 2025

View reviewed changes

litgpt/scripts/convert_hf_checkpoint.py Show resolved Hide resolved

Borda reviewed Jun 2, 2025

View reviewed changes

litgpt/scripts/convert_hf_checkpoint.py Outdated Show resolved Hide resolved

Fix in convert_hf_checkpoint.py related to Gemma 3

97e4a1a

mseeger force-pushed the fix_test branch from d47ff4e to 97e4a1a Compare June 3, 2025 14:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

233817e

for more information, see https://pre-commit.ci

NCCL_IGNORE_DISABLED_P2P: "1"

b499c75

Borda mentioned this pull request Jun 10, 2025

Qwen3 MoE #2060

Merged

Borda reviewed Jun 10, 2025

View reviewed changes

.azure/gpu-test.yml Show resolved Hide resolved

Borda added 3 commits June 10, 2025 11:44

Apply suggestions from code review

daabbc1

Merge branch 'main' into fix_test

26ccb4e

Merge branch 'main' into fix_test

00e68aa

Merge branch 'main' into fix_test

9dbcca3

Borda merged commit 3d33a05 into Lightning-AI:main Jun 16, 2025
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix in `convert_hf_checkpoint` related to Gemma 3 #2062

Fix in `convert_hf_checkpoint` related to Gemma 3 #2062

Uh oh!

mseeger commented May 30, 2025

Uh oh!

Borda commented May 31, 2025

Uh oh!

Uh oh!

mseeger commented May 31, 2025

Uh oh!

Uh oh!

Uh oh!

Borda commented Jun 4, 2025

Uh oh!

t-vi commented Jun 4, 2025

Uh oh!

Borda commented Jun 4, 2025

Uh oh!

t-vi commented Jun 10, 2025

Uh oh!

Uh oh!

Borda commented Jun 11, 2025

Uh oh!

Borda commented Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

Fix in convert_hf_checkpoint related to Gemma 3 #2062

Fix in convert_hf_checkpoint related to Gemma 3 #2062

Uh oh!

Conversation

mseeger commented May 30, 2025

Uh oh!

Borda commented May 31, 2025

Uh oh!

Uh oh!

mseeger commented May 31, 2025

Uh oh!

Uh oh!

Uh oh!

Borda commented Jun 4, 2025

Uh oh!

t-vi commented Jun 4, 2025

Uh oh!

Borda commented Jun 4, 2025

Uh oh!

t-vi commented Jun 10, 2025

Uh oh!

Uh oh!

Borda commented Jun 11, 2025

Uh oh!

Borda commented Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

Fix in `convert_hf_checkpoint` related to Gemma 3 #2062

Fix in `convert_hf_checkpoint` related to Gemma 3 #2062