Skip to content

Conversation

mseeger
Copy link
Contributor

@mseeger mseeger commented May 30, 2025

The existing test assumes that parameter names for Gemma 3 start with "vision_tower", "language_model", "multi_modal_projector", but they really start with "model.vision_tower", "model.language_model", "model.multi_modal_projector".

I have transformers==4.52.4, which is most recent.

@mseeger mseeger requested review from lantiga, t-vi and Borda as code owners May 30, 2025 15:19
@Borda Borda changed the title Fix in convert_hf_checkpoint.py related to Gemma 3 Fix in convert_hf_checkpoint related to Gemma 3 May 31, 2025
@Borda
Copy link
Member

Borda commented May 31, 2025

I have transformers==4.52.4, which is most recent

can we have it back compatible, meaning have this depending on transformer version

@mseeger mseeger force-pushed the fix_test branch 3 times, most recently from 297b047 to 9596dcd Compare May 31, 2025 09:24
@mseeger
Copy link
Contributor Author

mseeger commented May 31, 2025

It seems these two failing tests are flaky.

@Borda
Copy link
Member

Borda commented Jun 4, 2025

Seems we may need to skip this test for our RTX3090...

>       work = group.broadcast([tensor], opts)
E       torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
E       ncclUnhandledCudaError: Call to CUDA function failed.
E       Last error:
E       Cuda failure 217 'peer access is not supported between these two devices'

/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2714: DistBackendError
------------------------------ Captured log call -------------------------------
INFO     lightning.pytorch.utilities.rank_zero:cuda.py:166 You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
INFO     lightning.fabric.utilities.distributed:distributed.py:297 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
INFO     lightning.pytorch.utilities.rank_zero:distributed.py:305 ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

@t-vi
Copy link
Collaborator

t-vi commented Jun 4, 2025

Maybe setting NCCL_IGNORE_DISABLED_P2P=1 helps? Not sure why we're seeing this failure at this specific point.
(eg vllm-project/vllm#406 (comment) )

@Borda Borda mentioned this pull request Jun 10, 2025
@t-vi
Copy link
Collaborator

t-vi commented Jun 10, 2025

would NCCL_DEBUG=INFO give interesting information?

@Borda
Copy link
Member

Borda commented Jun 11, 2025

would NCCL_DEBUG=INFO give interesting information?

======================== 1 failed, 3 warnings in 8.40s =========================
7996498d3326:10492:10492 [1] NCCL INFO cudaDriverVersion 12080
7996498d3326:10492:10492 [1] NCCL INFO Bootstrap: Using eth0:172.18.0.2<0>
7996498d3326:10492:10492 [1] NCCL INFO NCCL version 2.26.2+cuda12.2
7996498d3326:10492:10492 [1] NCCL INFO Comm config Blocking set to 1
7996498d3326:10492:11102 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
7996498d3326:10492:11102 [1] NCCL INFO NET/IB : No device found.
7996498d3326:10492:11102 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.18.0.2<0>
7996498d3326:10492:11102 [1] NCCL INFO NET/Socket : Using [0]eth0:172.18.0.2<0>
7996498d3326:10492:11102 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
7996498d3326:10492:11102 [1] NCCL INFO Using network Socket
7996498d3326:10492:11102 [1] NCCL INFO ncclCommInitRankConfig comm 0x2c537d10 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 25000 commId 0xbd604eb987af1e4f - Init START
7996498d3326:10492:11102 [1] NCCL INFO RAS client listening socket at ::1<28028>
7996498d3326:10492:11102 [1] NCCL INFO Bootstrap timings total 0.041625 (create 0.000067, send 0.000205, recv 0.040397, ring 0.000063, delay 0.000002)
7996498d3326:10492:11102 [1] NCCL INFO NCCL_IGNORE_DISABLED_P2P set by environment to 1.
7996498d3326:10492:11102 [1] NCCL INFO Setting affinity for GPU 1 to ffff,0000ffff
7996498d3326:10492:11102 [1] NCCL INFO comm 0x2c537d10 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
7996498d3326:10492:11102 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
7996498d3326:10492:11102 [1] NCCL INFO P2P Chunksize set to 131072
7996498d3326:10492:11108 [1] NCCL INFO [Proxy Service] Device 1 CPU core 6
7996498d3326:10492:11111 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 5
7996498d3326:10492:11102 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
7996498d3326:10492:11102 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
7996498d3326:10492:11102 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
7996498d3326:10492:11102 [1] NCCL INFO ncclCommInitRankConfig comm 0x2c537d10 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 25000 commId 0xbd604eb987af1e4f - Init COMPLETE
7996498d3326:10492:11102 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.46 (kernels 0.24, alloc 0.15, bootstrap 0.04, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.01, rest 0.00)
7996498d3326:10492:11114 [1] NCCL INFO Channel 00 : 1[1] -> 0[6] via SHM/direct/direct
7996498d3326:10492:11114 [1] NCCL INFO Channel 01 : 1[1] -> 0[6] via SHM/direct/direct

[2025-06-11 09:33:24] 7996498d3326:10492:11114 [1] transport/shm.cc:590 NCCL WARN Cuda failure 217 'peer access is not supported between these two devices'
7996498d3326:10492:11114 [1] NCCL INFO transport/shm.cc:169 -> 1
7996498d3326:10492:11114 [1] NCCL INFO transport.cc:197 -> 1
7996498d3326:10492:11114 [1] NCCL INFO transport/generic.cc:19 -> 1
7996498d3326:10492:11114 [1] NCCL INFO group.cc:148 -> 1
7996498d3326:10492:11114 [1] NCCL INFO group.cc:75 -> 1 [Async thread]
7996498d3326:10492:10492 [1] NCCL INFO group.cc:460 -> 1
7996498d3326:10492:10492 [1] NCCL INFO group.cc:581 -> 1
7996498d3326:10492:10492 [1] NCCL INFO enqueue.cc:2299 -> 1
7996498d3326:10492:11108 [1] NCCL INFO misc/socket.cc:881 -> 3
7996498d3326:10492:11132 [1] NCCL INFO misc/socket.cc:64 -> 3
7996498d3326:10492:11132 [1] NCCL INFO misc/socket.cc:80 -> 3
7996498d3326:10492:11132 [1] NCCL INFO misc/socket.cc:829 -> 3
7996498d3326:10492:11132 [1] NCCL INFO misc/socket.cc:64 -> 3
7996498d3326:10492:11132 [1] NCCL INFO misc/socket.cc:80 -> 3
7996498d3326:10492:11132 [1] NCCL INFO misc/socket.cc:829 -> 3
7996498d3326:10492:11108 [1] NCCL INFO misc/socket.cc:881 -> 3
7996498d3326:10492:11134 [1] NCCL INFO comm 0x2c537d10 rank 1 nranks 2 cudaDev 1 busId 25000 - Abort COMPLETE

@Borda
Copy link
Member

Borda commented Jun 11, 2025

wondering why it refers to failing with NCCL 2.26, but we have installed 2.24.3
cc: @t-vi @k223kim

@Borda Borda merged commit 3d33a05 into Lightning-AI:main Jun 16, 2025
24 checks passed
mseeger added a commit to mseeger/litgpt that referenced this pull request Jul 4, 2025
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka B <[email protected]>
Co-authored-by: Jirka Borovec <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants