Fix #3982: Fix DPO Trainer support for Gemma 3 vision models #4022

akshay-babbar · 2025-09-06T12:47:08Z

What

This PR addresses an issue with the DPO Trainer's handling of vision-language models, specifically for Gemma 3. The changes enhance model type detection to properly support image-text-to-text models.

Fixes #3982

Changes made:

Import MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES from transformers' modeling_auto
Update the is_vision_model check to include both vision-to-sequence and image-text-to-text model types
Add pixel_values and pixel_attention_mask to the signature columns to properly process vision inputs

Testing

Unit test added: test_dpo_trainer_gemma3_vision_model_detection in verifies that Gemma3 models are correctly identified as vision models and processed through the appropriate pipeline
- Integration verification: Confirmed fix resolves the original issue where DPOTrainer incorrectly routed Gemma3 models through tokenizer path instead of processor path
- End-to-end validation: Successfully instantiated DPOTrainer with Gemma3 vision models and verified is_vision_model=True

The test suite ensures that Gemma3 and other image-text-to-text models are properly detected and routed through the vision processing pipeline, preventing the processor/tokenizer confusion that caused training failures.

Review Request 🙏

I've included unit tests and verified the fix resolves the original issue. Would appreciate maintainer review when possible.

Thanks!

…imodal) and keep pixel tensors

…l tensors are preserved.

qgallouedec · 2025-09-09T01:45:10Z

trl/trainer/dpo_trainer.py

-from transformers.models.auto.modeling_auto import MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES
+from transformers.models.auto.modeling_auto import (
+    MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES,
+    MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES,


MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES and AutoModelForVision2Seq are actually deprecated, can you simply replace it by MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES?

Sure, done!

akshay-babbar added 2 commits September 6, 2025 18:01

Fix huggingface#3982: DPOTrainer: detect Gemma 3 as vision‑text (mult…

1e1a13c

…imodal) and keep pixel tensors

tests(DPOTrainer): assert Gemma‑3 is detected as vision‑text and pixe…

c755d89

…l tensors are preserved.

akshay-babbar marked this pull request as ready for review September 6, 2025 15:12

refactor(test): improve comments in vision-text model detection test

5ed1636

qgallouedec reviewed Sep 9, 2025

View reviewed changes

fix: replace deprecated Vision2Seq mapping with ImageTextToText

b0ea075

akshay-babbar requested a review from qgallouedec September 9, 2025 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix #3982: Fix DPO Trainer support for Gemma 3 vision models #4022

Fix #3982: Fix DPO Trainer support for Gemma 3 vision models #4022

Uh oh!

akshay-babbar commented Sep 6, 2025 •

edited

Loading

Uh oh!

qgallouedec Sep 9, 2025

Uh oh!

akshay-babbar Sep 9, 2025

Uh oh!

Uh oh!

Fix #3982: Fix DPO Trainer support for Gemma 3 vision models #4022

Are you sure you want to change the base?

Fix #3982: Fix DPO Trainer support for Gemma 3 vision models #4022

Uh oh!

Conversation

akshay-babbar commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes made:

Testing

Review Request 🙏

Uh oh!

qgallouedec Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

akshay-babbar Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

akshay-babbar commented Sep 6, 2025 •

edited

Loading