Skip to content

Conversation

akshay-babbar
Copy link

@akshay-babbar akshay-babbar commented Sep 6, 2025

What

This PR addresses an issue with the DPO Trainer's handling of vision-language models, specifically for Gemma 3. The changes enhance model type detection to properly support image-text-to-text models.

Fixes #3982

Changes made:

  • Import MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES from transformers' modeling_auto
  • Update the is_vision_model check to include both vision-to-sequence and image-text-to-text model types
  • Add pixel_values and pixel_attention_mask to the signature columns to properly process vision inputs

Testing

  • Unit test added: test_dpo_trainer_gemma3_vision_model_detection in verifies that Gemma3 models are correctly identified as vision models and processed through the appropriate pipeline
    • Integration verification: Confirmed fix resolves the original issue where DPOTrainer incorrectly routed Gemma3 models through tokenizer path instead of processor path
    • End-to-end validation: Successfully instantiated DPOTrainer with Gemma3 vision models and verified is_vision_model=True

The test suite ensures that Gemma3 and other image-text-to-text models are properly detected and routed through the vision processing pipeline, preventing the processor/tokenizer confusion that caused training failures.

Review Request 🙏

I've included unit tests and verified the fix resolves the original issue. Would appreciate maintainer review when possible.

Thanks!

@akshay-babbar akshay-babbar marked this pull request as ready for review September 6, 2025 15:12
from transformers.models.auto.modeling_auto import MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES
from transformers.models.auto.modeling_auto import (
MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES,
MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES and AutoModelForVision2Seq are actually deprecated, can you simply replace it by MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DPOTrainer uses tokenizer instead of processor for Gemma3 vision
2 participants