Skip to content

Conversation

davanstrien
Copy link
Contributor

Very nice package! For large datasets / when running on powerful GPUs for the embeddings, it would be nice to be able to either raise or lower the batch size for the embedding step. Currently, it's using the ST default of 32. I've had embedding-atlas crash when I run larger embedding models on my local machine, and for these, it would be nice to run with a batch size of 8 or so.

This PR adds:

  • Add --batch-size option to CLI with default value of 32
  • Pass batch_size parameter through to compute_text_projection()
  • Update _projection_for_texts() to use SentenceTransformer's built-in batch_size parameter
  • Include batch_size in cache key for proper caching
  • Add documentation for the new parameter

This change shouldn't have an impact on other parts of the frontend, etc, but I mostly looked at the Python part of the code base, so I might have missed something.

- Add --batch-size option to CLI with default value of 32
- Pass batch_size parameter through to compute_text_projection()
- Update _projection_for_texts() to use SentenceTransformer's built-in batch_size parameter
- Include batch_size in cache key for proper caching
- Add documentation for the new parameter

This allows users to control memory usage and performance when processing embeddings.
Larger batch sizes use more memory but may be faster, while smaller batch sizes
use less memory at the cost of potentially slower processing.
@fredhohman fredhohman requested a review from donghaoren August 13, 2025 13:32
Copy link
Member

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment on the default value. Otherwise looks good.

@click.option(
"--batch-size",
type=int,
default=32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the default from ST ever change? Then it might make sense to set None as the default here and pass that through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, maybe None is safer here. cc @tomaarsen do you plan to change ST default batch size ever?

Copy link

@tomaarsen tomaarsen Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No plans, Sentence Transformers is too mature to allow such backwards incompatible changes. I imagine it'll most likely never happen.

  • Tom Aarsen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@domoritz are you okay to leave the bs default as 32 for now in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. That's okay if ST is stable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also add the batch_size parameter to compute_image_projection (which currently has a default of 16), so maybe it's still a good idea to default to None and use different defaults depending on the modality (and in the future we might be able to even auto adjust it by available VRAM)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that makes sense. Maybe we could have a default of None in the CLI, then pass 32 for text and 16 for images when None? It might also be worth adding a short log message when batch_size is None to let users know they can adjust it if they run out of memory/VRAM or have a powerful GPU and want to speed things up?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both sounds great!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have implemented

  • bs for images
  • defaults bs for images/text (when None is passed to CLI, i.e. default behaviour)
  • added more info to logging for users

- Add batch_size parameter to image processing functions
- Use None as CLI default with modality-specific defaults (32 for text, 16 for images)
- Add educational logging when using defaults
- Include batch_size in image cache keys
Copy link
Collaborator

@donghaoren donghaoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution!

@donghaoren donghaoren merged commit caad819 into apple:main Aug 13, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants