Add batch size parameter to CLI for Sentence Transformers #24

davanstrien · 2025-08-13T08:42:55Z

Very nice package! For large datasets / when running on powerful GPUs for the embeddings, it would be nice to be able to either raise or lower the batch size for the embedding step. Currently, it's using the ST default of 32. I've had embedding-atlas crash when I run larger embedding models on my local machine, and for these, it would be nice to run with a batch size of 8 or so.

This PR adds:

Add --batch-size option to CLI with default value of 32
Pass batch_size parameter through to compute_text_projection()
Update _projection_for_texts() to use SentenceTransformer's built-in batch_size parameter
Include batch_size in cache key for proper caching
Add documentation for the new parameter

This change shouldn't have an impact on other parts of the frontend, etc, but I mostly looked at the Python part of the code base, so I might have missed something.

- Add --batch-size option to CLI with default value of 32 - Pass batch_size parameter through to compute_text_projection() - Update _projection_for_texts() to use SentenceTransformer's built-in batch_size parameter - Include batch_size in cache key for proper caching - Add documentation for the new parameter This allows users to control memory usage and performance when processing embeddings. Larger batch sizes use more memory but may be faster, while smaller batch sizes use less memory at the cost of potentially slower processing.

domoritz

Just one comment on the default value. Otherwise looks good.

domoritz · 2025-08-13T14:15:40Z

packages/backend/embedding_atlas/cli.py

+@click.option(
+    "--batch-size",
+    type=int,
+    default=32,


Could the default from ST ever change? Then it might make sense to set None as the default here and pass that through.

yeah, maybe None is safer here. cc @tomaarsen do you plan to change ST default batch size ever?

No plans, Sentence Transformers is too mature to allow such backwards incompatible changes. I imagine it'll most likely never happen.

Tom Aarsen

@domoritz are you okay to leave the bs default as 32 for now in this case?

Sure. That's okay if ST is stable.

I think we should also add the batch_size parameter to compute_image_projection (which currently has a default of 16), so maybe it's still a good idea to default to None and use different defaults depending on the modality (and in the future we might be able to even auto adjust it by available VRAM)

Agree that makes sense. Maybe we could have a default of None in the CLI, then pass 32 for text and 16 for images when None? It might also be worth adding a short log message when batch_size is None to let users know they can adjust it if they run out of memory/VRAM or have a powerful GPU and want to speed things up?

Yes, both sounds great!

Have implemented

bs for images

defaults bs for images/text (when None is passed to CLI, i.e. default behaviour)

added more info to logging for users

- Add batch_size parameter to image processing functions - Use None as CLI default with modality-specific defaults (32 for text, 16 for images) - Add educational logging when using defaults - Include batch_size in image cache keys

donghaoren

Thank you for the contribution!

fredhohman approved these changes Aug 13, 2025

View reviewed changes

fredhohman requested a review from donghaoren August 13, 2025 13:32

domoritz approved these changes Aug 13, 2025

View reviewed changes

donghaoren approved these changes Aug 13, 2025

View reviewed changes

donghaoren merged commit caad819 into apple:main Aug 13, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add batch size parameter to CLI for Sentence Transformers #24

Add batch size parameter to CLI for Sentence Transformers #24

Uh oh!

davanstrien commented Aug 13, 2025

Uh oh!

domoritz left a comment

Uh oh!

domoritz Aug 13, 2025

Uh oh!

davanstrien Aug 13, 2025

Uh oh!

tomaarsen Aug 13, 2025 •

edited

Loading

Uh oh!

davanstrien Aug 13, 2025

Uh oh!

domoritz Aug 13, 2025

Uh oh!

donghaoren Aug 13, 2025

Uh oh!

davanstrien Aug 13, 2025

Uh oh!

donghaoren Aug 13, 2025

Uh oh!

davanstrien Aug 13, 2025

Uh oh!

donghaoren left a comment

Uh oh!

Uh oh!

Uh oh!

Add batch size parameter to CLI for Sentence Transformers #24

Add batch size parameter to CLI for Sentence Transformers #24

Uh oh!

Conversation

davanstrien commented Aug 13, 2025

Uh oh!

domoritz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomaarsen Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

donghaoren left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tomaarsen Aug 13, 2025 •

edited

Loading