Skip to content

Commit 8c3503b

Browse files
authored
docs: add information about default model and reproducibility (#36)
1 parent 843440f commit 8c3503b

File tree

1 file changed

+17
-1
lines changed

1 file changed

+17
-1
lines changed

packages/docs/tool.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,9 @@ embedding-atlas huggingface_org/dataset_name
4949

5050
## Visualizing Embeddings
5151

52-
The script will use [SentenceTransformers](https://sbert.net/) to compute embedding vectors for the specified column containing the text data. The script will then project the high-dimensional embedding vectors to 2D with [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html).
52+
The script will use [SentenceTransformers](https://sbert.net/) to compute embedding vectors for the specified column containing the text or image data. You may use the `--model` option to specify an embedding model. If not specified, a default model will be used. The current defaults are `all-MiniLM-L6-v2` for text and `google/vit-base-patch16-384` for images, but these are subject to change in future releases.
53+
54+
After embedding vectors are computed, the script will then project the high-dimensional vectors to 2D with [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html).
5355

5456
::: tip
5557
Optionally, if you know what column your text data is in beforehand, you can specify which column to use with the `--text` flag, for example:
@@ -74,6 +76,20 @@ If this column is specified, you'll be able to see nearest neighbors for a selec
7476

7577
Once this script completes, it will print out a URL like `http://localhost:5055/`. Open the URL in a web browser to view the embedding.
7678

79+
## Reproducibility
80+
81+
For reproducible embedding visualizations, we recommend pre-computing both the embedding vectors and their UMAP projections, and storing them with your dataset. This ensures consistency since the default embedding model may change over time, floating-point precision may vary across different devices, and UMAP introduces randomness through both its default random initialization and its use of parallelism (see [here](https://umap-learn.readthedocs.io/en/latest/reproducibility.html)).
82+
83+
The `embedding_atlas` package provides utility functions to compute the embedding projections:
84+
85+
```python
86+
from embedding_atlas.projection import compute_text_projection
87+
88+
compute_text_projection(df, text="text_column",
89+
x="projection_x", y="projection_y", neighbors="neighbors"
90+
)
91+
```
92+
7793
## Usage
7894

7995
```

0 commit comments

Comments
 (0)