Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions fern/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,8 @@ navigation:
path: pages/08-concepts/account_management.mdx
- page: Webhooks
path: pages/05-guides/webhooks.mdx
- page: Evaluating STT models
path: pages/08-concepts/evals.mdx
- link: Available cloud regions
href: /docs/speech-to-text/pre-recorded-audio/select-the-region
- link: Security page
Expand Down
78 changes: 78 additions & 0 deletions fern/pages/08-concepts/evals.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
title: "Evaluating STT models"
---

# Introduction

The high level objective of a STT model evaluation is to answer the question *"Which Speech-to-text model is the best for my product?"*.

This guide will provide a step-by-step framework for evaluating and benchmarking Speech-to-text models to help you select the best fit for you.

<Note>
Need help evaluating our Speech-to-text products? [Contact our Sales team](https://www.assemblyai.com/contact/sales) to request for an evaluation.
</Note>

# Common Evaluation Metrics

Check warning on line 15 in fern/pages/08-concepts/evals.mdx

View workflow job for this annotation

GitHub Actions / lint

[vale] reported by reviewdog 🐶 [AssemblyAI.Headings] Use sentence-style capitalization for 'Common Evaluation Metrics'. Raw Output: {"message": "[AssemblyAI.Headings] Use sentence-style capitalization for 'Common Evaluation Metrics'. ", "location": {"path": "fern/pages/08-concepts/evals.mdx", "range": {"start": {"line": 15, "column": 3}}}, "severity": "WARNING"}

## Word Error Rate (WER)

Check warning on line 17 in fern/pages/08-concepts/evals.mdx

View workflow job for this annotation

GitHub Actions / lint

[vale] reported by reviewdog 🐶 [AssemblyAI.Headings] Use sentence-style capitalization for 'Word Error Rate (WER)'. Raw Output: {"message": "[AssemblyAI.Headings] Use sentence-style capitalization for 'Word Error Rate (WER)'. ", "location": {"path": "fern/pages/08-concepts/evals.mdx", "range": {"start": {"line": 17, "column": 4}}}, "severity": "WARNING"}

$$
WER = \frac{S + D + I}{N}
$$

This formula takes the number of Substitutions (S), Deletions (D), and Insertions (N), and divides their sum by the Total Number of Words (N).

While WER calculation may seem simple, but it requires a methodical granular approach and reliable reference data.
Word Error Rate can tell you how “different” the automatic transcription was compared to the human transcription, and *generally*, this is a reliable metric to determine how “good” a transcription is. For more info on WER as a metric, read Dylan our CEO's blog post [here](https://www.assemblyai.com/blog/word-error-rate).

## Sentence Error Rate (SER)

Check warning on line 28 in fern/pages/08-concepts/evals.mdx

View workflow job for this annotation

GitHub Actions / lint

[vale] reported by reviewdog 🐶 [AssemblyAI.Headings] Use sentence-style capitalization for 'Sentence Error Rate (SER)'. Raw Output: {"message": "[AssemblyAI.Headings] Use sentence-style capitalization for 'Sentence Error Rate (SER)'. ", "location": {"path": "fern/pages/08-concepts/evals.mdx", "range": {"start": {"line": 28, "column": 4}}}, "severity": "WARNING"}

Define SER as a metric that measures the percentage of sentences that have at least one error. Explain that it provides a different perspective on accuracy, especially for use cases where entire sentences must be perfectly transcribed.

## Diarization Error Rate (DER)

Check warning on line 32 in fern/pages/08-concepts/evals.mdx

View workflow job for this annotation

GitHub Actions / lint

[vale] reported by reviewdog 🐶 [AssemblyAI.Headings] Use sentence-style capitalization for 'Diarization Error Rate (DER)'. Raw Output: {"message": "[AssemblyAI.Headings] Use sentence-style capitalization for 'Diarization Error Rate (DER)'. ", "location": {"path": "fern/pages/08-concepts/evals.mdx", "range": {"start": {"line": 32, "column": 4}}}, "severity": "WARNING"}

$$
DER = \frac{false alarm + missed detection + confusion}{total}
$$

This formula takes the the duration of non-speech incorrectly classified as speech (false alarm), the duration of speech incorrectly classified as non-speech (missed detection), the duration of speaker confusion (confusion), and divides the sum over the total speech duration.

# The Evaluation Process

Check warning on line 40 in fern/pages/08-concepts/evals.mdx

View workflow job for this annotation

GitHub Actions / lint

[vale] reported by reviewdog 🐶 [AssemblyAI.Headings] Use sentence-style capitalization for 'The Evaluation Process'. Raw Output: {"message": "[AssemblyAI.Headings] Use sentence-style capitalization for 'The Evaluation Process'. ", "location": {"path": "fern/pages/08-concepts/evals.mdx", "range": {"start": {"line": 40, "column": 3}}}, "severity": "WARNING"}

This section will be the core of the documentation, providing a step-by-step guide on how to run an evaluation.

For that reason, the evaluation process should closely match your production environment - including the files you intend to transcribe, the model you intend to use, and the settings applied to those models.

## Step 1: Get files to benchmark

Ensure that the files you use to benchmark are representative of the files you plan to use in production. For example, if you plan to transcribe meetings, gather a corpus of meeting recordings.

Then, gather human-labeled data to act as your source of ground truth. Ground truth is accurately transcribed audio data that will serve as the "correct answer" for our benchmark. Human-labeled data can be purchased from an external vendor or created manually.

## Step 2: Transcribe your files

Next, transcribe your files using AssemblyAI's API.

## Step 3: Text Normalization

Check warning on line 56 in fern/pages/08-concepts/evals.mdx

View workflow job for this annotation

GitHub Actions / lint

[vale] reported by reviewdog 🐶 [AssemblyAI.Headings] Use sentence-style capitalization for 'Step 3: Text Normalization'. Raw Output: {"message": "[AssemblyAI.Headings] Use sentence-style capitalization for 'Step 3: Text Normalization'. ", "location": {"path": "fern/pages/08-concepts/evals.mdx", "range": {"start": {"line": 56, "column": 4}}}, "severity": "WARNING"}

Before calculating any metrcics, both reference (ground truth) and hypothesis (model generated) texts need to be normalized to ensure a fair comparison.

This accounts for differences in:
- Punctuation and capitalization
- Number formatting (e.g., “twenty-one” vs. “21”)
- Contractions and abbreviations
- Other stylistic variations that don’t affect meaning

Normalisation can be done with a library like [Whisper Normalizer](https://pypi.org/project/whisper-normalizer/).

## Step 4: Compare and Calculate

Check warning on line 68 in fern/pages/08-concepts/evals.mdx

View workflow job for this annotation

GitHub Actions / lint

[vale] reported by reviewdog 🐶 [AssemblyAI.Headings] Use sentence-style capitalization for 'Step 4: Compare and Calculate'. Raw Output: {"message": "[AssemblyAI.Headings] Use sentence-style capitalization for 'Step 4: Compare and Calculate'. ", "location": {"path": "fern/pages/08-concepts/evals.mdx", "range": {"start": {"line": 68, "column": 4}}}, "severity": "WARNING"}

Calculate the Error rates using the formula above or consider using a library like [pyannote.metrics](https://github.com/pyannote/pyannote-metrics).

# Beyond the metrics

Consider A/B testing with your current provider in production and checking which provider produces more support tickets.

# Conclusion

Need help? Contact our sales team, we can help with evaluations on our end.
Loading