AssemblyAI · m-ods · Sep 16, 2025
diff --git a/fern/docs.yml b/fern/docs.yml
@@ -162,6 +162,8 @@ navigation:
                 path: pages/08-concepts/account_management.mdx
               - page: Webhooks
                 path: pages/05-guides/webhooks.mdx
+              - page: Evaluating STT models
+                path: pages/08-concepts/evals.mdx
               - link: Available cloud regions
                 href: /docs/speech-to-text/pre-recorded-audio/select-the-region
               - link: Security page

diff --git a/fern/pages/08-concepts/evals.mdx b/fern/pages/08-concepts/evals.mdx
@@ -0,0 +1,78 @@
+---
+title: "Evaluating STT models"
+---
+
+# Introduction
+
+The high level objective of a STT model evaluation is to answer the question *"Which Speech-to-text model is the best for my product?"*.
+
+This guide will provide a step-by-step framework for evaluating and benchmarking Speech-to-text models to help you select the best fit for you.
+
+<Note>
+    Need help evaluating our Speech-to-text products? [Contact our Sales team](https://www.assemblyai.com/contact/sales) to request for an evaluation.
+</Note>
+
+# Common Evaluation Metrics
+
+## Word Error Rate (WER)
+
+$$
+WER = \frac{S + D + I}{N}
+$$
+
+This formula takes the number of Substitutions (S), Deletions (D), and Insertions (N), and divides their sum by the Total Number of Words (N).
+
+While WER calculation may seem simple, but it requires a methodical granular approach and reliable reference data.
+Word Error Rate can tell you how “different” the automatic transcription was compared to the human transcription, and *generally*, this is a reliable metric to determine how “good” a transcription is. For more info on WER as a metric, read Dylan our CEO's blog post [here](https://www.assemblyai.com/blog/word-error-rate).
+
+## Sentence Error Rate (SER)
+
+Define SER as a metric that measures the percentage of sentences that have at least one error. Explain that it provides a different perspective on accuracy, especially for use cases where entire sentences must be perfectly transcribed.
+
+## Diarization Error Rate (DER)
+
+$$
+DER = \frac{false alarm + missed detection + confusion}{total}
+$$
+
+This formula takes the the duration of non-speech incorrectly classified as speech (false alarm), the duration of speech incorrectly classified as non-speech (missed detection), the duration of speaker confusion (confusion), and divides the sum over the total speech duration.
+
+# The Evaluation Process
+
+This section will be the core of the documentation, providing a step-by-step guide on how to run an evaluation.
+
+For that reason, the evaluation process should closely match your production environment - including the files you intend to transcribe, the model you intend to use, and the settings applied to those models.
+
+## Step 1: Get files to benchmark
+
+Ensure that the files you use to benchmark are representative of the files you plan to use in production. For example, if you plan to transcribe meetings, gather a corpus of meeting recordings.
+
+Then, gather human-labeled data to act as your source of ground truth. Ground truth is accurately transcribed audio data that will serve as the "correct answer" for our benchmark. Human-labeled data can be purchased from an external vendor or created manually.
+
+## Step 2: Transcribe your files
+
+Next, transcribe your files using AssemblyAI's API.
+
+## Step 3: Text Normalization
+
+Before calculating any metrcics, both reference (ground truth) and hypothesis (model generated) texts need to be normalized to ensure a fair comparison.
+
+This accounts for differences in:
+- Punctuation and capitalization
+- Number formatting (e.g., “twenty-one” vs. “21”)
+- Contractions and abbreviations
+- Other stylistic variations that don’t affect meaning
+
+Normalisation can be done with a library like [Whisper Normalizer](https://pypi.org/project/whisper-normalizer/).
+
+## Step 4: Compare and Calculate
+
+Calculate the Error rates using the formula above or consider using a library like [pyannote.metrics](https://github.com/pyannote/pyannote-metrics).
+
+# Beyond the metrics
+
+Consider A/B testing with your current provider in production and checking which provider produces more support tickets.
+
+# Conclusion
+
+Need help? Contact our sales team, we can help with evaluations on our end.