fast-langdetect
is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.
- Supported Python
3.9
to3.13
. - Works offline with the lite model.
- No
numpy
required (thanks to @dalf).
This project builds upon zafercavdar/fasttext-langdetect with enhancements in packaging. For more information about the underlying model, see the official FastText documentation: Language Identification.
The lite model runs offline and is memory-friendly; the full model is larger and offers higher accuracy.
Approximate memory usage (RSS after load):
- Lite: ~45–60 MB
- Full: ~170–210 MB
- Auto: tries full first, falls back to lite only on MemoryError.
Notes:
- Measurements vary by Python version, OS, allocator, and import graph; treat these as practical ranges.
- Validate on your system if constrained; see
examples/memory_usage_check.py
(credit: script by github@JackyHe398`).- Run memory checks in a clean terminal session. IDEs/REPLs may preload frameworks and inflate peak RSS (ru_maxrss), leading to very large peaks with near-zero deltas.
Choose the model that best fits your constraints.
To install fast-langdetect, you can use either pip
or pdm
:
pip install fast-langdetect
pdm add fast-langdetect
For higher accuracy, prefer the full model via detect(text, model='full')
. For robust behavior under memory pressure, use detect(text, model='auto')
which falls back to the lite model only on MemoryError.
- If the sample is too long (generally over 80 characters) or too short, accuracy will be reduced.
- The model downloads to the system temporary directory by default. You can customize it by:
- Setting
FTLANG_CACHE
environment variable - Using
LangDetectConfig(cache_dir="your/path")
- Setting
from fast_langdetect import detect
print(detect("Hello, world!", model="auto", k=1))
print(detect("Hello 世界 こんにちは", model="auto", k=3))
detect
always returns a list of candidates ordered by score. Use model="full"
for the best accuracy or model="lite"
for an offline-only workflow.
from fast_langdetect import LangDetectConfig, LangDetector
config = LangDetectConfig(cache_dir="/custom/cache", model="lite")
detector = LangDetector(config)
print(detector.detect("Bonjour", k=1))
print(detector.detect("Hola", model="full", k=1))
Each LangDetector
instance maintains its own in-memory model cache. Once loaded, models are reused for subsequent calls within the same instance. The global detect()
function uses a shared default detector, so it also benefits from automatic caching.
Create a custom LangDetector
instance when you need specific configuration (custom cache directory, input limits, etc.) or isolated model management.
Keep It Simple!
- Only
MemoryError
triggers fallback (inmodel='auto'
): when loading the full model runs out of memory, it falls back to the lite model. - I/O/network/permission/path/integrity errors raise standard exceptions (e.g.,
FileNotFoundError
,PermissionError
) or library-specific errors where applicable — no silent fallback. model='lite'
andmodel='full'
never fallback by design.
- Base error:
FastLangdetectError
(library-specific failures). - Model loading failures:
ModelLoadError
. - Standard Python exceptions (e.g.,
ValueError
,TypeError
,FileNotFoundError
,MemoryError
) propagate when they are not library-specific.
For text splitting based on language, please refer to the split-lang repository.
You can control log verbosity and input normalization via LangDetectConfig
:
from fast_langdetect import LangDetectConfig, LangDetector
config = LangDetectConfig(max_input_length=200)
detector = LangDetector(config)
print(detector.detect("Some very long text..." * 5))
- Newlines are always replaced with spaces to avoid FastText errors (silent, no log).
- When truncation happens, a WARNING is logged because it may reduce accuracy.
- The default
max_input_length
is 80 characters (optimal for accuracy); increase it if you need longer samples, or setNone
to disable truncation entirely.
- Default cache: if
cache_dir
is not set, models are stored under a system temp-based directory specified byFTLANG_CACHE
or an internal default. This directory is created automatically when needed. - User-provided cache_dir: if you set
LangDetectConfig(cache_dir=...)
to a path that does not exist, the library raisesFileNotFoundError
instead of silently creating or using another location. Create the directory yourself if that’s intended.
The constructor exposes a few advanced knobs (proxy
, normalize_input
, max_input_length
). These are rarely needed for typical usage and can be ignored. Prefer detect(..., model=...)
unless you know you need them.
fastText reports BCP-47 style tags such as en
, zh-cn
, pt-br
, yue
. The detector keeps those codes so you can decide how to display them. Choose the approach that fits your product:
- Small, fixed list? Maintain a hand-written mapping and fall back to the raw code for anything unexpected.
FASTTEXT_DISPLAY_NAMES = {
"en": "English",
"zh": "Chinese",
"zh-cn": "Chinese (China)",
"zh-tw": "Chinese (Taiwan)",
"pt": "Portuguese",
"pt-br": "Portuguese (Brazil)",
"yue": "Cantonese",
"wuu": "Wu Chinese",
"arz": "Egyptian Arabic",
"ckb": "Central Kurdish",
"kab": "Kabyle",
}
def code_to_display_name(code: str) -> str:
return FASTTEXT_DISPLAY_NAMES.get(code.lower(), code)
print(code_to_display_name("pt-br"))
print(code_to_display_name("de"))
- Need coverage for all 176 fastText languages? Use a language database library that understands subtags and scripts. Two popular libraries are
langcodes
andpycountry
.
# pip install langcodes
from langcodes import Language
LANG_OVERRIDES = {
"pt-br": "Portuguese (Brazil)",
"zh-cn": "Chinese (China)",
"zh-tw": "Chinese (Taiwan)",
"yue": "Cantonese",
}
def fasttext_to_name(code: str) -> str:
normalized = code.replace("_", "-").lower()
if normalized in LANG_OVERRIDES:
return LANG_OVERRIDES[normalized]
try:
return Language.get(normalized).display_name("en")
except Exception:
base = normalized.split("-")[0]
try:
return Language.get(base).display_name("en")
except Exception:
return code
from fast_langdetect import detect
result = detect("Olá mundo", model="full", k=1)
print(fasttext_to_name(result[0]["lang"]))
pycountry
works similarly (pip install pycountry
). Use pycountry.languages.lookup("pt")
for fuzzy matching or pycountry.languages.get(alpha_2="pt")
for exact lookups, and pair it with a small override dictionary for non-standard tags such as pt-br
, zh-cn
, or dialect codes like yue
.
# pip install pycountry
import pycountry
FASTTEXT_OVERRIDES = {
"pt-br": "Portuguese (Brazil)",
"zh-cn": "Chinese (China)",
"zh-tw": "Chinese (Taiwan)",
"yue": "Cantonese",
}
def fasttext_to_name_pycountry(code: str) -> str:
normalized = code.replace("_", "-").lower()
if normalized in FASTTEXT_OVERRIDES:
return FASTTEXT_OVERRIDES[normalized]
try:
return pycountry.languages.lookup(normalized).name
except LookupError:
base = normalized.split("-")[0]
try:
return pycountry.languages.lookup(base).name
except LookupError:
return code
from fast_langdetect import detect
result = detect("Olá mundo", model="full", k=1)
print(fasttext_to_name_pycountry(result[0]["lang"]))
from importlib import resources
from fast_langdetect import LangDetectConfig, LangDetector
with resources.path("fast_langdetect.resources", "lid.176.ftz") as model_path:
config = LangDetectConfig(custom_model_path=str(model_path))
detector = LangDetector(config)
print(detector.detect("Hello world", k=1))
When using a custom model via custom_model_path
, the model
parameter in detect()
calls is ignored since your custom model file is always loaded directly. The model="lite"
, model="full"
, and model="auto"
parameters only apply when using the built-in models.
For detailed benchmark results, refer to zafercavdar/fasttext-langdetect#benchmark.
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}
[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}
- Code: Released under the MIT License (see
LICENSE
). - Models: This package uses the pre-trained fastText language identification models (
lid.176.ftz
bundled for offline use andlid.176.bin
downloaded as needed). These models are licensed under the Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license. - Attribution: fastText language identification models by Facebook AI Research. See the fastText docs and license for details:
- Note: If you redistribute or modify the model files, you must comply with CC BY-SA 3.0. Inference usage via this library does not change the license of the model files themselves.