AI-Assisted Translation Pipeline
A fully reproducible, seven-step workflow turning raw Voyant UI strings into human-validated translations for underrepresented languages — with provenance tracked at every stage.
The seven steps
A Voyant Tools administrator exports the UI string bundle as a CSV file.
Our Django management command ingests every row into the StringUnit
model, capturing the location, message ID, and English source text.
UI strings often contain ICU message syntax ({count}),
HTML tags, or positional arguments. Before translation, a regex pass
swaps these tokens for numbered sentinels (■1■)
so the translation engine never mangles them.
One or more translation engines produce a candidate translation. NLLB-200 runs locally (no API key). OpenAI and Ollama are optional pluggable alternatives controlled via environment variables. Multiple drafts can be stored side-by-side for cross-model comparison.
The draft is translated back to English using NLLB-200. This round-trip surface is later compared to the original English source to detect semantic drift — strings that "lost their meaning" in translation are flagged automatically.
XLM-R (via sentence-transformers) embeds both the original
English source and the back-translated text into cross-lingual vector space.
Cosine similarity between the two embeddings gives a score in [0, 1].
Strings below the threshold are marked as QA warnings.
A lightweight rule engine checks: placeholder integrity (sentinels
restored correctly), length ratio (translation suspiciously shorter
or longer), Unicode validity, and similarity score threshold.
Each failed check is stored as a structured flag on the
Translation record for reviewer visibility.
Approved volunteer reviewers (speakers of the target language) inspect the AI draft alongside the source, back-translation, similarity score, and QA flags. They can accept, edit, or reject the draft. Only reviewer-approved strings are exported in the final Voyant CSV.
Models at a glance
Meta AI's 200-language neural MT model. Primary engine for translation drafts and back-translation. Runs locally — no API key required.
Used for alternative draft generation and terminology suggestions. Swap in any model via environment variable.
Cross-lingual sentence embeddings for back-translation similarity scoring. Detects semantic drift automatically.
UBC-NLP's multilingual MT benchmark for evaluating quality on underrepresented language pairs beyond Voyant strings.
Translation statuses
Reproducibility
Open Python
Every pipeline step is implemented in plain Python scripts and Jupyter notebooks. No proprietary tooling, no locked-in services.
Portable
Point the same pipeline at any Voyant CSV and any NLLB-200-supported language. Amharic, Swahili, isiZulu — same codebase.
Auditable
Every translation record carries its engine source, similarity score, QA flags, and full reviewer history — provenance-aware by design.
Ready to help validate translations?
Native or fluent speakers of underrepresented languages are welcome.