Translation Pipeline — Voyant Tools Localization Hub

AI-Assisted Translation Pipeline

A fully reproducible, seven-step workflow turning raw Voyant UI strings into human-validated translations for underrepresented languages — with provenance tracked at every stage.

The seven steps

📥

Import Voyant CSV

A Voyant Tools administrator exports the UI string bundle as a CSV file. Our Django management command ingests every row into the StringUnit model, capturing the location, message ID, and English source text.

import_voyant_csv StringUnit model

🛡️

Protect Placeholders

UI strings often contain ICU message syntax ({count}), HTML tags, or positional arguments. Before translation, a regex pass swaps these tokens for numbered sentinels (■1■) so the translation engine never mangles them.

placeholder_guard.py ICU / HTML-safe

🤖

Generate Machine Draft

One or more translation engines produce a candidate translation. NLLB-200 runs locally (no API key). OpenAI and Ollama are optional pluggable alternatives controlled via environment variables. Multiple drafts can be stored side-by-side for cross-model comparison.

NLLB-200 OpenAI (optional) Ollama (optional)

🔄

Back-Translation

The draft is translated back to English using NLLB-200. This round-trip surface is later compared to the original English source to detect semantic drift — strings that "lost their meaning" in translation are flagged automatically.

NLLB-200 (target→en) back_translate()

📊

Semantic Similarity Scoring

XLM-R (via sentence-transformers) embeds both the original English source and the back-translated text into cross-lingual vector space. Cosine similarity between the two embeddings gives a score in [0, 1]. Strings below the threshold are marked as QA warnings.

XLM-R sentence-transformers cosine similarity

⚠️

Automated QA Flags

A lightweight rule engine checks: placeholder integrity (sentinels restored correctly), length ratio (translation suspiciously shorter or longer), Unicode validity, and similarity score threshold. Each failed check is stored as a structured flag on the Translation record for reviewer visibility.

qa_check() qa_flags JSON field AfroLingu-MT eval

👥

Human Review & Approval

Approved volunteer reviewers (speakers of the target language) inspect the AI draft alongside the source, back-translation, similarity score, and QA flags. They can accept, edit, or reject the draft. Only reviewer-approved strings are exported in the final Voyant CSV.

Review Queue UI status: APPROVED human-in-the-loop

Models at a glance

Machine Translation

NLLB-200

Meta AI's 200-language neural MT model. Primary engine for translation drafts and back-translation. Runs locally — no API key required.

LLM (optional)

OpenAI / Ollama

Used for alternative draft generation and terminology suggestions. Swap in any model via environment variable.

Semantic Scoring

XLM-R

Cross-lingual sentence embeddings for back-translation similarity scoring. Detects semantic drift automatically.

Benchmark

AfroLingu-MT

UBC-NLP's multilingual MT benchmark for evaluating quality on underrepresented language pairs beyond Voyant strings.

Translation statuses

machine_draft — AI draft generated, not yet reviewed

in_review — Reviewer is working on this string

⚠ QA flags — Automated checks raised concerns

approved — Human-validated, export-ready

flagged — Rejected or needs rework

Reproducibility

Open Python

Every pipeline step is implemented in plain Python scripts and Jupyter notebooks. No proprietary tooling, no locked-in services.

Portable

Point the same pipeline at any Voyant CSV and any NLLB-200-supported language. Amharic, Swahili, isiZulu — same codebase.

Auditable

Every translation record carries its engine source, similarity score, QA flags, and full reviewer history — provenance-aware by design.

Ready to help validate translations?

Native or fluent speakers of underrepresented languages are welcome.

Apply to join Learn more