About the Project
An AI-driven, reproducible workflow for translating digital humanities tools for underrepresented languages — starting with Yoruba.
The problem
Limited NLP resources
Underrepresented languages lack the parallel corpora, pre-trained models, and evaluation benchmarks that widely-resourced languages enjoy.
No localisation framework
Community-based efforts are not scalable. There was no transparent, open, replicable workflow for DH software localisation into underrepresented languages.
Exclusion from DH tools
Voyant Tools — widely used for text analysis in DH — lacked interfaces for many languages, excluding millions of speakers from a major research tool.
Our approach
AI-assisted, human-validated
We combine multiple translation engines — NLLB-200, OpenAI, and optionally Ollama — to generate machine drafts, then apply back-translation and XLM-R semantic similarity scoring to verify quality before a human reviewer approves each string.
The result: a provenance-aware, auditable localisation — every string carries its source engine, similarity score, and review history.
Reproducible by design
Every step is implemented in open Python — no proprietary tooling, no locked-in services. The workflow, scripts, and datasets are shared publicly so any DH scholar can replicate or extend it for any underrepresented language supported by NLLB-200.
Implemented in a Jupyter Notebook environment, requiring only an internet connection and (optionally) a GPU for the NLLB model.
AI models used
NLLB-200
Meta AI's 200-language model. Primary engine for translation drafts. Runs locally; no API key required.
OpenAI / Ollama
Used for alternative drafts, cross-model comparison, and terminology suggestions. Pluggable — swap in any model via environment variable.
XLM-R
Cross-lingual sentence embeddings (via sentence-transformers) for back-translation similarity scoring. Detects semantic drift automatically.
AfroLingu-MT
UBC-NLP's multilingual MT benchmark dataset, used for evaluating translation quality on underrepresented language pairs beyond Voyant strings.
Why this matters
Even when full NLP analysis for a language is not yet available inside Voyant, multilingual interface "skins" are a meaningful first step. A scholar working in Yoruba should be able to navigate the tool in their own language — this reduces the cognitive overhead of digital humanities work and sends a clear signal that these languages belong in scholarly infrastructure.
This project is a template. The same pipeline can be pointed at any Voyant-supported export CSV and any target language supported by NLLB-200 or the chosen LLM. We invite scholars, students, and community members to extend it to any underrepresented language.
Presentation overview
- Introduce the general problem of translating digital humanities resources.
- Describe how Voyant handles different language interfaces and the challenge of adding new ones.
- Walk through the AI-assisted workflow with its human-in-the-loop validation phase.
- Discuss specialised language models for underrepresented languages (NLLB-200, AfroLingu-MT).
- Call for more translated DH tools and low-resource language utilities.
Future directions
More languages
Expand beyond Yoruba to Hausa, Igbo, Swahili, Amharic, isiZulu, and more — using the same pipeline.
Domain glossaries
Build controlled terminology lists per language to enforce consistency across related UI strings.
Deeper NLP support
Work towards enabling Voyant's analysis capabilities — not just the interface — for underrepresented language texts.