About the Project

An AI-driven, reproducible workflow for translating digital humanities tools for underrepresented languages — starting with Yoruba.

The problem

Limited NLP resources

Underrepresented languages lack the parallel corpora, pre-trained models, and evaluation benchmarks that widely-resourced languages enjoy.

No localisation framework

Community-based efforts are not scalable. There was no transparent, open, replicable workflow for DH software localisation into underrepresented languages.

Exclusion from DH tools

Voyant Tools — widely used for text analysis in DH — lacked interfaces for many languages, excluding millions of speakers from a major research tool.

Our approach

AI-assisted, human-validated

We combine multiple translation engines — NLLB-200, OpenAI, and optionally Ollama — to generate machine drafts, then apply back-translation and XLM-R semantic similarity scoring to verify quality before a human reviewer approves each string.

The result: a provenance-aware, auditable localisation — every string carries its source engine, similarity score, and review history.

Reproducible by design

Every step is implemented in open Python — no proprietary tooling, no locked-in services. The workflow, scripts, and datasets are shared publicly so any DH scholar can replicate or extend it for any underrepresented language supported by NLLB-200.

Implemented in a Jupyter Notebook environment, requiring only an internet connection and (optionally) a GPU for the NLLB model.

AI models used

Machine Translation

NLLB-200

Meta AI's 200-language model. Primary engine for translation drafts. Runs locally; no API key required.

LLM (optional)

OpenAI / Ollama

Used for alternative drafts, cross-model comparison, and terminology suggestions. Pluggable — swap in any model via environment variable.

Semantic Scoring

XLM-R

Cross-lingual sentence embeddings (via sentence-transformers) for back-translation similarity scoring. Detects semantic drift automatically.

Benchmark

AfroLingu-MT

UBC-NLP's multilingual MT benchmark dataset, used for evaluating translation quality on underrepresented language pairs beyond Voyant strings.

Why this matters

Even when full NLP analysis for a language is not yet available inside Voyant, multilingual interface "skins" are a meaningful first step. A scholar working in Yoruba should be able to navigate the tool in their own language — this reduces the cognitive overhead of digital humanities work and sends a clear signal that these languages belong in scholarly infrastructure.

This project is a template. The same pipeline can be pointed at any Voyant-supported export CSV and any target language supported by NLLB-200 or the chosen LLM. We invite scholars, students, and community members to extend it to any underrepresented language.

Presentation overview

Introduce the general problem of translating digital humanities resources.
Describe how Voyant handles different language interfaces and the challenge of adding new ones.
Walk through the AI-assisted workflow with its human-in-the-loop validation phase.
Discuss specialised language models for underrepresented languages (NLLB-200, AfroLingu-MT).
Call for more translated DH tools and low-resource language utilities.

Future directions

More languages

Expand beyond Yoruba to Hausa, Igbo, Swahili, Amharic, isiZulu, and more — using the same pipeline.

Domain glossaries

Build controlled terminology lists per language to enforce consistency across related UI strings.

Deeper NLP support

Work towards enabling Voyant's analysis capabilities — not just the interface — for underrepresented language texts.