library_name: transformers tags: - yoruba - tone-restoration - diacritics - nlp - seq2seq - mt5 - low-resource - african-languages
Model Card for JohnsonPedia01/mT5_base_yoruba_tone_restoration
This model is a fine-tuned version of google/mt5-base for automatic Yorùbá tone and diacritic restoration.
It restores missing tone marks in plain Yoruba text, improving readability and supporting downstream natural language processing (NLP) tasks.
Model Details
Model Description
This model is an mT5-base sequence-to-sequence Transformer fine-tuned specifically to restore tonal diacritics in Yoruba text. Given plain Yoruba text without accents, the model generates a toned version with appropriate diacritics.
The model is designed for low-resource language processing and can be used in text normalization, linguistic research, and speech-related applications.
- Developed by: Babarinde Johnson
- Funded by: N/A
- Shared by: JohnsonPedia01
- Model type: Seq2Seq (Text-to-Text Transformer)
- Language(s): Yoruba
- License: Apache-2.0
- Fine-tuned from: google/mt5-base
Model Sources
- Repository: Hugging Face Model Hub
- Base model: google/mt5-base
- Paper: N/A
- Demo: N/A
Intended Uses
Direct Use
- Automatic restoration of Yoruba diacritics
- Text normalization for Yoruba NLP tasks
- Improving readability of plain Yoruba text
Downstream Use
- Post-processing for Automatic Speech Recognition (ASR) outputs
- Preprocessing for Text-to-Speech (TTS) systems
- Data cleaning for machine translation or language modeling
Out-of-Scope Use
- Languages other than Yoruba
- Heavily code-mixed text (e.g., Yoruba–English mixtures)
- Non-standard or highly informal spelling variants
Evaluation
The model was evaluated on held-out Yoruba text using multiple automatic metrics.
Due to the nature of tone restoration, exact sentence matching is a strict metric and may underestimate real performance.
Quantitative Results
| Metric | Score |
|---|---|
| Exact Match Accuracy | 23.68% |
| Character Error Rate (CER) | 5.64% |
| Word Error Rate (WER) | 24.76% |
| Diacritic Accuracy | 81.46% |
| Character-level Accuracy | 84.11% |
| Similarity Score | 95.66% |
| BLEU Score | 73.16% |
Evaluation Setup
- Evaluation source: IGBÓ_OLÓDÙMARÈ.txt
- Input: Plain Yoruba text (lowercased, diacritics removed)
- Target: Original toned Yoruba text
- Max sequence length: 64 tokens
- Evaluation type: Automatic (string- and character-level metrics)
Interpretation
- The high similarity score and BLEU score indicate strong agreement with reference texts.
- Low CER shows that the model preserves characters and structure accurately.
- High diacritic accuracy confirms effective tonal restoration.
- Lower exact match accuracy is expected due to multiple valid tonal realizations in Yoruba.
Bias, Risks, and Limitations
- Performance may degrade on:
- Rare words
- Dialectal variants
- Code-mixed Yoruba–English text
- The model may propagate biases present in the training data.
- Not suitable for languages other than Yoruba.
Recommendations
- Human review is recommended for:
- Educational materials
- Published literature
- Linguistic research
- Use as a preprocessing or assistive tool, not a final authority.
How to Get Started
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
tokenizer = AutoTokenizer.from_pretrained(
"JohnsonPedia01/mT5_base_yoruba_tone_restoration"
)
model = AutoModelForSeq2SeqLM.from_pretrained(
"JohnsonPedia01/mT5_base_yoruba_tone_restoration"
)
yoruba_tone_pipe = pipeline(
"text2text-generation",
model=model,
tokenizer=tokenizer
)
example = "omo mi wa nita nitoripe"
output = yoruba_tone_pipe(example)
print(output[0]["generated_text"])
- Downloads last month
- 59