YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

library_name: transformers tags: - yoruba - tone-restoration - diacritics - nlp - seq2seq - mt5 - low-resource - african-languages

Model Card for JohnsonPedia01/mT5_base_yoruba_tone_restoration

This model is a fine-tuned version of google/mt5-base for automatic Yorùbá tone and diacritic restoration.
It restores missing tone marks in plain Yoruba text, improving readability and supporting downstream natural language processing (NLP) tasks.

Model Details

Model Description

This model is an mT5-base sequence-to-sequence Transformer fine-tuned specifically to restore tonal diacritics in Yoruba text. Given plain Yoruba text without accents, the model generates a toned version with appropriate diacritics.

The model is designed for low-resource language processing and can be used in text normalization, linguistic research, and speech-related applications.

Developed by: Babarinde Johnson
Funded by: N/A
Shared by: JohnsonPedia01
Model type: Seq2Seq (Text-to-Text Transformer)
Language(s): Yoruba
License: Apache-2.0
Fine-tuned from: google/mt5-base

Model Sources

Repository: Hugging Face Model Hub
Base model: google/mt5-base
Paper: N/A
Demo: N/A

Intended Uses

Direct Use

Automatic restoration of Yoruba diacritics
Text normalization for Yoruba NLP tasks
Improving readability of plain Yoruba text

Downstream Use

Post-processing for Automatic Speech Recognition (ASR) outputs
Preprocessing for Text-to-Speech (TTS) systems
Data cleaning for machine translation or language modeling

Out-of-Scope Use

Languages other than Yoruba
Heavily code-mixed text (e.g., Yoruba–English mixtures)
Non-standard or highly informal spelling variants

Evaluation

The model was evaluated on held-out Yoruba text using multiple automatic metrics.
Due to the nature of tone restoration, exact sentence matching is a strict metric and may underestimate real performance.

Quantitative Results

Metric	Score
Exact Match Accuracy	23.68%
Character Error Rate (CER)	5.64%
Word Error Rate (WER)	24.76%
Diacritic Accuracy	81.46%
Character-level Accuracy	84.11%
Similarity Score	95.66%
BLEU Score	73.16%

Evaluation Setup

Evaluation source: IGBÓ_OLÓDÙMARÈ.txt
Input: Plain Yoruba text (lowercased, diacritics removed)
Target: Original toned Yoruba text
Max sequence length: 64 tokens
Evaluation type: Automatic (string- and character-level metrics)

Interpretation

The high similarity score and BLEU score indicate strong agreement with reference texts.
Low CER shows that the model preserves characters and structure accurately.
High diacritic accuracy confirms effective tonal restoration.
Lower exact match accuracy is expected due to multiple valid tonal realizations in Yoruba.

Bias, Risks, and Limitations

Performance may degrade on:
- Rare words
- Dialectal variants
- Code-mixed Yoruba–English text
The model may propagate biases present in the training data.
Not suitable for languages other than Yoruba.

Recommendations

Human review is recommended for:
- Educational materials
- Published literature
- Linguistic research
Use as a preprocessing or assistive tool, not a final authority.

How to Get Started

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

tokenizer = AutoTokenizer.from_pretrained(
    "JohnsonPedia01/mT5_base_yoruba_tone_restoration"
)
model = AutoModelForSeq2SeqLM.from_pretrained(
    "JohnsonPedia01/mT5_base_yoruba_tone_restoration"
)

yoruba_tone_pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer
)

example = "omo mi wa nita nitoripe"
output = yoruba_tone_pipe(example)
print(output[0]["generated_text"])

Downloads last month: 59

Safetensors

Model size

0.6B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support