YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

library_name: transformers tags: - yoruba - tone-restoration - diacritics - nlp - seq2seq - mt5 - low-resource - african-languages

Model Card for JohnsonPedia01/mT5_base_yoruba_tone_restoration

This model is a fine-tuned version of google/mt5-base for automatic Yorùbá tone and diacritic restoration.
It restores missing tone marks in plain Yoruba text, improving readability and supporting downstream natural language processing (NLP) tasks.


Model Details

Model Description

This model is an mT5-base sequence-to-sequence Transformer fine-tuned specifically to restore tonal diacritics in Yoruba text. Given plain Yoruba text without accents, the model generates a toned version with appropriate diacritics.

The model is designed for low-resource language processing and can be used in text normalization, linguistic research, and speech-related applications.

  • Developed by: Babarinde Johnson
  • Funded by: N/A
  • Shared by: JohnsonPedia01
  • Model type: Seq2Seq (Text-to-Text Transformer)
  • Language(s): Yoruba
  • License: Apache-2.0
  • Fine-tuned from: google/mt5-base

Model Sources

  • Repository: Hugging Face Model Hub
  • Base model: google/mt5-base
  • Paper: N/A
  • Demo: N/A

Intended Uses

Direct Use

  • Automatic restoration of Yoruba diacritics
  • Text normalization for Yoruba NLP tasks
  • Improving readability of plain Yoruba text

Downstream Use

  • Post-processing for Automatic Speech Recognition (ASR) outputs
  • Preprocessing for Text-to-Speech (TTS) systems
  • Data cleaning for machine translation or language modeling

Out-of-Scope Use

  • Languages other than Yoruba
  • Heavily code-mixed text (e.g., Yoruba–English mixtures)
  • Non-standard or highly informal spelling variants

Evaluation

The model was evaluated on held-out Yoruba text using multiple automatic metrics.
Due to the nature of tone restoration, exact sentence matching is a strict metric and may underestimate real performance.

Quantitative Results

Metric Score
Exact Match Accuracy 23.68%
Character Error Rate (CER) 5.64%
Word Error Rate (WER) 24.76%
Diacritic Accuracy 81.46%
Character-level Accuracy 84.11%
Similarity Score 95.66%
BLEU Score 73.16%

Evaluation Setup

  • Evaluation source: IGBÓ_OLÓDÙMARÈ.txt
  • Input: Plain Yoruba text (lowercased, diacritics removed)
  • Target: Original toned Yoruba text
  • Max sequence length: 64 tokens
  • Evaluation type: Automatic (string- and character-level metrics)

Interpretation

  • The high similarity score and BLEU score indicate strong agreement with reference texts.
  • Low CER shows that the model preserves characters and structure accurately.
  • High diacritic accuracy confirms effective tonal restoration.
  • Lower exact match accuracy is expected due to multiple valid tonal realizations in Yoruba.

Bias, Risks, and Limitations

  • Performance may degrade on:
    • Rare words
    • Dialectal variants
    • Code-mixed Yoruba–English text
  • The model may propagate biases present in the training data.
  • Not suitable for languages other than Yoruba.

Recommendations

  • Human review is recommended for:
    • Educational materials
    • Published literature
    • Linguistic research
  • Use as a preprocessing or assistive tool, not a final authority.

How to Get Started

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

tokenizer = AutoTokenizer.from_pretrained(
    "JohnsonPedia01/mT5_base_yoruba_tone_restoration"
)
model = AutoModelForSeq2SeqLM.from_pretrained(
    "JohnsonPedia01/mT5_base_yoruba_tone_restoration"
)

yoruba_tone_pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer
)

example = "omo mi wa nita nitoripe"
output = yoruba_tone_pipe(example)
print(output[0]["generated_text"])
Downloads last month
59
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support