Spam Detection for Social Media Text

Multilingual Indonesian & English | XLM-RoBERTa

This model is a fine-tuned XLM-RoBERTa designed to detect Spam vs Ham content in social media text.
It supports Indonesian regional languages, Malay, and English, making it suitable for multi-platform moderation use cases such as Twitter/X, Instagram, TikTok, Facebook, and online forums.


✨ Key Features

  • ✅ Spam vs Ham classification
  • 🌏 Multilingual support (Indonesian & English + regional languages)
  • 🧠 Based on XLM-RoBERTa (multilingual transformer)
  • ⚡ Ready-to-use with Hugging Face pipeline
  • 📊 Strong performance on noisy social media text

🌍 Supported Languages

  • 🇮🇩 Bahasa Indonesia
  • Bahasa Melayu
  • Basa Jawa
  • Bahasa daerah Indonesia (Aceh, Banjar, Bugis, Minang, Sunda, dll.)
  • 🇬🇧 English

🧪 Model Performance

Metric Score
Accuracy 0.9451
F1 (Macro) 0.9446
F1 (Weighted) 0.9500
Precision 0.9500
Recall 0.9500
Training Loss 0.1187
Validation Loss 0.2370

Evaluated on held-out validation data with balanced spam/ham distribution.


🚀 Quick Start

Installation

pip install transformers torch

Single Prediction

from transformers import pipeline

classifier = pipeline(
    task="text-classification",
    model="nahiar/spam-detection-xlm-roberta-v1"
)

result = classifier("PASTI DIJAMIN WDP 100%")
print(result)

Output

[{'label': 'LABEL_1', 'score': 0.9876}]

Label Mapping

LABEL_0 → SPAM
LABEL_1 → HAM

📦 Batch Inference Example

"texts": [
        "साइबर हमले के बाद JLR का बड़ा बयान - जानें कंपनी ने क्या कहा | Tata Motors के शेयर पर दिखेगा असर?

#TataMotors #JLR #CyberAttack 

https://t.co/6WlGS77UUp",
        "Kita sudah Ready skrg ini bagi yang memerlukan jasa pemulihan akun & Hapus All akun 

 Lacak lokasi / sadap wa / Hack Akun / Revengeporn - korban pemerasan vcs / terror

TIKTOK,GMAIL,TWITER,TELEGRAM,
FACEBOOK,INSTAGRAM 
#revengeporn #zonauangᅠᅠᅠ 
 ☎️ https://t.co/K0AbW08qnU https://t.co/4IpWNA7a0z",
        "💥Slot Gacor Hari ini Rute303
💥Jaminan Jackpot Maxwin malam ini

LINK SLOT GACOR HARI INI : https://t.co/QvxjCAnt8o

Tags:
Jumbo #timsekop Jumat gratis ongkir Like Crazy PSIM https://t.co/ukuRdlvgGA"
    ]

results = classifier(texts)

for text, result in zip(texts, results):
    print(f"{text} -> {result['label']} ({result['score']:.4f})")

🏗️ Training Configuration

Parameter Value
Base Model xlm-roberta-base
Training Samples 11,958
Validation Samples 2,989
Epochs 3
Learning Rate 2e-5
Batch Size 16
Training Date 2025-12-15

🎯 Intended Use Cases

  • Social media spam moderation
  • Comment & post filtering
  • Content quality control
  • Pre-filtering for sentiment or topic analysis pipelines

⚠️ Limitations

  • Binary classification only (Spam / Ham)
  • Not optimized for non-social-media formal text
  • Performance may degrade on very short or ambiguous messages

📜 License

Released under the Apache 2.0 License. Free for commercial and research use.


📚 Citation

If you use this model in your work, please cite:

@misc{djunaedi2025spam,
  author    = {Raihan Hidayatullah Djunaedi},
  title     = {Spam Detection for Social Media Text},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/nahiar/spam-detection-xlm-roberta-v1}
}

🙌 Acknowledgements

  • Hugging Face Transformers
  • Facebook AI Research — XLM-RoBERTa
Downloads last month
223
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nahiar/spam-detection-xlm-roberta-v1

Finetuned
(3705)
this model

Collection including nahiar/spam-detection-xlm-roberta-v1