Multi-class Offensive Language Detection in Roman Urdu

Rida Ayesha; Sarah Ali; Usman Inayat; Sajid Mahmood

doi:10.21015/vtse.v13i4.2251

Authors

Rida Ayesha University of Management and Technology https://orcid.org/0009-0004-1433-2417
Sarah Ali University of Management and Technology https://orcid.org/0009-0004-0338-8941
Usman Inayat University of Management and Technology https://orcid.org/0000-0001-8397-9995
Sajid Mahmood University of Management and Technology https://orcid.org/0000-0003-0684-0185

DOI:

https://doi.org/10.21015/vtse.v13i4.2251

Abstract

Automated systems for detecting hate speech play a crucial role in combating the proliferation of hateful content, especially as social media user bases continue to grow. Recent research efforts have focused on creating datasets for this purpose, but the majority have been designed for English, leaving low-resource languages like Roman Urdu with limited resources. To enhance hate speech identiﬁcation in Roman Urdu, various machine and deep learning models were trained on a publicly available dataset of Roman Urdu tweets (RUSHOLD). For multi-class classiﬁcation, both machine and deep learning techniques were employed, while restricting binary classiﬁcation to deep learning methods. Given the dataset’s class imbalances, particularly with some classes having fewer instances, SMOTE was employed to address this disparity. The ﬁndings indicated that the developed machine learning model outperforms the deep learning model in terms of recall, as well as key metrics such as F1 and Macro F1.

References

G. Ramos, F. Batista, R. Ribeiro, P. Fialho, S. Moro, et al., “A comprehensive review on automatic hate speech detection in the age of the transformer,” Social Network Analysis and Mining, vol. 14, 2024.

H. K. Sariyanto, D. Ulucan, O. Ulucan, and M. Ebner, “Title not specified,” in Proceedings of the Association for Computational Linguistics, 2025, pp. 12 883–12 893.

S. Nasir, A. Seerat, and M. Wasim, “Hate speech detection in Roman Urdu using machine learning techniques,” in Proc. 5th Int. Conf. on Advancements in Computational Sciences (ICACS), 2024, pp. 1–7.

M. Bilal, A. Khan, S. Jan, S. Musa, and S. Ali, “Roman Urdu hate speech detection using transformer-based model for cyber security applications,” Sensors, vol. 23, no. 8, 2023. [Online]. Available: https://www.mdpi.com/1424-8220/23/8/3909

M. S. Khan, M. S. I. Malik, and A. Nadeem, “Detection of violence incitation expressions in Urdu tweets using convolutional neural network,” Expert Systems with Applications, vol. 245, p. 123174, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0957417424000393

A. A. Khan, M. H. Iqbal, S. Nisar, A. Ahmad, and W. Iqbal, “Offensive language detection for low-resource language using deep sequence model,” IEEE Trans. Computational Social Systems, pp. 1–9, 2023.

A. Dewani, M. A. Memon, S. Bhatti, A. Sulaiman, M. Hamdi, H. Alshahrani, A. Alghamdi, and A. Shaikh, “Detection of cyberbullying patterns in low-resource colloquial Roman Urdu microtext using NLP, machine learning, and ensemble techniques,” Applied Sciences, vol. 13, no. 4, 2023. [Online]. Available: https://www.mdpi.com/2076-3417/13/4/2062

M. M. Khan, K. Shahzad, and M. K. Malik, “Hate speech detection in Roman Urdu,” ACM Trans. Asian and Low-Resource Language Information Processing, vol. 20, no. 1, 2021. [Online]. Available: https://doi.org/10.1145/3414524

M. Usman, M. Ahmad, M. S. Tash, I. Gelbukh, R. Q. Tellez, and G. Sidorov, “Multilingual hate speech detection in social media using translation-based approaches with large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2506.08147

B. Barakat and S. Jaf, “Beyond traditional classifiers: Evaluating large language models for robust hate speech detection,” Computation, vol. 13, no. 8, p. 196, 2025.

M. Z. Ali, S. Rauf, K. Javed, S. Hussain, et al., “Improving hate speech detection of Urdu tweets using sentiment analysis,” IEEE Access, vol. 9, pp. 84 296–84 305, 2021.

M. Mohiyaddeen and S. Siddiqi, “Automatic hate speech detection: A literature review,” SSRN, Article 3887383, 2021.

A. Dewani, M. A. Memon, and S. Bhatti, “Cyberbullying detection: Advanced preprocessing techniques and deep learning architecture for Roman Urdu data,” Journal of Big Data, vol. 8, no. 1, p. 160, 2021.

R. G. Kodali, D. P. Manukonda, and D. Iglesias, “ByteSizedLLM@NLU of Devanagari script languages 2025: Hate speech detection and target identification using customized attention BiLSTM and XLM-RoBERTa,” in Proc. 1st Workshop on Challenges in Processing South Asian Languages, ACL, 2025, pp. 242–247.

H. H. Saeed, M. H. Ashraf, F. Kamiran, A. Karim, and T. Calders, “Roman Urdu toxic comment classification,” Language Resources and Evaluation, pp. 1–26, 2021.

U. Azam, H. Rizwan, and A. Karim, “Exploring data augmentation strategies for hate speech detection in Roman Urdu,” in Proc. 13th Language Resources and Evaluation Conf. (LREC), Marseille, France, Jun. 2022, pp. 4523–4531. [Online]. Available: https://aclanthology.org/2022.lrec-1.481

A. Albladi, M. Islam, A. Das, M. Bigonah, Z. Zhang, F. Jamshidi, M. Rahgouy, N. Raychawdhary, D. Marghitu, and C. Seals, “Hate speech detection using large language models: A comprehensive review,” IEEE Access, vol. 13, pp. 20 871–20 892, 2025.

N. H. Usman and S. M. K. Quadri, “Scalable and advanced framework for hate speech detection on social media using BERT and GPT-2,” Journal of Computer Science, vol. 21, no. 3, pp. 584–594, 2025.

M. Ahmad, M. Waqas, A. Hamza, S. Usman, I. Batyrshin, and G. Sidorov, “UA-HSD-2025: Multi-lingual hate speech detection from tweets using pre-trained transformers,” Computers, vol. 14, no. 6, p. 239, 2025.

P. Kapil and A. Ekbal, “A survey on combating hate speech through detection and prevention,” in Proc. ICON 2024, 2024. [Online]. Available: ACL Anthology.

M. K. Ngueajio, S. Aryal, M. Atemkeng, G. Washington, and D. Rawat, “Decoding fake news and hate speech: A survey of explainable AI techniques,” ACM Computing Surveys, vol. 57, no. 7, Feb. 2025. [Online]. Available: https://doi.org/10.1145/3711123

Multi-class Offensive Language Detection in Roman Urdu

Authors

DOI:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information

ISSN

Scopus Metrics

SCImago

Scopus CiteScore

Make a Submission