Multi-class Offensive Language Detection in Roman Urdu
DOI:
https://doi.org/10.21015/vtse.v13i4.2251Abstract
Automated systems for detecting hate speech play a crucial role in combating the proliferation of hateful content, especially as social media user bases continue to grow. Recent research efforts have focused on creating datasets for this purpose, but the majority have been designed for English, leaving low-resource languages like Roman Urdu with limited resources. To enhance hate speech identification in Roman Urdu, various machine and deep learning models were trained on a publicly available dataset of Roman Urdu tweets (RUSHOLD). For multi-class classification, both machine and deep learning techniques were employed, while restricting binary classification to deep learning methods. Given the dataset’s class imbalances, particularly with some classes having fewer instances, SMOTE was employed to address this disparity. The findings indicated that the developed machine learning model outperforms the deep learning model in terms of recall, as well as key metrics such as F1 and Macro F1.
References
G. Ramos, F. Batista, R. Ribeiro, P. Fialho, S. Moro, et al., “A comprehensive review on automatic hate speech detection in the age of the transformer,” Social Network Analysis and Mining, vol. 14, 2024.
H. K. Sariyanto, D. Ulucan, O. Ulucan, and M. Ebner, “Title not specified,” in Proceedings of the Association for Computational Linguistics, 2025, pp. 12 883–12 893.
S. Nasir, A. Seerat, and M. Wasim, “Hate speech detection in Roman Urdu using machine learning techniques,” in Proc. 5th Int. Conf. on Advancements in Computational Sciences (ICACS), 2024, pp. 1–7.
M. Bilal, A. Khan, S. Jan, S. Musa, and S. Ali, “Roman Urdu hate speech detection using transformer-based model for cyber security applications,” Sensors, vol. 23, no. 8, 2023. [Online]. Available: https://www.mdpi.com/1424-8220/23/8/3909
M. S. Khan, M. S. I. Malik, and A. Nadeem, “Detection of violence incitation expressions in Urdu tweets using convolutional neural network,” Expert Systems with Applications, vol. 245, p. 123174, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0957417424000393
A. A. Khan, M. H. Iqbal, S. Nisar, A. Ahmad, and W. Iqbal, “Offensive language detection for low-resource language using deep sequence model,” IEEE Trans. Computational Social Systems, pp. 1–9, 2023.
A. Dewani, M. A. Memon, S. Bhatti, A. Sulaiman, M. Hamdi, H. Alshahrani, A. Alghamdi, and A. Shaikh, “Detection of cyberbullying patterns in low-resource colloquial Roman Urdu microtext using NLP, machine learning, and ensemble techniques,” Applied Sciences, vol. 13, no. 4, 2023. [Online]. Available: https://www.mdpi.com/2076-3417/13/4/2062
M. M. Khan, K. Shahzad, and M. K. Malik, “Hate speech detection in Roman Urdu,” ACM Trans. Asian and Low-Resource Language Information Processing, vol. 20, no. 1, 2021. [Online]. Available: https://doi.org/10.1145/3414524
M. Usman, M. Ahmad, M. S. Tash, I. Gelbukh, R. Q. Tellez, and G. Sidorov, “Multilingual hate speech detection in social media using translation-based approaches with large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2506.08147
B. Barakat and S. Jaf, “Beyond traditional classifiers: Evaluating large language models for robust hate speech detection,” Computation, vol. 13, no. 8, p. 196, 2025.
M. Z. Ali, S. Rauf, K. Javed, S. Hussain, et al., “Improving hate speech detection of Urdu tweets using sentiment analysis,” IEEE Access, vol. 9, pp. 84 296–84 305, 2021.
M. Mohiyaddeen and S. Siddiqi, “Automatic hate speech detection: A literature review,” SSRN, Article 3887383, 2021.
A. Dewani, M. A. Memon, and S. Bhatti, “Cyberbullying detection: Advanced preprocessing techniques and deep learning architecture for Roman Urdu data,” Journal of Big Data, vol. 8, no. 1, p. 160, 2021.
R. G. Kodali, D. P. Manukonda, and D. Iglesias, “ByteSizedLLM@NLU of Devanagari script languages 2025: Hate speech detection and target identification using customized attention BiLSTM and XLM-RoBERTa,” in Proc. 1st Workshop on Challenges in Processing South Asian Languages, ACL, 2025, pp. 242–247.
H. H. Saeed, M. H. Ashraf, F. Kamiran, A. Karim, and T. Calders, “Roman Urdu toxic comment classification,” Language Resources and Evaluation, pp. 1–26, 2021.
U. Azam, H. Rizwan, and A. Karim, “Exploring data augmentation strategies for hate speech detection in Roman Urdu,” in Proc. 13th Language Resources and Evaluation Conf. (LREC), Marseille, France, Jun. 2022, pp. 4523–4531. [Online]. Available: https://aclanthology.org/2022.lrec-1.481
A. Albladi, M. Islam, A. Das, M. Bigonah, Z. Zhang, F. Jamshidi, M. Rahgouy, N. Raychawdhary, D. Marghitu, and C. Seals, “Hate speech detection using large language models: A comprehensive review,” IEEE Access, vol. 13, pp. 20 871–20 892, 2025.
N. H. Usman and S. M. K. Quadri, “Scalable and advanced framework for hate speech detection on social media using BERT and GPT-2,” Journal of Computer Science, vol. 21, no. 3, pp. 584–594, 2025.
M. Ahmad, M. Waqas, A. Hamza, S. Usman, I. Batyrshin, and G. Sidorov, “UA-HSD-2025: Multi-lingual hate speech detection from tweets using pre-trained transformers,” Computers, vol. 14, no. 6, p. 239, 2025.
P. Kapil and A. Ekbal, “A survey on combating hate speech through detection and prevention,” in Proc. ICON 2024, 2024. [Online]. Available: ACL Anthology.
M. K. Ngueajio, S. Aryal, M. Atemkeng, G. Washington, and D. Rawat, “Decoding fake news and hate speech: A survey of explainable AI techniques,” ACM Computing Surveys, vol. 57, no. 7, Feb. 2025. [Online]. Available: https://doi.org/10.1145/3711123
Downloads
Published
How to Cite
Issue
Section
License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC-By) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
This work is licensed under a Creative Commons Attribution License CC BY