Next Word Prediction for Urdu using Deep Learning Techniques

Authors

DOI:

https://doi.org/10.21015/vtse.v13i1.2044

Abstract

A language model for next-word prediction is a probabilistic representation of a natural language that utilizes text corpora to generate word probabilities. These models play a crucial role in text generation, machine translation, and question-answering applications. The focus of this study is to develop an improved algorithm for next-word prediction in Urdu. The study implements deep learning models, including RNN, LSTM, and Bi-LSTM, on a subset of the Ur-Mono Urdu corpus containing 3,000  and   5,000 sentences. To prepare the data for experimentation, tokenization and stemming data cleaning techniques are applied. The study achieved an accuracy of 87% using the RNN model on the first 3,000 sentences of the Ur-Mono dataset and 84% accuracy using the RNN model on the first 5,000 sentences of the Ur-Mono dataset. In conclusion, it can be stated that when the corpus size is small, the RNN outperforms both the LSTM and BiLSTM. However, as the corpus size increases, the Bi-LSTM exhibits superior performance compared to both RNN and LSTM.

References

A. Rianti, S. Widodo, A. Ayuningtyas, and B. Hermawan, “Next word prediction using LSTM,” J. Inf. Technol. Its Util., vol. 5, 2022.

M. Saeed et al., “Effective word prediction in Urdu language using stochastic model,” Sukkur IBA J. Comput. Math. Sci., vol. 2, no. 2, pp. 38–46, 2018.

M. F. Ullah, A. Saeed, and N. Hussain, “Comparison of pre-trained vs custom-trained word embedding models for word sense disambiguation,” ADCAIJ Adv. Distrib. Comput. Artif. Intell. J., vol. 12, no. 1, p. 31084, 2023. DOI: https://doi.org/10.14201/adcaij.31084

H. Kour and N. K. Gondhi, “Machine learning approaches for Nastaliq style Urdu handwritten recognition: A survey,” in *Proc. 6th Int. Conf. Adv. Comput. Commun. Syst. (ICACCS)*, 2020, pp. 50–54. DOI: https://doi.org/10.1109/ICACCS48705.2020.9074294

L. Khan et al., “Urdu sentiment analysis with deep learning methods,” *IEEE Access*, vol. 9, pp. 97803–97812, 2021. DOI: https://doi.org/10.1109/ACCESS.2021.3093078

A. Daud, W. Khan, and D. Che, “Urdu language processing: A survey,” *Artif. Intell. Rev.*, vol. 47, pp. 279–311, 2017. DOI: https://doi.org/10.1007/s10462-016-9482-x

Y. Gong, L. Hua, and S. Wang, “Leveraging user’s performance in reporting patient safety events by utilizing text prediction in narrative data entry,” *Comput. Methods Programs Biomed.*, vol. 131, pp. 181–189, 2016. DOI: https://doi.org/10.1016/j.cmpb.2016.03.031

K. Črnka et al., “User interaction with word prediction: The effects of prediction quality,” *ACM Trans. Access. Comput. (TACCESS)*, vol. 1, no. 3, pp. 1–34, 2009. DOI: https://doi.org/10.1145/1497302.1497307

R. Alarcón, L. Moreno, and P. Martínez, “Lexical simplification system to improve web accessibility,” *IEEE Access*, vol. 9, pp. 58755–58767, 2021. DOI: https://doi.org/10.1109/ACCESS.2021.3072697

M. Alkhatib, A. A. Monem, and K. Shaalan, “Deep learning for Arabic error detection and correction,” *ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP)*, vol. 19, no. 5, pp. 1–13, 2020. DOI: https://doi.org/10.1145/3373266

M. F. Ssaneo, P. Ripollés, J. Orpella, W. M. Lin, R. Diego Balaguer, and D. Poeppel, “Spontaneous synchronization to speech reveals neural mechanisms facilitating language learning,” *Nature Neuroscience*, vol. 22, no. 4, pp. 627–632, 2019. DOI: https://doi.org/10.1038/s41593-019-0353-z

O. Van Laere, M. Strobbe, P. Leroux, B. Dhoedt, F. De Turck, and P. Demeester, “Enabling platform for mobile content generation based on 2D barcodes,” in *Proc. Int. Conf. Internet Comput.*, 2008, pp. 209–214.

S. Babar and P. D. Patil, “Improving performance of text summarization,” *Procedia Comput. Sci.*, vol. 46, pp. 354–363, 2015. DOI: https://doi.org/10.1016/j.procs.2015.02.031

R. Sennrich and B. Haddow, “Linguistic input features improve neural machine translation,” *arXiv preprint arXiv:1606.02892*, 2016. DOI: https://doi.org/10.18653/v1/W16-2209

S. Ghosh, M. Dutta, and T. Das, “Indian legal text summarization: A text normalization-based approach,” in *Proc. IEEE 19th India Council Int. Conf. (INDICON)*, 2022, pp. 1–4. DOI: https://doi.org/10.1109/INDICON56171.2022.10039891

P. Kalamkar, A. Tiwari, A. Agarwal, S. Karn, S. Gupta, V. Raghavan, and A. Modi, “Corpus for automatic structuring of legal documents,” *arXiv preprint arXiv:2201.13125*, 2022.

R. Sharma, N. Goel, N. Aggarwal, P. Kaur, and C. Prakash, “Next word prediction in Hindi using deep learning techniques,” in *Proc. Int. Conf. Data Sci. Eng. (ICDSE)*, 2019, pp. 55–60. DOI: https://doi.org/10.1109/ICDSE47409.2019.8971796

J. M. Chambers and T. J. Hastie, “Statistical models,” in *Statistical Models in S*. Routledge, 2017, pp. 13–44. DOI: https://doi.org/10.1201/9780203738535-2

D. Forsyth, “Hidden Markov Models,” *Appl. Mach. Learn.*, pp. 305–332, 2019. DOI: https://doi.org/10.1007/978-3-030-18114-7_13

S. Morsy and G. Karypis, “Cumulative knowledge-based regression models for next-term grade prediction,” in *Proc. SIAM Int. Conf. Data Mining*, 2017, pp. 552–560. DOI: https://doi.org/10.1137/1.9781611974973.62

M. Roam and S. Thakur, “Next word prediction using deep learning: A comparative study,” in *Proc. 12th Int. Conf. Cloud Comput., Data Sci. Eng. (Confluence)*, 2022, pp. 653–658. DOI: https://doi.org/10.1109/Confluence52989.2022.9734151

A. Atccili, O. Ozkaraca, G. Sariman, and B. Patrut, “Next word prediction with deep learning models,” in *Proc. Int. Conf. Artif. Intell. Appl. Math. Eng.*, 2021, pp. 523–531. Springer. DOI: https://doi.org/10.1007/978-3-031-09753-9_38

D. M. Berhard, G. F. Simons, and C. D. Fennig, *Ethnologue: Languages of the World*, 23rd ed. Dallas, TX, USA: SIL International, 2020. [Online]. Available: https://www.ethnologue.com

B. Jawaid, A. Kamran, and O. Bojar, “A tagged corpus and a tagger for Urdu,” in *Proc. 9th Int. Conf. Lang. Resour. Eval. (LREC’14)*, Reykjavik, Iceland, 2014, pp. 2938–2943. European Language Resources Association (ELRA).

A. Saeed, M. F. Ullah, M. Sauood, S. N. Ali, and N. Hussain, “Prediction of dengue cases and deaths using machine learning algorithm,” *Pak. J. Sci. Res.*, vol. 3, no. 2, pp. 210–216, 2023. DOI: https://doi.org/10.57041/pjosr.v3i2.1042

M. F. Siddiqui and M. Hassan, “Effective word prediction in Urdu language using stochastic model,” *Sukkur IBA J. Comput. Math. Sci.*, vol. 2, no. 2, pp. 38–46, 2018. DOI: https://doi.org/10.30537/sjcms.v2i2.304

A. Saeed, R. M. A. Nawab, and M. Stevenson, “Investigating the feasibility of deep learning methods for Urdu word sense disambiguation,” *Trans. Asian Low-Resour. Lang. Inf. Process.*, vol. 21, no. 2, pp. 1–16, 2021. DOI: https://doi.org/10.1145/3477578

S. Avasthi, R. Chauhan, and D. P. Acharjya, “Processing large text corpus using n-gram language modeling and smoothing,” in *Proc. 2nd Int. Conf. Inf. Manag. Mach. Intell. (ICIMMI 2020)*, 2021, pp. 21–32. Springer. DOI: https://doi.org/10.1007/978-981-15-9689-6_3

D. Naik, I. Naik, and N. Naik, “Large data begets large data: Studying large language models (LLMs) and its history, types, working, benefits and limitations,” in *Proc. Int. Conf. Comput., Commun., Cybersecurity AI*, Cham, Switzerland, July 2024, pp. 293–314. Springer Nature. DOI: https://doi.org/10.1007/978-3-031-74443-3_18

C. C. Aggarwal et al., *Neural Networks and Deep Learning*, vol. 10, no. 978, p. 3, 2018. Springer. DOI: https://doi.org/10.1007/978-3-319-94463-0

J. A. Ruffolo and A. Madani, “Designing proteins with language models,” *Nat. Biotechnol.*, vol. 42, no. 2, pp. 200–202, 2024. DOI: https://doi.org/10.1038/s41587-024-02123-4

M. Jordan, G. N. N. Neto, A. Brito Jr, and P. Nohama, “Virtual keyboard with the prediction of words for children with cerebral palsy,” *Comput. Methods Programs Biomed.*, vol. 192, p. 105402, 2020. DOI: https://doi.org/10.1016/j.cmpb.2020.105402

J. Yu, J. Weng, T. Wang, P. Lin, Y. Sun, and J. Chai, “How do differences in airline passengers’ satisfaction with connectivity modes affect last-mile travel choices? A SALC modeling based on RRM,” *Transp. Res. Part A Policy Pract.*, vol. 192, p. 104374, 2025. DOI: https://doi.org/10.1016/j.tra.2025.104374

C. Spiccia, A. Augello, G. Pilato, and G. Vassallo, “A word prediction methodology for automatic sentence completion,” in *Proc. IEEE 9th Int. Conf. Semantic Comput. (ICSC)*, 2015, pp. 240–243. DOI: https://doi.org/10.1109/ICOSC.2015.7050813

K. Jing and J. Xu, “A survey on neural network language models,” 2019.

Z. Song, L. Liu, W. Song, X. Zhao, and C. Du, “A neural network model for Chinese sentence generation with keyword,” in *Proc. IEEE 9th Int. Conf. Electron. Inf. Emerg. Commun. (ICEIEC)*, 2019, pp. 334–337. DOI: https://doi.org/10.1109/ICEIEC.2019.8784475

P. P. Barman and A. Boruah, “An RNN-based approach for next word prediction in Assamese phonetic transcription,” *Procedia Comput. Sci.*, vol. 143, pp. 117–123, 2018. DOI: https://doi.org/10.1016/j.procs.2018.10.359

D. Endalie, G. Haile, and W. Abebe, “Bi-directional long short-term memory-gated recurrent unit model for Amharic next word prediction,” *PLoS ONE*, vol. 17, p. 0273156, 2022.

R. Yacouby and D. Axman, “Probabilistic extension of precision, recall, and F1 score for more thorough evaluation of classification models,” in *Proc. First Workshop Eval. Compar. NLP Syst.*, 2020, pp. 79–91. DOI: https://doi.org/10.18653/v1/2020.eval4nlp-1.9

C. Goutte and E. Gaussier, “A probabilistic interpretation of precision, recall and F-score, with implication for evaluation,” in *Proc. Eur. Conf. Inf. Retr.*, 2005, pp. 345–359. DOI: https://doi.org/10.1007/978-3-540-31865-1_25

O. F. Rakib, S. Akter, M. A. Khan, A. K. Das, and K. M. Habibullah, “Bangla word prediction and sentence completion using GRU: An extended version of RNN on n-gram language model,” in *Proc. Int. Conf. Sustainable Technol. Ind. 4.0 (STI)*, 2019, pp. 1–6. DOI: https://doi.org/10.1109/STI47673.2019.9068063

J. L. Elman, “Finding structure in time,” *Cogn. Sci.*, vol. 14, no. 2, pp. 179–211, 1990. DOI: https://doi.org/10.1016/0364-0213(90)90002-E

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” *Neural Comput.*, vol. 9, no. 8, pp. 1735–1780, 1997. DOI: https://doi.org/10.1162/neco.1997.9.8.1735

S. Zhang, D. Zheng, X. Hu, and M. Yang, “Bidirectional long short-term memory networks for relation classification,” in *Proc. 29th Pacific Asia Conf. Lang., Inf. Comput.*, Shanghai, China, 2015, pp. 73–78.

D. Endalie, G. Haile, and W. Taye, “Bi-directional long short-term memory-gated recurrent unit model for Amharic next word prediction,” *PLoS ONE*, vol. 17, 2022. DOI: https://doi.org/10.1371/journal.pone.0273156

M. F. Ullah et al., “BERT model for Roman Urdu fake review identification,” 2023. DOI: https://doi.org/10.21203/rs.3.rs-3243015/v1

M. U. Bhatti et al., “Optimizing brain tumor prediction: A comparative study of machine learning algorithms,” *VFAST Trans. Softw. Eng.*, vol. 12, no. 4, pp. 209–219, 2024.

A. Saeed, R. M. A. Nawab, M. Stevenson, and P. Rayson, “A sense annotated corpus for all-words Urdu word sense disambiguation,” *ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP)*, vol. 18, no. 4, pp. 1–14, 2019. DOI: https://doi.org/10.1145/3314940

P. Li, “The study for word prediction based on Skip-gram and CBOW model,” *Theor. Nat. Sci.*, vol. 18, pp. 210–215, 2023, doi: 10.54254/2753-8818/18/20230392. DOI: https://doi.org/10.54254/2753-8818/18/20230392

A. Rianti, S. Widodo, A. Ayuningtyas, and B. Hermawan, “Next word prediction using LSTM,” *J. Inf. Technol. Util.* vol. 5, 2022, doi: 10.56873/jitu.5.1.4748. DOI: https://doi.org/10.56873/jitu.5.1.4748

A. Saeed, R. M. A. Nawab, M. Stevenson, and P. Rayson, “A word sense disambiguation corpus for Urdu,” *Lang. Resour. Eval.*, vol. 53, pp. 397–418, 2019. DOI: https://doi.org/10.1007/s10579-018-9438-7

I. Muneer, G. Fatima, M. S. Khan, R. M. A. Nawab, and A. Saeed, “Developing a large benchmark corpus for Urdu semantic word similarity,” *ACM Trans. Asian Low-Resour. Lang. Inf. Process.*, vol. 22, no. 3, pp. 1–19, 2023. DOI: https://doi.org/10.1145/3566124

G. Fatima, R. M. A. Nawab, M. S. Khan, and A. Saeed, “Developing a cross-lingual semantic word similarity corpus for English–Urdu language pair,” *ACM Trans. Asian Low-Resour. Lang. Inf. Process.*, vol. 21, no. 2, pp. 1–16, 2021. DOI: https://doi.org/10.1145/3472618

M. H. K. Vardag, A. Saeed, U. Hayat, M. F. Ullah, and N. Hussain, “Contextual Urdu text emotion detection corpus and experiments using deep learning approaches,” *ADCAIJ: Adv. Distrib. Comput. Artif. Intell. J.*, vol. 11, no. 4, pp. 489–505, 2022. DOI: https://doi.org/10.14201/adcaij.30128

Downloads

Published

2025-02-27

How to Cite

Ahmad, M. H., Saeed, A., Bhatti, M. U., Hussain, N., Ullah, M. F., & Anwar, M. (2025). Next Word Prediction for Urdu using Deep Learning Techniques. VFAST Transactions on Software Engineering, 13(1), 49–59. https://doi.org/10.21015/vtse.v13i1.2044