Next Word Prediction for Urdu using Deep Learning Techniques
DOI:
https://doi.org/10.21015/vtse.v13i1.2044Abstract
A language model for next-word prediction is a probabilistic representation of a natural language that utilizes text corpora to generate word probabilities. These models play a crucial role in text generation, machine translation, and question-answering applications. The focus of this study is to develop an improved algorithm for next-word prediction in Urdu. The study implements deep learning models, including RNN, LSTM, and Bi-LSTM, on a subset of the Ur-Mono Urdu corpus containing 3,000 and 5,000 sentences. To prepare the data for experimentation, tokenization and stemming data cleaning techniques are applied. The study achieved an accuracy of 87% using the RNN model on the first 3,000 sentences of the Ur-Mono dataset and 84% accuracy using the RNN model on the first 5,000 sentences of the Ur-Mono dataset. In conclusion, it can be stated that when the corpus size is small, the RNN outperforms both the LSTM and BiLSTM. However, as the corpus size increases, the Bi-LSTM exhibits superior performance compared to both RNN and LSTM.
References
A. Rianti, S. Widodo, A. Ayuningtyas, and B. Hermawan, “Next word prediction using LSTM,” J. Inf. Technol. Its Util., vol. 5, 2022.
M. Saeed et al., “Effective word prediction in Urdu language using stochastic model,” Sukkur IBA J. Comput. Math. Sci., vol. 2, no. 2, pp. 38–46, 2018.
M. F. Ullah, A. Saeed, and N. Hussain, “Comparison of pre-trained vs custom-trained word embedding models for word sense disambiguation,” ADCAIJ Adv. Distrib. Comput. Artif. Intell. J., vol. 12, no. 1, p. 31084, 2023. DOI: https://doi.org/10.14201/adcaij.31084
H. Kour and N. K. Gondhi, “Machine learning approaches for Nastaliq style Urdu handwritten recognition: A survey,” in *Proc. 6th Int. Conf. Adv. Comput. Commun. Syst. (ICACCS)*, 2020, pp. 50–54. DOI: https://doi.org/10.1109/ICACCS48705.2020.9074294
L. Khan et al., “Urdu sentiment analysis with deep learning methods,” *IEEE Access*, vol. 9, pp. 97803–97812, 2021. DOI: https://doi.org/10.1109/ACCESS.2021.3093078
A. Daud, W. Khan, and D. Che, “Urdu language processing: A survey,” *Artif. Intell. Rev.*, vol. 47, pp. 279–311, 2017. DOI: https://doi.org/10.1007/s10462-016-9482-x
Y. Gong, L. Hua, and S. Wang, “Leveraging user’s performance in reporting patient safety events by utilizing text prediction in narrative data entry,” *Comput. Methods Programs Biomed.*, vol. 131, pp. 181–189, 2016. DOI: https://doi.org/10.1016/j.cmpb.2016.03.031
K. Črnka et al., “User interaction with word prediction: The effects of prediction quality,” *ACM Trans. Access. Comput. (TACCESS)*, vol. 1, no. 3, pp. 1–34, 2009. DOI: https://doi.org/10.1145/1497302.1497307
R. Alarcón, L. Moreno, and P. Martínez, “Lexical simplification system to improve web accessibility,” *IEEE Access*, vol. 9, pp. 58755–58767, 2021. DOI: https://doi.org/10.1109/ACCESS.2021.3072697
M. Alkhatib, A. A. Monem, and K. Shaalan, “Deep learning for Arabic error detection and correction,” *ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP)*, vol. 19, no. 5, pp. 1–13, 2020. DOI: https://doi.org/10.1145/3373266
M. F. Ssaneo, P. Ripollés, J. Orpella, W. M. Lin, R. Diego Balaguer, and D. Poeppel, “Spontaneous synchronization to speech reveals neural mechanisms facilitating language learning,” *Nature Neuroscience*, vol. 22, no. 4, pp. 627–632, 2019. DOI: https://doi.org/10.1038/s41593-019-0353-z
O. Van Laere, M. Strobbe, P. Leroux, B. Dhoedt, F. De Turck, and P. Demeester, “Enabling platform for mobile content generation based on 2D barcodes,” in *Proc. Int. Conf. Internet Comput.*, 2008, pp. 209–214.
S. Babar and P. D. Patil, “Improving performance of text summarization,” *Procedia Comput. Sci.*, vol. 46, pp. 354–363, 2015. DOI: https://doi.org/10.1016/j.procs.2015.02.031
R. Sennrich and B. Haddow, “Linguistic input features improve neural machine translation,” *arXiv preprint arXiv:1606.02892*, 2016. DOI: https://doi.org/10.18653/v1/W16-2209
S. Ghosh, M. Dutta, and T. Das, “Indian legal text summarization: A text normalization-based approach,” in *Proc. IEEE 19th India Council Int. Conf. (INDICON)*, 2022, pp. 1–4. DOI: https://doi.org/10.1109/INDICON56171.2022.10039891
P. Kalamkar, A. Tiwari, A. Agarwal, S. Karn, S. Gupta, V. Raghavan, and A. Modi, “Corpus for automatic structuring of legal documents,” *arXiv preprint arXiv:2201.13125*, 2022.
R. Sharma, N. Goel, N. Aggarwal, P. Kaur, and C. Prakash, “Next word prediction in Hindi using deep learning techniques,” in *Proc. Int. Conf. Data Sci. Eng. (ICDSE)*, 2019, pp. 55–60. DOI: https://doi.org/10.1109/ICDSE47409.2019.8971796
J. M. Chambers and T. J. Hastie, “Statistical models,” in *Statistical Models in S*. Routledge, 2017, pp. 13–44. DOI: https://doi.org/10.1201/9780203738535-2
D. Forsyth, “Hidden Markov Models,” *Appl. Mach. Learn.*, pp. 305–332, 2019. DOI: https://doi.org/10.1007/978-3-030-18114-7_13
S. Morsy and G. Karypis, “Cumulative knowledge-based regression models for next-term grade prediction,” in *Proc. SIAM Int. Conf. Data Mining*, 2017, pp. 552–560. DOI: https://doi.org/10.1137/1.9781611974973.62
M. Roam and S. Thakur, “Next word prediction using deep learning: A comparative study,” in *Proc. 12th Int. Conf. Cloud Comput., Data Sci. Eng. (Confluence)*, 2022, pp. 653–658. DOI: https://doi.org/10.1109/Confluence52989.2022.9734151
A. Atccili, O. Ozkaraca, G. Sariman, and B. Patrut, “Next word prediction with deep learning models,” in *Proc. Int. Conf. Artif. Intell. Appl. Math. Eng.*, 2021, pp. 523–531. Springer. DOI: https://doi.org/10.1007/978-3-031-09753-9_38
D. M. Berhard, G. F. Simons, and C. D. Fennig, *Ethnologue: Languages of the World*, 23rd ed. Dallas, TX, USA: SIL International, 2020. [Online]. Available: https://www.ethnologue.com
B. Jawaid, A. Kamran, and O. Bojar, “A tagged corpus and a tagger for Urdu,” in *Proc. 9th Int. Conf. Lang. Resour. Eval. (LREC’14)*, Reykjavik, Iceland, 2014, pp. 2938–2943. European Language Resources Association (ELRA).
A. Saeed, M. F. Ullah, M. Sauood, S. N. Ali, and N. Hussain, “Prediction of dengue cases and deaths using machine learning algorithm,” *Pak. J. Sci. Res.*, vol. 3, no. 2, pp. 210–216, 2023. DOI: https://doi.org/10.57041/pjosr.v3i2.1042
M. F. Siddiqui and M. Hassan, “Effective word prediction in Urdu language using stochastic model,” *Sukkur IBA J. Comput. Math. Sci.*, vol. 2, no. 2, pp. 38–46, 2018. DOI: https://doi.org/10.30537/sjcms.v2i2.304
A. Saeed, R. M. A. Nawab, and M. Stevenson, “Investigating the feasibility of deep learning methods for Urdu word sense disambiguation,” *Trans. Asian Low-Resour. Lang. Inf. Process.*, vol. 21, no. 2, pp. 1–16, 2021. DOI: https://doi.org/10.1145/3477578
S. Avasthi, R. Chauhan, and D. P. Acharjya, “Processing large text corpus using n-gram language modeling and smoothing,” in *Proc. 2nd Int. Conf. Inf. Manag. Mach. Intell. (ICIMMI 2020)*, 2021, pp. 21–32. Springer. DOI: https://doi.org/10.1007/978-981-15-9689-6_3
D. Naik, I. Naik, and N. Naik, “Large data begets large data: Studying large language models (LLMs) and its history, types, working, benefits and limitations,” in *Proc. Int. Conf. Comput., Commun., Cybersecurity AI*, Cham, Switzerland, July 2024, pp. 293–314. Springer Nature. DOI: https://doi.org/10.1007/978-3-031-74443-3_18
C. C. Aggarwal et al., *Neural Networks and Deep Learning*, vol. 10, no. 978, p. 3, 2018. Springer. DOI: https://doi.org/10.1007/978-3-319-94463-0
J. A. Ruffolo and A. Madani, “Designing proteins with language models,” *Nat. Biotechnol.*, vol. 42, no. 2, pp. 200–202, 2024. DOI: https://doi.org/10.1038/s41587-024-02123-4
M. Jordan, G. N. N. Neto, A. Brito Jr, and P. Nohama, “Virtual keyboard with the prediction of words for children with cerebral palsy,” *Comput. Methods Programs Biomed.*, vol. 192, p. 105402, 2020. DOI: https://doi.org/10.1016/j.cmpb.2020.105402
J. Yu, J. Weng, T. Wang, P. Lin, Y. Sun, and J. Chai, “How do differences in airline passengers’ satisfaction with connectivity modes affect last-mile travel choices? A SALC modeling based on RRM,” *Transp. Res. Part A Policy Pract.*, vol. 192, p. 104374, 2025. DOI: https://doi.org/10.1016/j.tra.2025.104374
C. Spiccia, A. Augello, G. Pilato, and G. Vassallo, “A word prediction methodology for automatic sentence completion,” in *Proc. IEEE 9th Int. Conf. Semantic Comput. (ICSC)*, 2015, pp. 240–243. DOI: https://doi.org/10.1109/ICOSC.2015.7050813
K. Jing and J. Xu, “A survey on neural network language models,” 2019.
Z. Song, L. Liu, W. Song, X. Zhao, and C. Du, “A neural network model for Chinese sentence generation with keyword,” in *Proc. IEEE 9th Int. Conf. Electron. Inf. Emerg. Commun. (ICEIEC)*, 2019, pp. 334–337. DOI: https://doi.org/10.1109/ICEIEC.2019.8784475
P. P. Barman and A. Boruah, “An RNN-based approach for next word prediction in Assamese phonetic transcription,” *Procedia Comput. Sci.*, vol. 143, pp. 117–123, 2018. DOI: https://doi.org/10.1016/j.procs.2018.10.359
D. Endalie, G. Haile, and W. Abebe, “Bi-directional long short-term memory-gated recurrent unit model for Amharic next word prediction,” *PLoS ONE*, vol. 17, p. 0273156, 2022.
R. Yacouby and D. Axman, “Probabilistic extension of precision, recall, and F1 score for more thorough evaluation of classification models,” in *Proc. First Workshop Eval. Compar. NLP Syst.*, 2020, pp. 79–91. DOI: https://doi.org/10.18653/v1/2020.eval4nlp-1.9
C. Goutte and E. Gaussier, “A probabilistic interpretation of precision, recall and F-score, with implication for evaluation,” in *Proc. Eur. Conf. Inf. Retr.*, 2005, pp. 345–359. DOI: https://doi.org/10.1007/978-3-540-31865-1_25
O. F. Rakib, S. Akter, M. A. Khan, A. K. Das, and K. M. Habibullah, “Bangla word prediction and sentence completion using GRU: An extended version of RNN on n-gram language model,” in *Proc. Int. Conf. Sustainable Technol. Ind. 4.0 (STI)*, 2019, pp. 1–6. DOI: https://doi.org/10.1109/STI47673.2019.9068063
J. L. Elman, “Finding structure in time,” *Cogn. Sci.*, vol. 14, no. 2, pp. 179–211, 1990. DOI: https://doi.org/10.1016/0364-0213(90)90002-E
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” *Neural Comput.*, vol. 9, no. 8, pp. 1735–1780, 1997. DOI: https://doi.org/10.1162/neco.1997.9.8.1735
S. Zhang, D. Zheng, X. Hu, and M. Yang, “Bidirectional long short-term memory networks for relation classification,” in *Proc. 29th Pacific Asia Conf. Lang., Inf. Comput.*, Shanghai, China, 2015, pp. 73–78.
D. Endalie, G. Haile, and W. Taye, “Bi-directional long short-term memory-gated recurrent unit model for Amharic next word prediction,” *PLoS ONE*, vol. 17, 2022. DOI: https://doi.org/10.1371/journal.pone.0273156
M. F. Ullah et al., “BERT model for Roman Urdu fake review identification,” 2023. DOI: https://doi.org/10.21203/rs.3.rs-3243015/v1
M. U. Bhatti et al., “Optimizing brain tumor prediction: A comparative study of machine learning algorithms,” *VFAST Trans. Softw. Eng.*, vol. 12, no. 4, pp. 209–219, 2024.
A. Saeed, R. M. A. Nawab, M. Stevenson, and P. Rayson, “A sense annotated corpus for all-words Urdu word sense disambiguation,” *ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP)*, vol. 18, no. 4, pp. 1–14, 2019. DOI: https://doi.org/10.1145/3314940
P. Li, “The study for word prediction based on Skip-gram and CBOW model,” *Theor. Nat. Sci.*, vol. 18, pp. 210–215, 2023, doi: 10.54254/2753-8818/18/20230392. DOI: https://doi.org/10.54254/2753-8818/18/20230392
A. Rianti, S. Widodo, A. Ayuningtyas, and B. Hermawan, “Next word prediction using LSTM,” *J. Inf. Technol. Util.* vol. 5, 2022, doi: 10.56873/jitu.5.1.4748. DOI: https://doi.org/10.56873/jitu.5.1.4748
A. Saeed, R. M. A. Nawab, M. Stevenson, and P. Rayson, “A word sense disambiguation corpus for Urdu,” *Lang. Resour. Eval.*, vol. 53, pp. 397–418, 2019. DOI: https://doi.org/10.1007/s10579-018-9438-7
I. Muneer, G. Fatima, M. S. Khan, R. M. A. Nawab, and A. Saeed, “Developing a large benchmark corpus for Urdu semantic word similarity,” *ACM Trans. Asian Low-Resour. Lang. Inf. Process.*, vol. 22, no. 3, pp. 1–19, 2023. DOI: https://doi.org/10.1145/3566124
G. Fatima, R. M. A. Nawab, M. S. Khan, and A. Saeed, “Developing a cross-lingual semantic word similarity corpus for English–Urdu language pair,” *ACM Trans. Asian Low-Resour. Lang. Inf. Process.*, vol. 21, no. 2, pp. 1–16, 2021. DOI: https://doi.org/10.1145/3472618
M. H. K. Vardag, A. Saeed, U. Hayat, M. F. Ullah, and N. Hussain, “Contextual Urdu text emotion detection corpus and experiments using deep learning approaches,” *ADCAIJ: Adv. Distrib. Comput. Artif. Intell. J.*, vol. 11, no. 4, pp. 489–505, 2022. DOI: https://doi.org/10.14201/adcaij.30128
Downloads
Published
How to Cite
Issue
Section
License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC-By) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
This work is licensed under a Creative Commons Attribution License CC BY