Identification of Malicious PDFs Using Convolutional Neural Networks

Umm-e-Hani Tayyab; Muhammad Zain; Faiza Babar Khan; Dr. Muhammad Hanif Durad

doi:10.21015/vtse.v10i3.1114

Authors

Umm-e-Hani Tayyab Pakistan Institute of Engineering and Applied Sciences, Islamabad
Muhammad Zain Pakistan Institute of Engineering and Applied Sciences, Islamabad, Pakistan
Faiza Babar Khan Pakistan Institute of Engineering and Applied Sciences, Islamabad, Pakistan
Dr. Muhammad Hanif Durad Pakistan Institute of Engineering and Applied Sciences, Islamabad, Pakistan

DOI:

https://doi.org/10.21015/vtse.v10i3.1114

Abstract

With multiple possible carriers of malware, one of the most targeted file formats by malware writers is PDF format due to its inherent shortcomings as well as the limitations of PDF readers. The PDF-based attack is one of the most common attacks among document-based attacks. Some of the most luring features of PDF files include their widespread use which has replaced Word documents quite dominantly, secondly the ease of crafting a malicious PDF, and above all its capability of containing javascript. Initially, researchers focused on identifying malicious PDFs by comparing the structural changes between benign and malicious PDFs. Malware writers started using the evasion techniques to hide the malware so that structural changes can be hidden. Researchers started exploiting the benefits of Artificial Intelligence to combat these evasion techniques. We developed a convolutional neural network, trained to classify PDFs as benign and malicious using the byte level information, on a dataset of approximately 21,000 files. Our model can detect malicious PDFs without the human intervention and manual feature extraction. Moreover, our model detects malicious PDFs with 97% accuracy.

Author Biography

Umm-e-Hani Tayyab, Pakistan Institute of Engineering and Applied Sciences, Islamabad

Department of Computer Information sciences

References

https://www.comparitech.com/antivirus/malware-statistics-facts/

Young-Seob Jeong, Jiyoung Woo, Ah Reum Kang, "Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks", Security and Communication Networks, vol. 2019. https://doi.org/10.1155/2019/8485365

D. Liu, H. Wang and A. Stavrou, "Detecting Malicious Javascript in PDF through Document Instrumentation," 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2020, pp. 100- 111, doi: 10.1109/DSN.2014.92. DOI: https://doi.org/10.1109/DSN.2014.92

C. Smutz and A. Stavrou, “Malicious pdf detection using metadata and structural features,” in Proceedings of Annual Computer Security Applications Conference (ACSAC), 2021

N. Srndic and P. Laskov, “Detection of malicious pdf files based on hierarchical document structure,” in NDSS, 2013

D. Maiorca, G. Giacinto, and I. Corona, “A pattern recognition system for malicious pdf files detection,” in Proceedings of International conference on Machine Learning and Data Mining in Pattern Recognition (MLDM), 2012.

P. Laskov and N. Srndi ˇ c, “Static detection of malicious ´ javascript-bearing pdf documents,” in Proceedings of Annual Computer Security Applications Conference (ACSAC), 2011. DOI: https://doi.org/10.1145/2076732.2076785

Z. Tzermias, G. Sykiotakis, M. Polychronakis, and E. P. Markatos, “Combining static and dynamic analysis for the detection of malicious documents,” in Proceedings of European Workshop on System Security (EUROSEC), 2011. DOI: https://doi.org/10.1145/1972551.1972555

M. Munson, Deep PDF parsing to extract features for detecting embedded malware. 2011.A. Corum, D. Jenkins, and J. Zheng, “Robust PDF malware detection with image visualization and processing techniques,” in 2019 2nd International Conference on Data Intelligence and Security (ICDIS), 2019.

J. Saxe and K. Berlin. Deep neural network based malware detection using two dimensional binary program features. In 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), pages 11–20. IEEE, 2015. DOI: https://doi.org/10.1109/MALWARE.2015.7413680

B. Kolosnjaji, A. Zarras, G. Webster et al. Deep learning for classification of malware system call sequences,”Lecture Notes in Computer Science, vol. 9992, pp. 137–149, 2016. DOI: https://doi.org/10.1007/978-3-319-50127-7_11

W. Huang and J. W. Stokes, “MtNet: a multi-task neural network for dynamic malware classification,” Lecture Notes in Computer Science, vol. 9721, pp. 399–418, 2016. DOI: https://doi.org/10.1007/978-3-319-40667-1_20

M. Uddin, J. Lee, S. Rizvi, and S. Hamada, “Proposing enhanced feature engineering and a selection model for machine learning processes,” Appl. Sci. (Basel), vol. 8, no. 4, p. 646, 2018. DOI: https://doi.org/10.3390/app8040646

D. Maiorca, G. Giacinto, and I. Corona, “A pattern recognition system for malicious PDF files detection,” in Machine Learning and Data Mining in Pattern Recognition, Berlin, Heidelberg: Springer Berlin Heidelberg, 2021, pp. 510–524.. DOI: https://doi.org/10.1007/978-3-642-31537-4_40

P. Anantharaman, S. Cheung, N. Boorman, and M. E. Locasto, “A format-aware reducer for scriptable rewriting of PDF files,” in 2022 IEEE Security and Privacy Workshops (SPW), 2022.

L. Alzubaidi et al., “Review of deep learning: concepts, CNN architectures, challenges, applications, future directions,” J. Big Data, vol. 8, no. 1, p. 53, 2021.

Identification of Malicious PDFs Using Convolutional Neural Networks

Authors

DOI:

Abstract

Author Biography

Umm-e-Hani Tayyab, Pakistan Institute of Engineering and Applied Sciences, Islamabad

References

Downloads

Published

How to Cite

Issue

Section

License

Information

ISSN

Make a Submission