Identification of Malicious PDFs Using Convolutional Neural Networks
DOI:
https://doi.org/10.21015/vtse.v10i3.1114Abstract
With multiple possible carriers of malware, one of the most targeted file formats by malware writers is PDF format due to its inherent shortcomings as well as the limitations of PDF readers. The PDF-based attack is one of the most common attacks among document-based attacks. Some of the most luring features of PDF files include their widespread use which has replaced Word documents quite dominantly, secondly the ease of crafting a malicious PDF, and above all its capability of containing javascript. Initially, researchers focused on identifying malicious PDFs by comparing the structural changes between benign and malicious PDFs. Malware writers started using the evasion techniques to hide the malware so that structural changes can be hidden. Researchers started exploiting the benefits of Artificial Intelligence to combat these evasion techniques. We developed a convolutional neural network, trained to classify PDFs as benign and malicious using the byte level information, on a dataset of approximately 21,000 files. Our model can detect malicious PDFs without the human intervention and manual feature extraction. Moreover, our model detects malicious PDFs with 97% accuracy.
References
https://www.comparitech.com/antivirus/malware-statistics-facts/
Young-Seob Jeong, Jiyoung Woo, Ah Reum Kang, "Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks", Security and Communication Networks, vol. 2019. https://doi.org/10.1155/2019/8485365
D. Liu, H. Wang and A. Stavrou, "Detecting Malicious Javascript in PDF through Document Instrumentation," 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2020, pp. 100- 111, doi: 10.1109/DSN.2014.92. DOI: https://doi.org/10.1109/DSN.2014.92
C. Smutz and A. Stavrou, “Malicious pdf detection using metadata and structural features,” in Proceedings of Annual Computer Security Applications Conference (ACSAC), 2021
N. Srndic and P. Laskov, “Detection of malicious pdf files based on hierarchical document structure,” in NDSS, 2013
D. Maiorca, G. Giacinto, and I. Corona, “A pattern recognition system for malicious pdf files detection,” in Proceedings of International conference on Machine Learning and Data Mining in Pattern Recognition (MLDM), 2012.
P. Laskov and N. Srndi ˇ c, “Static detection of malicious ´ javascript-bearing pdf documents,” in Proceedings of Annual Computer Security Applications Conference (ACSAC), 2011. DOI: https://doi.org/10.1145/2076732.2076785
Z. Tzermias, G. Sykiotakis, M. Polychronakis, and E. P. Markatos, “Combining static and dynamic analysis for the detection of malicious documents,” in Proceedings of European Workshop on System Security (EUROSEC), 2011. DOI: https://doi.org/10.1145/1972551.1972555
M. Munson, Deep PDF parsing to extract features for detecting embedded malware. 2011.A. Corum, D. Jenkins, and J. Zheng, “Robust PDF malware detection with image visualization and processing techniques,” in 2019 2nd International Conference on Data Intelligence and Security (ICDIS), 2019.
J. Saxe and K. Berlin. Deep neural network based malware detection using two dimensional binary program features. In 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), pages 11–20. IEEE, 2015. DOI: https://doi.org/10.1109/MALWARE.2015.7413680
B. Kolosnjaji, A. Zarras, G. Webster et al. Deep learning for classification of malware system call sequences,”Lecture Notes in Computer Science, vol. 9992, pp. 137–149, 2016. DOI: https://doi.org/10.1007/978-3-319-50127-7_11
W. Huang and J. W. Stokes, “MtNet: a multi-task neural network for dynamic malware classification,” Lecture Notes in Computer Science, vol. 9721, pp. 399–418, 2016. DOI: https://doi.org/10.1007/978-3-319-40667-1_20
M. Uddin, J. Lee, S. Rizvi, and S. Hamada, “Proposing enhanced feature engineering and a selection model for machine learning processes,” Appl. Sci. (Basel), vol. 8, no. 4, p. 646, 2018. DOI: https://doi.org/10.3390/app8040646
D. Maiorca, G. Giacinto, and I. Corona, “A pattern recognition system for malicious PDF files detection,” in Machine Learning and Data Mining in Pattern Recognition, Berlin, Heidelberg: Springer Berlin Heidelberg, 2021, pp. 510–524.. DOI: https://doi.org/10.1007/978-3-642-31537-4_40
P. Anantharaman, S. Cheung, N. Boorman, and M. E. Locasto, “A format-aware reducer for scriptable rewriting of PDF files,” in 2022 IEEE Security and Privacy Workshops (SPW), 2022.
L. Alzubaidi et al., “Review of deep learning: concepts, CNN architectures, challenges, applications, future directions,” J. Big Data, vol. 8, no. 1, p. 53, 2021.
Downloads
Published
How to Cite
Issue
Section
License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC-By) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
This work is licensed under a Creative Commons Attribution License CC BY