進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-0608201917200900
論文名稱(中文) 基於文本處理結合有效特徵選擇之惡意程式分類方法
論文名稱(英文) Malware Classification based on N-gram and TF-IDF with Efficient Feature Set Reduction
校院名稱 成功大學
系所名稱(中) 電腦與通信工程研究所
系所名稱(英) Institute of Computer & Communication
學年度 107
學期 2
出版年 108
研究生(中文) 劉大哲
研究生(英文) Ta-Che Liu
學號 q36061305
學位類別 碩士
語文別 中文
論文頁數 52頁
口試委員 指導教授-李忠憲
口試委員-蘇淑茵
口試委員-蘇暉凱
召集委員-鄭伯炤
中文關鍵字 動態分析  惡意程式分類  機器學習  深度學習  特徵選擇 
英文關鍵字 Dynamic Analysis  Malware Classification  Machine Learning  Deep Learning  Feature Selection 
學科別分類
中文摘要 由於電腦和網路的發展快速,人們在使用網路所帶來的便利時,也使得資訊科技犯罪的快速崛起。傳統上面對惡意程式是使用特徵資料庫來進行特徵馬比對來識別惡意程式種類,或著透過專家的經驗人為分析。然而面對越來越多變的惡意程式,使用傳統資料庫比對的方法可能會因為惡意程式做了加殼等等的動作而使特徵碼分析失去準確性,也因為惡意程式數量呈指數級增長的原因,導致特徵資料庫越來越大。綜合以上兩種主因造成傳統資料庫比對方法已無法有效且有效率的對惡意程式進行分類,因此各學者開始研究如何藉由機器學習與深度學習等方法來解決此問題。本論文以國家高速網路與計算中心提供大量的惡意程式分析報告來進行研究,將報告中的函式呼叫透過自動化程式將其取出並生成文檔,搭配自然語言處理中用來處理文本資料的演算法n-gram及詞頻-逆文件頻率將這些取出的文字量化,並轉換為具有意義的數字,最後藉由特徵選擇消除冗餘特徵減少訓練時間成本。在實驗結果中,我們比較了詞頻-逆文件頻率使用前後的準確率變化及經特徵選擇後的特徵子集與原始特徵間的效能差異;結果顯示本研究提出之方法能取得87.08%準確率,並節省87.97%訓練時間,且最後於相關研究中能取得較佳的表現。
英文摘要 Due to the rapid development of computers and the Internet, people's use of the Internet has also led to the rapid rise of information technology crimes. The traditional way to identify malware is to use a signature database to compare and determine the type of a malware, or artificial analysis through expert experience. However, in the face of more and more malware program, the traditional database comparison method may cause the signature database analysis to lose accuracy due to the action of encoding, and because of the exponential increase in the number of malware programs, the feature database is getting bigger and bigger. Combining the above two main causes, the traditional database comparison method cannot effectively and efficiently classify malware programs. Therefore, scholars has begun to study how to solve this problem through machine learning and deep learning.
In this thesis, we study a large number of malware analysis reports provided by the National Center for High-Performance Computing, we extracts the function calls in the report through an automated program to generate documents, the algorithm n-gram and TF-IDF used to process text data in natural language processing quantify these extracted texts and converts them into meaningful numbers. Finally, we eliminate redundant feature by feature selection to reduce the training time cost.
In the experimental results, we compare the accuracy of the TF-IDF before and after the use and the difference between the feature subset and the original feature. The results show that our proposed method can achieve 87.08% accuracy, and save 87.97% training time. Through experiments, our method outperforms the other related research.
論文目次 摘要 I
誌謝 XII
目錄 XIII
表目錄 XV
圖目錄 XVI
一、緒論 1
1.1 研究背景 1
1.2 研究動機 3
1.3 貢獻 4
1.4 論文架構 5
二、相關研究 6
2.1 惡意程式概述 6
2.2 分析方法 9
2.2.1靜態分析 ( Static Analysis ) : 9
2.2.2動態分析 ( Dynamic Analysis ) : 10
2.3 惡意程式分類 12
2.4 機器學習與深度學習 14
2.4.1支持向量機 ( Support Vector Machine ) 14
2.4.2多層感知器 ( Multi-layer Perceptron ) 16
2.4.3 卷積神經網路 ( Convolutional neural network ) 18
三、系統架構 22
3.1 資料來源 24
3.2 資料預處理 ( DATA PRE-PROCESSING ) 25
3.2.2 分類標籤 25
3.2.2 特徵提取 27
3.3 特徵編碼 ( FEATURE ENCODING ) 28
3.4 特徵選擇 ( FEATURE SELECTION ) 32
3.4.1 特徵間相關性選擇 ( Correlation-based Feature Selection ) 32
3.4.2 過濾法 ( Filter method ) 35
3.5 系統實現 37
四、實驗結果 38
4.1 研究環境及相關評估指標 39
4.2 使用詞頻-逆文件頻率的準確率變化 41
4.3 特徵選擇的效能比較 43
4.4 相關論文比較 47
五、結論與未來展望 49
參考資料 50

參考文獻 [1]
"International, Radio Taiwan," [Online]. Available: https://www.rti.org.tw/news/view/id/2004515. [Accessed 17 6 2019].
[2] "資安趨勢部落格," [Online]. Available: https://blog.trendmicro.com.tw/?p=49656. [Accessed 19 6 2019].
[3] "KasperskyLab," [Online]. Available: https://www.kaspersky.com/about/press-releases/2017_kaspersky-lab-detects-360000-new-malicious-files-daily. [Accessed 14 6 2019].
[4] "virus," [Online]. Available: http://myweb.scu.edu.tw/~mlchao/basic/virus.htm. [Accessed 18 6 2019].
[5] "KUAS," [Online]. Available: http://computer.kuas.edu.tw/files/16-1006-28328.php. [Accessed 19 6 2019].
[6] "symantec," [Online]. Available: https://www.websecurity.symantec.com/zh/tw/security-topics/what-are-malware-viruses-spyware-and-cookies-and-what-differentiates-them. [Accessed 10 5 2019].
[7] "資安趨勢部落格," [Online]. Available: https://blog.trendmicro.com.tw/?p=143. [Accessed 19 6 2019].
[8] "wikipedia," [Online]. Available: https://en.wikipedia.org/wiki/Spyware. [Accessed 19 6 2019].
[9] "資安趨勢部落格," [Online]. Available: https://blog.trendmicro.com.tw/?cat=3929. [Accessed 19 6 2019].
[10] "trendmicro," [Online]. Available: https://www.trendmicro.com/vinfo/us/security/definition/potentially-unwanted-app. [Accessed 25 5 2019].
[11] D. Bilar, "Opcodes as predictor for malware," International Journal of Electronic Security and Digital Forensics, vol. 1, no. 2, pp. 156-168, 2 5 2007.
[12] I. Santos, B. Felix and J. Nieves, "Idea: Opcode-Sequence-Based Malware Detection," International Symposium on Engineering Secure Software and Systems, vol. 5965, pp. 35-43, 2010.
[13] J. Saxe and K. Berlin, "Deep neural network based malware detection using two dimensional binary program features," pp. 11-20, 13 8 2015.
[14] N. Kawaguchi and K. Omote, "Malware function classification using apis in initial behavior," 2015 10th Asia Joint Conference on Information Security, 13 7 2015.
[15] S. Seok and H. Kim, "Visualized malware classification based-on convolutional neural network," Journal of the Korea Institute of Information Security and Cryptology, vol. 26, no. 1, 2 2016.
[16] B. Kolosnjaji, G. Eraisha, G. Webster, A. Zarras and C. Eckert, "Empowering convolutional networks for malware classification and analysis," 2017 International Joint Conference on Neural Networks (IJCNN), 5 2017.
[17] E. Moshiri, A. B. Abdullah, R. A. B. R. Mahmood and Z. Muda, "Malware Classification Framework for Dynamic Analysis using Information Theory," Indian Journal of Science and Technology, vol. 10, 2017.
[18] H.-T. Li, "Malware Detection and Classification Based on Machine Learning Technology," Department of Computer Science and Information Engineering,National Yunlin University of Science and Technology, 2018.
[19] L. Nataraj, S. Karthikeyan, G. Jacob and B. S. Manjunath, "Visualization and automatic classification," International Symposium on Visualization for Cyber Security (VizSec), vol. 4, 2011.
[20] V. N. Vapnik, "An Overview of Statistical Learning Theory," IEEE TRANSACTIONS ON NEURAL NETWORKS, vol. 10, no. 5, 1999.
[21] M. Alazab, R. Layton, S. Venkatraman and P. Watters, "Malware Detection Based on Structural and Behavioural Features of API Calls," Proceedings of the 1st international cyber resilience conference, pp. 1-10, 2010.
[22] Y. Ye, L. Chen, D. Wang, T. Li, Q. Jiang and M. Zhao, "SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging," Journal in Computer Virology, vol. 5, no. 4, pp. 283-293, 2009.
[23] K. Rieck, T. Holz, C. Willems, P. Düssel and P. Laskov, "Learning and Classification of Malware Behavior," International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, 2008.
[24] C.-W. Hsu and C.-J. Lin, "A comparison of methods for multiclass support vector machines," IEEE Transactions on Neural Networks, vol. 13, no. 2, pp. 415 - 425, 2002.
[25] M. Kruczkowski and E. N. Szynkiewicz, "Support vector machine for malware analysis and classification," 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 2, pp. 415-420, 2014.
[26] F. Murtagh, "Multilayer perceptrons for classification and regression Neurocomputing," Neurocomputing 2, vol. 2, pp. 183-197, 1991.
[27] M. Kalash, M. Rochan, N. Mohammed, N. D. B. Bruce, Y. Wang and F. Iqbal, "Malware classification with deep convolutional neural networks," 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), pp. 1-5, 2018.
[28] "Malware Database," [Online]. Available: https://owl.nchc.org.tw/. [Accessed 20 4 2019].
[29] "VirusTotal," [Online]. Available: http://www.virustotal.com.. [Accessed 10 5 2019].
[30] J. Y.-C. Cheng, T.-S. Tsai and C.-S. Yang, "An information retrieval approach for malware classification based on Windows API calls," 2013 International Conference on Machine Learning and Cybernetics, vol. 4, pp. 1678-1683, 2013.
[31] E. Raff, R. Zak, R. Cox, J. Sylvester, P. Yacci, R. Ward, A. Tracy, M. McLean and C. Nicholas, "An investigation of byte n-gram features for malware classification," Journal of Computer Virology and Hacking Techniques, vol. 14, no. 1, pp. 1-20, 2018.
[32] "symantec," [Online]. Available: https://www.symantec.com/. [Accessed 20 6 2019].
[33] "kaspersky," [Online]. Available: https://www.kaspersky.com/. [Accessed 20 6 2019].
[34] "F-Secure," [Online]. Available: https://www.f-secure.com/en/welcome. [Accessed 20 6 2019].
[35] "Trend Micro," [Online]. Available: https://www.trendmicro.com/en_us/business.html. [Accessed 20 6 2019].
[36] S. E. Robertson, "Understanding Inverse Document Frequency:On theoretical arguments for IDF," Journal of Documentation 2004, pp. 503-520, 2004.
[37] "keras," [Online]. Available: https://keras.io/. [Accessed 5 6 2019].
[38] "scikit-learn," [Online]. Available: https://scikit-learn.org/stable/. [Accessed 5 6 2019].

論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2024-07-17起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2024-07-17起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw