景鴻理1,黃 娜1,2,李建國1
摘要: 由于惡意軟件的數(shù)量日漸龐大,攻擊手段不斷更新,結合機器學習技術是惡意軟件檢測發(fā)展的一個新方向。先簡要介紹惡意軟件檢測中的靜態(tài)檢測方法以及動態(tài)檢測方法,總結基于機器學習的惡意軟件檢測一般流程,回顧了研究進展。通過使用Ember 2017和Ember 2018數(shù)據(jù)集,分析驗證了結構化特征相關方法,包括隨機森林(Random Forest,RF)、LightGBM、支持向量機(Support Vector Machine,SVM)、K-means以及卷積神經(jīng)網(wǎng)絡(Convolutional Neural Network,CNN)等算法模型;使用收集的2019年樣本集分析驗證了序列化特征相關方法,包括幾種常見的深度學習算法模型。計算模型以在不同測試集上的準確率、精確率、召回率以及F1-值作為評估指標。根據(jù)實驗結果分析討論了各類方法的優(yōu)缺點,著重驗證分析了樹模型的泛化能力,表明隨著樣本的不斷演變,模型普遍存在退化問題,并指出進一步研究方向。
Research progress and challenges of malware detection method based on machine learning
Jing Hongli1,Huang Na1,2,Li Jianguo1
1.Beijing Topsec Science & Technology Inc.,Beijing 100085,China; 2.Beijing University of Technology,Beijing 100124,China
Abstract: Due to the increasing number of malware and the updated attack means, malware detection combined with machine learning technology is a new direction of its development. Firstly, this paper introduces the static detecting methods and dynamic detecting methods of malware briefly; summarizes the general process of malware detecting methods based on machine learning, and reviews the existing methods with research progress. Using the data sets of Ember 2017 and Ember 2018, the structural feature correlation methods, including RF(Random Forest), LightGBM, SVM(Support Vector Machine), K-means and CNN(Convolutional Neural Network), are analyzed and validated,and the 2019 sample set analysis is used to validate the serialization feature correlation method, including several common deep learning algorithm models. The accuracy, precision, recall and F1_score of the trained model on different testing data sets are calculated as evaluating metrics. According to the experimental results, the advantages and disadvantages of various methods are discussed in this paper, the generalization ability of the tree model is verified and analyzed emphatically. It is shown that the model generally has degradation problem with the continuous evolution of samples, and the further research direction is pointed out at last.
Key words : malware detection;static detection of malware;machine learning;LightGBM;random forest

0 引言


    目前已經(jīng)有許多機器學習技術和框架被研究提出,應用于惡意軟件檢測,起到了非??捎^的效果。根據(jù)SGANDURRA D等[4]在2016年的調研,使用機器學習技術的靜態(tài)檢測方法準確率達到90%以上,動態(tài)檢測方法準確率能夠達到96%以上,經(jīng)過近幾年的繼續(xù)發(fā)展,此類方法的性能得到了進一步提高。基于機器學習技術建立智能化檢測模型,形成阻斷惡意軟件的一道防線,是技術突破與市場拓展的一個新方向,具有重要的研究意義和應用價值。




