《電子技術(shù)應(yīng)用》
您所在的位置:首頁 > 通信與網(wǎng)絡(luò) > 設(shè)計(jì)應(yīng)用 > 基于文檔圖結(jié)構(gòu)的惡意PDF文檔檢測方法
基于文檔圖結(jié)構(gòu)的惡意PDF文檔檢測方法
信息技術(shù)與網(wǎng)絡(luò)安全 11期
俞遠(yuǎn)哲,王金雙,鄒 霞
(陸軍工程大學(xué) 指揮控制工程學(xué)院,江蘇 南京210007)
摘要: 目前基于機(jī)器學(xué)習(xí)的惡意PDF文檔檢測方法依賴于專家經(jīng)驗(yàn)來遴選特征,無法全面反映文檔屬性。而且在面對對抗樣本時(shí),檢測器性能下降明顯。針對上述問題,提出了一種基于文檔圖結(jié)構(gòu)和卷積神經(jīng)網(wǎng)絡(luò)的惡意PDF文檔檢測方法。該方法解析文檔結(jié)構(gòu),根據(jù)文檔中各對象之間的引用關(guān)系構(gòu)建出有向圖。然后,通過TF-IDF算法計(jì)算各節(jié)點(diǎn)對分類的貢獻(xiàn)度來進(jìn)行圖結(jié)構(gòu)精簡。最后,計(jì)算精簡后圖的鄰接矩陣和度矩陣,并得到圖的拉普拉斯矩陣,以此作為特征送入CNN分類模型進(jìn)行訓(xùn)練。同時(shí)還加入了對抗樣本,對模型進(jìn)行對抗訓(xùn)練。實(shí)驗(yàn)評(píng)估表明,在給定訓(xùn)練和測試樣本比例9:1條件下,不斷調(diào)整神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)和參數(shù),該方法的準(zhǔn)確率達(dá)到了99.71%,性能優(yōu)于KNN和SVM分類模型。在針對對抗樣本的檢測上,與知名在線檢測網(wǎng)站VirusTotal上的67款殺毒引擎相比,該方法取得了更高的檢測性能。
中圖分類號(hào): TP309
文獻(xiàn)標(biāo)識(shí)碼: A
DOI: 10.19358/j.issn.2096-5133.2021.11.003
引用格式: 俞遠(yuǎn)哲,王金雙,鄒霞. 基于文檔圖結(jié)構(gòu)的惡意PDF文檔檢測方法[J].信息技術(shù)與網(wǎng)絡(luò)安全,2021,40(11):16-23.
Malicious PDF detection method based on document graph structure
Yu Yuanzhe,Wang Jinshuang,Zou Xia
(Command & Control Engineering College,Army Engineering University of PLA,Nanjing 210007,China)
Abstract: Malicious PDF detection methods based on machine learning rely on the expert knowledge, which still cannot fully reflect the document attributes. Moreover, the performances of the detectors are easily affected by adversarial samples. To overcome these limitations, a malicious PDF detection method based on the PDF document graph structures and Convolutional Neural Network(CNN) was proposed. Firstly, a directed graph was constructed according to the document structure and the reference relationships between document objects. Secondly, the contribution of each node was calculated using TF-IDF algorithm, according to which the graph structures was simplified. Thirdly, the adjacency and degree matrices of the simplified graph were calculated, and the Laplacian matrix of the graph was obtained, which was used as a feature and sent to the CNN classification model for training. Adversarial samples were also added to train the model. It was evaluated that this method has an accuracy of 99.71% which is better than KNN and SVM classification models. Compared with the 67 antivirus engines on VirusTotal, it has achieved higher detection performance in the detection of adversarial samples.
Key words : malicious PDF document;document graph structure;CNN;adversarial sample

0 引言

PDF(Portable Document Format)文檔的使用非常廣泛。隨著版本的更新?lián)Q代,PDF文檔包含的功能也變得多種多樣,但其中一些鮮為人知的功能(如文件嵌入、JavaScript代碼執(zhí)行、動(dòng)態(tài)表單等)越來越多地被不法分子利用,來實(shí)施惡意網(wǎng)絡(luò)攻擊行為[1]。APT(Advanced Persistent Threat)攻擊[2]常常構(gòu)造巧妙偽裝的惡意PDF文檔,通過釣魚郵件攻擊等手段誘騙受害者下載,從而侵入或破壞計(jì)算機(jī)系統(tǒng)。相比傳統(tǒng)的惡意可執(zhí)行程序,惡意文檔具有更強(qiáng)的迷惑性。

基于機(jī)器學(xué)習(xí)的檢測方法被研究人員廣為使用,主要可以分為靜態(tài)檢測、動(dòng)態(tài)檢測和動(dòng)靜結(jié)合檢測方法[3]。而現(xiàn)有的惡意文檔特征選擇方法大多依賴于專家的知識(shí)驅(qū)動(dòng),在惡意文檔的手動(dòng)分析期間進(jìn)行觀察來選擇特征集(如調(diào)用類對象的數(shù)量、文檔頁數(shù)或版本號(hào)等),或是通過數(shù)學(xué)統(tǒng)計(jì)分析將特征細(xì)化(如某類對象在所有對象中的占比)。由于特征可選取的范圍很大,如果僅僅根據(jù)經(jīng)驗(yàn)選取了一部分作為特征集,就會(huì)喪失文檔的部分信息,無法全面地表達(dá)文檔特性。

由于PDF文檔格式的復(fù)雜性,其邏輯結(jié)構(gòu)包含了大量的文檔語義。文獻(xiàn)[4]認(rèn)為通過對結(jié)構(gòu)屬性的綜合分析能夠解釋惡意和良性PDF文檔之間的顯著結(jié)構(gòu)差異。因此本文設(shè)計(jì)通過綜合分析文檔的邏輯結(jié)構(gòu),以文檔的結(jié)構(gòu)圖為特征進(jìn)行檢測,而不是獨(dú)立的結(jié)構(gòu)路徑。即使攻擊者知道哪些對象是成功檢測的關(guān)鍵,并可能針對性地修改某一特定路徑,但這樣就會(huì)破壞文檔的整體結(jié)構(gòu),因此逃避檢測的成本很高。




本文詳細(xì)內(nèi)容請下載:http://ihrv.cn/resource/share/2000003843




作者信息:

俞遠(yuǎn)哲,王金雙,鄒  霞

(陸軍工程大學(xué) 指揮控制工程學(xué)院,江蘇 南京210007)


此內(nèi)容為AET網(wǎng)站原創(chuàng),未經(jīng)授權(quán)禁止轉(zhuǎn)載。