123,123

基于单页语义特征的垃圾网页检测

电子技术应用

陈木生1，2，高斐1，吴俊华1

（1.江西理工大学软件工程学院，江西南昌 330013；2.南昌市虚拟数字工程与文化传播重点实验室，江西南昌 330013）

摘要： 为解决垃圾网页检测中特征提取难度高、计算量大的问题，提出一种仅基于当前网页的HTML脚本提取语义特征的方法。首先使用深度优先搜索和动态规划相结合的记忆化搜索算法对域名进行单词切割，采用隐含狄利克雷分布提取主题词，基于Word2Vec词向量和词移距离计算3个单页语义相似度特征；然后将单页语义相似度特征融合单页统计特征，使用随机森林等分类算法构建分类模型进行垃圾网页检测。实验结果表明，基于单页内容提取语义特征融合单页统计特征进行分类的AUC值达到88.0%，比对照方法提高4%左右。

關(guān)鍵詞： 垃圾网页检测特征提取记忆化搜索隐含狄利克雷分布词向量

中圖分類(lèi)號(hào)：TP391.6
文獻(xiàn)標(biāo)志碼：A
DOI: 10.16157/j.issn.0258-7998.223376
中文引用格式： 陳木生，高斐，吳俊華. 基于單頁(yè)語(yǔ)義特征的垃圾網(wǎng)頁(yè)檢測(cè)[J]. 電子技術(shù)應(yīng)用，2023，49(6)：24-29.
英文引用格式： Chen Musheng，Gao Fei，Wu Junhua. Web spam detection based on semantic features from current page[J]. Application of Electronic Technique，2023，49(6)：24-29.

Web spam detection based on semantic features from current page

Chen Musheng1，2，Gao Fei1，Wu Junhua1

(1.School of Software Engineering， Jiangxi University of Science and Technology， Nanchang 330013， China； 2.Nanchang Key Laboratory of Virtual Digital Engineering and Cultural Communication， Nanchang 330013， China)

Abstract： In order to solve the problem of high difficulty and large amount of computation in feature extraction for web spam detection, a method for extracting semantic features only based on the HTML script of the current page is proposed. Firstly, the domain name is segmented by a memorization search algorithm combining depth-first search and dynamic programming. Secondly, The latent Dirichlet distribution is used to extract subject words of the web page. Lastly, three single-page semantic similarity features are calculated based on Word2Vec and word mover distance. Combining the single-page semantic similarity features with single-page statistical features, classification algorithms such as random forest are used to build classification models for web spam detection. The experimental results show that the AUC value of single-page content extraction based on semantic and statistical features for classification reaches 88.0%, which is about 4% higher than that of the control method.

Key words : web spam detection；feature extraction；memory search；latent Dirichlet distribution；Word2Vec；word mover distance；random forest

0　引言

如今，隨著互聯(lián)網(wǎng)信息的快速增長(zhǎng)，搜索引擎被認(rèn)為是訪問(wèn)網(wǎng)站的關(guān)鍵工具，其用戶占到網(wǎng)絡(luò)用戶的80%以上[1]。但是有研究表明，大約60%的用戶只查看第一頁(yè)中最初的5個(gè)結(jié)果[2]?？梢钥闯?，在搜索結(jié)果中排名靠前的網(wǎng)頁(yè)會(huì)擁有更多的訪問(wèn)者，由此帶來(lái)更多的收入。由于通過(guò)正常手段提高網(wǎng)頁(yè)排名非常困難，于是某些網(wǎng)站便通過(guò)非正常手段和技術(shù)欺騙搜索引擎提高網(wǎng)頁(yè)排名，這些網(wǎng)頁(yè)被稱為垃圾網(wǎng)頁(yè)[3]。垃圾網(wǎng)頁(yè)會(huì)降低搜索結(jié)果的質(zhì)量，浪費(fèi)用戶的時(shí)間，侵占搜索引擎公司和其他內(nèi)容網(wǎng)站的合法利益[4]。盡管搜索引擎公司已經(jīng)使用了各種方法來(lái)應(yīng)對(duì)垃圾網(wǎng)頁(yè)，但至今為止，垃圾網(wǎng)頁(yè)檢測(cè)依然是搜索引擎需要重點(diǎn)突破的難題，也是學(xué)術(shù)領(lǐng)域的一個(gè)前沿課題。因此，高效、準(zhǔn)確地檢測(cè)垃圾網(wǎng)頁(yè)具有重要意義。

本文詳細(xì)內(nèi)容請(qǐng)下載：http://ihrv.cn/resource/share/2000005343

作者信息：

陳木生1，2，高斐1，吳俊華1

（1.江西理工大學(xué) 軟件工程學(xué)院，江西南昌 330013；2.南昌市虛擬數(shù)字工程與文化傳播重點(diǎn)實(shí)驗(yàn)室，江西南昌 330013）

微信圖片_20210517164139.jpg

原創(chuàng)聲明：此內(nèi)容為AET網(wǎng)站原創(chuàng)，未經(jīng)授權(quán)禁止轉(zhuǎn)載。

相關(guān)內(nèi)容