123,123,123

基于信息熵的数据集重标识风险评估方法

2020年信息技术与网络安全第12期

陈磊1，2，薛见新1，2，张润滋1，2，刘文懋1

1.绿盟科技集团股份有限公司，北京100089；2.清华大学自动化系，北京100084

摘要： 去标识化作为一种隐私保护技术，在数据发布领域得到了广泛应用。然而，在大数据时代下，攻击者可能获得了更多的关联数据，去标识数据集仍然存在重标识攻击的风险。基于Shannon信息熵，并结合信息安全风险评估框架，提出了一种综合的重标识风险评估方法。首先，将攻击者可能利用的数据集的各种属性组合归纳为若干个脆弱性，然后逐一对这些脆弱性从可能性和危害性两个维度进行评估。最后，为了综合评估整个数据集的重标识风险，构造了一种基于熵值增量和加权的评估算法。实验结果表明，所提评估方法可全面、直观地反映风险分布与趋势。

關(guān)鍵詞： 隐私保护去标识数据集重标识风险评估信息熵

中圖分類號(hào)： TP399
文獻(xiàn)標(biāo)識(shí)碼： A
DOI： 10.19358/j.issn.2096-5133.2020.12.001
引用格式：陳磊，薛見新，張潤滋，等. 基于信息熵的數(shù)據(jù)集重標(biāo)識(shí)風(fēng)險(xiǎn)評(píng)估方法[J].信息技術(shù)與網(wǎng)絡(luò)安全，2020，39(12)：1-7.

Re-identification risk assessment of de-identified datasets based on information entropy

Chen Lei1，2，Xue Jianxin1，2，Zhang Runzi1，2，Liu Wenmao1

1.Nsfocus Information Technology Co.，Ltd.，Beijing 100089，China； 2.Department of Automation，Tsinghua University，Beijing 100084，China

Abstract： As a privacy protection technology, de-identification has been widely used in data publishing scenarios. However, in the era of big data, attackers may obtain more associated data, and there is still a risk of re-identification attacks on de-identified datasets. Based on information entropy and information security risk assessment framework, this paper proposes a comprehensive re-identification risk assessment method. Firstly, the various attribute combinations of a de-identified dataset that attackers may utilize are summarized into several vulnerabilities, and then these vulnerabilities are evaluated one by one from probability and impact dimension. Finally, in order to comprehensively evaluate the re-identification risk of the dataset, this paper constructs a fast evaluation algorithm based on entropy increments and weights. Extensive experimental results demonstrate that the proposed evaluation method can comprehensively and intuitively reflect the risk distribution and trend.

Key words : privacy protection；de-identified datasets；re-identification risk assessment；information entropy

0 引言

在大數(shù)據(jù)時(shí)代下，數(shù)據(jù)共享、發(fā)布和交易等場景需求變得越來越多，一方面促進(jìn)了數(shù)據(jù)流通與價(jià)值利用，另一方面引發(fā)的個(gè)人數(shù)據(jù)與隱私安全事件近年來呈現(xiàn)爆發(fā)趨勢^[1]。

為了應(yīng)對(duì)挑戰(zhàn)，在法規(guī)層面，全球掀起了數(shù)據(jù)隱私的立法熱潮，如歐盟《通用數(shù)據(jù)保護(hù)條例》(GDPR)、美國《加州消費(fèi)者隱私法案》(CCPA)等。我國2017年實(shí)施的《網(wǎng)絡(luò)安全法》，其中一個(gè)章節(jié)專門明確個(gè)人信息安全；此外，我國《個(gè)人信息保護(hù)法》在加快立法與制定中。在技術(shù)層面，如何平衡數(shù)據(jù)利用與隱私保護(hù)問題，已經(jīng)成為學(xué)術(shù)界和工業(yè)界的一大研究熱點(diǎn)^[2]。當(dāng)前，已經(jīng)發(fā)展出了保留格式加密(Format-Preserving Encryption，F(xiàn)PE)^[3]、差分隱私(Differential Privacy，DP)^[4]、K-匿名(K-Anonymity)^[5]和L-多樣性(L-Diversity)^[6]以及去標(biāo)識(shí)化(De-identification)^[7]等技術(shù)。其中，去標(biāo)識(shí)化技術(shù)通過對(duì)原始個(gè)人信息進(jìn)行部分屏蔽、泛化和失真等數(shù)據(jù)變換操作，是一種意圖消除“個(gè)人身份”的隱私保護(hù)技術(shù)。由于其處理規(guī)則簡單靈活且易于并行處理(高效)，目前在隱私保護(hù)的數(shù)據(jù)發(fā)布和數(shù)據(jù)挖掘等實(shí)際場景中有廣泛應(yīng)用與部署。通常，在工業(yè)界習(xí)慣稱為“數(shù)據(jù)脫敏”。

本文詳細(xì)內(nèi)容請下載:http://ihrv.cn/resource/share/2000003069

作者信息:

陳磊1，2，薛見新1，2，張潤滋1，2，劉文懋1

(1.綠盟科技集團(tuán)股份有限公司，北京100089；2.清華大學(xué) 自動(dòng)化系，北京100084)

原創(chuàng)聲明：此內(nèi)容為AET網(wǎng)站原創(chuàng)，未經(jīng)授權(quán)禁止轉(zhuǎn)載。

相關(guān)內(nèi)容