123,123,123

机器学习中已公开个人数据的合法利用路径

网络安全与数据治理

王婉清

华东政法大学中国法治战略研究院

摘要： 已公开个人数据作为机器学习的重要训练语料 , 应对其秉持开放利用的目标取向。但采取宽松获取策略 , 却由于爬取范围不清晰、用于生成式 AI存在侵权风险、个人数据主体难以行使信息自决权而面临合法利用的实践困境。检视困境成因 , 应围绕机器学习应用全周期 , 构建已公开个人数据的合法利用路径 : 在数据获取阶段 , 评估爬取行为的正当性与潜在影响 , 若涉及竞争性权益 , 应转向 API 授权等合法路径 , 确保数据来源合法 ; 在机器学习智力成果投入应用阶段 , 应依据个人信息种类设置分类安全机制 , 并实时监督以防范隐私泄露与滥用风险; 在应用投放市场后 , 应构建训练数据披露机制 , 以透明度支持用户干预 , 保障个人信息自决权的实现。

關(guān)鍵詞： 已公开个人数据机器学习竞争性权益个人信息保护信息披露

中圖分類號 : D922. 17 ; TP181 文獻標(biāo)志碼 : A DOI :10.19358/j.issn.2097-1788.2026.02.010
中文引用格式 : 王婉清. 機器學(xué)習(xí)中已公開個人數(shù)據(jù)的合法利用路徑 [J]. 網(wǎng)絡(luò)安全與數(shù)據(jù)治理 , 2026 , 45(2) : 73 - 80.
英文引用格式 : Wang Wanqing. Legal use of publicly available personal data in machine learning [J]. Cyber Security and Data Govern-
ance, 2026 , 45(2) : 73 - 80.

Legal use of publicly available personal data in machine learning

Wang Wanqing

China Institute for Rule of Law Strategy, East China University of Political Science and Law

Abstract： Publicly available personal data, as a crucial corpus for machine learning training, should in principle be governed by an orientation toward open utilization and more permissive acquisition strategies. However, practical challenges arise in lawful use due to ambiguities in the scope of web scraping, potential infringement risks in generative AI applications, and the difficulty for data subjects to exercise informational self-determination. To address the dilemma of lawful use, it is necessary to construct a legal utilization pathway for such data throughout the full machine learning cycle. During the data collection stage, the legitimacy and potential impact of web scraping should be assessed. If competitive interests are involved, access should shift to lawful channels such as API authorization to ensure data sources are legal. In the application stage of machine learning outputs, a classified security mechanism should be established based on the type of personal information, with real-time su- pervision to prevent privacy breaches and misuse. After deployment in the market, a data disclosure mechanism should be implemented to sup- port user intervention through transparency and safeguard the right to personal information autonomy.

Key words : publicly available personal data; machine learning; competitive interests; personal data protection; information disclosure

引言

目前 , 我國的人工智能 ( Artificial Intelligence , AI) 已經(jīng)進入統(tǒng)籌安全與創(chuàng)新發(fā)展的新階段[1] 。人工智能系統(tǒng)多以機器學(xué)習(xí) (Machine Learning) 為基礎(chǔ)技術(shù)路徑。例如 , 生成式 AI 的工作原理是基于海量數(shù)據(jù) 學(xué)習(xí)總結(jié)規(guī)律 , 不斷優(yōu)化模型 , 依據(jù)操作者指令生成新的內(nèi)容。而總結(jié)規(guī)律的過程便是機器學(xué)習(xí)環(huán)節(jié)[2] 。機器學(xué)習(xí)利用數(shù)據(jù)和算法 , 通過模型訓(xùn)練學(xué)習(xí) 、參數(shù)調(diào)優(yōu)來逐步提高決策準(zhǔn)確性[3] , 最終形成預(yù)測、判斷等信息智能 , 實現(xiàn)特定目標(biāo)[4] 。

在以數(shù)據(jù)為核心驅(qū)動的人工智能技術(shù)體系中 , 機器學(xué)習(xí)對訓(xùn)練數(shù)據(jù)的依賴性愈發(fā)顯著。與傳統(tǒng)軟件開發(fā)的預(yù)設(shè)固定規(guī)則不同 , 機器學(xué)習(xí)通過對海量數(shù)據(jù)的自主學(xué)習(xí)來完成能力遷移與性能優(yōu)化。因此 , 高質(zhì)量語料成為影響模型效果的關(guān)鍵變量。而網(wǎng)絡(luò)空間中的已公開個人數(shù)據(jù)因獲取便利、信息密度高等特征 , 符合生成式人工智能研發(fā)對訓(xùn)練語料的需求 , 因而被廣泛采集并成為訓(xùn)練集的重要組成部分 , 用于支撐機器學(xué)習(xí)模型構(gòu)建和優(yōu)化 , 應(yīng)用于用戶個性化推薦、自然語言處理、人臉識別訓(xùn)練、金融風(fēng)控與信用評估等場景。因此 , 在機器學(xué)習(xí)中如何高效規(guī)范地利用已公開個人數(shù)據(jù) , 已成為人工智能發(fā)展和個人信息權(quán)益保護的重要課題。

本文詳細(xì)內(nèi)容請下載：

http://ihrv.cn/resource/share/2000006992

作者信息：

王婉清

(華東政法大學(xué) 中國法治戰(zhàn)略研究院 , 上海 200042)

原創(chuàng)聲明：此內(nèi)容為AET網(wǎng)站原創(chuàng)，未經(jīng)授權(quán)禁止轉(zhuǎn)載。

相關(guān)內(nèi)容