123,123,123

计算机文本分析算法发展综述

电子技术应用 2023年3期

孙静含1，任静2

（1.北京工业大学，北京 100124；2.中国电子信息产业集团第六研究所，北京 100083）

摘要： 计算机文本分析是自然语言处理领域的一个重要分支，是研究如何在文本数据中提取出给定语料的各类信息的计算机技术。目前，计算机文本分析已经进入了新的历史阶段，一方面关键词提取算法已经逐渐完备，另一方面随着BERT方法的出现，词向量计算问题也取得了极大的进步。但是，无论是关键词提取还是词向量计算都仍存在一些有待解决的问题。另外，现有的许多适合使用文本分析的研究仍在使用早期的文本分析方法。因此在未来，如何更好地降低模型尺寸以促进学科融合、提升文本分析的综合社会效益，成为文本分析算法发展的重要问题。

關(guān)鍵詞： 文本分析自然语言处理算法

中圖分類號(hào)：TP181 文獻(xiàn)標(biāo)志碼：A DOI: 10.16157/j.issn.0258-7998.223117
中文引用格式： 孫靜含，任靜. 計(jì)算機(jī)文本分析算法發(fā)展綜述[J]. 電子技術(shù)應(yīng)用，2023，49(3)：42-47.
英文引用格式： Sun Jinghan，Ren Jing. A survey of the development of computer text analysis algorithms[J]. Application of Electronic Technique，2023，49(3)：42-47.

A survey of the development of computer text analysis algorithms

Sun Jinghan1，Ren Jing2

(1.Beijing University of Technology，Beijing100124， China； 2.The Sixth Research Institute of China Electronics Corporation， Beijing 100083， China)

Abstract： Abstract： Computer text analysis is an important branch in the field of natural language processing, and it is a computer technology that studies how to extract various types of information from a given corpus from text data. At present, computer text analysis has entered a new historical stage. On the one hand, the keyword extraction algorithm has gradually been completed. On the other hand, with the emergence of the BERT method, the word vector calculation problem has also made great progress. However, there are still some problems to be solved in both keyword extraction and word vector calculation. In addition, many existing studies suitable for using text analysis still use ancient text analysis methods. Therefore, in the future, how to better reduce the model size to promote the integration of disciplines and improve the comprehensive social benefits of text analysis will become an important issue in the development of text analysis algorithms.

Key words : text analysis；natural language processing；algorithm

0　引言

計(jì)算機(jī)文本分析是自然語(yǔ)言處理（Natural Language Processing, NLP）領(lǐng)域的一個(gè)重要分支，是指對(duì)文本數(shù)據(jù)或語(yǔ)料庫(kù)內(nèi)的語(yǔ)料進(jìn)行分析，最終提取出給定語(yǔ)料的各種信息，包括關(guān)鍵詞、詞向量等內(nèi)容的計(jì)算機(jī)技術(shù)，在一些文獻(xiàn)中也將這一領(lǐng)域的相關(guān)技術(shù)納入自然語(yǔ)言處理預(yù)訓(xùn)練技術(shù)之中。文本分析最早起源于20世紀(jì)50~60年代，這一階段的研究重點(diǎn)是如何對(duì)語(yǔ)言規(guī)則進(jìn)行設(shè)定。到20世紀(jì)70年代，隨著語(yǔ)料庫(kù)的豐富和硬件設(shè)備的進(jìn)步，文本分析技術(shù)開始融合機(jī)器學(xué)習(xí)算法，并得到了快速發(fā)展。進(jìn)入21世紀(jì)之后，深度學(xué)習(xí)方法被運(yùn)用到文本分析之中，誕生了諸如Word2Vec、BERT等技術(shù)，進(jìn)一步提高了文本分析的應(yīng)用場(chǎng)景。在可以預(yù)見的未來(lái)，傳統(tǒng)的基于數(shù)字?jǐn)?shù)據(jù)的分析方法將逐漸無(wú)法滿足越發(fā)龐雜的應(yīng)用需求，文本分析和其所屬的自然語(yǔ)言處理領(lǐng)域?qū)?huì)是一個(gè)愈發(fā)重要的發(fā)展方向。

本文詳細(xì)內(nèi)容請(qǐng)下載：http://ihrv.cn/resource/share/2000005227

作者信息：

孫靜含1，任靜2

（1.北京工業(yè)大學(xué)，北京 100124；2.中國(guó)電子信息產(chǎn)業(yè)集團(tuán)第六研究所，北京 100083）

微信圖片_20210517164139.jpg

原創(chuàng)聲明：此內(nèi)容為AET網(wǎng)站原創(chuàng)，未經(jīng)授權(quán)禁止轉(zhuǎn)載。

相關(guān)內(nèi)容