当前位置:主页 > 科技论文 > 软件论文 >

基于维基百科语料的新闻文本词汇链构建技术研究

发布时间:2018-07-05 09:57

  本文选题:自然语言处理 + 维基百科 ; 参考:《昆明理工大学》2017年硕士论文


【摘要】:一个高效的信息文本处理方法可以很好地对新闻文本进行快速处理,从而得到人们需要的文本类别、关键词以及更深层次的语义内涵和语义关系。词汇链的构建对新闻文本的快速处理有着重要意义,相比传统基于频率和机器学习的关键词提取方法,词汇链基于网络语料库,融合了人类的认知,由于网络语料资源库高速的更新频率和合理的结构分类关系,由词汇链对新闻文本进行进一步研究较其他方法有着更好的效果。目前已有的中文词汇链构建方法不能很好地解决词语歧义消歧问题,构建的词汇链也往往不能正确表达文本的语义聚类关系,既而影响着抽取关键词的质量。为了帮助读者更快的掌握新闻文本的主旨含义、确定新闻篇章结构,本文从以下几个方面展开研究:(1)基于维基百科的分类结构图和文档链接信息图两大特征分别利用候选词的路径信息进行深度加权路径长度(DPL)算法计算节点深度之间的关系;利用文档分类信息基于解释的文本向量进行明确语义分析(ESA)算法计算词与词之间的相关度,从而进行词汇链的初步构建,并且考虑候选词权重改善了文本关键词提取的算法,结合新闻文本5个特征项对初建词汇链优化,以在门户网站上爬取得1500多篇新闻文本为语料对本文中所提到的词汇链构建算法进行试验,将获取的关键词与其他关键词提取的方法进行对比试验,得出的结果表明本文的词汇链构建方法所提取出来的关键词效果更好。(2)基于维基百科语料资源库的从属关系、资源库自身结构特性以及链接复现特性与经典MGKM2003方法结合构建出MGKM-WIKI消歧算法对初选词汇链进行进一步消歧;将MGKM-WIKI消歧算法以Semval-3作为词义消歧系统的候选词数据集,与其他的有监督消歧算法、无监督消歧算法进行了对比试验,得到了较好的效果。(3)在完成词汇链构建的基础上,利用对齐技术实现越南语新闻文本的词汇链构建工作,并爬取大量越南语新闻文本对构建方法进行试验。(4)结合以上研究内容设计原型系统,通过本系统可实现对汉语和越南语新闻文本的词汇链构建,使读者快速掌握新闻主旨、确定新闻篇章结构。
[Abstract]:An efficient information text processing method can be used to process news texts quickly, so as to obtain the text categories, keywords and deeper semantic connotations and semantic relationships that people need. The construction of lexical chain is of great significance to the rapid processing of news texts. Compared with the traditional keyword extraction methods based on frequency and machine learning, the lexical chain is based on the network corpus, which combines human cognition. Because of the high updating frequency and reasonable structure classification relationship of the online corpus, the further study of news text by lexical chain has better results than other methods. The existing Chinese lexical chain construction methods can not solve the problem of word ambiguity, and the constructed lexical chain often can not correctly express the semantic clustering relationship of the text, which affects the quality of the extracted keywords. In order to help readers grasp the main meaning of news texts more quickly, to determine the structure of news texts, In this paper, the following aspects are studied: (1) based on Wikipedia classification structure diagram and document link information graph, the DPL algorithm is used to calculate the relationship between node depth using candidate word path information. The document classification information is used to calculate the correlation between words and words by explicit semantic analysis (ESA) algorithm based on interpretive text vector, so that the lexical chain is constructed preliminarily, and the weight of candidate words is considered to improve the algorithm of text keyword extraction. Combining five feature items of news text to optimize the newly built lexical chain, and taking more than 1500 news texts crawled on the portal as the corpus, this paper attempts to test the lexical chain construction algorithm mentioned in this paper. The results show that the proposed method is more effective than other methods. (2) based on the subordinate relationship of Wikipedia corpus, the proposed method is more effective than other methods. Combined with the classic MGKM2003 method, the MGKM-WIKI disambiguation algorithm is used to further disambiguate the primary lexical chain, and Semval-3 is used as the candidate word data set in the MGKM-WIKI disambiguation algorithm. Compared with other supervised disambiguation algorithms and unsupervised disambiguation algorithms, the results are satisfactory. (3) on the basis of the construction of lexical chain, the construction of lexical chain of Vietnamese news texts is realized by using alignment technology. And crawl a large number of Vietnamese news texts to test the construction method. (4) combined with the above research content design prototype system, through this system can realize the construction of Chinese and Vietnamese news text vocabulary chain, so that readers can quickly grasp the news purport. Determine the structure of the news text.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1

【参考文献】

相关期刊论文 前10条

1 孙琛琛;申德荣;单菁;聂铁铮;于戈;;WSR:一种基于维基百科结构信息的语义关联度计算算法[J];计算机学报;2012年11期

2 盛志超;陶晓鹏;;基于维基百科的语义相似度计算方法[J];计算机工程;2011年07期

3 刘军;姚天f ;;基于Wikipedia的语义相关度计算[J];计算机工程;2010年19期

4 方俊;郭雷;王晓东;;基于语义的关键词提取算法[J];计算机科学;2008年06期

5 ;Keyword Extraction Based on tf/idf for Chinese News Document[J];Wuhan University Journal of Natural Sciences;2007年05期

6 张敏;耿焕同;王煦法;;一种利用BC方法的关键词自动提取算法研究[J];小型微型计算机系统;2007年01期

7 索红光;刘玉树;曹淑英;;一种基于词汇链的关键词抽取方法[J];中文信息学报;2006年06期

8 王军;词表的自动丰富——从元数据中提取关键词及其定位[J];中文信息学报;2005年06期

9 李素建,王厚峰,俞士汶,辛乘胜;关键词自动标引的最大熵模型应用研究[J];计算机学报;2004年09期

10 韩客松,王永成;中文全文标引的主题词标引和主题概念标引方法[J];情报学报;2001年02期

相关硕士学位论文 前1条

1 刘琦;一种基于WordNet上下文的词义消歧算法[D];吉林大学;2014年



本文编号:2099827

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2099827.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户e43dd***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com