当前位置:主页 > 科技论文 > 软件论文 >

基于超图的汉越新闻关键词抽取研究

发布时间:2018-10-10 10:27
【摘要】:随着一带一路的展开,我国对越南的关注度开始提高,新闻作为信息传播的载体,是人们获取信息的重要途径。然而越南语是小语种,掌握的人非常少,且网络新闻几乎不会提供关键词,使得新闻的定位成了难题。汉越新闻关键词抽取能够节省大量时间、提高信息使用率,在汉越关系日趋密切的当下有着重要的研究价值。当前在关键词抽取领域,人们通常只考虑词的特征信息,并没有考虑新闻文档中存在的复杂关系,所以使用合适的模型对这些复杂关系进行表达成为了急需解决的问题。超图模型中的超边可以表达多个实体间的复杂关系,其特性恰好能够满足新闻文档表述多元关系的需求,因此本文使用超图模型,研究在单文档、多文档与双语环境下关键词抽取的方法。本文的主要工作如下:1.提出了单文档下基于超图排序的新闻关键词抽取方法。考虑到超图模型能够表述文档中词与句子之间的关系,该方法首先分析单文档的结构特征,将词作为顶点,选择词频、词性、词跨度和位置因素作为词的权重,之后将句子作为超图的超边,构建单文档新闻超图模型。2.提出了多文档下基于超图排序的新闻关键词抽取方法。考虑到超图模型中的超边可以表示一篇新闻文档,该方法通过分析新闻网页自身特征对关键词抽取的影响,提取新闻网页的时间要素与评论数要素作为超边的特征权重,构建多文档新闻超图模型。3.提出了多文档下基于超图排序的汉越双语新闻关键词抽取方法。考虑到超图可以通过超边表述汉越双语词对应关系,以此将两种语言进行关联,该方法首先分析双语新闻文档的特点,将双语词频作为词的核心特征信息,之后通过构建两种类型的超边来建立双语新闻超图模型。最后使用基于超图的随机游走算法将超图的顶点进行排序,并输出排名最高的若干词作为新闻文档的关键词,实验证明了方法的有消息。
[Abstract]:With the development of Belt and Road, our country began to pay more attention to Vietnam. News, as a carrier of information dissemination, is an important way for people to obtain information. However, Vietnamese is a small language, very few people master, and network news can hardly provide keywords, making news positioning difficult. Chinese and Vietnamese news keyword extraction can save a lot of time and improve the utilization rate of information. At present, in the field of keyword extraction, people usually only consider the feature information of words, and do not consider the complex relations in news documents, so it is an urgent problem to use the appropriate model to express these complex relationships. The hypergraph model can express the complex relations between multiple entities, and its characteristics can meet the needs of news documents to express multiple relationships. Therefore, this paper uses the hypergraph model to study the complex relationships between multiple entities, and the hypergraph model is used to study the relationship between multiple entities in a single document. The method of keyword extraction in multi-document and bilingual environment. The main work of this paper is as follows: 1. This paper presents a new method of news keyword extraction based on hypergraph sorting in single document. Considering that the hypergraph model can express the relationship between words and sentences in the document, this method firstly analyzes the structural characteristics of a single document, takes the word as the vertex, selects the word frequency, part of speech, word span and location as the weight of the word. Then the sentence is regarded as the hypergraph edge of the hypergraph, and the single document news hypergraph model. 2. 2. In this paper, a new method of news keyword extraction based on hypergraph sorting is proposed. Considering that hyper-edge in hypergraph model can represent a news document, this method extracts the time factor and comment number element of news page as feature weight of super-edge by analyzing the influence of the feature of news page on keyword extraction. Build multi-document news hypergraph model. 3. In this paper, a method for extracting Chinese and Vietnamese bilingual news keywords based on hypergraph ordering under multi-document is proposed. Considering that hypergraph can express the corresponding relationship between Chinese and Vietnamese bilingual words by hypergraph, this method firstly analyzes the characteristics of bilingual news documents and takes the frequency of bilingual words as the core feature information of words. Then two types of hyperedges are constructed to build a bilingual news hypergraph model. Finally, the hypergraph-based random walk algorithm is used to sort the vertices of the hypergraph, and some words with the highest ranking are output as keywords of the news document. The experiment proves that the method has message.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1

【参考文献】

相关期刊论文 前10条

1 张莉婧;李业丽;曾庆涛;雷嘉丽;杨鹏;;基于改进TextRank的关键词抽取算法[J];北京印刷学院学报;2016年04期

2 宁建飞;刘降珍;;融合Word2vec与TextRank的关键词抽取研究[J];现代图书情报技术;2016年06期

3 牛萍;黄德根;;TF-IDF与规则相结合的中文关键词自动抽取研究[J];小型微型计算机系统;2016年04期

4 李强;;一种基于种子扩散策略的关键词抽取方法[J];科技风;2016年01期

5 朱泽德;李淼;张健;曾伟辉;曾新华;;一种基于LDA模型的关键词抽取方法[J];中南大学学报(自然科学版);2015年06期

6 王民;;新闻文档关键词抽取技术研究[J];科技传播;2015年07期

7 梁喜涛;顾磊;;中文分词与词性标注研究[J];计算机技术与发展;2015年02期

8 顾益军;夏天;;融合LDA与TextRank的关键词抽取研究[J];现代图书情报技术;2014年Z1期

9 孙皓;董守斌;;基于标签密度的自适应正文提取方法[J];郑州大学学报(理学版);2009年01期

10 章成志;;自动标引研究的回顾与展望[J];现代图书情报技术;2007年11期

相关博士学位论文 前1条

1 徐晓华;图上的随机游走学习[D];南京航空航天大学;2008年

相关硕士学位论文 前5条

1 汪建成;汉越双语新闻话题分析方法研究[D];昆明理工大学;2015年

2 毛新武;基于组合特征的中文新闻网页关键词提取研究[D];北京林业大学;2013年

3 沈剑平;面向网络人物搜索的中文人名消歧[D];哈尔滨工业大学;2010年

4 尹倩;基于聚类分析的中文新闻网页关键词提取方法研究[D];合肥工业大学;2009年

5 杨洁;多文档关键词抽取技术的研究[D];沈阳航空工业学院;2009年



本文编号:2261433

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2261433.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户2f5d4***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com