基于组合特征的中文新闻网页关键词提取研究

发布时间：2018-01-20 13:17

本文关键词： 关键词提取组合特征组合词有向图新闻网页　出处：《北京林业大学》2013年硕士论文　论文类型：学位论文

【摘要】：随着互联网的迅速发展,网络信息正在呈爆炸式增长,新闻网页已经成为人们获取信息的一个重要途径。如何快速有效地获取新闻网页中的信息并进行处理已经成为一个重要的研究工作。在搜索引擎领域,网页内容及关键词提取是有关文本自动处理的基础工作。网页关键词反映了网页的主要内容,能够有效地对网页进行标识,从而便于进一步处理。本文首先介绍了关键词提取相关的理论知识,包括关键词提取的概念、自然语言处理、网页内容提取等。接着介绍了组合词以及组合词生成的方法。然后提出了基于组合特征的新闻网页关键词提取方法。在对网页文本进行分词的基础上,通过计算文本特征的权重得到候选关键词,并利用基于有向图的组合词生成算法得到组合词,经过去重合并得到最终关键词。最后对新闻网页进行实验,实验结果表明本文方法能够有效地提取出新闻网页的关键词。
[Abstract]:With the rapid development of the Internet, the network information is explosive growth. News pages have become an important way for people to obtain information. How to quickly and effectively access information in news pages and deal with them has become an important research work in the field of search engines. Web page content and keyword extraction are the basic work of automatic text processing. Web keywords reflect the main content of the page and can effectively identify the page, thus facilitating further processing. This paper first introduces the relevant theoretical knowledge of keyword extraction, including the concept of keyword extraction, natural language processing. Then introduced the combination words and the combination word generation method. Then proposed the news page keyword extraction method based on the combination characteristic. On the basis of the word segmentation to the web page text. The candidate keywords are obtained by calculating the weight of the text features, and the combinational words are obtained by using the combinatorial word generation algorithm based on directed graph, then the final keywords are obtained by de-coincidence. Finally, the experiment of news pages is carried out. Experimental results show that this method can effectively extract the keywords of news pages.
【学位授予单位】：北京林业大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【相似文献】