基于复杂网络的文本关键词提取分析平台

发布时间：2018-03-31 06:42

本文选题：提取　切入点：加权文本网络　出处：《南京邮电大学》2017年硕士论文

【摘要】：随着信息时代的到来以及互联网的蓬勃发展,关键词作为对文本主题的高度概括,成为用户搜索信息必不可少的工具,如何快速有效地挖掘文本关键词成为现今研究的热点。而基于复杂网络的文本关键词提取作为最新的关键词提取方法,学者们对其的研究也十分热衷。本文将文本数据抽象为复杂网络进行研究与分析,并构建了关键词提取分析平台实现对文本关键词的批量自动提取,主要成果如下:1.总结了国内外学者对关键词提取的研究概况,主要介绍了不同领域对关键词提取的经典方法,并分析了各类方法的局限性;针对现有的基于复杂网络的文本关键词提取算法进行研究,详细介绍了复杂网络常用的节点重要性衡量指标,包括常用的统计参数和相关算法,并对其进行对比分析。2.考虑到词频对文本主题的重要性,提出“词频分享权重”的概念,继而提出了一种构建加权文本网络的新方法,将目标节点的词频值根据邻居节点对其的重要度贡献来分配给相应的连边,从而实现对网络的加权,改善了目前已有研究大多基于“词语在同一个句子中共现次数”为连边加权的现状。3.在构建的加权文本网络基础上,结合人类语言特性引入位置权重系数,基于PageRank算法提出了一种基于复杂网络的文本关键词提取算法LTWPR。利用该算法对采集的新浪新闻语料进行多类关键词提取实验,并将实验结果与两种经典算法进行比较,验证了该算法的准确性和有效性。同时从多方面说明LTWPR算法在挖掘文本的关键词方面表现优异,适用于大批量文本网络关键节点挖掘。4.开发了一个基于复杂网络的文本关键词提取分析平台,实现批量读入文本数据、批量输出文本关键词。平台具有界面简洁友好、操作便捷、可扩展性强的优势,能够较好地批量处理文本数据、仿真各类文本关键词提取算法并将结果与作者标注的关键词进行对比等功能。平台较好地集成了本课题的研究成果,有助于快捷直观地进行文本关键词提取研究,具有良好的工程实用性。
[Abstract]:With the arrival of the information age and the vigorous development of the Internet, keywords, as a highly summary of text topics, have become an indispensable tool for users to search for information. How to quickly and effectively mine text keywords has become a hot topic in today's research, and text keyword extraction based on complex network is the latest keyword extraction method. In this paper, the text data is abstracted into a complex network for research and analysis, and a keyword extraction and analysis platform is constructed to realize the automatic batch extraction of text keywords. The main results are as follows: 1. Summarize the domestic and foreign scholars' research on keyword extraction, mainly introduce the classical methods of keyword extraction in different fields, and analyze the limitations of all kinds of methods. Based on the existing text keyword extraction algorithm based on complex network, this paper introduces the commonly used node importance measurement index, including the commonly used statistical parameters and related algorithms. Considering the importance of word frequency to text topic, the concept of "word frequency sharing weight" is put forward, and then a new method of constructing weighted text network is proposed. The word frequency value of the target node is assigned to the corresponding connected edges according to the importance contribution of the neighbor node to the target node, thus the weighting of the network is realized. It improves the current situation that most of the previous studies are based on "the number of occurrences of words in the same sentence" as continuous edge weighting. 3. On the basis of the weighted text network constructed, the position weight coefficient is introduced in combination with the human language characteristics. Based on the PageRank algorithm, a text keyword extraction algorithm based on complex network is proposed. The algorithm is used to carry out multi-class keyword extraction experiments on the collected Sina news corpus, and the experimental results are compared with the two classical algorithms. The accuracy and validity of the algorithm are verified. At the same time, the LTWPR algorithm is proved to be excellent in mining the keywords of text from many aspects. A text keyword extraction and analysis platform based on complex network is developed, which can read text data in batches and output text keywords in batches. The platform has a simple and friendly interface. It has the advantages of convenient operation and strong expansibility, and can process text data in batches. Simulation of all kinds of text keyword extraction algorithms and compare the results with the key words annotated by the author. The platform integrates the research results of this topic well, which is helpful for the research of text keyword extraction quickly and intuitively. It has good engineering practicability.
【学位授予单位】：南京邮电大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1;O157.5

【参考文献】