融合多特征的TextRank关键词抽取方法

发布时间：2018-05-20 23:19

本文选题：TextRank算法 + 关键词抽取　；参考：《情报杂志》2017年08期

【摘要】：[目的/意义]关键词提取在自然语言处理领域有着广泛的应用,如何快速准确地实现关键词的提取已经成为文本处理的关键问题。目前关键词提取方法非常多,但准确率仍有待提升。为此,提出一种结合单一文档内部结构信息、词语对于单文档和文档集整体的重要性的关键词抽取方法。[方法/过程]首先,根据词语的平均信息熵特征计算词语对文档集整体的重要性,利用词语的词性、位置特征计算词语对单文档中的重要性。然后,通过神经网络训练的方式优化三个特征的权重分配实现特征的融合。最后,利用三个特征计算得到词语的综合权值来改进TextRank模型词汇节点的初始权重以及概率转移矩阵,再通过迭代法实现关键词的抽取。[结果 /结论]该研究方法结合了文档集整体信息和单文档自身信息,其关键词提取的准确率较传统TextRank方法、TFIDF-TextRank方法有了明显的提高。
[Abstract]:Objective / meaning keyword extraction is widely used in the field of natural language processing. How to extract keywords quickly and accurately has become a key problem in text processing. At present, there are many methods of keyword extraction, but the accuracy still needs to be improved. This paper proposes a keyword extraction method which combines the internal structure information of a single document and the importance of words to the whole of a single document and a set of documents. [method / process] first, the importance of words to the whole document set is calculated according to the average information entropy feature of words, and the importance of words to a single document is calculated by using the word's part of speech and location feature. Then, the weights of the three features are optimized by neural network training to achieve feature fusion. Finally, the synthetic weights of the words are calculated by using three features to improve the initial weight and the probability transfer matrix of the lexical nodes in the TextRank model, and then the keyword extraction is realized by iterative method. [results / conclusion] this method combines the whole information of document set and the information of single document itself, and the accuracy of keyword extraction is much higher than that of the traditional TextRank method (TFIDF-TextRank).
【作者单位】：广东工业大学计算机学院;广东工业大学艺术与设计学院;
【基金】：广东省部产学研专项资金企业创新平台“面向家电行业的用户数据挖掘系统研究及体验式设计创新服务”(编号:2013B090800042)
【分类号】：TP391.1

【相似文献】