基于基本层次范畴改进TextRank算法的中文关键词抽取

发布时间：2019-06-21 05:04

【摘要】：关键词的自动抽取技术是文本分类、信息检索、自动文摘等工作的基础性技术,有着重要的应用价值。文章以基本层次范畴理论为基础,提出了一种针对TextRank关键词抽取算法的改进方案,并对其抽取效果进行了评估。全文共分为五个部分。第一部分为绪论,交代了选题背景及意义,对关键词抽取的研究现状进行了梳理,对基本层次范畴、语言网络等研究所用理论做了简单介绍,交代了文章的语料来源。第二部分介绍了利用基本层次范畴理论改进TextRank算法的合理性,并给出了算法的改进方案。改进算法的核心在于以基本层次范畴词为基础来构建层次化词库。词库中的每个词对应了一个属性集,属性集包含了层级信息、语义关系,以及基础权值。第三部分详细介绍了词库的构建方法及过程。词库的构建主要包括选取基本层次范畴词和确定词语基础权值两方面的工作。第四部分对改进算法进行了评估。文章选用科技论文、网页新闻、微博三类文本作为评估材料,分别利用改进前后的TextRank算法抽取关键词。实验结果表明,改进后的算法在准确率、召回率、F1值上要高于改进前的算法。第五部分为结语,对文章主要内容进行总结,并简要讨论了算法的后续改进方向。
[Abstract]:The automatic extraction technology of keywords is the basic technology of text classification, information retrieval, automatic abstracting and so on, which has important application value. Based on the basic hierarchical category theory, an improved scheme for TextRank keyword extraction algorithm is proposed in this paper, and its extraction effect is evaluated. The full text is divided into five parts. The first part is the introduction, which explains the background and significance of the selected topic, combs the research status of keyword extraction, briefly introduces the basic level category, language network and other research theories, and explains the corpus source of the article. In the second part, the rationality of using the basic hierarchical category theory to improve the TextRank algorithm is introduced, and the improvement scheme of the algorithm is given. The core of the improved algorithm is to construct hierarchical word library based on basic hierarchical category words. Each word in the thesaurus corresponds to an attribute set, which contains hierarchical information, semantic relations, and basic weights. The third part introduces the construction method and process of thesaurus in detail. The construction of thesaurus mainly includes the selection of basic category words and the determination of basic weight of words. The fourth part evaluates the improved algorithm. In this paper, three kinds of texts, such as scientific papers, web news and Weibo, are selected as evaluation materials, and the key words are extracted by TextRank algorithm before and after the improvement. The experimental results show that the accuracy, recall rate and F1 value of the improved algorithm are higher than those of the improved algorithm. The fifth part is the conclusion, summarizes the main content of the article, and briefly discusses the follow-up improvement direction of the algorithm.
【学位授予单位】：华中师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：H136

【参考文献】