文本关键词提取技术及其应用研究

发布时间：2018-06-05 21:16

本文选题：维吾尔文 + 关键词提取　；参考：《新疆大学》2014年硕士论文

【摘要】：随着网络时代的到来，在线文档开始涌现且其数量每天仍在急剧增加，面对如此浩大的信息资源，有效地提取对这些信息的关键内容显得十分重要。关键词提取技术对文本自动摘要生成、文本分类、文本聚类和信息检索等研究都具有重要意义。首先，本文建立了用于训练和测试的文本语料数据库，总计1000篇（其中500篇属于健康类，其余500篇属于计算机、教育、经济、房地产、历史、地理等非健康类文档）。其次，本文应用了基于TextRank的关键词提取方法。实验结果表明，用此方法获得的最高文档分类正确率为75.5%，再增加关键词数对分类结果无明显贡献。为了进一步提高分类精度，我们提出了基于TF/IDF的区分性关键词提取方法，该方法通过计算同一词语在不同组合统计量下的类间差异得到区分性关键词。实验结果表明，区分性关键词提取方法获得的最高文档分类正确率高达98.5%（关键词语数量为100）。基于TF/IDF的区分性关键词提取方法虽然在文档分类上很有效，但是都以收集大量关键词语为基础，且缺少理论基础，具有一定的局限性。因此，本文又引用了在生物技术领域中常见的SDA（稀疏判别分析）方法。实验结果证明，该方法获得的文档分类正确率为98%（关键词语数量为90），实现了在少量数据集上较高的分类效果。于是，在少量数据集上进一步提高正确率，我们又研究了基于SparseSVM的关键词提取方法。实验结果是，关键词数量分别在10、20、30时，基于SDA的方法获得文档分类正确率分别为88.5%、90.5%、91.5%，而基于SparseSVM的方法则分别为90%、92%、95.5%。这些表明，SparseSVM方法在少量数据集上更有效。为了验证上述技术的性能稳定性，本文最后还给出了基于以上四种方法的维吾尔文本情感辨识实验结果，其结果令人满意。
[Abstract]:With the advent of the network era, the number of online documents is still increasing rapidly every day. In the face of so large information resources, it is very important to extract the key content of these information effectively. Keyword extraction is of great significance to the research of text automatic summary generation, text classification, text clustering and information retrieval. First of all, this paper establishes a text corpus database for training and testing, a total of 1000 articles (of which 500 belong to the category of health, the remaining 500 belong to computer, education, economics, real estate, history, geography and other unhealthy documents. Secondly, this paper applies the keyword extraction method based on TextRank. The experimental results show that the highest classification accuracy rate of this method is 75.5, and the increase of the number of keywords has no significant contribution to the classification results. In order to further improve the classification accuracy, we propose a discriminative keyword extraction method based on TF/IDF, which obtains the discriminative keywords by calculating the differences between classes of the same word under different combination statistics. The experimental results show that the highest correct rate of document classification obtained by the discriminative keyword extraction method is as high as 98.5% (the number of key words is 100). Although the discriminative keyword extraction method based on TF/IDF is very effective in document classification, it is based on the collection of a large number of key words and lacks the theoretical basis, so it has some limitations. Therefore, the SDA (sparse discriminant analysis) method, which is commonly used in the field of biotechnology, is also cited in this paper. The experimental results show that the correct rate of document classification obtained by this method is 98 (the number of key words is 90), which can achieve a higher classification effect on a small number of data sets. Therefore, we further improve the accuracy on a small number of data sets, and we also study the keyword extraction method based on SparseSVM. The experimental results are as follows: at 1020 and 30, respectively, the correct rate of document classification obtained by the method based on SDA is 88.50.5,90.5 and 91.5, respectively, while the method based on SparseSVM is 92,92and 95.555, respectively. These results show that the SparseSVM method is more effective on a small number of data sets. In order to verify the performance stability of the above techniques, the experimental results of Uygur text emotion identification based on the above four methods are presented, and the results are satisfactory.
【学位授予单位】：新疆大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP391.1

【参考文献】