词向量聚类加权TextRank的关键词抽取
发布时间:2019-06-27 18:25
【摘要】:【目的】将维基百科蕴涵的世界知识以词向量方式融入TextRank模型,改进单文档关键词抽取效果。【方法】利用Word2Vec模型基于维基百科中文数据,生成词向量模型,对TextRank词图节点的词向量进行聚类以调整簇内节点的投票重要性,结合节点的覆盖和位置因素,计算节点之间的随机跳转概率,生成转移矩阵,最终通过迭代计算获得节点的重要性得分,选取前TopN个词语生成关键词。【结果】当TopN≤7时,词向量聚类加权方法均优于对比方法;TopN=3时,F值取得最大值,比先前最优结果增量提升了3.374%;TopN7时,结果与位置加权法相似。【局限】聚类分析使得计算开销变高。【结论】词向量聚类加权能够改善关键词抽取效果。
[Abstract]:[objective] to integrate the world knowledge contained in Wikipedia into TextRank model by word vector, and to improve the effect of keyword extraction from single document. [methods] the word vector model is generated based on Wikipedia Chinese data, and the word vector of TextRank word map node is clustering to adjust the voting importance of the nodes in the cluster. Combined with the coverage and location factors of the nodes, the random jump probability between nodes is calculated and the transfer matrix is generated. Finally, the importance score of the node is obtained by iterative calculation, and the former TopN words are selected to generate keywords. [results] when TopN 鈮,
本文编号:2507032
[Abstract]:[objective] to integrate the world knowledge contained in Wikipedia into TextRank model by word vector, and to improve the effect of keyword extraction from single document. [methods] the word vector model is generated based on Wikipedia Chinese data, and the word vector of TextRank word map node is clustering to adjust the voting importance of the nodes in the cluster. Combined with the coverage and location factors of the nodes, the random jump probability between nodes is calculated and the transfer matrix is generated. Finally, the importance score of the node is obtained by iterative calculation, and the former TopN words are selected to generate keywords. [results] when TopN 鈮,
本文编号:2507032
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2507032.html