基于改进的TF-IDF算法及共现词的主题词抽取算法
发布时间:2018-03-27 20:33
本文选题:共现词 切入点:互信息 出处:《南京大学学报(自然科学)》2017年06期
【摘要】:信息主题的抽取是快速定位用户需求的基础任务,主题词抽取时主要存在三个问题:一是词语权重的计算,二是词语间关系的度量,三是数据维度灾难.在计算词权重时首先利用互信息确定共现词对,与词频、词性、词位置信息非线性组合,然后,根据词权重构建文档—共现词矩阵并建立潜在语义分析(Latent Semantic Analysis,LSA)模型.该方法借助LSA模型的奇异值分解(Singular Value Decomposition,SVD)将文档—共现词矩阵映射到潜在语义空间,不仅实现数据降维,而且获得低维度的文档相似矩阵.最后,对文档相似矩阵进行k-means聚类,在同类文档中选出词权重最大的前几对共现词,作为该类文章的主题词.对比基于TF-IDF(Term Frequency-Inverse Document Frequency)和共现词抽取主题词的实验,该算法的准确度分别提高了19%和10%.
[Abstract]:The extraction of information topic is the basic task to locate the user's demand quickly. There are three main problems in the extraction of theme words: one is the calculation of the word weight, the other is the measurement of the relationship between words and phrases. The third is the disaster of data dimension. When calculating the word weight, we first use mutual information to determine the co-occurrence word pair, and the word frequency, part of speech, word position information, and then, According to the word weight, the document cooccurrence matrix is constructed and the latent Semantic analysis model is established. By using singular Value decomposition of the LSA model, the document cooccurrence matrix is mapped to the latent semantic space, which not only reduces the dimension of the data, but also reduces the dimension of the data. And the document similarity matrix of low dimension is obtained. Finally, the document similarity matrix is clustered by k-means, and the first few pairs of co-occurrence words with the largest word weight are selected from the similar documents. As the theme words of this kind of articles, the accuracy of the algorithm is improved by 19% and 10% respectively by comparing the experiments of extracting theme words based on TF-IDF(Term Frequency-Inverse Document frequency) and cooccurrence words.
【作者单位】: 山东财经大学计算机科学与技术学院;曲阜师范大学软件学院;山东大学计算机学院;
【基金】:教育部人文社会科学研究项目(15YJAZH042) 山东省本科高校教学改革研究重点项目(2015Z058)
【分类号】:TP391.1
【相似文献】
相关期刊论文 前10条
1 郭锋,李绍滋,周昌乐,林颖,李胜睿;基于词汇吸引与排斥模型的共现词提取[J];中文信息学报;2004年06期
2 乔亚男;齐勇;侯迪;;一种高稳定性词汇共现模型[J];西安交通大学学报;2009年06期
3 赵文清;侯小可;;基于词共现图的中文微博新闻话题识别[J];智能系统学报;2012年05期
4 胡明生;贾志娟;雷利利;洪流;;基于共现分析的历史自然灾害关联研究[J];计算机工程与设计;2013年06期
5 葛玲;蒋宗礼;;基于共现词查询的主题爬虫研究[J];计算机工程;2010年08期
6 孙爱珍;;语境共现词汇链的自动提取及与语篇衔接之关系(英文)[J];Chinese Journal of Applied Linguistics;2011年04期
7 陈,
本文编号:1673142
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1673142.html