共现潜在语义向量空间模型的进一步研究
发布时间:2018-01-26 05:58
本文关键词: 向量空间模型 CLSVSM TCLSVSM 共现分析 聚类 出处:《情报杂志》2017年12期 论文类型:期刊论文
【摘要】:[目的/意义]文献的向量表示是文献聚类的首要任务。共现潜在语义向量空间模型(CLSVSM)通过共现分析挖掘特征词对间的最大潜在语义信息对向量空间模型(VSM)进行了语义补充,与向量空间模型相比明显提高了中文文献的聚类性能。然而,对该模型的研究还有待深入:该模型对英文文献的聚类适用性尚需检验;是否可以考虑利用除max统计量以外的其它统计量构建模型?聚类效果又会如何?面对大量的文献数据,模型的维度往往较高,运算成本大,所以有必要对模型进行优化处理。[方法/过程]首先将CLSVSM用于对英文文献集(数据来源于Web of Science,简记为WOS)的主题聚类并与VSM的聚类结果进行比较;然后利用除max统计量以外的三个常用统计量min,ave,med构建相应的CLSVSM模型,并用这四个统计量构建的CLSVSM模型对中英文文献进行聚类比较。更重要的是,我们提出了截尾共现潜在语义向量空间模型(TCLSVSM)并检验其聚类性能。[结果/结论]实验显示:CLSVSM对英文文献聚类同样适用;四种统计量构建的模型中CLSVSM-max对中英文文献的聚类效果最佳;TCLSVSM不仅能保证聚类性能,而且能显著降低运算成本。
[Abstract]:[Objective / meaning] the vector representation of literature is the primary task of document clustering. The latent semantic Vector Space Model (CLSVSM). The maximum potential semantic information between feature pairs is extracted by co-occurrence analysis to complement the vector space model (VSM). Compared with the vector space model, the clustering performance of Chinese literature is improved obviously. However, the research on this model needs to be further studied: the applicability of the model to English literature clustering needs to be tested; Could you consider using statistics other than max statistics to build models? What is the effect of clustering? In the face of a large amount of literature data, the dimension of the model is often high and the operation cost is large, so it is necessary to optimize the model. [Methods / procedures] first, CLSVSM was used in the English literature set (data from Web of Science). The topic clustering is abbreviated as WOS) and compared with the clustering results of VSM. Then, the corresponding CLSVSM model was constructed by using the three commonly used statistics except max statistics. The CLSVSM model constructed by these four statistics is used to cluster and compare Chinese and English literature. We propose a truncated cooccurrence latent semantic vector space model (TCLSVSM) and test its clustering performance. [Results / conclusion] the experiment showed that: 1. CLSVSM was also applicable to English literature clustering. Among the four statistical models, CLSVSM-max has the best clustering effect on Chinese and English literature. TCLSVSM can not only guarantee the clustering performance, but also reduce the operation cost significantly.
【作者单位】: 山西大学数学科学学院;山西大学管理与决策研究所;
【基金】:国家自然科学基金项目“共现潜在语义向量空间模型及其语义核的构建与应用研究”(编号:71503151) 山西省高等学校创新人才支持计划“基于潜在语义的文本信息主题深度聚类研究”(编号:2016052006)的研究成果之一
【分类号】:G353.1;TP391.1
【正文快照】: 0引言大数据时代使得信息资源空前丰富,其中绝大多数是文本信息资源。如何有效处理这些信息是文本挖掘、信息检索等领域研究的重点问题。文本信息资源不同于一般的数据资源,其一,文本数据是一种半结构或无结构的数据;其二,文本数据中包含大量的语义信息;传统的数据挖掘算法无
【相似文献】
相关期刊论文 前10条
1 丁月华,文贵华,郭炜强;基于核向量空间模型的专利分类[J];华南理工大学学报(自然科学版);2005年08期
2 王萌,何婷婷,张伟;基于概念向量空间模型的中文自动文摘系统[J];计算机工程与应用;2005年01期
3 张玉连;张敏;张波;;一种扩展的向量空间模型-隐含语义索引模型研究[J];燕山大学学报;2006年01期
4 李雪峰;刘鲁;张f,
本文编号:1464875
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1464875.html