K-medoids聚类算法研究及其在文本聚类中的应用

发布时间：2018-05-01 16:37

本文选题：K-medoids + 文本聚类　；参考：《重庆理工大学》2017年硕士论文

【摘要】：文本聚类就是将给定的文本集合划分为多个类簇,以期达到不同类的文档相似度较小,而同类的文档相似度较大。作为一种无监督的机器学习方法,由于聚类算法不需要训练过程,并且无需事先对文档进行手工标注类别,因此具有一定的自动化处理能力和较高的灵活性,已经成为对文本信息进行摘要、导航和有效组织的重要手段,受到越来越多的研究人员关注。在对文本进行聚类处理时,主要采用基于TF-IDF统计的向量空间模型表示文档,其涉及到文本预处理、中文分词、特征提取、特征权重计算、聚类算法、聚类性能测评等多个过程。其中特征项权重计算和聚类算法的选择是基于向量空间模型的文本聚类算法中重要的两个环节,关系到文本的聚类效果。针对传统的特征项权重计算方法只考虑频率和逆文档频率,忽略了文档所属类别对特征权重的影响的问题,结合实际应用中可能没有标准分类数据集,本文提出了一种新的结合类别与语义贡献度的特征项权重计算方法。该方法首先提出了语义贡献度,并将其与模糊聚类相结合,对没有类别信息的文本集合进行粗略聚类得到带类别信息的文本集合;然后提出了类别信息熵,并和语义贡献度相结合对传统的TF-IDF权重计算方法进行改进,从而得到更有效的权重计算方法。利用复旦大学中文自然语言处理开放平台中的中文文本分类语料库数据集进行测试,结果表明新的特征项权重计算方法优于传统的权重计算方法。针对K-medoids聚类算法对选择的聚类初始中心点敏感,不当的初始中心点选择可能导致聚类效果达到局部最优的问题,本文提出了一种半径自适应的初始中心点选择K-medoids聚类算法。该算法在每次迭代过程中会根据剩余样本点的分布特征重新对半径进行计算,从而实现动态计算对应样本点的邻域半径和局部方差,以此选出更优的聚类初始中心点,达到更好的聚类效果。分别在带有不同比例随机点的模拟数据集和规模大小不等的UCI数据集上进行测试,并采用5个通用的聚类评价指标进行性能评价,结果表明,本算法性能较同类算法有明显提高。最后对本文改进的文本聚类算法设计成一个文本聚类系统,该系统对整个流程进行了展示,并对该系统的实验结果进行比较。
[Abstract]:Text clustering is to divide a given text set into multiple clusters, in order to achieve the document similarity of different classes is smaller, while the same kind of document similarity is larger. As an unsupervised machine learning method, the clustering algorithm does not need training process, and does not need to label the document manually in advance, so it has certain automatic processing ability and high flexibility. Text information has become an important means of summary, navigation and effective organization, which has attracted more and more researchers' attention. In the process of text clustering, the vector space model based on TF-IDF statistics is used to represent the document, which involves many processes, such as text preprocessing, Chinese word segmentation, feature extraction, feature weight calculation, clustering algorithm, clustering performance evaluation and so on. The weight calculation of feature items and the selection of clustering algorithm are two important links in the text clustering algorithm based on vector space model, which is related to the clustering effect of text. The traditional method only considers the frequency and the inverse document frequency, neglects the influence of the document category on the feature weight, and there may be no standard classification data set in the practical application. In this paper, a new method for calculating the weights of feature items combining category and semantic contribution is proposed. In this method, the semantic contribution degree is first proposed, and combined with fuzzy clustering, the text set without category information is roughly clustered to obtain the text set with category information, and then the category information entropy is proposed. Combined with semantic contribution, the traditional weight calculation method of TF-IDF is improved, and a more effective weight calculation method is obtained. The Chinese text classification corpus data set of Fudan University's Chinese natural language processing platform is used to test. The results show that the new method is better than the traditional weight calculation method. Aiming at the problem that the K-medoids clustering algorithm is sensitive to the selected initial center points and the improper selection of the initial center points may lead to the local optimal clustering effect, a radius adaptive initial center point selection K-medoids clustering algorithm is proposed in this paper. In each iteration process, the radius is calculated again according to the distribution characteristics of the remaining sample points, so that the neighborhood radius and local variance of the corresponding sample points can be dynamically calculated, so as to select a better clustering initial center point. Better clustering effect is achieved. The simulation data sets with different proportions of random points and the UCI data sets with different scales are tested, and five general cluster evaluation indexes are used to evaluate the performance. The results show that, The performance of this algorithm is obviously improved compared with the similar algorithm. At last, the improved text clustering algorithm is designed as a text clustering system. The whole process of the system is presented, and the experimental results of the system are compared.
【学位授予单位】：重庆理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】