基于混合蛙跳算法的Web文本聚类研究

发布时间：2018-05-07 11:54

本文选题：Web文本聚类 + 混合蛙跳算法　；参考：《江南大学》2013年硕士论文

【摘要】：随着互联网技术的迅速普及和不断发展，网页上的文本信息在爆炸性的增长。如何对互联网上的信息进行有效的挖掘成为计算机科学领域所面临的一个巨大挑战。人们急需从大量的Web资源中快速、准确、有效地获取感兴趣的知识。文本聚类技术的出现为海量文本信息的分类管理及可视化研究提供了一条有效的途径。文本聚类作为信息过滤、信息检索、搜索引擎、文本数据库、数字化图书馆等领域的技术基础，获得了广泛的应用和发展。由于Web文本数据的海量、高维、动态以及不可预测性，基于Web的聚类研究已逐渐成为了新的热点。论文把重点放在Web文本聚类算法的研究上，K-means(K均值)和FCM(模糊C均值)是聚类中基于划分的算法，由于其简单、快速和有效，被广泛应用于Web文本聚类中，但在应用过程中这些算法常常会在求解过程中陷入局部极小值，而且对初始值敏感。论文研究混合蛙跳算法在Web文本聚类中的应用，通过将混合蛙跳算法分别与K-means和FCM相结合，一定程度上解决了这两种聚类算法易陷入局部极小值和对初始值敏感的问题，提高了这两种算法的收敛精度。论文首先对文本聚类技术的概念、特点和应用领域进行了介绍，对几种经典的聚类方法的实现方式进行了重点的描述，并分析了它们的优势和不足之处。其次，详细的介绍了混合蛙跳算法，针对传统混合蛙跳算法的不足，提出一种改进的混合蛙跳算法，，它通过混沌搜索优化初始解，变异操作生成新个体，并设计了一种新的搜索策略，有效的提高了算法寻优能力。最后，将改进的混合蛙跳算法分别与K-means和FCM相结合。在基于混合蛙跳的K-means算法中，根据青蛙群体的适应度方差来确定K-means算法的操作时机，抑制早熟收敛，用UCI数据集和随机产生的数据来验证其有效性。在基于混合蛙跳的FCM算法中，使用混合蛙跳算法的优化过程代替FCM的基于梯度下降的迭代过程，提高了算法全局寻优能力，通过实际语料库的测试结果比较，改进的算法提高了聚类精度，在全局寻优能力方面具有优势。
[Abstract]:With the rapid popularization and development of Internet technology, text information on web pages is increasing explosively. How to effectively mine the information on the Internet has become a great challenge in the field of computer science. There is an urgent need to quickly, accurately and effectively acquire interesting knowledge from a large number of Web resources. The emergence of text clustering technology provides an effective way for the classification management and visualization of massive text information. Text clustering, as the technical foundation of information filtering, information retrieval, search engine, text database, digital library and so on, has been widely used and developed. Because of the huge volume, high dimension, dynamic and unpredictability of Web text data, clustering based on Web has gradually become a new hot spot. In this paper, we focus on the research of Web text clustering algorithm (K-Means K mean) and FCM (Fuzzy C mean) are partition-based algorithms in clustering. Because of their simplicity, fast and efficiency, they are widely used in Web text clustering. However, in the process of application, these algorithms often fall into local minima and are sensitive to initial values. In this paper, the application of hybrid leapfrog algorithm in Web text clustering is studied. By combining the hybrid leapfrog algorithm with K-means and FCM, the problem that these two clustering algorithms are prone to fall into local minima and are sensitive to initial value is solved to a certain extent. The convergence accuracy of these two algorithms is improved. Firstly, the paper introduces the concept, characteristics and application of text clustering technology, describes the implementation of several classical clustering methods, and analyzes their advantages and disadvantages. Secondly, the hybrid leapfrog algorithm is introduced in detail. In view of the shortcomings of the traditional hybrid leapfrog algorithm, an improved hybrid leapfrog algorithm is proposed, which optimizes the initial solution by chaotic search and generates a new individual by mutation operation. A new search strategy is designed to effectively improve the ability of algorithm optimization. Finally, the improved hybrid leapfrog algorithm is combined with K-means and FCM, respectively. In the K-means algorithm based on mixed leapfrog, the timing of K-means algorithm is determined according to the variance of frog population fitness, and the precocious convergence is restrained. The validity of K-means algorithm is verified by UCI data set and randomly generated data. In the FCM algorithm based on hybrid leapfrog, the optimization process of hybrid leapfrog algorithm is used to replace the gradient descent iterative process of FCM, which improves the global optimization ability of the algorithm. The test results of the actual corpus are compared. The improved algorithm improves the clustering accuracy and has the advantage in global optimization.
【学位授予单位】：江南大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【参考文献】