大规模数据场景下的有监督（迁移）聚类技术研究

发布时间：2018-03-08 20:40

本文选题：聚类算法　切入点：模糊C均值　出处：《江南大学》2017年博士论文　论文类型：学位论文

【摘要】：人工智能经过60多年的发展已经取得了巨大进步,作为人工智能领域中最活跃分支之一的机器学习也相应地得到快速发展。聚类作为一种有效的数据分析方法和工具,一直以来,在学术界和工业界受到广泛关注和应用。然而,随着科学技术的不断发展和计算机技术的广泛应用,新的问题和挑战不断涌现,其中迁移场景下的聚类和大规模数据场景下的聚类是目前面临的两个突出问题。本研究课题主要关注的是上述两个场景下的聚类问题。我们在研究传统聚类方法时发现,直接使用传统聚类方法对迁移应用场景和大规模数据场景下的数据执行聚类任务时,往往不能获得理想的聚类性能或者有时甚至无法运行相关算法。其面临的常见挑战是:1)在迁移场景中,由于行业建立之初往往无数据积累或者采集到的数据样本量不足,亦或者由于采集设备的不稳定等因素导致采集到的数据样本受到了污染,在这样的情况下,如果直接使用传统的聚类算法,常常导致聚类性能不稳定甚至失效。2)在大规模数据场景中,由于要处理的数据样本量大,而用于处理的机器内存有限,不能一次装载所有要处理的数据,直接导致不能使用传统的聚类算法来对该数据进行处理分析。为了解决传统聚类算法应用到上述两种新兴应用场景时所面临的问题,本研究课题以经典模糊聚类算法为基础,以迁移应用场景和大规模数据应用场景为切入点,对相关算法进行改造和重构使其适应新应用场景的需求。主要内容安排如下:(1)第二章节至第四章节重点研究迁移应用场景下的模糊聚类算法改造和应用。其中第二章节至第三章节探讨的是对经典模糊聚类算法的改造和重构;第四章节讨论的是知识迁移在具体的图像分割应用中的使用。具体来说,第二章节是在模糊C均值(FCM)聚类算法的基础上,对其目标函数进行修改,提出了一个全新的PPKTFCM聚类算法。该算法同时满足两个规则:样本点与历史类中心点距离和极小规则和隶属度变化极小规则,由于两个规则的应用使得该新算法具有了知识迁移的功能,进而提高了其聚类性能。第三章节是在极大熵聚类算法(MECA)的基础上,同时加入两个新的约束规则:隶属度重要程度受约束规则和聚类中心点变化最小规则,产生了新的基于极大熵的知识迁移模糊聚类MEKTFCA算法。由于知识迁移的应用,提高了其在样本量不足和样本受到污染场景下的聚类性能。第四章节是通过修改经典FCM算法的目标函数产生新的目标函数,使新的目标函数中增加了能够吸收空间邻居知识能力的正则项。由于该正则项的加入提高了新算法在图像分割应用中的鲁棒性。(2)第五章节至第六章节重点研究了大规模数据应用场景下的模糊聚类算法改造和重构。其中第五章节参考了经典的基于增量式处理的历史在线模糊C代表点聚类算法(HOFCMD)和在线模糊C代表点聚类算法(OFCMD)的运行原理,但改进了这两种算法只使用单个代表点表示一个类时的不足,提出了应用于大规模数据场景的增量式多代表点模糊聚类MMFCA算法。该算法通过多个代表点使得每个聚类信息更加丰富,同时在聚类过程中考虑历史聚类点对之间的约束关系,进而提高了新提出的MMFCA算法的聚类性能。第六章节是受OFCMD和FC-QR算法思想的启发。提出了具有加权代表性,二次正则化和成对约束三重优化机制的基于多代表点的大规模数据模糊聚类LS-FMMdC算法。该多重优化机制和多代表点的使用贡献了最终LS-FMMdC算法在聚类性能上的提高。需要说明的是,第五章节和第六章节重点探讨的是大规模数据应用场景下的聚类问题。其中在处理大规模数据集时使用的是数据分块技术,在处理数据块时包含着先前数据块获得的知识迁移到后续数据块的机制。所以,该两章节是大规模数据场景和迁移场景的综合研究。
[Abstract]:Artificial intelligence after 60 years of development has made great progress in the field of artificial intelligence, as one of the most active branch of machine learning has been the rapid development of the cluster. As a kind of effective data analysis methods and tools, has attracted widespread attention and application in academic and industrial circles. However, with the wide application of the continuous development of computer technology and science and technology, new problems and challenges continue to emerge, including clustering scenarios and large-scale data migration scenarios are two prominent problems faced. This research is mainly about the clustering problem of the two scenarios. We found in the study of traditional clustering method the direct use of traditional clustering methods, perform clustering tasks on migration scenarios and massive data scene data, clustering can not get ideal to Or sometimes even unable to run the algorithm. The common challenges facing it is: 1) in the migration of the scene, due to the beginning of the establishment of the industry often no accumulation of data or data collected by the insufficient sample, or due to instability and other factors lead to acquisition equipment collected data samples were contaminated, in this the case, if the direct use of the traditional clustering algorithm, the clustering performance often leads to instability and failure of.2) in large scale data in the scene, due to the large amount of data processing, and for processing machines with limited memory, can not load all the data to be processed once, using traditional clustering algorithms can not directly lead to analyzing the data. In order to solve the problems in traditional clustering algorithm is applied to the two emerging application scenarios of this research subject in classical fuzzy clustering algorithm based on, To migrate the application scenarios and large data applications as the starting point, and reconstructed the correlation algorithm to adapt to the new application scenarios. The main contents are as follows: (1) study section to fourth chapters focus on second migration scenarios fuzzy clustering algorithm. The transformation and application of second chapter to the third chapter. The reform and reconstruction of classical fuzzy clustering algorithm; the fourth chapter is the application of knowledge transfer in segmentation using the specific image. Specifically, the second chapter is the fuzzy C means (FCM) clustering algorithm based on modification of the objective function, we propose a new PPKTFCM clustering algorithm. The two rule of the algorithm at the same time: the sample and the history class center distance and minimum rules and membership changes are minimal rules, due to the application of the two rules of the new algorithm has The knowledge transfer function, so as to improve the clustering performance. The third chapter is the maximum entropy clustering algorithm (MECA) based on the addition of two new rules: membership degree constraint rules and clustering minimum change rules, generating new based on maximum entropy fuzzy knowledge transfer the MEKTFCA clustering algorithm. Due to the application of knowledge transfer, improve the clustering performance of pollution scenarios by the insufficient sample and the sample. The fourth chapter is to produce a new objective function in the objective function to modify the classical FCM algorithm, the new target function is added in the regularization term to absorb knowledge and ability. Due to spatial neighbor the regularization improves the new algorithm in the application of image segmentation in robustness. (2) the fifth chapter to the sixth chapter focuses on the fuzzy clustering algorithm for large data application scenarios. The fifth chapter and reconstruction. With reference to the classic history of online incremental processing based on fuzzy C point clustering algorithm (HOFCMD) and online fuzzy C representative point clustering algorithm (OFCMD) operation principle, but the improvement of the two algorithms using only a single representative points to represent a class of problems, put forward the application of incremental in the massive data scene representative points fuzzy clustering MMFCA algorithm. The algorithm through a number of representative points so that each cluster more abundant information, considering the constraints of history clustering between pairs of points in the process of clustering, and then improve the clustering performance of the new algorithm MMFCA. The sixth chapter is inspired by OFCMD and FC-QR algorithm the representative is put forward. The weighted fuzzy clustering algorithm for large data, LS-FMMdC two regularization and pairwise constraints optimization mechanism based on three representative points. The multiple optimization The use of mechanisms and representative points contribute to the final LS-FMMdC algorithm in clustering performance improvement. That is, fifth chapters and sixth chapters focus on the discussion of large-scale clustering application scenarios. The use in handling large data sets of data block technology in processing the data block contains the previous data block to obtain knowledge migration to follow-up mechanism data blocks. Therefore, the two chapter is a comprehensive study of large-scale data migration and the scene of the scene.

【学位授予单位】：江南大学
【学位级别】：博士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】