Clustering is one of the most common data mining tasks, used frequently for data categorization and analysis in both industry and academia. In many domains where clustering is applied, some prior knowledge is available either in the form of labeled data(specifying the category to which an instance belongs) or pairwise constraints on some of the instances(specifying whether two instances should be in same or different clusters). The focus of our research is on semisupervised clustering, where we study how prior knowledge can be incorporated into clustering algorithms.Semi-supervised clustering aims to improve the clustering performance by considering user supervision in the form of pairwise constraints. However, most current algorithms are passive in the sense that pairwise constraints are provided beforehand and selected randomly. This may lead to the use of constraints that are redundant, unnecessary, or even harmful to the clustering results. For those reasons, we would like to optimize the selection of the constraints for semisupervised clustering. Moreover, semi-supervised clustering algorithms imposes several challenges to be addressed, such as dealing with multi-density data, how to handle the evolving patterns that are important characteristics of streaming data with dynamic distributions, capable of performing fast and incremental processing of data objects, and suitably addressing time and memory limitations.In this thesis, we consider three main contributions. The first contribution of this thesis, we consider batch-mode active learning for semi-supervised clustering algorithms in an iterative manner. First, we select a batch of informative query instances such that the distribution represented by the selected query set and the available labeled data is closest to the distribution represented by the unlabeled data. Then, we query them with the existing neighborhoods to determine which neighborhood they belong. The experimental results with state-of-the-art methods on different real world dataset demonstrate the effectiveness and efficiency of the proposed method.In the second contribution of this thesis, we address the problem of streaming data. Data stream mining is an active research area that has recently emerged to discover knowledge from large amounts of continuously generated data. We propose an algorithm that extending Affinity Propagation(AP) to handle evolving data steam with dynamic distributions. We present a semisupervised clustering technique(SSAPStream) that incorporates labeled exemplars into the APalgorithm to deal with changes in the data distribution, which requires the stream model to be updated as soon as possible. The experimental results on synthetic and real data sets validate the effectiveness of our algorithm in handling dynamically evolving data streams. Also, we study the execution time and memory usage of SSAPStream, which are important efficiency factors for streaming algorithms.The third contribution of this thesis addresses the problem of clustering multi-density data and arbitrary shapes. Density-based clustering methods are the most important due to their high ability to detect arbitrary shaped clusters. Existing methods are based on DBSCAN which is a typical density-based clustering algorithm and its clustering performance depends on two specified parameters(Eps and Minpts) that define a single density. Most of existing methods are unsupervised, which cannot utilize the small number of prior knowledge. We propose a semisupervised clustering(called Semi Den) algorithm that discovers clusters of different densities and arbitrary shapes. The idea of the proposed algorithm is to partition the dataset into different density levels and compute the density parameters for each density level set. Then, use the pairwise constraints for expanding the clustering process based on the computed density parameters. Evaluating Semi Den algorithm on both synthetic and real datasets confirms that the proposed algorithm gives better results than other semi-supervised and unsupervised density based approaches.
【学位单位】:北京理工大学
【学位级别】:博士
【学位年份】:2015
【中图分类】:TP311.13
【相似文献】
相关期刊论文 前10条
1 YODJAIPHET Anusorn;THEERA-UMPON Nipon;AUEPHANWIRIYAKUL Sansanee;;Instance reduction for supervised learning using input-output clustering method[J];Journal of Central South University;2015年12期
2 HU LuanYun;CHEN YanLei;XU Yue;ZHAO YuanYuan;YU Le;WANG Jie;GONG Peng;;A 30 meter land cover mapping of China with an efficient clustering algorithm CBEST[J];Science China(Earth Sciences);2014年10期
3 Amineh Amini;Teh Ying Wah;Hadi Saboohi;;On Density-Based Data Streams Clustering Algorithms: A Survey[J];Journal of Computer Science & Technology;2014年01期
4 岳士弘,李平,郭继东,周水庚;A statistical information-based clustering approach in distance space[J];Journal of Zhejiang University Science A(Science in Engineering);2005年01期
5 ;DCAD:a Dual Clustering Algorithm for Distributed Spatial Databases[J];Geo-Spatial Information Science;2007年02期
6 DENG Min;LIU QiLiang;WANG JiaQiu;SHI Yan;;A general method of spatio-temporal clustering analysis[J];Science China(Information Sciences);2013年10期
7 ;Comparison of Supervised Clustering Methods for the Analysis of DNA Microarray Expression Data[J];Agricultural Sciences in China;2008年02期
8 WANG Jindong;HE Jiajing;ZHANG Hengwei;YU Zhiyong;;CSFW-SC: Cuckoo Search Fuzzy-Weighting Algorithm for Subspace Clustering Applying to High-Dimensional Clustering[J];中国通信;2015年S2期
9 李风环;Zhao Zongfei;Wang Zhenyu;;Hierarchical clustering based on single-pass for breaking topic detection and tracking[J];High Technology Letters;2018年04期
10 XIE Naiming;SU Bentao;CHEN Nanlei;;Construction mechanism of whitenization weight function and its application in grey clustering evaluation[J];Journal of Systems Engineering and Electronics;2019年01期
相关会议论文 前10条
1 ;A Semi-supervised Clustering Algorithm Based on Rough Reduction[A];2009中国控制与决策会议论文集(3)[C];2009年
2 Ping Zhou;Jiayin Wei;Yongbin Qin;;A Semi-Supervised Text Clustering Algorithm with Word Distribution Weights[A];2013教育技术与信息系统国际会议论文集[C];2013年
3 ;A Hybrid Clustering Algorithm Based on Grid Density and Rough Sets[A];第二十七届中国控制会议论文集[C];2008年
4 ;A Novel Supervised Multi-model Modeling Method Based on k-means Clustering[A];Proceedings of 2010 Chinese Control and Decision Conference[C];2010年
5 Aoran Xu;Tao Yang;Jianwei Ji;Yang Gao;;Application of fuzzy clustering algorithm in the evaluation of abandoned wind power[A];第30届中国控制与决策会议论文集(4)[C];2018年
6 ;Mining Cluster-Defining Actionable Rules[A];第二十一届中国数据库学术会议论文集(技术报告篇)[C];2004年
7 ;Clustering Analysis with Information System Approaches[A];Proceedings of the 2011 Chinese Control and Decision Conference(CCDC)[C];2011年
8 ;A hybrid of fuzzy-link clustering and classification for seismic data[A];第六届(2011)中国管理学年会——商务智能分会场论文集[C];2011年
9 武丁明;古槿;张奇伟;;A new gene network clustering algorithm based on minimum spanning tree[A];第四届全国生物信息学与系统生物学学术大会论文集[C];2010年
10 Lin Hou;Lin Wang;Arthur Berg;Minping Qian;Yunping Zhu;Fangting Li;邓明华;;Comparison and evaluation of network clustering algorithms applied to genetic interaction networks[A];第五届全国生物信息学与系统生物学学术大会论文集[C];2012年
相关博士学位论文 前6条
1 阿特瓦(Walid Said Abdelhamid Atwa);半监督聚类算法对于流和多密度数据[D];北京理工大学;2015年
2 Muhammad Zia-ur-Rehman;动态数据流挖掘关键技术研究[D];西南交通大学;2014年
3 Amjad Mahmood;半监督进化集成及其在网络视频分类中的应用[D];西南交通大学;2015年
4 许振浩;拓扑管网法地下水模拟研究与工程应用[D];山东大学;2013年
5 Naser Farajzadeh;基于超概率编码的多类分类器[D];浙江大学;2013年
6 魏立梅;聚类分析新方法的研究与应用[D];西安电子科技大学;1998年
相关硕士学位论文 前10条
1 白明雪;词汇聚合对高中英语词汇学习影响的实验研究[D];河北师范大学;2019年
2 穆罕默德奥马尔法鲁克(Muhammad Omer Farooq);基于明星的视频人脸验证和聚类算法研究[D];哈尔滨工业大学;2017年
3 戴维斯;移动约束群组AdHoc网络研究[D];华中科技大学;2009年
4 Iakovleva Tatiana;[D];北京理工大学;2016年
5 Mazen Hassan Hodeib;[D];湖南大学;2007年
6 Nassir Abdullah Nassir(那西尔);[D];中南大学;2012年
7 ISRAR KHAN;[D];北京邮电大学;2016年
8 Tanakrit Wongwitit;[D];哈尔滨工程大学;2012年
9 ZAKOUNI AMIYNE(阿米);[D];中南大学;2012年
10 徐建鹏;高维局部共表达模式挖掘算法的研究[D];哈尔滨工业大学;2009年
本文编号:
2810488
本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/2810488.html