不完整数据上的聚类算法研究
发布时间:2018-08-15 15:53
【摘要】:进入二十一世纪以来,人与人之间、人类与物理世界之间的联系变得愈来愈紧密。在这种情况下,数据的产生无处不在。然而,在数据规模几乎爆炸式增长的同时,数据质量并没有得到相应的提升,也无法得到足够的保障。因为数据在最初获取以及交换和传播的过程中,可能会出现各式各样的状况使得我们最终所获得的数据质量存在问题。然而常用的聚类算法通常需要数据的质量较高时才能正常使用,然而当大数据的质量存在问题时,这类方法通常表现欠佳。因而通常先使用数据清洗技术对存在质量问题的数据先行进行清洗,而后再进行诸如聚类的数据挖掘操作。但是在大规模数据上进行数据清洗往往具有很昂贵的时间开销,而最终的清洗效果可能尚不如人愿;即我们花费了大量的时间在数据清洗上,最终数据上可能仍旧存在无法清除的质量问题,也就是说最终清洗结果并不能显著的提高数据挖掘结果的质量。所以,直接在弱可用数据上进行聚类操作的研究对该问题的解决提供了一个新的思路,即我们不清洗数据直接进行聚类操作,或者在没有清洗干净的数据上进行聚类操作。本文主要研究如何在不完整数据集合上进行聚类分析的操作。首先,本文分析了不完整数据的空间结构,由此理解了不完整数据对于聚类操作的影响。据此设计了基于模糊聚类的不完整聚类算法,基于模糊聚类的不完整数据聚类算法将数据当中的缺失视为聚类迭代过程当中的优化变量,并在迭代过程中不断进行更新求解,完成不完整数据的聚类。基于密度分析的不完整数据聚类算法,将聚类过程中的两个核心要求进行了刻画,要求聚类当中的簇中心必须是周围点密度大的点,并且与其它的点密度大的点之间的距离尽量远,在确定了簇中心以后再依据一定的策略将其它点划分入当前的簇当中去。基于信息理论的不完整数据聚类算法将聚类过程视为记录对簇的不确定度不断变化的过程,随属性的加入,一条记录对类别的不确定度不断减小,直至最后我们可以将其划分到不确定度最小的簇当中去;针对不完整数据,我们需要先估计出需要的信息理论基本参数和簇的信息参数,通过这两者的结合,完成对不完整数据的聚类操作。在每种算法的设计最后,本文都通过相关的实验对算法进行了实验分析。
[Abstract]:Since the beginning of the 21st century, the relationship between human beings and the physical world has become more and more close. In this case, data generation is ubiquitous. However, while the scale of data increases almost explosively, the quality of data has not been improved and can not be guaranteed adequately. In the process of acquisition, exchange and propagation, various conditions may arise which may lead to problems in the quality of the data we ultimately obtain. However, the commonly used clustering algorithms usually require higher quality of data to be used properly. However, when the quality of large data is problematic, such methods usually perform poorly. Data cleaning technology is used to clean the data with quality problems first, and then to do data mining operations such as clustering. However, data cleaning on large-scale data often has a very expensive time cost, and the final cleaning effect may not be as desirable; that is, we spend a lot of time in data cleaning. Therefore, the study of clustering operation directly on the weak available data provides a new way to solve this problem, that is, we do not clean the data directly. In this paper, we mainly study how to do clustering analysis on incomplete data sets. Firstly, we analyze the spatial structure of incomplete data, and then understand the impact of incomplete data on clustering operations. Complete clustering algorithm, the incomplete data clustering algorithm based on fuzzy clustering regards the missing data as the optimization variable in the clustering iterative process, and carries on the renewal solution unceasingly in the iterative process, completes the incomplete data clustering. It is required that the cluster center of the cluster must be the point with high density of the surrounding points, and the distance between the cluster center and other points with high density should be as far as possible. After determining the cluster center, other points are divided into the current cluster according to certain strategies. Cheng is regarded as a process of recording the uncertainties of a pair of clusters. With the addition of attributes, the uncertainties of a record pair are decreasing until we can divide it into clusters with the least uncertainties. At the end of the design of each algorithm, the algorithm is experimentally analyzed through related experiments.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13
本文编号:2184685
[Abstract]:Since the beginning of the 21st century, the relationship between human beings and the physical world has become more and more close. In this case, data generation is ubiquitous. However, while the scale of data increases almost explosively, the quality of data has not been improved and can not be guaranteed adequately. In the process of acquisition, exchange and propagation, various conditions may arise which may lead to problems in the quality of the data we ultimately obtain. However, the commonly used clustering algorithms usually require higher quality of data to be used properly. However, when the quality of large data is problematic, such methods usually perform poorly. Data cleaning technology is used to clean the data with quality problems first, and then to do data mining operations such as clustering. However, data cleaning on large-scale data often has a very expensive time cost, and the final cleaning effect may not be as desirable; that is, we spend a lot of time in data cleaning. Therefore, the study of clustering operation directly on the weak available data provides a new way to solve this problem, that is, we do not clean the data directly. In this paper, we mainly study how to do clustering analysis on incomplete data sets. Firstly, we analyze the spatial structure of incomplete data, and then understand the impact of incomplete data on clustering operations. Complete clustering algorithm, the incomplete data clustering algorithm based on fuzzy clustering regards the missing data as the optimization variable in the clustering iterative process, and carries on the renewal solution unceasingly in the iterative process, completes the incomplete data clustering. It is required that the cluster center of the cluster must be the point with high density of the surrounding points, and the distance between the cluster center and other points with high density should be as far as possible. After determining the cluster center, other points are divided into the current cluster according to certain strategies. Cheng is regarded as a process of recording the uncertainties of a pair of clusters. With the addition of attributes, the uncertainties of a record pair are decreasing until we can divide it into clusters with the least uncertainties. At the end of the design of each algorithm, the algorithm is experimentally analyzed through related experiments.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13
【参考文献】
相关期刊论文 前5条
1 李建中;王宏志;高宏;;大数据可用性的研究进展[J];软件学报;2016年07期
2 周志华;;《机器学习》[J];中国民商;2016年03期
3 夏慧;梁晓明;许宏;张红君;张超;;基于临床大数据中心的医疗质量控制管理系统研究与应用[J];中国数字医学;2016年02期
4 王宏志;;大数据质量管理:问题与研究进展[J];科技导报;2014年34期
5 李建中;刘显敏;;大数据的一个重要方面:数据可用性[J];计算机研究与发展;2013年06期
,本文编号:2184685
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2184685.html