基于密度网格的数据流聚类和概念漂移检测算法研究

发布时间：2018-06-17 11:09

本文选题：数据挖掘 + 数据流　；参考：《北京交通大学》2017年硕士论文

【摘要】：数据流聚类算法是一项关键的数据挖掘技术,在数据流聚类研究中,算法框架可以分为两类:single-phase model 和 two-phase scheme。应用 two-phase scheme 的基于密度网格的数据流聚类框架,包含了在线处理阶段和离线处理阶段。在线处理阶段中,将据流数据映射到网格中,在离线处理阶段中,对网格数据聚类,此框架降低了数据流聚类的难度。但是在离线处理阶段中,这种聚类框架也存在三点缺陷:(1)基于固定阈值的稀疏网格或稠密网格判定不能适用于不均匀分布的数据流和多密度的数据流;(2)基于密度把相邻的网格连接为一类,而没有考虑数据之间的相似度,数据间相似度考量的缺失会影响数据聚类的准确性;(3)边界点的检测考量不够全面,有的边界点是噪音,而有的边界点可能属于邻近的簇。数据流的概念也会随着时间的推移而改变,这种现象被称为概念漂移。DCDA是一种基于粗糙集理论和滑动窗口技术的概念漂移检测算法,其主要思想是:计算两个滑动窗口之间的距离判断概念漂移。这种算法存在如下缺陷:(1)只适用于分类型数据;(2)没有考虑一个窗口中包含多概念的情况;(3)无法确定合适的滑动窗口尺寸。针对以上问题,本文的主要贡献如下:第一,针对DCDA概念漂移检测存在的缺陷,提出了一种基于密度网格的数据流概念漂移检测框架(简称DCDD)。该框架利用网格技术,进而使得其适用于一般的数据。在解决滑动窗口中多概念问题上,在在线处理阶段中创建一个临时密度网格和一个历史密度网格,根据数据集到达时间给网格赋予一个权值扩展了DCDA检测模型,计算临时密度网格和历史密度网格的距离检测概念漂移。在离线处理阶段中训练提取的概念漂移特征,提出一个预测模型,预测概念数据量,并根据预测量设计了可变尺寸的滑动窗口。实验结果表明,我们检测概念漂移的时间远低于DCDA算法,且检测的概念漂移更准确,更有效。第二,针对基于密度网格的数据流聚类框架的缺陷,提出了一种基于相对密度网格的数据流聚类算法和边界检测算法。其主要思想是:计算相邻网格之间的相似性,并根据相似性作为权重去影响相邻网格之间的连接,而连接相邻网格是根据一个考虑了密度、质心和相邻网格之间的相似性权重的差异模型。最后,.我们提出了一个边界检测算法,使用隶属函数给簇周围稀疏网格中的数据点打上簇标签。实验结果表明,我们的算法适用于多密度分布的数据流,且具有更好的聚类质量。
[Abstract]:Data stream clustering algorithm is a key data mining technology. In data stream clustering research, the algorithm framework can be divided into two categories: single-phase model and two-phase schema. The data flow clustering framework based on density grid of two-phase scheme is used, which includes on-line processing stage and off-line processing stage. In the on-line processing stage, the streaming data is mapped to the grid, and in the off-line processing stage, the grid data clustering is reduced by this framework. But in the off-line processing phase, The clustering framework also has three defects: 1) sparse or dense grid decision based on fixed threshold is not suitable for inhomogeneous distributed data flow and multi-density data stream / 2) the adjacent grids are connected as a class based on density. Without considering the similarity between data, the lack of similarity consideration will affect the accuracy of data clustering. The detection of boundary points is not comprehensive enough, some of the boundary points are noise, and some of the boundary points may belong to adjacent clusters. The concept of data flow also changes over time. This phenomenon is called concept drift. DCDA is a concept drift detection algorithm based on rough set theory and sliding window technology. The main idea is to calculate the distance between two sliding windows to judge the concept drift. This algorithm has the following defects: 1) it can only be applied to classified data / 2) and does not consider the case where a window contains multiple concepts) it is unable to determine the appropriate sliding window size. The main contributions of this paper are as follows: first, a DCDA conceptual drift detection framework based on density grid (DCDD) is proposed to overcome the shortcomings of DCDA concept drift detection. The framework makes use of grid technology to make it applicable to general data. In order to solve the problem of multi-concept in sliding window, a temporary density grid and a historical density grid are created in the online processing phase, and the DCDA detection model is extended to the grid according to the time of arrival of the data set. The distance detection concept drift of temporary density grid and historical density grid is calculated. In the off-line processing stage, the concept drift features are trained, a prediction model is proposed to predict the conceptual data volume, and a sliding window with variable size is designed according to the prediction quantity. The experimental results show that the detection time of the concept drift is much lower than that of the DCDA algorithm, and the detection of the concept drift is more accurate and effective. Secondly, a data stream clustering algorithm and a boundary detection algorithm based on relative density grid are proposed to overcome the shortcomings of the data stream clustering framework based on density grid. The main idea is to calculate the similarity between adjacent meshes and use the similarity as the weight to affect the connection between adjacent meshes. The difference model of similarity weight between centroid and adjacent grid. Finally. We propose a boundary detection algorithm which uses membership functions to label data points in sparse grids around clusters. Experimental results show that our algorithm is suitable for multi-density data flow and has better clustering quality.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】