基于数据流的概念漂移检测及集成分类研究
发布时间:2018-05-21 00:21
本文选题:数据流 + 概要结构 ; 参考:《四川师范大学》2017年硕士论文
【摘要】:大数据引领了信息时代的重要变革,影响了经济、科技和社会等各个层面,大数据的其中一种形式以海量实时数据流的方式呈现。这些海量的实时数据中隐藏着巨大的价值,如何更好的挖掘处理这些实时数据流已经成为了国内外数据挖掘领域的研究重点和热点。数据流具有有序性、实时性、高速性、动态性、潜在无限性等特点,对数据流的处理包含存储、处理、分析和应用等。概要结构是用于解决数据流潜在无限性问题的处理技术,但现有的概要结构算法存在着重构数据流与原数据流相对重构误差较大和参数难以调整的缺点。概念漂移检测技术用于解决数据流的动态性问题,数据流集成分类具有较高的分类准确率和概念漂移适应能力而被广泛地应用到数据流分类中。但概念漂移检测和集成分类处理通常基于数据流标签及时可用的假设,在实际应用中这一假设很难成立。针对这些问题,本文做了以下三方面的工作:(1)实现了基于sim Hash的数据流分层遗忘概要结构(SH-HAS)。该结构采用sim Hash算法获取概要信息,并动态调整SH-HAS结构,解决了重构数据集与原数据集误差较大的问题。实验证明,SH-HAS结构具有更小的相对重构误差。(2)改进FKNNModel概念漂移检测算法,提出了MFKNNModel概念漂移检测算法。MFKNNModel利用数据的空间分布的改变来检测数据流概念漂移,并利用Spark Streaming高效并行计算来提升算法的运行效率,解决了FKNNModel算法中的人工干预及计算效率问题。实验效果表明,在缺乏人工干预的情况下,MFKNNModel具有良好的概念漂移检测能力和较高的运行效率。(3)提出了基于概念漂移的数据流集成分类模型(Ensemble Classifier Based on Concept-Drifting Data Stream,ECCDDS)。采用水平集成的方式生成基分类器,通过加权投票的方法对基分类器的分类结果进行投票,生成集成分类器的分类结果;ECCDDS算法首先形成数据流的概要结构,然后引入概念漂移检测算法MFKNNModel,在发生概念漂移时更新集成分类模型,最后对数据进行分类。ECCDDS打破了集成分类器以数据流标签及时可用为假设的前提,解决了集成分类器以分类精度作为概念漂移检测和模型更新为依据所带来的后序到达的数据流类标签不能及时可用的问题。利用Spark Streaming流式计算框架解决了集成分类器在计算资源和计算效率方面的问题。在真实数据集和人工数据集上的实验验证了ECCDDS集成分类模型的有效性。
[Abstract]:Big data has led an important revolution in the information age, which has affected the economy, science and technology, society and so on. One of the forms of big data is presented in the form of massive real-time data flow. There is great value hidden in these massive real-time data. How to better mine these real-time data streams has become the research focus and hotspot in the field of data mining at home and abroad. Data flow has the characteristics of order, real-time, high speed, dynamic, potential infinity, etc. The processing of data flow includes storage, processing, analysis and application. Summary structure is a processing technique used to solve the potential infinity problem of data flow. However, the existing algorithms of summary structure have some disadvantages such as the relative error between reconstructing data stream and original data stream is large, and the parameters are difficult to adjust. Conceptual drift detection technique is used to solve the dynamic problem of data flow. Data stream integrated classification has high classification accuracy and concept drift adaptability, so it is widely used in data stream classification. However, conceptual drift detection and ensemble classification are usually based on the assumption that data stream tags are available in time, which is difficult to establish in practical applications. In order to solve these problems, the following three aspects of work are done: 1) A hierarchical forgetting summary structure based on sim Hash is implemented. The structure adopts sim Hash algorithm to obtain the summary information and dynamically adjusts the SH-HAS structure to solve the problem of large error between the reconstructed dataset and the original data set. Experimental results show that the SH-HAS structure has a smaller relative reconstruction error. It improves the FKNNModel concept drift detection algorithm. An MFKNNModel concept drift detection algorithm .MFKN NModel is proposed to detect the conceptual drift of the data stream by changing the spatial distribution of the data. The efficient parallel computing of Spark Streaming is used to improve the efficiency of the algorithm, and the problem of manual intervention and computational efficiency in the FKNNModel algorithm is solved. The experimental results show that the MFKN Model has good concept drift detection ability and high running efficiency without manual intervention.) an integrated data stream classification model based on conceptual drift is proposed, which is called Ensemble Classifier Based on Concept-Drifting Data Stream-ECCDDSs. The basic classifier is generated by horizontal integration, and the classification result of the base classifier is voted by weighted voting method. The classification results of the integrated classifier are generated and the ECCDDS algorithm first forms the summary structure of the data stream. Then the concept drift detection algorithm MFKN Model is introduced to update the integrated classification model when the concept drift occurs. Finally, the data classification .ECCDDS breaks the premise of the integrated classifier that the data stream labels are available in time. It solves the problem that the data stream class labels arrived in the order of the integrated classifier based on the classification precision as the basis of the concept drift detection and model updating can not be used in time. The problem of integrated classifier in computing resources and computing efficiency is solved by using Spark Streaming flow computing framework. Experiments on real data sets and human data sets verify the effectiveness of the ECCDDS integrated classification model.
【学位授予单位】:四川师范大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13
【参考文献】
相关期刊论文 前10条
1 黄树成;刘悦;;一种抗噪的动态数据流分类算法[J];江苏科技大学学报(自然科学版);2016年03期
2 陈笑蓉;刘作国;;文本聚类的重构策略研究[J];中文信息学报;2016年02期
3 胡小生;温菊屏;钟勇;;动态平衡采样的不平衡数据集成分类方法[J];智能系统学报;2016年02期
4 孙雪;李昆仑;韩蕾;白晓亮;;基于特征项分布的信息熵及特征动态加权概念漂移检测模型[J];电子学报;2015年07期
5 郭文锋;王勇;;基于累积正样本的偏斜数据流集成分类方法[J];计算机与现代化;2015年03期
6 李勇;刘战东;张海军;;不平衡数据的集成分类算法综述[J];计算机应用研究;2014年05期
7 李南;郭躬德;陈黎飞;;基于少量类标签的概念漂移检测算法[J];计算机应用;2012年08期
8 徐文华;覃征;常扬;;基于半监督学习的数据流集成分类算法[J];模式识别与人工智能;2012年02期
9 欧阳震诤;陶孜谨;蔡建宇;吴泉源;;一种不平衡噪声数据流集成分类模型[J];计算机工程与科学;2011年12期
10 张玉红;胡学钢;李培培;;一种抗噪的概念漂移数据流分类方法[J];中国科学技术大学学报;2011年04期
,本文编号:1916899
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1916899.html