面向互联网应用的不平衡数据分类技术研究
发布时间:2018-06-28 22:07
本文选题:互联网应用 + 不平衡数据 ; 参考:《国防科学技术大学》2016年博士论文
【摘要】:互联网的飞速发展,尤其是各类互联网应用,如网络新闻、电子邮件、电子商务等的发展为人们获取信息提供了便捷,但也同时将人们淹没在信息的海洋中。对海量的互联网应用数据自动进行分类可以有效提高人们获取信息的效率,进而提升决策效率。然而,很多互联网应用数据中某一类别或多个类别对应的样例数目明显少于其它类别对应的样例数目,形成所谓不平衡数据,如反动新闻与正常新闻、垃圾邮件与正常邮件、异常交易与正常交易等。传统的基于类别均匀分布假设所设计的分类方法以及评价策略通常以整体的准确率为优化目标,容易忽视其中的少数类别。而在实际应用中,人们经常更加关心少数类别,如网监部门更加希望识别出反动新闻、邮件服务商希望更好地识别出垃圾邮件、电子商务平台希望检测出其中的异常交易等。互联网应用数据的持续到达特性以及类别分布的不平衡性为准确进行数据分类带来了诸多困难与挑战。因而对面向互联网应用的不平衡数据分类技术进行研究具有很强的现实意义和社会价值。本文从互联网应用数据的特性以及承担项目的实际需求出发,遵循由简单到复杂的思路,对不同类型的互联网应用数据设计了相应的处理算法。首先从常见的两类别不平衡数据出发,针对其特点及实际应用需求,研究了不平衡数据预处理中的噪声过滤策略和数据重采样方法。之后,将其扩展到多类别(类别数目多于两个,但每个样例只能属于一个类别)不平衡数据应用场景,提出了分解策略与数据重采样相结合的处理方法。之后,进一步将前述研究成果拓展应用到多标签(不同于多类别,此时同一样例可以属于多个类别)不平衡数据分类中,设计了新的集成学习框架和基础分类算法。最后,根据互联网应用数据持续到达的特点,研究了在不平衡数据流上的多窗口学习策略:(1)在两类别不平衡数据的预处理方面,首先针对不平衡数据集中可能存在的噪声,提出了基于IPF的改进噪声过滤方法,以尽可能减少噪声过滤时将少数类样例误判为噪声的可能性。之后,针对少数类样例和多数类样例各自的特点,分别设计了基于近邻分布的少数类过采样算法以及基于距离排序的多数类欠采样算法。在此基础上,针对实际应用需求,设计了少数类和多数类之间采样比例的自适应方法,从而减小了数据重采样对后续处理流程的影响。最后,通过在大量真实数据集上的测试验证了所提方法的有效性,尤其是对于少数类别分类效果的提升明显;(2)在多类别不平衡数据分类方面,针对互联网应用数据的多类别特性,提出分而治之的学习策略。首先使用一对多的OVA方法对训练数据进行分解并训练得到多个子分类器。此时,所有的子分类器都是基于全部类别数据训练得到,确保了子分类器的适应性。之后,使用一对一的OVO方法对候选类别对应的样例集进一步划分,此阶段根据划分子集的类别分布决定是否进行数据重采样。最后,在采样后的数据子集上训练得到更加细粒度的子分类器。此外,根据实际应用需求,分别设计了子分类器输出值为离散和连续情形下的不同处理策略。在理论分析的基础上,对所提方法在多个真实数据集上进行了测试,结果表明所提方法能够有效处理多类别数据中存在的不平衡问题;(3)在多标签不平衡数据分类方面,针对已有方法偏重多标签分解而缺乏对标签分布不平衡性考虑的问题,提出了一种多标签不平衡数据集成学习框架并设计了相应的基础分类算法。以AdaBoost方法为基础,该框架将标签分布的不平衡特性集成到了各个子分类器的学习训练过程中。此外,以多标签神经网络方法BPMLL为基础,设计了针对多标签不平衡数据的改进算法并将其作为集成学习框架的基础分类算法,在多个实际应用数据集上对分类效果进行了测试,表明了所提方法的有效性;(4)在不平衡数据流分类方面,针对互联网应用数据流的动态特性以及各个类别样例到达顺序的不确定性,提出了一种基于多窗口机制的集成学习方法。该方法根据不平衡数据流的特点,定义了四个不同的窗口分别用于保存当前滑动窗口数据、最近的少数类样例、经筛选的子分类器以及子分类器对应的历史窗口数据。分别为不同的窗口设计了不同的更新策略。对于新的测试样例,其类别标签通过多数加权投票确定。通过在多个人工合成数据集和真实数据集上的测试表明,该方法效果更好,效率更高。综上所述,本文针对互联网应用中不同类型数据的不同分类需求,尤其针对其中存在的类别分布不平衡问题,提出了有效的解决方案,并通过在不同领域真实数据集以及人工合成数据集上的实验验证了本文所提方法的有效性。本文的研究工作对于推进各类互联网应用数据的分类处理具有一定的理论意义和应用价值。
[Abstract]:The rapid development of the Internet, especially the development of all kinds of Internet applications, such as Internet news, e-mail, electronic commerce, has provided convenience for people to obtain information, but also drowns people in the ocean of information at the same time. Automatic classification of massive Internet application data can effectively improve the efficiency of people's access to information. To improve the efficiency of decision making, however, the number of samples corresponding to one or more categories in a lot of Internet application data is obviously less than the number of examples corresponding to other categories, forming so-called unbalanced data, such as reactionary news and normal news, spam and normal mail, abnormal transactions and normal transactions. The classification method and evaluation strategy designed by the cloth hypothesis usually take the overall accuracy as the optimization goal and easily ignore a few of them. In practical applications, people often pay more attention to the minority categories, such as the network supervision department is more willing to identify the reactionary news, the mail service providers want to better identify the spam, electronic business. The continuous arrival characteristics of the Internet application data and the imbalance of the category distribution have brought many difficulties and challenges to the accurate classification of data. Therefore, it is of great practical significance and social value to study the unbalanced data classification technology for Internet applications. Based on the characteristics of the Internet application data and the actual needs of the project, this paper designs the corresponding processing algorithms for different types of Internet application data from simple to complex ideas. Firstly, starting from the common two categories of unbalanced data, this paper studies the unbalance data preposition in view of its characteristics and practical application requirements. After the noise filtering strategy and data resampling method, it extends to multiple categories (more than two categories, but each sample can only belong to one category) unbalanced data application scenario, and proposes a combination of decomposition strategy and data resampling. After that, the previous research results are further extended to multi standard. The new integrated learning framework and basic classification algorithm are designed in the classification of unbalanced data, which are different from multiple categories. At the same time, according to the characteristics of the continuous arrival of the Internet application data, the multi window learning strategy on the unbalanced data flow is studied. (1) the pre location of the two categories of unbalanced data. In view of the possible noise in the unbalanced data set, an improved noise filtering method based on IPF is proposed in order to minimize the possibility of misjudging a few samples in noise filtering as possible. Then, a few classes based on near neighbour distribution are designed for a few samples and the characteristics of most class samples. Over sampling algorithm and the majority class under sampling algorithm based on distance sorting. Based on this, an adaptive method of sampling proportion between the minority and the majority class is designed to reduce the impact of the data resampling on the subsequent process. Finally, the tests on a large number of real data sets have been tested and verified. The effectiveness of the proposed method is especially significant for the improvement of the effect of a few categories of classification; (2) a divide and conquer learning strategy is proposed for the multi class unbalance data classification for the multi category characteristics of the Internet application data. First, a one to many OVA method is used to decompose the training data and train a number of sub classifiers. All of the sub classifiers are trained based on all category data training to ensure the adaptability of the Subclassifier. Then, the one to one OVO method is used to further divide the sample set corresponding to the candidate categories. This stage determines whether data resampling is determined by the classification of the subsets. Finally, the data subset after the sample is sampled. A more finer Subclassifier is trained. In addition, according to the actual application requirements, the different processing strategies of the discrete and continuous sub classifier are designed respectively. On the basis of the theoretical analysis, the proposed method is tested on multiple real data sets. The results show that the proposed method can effectively deal with multiple categories. The unbalance problem exists in the data; (3) in the classification of multi label unbalance data, aiming at the problem that the existing methods weigh the multi label decomposition and lack the imbalance of the label distribution, a multi label unbalanced data integration learning framework is proposed and the corresponding basic classification algorithm is designed. Based on the AdaBoost method, the frame is designed. In addition, based on the multi label neural network (BPMLL), an improved algorithm for multi label unbalanced data is designed and used as the basic classification algorithm for the integrated learning framework, and the classification efficiency is on the multiple practical application data sets. The test results show the effectiveness of the proposed method; (4) an integrated learning method based on the multi window mechanism is proposed for the dynamic characteristics of the data flow in the Internet application and the uncertainty in the arrival order of each class sample in the unbalanced data flow classification. The method is defined according to the characteristics of the unbalanced data flow. Four different windows are used to save the current sliding window data, the nearest few samples, the selected sub classifiers and the historical window data corresponding to the sub classifier. The different updating strategies are designed for different windows. For the new test examples, the class labels are determined by the majority of the weighted votes. The tests on personal synthetic data sets and real data sets show that the method has better effect and higher efficiency. In summary, this paper presents an effective solution to the different classification requirements of different types of data in Internet applications, especially for the problem of the disequilibrium of category distribution, and through the real number of different fields. The experiments on the dataset and the synthetic data set verify the effectiveness of the proposed method. The research work of this paper has a certain theoretical significance and application value for promoting the classification and processing of various kinds of Internet application data.
【学位授予单位】:国防科学技术大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP393.09
,
本文编号:2079537
本文链接:https://www.wllwen.com/jingjilunwen/dianzishangwulunwen/2079537.html