基于选择性集成的在线机器学习关键技术研究

发布时间：2018-11-18 10:01

【摘要】：一直以来,机器学习技术在众多领域都发挥着巨大的作用。对数据进行分析处理,从中获得有用的信息和知识以便指导后续的决策,这是机器学习的最终目标。而随着互联网的普及,数据获取的手段逐渐丰富,人们获得的数据量呈指数增长,从而对传统的机器学习技术造成挑战。对于架构在互联网之上的在线交易、在线广告、金融分析以及搜索引擎等业务而言,能够对大规模、长时间、持续性的数据进行快速、有效的学习具有重要的意义。在线机器学习是对大量数据进行及时处理的重要手段,预测能力和预测效率成为在线学习方法最重要的评价标准。作为最重要的在线机器学习策略,增量学习方法可以分为单分类器增量学习和集成式增量学习。单分类器方法容易出现过适应问题,预测能力较低。而随着系统的持续运行,集成式学习方法通常会导致目标集成分类器规模不断增大,预测开销越来越大。在批量式机器学习中,选择性集成可以有效提高集成分类器的预测能力和预测效率。本文针对监督学习和分类问题,提出将选择性集成技术用于集成式增量学习,从而提高在线学习的预测能力和预测效率的思想。论文首先提出了选择性集成与增量学习相结合的在线学习模型,然后对其涉及的关键技术展开深入研究。论文的主要工作和创新包括: 1、提出选择性集成与增量学习相结合的在线学习模型EPIL。本文针对各领域的实际需求以及目前在线学习技术的缺陷,提出选择性集成与增量学习相结合的在线学习模型EPIL,并阐述了该模型涉及的若干技术挑战。EPIL模型对每次增量数据集的学习均获得若干局部基分类器,然后利用局部选择剔除预测能力差的局部基分类器,并择机利用全局选择剔除已经过时的全局基分类器,使得目标集成分类器的规模小、预测能力强、具有良好的增量学习能力。 2、提出基于模式挖掘的选择性集成策略及算法框架。对EPIL模型中的选择性集成技术进行研究,创新性地提出了基于模式挖掘的选择性集成策略,并构建基于该策略的选择性集成算法框架,详细分析了框架中的关键技术。在基于模式挖掘的选择性集成策略中,选择性集成问题被描述为从事务数据库中挖掘一个模式的问题,从而能够利用事务处理和模式挖掘技术进行基分类器的选择,为选择性集成方法的研究开拓了一个新的方向。 3、提出两种基于覆盖模式挖掘的选择性集成算法。源于基于模式挖掘的选择性集成策略,论文首先提出了覆盖模式挖掘的概念,然后利用该概念给出了两种选择性集成算法:CPM-EP和PMEP。CPM-EP和PMEP算法都利用覆盖模式挖掘思想和多数投票法原理来获取各种长度的候选子模式,然后都是利用贪婪策略来构造目标集成分类器。但是PMEP通过对原始事务数据库创建一棵FP-Tree,然后从FP-Tree中获取候选子模式,避免对事务数据库的频繁操作,从而节省了大量开销。实验结果表明,CPM-EP和PMEP算法的基分类器选择速度快,目标集成分类器规模小、预测能力强。就上述两种算法而言,PMEP在选择时间上优于CPM-EP。实验结果验证了模式挖掘思想是一种十分有效的选择性集成策略。 4、提出以Bagging为基础的集成式增量学习方法。论文对EPIL模型中的基分类器构造方法进行研究,针对传统集成式增量学习方法对基分类器的结构适应性差,提出以Bagging为基础的集成式增量学习方法Bagging++,并提出一种基于Bagging的异构基分类器构造方法。实验结果表明,Bagging++具有很好的基分类器算法适应性,能够获得良好的预测能力,性能明显优于传统算法。此外,采用异构基分类器构造方法能够进一步提高集成式增量学习的预测性能。 5、提出基于选择性集成的增量学习技术。论文对EPIL模型中利用选择性集成技术进行增量学习的具体方法进行研究,主要包括基分类选择的时机,校验样本集的确定等内容,然后针对Bagging++算法,提出基于局部选择的LP-Bagging++算法,以及局部与全局选择相结合的MP-Bagging++算法。实验结果表明,由于全局选择可剔除失效的基分类器,可有效控制目标集成分类器的规模,在保证预测能力的同时,显著提高了预测的时空效率。因此,局部与全局相结合的混合选择策略更适合当前在线学习的需求。 6、设计并实现了集成学习开发平台LibEP。在前面研究结果的基础上,论文设计并实现了一个开源的集成学习开发平台LibEP。该平台涵盖的算法包括了集成学习研究的所有主要方面,包括样本操作方法、基分类器学习算法、集成学习算法、选择性集成算法、增量学习算法、性能评估算法等。LibEP平台的接口简单,易于使用,能够方便地集成到用户的程序中。该开发平台采用标准C++语言实现,运行性能高、可移植性好,功能易于扩展。本文从模型、算法和实验研究的角度,探讨了选择性集成与增量学习相结合的在线学习技术。而在下一步,通过将论文的研究内容与实际应用相结合,作者将致力于推动该项技术在需要高性能、高效率的机器学习应用领域中发挥出重要的作用。
[Abstract]:The machine learning technology has been playing a great role in many fields. Data is analyzed and useful information and knowledge is obtained to guide subsequent decision-making, which is the ultimate goal of machine learning. With the popularization of the Internet, the method of data acquisition is gradually rich, and the data volume obtained by people is exponentially increasing, thus causing the challenge to the traditional machine learning technology. For the business of on-line transaction, on-line advertisement, financial analysis and search engine, which is based on the Internet, it is of great significance to carry out fast and effective learning for large-scale, long-time and continuous data. On-line machine learning is an important means to deal with a large amount of data in a timely manner, and the prediction ability and the prediction efficiency become the most important evaluation standard of the online learning method. As the most important online machine learning strategy, the incremental learning method can be divided into single-classifier incremental learning and integrated incremental learning. The single-classifier method has the advantages of easy adaptation and low prediction capability. With the continuous operation of the system, the integrated learning method usually results in an ever-increasing scale of the target integrated classifier, and the prediction cost is increasing. In batch machine learning, selective integration can effectively improve the prediction ability and prediction of the integrated classifier In view of the problem of supervised learning and classification, this paper puts forward that the selective integration technology is used for integrated incremental learning, so as to improve the prediction and prediction efficiency of on-line learning. In this paper, the on-line learning model of selective integration and incremental learning is put forward, and the key technology involved in this paper is further studied. Research. The main work and innovation of the paper The method comprises the following steps of: 1, proposing an on-line learning mode combining the selective integration and the incremental learning, In this paper, we put forward the on-line learning model EPIL with the combination of selective integration and incremental learning for the actual demand in all fields and the defects of the present on-line learning technology. and selecting a global base classifier which is out of date by using the global selection, so that the target set The classifier has the advantages of small scale, strong prediction capability and good performance. good incremental learning ability. In this paper, the selective integration strategy in the EIL model is studied, the selective integration strategy based on the model mining is proposed, and the framework of the selective integration algorithm based on the strategy is constructed. In this paper, the key technologies in the framework are analyzed in detail. In the selective integration strategy based on pattern mining, the problem of selective integration is described as the problem of mining a pattern from the transaction database, so that the transaction and the model can be utilized. The selection of the base classifier based on the mining technology is a selective integration. the research of the method has developed a new direction. In this paper, a selective integration algorithm based on pattern mining is proposed, which is derived from the selective integration strategy based on pattern mining. and then the two selective integration algorithms are given by using the concept: the cpm-ep and the pmep. cpm-ep and pmep algorithms all use the overlay mode mining idea and the majority of the voting method principles to obtain candidate sub-modes of various lengths, The PMEP then uses the greedy strategy to construct the target integrated classifier. However, the PMEP creates a FP-Tree for the original transaction database, and then obtains the candidate sub-mode from the FP-Tree, so as to avoid The frequent operation of the transaction database saves a lot of overhead. The experimental results show that the base classifier of the CPM-EP and PMEP algorithm The method has the advantages of high speed, small scale of the target integrated classifier, strong prediction capability, In this paper, PMEP is better than CPM-EP in the time of selection. The idea of mining is a very effective and selective integration strategy. This paper presents an integrated incremental learning method based on Bagging. The paper studies the construction method of the base classifier in the EIL model, and puts forward an integrated incremental learning method based on Bagging for the traditional integrated incremental learning method. The experimental results show that Bagging ++ has a good base classifier algorithm. adaptability, can obtain good prediction capability, performance is obviously superior to the traditional algorithm, The method of constructing the classifier can further improve the integrated increment. The predictive performance of learning is 5, and the incremental learning technology based on selective integration is put forward. The paper studies the specific methods of the increment learning by using the selective integration technique in the EPIL model, mainly including the timing of the base classification selection, the determination of the sample set, and then the Bag. The LP-Bagging based on the local selection is proposed based on the logging + + algorithm An MP-Bagging + + algorithm based on the combination of local and global selection is presented in this paper. The experimental results show that the control target can be effectively controlled by the global selection. The scale of the integrated classifier significantly improves the spatial and temporal efficiency of the prediction while ensuring the prediction capability. rate. As a result, the mix-selection strategy that is locally combined with the global is more appropriate To meet the current online learning requirements. 6. Design and implement the LibEP of the integrated learning and development platform. On the basis of the previous research results, the thesis designs and implements an open source integrated learning and development platform LibEP. The proposed algorithm includes all the main aspects of the integrated learning study, including the sample operation method. a base classifier learning algorithm, an integrated learning algorithm, a selective integration algorithm, an incremental learning algorithm, a performance, The LibEP platform has a simple interface and is easy to use and can be easily integrated into the user's range In order, the development platform is realized with the standard C ++ language, the running performance is high, the portability is good, and the function is easy to expand. In this paper, the on-line learning technology of selective integration and incremental learning is discussed from the perspective of the model, the algorithm and the experimental research. In the next step, the research contents of the thesis are combined with the practical application
【学位授予单位】：国防科学技术大学
【学位级别】：博士
【学位授予年份】：2010
【分类号】：TP18

【引证文献】