基于特征抽取的集成学习算法研究

发布时间：2018-07-04 10:42

本文选题：集成学习 + 特征抽取　；参考：《山东师范大学》2017年硕士论文

【摘要】：学习系统泛化能力的提升一直是机器学习研究的重点。单一分类器无法避免的局限和不足导致其分类性能的提升遇到瓶颈。集成学习作为新的机器学习模式,采用若干个单一分类器预测同一问题,分类结果由各学习器共同决定,并按某种规则进行集成。集成学习使得各分类器优势互补,极大提升了分类系统的泛化能力和分类性能,被广泛应用于生物医学、信息科学等各个领域。随着互联网技术向社会生活各个领域渗透,待处理的数据也变得愈加复杂。其中,不平衡数据、高维数据、噪声数据等各种类型数据普遍存在。传统的集成学习方法处理规范数据性能较好,而对于复杂数据分类效果有限。因此,在集成学习中融入数据处理方法显得尤为重要。特征抽取是数据分析处理的重要手段之一,在数据降维,消除噪声冗余等方面有着广泛的应用。本文在对集成学习算法深入研究的基础上,将特征抽取等数据处理算法与集成学习算法相结合,提出了改进后的集成学习算法,具体如下:不平衡数据通常会导致分类器对少数类样本分类效果较差。为了降低数据集的不平衡比例,可以采用SMOTE过采样算法对数据预处理。本文使用独立成分分析算法(ICA)消除数据噪声,同时融入SMOTE算法平衡数据,使得处理后的数据对集成学习算法具有较好的适应性。实验结果表明,本文提出的方法能显著提升集成学习算法Bagging对不平衡数据的分类性能。不同类型的数据都存在一定的组织方式和结构信息,属性之间相互关联。经过研究分析,垃圾网页数据集特征属性不仅维度高而且关联度也较高。针对垃圾网页内容特征和链接特征之间的高维性和关联性,本文在对垃圾网页特征属性深入研究的基础上,对其关联属性分组进行主成分分析(PCA),而非整体主成分分析。这在降低维度的同时,一定程度的保护了数据集原有的属性结构。实验结果表明,本文提出的方法在应用于垃圾网页分类时具有较好的性能。
[Abstract]:The improvement of generalization ability of learning system has been the focus of machine learning research. The limitation and deficiency of single classifier lead to the bottleneck of its classification performance. As a new machine learning model, ensemble learning uses several single classifiers to predict the same problem. Ensemble learning makes each classifier complement each other, greatly improves the generalization ability and classification performance of classification system, and is widely used in biomedicine, information science and other fields. As Internet technology penetrates into all areas of social life, the data to be processed become more complex. Among them, unbalanced data, high-dimensional data, noise data and other types of data generally exist. Traditional ensemble learning methods have better performance for standard data processing, but limited effect for complex data classification. Therefore, it is very important to integrate data processing methods into integrated learning. Feature extraction is one of the most important methods in data analysis and processing. It is widely used in data dimensionality reduction, noise redundancy elimination and so on. Based on the in-depth study of the integrated learning algorithm, this paper combines the feature extraction and other data processing algorithms with the integrated learning algorithm, and proposes an improved ensemble learning algorithm. The main results are as follows: unbalanced data usually lead to poor classification performance for a few samples. In order to reduce the imbalance ratio of data sets, SMOTE oversampling algorithm can be used to preprocess the data. In this paper, the independent component analysis (ICA) algorithm is used to eliminate the data noise and the SMOTE algorithm is used to balance the data, which makes the processed data more adaptable to the ensemble learning algorithm. The experimental results show that the proposed method can significantly improve the classification performance of the integrated learning algorithm bagging for unbalanced data. Different types of data have a certain organization and structure information, and attributes are related to each other. Through research and analysis, the feature attribute of garbage page dataset is not only high dimension but also high correlation degree. In view of the high dimension and relevance between the content features and link features of spam pages, this paper makes a principal component analysis (PCA) instead of global principal component analysis (PCA) on the basis of in-depth research on the feature attributes of spam pages. This not only reduces the dimension, but also protects the original attribute structure of the data set to a certain extent. The experimental results show that the proposed method has good performance in the classification of garbage pages.
【学位授予单位】：山东师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181

【参考文献】