MapReduce框架下模糊粗糙集属性约简算法研究

发布时间：2018-06-13 03:36

本文选题：模糊粗糙集 + 属性约简　；参考：《西南交通大学》2017年硕士论文

【摘要】：近年来随着互联网的高速发展,需要处理的数据量急剧增加,从而如何从海量数据中获取知识成为了人们关注的热点,知识发现成为了重要的研究课题。属性约简(特征选择)是有效地获取知识摒除干扰因素的重要方法之一。在一个数据集(知识库)中,有着众多不同的属性,但并不是每个属性都有着相同的重要性。有些属性对于人们决策可能重要一些,有些属性可能不那么重要,有些属性还有可能是冗余的、不必要的。由于这些冗余信息的存在,使得人们在获得知识时会花费掉更多的时间和空间用于处理这些无关信息。属性约简的目的是从数据集中去除这些无关信息,解决数据处理中的过拟合、维数灾难等问题。属性约简是粗糙集理论的重要应用之一,得到了学者们的广泛关注和研究。但是经典粗糙集模型无法直接对数值型数据进行处理,需要预先对数值数据进行离散化处理,从而可能造成信息损失,影响知识的获取。在模糊粗糙集模型下,可以直接处理数值型数据。针对基于属性依赖度的属性约简算法中存在的一些缺陷,本文将粒子群算法与模糊粗糙集相结合,并从大数据的角度出发,利用MapReduce框架,进行模糊粗糙集和稳健模糊粗糙集并行属性约简的相关研究。本论文的主要研究工作如下:1.将高斯核模糊粗糙集与粒子群算法相结合,构建了基于粒子群算法的高斯核模糊粗糙集属性约简算法。由于高斯核模糊粗糙集的特性,在基于属性依赖度的启发式属性约简算法中,可能无法获取最佳属性组合,甚至无法获得约简。因而本文通过将粒子群算法与之结合,克服了该种缺陷,并利用高斯核模糊粗糙集的特性,在不同的核参数选择下,可得出不同的属性约简以满足分类的要求。采用UCI公用数据集进行实验,实验结果表明了该算法具有良好的约简性能。(第3章)2.基于高斯核模糊粗糙集模型,分析了并行计算模糊粗糙集近似集和属性依赖度的原理,给出了基于MapReduce框架的高斯核模糊粗糙集下近似集和属性依赖度并行计算算法,进而给出了基于粒子群算法的高斯核模糊粗糙集属性约简并行计算算法。该算法利用MapReduce的特性,直接在Map过程中求得不同分片中对象在该分片中与不同决策类对象的最小距离,而不必对两两对象间的关系都进行输出,从而减少了 HDFS的访问。使得在大数据上计算模糊粗糙集下近似集以及属性依赖度可行。在UCI公用数据集和人工生成的数据集上进行实验,实验结果表明了在大数据环境下本算法具有良好的并行性能和约简性能。(第4章)3.在稳健模糊粗糙集模型上,利用MapReduce框架,实现了高斯核稳健模糊粗糙集并行属性约简算法。在该算法中,首先计算数据分片中每一个对象与它的k个邻近的不同决策类对象的距离,从而求取整个数据集下每一个对象的k个邻近点,再利用RNN算子求取对象的下近似,进而计算所有候选约简的属性依赖度以获取属性约简。以上策略使该算法储存空间需求较少,且能减少因多次迭代Hadoop平台中资源调度产生的时间开销。在UCI公用数据集上对该算法进行了实验,分析了使用不同参数的RNN算子时的约简性能和并行性能。实验结果表明该算法能够对大数据进行约简,克服了传统模型无法获取约简的情况。该算法不仅能够有效地处理噪声数据,而且具有良好的并行性能。(第5章)
[Abstract]:In recent years, with the rapid development of the Internet, the amount of data needed to be processed has increased rapidly. Thus how to acquire knowledge from mass data has become a hot topic of attention. Knowledge discovery has become an important research topic. Attribute reduction (feature selection) is one of the most important methods to effectively obtain knowledge to remove interference factors. There are many different attributes in a set (a knowledge base), but not every attribute has the same importance. Some attributes may be important for people to make decisions, some attributes may be less important, some attributes may be redundant, unnecessary. Because of the existence of these redundant information, people will be able to acquire knowledge. It takes more time and space to deal with these unrelated information. The purpose of attribute reduction is to remove these unrelated information from the data set and solve the problems of overfitting and dimension disaster in the data processing. Attribute reduction is one of the important applications of the rough set theory, and the extensive attention and research of the scholars are obtained, but the classical roughness is rough. The set model can not deal with the numerical data directly. It is necessary to discretize the numerical data in advance, which may cause information loss and influence the acquisition of knowledge. In the fuzzy rough set model, the numerical data can be processed directly. Particle swarm optimization (PSO) and fuzzy rough set are combined, and the MapReduce framework is used to study the correlation of fuzzy rough set and robust fuzzy rough set parallel attribute reduction. The main research work of this thesis is as follows: 1. combining the Gauss kernel fuzzy rough set and the particle swarm optimization algorithm, the particle swarm optimization algorithm is constructed. The Gauss kernel fuzzy rough set attribute reduction algorithm. Because of the characteristics of the Gauss kernel fuzzy rough set, it may not be able to obtain the best attribute combination in the heuristic attribute reduction algorithm based on the attribute dependence degree, and can not even reduce the reduction. This paper, by combining the particle swarm optimization algorithm, overcomes the defect, and uses the Gauss kernel model. Under the selection of different kernel parameters, different attribute reduction can be obtained to meet the requirements of classification. Experiment with UCI public data sets is used. The experimental results show that the algorithm has good reduction performance. (third) 2. based on the Gauss kernel fuzzy rough set model, the approximate set of parallel computing fuzzy rough set is analyzed and the approximate set of fuzzy rough set is analyzed. The principle of attribute dependence is given, and the parallel computation algorithm of approximation set and attribute dependence degree under the Gauss kernel fuzzy rough set based on the MapReduce framework is given, and then the parallel computation algorithm of the attribute reduction of the Gauss kernel fuzzy rough set based on particle swarm optimization is given. The algorithm uses the characteristics of the MapReduce to obtain the different slices directly in the Map process. The minimum distance between the object in the segment and the different decision class objects does not need to output the relationship between the 22 objects, thus reducing the access of the HDFS. It makes it feasible to calculate the approximate set and the attribute dependence of the fuzzy rough set on the large data. Experiments are carried out on the UCI public data set and the artificially generated data set. The results show that the algorithm has good parallel performance and reduction performance in the large data environment. (fourth) 3. on the robust fuzzy rough set model, the MapReduce kernel robust fuzzy rough set parallel attribute reduction algorithm is realized on the robust fuzzy rough set model. In this algorithm, the algorithm calculates each object and its K adjacent to each object in the data slice first. The distance of the decision class object is different, thus the K adjacent points of each object under the whole data set are obtained, and then the lower approximation of the object is obtained by the RNN operator, and then the attribute dependence of all the candidate reductions is calculated to obtain the attribute reduction. The strategy makes the algorithm store less space requirements and can reduce the number of iterations in the Hadoop platform. The time overhead of resource scheduling. The algorithm is experimentation on the UCI public data set. The reduction performance and parallel performance of the RNN operator with different parameters are analyzed. The experimental results show that the algorithm can reduce the large data and overcome the fact that the traditional model can not reduce the reduction. The algorithm can not only be effectively applied to the algorithm. Noise data and good parallel performance. (Chapter fifth)
【学位授予单位】：西南交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP18

【参考文献】