海量不一致数据的分类算法研究

发布时间：2018-11-29 12:07

【摘要】：近年来,随着实际生活中的数据量不断呈指数增大,不一致数据的出现也变得越发频繁。传统的方法是通过人工修正来对不一致数据进行修复校正。然而,随着不一致数据的数据量增长趋势呈指数增长,通过人工的方式对不一致数据进行修正也变得更加耗时。并且,随着数据量的增大,人工修正数据也存在着不可避免的人为操作错误,从而导致数据中出现错误数据。因此,这种修正方法变的不再可行。如何能够对不一致数据不进行人工的修正,直接在不一致数据上进行特征选择以及分类,是本文的核心研究内容。决策树算法是一种性能较优的分类算法,它对于错误数据、离群数据有较好的容错性,对于建模后的树形结构也有较好的可释性,能够直观的看出数据分类子集,因而本文选择该算法进行改进。互信息算法通过对单个特征与目标特征进行影响因子计算,从而衡量特征间的相关程度,并且通过共同出现概率来进行相关因子计算,因而,文本选择该算法进行改进来进行特征选择。本文首先通过改进决策树算法,使其能够直接对不一致数据进行分类,并得到较好的结果。文章主要研究不一致数据约束条件中的函数依赖,通过分别针对前置特征与后置特征在数据中的差异性,对其进行不同的算法设计,从而使改进后的算法对前置特征与后置特征进行不同的计算。文章通过改进决策树算法的目标函数,改变约束条件中特征的分割计算方法,来对不一致数据进行划分。文章主要通过多方面衡量约束条件中特征对分类结果的影响,从而调整该特征的影响因子,使得决策树的节点分割更精确。随着不一致数据的数据量呈指数形式的增大,数据特征的维数也随着增加。高维度的特征使得分类模型的构建变得耗时,而对于目标特征来说,与其相关程度较小的特征对分类模型的效果影响较小。本文通过对特征选择算法中的互信息算法进行改进,使其能够对不一致数据集进行特征重要性评判,从而能够筛选出对目标特征影响程度最高的特征来进行分类模型建模。文章通过对约束条件中函数依赖特征区分为前置特征与后置特征,从而针对前置特征与后置特征在不一致数据中的特性,进行不同的算法改进。通过对决策树算法与互信息算法进行改进,根据对比实验结果可以得出,改进后的算法相比于对比算法来说,分类效果有明显的提升。
[Abstract]:In recent years, with the increasing of data volume in real life, inconsistent data appear more and more frequently. The traditional method is to correct the inconsistent data by manual correction. However, as the data volume of inconsistent data increases exponentially, it becomes more time-consuming to modify the inconsistent data manually. Moreover, with the increase of data volume, there are inevitable human error in the data correction, which leads to the occurrence of error data in the data. As a result, this correction method is no longer feasible. How to select features and classify inconsistent data directly without artificial correction is the core of this paper. Decision tree algorithm is a kind of classification algorithm with better performance. It has good fault tolerance for error data, outlier data, and good interpretability for the tree structure after modeling, which can directly see the subset of data classification. So this paper chooses the algorithm to improve. The mutual information algorithm calculates the influence factors between the single feature and the target feature, so as to measure the correlation degree between the features, and calculate the correlation factor by co-occurrence probability. Text selection algorithm is improved to select features. In this paper, the decision tree algorithm is improved to classify the inconsistent data directly, and a better result is obtained. This paper mainly studies the function dependence in the constraint condition of inconsistent data. According to the difference of pre-feature and post-feature in data, the paper designs different algorithms. Therefore, the improved algorithm can calculate the pre- and post-features differently. By improving the objective function of the decision tree algorithm and changing the method of segmentation and calculation of the features in the constraint conditions, the inconsistent data are partitioned in this paper. In this paper, the influence of the feature on the classification result is measured in many aspects, and the influence factor of the feature is adjusted to make the node segmentation of the decision tree more accurate. As the data volume of inconsistent data increases exponentially, so does the dimension of data feature. The feature of high dimension makes the construction of classification model time-consuming, but for the target feature, the feature with less correlation degree has little influence on the effect of classification model. In this paper, the mutual information algorithm in feature selection algorithm is improved to evaluate the feature importance of inconsistent data sets, so that the features with the greatest influence on the target features can be selected to model the classification model. In this paper, the features of functional dependence in constraint conditions are distinguished into pre- and post-features, so that different algorithms are improved for the characteristics of pre- and post-features in inconsistent data. Through the improvement of decision tree algorithm and mutual information algorithm, according to the comparison experiment results, the improved algorithm can be compared with the contrast algorithm, the classification effect is obviously improved.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】