当前位置:主页 > 科技论文 > 软件论文 >

数据挖掘领域中若干预处理方法研究

发布时间:2018-06-06 18:28

  本文选题:粗糙集 + 离散化 ; 参考:《中国石油大学(北京)》2016年硕士论文


【摘要】:现实世界中数据具有不完整,不一致等特点,为了提高数据挖掘的质量产生了数据预处理技术。本文介绍了粗糙集的理论知识,并在此基础上,主要做了以下两个方面的研究:1、在传统基于属性依赖度的约简方法基础上,定义更精确的强化正域概念。通过对边界域的精确划分,确定各条件属性对决策属性的强化依赖度,并用自顶向下的启发式搜索算法得到约简结果。通过对UCI数据集实验,结果表明,相比于经典方法,REPR能更有效地对决策表进行属性约简。2、首先对离散化问题形式化描述,并采用最优化方法进行离散化定义;其次基于信息熵思想分别定义修正信息增益率IIGR和统计相似性SIS作为离散化的最优化目标函数,并给出离散化约束条件;最后采用遗传算法实现连续属性的离散化。采用UCI数据集实验对比,在统计意义下,本文离散化方法实现离散区间数少,离散后数据集构建决策树的规模小,分类精度高,表明以最优化为指导,多个连续属性并行离散化兼顾属性间的关联关系,数据离散化更加有效。
[Abstract]:In order to improve the quality of data mining, data preprocessing technology is produced in order to improve the quality of data mining because of the incomplete and inconsistent data in the real world. In this paper, the theory of rough set is introduced, and on this basis, the following two aspects of research: 1 are mainly done. On the basis of the traditional reduction method based on attribute dependence, the concept of enhanced positive domain is defined more accurately. Through the precise partition of the boundary domain, the degree of dependence of each conditional attribute on the decision attribute is determined, and the reduction result is obtained by using the top-down heuristic search algorithm. Through the experiment of UCI data set, the results show that compared with the classical method, REPR is more effective in attribute reduction of decision table. Firstly, the discretization problem is described formally, and the discretization is defined by optimization method. Secondly, the modified information gain rate IIGR and statistical similarity SIS are defined as the optimization objective function of discretization based on the idea of information entropy, and the discretization constraints are given. Finally, genetic algorithm is used to realize the discretization of continuous attributes. By using UCI data set experiments, in the statistical sense, the discretization method has less discrete interval number, smaller scale and higher classification accuracy of discrete data sets, which indicates that optimization is the guide. Parallel discretization of multiple continuous attributes takes into account the relationship between attributes, and data discretization is more effective.
【学位授予单位】:中国石油大学(北京)
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13;TP18

【参考文献】

相关期刊论文 前10条

1 安利平;刘森;;属性约简的两阶段遗传算法[J];系统工程理论与实践;2014年11期

2 邓维斌;王国胤;胡峰;;基于优势关系粗糙集的自主式学习模型[J];计算机学报;2014年12期

3 杨波;徐章艳;舒文豪;;一种快速的Rough集属性约简遗传算法[J];小型微型计算机系统;2012年01期

4 杨传健;葛浩;汪志圣;;基于粗糙集的属性约简方法研究综述[J];计算机应用研究;2012年01期

5 孙娓娓;王春生;姚云飞;;基于自适应遗传算法的粗糙集属性约简算法[J];计算机工程与应用;2011年33期

6 杨明;;决策表中基于条件信息熵的近似约简[J];电子学报;2007年11期

7 陈果;;基于遗传算法的决策表连续属性离散化方法[J];仪器仪表学报;2007年09期

8 谢宏,程浩忠,牛东晓;基于信息熵的粗糙集连续属性离散化算法[J];计算机学报;2005年09期

9 李国和,赵沁平;信息系统的一种分块特征选取方法[J];北京航空航天大学学报;2003年03期

10 王国胤,于洪,杨大春;基于条件信息熵的决策表约简[J];计算机学报;2002年07期



本文编号:1987699

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1987699.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户bb057***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com