一种面向文本分类的基于动态邻域粗糙集的属性约简算法

发布时间：2018-05-18 13:40

本文选题：粗糙集 + 邻域粗糙集模型　；参考：《山东科技大学》2017年硕士论文

【摘要】：随着机器学习领域不断的发展与进步,计算机处理海量数据的能力大大提升。但是,海量的数据中掺杂着大量冗余的、不完全的信息,对机器学习算法的性能造成了极大的影响。为了解决这一问题,有学者提出了数据约简这一概念,即以保持原有数据分类能力为前提,剔除掉数据中的冗余信息。如何对海量数据进行有效约简的同时最大限度保留有用信息,是数据挖掘与机器学习领域中的重要研究方向。近年来粗糙集理论作为一种有效处理不精确、不一致、不完整数据的分析工具,在机器学习等诸多领域得到了广泛地应用。邻域粗糙集模型作为粗糙集的一种拓展,能够很好的对连续型数据进行处理,从而解决了经典粗糙集中出现的信息损失和对离散化方法的依赖问题。本文对邻域粗糙集模型以及基于此模型的属性约简算法进行研究,主要包括:(1)为更好的确定适合特定数据集的邻域值,提高约简效果,本文将FCM算法和邻域粗糙集结合,并以属性重要度为启发条件,构造了一种基于Canopy-FCM非对称动态邻域粗糙集模型的前向贪心属性约简算法,为每个属性确定特定的邻域值,使邻域值的设定完全根据数据的分布,避免了设置全局定邻域值的弊端,从而更准确的选择出对决策能力贡献度高的属性。在UCI上的公开数据集实验结果表明,本文算法能保留较少的条件属性,而且较好的提升分类精度。(2)将本文提出的属性约简算法应用于中文文本分类中,以提取关键特征词并减少冗余词汇对分类效果的影响。本文以李荣陆整理的中文文本语料库为实验对象进行实验,实验结果表明,本文提出的属性约简算法可以很好地减少文本特征词,降低文本集的维度,提高了对文本数据的分类能力,便于更准确的捕捉关键信息,具有一定的实际意义。
[Abstract]:With the continuous development and progress of machine learning, the ability of computer to deal with massive data is greatly improved. However, a large amount of redundant and incomplete information is mixed in the massive data, which has a great impact on the performance of machine learning algorithm. In order to solve this problem, some scholars put forward the concept of data reduction, which is to eliminate redundant information from the data on the premise of maintaining the original data classification ability. It is an important research direction in the field of data mining and machine learning that how to effectively reduce the mass data while keeping the useful information to the maximum extent. In recent years, rough set theory, as an effective analysis tool for dealing with imprecise, inconsistent and incomplete data, has been widely used in many fields such as machine learning. As an extension of rough set, neighborhood rough set model can deal with continuous data well, thus solving the problem of information loss and dependence on discrete methods in classical rough sets. In this paper, the neighborhood rough set model and the attribute reduction algorithm based on this model are studied, including: 1) in order to better determine the neighborhood value suitable for a particular data set and improve the reduction effect, this paper combines FCM algorithm with neighborhood rough set. Taking attribute importance as the heuristic condition, a forward greedy attribute reduction algorithm based on Canopy-FCM asymmetric dynamic neighborhood rough set model is constructed, which determines the specific neighborhood value for each attribute and makes the neighborhood value set according to the distribution of the data. The disadvantage of setting global local neighborhood value is avoided, and the attribute with high contribution to decision-making ability is selected more accurately. The experimental results on the open dataset on UCI show that the proposed algorithm can retain less conditional attributes and improve the classification accuracy. (2) the proposed attribute reduction algorithm is applied to Chinese text classification. In order to extract the key feature words and reduce the influence of redundant words on the classification effect. In this paper, the Chinese text corpus compiled by Li Ronglu is used as the experimental object. The experimental results show that the attribute reduction algorithm proposed in this paper can reduce the text feature words and reduce the dimension of the text set. It improves the classification ability of text data, and it is convenient to capture the key information more accurately, which has certain practical significance.
【学位授予单位】：山东科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP18;TP391.1

【参考文献】