代价敏感属性约简的归并算法研究

发布时间：2018-09-05 17:04

【摘要】：数据挖掘又称数据库中的知识发现(Knowledge Discover in Database, KDD),是目前人工智能和数据库领域研究的热点问题。在现实世界中,数据集中存储的属性有几十、几百、甚至上千种。这些属性中有很多是冗余的,它们会干扰数据挖掘的过程,也很大程度上会影响算法的效率。因此,人们提出了属性约简这一预处理技术。另一方面,现实世界中的行为或者事物都有各种代价,如测试代价、误分类代价、延迟代价等,涉及金钱、时间、人工等方面的开销。代价敏感学习致力于涉及各类代价的挖掘问题。当前被研究的代价敏感属性约简问题包括：最小测试代价属性约简问题、简单公共测试代价属性约简问题、最小测试时间代价问题等。人们将不同算法框架应用于这些具体问题。启发式算法的速度很快,但是由于它们常常会收敛于局部最优解,因此正确率不高。回溯算法虽然能够保证找到最优解,但是运行时间往往不能被接受。仿生算法也常常能找到最优解,不过其耗费的时间代价过大。最近还有学者提出半贪心算法,能够在一定时间内得到较好的结果。本文将分治算法与回溯算法相结合,提出一种归并算法,以改善回溯算法的不足。本文的归并算法包含三个关键技术：分组与合并、回溯算法、以及竞争机制。该算法先将属性随机分组,得到一些属性子集,组的大小g对算法的性能有很大的影响。在极端情况下,组的大小与属性数目相同的情况下,归并算法将退回为回溯算法。属性子集通过回溯算法得到属性子集的约简,然后将每对相邻的约简合并成一个新的属性子集。重复以上过程直到只剩一个属性子集,这个属性子集的约简就是原问题的约简。属性分组后,全局重要的属性可能在局部约简时被删除,导致归并算法得到的解不是全局最优解。因此我们采用竞争机制,运行归并算法p次,得到p个解,再从这p个解中选取最优解。本文将该算法运用于最小测试代价属性约简问题、简单公共测试代价属性约简问题以及最小测试时间代价问题这三个问题。并使用来自UCI(University of California-Irvine)数据库中的四个真实数据集对提出的归并算法进行实验。其中每种数据集使用了三种不同分布的测试代价。通过实验我们得知：竞争机制能有效提高结果的质量；对于不同问题,p值大于6之后算法结果趋于稳定；最优g值对于不同情况略有不同,在最小测试代价问题中为6,简单公共测试代价、最小测试时间代价问题中均为7。与现有的启发式算法、蚁群算法和回溯算法相比,归并算法在保持较高的正确率的情况下,能够大大缩短运行时间。因此,本文提出的归并算法是针对该类问题的一种有效并且高效的算法。
[Abstract]:Data mining, also known as knowledge discovery (Knowledge Discover in Database, KDD),) in database, is a hot issue in the field of artificial intelligence and database. In the real world, data sets store dozens, hundreds, or even thousands of attributes. Many of these attributes are redundant, which interfere with the process of data mining and greatly affect the efficiency of the algorithm. Therefore, the preprocessing technique of attribute reduction is proposed. On the other hand, behaviors or things in the real world have various costs, such as the cost of testing, the cost of misclassification, the cost of delay, the cost of money, time, labor, and so on. Cost-sensitive learning focuses on mining problems involving various costs. The cost sensitive attribute reduction problem which has been studied at present includes: the minimum test cost attribute reduction problem, the simple common test cost attribute reduction problem, the minimum test time cost problem and so on. Different algorithms are applied to these specific problems. The speed of heuristic algorithms is very fast, but the accuracy is not high because they often converge to local optimal solutions. Although the backtracking algorithm can guarantee to find the optimal solution, the running time is often unacceptable. Bionic algorithms can often find the optimal solution, but the cost of time is too large. Recently, some scholars have proposed a half-greedy algorithm, which can get better results in a certain time. In this paper, a merging algorithm is proposed to improve the deficiency of the backtracking algorithm by combining the partition-and-conquer algorithm with the backtracking algorithm. The merging algorithm consists of three key technologies: grouping and merging, backtracking algorithm, and competition mechanism. In this algorithm, the attributes are grouped randomly and some subsets of attributes are obtained. The size of the group g has a great influence on the performance of the algorithm. In extreme cases, when the group size is the same as the number of attributes, the merging algorithm is returned to the backtracking algorithm. The reduction of attribute subset is obtained by backtracking algorithm, and then each pair of adjacent reduction is merged into a new attribute subset. Repeat the above process until there is only a subset of attributes whose reduction is the reduction of the original problem. After the attributes are grouped, the globally important attributes may be deleted in the local reduction, resulting in the solution obtained by the merging algorithm is not the global optimal solution. So we use the competition mechanism, run the merging algorithm p times, obtain p solutions, and then select the optimal solution from the p solution. In this paper, the algorithm is applied to three problems: the minimum test cost attribute reduction problem, the simple common test cost attribute reduction problem and the minimum test time cost problem. Four real data sets from UCI (University of California-Irvine) database are used to test the proposed merging algorithm. Each data set uses three different distribution of test costs. Through experiments, we know that the competition mechanism can effectively improve the quality of the results, the algorithm results tend to be stable when the value of p is greater than 6 for different problems, and the optimal g value is slightly different for different cases. In the minimum test cost problem is 6, the simple common test cost and the minimum test time cost problem are all 7. 5%. Compared with the existing heuristic algorithm, ant colony algorithm and backtracking algorithm, the merging algorithm can greatly shorten the running time while maintaining a higher accuracy. Therefore, the merging algorithm proposed in this paper is an effective and efficient algorithm for this kind of problems.
【学位授予单位】：西南石油大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP18;TP311.13

【参考文献】