面向不确定数据的频繁模式挖掘方法研究

发布时间：2018-05-13 19:23

本文选题：反单调性 + 频繁模式　；参考：《山东师范大学》2016年博士论文

【摘要】：大数据时代悄然到来,数据挖掘技术正面临着前所未有的机遇和挑战。作为数据挖掘领域的重要研究课题,频繁模式挖掘和关联规则发现受到了持续而广泛的关注,并且涌现了大量经典理论、高效算法和新兴应用领域。挖掘频繁项集,是关联规则发现中的关键技术和步骤,并决定了关联规则的总体性能,目前已广泛应用于市场销售、文本挖掘、公众健康等各个领域。在实际应用中,由于技术手段有限、测量设备误差、通讯开销限制和用户隐私保护等诸多因素的影响,获得的原始数据往往存在不确定性。同时,受到主客观条件的限制,频繁模式挖掘过程中也会带来一系列的不确定性,这些不确定性在挖掘过程中不断传播和积累,可能导致挖掘出的知识与真实结果之间存在较大差距甚至毫无意义。而传统的挖掘方法却未将这些因素考虑进去,只简单地认为挖掘出的知识一般都是有用的和确定的,致使传统的频繁模式挖掘方法在处理不确定数据时面临着得到的挖掘结果异常却难以解释的窘态。这显然是不科学和不妥当的。因此,针对不确定频繁模式挖掘的研究显得尤为重要,并日益受到广大研究人员的关注。本文主要针对两类典型的不确定性数据,即概率数据和容错数据,进行概率频繁模式挖掘和近似频繁模式挖掘的研究,并应用在中医药诊疗数据环境下,实现基于不确定数据的高效频繁模式挖掘。本文的主要工作和成果总结如下:1.针对概率数据中垂直格式的数据表示形式,提出了一种基于Eclat框架的概率频繁项集精确挖掘算法(UBEclat)。首先,对于采用垂直数据格式的概率数据,本文设计了一种适用于Eclat框架,旨在提高算法执行效率的双向排序策略,然后基于概率频度的定义,提出了采用分而治之方法的概率频繁项集精确挖掘算法。在基准数据集和真实数据集上的对比实验表明,UBEclat算法能够依据支持度的概率分布,准确挖掘出所有概率频繁项集。这为有效解决概率频繁项集的精确挖掘问题提供了新的思路。2.针对概率频繁项集精确挖掘算法执行效率较低,运行时间过长的问题,基于概率数据的可能性理论,提出了一种高效的概率频繁项集近似挖掘算法(NDUEclat)。结合Eclat框架和近似方法的优势,NDUEclat算法采用分而治之的方法,应用大数定律优化挖掘过程,改进了频繁项集挖掘的效率。在基准数据集和真实数据集上的多组对比实验也验证了该算法具有良好的挖掘性能。目前,这也是第一个基于支持度的概率分布,在垂直数据格式的概率数据中高效挖掘不确定频繁项集的近似算法。3.针对NP-hard类的容错频繁模式挖掘问题,提出了一种将容错数据库映射为事务信息系统,基于粗糙集理论挖掘近似频繁模式的新方法。依据挖掘出的频繁项目确定决策表中的决策属性;基于粗糙集理论中上近似和下近似概念,确定近似频繁模式的匹配程度。在基准数据集和真实数据集上进行的对比实验证实了该方法在挖掘的准确率指标上,比以往方法有更好的性能表现。显然,基于粗糙集理论的近似挖掘方法为有效提高近似频繁模式挖掘的准确性和适用性提供了新的思路。4.以减少敏感参数设置的影响、提高挖掘效率的同时保证实际挖掘结果的可用性为目的,研究了基于容错数据的粗糙集理论,提出了一种挖掘近似频繁闭模式的新模型。新模型主要由三部分组成:用聚类算法完成数据预处理;对同一类中的事务依据粗糙集理论进行属性约简生成核模式;将核模式作为初始种子构建等价类,用分而治之的方法挖掘近似频繁闭模式。在传统中医药数据集上的实验结果表明,该模型可以更精准地表达近似频繁模式,有利于实现基于中医诊疗应用的知识发现。综上所述,本文针对概率数据中如何提高频繁模式挖掘的效率、如何屏蔽容错数据中因数据表达不准确而对挖掘结果造成的影响以及如何确定容错率以获得有意义的挖掘结果等问题,从数据库的特点和数据的表示方式、模式挖掘的类型、具体挖掘技术的选择等几个不同的角度提出了相应的解决方案,并通过实验验证了它们的有效性。本文工作可以为今后面向不确定数据的频繁模式挖掘研究提供帮助。
[Abstract]:As the big data era is coming, data mining technology is facing unprecedented opportunities and challenges. As an important research topic in the field of data mining, frequent pattern mining and association rules discovery have been paid more and more attention, and a large number of classical theories, high efficiency algorithms and new application fields are emerging, and frequent itemsets are excavated. The key technologies and steps of association rules are found, and the overall performance of association rules is determined. It has been widely used in various fields, such as market sales, text mining, public health and so on. In practical applications, the influence of many factors, such as limited technical means, measurement equipment error, communication overhead limit and user privacy protection, has been obtained in practical application. The original data often have uncertainty. At the same time, limited by the subjective and objective conditions, the frequent pattern mining also brings a series of uncertainties. These uncertainties are constantly spread and accumulated during the mining process, which may lead to a large gap or even meaningless between the knowledge and the real results. The mining method does not take these factors into consideration, only simply thinks that the mining knowledge is generally useful and definite, which causes the traditional frequent pattern mining method to face the abnormal but difficult to explain when dealing with the uncertain data. This is obviously unscientific and inappropriate. The study of frequent pattern mining is particularly important and is increasingly concerned by the researchers. This paper mainly focuses on two types of typical uncertain data, probability data and fault-tolerant data, the research of probability frequent pattern mining and approximate frequent pattern mining, and is applied to the diagnosis and treatment data environment of traditional Chinese medicine. The main work and achievements of this paper are summarized as follows: 1. in view of the data representation of vertical format in probabilistic data, an accurate mining algorithm for probability frequent itemsets based on Eclat framework (UBEclat) is proposed. First, this paper designs the probability data for using the vertical data format. The Eclat framework is suitable for improving the efficiency of the algorithm. Then, based on the definition of probability frequency, a precise mining algorithm for probability frequent itemsets using a divide and conquer method is proposed. The comparison experiment on the datum data set and the real data set shows that the UBEclat algorithm can be based on the probability distribution of the support degree. In order to effectively solve the problem of accurate mining of probability frequent itemsets, it provides a new idea.2. for the problem of low execution efficiency and long running time for the accurate mining algorithm of probability frequent itemsets. Based on probability data possibility theory, an efficient approximate mining of probability frequent itemsets is proposed. Algorithm (NDUEclat). Combining the advantages of Eclat framework and approximate method, NDUEclat algorithm adopts a divide and conquer method, uses large number law to optimize the mining process and improves the efficiency of frequent itemsets mining. The multi group comparison experiments on datum dataset and real data set also verify that the algorithm has good mining performance. It is the first probability distribution based on the support degree. An approximate algorithm for mining the uncertain frequent itemsets in the probability data of the vertical data format.3. is used to solve the fault tolerant frequent pattern mining problem of the NP-hard class. A new method of mapping the fault tolerant database into a transaction information system based on the rough set theory is proposed. Method. The decision attributes in the decision table are determined according to the frequent items excavated, and the matching degree of the approximate frequent patterns is determined based on the upper and lower approximation concepts in the rough set theory. The comparison experiments on the datum data set and the real dataset confirm that the method has a better performance than the previous method on the accuracy index of the mining. It is obvious that the approximate mining method based on rough set theory provides a new idea to effectively improve the accuracy and applicability of the approximate frequent pattern mining,.4. to reduce the influence of the setting of sensitive parameters, improve the efficiency of mining and ensure the availability of the actual mining results, and study the rough set based on fault-tolerant data. In this paper, a new model for mining approximately frequent closed patterns is proposed. The new model is composed of three parts: data preprocessing by clustering algorithm, and attribute reduction in the same class of transactions based on rough set theory to generate the kernel model; the kernel model is used as the initial seed to construct the equivalent class, and a divide and conquer method is used to excavate the approximate frequency. The experimental results on traditional Chinese medicine data set show that the model can express approximate frequent patterns more accurately and be helpful to realize knowledge discovery based on the application of diagnosis and treatment of traditional Chinese medicine. In summary, this paper aims at how to improve the efficiency of frequent pattern mining in the probability data and how to shield the data from fault tolerance data to express inaccurate data. In order to determine the impact of the mining results and how to determine the fault tolerance rate to obtain meaningful mining results, the corresponding solutions are proposed from the characteristics of the database, the representation of data, the type of pattern mining, the selection of specific mining techniques and so on, and their effectiveness is verified by experiments. This work can help future research on frequent pattern mining for uncertain data.

【学位授予单位】：山东师范大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP311.13

【参考文献】