基于频繁模式树的关联规则算法研究

发布时间：2018-11-02 08:34

【摘要】： 数据挖掘是近年来迅速发展的信息处理技术,从大量的、不完全的、有噪声的、模糊的、随机的数据中,提取隐含在其中的、人们事先不知道的、但又是潜在有用的信息和知识的过程。它涉及数据库、人工智能、机器学习、模式识别、知识工程、面向对象、信息检索和可视化等一系列技术。关联规则挖掘作为数据挖掘领域的一个重要研究分支,它的任务是发现所有满足支持度阈值和置信度阈值的强关联规则。关联规则挖掘算法是关联规则挖掘研究的主要内容,迄今为止已经提出了许多高效的关联规则挖掘算法。本文对经典的Apriori和AprioriTid算法以及不产生候选集的FP-Growth算法进行了分析和研究。FP-Growth算法比Apriori算法在性能上有了很大提高,它仅需要扫描数据库两次,并且避免了产生大量的候选项集。但FP-Growth算法主要的瓶颈之一就是空间开销大。为了节省空间,提高频繁项的发现效率,本文对传统的频繁模式树和项头表进行了优化,采用动态构造哈希链地址的方法来构造项头表,FP-Tree的每个结点只存储该项在项头表中的地址,避免了在地址上出现空指针,节省了存储空间的开销,同时增加树结点的域实现了方便的双向遍历。此外还通过对事务数据库按一定的规则进行了划分,得到若干个数据库子集,然后分别对每个数据库子集进行数据挖掘,因而占用内存小,解决了内存无法装入频繁模式树的问题,使数据挖掘得以顺利进行。最后通过实验对基于频繁模式树的关联规则挖掘的优化算法与传统的频繁模式树的FP-Growth算法进行了比较,实验结果表明在挖掘大量数据信息时更有效。
[Abstract]:Data mining is a rapid development of information processing technology in recent years, from a large number of, incomplete, noisy, fuzzy, random data, extract hidden in them, people do not know in advance, But it is also a process of potentially useful information and knowledge. It involves a series of technologies, such as database, artificial intelligence, machine learning, pattern recognition, knowledge engineering, object oriented, information retrieval and visualization. As an important research branch in the field of data mining, the task of association rule mining is to find all strong association rules that satisfy the support threshold and confidence threshold. Association rule mining algorithm is the main content of association rule mining research. So far, many efficient association rules mining algorithms have been proposed. In this paper, the classical Apriori and AprioriTid algorithms and the FP-Growth algorithm without candidate set are analyzed and studied. The performance of the FP-Growth algorithm is greatly improved than that of the Apriori algorithm, and it only needs to scan the database twice. And avoid producing a large number of candidate items. However, one of the main bottlenecks of FP-Growth algorithm is the large space overhead. In order to save space and improve the efficiency of frequent item discovery, this paper optimizes the traditional frequent pattern tree and item header table, and constructs the item header table by dynamically constructing the hash chain address. Each node of FP-Tree only stores the address of the item in the item header table, avoids the null pointer on the address, saves the cost of storage space, and increases the domain of tree node to realize convenient two-way traversal. In addition, by dividing the transaction database according to certain rules, several subsets of the database are obtained, and then each subset of the database is mined separately, thus occupying less memory. It solves the problem that the memory can not load the frequent pattern tree, so that the data mining can proceed smoothly. Finally, the optimization algorithm of association rule mining based on frequent pattern tree is compared with the FP-Growth algorithm of traditional frequent pattern tree. The experimental results show that the algorithm is more effective in mining a large amount of data information.
【学位授予单位】：哈尔滨工程大学
【学位级别】：硕士
【学位授予年份】：2008
【分类号】：TP311.13

【参考文献】