基于约束的频繁模式挖掘方法以及应用研究

发布时间：2018-07-28 11:58

【摘要】：基于约束的频繁模式挖掘是数据挖掘研究中最基本问题之一,具有广泛的实际应用。然而,在这个研究领域中,仍然存在三个方面的挑战：(1)如何拓展新的应用?具体而言,除了模式的“支持度”,怎样设计一些新模式指标更好地去度量模式的兴趣度,以满足新应用的需求；(2)和模式支持度的反单调性不同,所提新模式指标的性质通常都比较复杂,比如它不满足单调性、反单调性、可转换性、简明性等。那么对一个模式,如何快速计算其所有父模式关于该指标的上/下界,并利用这个新模式指标的特性设计出高效算法；(3)通常,不同的应用,有不同新模式指标的提出,然后分别提出不同的模式上/下界的计算方法。那么有没有一种通用方法可以计算任一模式指标的上/下界?针对以上问题和挑战,本文开展了基于约束的频繁模式挖掘的方法及其应用研究,主要成果及贡献如下：首先,提出了一个基于模式挖掘的网页内容推荐方法。网页内容推荐就是从网页中找到重要的内容块组合推荐给用户,有着很多的应用(比如网页智能打印、移动设备上的电子阅读等)。目前有许多的方法试图去解决这个问题,但在这些方法中,要么就是针对于特定网页(比如新闻、博客类的网页),要么就是半自动化的(用户需要额外的操作去选择网页的内容块)。针对于任一类型的网页,如何全自动地提取网页中的有效内容,目前还没有得到很好地解决。为此,本文利用之前用户对相似网页的选择方式,将该问题形式化成一个模式挖掘推荐问题,提出了一个基于模式挖掘的网页内容推荐方法,可以为任一类型的网页提供更加准确的网页内容推荐。具体而言,推荐给用户的内容块组合(模式)不仅要频繁被其它用户选择,而且要越完整越好。鉴于此,本文提出了一个新的模式兴趣指标,即占有度,来衡量模式在其支持数据库上的完整度。结合模式的支持度和占有度,可以提供给用户更加准确、满意的网页内容推荐。最后,同基准方法比较,在真实的数据集上的实验结果表明所提方法能取得更加满意的推荐结果和运行效率。其次,提出了一个基于占有度的频繁模式挖掘通用高效算法。本章分别对占有度的定义、界估算方法以及应用三个层面进行深度扩展。具体而言,基于不同的加权平均(算术平均和调和平均),提出了两种不同的占有度定义,即算术占有度和调和占有度。与模式支持度的反单调性不同,占有度的性质即不满足单调性、反单调性,又不满足可转换性、简明性,那么对一个模式,如何快速计算其所有父模式关于占有度的一个上界?为此,对于每一种占有度定义,本文分别提出了三种上界：高效、最‘紧’和折中上界。高效上界对于单个结点计算比较高效,但是比较松散,需要搜索结点数比较多；最‘紧’上界得到的界比较紧凑,因而搜索很少的结点,但是计算单个结点比较耗时；为此,本文提出了一个折中上界,在松紧度和计算复杂度之间达到一个均衡,使算法整体性能达到最优。占有度的概念不仅对于事务数据库上的应用很重要(比如网页内容打印推荐),而且对于序列数据库中上的应用也非常重要(比如旅游餐景点推荐),为此,本文提出了一个通用算法DOFRA可以同时处理不同类型数据库上的应用。最后,在两个实际应用中验证了DOFRA的有效性,同时也在大量的合成数据中验证了DOFRA算法运行效率。最后,提出了一个通用模型可以高效估算任一模式指标的上／下界。基于约束模式挖掘不仅有助于捕捉更多的模式的语义信息,而且还可以利用约束的性质进一步地提高挖掘效率。在一些实际的应用驱动下,通常会提出一些新的模式指标去度量模式的兴趣度,然后分别估算所提模式指标的上／下界,缺少一个适合于任一模式指标的统一框架。为此,本文形式化了只考虑项标记的界估计问题,提出了一个通用模型可以高效解决这个问题。为了更加直观地展示所提通用框架的有效性,本文给出了两个非常典型的模式指标作为学习案例,即模式效用和模式占有度。除此之外,为满足不同的应用需求,本文把传统的基于SQL的模式指标,比如min, max, avg, var等,给扩展成了相对模式指标形式。最后,在真实和合成数据上的实验分析验证了该技术方案的通用性和有效性。
[Abstract]:Frequent pattern mining based on constraints is one of the most basic problems in the research of data mining and has a wide range of practical applications. However, there are still three challenges in this field: (1) how to expand the new application? Specifically, in addition to the "support" of the model, how to design some new model indicators to better measure it Mode of interest to meet the needs of the new application; (2) the anti mononality of the model support is different, and the properties of the proposed new model are usually more complex, such as it does not satisfy monotonicity, anti mono tonal, conversion, simplicity, etc. then, for a pattern, for example, how to quickly calculate all the upper / lower bounds of all its parent patterns on the index, And using the characteristics of this new model to design an efficient algorithm; (3) usually, different applications, with different new model indicators, and then put forward different model / lower bound calculation method. Then, is there a general method to calculate the upper / lower bounds of any pattern index? For the above problems and challenges, this paper develops The method and application of constraint based frequent pattern mining are summarized. The main achievements and contributions are as follows:
First, a web content recommendation method based on pattern mining is proposed. The recommendation of web content is to find important content block combinations from web pages to recommend users, and there are many applications (such as web page intelligent printing, electronic reading on mobile devices, etc.). There are many ways to solve this problem at present, but in these parties, there are many ways to solve this problem. In the law, either is for a specific web page (such as a web page for news, bloggers) or semi automated (users need additional operations to select the content blocks of a web page). For any type of web page, how to automatically extract effective content from a web page has not been well solved. The method of selecting the similar web page by the former user, makes the problem form a pattern mining recommendation problem, and proposes a web content recommendation method based on pattern mining, which can provide more accurate web content recommendation for any type of web page. Specifically, the content block combination (pattern) recommended to the user is not only frequent. Other users choose, and the more complete, the better. In view of this, this paper presents a new pattern of interest index, that is, the degree of possession, to measure the integrity of the pattern on its support database. Experimental results on real data sets show that the proposed method can achieve more satisfactory recommendation results and operational efficiency.
Secondly, a general efficient algorithm for mining frequent pattern mining based on occupancy is proposed. This chapter extends the definition of occupancy, the method of boundary estimation and the application of three levels. Specifically, two different definitions of occupancy are proposed based on the different weighted mean (arithmetic mean and harmonic mean), that is, the arithmetic occupancy. And harmonic possession. Unlike the anti mononality of the pattern support, the nature of possession is not satisfied with monotonicity, anti mononality, and is not satisfied with the convertability and simplicity. Then, how to quickly calculate the upper bound of all the parent patterns about the degree of possession for a pattern? For this, for each definition, three The upper bound is efficient, the most 'tight' and the upper bound. The high efficient upper bound is more efficient for single node computing, but it is looser, it needs to search a lot of nodes; the most tight upper bound is compact, so it searches for a few nodes, but the calculation of a single node is more time-consuming; for this reason, this paper puts forward a middle upper bound, A balance between the tightness and computational complexity makes the overall performance of the algorithm optimal. The concept of occupancy is not only important for the application on the transaction database (such as web page content printing recommendation), but also is very important for the application of the sequence database (such as a tourist attraction recommendation). For this reason, this paper proposes A universal algorithm DOFRA can process applications on different types of databases at the same time. Finally, the validity of DOFRA is verified in two practical applications, and the efficiency of the DOFRA algorithm is verified in a large number of synthetic data.
Finally, a general model is proposed to efficiently estimate the upper / lower bounds of any pattern index. Constraint based mining is not only helpful to capture more semantic information of the pattern, but also can further improve the mining efficiency by using the nature of constraints. The interest degree of the metric pattern is labeled, then the upper / lower bounds of the model indexes are estimated, and a unified framework suitable for any pattern index is lacking. Therefore, this paper formally considers the boundary estimation problem of only item markers, and proposes a general model to efficiently solve the problem. For the effectiveness of the framework, this paper gives two typical model indexes as learning cases, namely, pattern utility and pattern occupancy. In addition, in order to meet different application requirements, this paper extends the traditional SQL based pattern indicators, such as min, Max, AVG, VaR, and so on. The experimental analysis on the data shows the versatility and effectiveness of the proposed scheme.
【学位授予单位】：中国科学技术大学
【学位级别】：博士
【学位授予年份】：2014
【分类号】：TP311.13;TP393.092

【共引文献】