基于关联规则的查询扩展技术研究

发布时间：2018-08-30 12:07

【摘要】：随着网络信息量的日益剧增，通过搜索引擎找到人们想要的确切信息还存在一定的困难，，查询率不高和查准率低，成为搜索引擎迫切需要解决的问题。针对这一问题，本文依据Van Rijsbergen学者提出的利用对原查询进行修改来提高检索能力的观点，对基于关联规则的查询扩展技术进行研究。主要内容如下： 1．首先对本文研究的基础内容：数据挖掘、关联规则、查询扩展，进行详细介绍，针对现有的基于关联规则的查询扩展技术进行分析，指出优缺点，针对共性的缺点：现有的基于关联规则的查询扩展算法都不注意关联规则挖掘算法的挖掘效率以及采用的挖掘算法是否适合，作为本文的研究重点。 2．针对上述问题，本文首次提出基于最大频繁项目集挖掘的查询扩展算法，算法采用基于向量空间模型的查询技术，对初次检索到的n篇文档进行分词处理，将处理后的分词以垂直数据格式进行表示，采用求交集的方法得到项目集支持度，同时采用集合枚举树数据结构、一定的剪枝策略进行最大频繁项目集挖掘，得到扩展词库；扩展词和初始查询词相结合，进行二次检索。实验证明，同以往算法相比，算法效率得到提高。 3．本文提出的基于最大频繁项目集挖掘的查询扩展算法，是假设原查询词和扩展词的重要程度一样的基础上进行的，没有考虑原查询词和扩展词的权重问题；同时最大频繁项目集挖掘，丢失了部分频繁项的支持度信息。针对上述问题，本文提出基于频繁闭合项目集的查询扩展算法。算法采用HT-struct链接结构，采用深度优先搜索策略，结合一定的剪枝技术，挖掘出频繁闭合项目集，得到关联规则，得到扩展词库；算法同时根据规则置信度衡量扩展词的权重。实验证明，算法的效率得到了提高，算法具有可行性。
[Abstract]:With the rapid increase of network information, it is still difficult to find the exact information that people want through search engine, and the query rate is not high and the precision rate is low, which becomes the urgent problem that search engine needs to solve. In order to solve this problem, this paper studies the query extension technology based on association rules according to the viewpoint of Van Rijsbergen scholars to improve the retrieval ability by modifying the original query. The main contents are as follows: 1. Firstly, the basic contents of this paper: data mining, association rules, query expansion, detailed introduction, and analysis of the existing query extension technology based on association rules. Pointing out the advantages and disadvantages, aiming at the common shortcomings: the existing query expansion algorithms based on association rules do not pay attention to the mining efficiency of association rules mining algorithms and whether the mining algorithms are suitable or not. 2. Aiming at the above problems, this paper proposes a query expansion algorithm based on maximum frequent itemset mining for the first time, which adopts the query technology based on vector space model. The first retrieval of n documents is partitioned, the processed participle is represented by vertical data format, the support degree of item set is obtained by the method of intersection, and the data structure of set enumeration tree is adopted at the same time. A certain pruning strategy is used to mine the maximum frequent itemsets, and the extended lexicon is obtained, and the extended words are combined with the initial query words for secondary retrieval. Experimental results show that compared with the previous algorithms, the efficiency of the algorithm is improved. 3. The query expansion algorithm based on maximum frequent itemsets mining is proposed in this paper. It is based on the assumption that the importance of the original query word and the extension word is the same, and the weight of the original query word and the extended word is not considered. At the same time, the maximal frequent itemsets are mined, and the support degree information of some frequent items is lost. To solve the above problems, this paper proposes a query expansion algorithm based on frequently closed itemsets. The algorithm adopts HT-struct link structure, adopts depth-first search strategy, combines certain pruning technology, mining frequent closed itemsets, obtains association rules, and obtains extended lexicon. At the same time, the algorithm measures the weight of extended words according to the confidence degree of the rules. Experiments show that the efficiency of the algorithm is improved and the algorithm is feasible.
【学位授予单位】：解放军信息工程大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP311.13

【参考文献】