搜索引擎中的索引压缩和查询问题研究

发布时间：2018-04-20 02:19

本文选题：倒排索引 + 索引压缩　；参考：《国防科学技术大学》2015年博士论文

【摘要】：随着互联网技术的飞速发展和互联网应用的不断普及,互联网资源成为当前规模最大、内容最丰富、使用最广泛的信息来源。为了有效地从这些海量数据中检索到需要的信息,搜索引擎已经成为向用户提供快速资源定位的最好技术手段。然而,不断增长的网页规模和查询请求使得搜索引擎面临着巨大的性能挑战。面对海量的网页数据和巨大的查询需求,如何高效地处理查询请求成为搜索引擎领域中最重要的研究问题之一。在查询性能上的任何提升都可以显著地降低系统硬件资源的投入并提升用户的查询体验,从而为搜索引擎的良好运行提供坚实的基础。因此,本文主要研究提高搜索引擎效率的方法,并重点从倒排索引压缩和查询两方面技术结合的角度出发来解决制约搜索引擎系统性能的问题。本文的主要研究成果如下:(1)为了提升搜索引擎系统索引数据的压缩性能,本文研究了典型的Simple9索引压缩算法。针对Simple9压缩算法中存在的填充模式间可压缩整数个数的稀疏问题,本文提出了密集填充技术(Dense Padding)来使一个32位机器字能够压缩更多的整数,从而提升Simple9索引压缩算法的压缩率。当某一填充模式中异常值的相对位置大于下一个填充模式的可压缩整数个数时,我们通过在被压缩整数序列末尾插入整数0来构造满足本填充模式的可压缩整数序列。在解压过程中,我们设计巧妙的算法来去除额外的整数0。此外,针对Simple9中可压缩数字序列个数较少的导致的位操作过多问题,本文提出了分组Simple9压缩技术(GroupSimple9)来降低压缩和解压缩过程中的数据读取、分支判断、移位和查找映射表等位操作,从而提升Simple9压缩算法的压缩和解压速度。实验结果表明,本文提出的分组Simple9和密集Simple9(DenseSimple9)压缩算法能够在压缩率、压缩和解压三方面上均有明显的性能提升。(2)为了提升搜索引擎系统索引数据的查询性能,本文研究了当前流行的MaxScore动态剪枝算法。由于当前MaxScore查询算法是在必要表(Essential Lists)中进行选择候选文档,非必要表(Non-essential Lists)的倒排项仅进行分数贡献累加。MaxScore查询算法对必要表的访问是顺序进行的,而仅仅对非必要表采用随机访问,这种情况下MaxScore查询算法存在着严重的候选文档失效问题。针对这一问题,本文首先实现了一种多层自索引结构(Hierarchical Self-index)来支持倒排链表的随机访问,然后提出对候选文档最大可能分数进行双向检查,实现了必要表的快速跳跃访问(Essential Lists Skipping,ELS)。这样实现必要表中候选文档的预先检测和随机访问,使得候选文档的打分失效问题能够尽早被发现,避免在将要失效的候选文档上的一些无用的操作。实验结果表明,本文提出的ELS-MaxScore查询算法能够大大提升top-k查询性能,尤其在查询长度小于4时能达到MaxScore近2倍的性能。(3)通过之前的分析可以发现提升搜索引擎查询性能的一个方法是,候选文档的选择应该主要集中在重要度较高的词项对应的倒排链表中。在分析WAND查询算法问题和研究激进式MaxScore(Aggressive MaxScore)优势的基础上,为了更好的利用词项重要度这一重要特性,本文提出了最大重要度优先(Largest Score First,LSF)查询算法。LSF查询算法使得具有较高重要度的查询词项所指向的倒排链表能够优先得到处理。分析表明,LSF查询算法能够解决当前(Document At A Time,DAAT)和(Term At A Time,TAAT)两种穷尽遍历算法中存在的候选文档随机选择和内存消耗问题。为了进一步支持高效的动态剪枝算法,本文利用LSF查询的对于词项重要度考虑的优势,提出了两种精确的动态剪枝算法:基于LSF的去除查询词项技术(Term Omitting,LSF_TO)和基于LSF的文档部分打分技术(Partial Scoring,LSF_PS)。基于TREC GOV2上的实验结果表明,本文提出的两种基于LSF索引遍历的动态剪枝算法能够达到比基于DAAT索引遍历的WAND和MaxScore两种动态剪枝算法更好的查询性能。
[Abstract]:With the rapid development of Internet technology and the continuous popularization of Internet applications, Internet resources have become the largest, most abundant and most widely used information sources. In order to effectively retrieve the required information from these massive data, search engines have become the best technical means to provide users with rapid resource location. However, the growing web page size and query request make the search engine face huge performance challenges. How to efficiently handle query requests is one of the most important research issues in the search engine field. Any enhancement in query performance can be significantly reduced in the face of massive web data and huge query requirements. The input of the system hardware resources and the improvement of the user's query experience provide a solid foundation for the good operation of the search engine. Therefore, this paper mainly studies the methods to improve the efficiency of the search engine, and focuses on the problem of solving the performance of the search engine system from the angle of the combination of two aspects of inverted index compression and query. The main achievements of this paper are as follows: (1) in order to improve the compression performance of the index data of the search engine, a typical Simple9 index compression algorithm is studied in this paper. In this paper, a dense filling technique (Dense Padding) is proposed to make a 32 bit in order to solve the sparse problem of the number of compressible integers among the filling modes in the compression algorithm. The machine word can compress more integers and improve the compression rate of the Simple9 index compression algorithm. When the relative position of the exception value in a filling mode is larger than the number of compressible integers in the next filling mode, we construct the compressible integer sequence full of full full fill mode by inserting the integer 0 at the end of the compressed integer sequence. In the process of decompression, we design ingenious algorithms to remove additional integer 0.. In addition, we propose packet Simple9 compression (GroupSimple9) to reduce data reading, branch judgment, shift and lookup in the process of compression and decompression in order to reduce the number of bit operations caused by fewer compressible numeric sequences in Simple9. In order to improve the compression and decompression speed of the Simple9 compression algorithm, the compression and decompression speed of the Simple9 compression algorithm is improved by mapping the equal bit operation. The experimental results show that the proposed packet Simple9 and dense Simple9 (DenseSimple9) compression algorithm can significantly improve the performance of the compression rate, compression and decompression. (2) to improve the query of the index data of the search engine system Yes, this paper studies the popular MaxScore dynamic pruning algorithm. Since the current MaxScore query algorithm is selected in the necessary table (Essential Lists), the inverted item of the unnecessary table (Non-essential Lists) is only carried out by the fractional contribution accumulating.MaxScore query algorithm to the necessary table. In this case, the MaxScore query algorithm has a serious problem of candidate document failure. In this case, a multi-layer self indexing structure (Hierarchical Self-index) is first implemented to support the random access of the inverted list, and then the maximum possible fraction of the candidate documents is detected by two-way detection. The quick jump access (Essential Lists Skipping, ELS) of the necessary table is implemented. This enables the pre detection and random access of candidate documents in the necessary table so that the problem of the score failure of the candidate documents can be discovered as soon as possible and avoids some useless operations on the candidate documents to be invalid. The experimental results show that this article is proposed in this paper. The ELS-MaxScore query algorithm can greatly improve the performance of Top-k query, especially when the query length is less than 4. (3) one way to find the performance of the search engine query is that the selection of candidate documents should be mainly concentrated in the inverted chain corresponding to the more important words. On the basis of analyzing the problem of WAND query algorithm and studying the advantages of radical MaxScore (Aggressive MaxScore), in order to make better use of the important character of word item, this paper proposes the maximum important degree priority (Largest Score First, LSF) query algorithm, which makes the query words with higher importance point to point. The analysis shows that the LSF query algorithm can solve the problem of random selection and memory consumption of candidate documents in the two exhaustive traversal algorithms (Document At A Time, DAAT) and (Term At A Time, TAAT). In order to further support the efficient dynamic pruning algorithm, this paper uses a LSF query for the words. Two precise dynamic pruning algorithms: Term Omitting, LSF_TO and Partial Scoring (LSF_PS) based on LSF. Based on the experimental results on TREC GOV2, two kinds of dynamic pruning algorithms based on LSF index traversal are proposed in this paper. Query performance reach based on DAAT index traversal of the WAND and the MaxScore two kind of dynamic pruning algorithm is better.

【学位授予单位】：国防科学技术大学
【学位级别】：博士
【学位授予年份】：2015
【分类号】：TP391.3

【相似文献】