基于分区的倒排索引压缩算法研究

发布时间：2018-11-29 09:38

【摘要】：倒排索引是目前应用最为广泛的全文索引技术,是现代搜索引擎的核心技术。现在互联网上文本数据呈现爆炸式增长,为这些文本数据构造的倒排索引也需要越来越多的存储空间,压缩倒排索引不仅可以节省磁盘空间,同时也可以减少查询时所需要的磁盘和内存访问次数。因此,倒排索引压缩的研究对于海量文本数据的全文检索有很重要的意义。压缩倒排索引主要是压缩倒排索引中的docID序列、frequency序列,这两类序列的压缩最终可以转化为对非负单调递增整数序列的压缩。基于分区的Elias-Fano(PEF)编码压缩算法提出分区二级数据结构,对整数序列进行分区块压缩,显示出良好的压缩性能,本文针对整数序列"分区"压缩的思想深入研究,主要工作如下:(1)证明了 Golomb-Rice编码的压缩性能优于Elias-Fano编码的压缩性能,并且对于给定n个元素、上界为u的非负单调递增整数序列,本文提出一种为该序列直接确定Golomb-Rice编码参数b的方法。(2)结合Golomb-Rice编码优秀的压缩性能和整数序列"分区"压缩的思想,提出了基于分区的Elias-Fano-Golomb-Rice(PEFGR)编码压缩算法和基于分区的Golomb-Rcie(PGR)编码压缩算法。PEFGR编码压缩算法将整数序列划分为若干区块,构造一种二级数据结构:第一级结构存储整数序列每个区块边界元素组成序列的Elias-Fano编码结果;第二级结构存储整数序列每个区块内部元素的Golomb-Rice编码结果。PGR编码压缩算法在PEFGR编码压缩算法的基础上,将其第一级结构存储的元素也改用压缩性能更好的Golomb-Rice编码,进一步提高算法的压缩性能。(3)实现了本文提出的PEFGR编码压缩算法和PGR编码压缩算法,同时实现了 Golomb-Rice编码压缩算法、Elias-Fano编码压缩算法、PEF编码压缩算法、Simple-16编码压缩算法、OptPFD编码压缩算法用于对比。实验数据集采用的是Clueweb12和wikipediaXML。实验结果验证了 Golomb-Rice编码的压缩性能优于Elias-Fano编码的压缩性能;实验结果表明出这些压缩算法中PGR编码压缩算法的压缩性能最优,PEFGR编码压缩算法次之。在docID方面,PGR编码压缩算法比PEF编码压缩算法平均每个整数至少节省0.3个bit,比Simple-16编码压缩算法、OptPFD编码压缩算法节省更多的bits。
[Abstract]:Inverted index is the most widely used full-text index technology and the core technology of modern search engine. Now text data on the Internet is growing explosively, and the inverted index for these text data also needs more and more storage space. Compression of inverted index can not only save disk space, It can also reduce the number of disk and memory access required when querying. Therefore, the research of inverted index compression is very important for full-text retrieval of massive text data. Compressed inverted index is mainly composed of docID sequence and frequency sequence in inverted index. The compression of these two kinds of sequences can be transformed into compression of non-negative monotone increasing integer sequence. Based on the partition Elias-Fano (PEF) coding compression algorithm, the partitioned two-level data structure is proposed, and the integer sequence is compressed into blocks, which shows good compression performance. In this paper, the idea of "partitioning" compression of integer sequences is deeply studied. The main works are as follows: (1) it is proved that the compression performance of Golomb-Rice coding is better than that of Elias-Fano coding, and for a given n elements, the upper bound is u, which is a nonnegative monotone increasing integer sequence. In this paper, a method for determining the Golomb-Rice coding parameter b for the sequence is proposed. (2) combining the excellent compression performance of Golomb-Rice coding and the idea of integer sequence "partition" compression, The Elias-Fano-Golomb-Rice (PEFGR) coding compression algorithm based on partition and Golomb-Rcie (PGR) coding compression algorithm based on partition are proposed. The integer sequence is divided into several blocks by PEFGR coding compression algorithm. A secondary data structure is constructed: the first level structure stores the Elias-Fano coding result of each block boundary element composition sequence of integer sequence; The second level structure stores the Golomb-Rice coding result of the element inside each block of integer sequence. Based on the PEFGR coding compression algorithm, the PGR coding algorithm also converts the elements stored in the first level structure to Golomb-Rice coding with better compression performance. Further improve the compression performance of the algorithm. (3) the implementation of the proposed PEFGR coding compression algorithm and PGR coding compression algorithm, at the same time the implementation of Golomb-Rice coding compression algorithm, Elias-Fano coding compression algorithm, PEF coding compression algorithm, Simple-16 coding compression algorithm, OptPFD coding compression algorithm for comparison. The experimental data set uses Clueweb12 and wikipediaXML. The experimental results show that the compression performance of Golomb-Rice coding is better than that of Elias-Fano coding, and the experimental results show that the compression performance of PGR coding algorithm is the best, followed by PEFGR coding compression algorithm. In terms of docID, the PGR coding compression algorithm saves at least 0.3 bit, per integer on average than the PEF coding compression algorithm, and the OptPFD coding compression algorithm saves more bits. than the Simple-16 coding compression algorithm.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【参考文献】