倒排索引中的文档序号重排技术综述
发布时间:2018-08-05 12:40
【摘要】:倒排索引作为文本搜索的核心索引技术,广泛应用于搜索引擎、桌面搜索和数字图书馆领域。倒排索引由字典和对应的倒排表组成,倒排表一般采用差值存储和整数编码进行压缩。研究表明,当倒排表具有较好的局部连续性时,上述方法能够获得很高的压缩率。整数编码研究通过不断改进编码算法来充分利用倒排表的局部连续性特征,而文档序号重排正是一种对文档序号重新排列来产生局部连续性的技术。通过文档序号重排,索引压缩率得到显著提高。该文主要介绍近年来文档序号重排技术取得的研究成果:首先介绍索引压缩的基本原理,然后详细介绍文档序号重排技术,包括分析、对比各个方法的优劣;最后对文档序号重排技术进行总结、整理和展望。
[Abstract]:As the core index technology of text search, inverted index is widely used in search engine, desktop search and digital library. The inverted index is composed of a dictionary and a corresponding inverted table. The inverted table is compressed by difference storage and integer coding. The results show that when the inverted table has good local continuity, the method can obtain high compressibility. Integer coding makes full use of the local continuity characteristics of inverted tables by continuously improving the coding algorithm, and document sequence number rearrangement is a technique to produce local continuity by rearranging document sequence numbers. The index compression ratio is greatly improved by document number rearrangement. This paper mainly introduces the research achievements of document ordinal number rearrangement technology in recent years: firstly, introduces the basic principle of index compression, then introduces document sequence number rearrangement technology in detail, including analysis, compares the advantages and disadvantages of each method; Finally, the document number rearrangement technology is summarized, collated and prospected.
【作者单位】: 国家计算机网络应急技术处理协调中心;中国科学院信息工程研究所;
【基金】:国家973重点基础研究发展规划项目(2011CB302605) 科技支撑计划(2012BAH47B04)
【分类号】:TP391.3
[Abstract]:As the core index technology of text search, inverted index is widely used in search engine, desktop search and digital library. The inverted index is composed of a dictionary and a corresponding inverted table. The inverted table is compressed by difference storage and integer coding. The results show that when the inverted table has good local continuity, the method can obtain high compressibility. Integer coding makes full use of the local continuity characteristics of inverted tables by continuously improving the coding algorithm, and document sequence number rearrangement is a technique to produce local continuity by rearranging document sequence numbers. The index compression ratio is greatly improved by document number rearrangement. This paper mainly introduces the research achievements of document ordinal number rearrangement technology in recent years: firstly, introduces the basic principle of index compression, then introduces document sequence number rearrangement technology in detail, including analysis, compares the advantages and disadvantages of each method; Finally, the document number rearrangement technology is summarized, collated and prospected.
【作者单位】: 国家计算机网络应急技术处理协调中心;中国科学院信息工程研究所;
【基金】:国家973重点基础研究发展规划项目(2011CB302605) 科技支撑计划(2012BAH47B04)
【分类号】:TP391.3
【共引文献】
相关期刊论文 前10条
1 马乐;王力;;一种海量文本的动态索引方法[J];北京师范大学学报(自然科学版);2009年02期
2 孙德才;王晓霞;;一种基于Bigram二级哈希的中文索引结构[J];电子设计工程;2014年12期
3 丁维;周长胜;崔凌云;马志强;杨娜;;基于多级指引索引的高效技术[J];计算机与信息技术;2006年06期
4 王虎;王潜平;;对几种倒排文件压缩技术的研究与分析[J];计算机工程与应用;2006年07期
5 刘小珠;彭智勇;陈旭;;高效的随机访问分块倒排文件自索引技术[J];计算机学报;2010年06期
6 赵小苏;蒋福兴;;公安科技查新管理工作平台的设计与实现[J];警察技术;2012年04期
7 马健;张太红;陈燕红;;中文搜索引擎分块倒排索引存储模式[J];计算机应用;2013年07期
8 冯贵兰;谭良;;云环境中基于多属性排序的密文检索方案[J];计算机科学;2013年11期
9 于世龙;黄宏斌;邓苏;;空间资源索引与top-k查询研究[J];计算机应用研究;2014年01期
10 陈,
本文编号:2165825
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2165825.html