当前位置:主页 > 科技论文 > 搜索引擎论文 >

基于压缩全文自索引的分布式索引技术研究

发布时间:2018-05-02 10:29

  本文选题:分布式全文索引 + 压缩全文自索引 ; 参考:《杭州电子科技大学》2015年硕士论文


【摘要】:分布式全文检索技术是信息处理领域的核心技术之一,目前被广泛应用于竞争情报、信息检索、搜索引擎以及信息过滤等领域。对高效分布式全文索引技术的深入探讨不仅拥有重要的理论价值,同时还具有巨大的商业价值。随着互联网的日益普及,各式各样的数据以更快的速度产生,数据总量成指数级增长,面对海量的数据,相关数据索引文件的大小也持续增加。传统的单机索引系统基本不能满足海量数据的索引需求,而分布式索引系统可满足上述需求,并实现海量数据的分布式索引。分布式索引系统的核心技术涵盖了分布式索引创建、索引查询、分布式索引的数据分配以及分布式索引的负载均衡等内容。本文将近几年来流行的文本处理技术——压缩全文自索引应用到分布式索引当中,并讨论该索引结构下的查询策略。 本文对分布式全文索引技术研究的内容包括: (1)当前主流的分布式索引系统主要采用倒排索引结构,运行在高性能集群中的倒排索引对查询的响应时间可达到毫秒级别。然而,倒排索引除了需存储自身信息之外的信息,,还需要额外存储信息用于支持搜索引擎实现存储片段抽取、排序和位置信息、查询缓存等功能,从而导致存储空间的利用效率偏低。本文创新的将当前文本索引研究的热点压缩全文自索引应用到分布式索引系统当中,提出一种基于改进哈夫曼编码的小波树压缩算法,并与后缀数组将结合,实现了能适应分布式环境下的压缩全文自索引结构及对应的高效创建算法。 (2)索引系统在搜索引擎中主要发挥以下两种作用:第一,根据一定的规则创建网页文档的索引,便于后续查询;第二,按照用户提出的查询命令检索索引文件,同时按一定规则对索引文件进行排序并将结果返回客户端。基于新改进的压缩全文自索引结构,提出了一种分布式环境下的查询处理策略。 (3)结合以上研究内容和相关研究成果,提出一种分布式全文索引系统架构,该系统有利于实现各种各样非结构化数据的分布式索引,进而实现海量非结构化数据的查询和索引性能。详细介绍了系统中索引集群、查询集群以及分布式文件系统的设计,最后测试该分布式索引系统查询处理的高效性。
[Abstract]:Distributed full-text retrieval is one of the core technologies in the field of information processing. It is widely used in the fields of competitive intelligence, information retrieval, search engine and information filtering. The in-depth study of efficient distributed full-text indexing technology not only has important theoretical value, but also has great commercial value. With the increasing popularity of the Internet, all kinds of data are produced at a faster rate, and the total amount of data increases exponentially. In the face of massive data, the size of related data index files continues to increase. The traditional single computer indexing system can not meet the index requirement of mass data, but the distributed index system can meet the above requirements and realize the distributed index of mass data. The core technologies of distributed index system include distributed index creation, index query, data distribution of distributed index and load balance of distributed index. In this paper, a popular text processing technique, compressed full-text self-index, is applied to distributed index in recent years, and the query strategy under this index structure is discussed. In this paper, the research contents of distributed full-text indexing technology include: At present, the main distributed index system mainly adopts inverted index structure, and the response time of inverted index running in high performance cluster can reach millisecond level. However, the inverted index not only needs to store its own information, but also needs to store additional information to support the search engine to realize the functions of segment extraction, sorting and location information, query cache, etc. As a result, the utilization efficiency of storage space is on the low side. In this paper, a new algorithm of wavelet tree compression based on improved Huffman coding is proposed, which is combined with suffix array. The compression full-text self-index structure and the corresponding efficient creation algorithm are implemented in the distributed environment. The index system plays the following two main roles in the search engine: first, to create the index of the web page document according to certain rules, to facilitate the subsequent query; second, to retrieve the index file according to the query command put forward by the user. At the same time, the index files are sorted according to certain rules and the results are returned to the client. A query processing strategy in distributed environment is proposed based on the newly improved compression full text self-index structure. 3) combining the above research contents and related research results, a distributed full-text index system architecture is proposed, which is conducive to the realization of distributed index of various unstructured data. Then the query and index performance of massive unstructured data is realized. The design of index cluster, query cluster and distributed file system in the system is introduced in detail. Finally, the efficiency of query processing in the distributed index system is tested.
【学位授予单位】:杭州电子科技大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:TP391.3

【参考文献】

相关期刊论文 前4条

1 王建勇,单松巍,雷鸣,谢正茂,李晓明;海量Web搜索引擎系统中用户行为的分布特征及其启示[J];中国科学E辑:技术科学;2001年04期

2 李勇;张志刚;;领域本体构建方法研究[J];计算机工程与科学;2008年05期

3 吴晟;李星;;分布式搜索中节点索引量大小估计算法[J];计算机应用;2008年09期

4 韩婕;向阳;;本体构建研究综述[J];计算机应用与软件;2007年09期



本文编号:1833511

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1833511.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户73d5f***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com