基于Lucene的全文信息检索技术的研究与应用

发布时间：2018-07-07 18:20

本文选题：搜索引擎 + 全文检索　；参考：《江南大学》2012年硕士论文

【摘要】：随着网络信息资源急剧增长,互联网已经逐渐成为了一个海量信息空间,人们在享受互联网带来便利的同时也被淹没在信息的汪洋中,如何从海量网络信息中获取潜在的、有价值的信息,已成为众多互联网用户所面临的一个极其重要的问题。在这种迫切需求下,信息检索技术和网络搜索引擎便应运而生,并成为互联网中重要的应用和研究课题。 Lucene是一个全文检索框架,开发人员可以方便地在其基础上进行二次开发,设计实现快捷的专业搜索引擎。虽然Lucene功能强大、配置灵活,但是仅作为一个检索框架而言,缺少信息采集模块,还不能实现完整的搜索引擎功能,同时Lucene自带的中文分词器也不能有效的切分处理中文词汇,因此,本文选取Lucene作为研究对象。论文首先深入分析了Lucene整体框架结构,包括建立索引、检索索引文件以及结果集排序工作过程和原理。接着,介绍了网页采集技术和网络爬虫Heritrix,深入研究其框架结构及核心部件的工作原理,并对网页爬虫的功能做了如下的改进工作:针对爬虫下载内容复杂冗余问题,通过筛选URL链接方法,使爬虫忽略不符合条件的网页,从而过滤下载内容,减少存储空间的浪费;针对爬虫抓取速率不高问题,通过修改部分源码,取消附加robots协议对爬虫抓取过程的限制,从而提高了爬虫抓取效率;针对Heritrix默认采用主机名队列分配策略在抓取单一网站页面时,导致的抓取队列过长及部分线程阻塞问题,设计ELF散列算法,新建了一个队列分配策略,把URL尽量平均分到各个队列中去,从而提高抓取的速度。本文通过实验证明,上述三个优化方案达到预期目标。论文阐述了四种常见的中文分词算法,三种经典词典文件组织方式,并进行比较分析,在归纳各自优缺点的基础上,设计实现了一个改进的中文分词器。改进后的分词器采用了三级索引词典文件组织方式,该文件组织方式融合了表格词典结构方式实现方法简单、空间占用少、维护更新容易以及树形词典结构词条查找效率高的优点,从而有效地减小词典空间,实现词条的快速查询。设计并采用了改进的最大正向匹配中文分词算法,算法的主要过程是:从左至右顺序遍历待切分语句,计算首字散列值,在一级索引中匹配该散列值,匹配成功,将下一个字符加到前缀字符串中,计算新字符串长度,在二级索引中匹配该长度,匹配成功,计算新字符串散列值,在三级索引中匹配该散列值,匹配成功,记录当前已匹配字符串的长度,继续加入下一个字符,直到完成索引中当前首字最长的词条。改进后的中文分词算法运用了类似TRIE索引树的逐字匹配算法,消除了传统正向最大匹配算法的切分盲点,同时,避免多次无效二分查找,提高了分词效率。论文通过算法的时间复杂度和实验分析证明,改进后的分词器提高了中文分词速度和准确性。最后,综合上述理论、技术和算法研究,应用J2EE架构技术实现了一个全文信息检索系统,满足了用户快速准确地检索信息的需求。
[Abstract]:With the rapid growth of network information resources , the Internet has become a huge space of information , and people have been submerged in the ocean of the information while enjoying the convenience of the Internet . How to obtain the potential and valuable information from the mass network information has become an extremely important problem for many Internet users . In this urgent need , the information retrieval technology and the web search engine are born and become an important application and research subject in the Internet .

Lucene is a full - text search framework , and the developer can easily develop and design a professional search engine on the basis of it . Although Lucene is powerful and flexible , it lacks the information acquisition module as a search framework , and can not realize the complete search engine function . At the same time , Lucene ' s own Chinese classifier cannot effectively cut the Chinese vocabulary . Therefore , Lucene is selected as the research object .

This paper first analyzes Lucene ' s whole frame structure , including the process and principle of indexing , retrieving index file and result set ordering .

This paper describes four common Chinese word segmentation algorithms , three classical dictionary file organization methods and comparative analysis .

Finally , based on the theory , technology and algorithm research , a full - text information retrieval system is implemented using J2EE architecture technology , which meets the need of user to retrieve information quickly and accurately .
【学位授予单位】：江南大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3

【参考文献】