当前位置:主页 > 科技论文 > 搜索引擎论文 >

基于Lucene的垂直搜索引擎研究与实现

发布时间:2018-05-03 00:09

  本文选题:垂直搜索引擎 + Lucene ; 参考:《北京工业大学》2016年硕士论文


【摘要】:垂直搜索引擎作为一种面向某一主题或行业的网络信息检索工具,索引数据趋于结构化,检索范围趋于行业化,能够快速、精确地定位与查询相关的文档。本文主要围绕基于信息检索工具Lucene的垂直搜索引擎展开研究工作。通过深入研究Lucene基础排序算法和目前流行的检索模型,提出了一种融合位置相关和概率排序的Lucene排序算法的改进方法。通过分析垂直搜索引擎的基本工作原理及架构,面向汽车主题构建了一个小型的垂直搜索引擎系统。搜索引擎中应用了改进的Lucene排序算法为检索模块提供排序支持。本文的主要研究工作如下:第一,为了体现特征词在文档中的相关位置特征对于词的重要性影响,提出了一种位置相关的查询权重算法。利用查询词在文档中的不同位置及频率信息,改进词权重的TF-IDF计算方法,获得位置相关的查询词权重。第二,以Lucene基础排序算法为基础,提出了一种融合位置相关和概率排序的改进方法。首先,考虑到查询词在文档中的位置特征对文档相关性评分的影响,将位置相关的查询权重值融入排序算法的评分公式中。然后,利用概率排序原理,将基于朴素贝叶斯分类算法的文档概率排序值融入排序算法的评分公式中。第三,构建了一个小型的汽车垂直搜索引擎,包括采集汽车产品信息、解析网页文档、提取结构化信息、建立索引文件和检索相关文档等过程。其中,采用了融合位置相关和概率排序的Lucene排序算法对检索结果进行排序。第四,设计实验比较改进算法与Lucene基础排序算法在搜索质量上的差异。实验结果表明,与Lucene基础排序算法相比,使用融合位置相关和概率排序的改进算法后,检索的准确率有了较大幅度的提高,召回率和F值较为稳定且均有不同程度的提高。改进的排序算法能够有效的解决原算法中查询的位置相关性问题和理论支撑问题,提高检索的准确率。该算法具有很强的独立性和可重用性,可以为面向不同的主题的垂直搜索引擎提供排序支持。汽车垂直搜索引擎系统具有简明的构架和函数接口,为后续更新和完善系统各模块的功能提供了方便。
[Abstract]:Vertical search engine is a kind of network information retrieval tool for a certain subject or industry. The index data tends to be structured, and the retrieval scope tends to be industrial, which can locate the relevant documents quickly and accurately. This paper focuses on the vertical search engine based on the information retrieval tool Lucene. By deeply studying the basic sorting algorithm of Lucene and the popular retrieval model, an improved Lucene sorting algorithm combining location correlation and probability sorting is proposed. By analyzing the basic working principle and structure of vertical search engine, a small vertical search engine system for automobile theme is constructed. The improved Lucene sorting algorithm is applied to search engine to provide sorting support for retrieval module. The main work of this paper is as follows: first, in order to reflect the importance of the feature words in the document, a location-dependent query weight algorithm is proposed. By using the information of different positions and frequencies of query words in the document, the TF-IDF calculation method of word weight is improved, and the weight of query words related to location is obtained. Secondly, based on the basic sorting algorithm of Lucene, an improved method of combining position correlation and probability sorting is proposed. Firstly, considering the influence of the location feature of the query word in the document on the document correlation score, the location-related query weight value is incorporated into the scoring formula of the sorting algorithm. Then, using the principle of probability sorting, the document probability sorting value based on naive Bayes classification algorithm is incorporated into the scoring formula of sorting algorithm. Thirdly, a small vertical vehicle search engine is constructed, which includes the process of collecting automobile product information, parsing web pages, extracting structured information, establishing index files and retrieving related documents. Among them, the Lucene sorting algorithm combining position correlation and probability sorting is used to sort the retrieval results. Fourthly, the difference of search quality between the improved algorithm and the Lucene basic sorting algorithm is compared. The experimental results show that, compared with the basic sorting algorithm of Lucene, the improved algorithm of fusion location correlation and probability sorting can greatly improve the retrieval accuracy, and the recall rate and F value are more stable and improved in varying degrees. The improved sorting algorithm can effectively solve the problem of location correlation and theoretical support in the original algorithm, and improve the accuracy of retrieval. The algorithm has strong independence and reusability, and can provide sorting support for vertical search engines facing different topics. The vehicle vertical search engine system has a concise framework and function interface, which provides convenience for updating and perfecting the function of each module of the system.
【学位授予单位】:北京工业大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.3

【参考文献】

相关期刊论文 前10条

1 唐晓波;房小可;;微博中文本特征质量对检索效果的影响[J];现代图书情报技术;2014年06期

2 王泽贤;;基于Lucene的书目搜索相似度评分算法改进研究[J];图书情报工作;2014年04期

3 张小琴;王晓辉;;主题信息搜索系统中的搜索策略研究[J];软件导刊;2014年01期

4 郭卫宁;司莉;;国外语义搜索引擎调查与分析[J];图书情报工作;2013年23期

5 张宣;刘晓飞;;基于Lucene和Heritrix的全文搜索引擎的设计与实现[J];现代计算机(专业版);2013年33期

6 华京生;李萍;;基于Heritrix+Lucene的高校图书馆网站全文搜索引擎构建[J];情报探索;2013年09期

7 赵永鑫;雷霖;;Heritrix在电子信息垂直搜索平台中的应用[J];成都大学学报(自然科学版);2013年02期

8 何超;张玉峰;;融合语义相似度的商务情报链接分析算法研究[J];现代图书情报技术;2013年03期

9 胡嘉海;;基于Lucene的全文搜索引擎的设计与实现[J];安徽科技;2012年12期

10 袁小洁;;基于Heritrix的商品信息搜索的网络爬虫系统的设计[J];电脑编程技巧与维护;2012年22期



本文编号:1836056

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1836056.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户7314d***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com