基于Lucene的个性化搜索引擎系统

发布时间：2018-07-13 15:35

【摘要】：互联网快速发展带来了知识爆炸，用户面临海量数据必须借助检索工具的帮助来找到需要的信息。需求决定市场，以Google和百度为代表的各种搜索应用应运而生，改变了互联网。传统检索技术在理论和实践上都已经相当成熟，开源社区诞生了诸如Xapian、Lucene等第三方API库，以及基于第三方库的完整搜索解决方案。本文对搜索引擎的原理、组成、工作流程等方面做了深入细致地分析，介绍了每个模块的理论基础，，并且着重研究改进著名API类库Lucene，分析Lucene的模块结构、文件格式、索引过程、结果排序。目前，主流的搜索解决方案并不提供对javascript脚本的支持，在网页爬取数量和速度上做了折中。近年来出现的快速javascript解释引擎为解决这个问题提供了可能。本文在爬虫模块引入了脚本解释引擎以提高爬虫对javascript脚本的理解，模仿C++重载运算符的原理，将Javascript中涉及到URL赋值的运算符重载成集合运算，实现了对脚本URL的提取，在内网和外网中做了对比试验，总结了内网成功和外网失败的原因。链接分析是衡量网页质量的重要参数，本文在Lucene原有的评分公式中引入了PageRank算法来提高的网页评分的准确性，以改善搜索结果质量，并且在原有幂迭代计算基础之上提出了简易的计算方式，提高了计算速度。Lucene设计优秀，在各个功能模块开放了大量的接口以满足用户的自定义需求。利用这些接口，本文做了工程实践，并与原有的评分公式做了实验对比。最后，本文在搜索引擎的个性化方面做了实践探索。传统搜索引擎一般基于关键字匹配，没有充分使用用户的个性信息，缺少个性化功能。本文介绍了用户信息的收集、用户模型的建立以及使用，并在这些理论的指导下，结合工程难度，设计实现了一个简单的个性化搜索模块。实验结果表明，论文设计实现的个性化功能模块是有效的。
[Abstract]:The rapid development of the Internet has brought about a knowledge explosion. Users have to use the help of retrieval tools to find the information they need in the face of massive data. Demand determines the market, and various search applications represented by Google and Baidu emerge as the times require, changing the Internet. The traditional retrieval technology has been quite mature in theory and practice. The open source community has developed third-party API libraries such as Xapianli Lucene and a complete search solution based on third-party libraries. In this paper, the principle, composition and workflow of search engine are analyzed in detail, the theoretical basis of each module is introduced, and the improvement of the famous API class library Luceneis emphatically studied, and the module structure, file format, indexing process of Lucene are analyzed. The results are sorted. Currently, mainstream search solutions do not provide support for javascript scripts, making a compromise on the number and speed of web crawls. The emergence of fast javascript interpretation engines in recent years makes it possible to solve this problem. In this paper, the script interpretation engine is introduced into the crawler module to improve the crawler's understanding of the javascript script, imitating the principle of C overloading operator, reloading the operators involved in the assignment of URLs into set operations in Javascript, and realizing the extraction of script URLs. A comparative experiment is made between the intranet and the extranet, and the reasons for the success and failure of the intranet are summarized. Link analysis is an important parameter to measure the quality of web pages. In this paper, PageRank algorithm is introduced into Lucene's original scoring formula to improve the accuracy of web pages, so as to improve the quality of search results. And on the basis of the original power iterative calculation, a simple calculation method is proposed, which improves the speed of calculation. Lucene design is excellent, and a large number of interfaces are opened in each functional module to meet the user's custom requirements. Using these interfaces, this paper has done the engineering practice, and has made the experiment contrast with the original scoring formula. Finally, this article has made the practice exploration in the search engine personalization aspect. Traditional search engines are generally based on keyword matching, do not fully use the user's personality information, lack of personalized function. This paper introduces the collection of user information, the establishment and use of user model, and under the guidance of these theories, a simple personalized search module is designed and implemented under the guidance of these theories and combining with the engineering difficulty. The experimental results show that the personalized function module designed in this paper is effective.
【学位授予单位】：中国舰船研究院
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】