面向图书的垂直搜索引擎的研究与实现

发布时间：2018-07-05 05:04

本文选题：垂直搜索引擎 + Shark-Search　；参考：《北京工业大学》2014年硕士论文

【摘要】：Internet的出现使互联网成为了一个重要的信息资源宝库，网络用户需要利用搜索引擎提供的检索服务去查询想要的信息。传统的通用搜索引擎可以满足用户搜索信息的基本需求，，但是由于通用搜索引擎检索的范围宽泛，在返回给用户的结果中包含了大量用户不关心的信息，用户不得不对检索结果做进一步的过滤操作，这种额外的过滤操作使用户的检索体验变得不好。垂直搜索引擎弥补了这个不足点，相比通用搜索引擎而言它缩小了检索的范围，只关心网络中某一领域或者某一主题的信息，从数据源头上保证了用户检索的内容就是他们所关心的。同时垂直搜索引擎还对杂乱的网络信息进行相应的处理，将其中主要的部分抽取出来并以结构化的方式呈现给用户，使用户可以迅速发现最重要的信息。论文首先介绍了搜索引擎的基本概念以及分类，然后介绍了搜索引擎的工作原理。通过对比通用搜索引擎和垂直搜索引擎工作原理的不同点，对垂直搜索引擎涉及的主题网络爬虫、主题相似度判断等关键技术进行了介绍与分析。在论文中所做的主要工作包括：相同主题的超链接之间在URL结构上具有相似性，根据这种特性对传统基于页面内容的Shark-Search主题爬行算法进行了改进，在预测孩子URL链接的优先级得分时考虑了URL链接的结构特性对优先级得分值的影响；对向量空间模型计算页面相似度进行分析，提出使用二次主题判断的方法获得更多的高质量的主题相关网页；针对图书元数据信息在网页中的分布特点，结合解析工具HTMLParser设计了一个半自动的元数据抽取算法；利用全文索引开发包Lucene实现了一个面向图书资源的垂直搜索引擎系统的原型，并对Lucene检索结果的默认排序进行了自定义扩展。最后对本文实现的主题爬行算法进行了实验分析，在主题页面相对集中的规范的站点中运行效果较好，因为在这类站点中相同主题的URL之间的相似性比较明显。对实现的面向图书的垂直搜索系统原型进行验证，相比通用搜索引擎系统能够获得比较精确的检索结果，同时对Lucene默认排序进行自定义扩展可以使检索结果排序更合理。
[Abstract]:The Internet has made the Internet an important treasure house of information resources . Web users need to use search services provided by search engines to query the desired information . Traditional universal search engines can satisfy the basic requirements of user search information . However , because of the wide range of search by universal search engines , users have to do a further filtering operation on the search results .

This paper introduces the basic concept and classification of the search engine , then introduces the working principle of the search engine . Through comparing the differences between the general search engine and the working principle of the vertical search engine , this paper introduces and analyzes the key technologies such as the topic network crawler and the topic similarity judgment involved in the vertical search engine .
analyzing the similarity degree of the page of the vector space model , and proposing a method for obtaining more high - quality topic - related web pages by using the method of secondary topic judgment ;
In this paper , a semi - automatic meta - data extraction algorithm is designed according to the distribution characteristics of the book metadata information in web pages .
A prototype of a book - oriented vertical search engine system is realized by using full - text index development package Lucene , and the default ordering of Lucene search results is extended .

Finally , the subject crawling algorithm implemented in this paper is experimentally analyzed , and the results are better in the site with the same theme in the theme pages , because the similarity between the URLs of the same subject in this kind of site is more obvious . Compared with the universal search engine system , it is possible to obtain more accurate retrieval results , and meanwhile , the user - defined extension of Lucene ' s default sorting can make the search results more reasonable .
【学位授予单位】：北京工业大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP391.3

【参考文献】