当前位置:主页 > 科技论文 > 搜索引擎论文 >

电子信息垂直搜索引擎的研究与实现

发布时间:2018-10-14 12:19
【摘要】:在互联网高速发展的今天,网络信息呈指数增长,,搜索引擎在互联网的应用中一直占据着主要的地位。但是通用搜索引擎给人们带来便利的同时,也给人们带来了检索上的烦劳。具体表现为:检索返回的信息量十分庞大,用户需要花大量的时间在这些繁杂的信息中寻找到自己感兴趣的信息;通用搜索引擎并没有考虑到用户专业知识需求,无差别的返回检索结果,将造成检索过程的不便。 垂直搜索引擎作为未来搜索引擎的一个发展趋势,专注于某一个领域的搜索,在现代行业分工以及社会分工的逐渐细化的情况下,发挥着重要作用。用户对某一个专业信息有着很强的需求,垂直搜索引擎就是为了解决某(这)类专业信息检索的问题,其主要通过主题爬虫技术等,使得垂直搜索引擎在解决某些专业问题的时候比通用搜索引擎更加实用。 本文在介绍了搜索引擎和垂直搜索引擎的基础上,重点分析研究了heritrix网络爬虫,通过定制heritrix爬虫达到了主题网络信息的抓取、通过引入ELFHash算法,使得heritrix能够多线程抓取网页、通过消除对robots.txt的限制加快heritrix的抓取速率。 本文采用lucene来建立索引和检索,在分析研究lucene基本框架结构的基础上,对lucene自带的中分分词和排序做了修改。在针对电子信息搜索引擎需求下,设计出基于电子信息专业词典和统计结合的中文分词算法和修改了lucene的自带排序算法,使得检索的结果更加符合用户的需求。除此文章还对下载的网页信息做了内容的分析处理,以便lucene能够建立索引。 最后通过实验测试,验证了垂直搜索引擎与通用搜索引擎的不同与优劣、验证了网络爬虫的高效性、验证了中文分析的效果。整体的测试演示证明了系统具有一定的可靠性和实用性,对构建垂直搜索引擎有一定的参考价值。
[Abstract]:With the rapid development of the Internet, Internet information is growing exponentially, and search engines have been playing a major role in the application of the Internet. But the general search engine brings convenience to people, but also brings people the trouble of searching. In particular, the amount of information returned by the search engine is very large, and users need to spend a lot of time searching for the information they are interested in, and the general search engine does not take into account the needs of the users' professional knowledge. Returning the retrieval results without distinction will cause inconvenience to the retrieval process. Vertical search engine, as a developing trend of future search engine, focuses on the search in a certain field and plays an important role in the gradual refinement of the division of labor and social division of labor in modern industries. Users have a strong demand for a certain professional information, vertical search engine is to solve the problem of a (this) kind of professional information retrieval, mainly through the subject crawler technology, etc. The vertical search engine is more practical than the general search engine in solving some professional problems. Based on the introduction of search engine and vertical search engine, this paper focuses on the analysis and research of heritrix web crawler. By customizing heritrix crawler, the subject network information is captured. By introducing ELFHash algorithm, heritrix can grab web pages by multi-thread. Speed up the heritrix capture rate by removing restrictions on robots.txt. In this paper, lucene is used to build index and retrieval. On the basis of analyzing and studying the basic frame structure of lucene, the middle partitioning and sorting of lucene are modified. Under the demand of electronic information search engine, a Chinese word segmentation algorithm based on electronic information professional dictionary and statistics is designed, and the self-sorting algorithm of lucene is modified to make the retrieval results more in line with the needs of users. In addition, this article analyzes the contents of the downloaded web pages so that lucene can index them. Finally, the differences and advantages between vertical search engine and general search engine are verified, the efficiency of web crawler is verified, and the effect of Chinese analysis is verified. The whole test demonstration proves that the system has certain reliability and practicability, and has certain reference value to the construction of vertical search engine.
【学位授予单位】:西华大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3

【参考文献】

相关期刊论文 前7条

1 董守斌;赵铁柱;;面向搜索引擎的分布式文件系统性能分析[J];华南理工大学学报(自然科学版);2011年04期

2 张国煊,王小华,周必水;快速书面汉语自动分词系统及其算法设计[J];计算机研究与发展;1993年01期

3 林彤,江志军;Internet的搜索引擎[J];计算机工程与应用;2000年05期

4 刘琨,郑有才;搜索引擎剖析[J];微机发展;2004年03期

5 朱敏;罗省贤;;基于Heritrix的面向特定主题的聚焦爬虫研究[J];计算机技术与发展;2012年02期

6 李双龙;刘群;王成耀;;基于条件随机场的汉语分词系统[J];微计算机信息;2006年28期

7 赵宏中;李亚;;垂直搜索引擎应用研究[J];现代商贸工业;2010年04期



本文编号:2270431

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2270431.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户2b31b***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com