基于Nutch的医学领域垂直搜索引擎系统的研究与实现
发布时间:2018-04-26 00:39
本文选题:垂直搜索引擎 + 主题爬虫 ; 参考:《东华理工大学》2015年硕士论文
【摘要】:随着近几年互联网的快速发展,人们获取信息的方式越来越多,各种各样的信息充斥在人们的生活中,给人们带来了极大的便利,随之而来的还有面对丰富信息的无所适从。搜索引擎的出现极大地缓解了这一现状,然而,随着互联网上网页数目呈现指数级速度增长,通用搜索引擎在提高搜索效率方面愈发艰难,而垂直搜索引擎凭借其高度的信息集中度和较强的专业领域知识,成为时下研究的热点。因此,各领域内垂直搜索引擎平台相继出现,但在与人们生活健康息息相关的医疗卫生领域却依然没有一个较好的搜索平台,人们对于各种疾病的预防和治疗信息大多只能通过医生了解,信息渠道单一,而且由于地理环境、经济发展等因素限制,优势医疗资源发展不均衡。若能实现一个医疗领域的垂直搜索引擎,人们足不出户就可获取医疗信息,这将有利于缓解我国目前医疗意识和基础设施薄弱的问题。本文基于Nutch开源搜索框架,针对垂直搜索引擎中的主题爬虫模块和信息检索模块进行分析和设计,并最终实现医学领域的垂直搜索引擎。在该垂直搜索引擎的搭建中,主题爬虫模块的构建一直是当下研究的热点,本文通过对主题爬虫爬行策略中的Fish-Search算法进行分析试验,依据网页链接和网页内容对网页进行综合相关度评价,采用弹性阈值机制,在限制“隧道现象”的基础上对医学领域相关的网页进行爬取、下载。在抓取该医学领域的网页后,利用网页解析工具和网页分块技术对网页进行解析,并将解析后的网页文本内容进行中文分词,构建倒排索引结构的倒排表。针对信息检索中网页的排序问题,本文通过对Lucene搜索结果评分机制的分析与研究,对Page Rank算法在权值传递过程中的平均分配问题加以优化,并添加时间反馈因子,减少旧网页的天然优越性,并将优化后的Page Rank算法与Lucene中的向量空间模型结合,在抑制“主题漂移”现象的基础上提高网页的主题相关性和权威性,最后将经过排序处理后的结果网页返回给用户,实现医学领域垂直搜索引擎的整个流程。通过对垂直搜索引擎系统的设计与实现,用户可以以快捷高效的方式获得较为权威的医学领域信息,对个人的健康与卫生等行为有着积极的促进作用,同时为人们带来更为合理健康的生活方式。
[Abstract]:With the rapid development of the Internet in recent years, there are more and more ways for people to obtain information. Various kinds of information are flooded in people's lives, which brings great convenience to people, and then faces the confusion of rich information. The emergence of search engines has greatly alleviated this situation, however, as the number of web pages on the Internet has grown exponentially, it has become increasingly difficult for universal search engines to improve their search efficiency. Vertical search engine, with its high degree of information concentration and strong professional knowledge, has become a hot research topic. Therefore, vertical search engine platforms have appeared one after another in various fields, but there is still not a better search platform in the field of medical and health, which is closely related to people's life and health. The information of prevention and treatment of various diseases can only be understood by doctors, the information channel is single, and because of geographical environment, economic development and other factors, the development of superior medical resources is not balanced. If we can realize a vertical search engine in medical field, people can get medical information from home, which will help to alleviate the problem of weak medical consciousness and infrastructure. Based on Nutch open source search framework, this paper analyzes and designs the topic crawler module and information retrieval module in vertical search engine, and finally realizes the vertical search engine in medical field. In the construction of the vertical search engine, the construction of the topic crawler module has been the focus of current research. This paper analyzes and tests the Fish-Search algorithm in the topic crawler crawling strategy. According to the comprehensive relevance evaluation of the web pages based on the link and the content of the web pages, the elastic threshold mechanism is adopted to crawl and download the medical related web pages on the basis of limiting the "tunnel phenomenon". After grabbing the web pages of the medical field, we use the web page analysis tools and web page partitioning technology to parse the web pages, and make the Chinese word segmentation of the analyzed page text content, and construct the inverted table of inverted index structure. Aiming at the ranking problem of web pages in information retrieval, this paper analyzes and studies the scoring mechanism of Lucene search results, optimizes the average allocation problem of Page Rank algorithm in the process of weight transfer, and adds a time feedback factor. In order to reduce the natural superiority of the old web pages, and combine the optimized Page Rank algorithm with the vector space model in Lucene, we can improve the relevance and authority of the web pages on the basis of suppressing the "topic drift" phenomenon. Finally, the result page after sorting is returned to the user to realize the whole process of vertical search engine in medical field. Through the design and implementation of vertical search engine system, users can obtain authoritative medical field information in a fast and efficient way, which has a positive effect on personal health and hygiene behavior. At the same time for people to bring a more reasonable and healthy way of life.
【学位授予单位】:东华理工大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:TP391.3
【参考文献】
相关期刊论文 前1条
1 李晓红;李茂林;;用户兴趣模型在垂直搜索引擎检索模块中的应用[J];计算机时代;2012年12期
,本文编号:1803730
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1803730.html