基于Nutch的安全漏洞垂直搜索引擎的设计与实现
发布时间:2018-05-09 00:35
本文选题:Nutch + 垂直搜索引擎 ; 参考:《北京邮电大学》2017年硕士论文
【摘要】:当今社会,越来越多的人通过互联网获取信息资源,而面对海量的网络信息,人们需要通过搜索引擎来快速检索到所需的信息。传统的搜索引擎技术是对整个互联网资源进行爬取,搜索范围广,但是搜索结果中包含了大量用户不需要的信息,用户体验感差。而垂直搜索引擎只检索出用户关心的、某一特定专业领域的相关信息,它的搜索范围小,但是搜索结果更精准,符合用户对特定领域的信息检索需求。目前,人们的学习生活等各方面都离不开互联网,而个人、企业的信息泄露屡见不鲜,互联网安全问题越来越引起人们的重视。而互联网中大量的安全漏洞是构成网络安全威胁的重要原因,企业受到大规模ddos攻击导致主机崩溃、用户个人信息泄露等问题多是由安全漏洞所引发。安全漏洞导致的风险是巨大的,为了让人们能够了解到最新的安全漏洞信息,有必要构建一个可以检索安全漏洞信息的垂直搜索引擎。本文通过对垂直搜索引擎相关技术以及开源搜索引擎框架Nutch的研究,设计并实现了基于Nutch的安全漏洞垂直搜索引擎系统。该系统的主要功能模块包括网络爬虫、特定主题信息过滤、索引、检索排序以及第三方中文分词器。本文的主要工作包括以下几个方面:1、熟悉了搜索引擎的发展概况以及垂直搜索引擎的研究现状,重点研究了垂直搜索引擎的各个模块技术,同时熟悉了开源Nutch框架的工作原理与插件机制。2、重点研究了垂直搜索引擎的主题过滤模块,本文引入了分类器思想实现对信息的分类,从而实现面向特定领域信息的搜索。由于朴素贝叶斯分类器存在条件独立性的天然缺陷,本文重点研究了二阶AODE分类器,并在此基础上改进实现了基于属性变量和类变量互信息加权的WAODE分类算法。同时将WAODE分类算法结合Nutch的插件机制实现本文的主题过滤模块。3、改进了 Nutch检索排序算法模型,从内容相关性、超链接分析网页权威性以及时间因子三方面考虑,得到新的网页排序评分模型并实验验证。4、在Nutch中加入第三方中文分词器mmseg4j,实现了中文分词功能。
[Abstract]:In today's society, more and more people obtain information resources through the Internet, and in the face of massive network information, people need to quickly retrieve the required information through search engines. Traditional search engine technology is to crawl the entire Internet resources, search a wide range, but the search results contain a large number of users do not need information, user experience is poor. The vertical search engine only retrieves the relevant information of a specific professional domain which is of concern to the user. Its search scope is small, but the search results are more accurate and meet the information retrieval needs of the user in a specific field. At present, people's study life and other aspects can not be separated from the Internet, and the information leakage of individuals and enterprises is common, Internet security issues have been paid more and more attention. However, a large number of security vulnerabilities in the Internet are the important reasons for the network security threats. Large scale ddos attacks on enterprises lead to the collapse of the host, and many other problems such as the disclosure of personal information of users are caused by security vulnerabilities. The risks caused by security vulnerabilities are enormous. In order to make people know the latest information of security vulnerabilities, it is necessary to build a vertical search engine which can retrieve the information of security vulnerabilities. Based on the research of vertical search engine technology and open source search engine framework Nutch, this paper designs and implements a security vulnerability vertical search engine system based on Nutch. The main function modules of the system include web crawler, specific topic information filtering, indexing, retrieval and sorting, and third party Chinese word segmentation. The main work of this paper includes the following aspects: 1, familiar with the development of the search engine and the status quo of the vertical search engine, focusing on the vertical search engine module technology, At the same time, we are familiar with the working principle of open source Nutch framework and plug-in mechanism. 2. We focus on the topic filtering module of vertical search engine. In this paper, we introduce the idea of classifier to realize the classification of information, so as to realize the search for specific domain information. Due to the natural defect of conditional independence of naive Bayesian classifier, the second order AODE classifier is studied in this paper, and an improved WAODE classification algorithm based on mutual information between attribute variables and class variables is implemented. At the same time, the WAODE classification algorithm is combined with the plug-in mechanism of Nutch to realize the topic filtering module. 3, which improves the sorting algorithm model of Nutch retrieval, considering from three aspects: content correlation, hyperlink analysis of web page authority and time factor. A new web page ranking scoring model was obtained and verified by experiments. The third party Chinese word particifier mmseg4jwas added to Nutch to realize the function of Chinese word segmentation.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.3
【参考文献】
相关期刊论文 前10条
1 彭Z,
本文编号:1863811
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1863811.html