搜索引擎系统中网页消重的研究与实现

发布时间：2018-10-30 13:37

【摘要】：随着计算机硬件软件和互联网技术的飞速发展,网络上的各种信息急剧增长,已经成为人类有史以来信息资源数量最多、信息资源种类最全、信息资源规模最大的一个综合信息资源库。然而,用户在互联网上查找需要信息的时候,只知道搜索的关键词,并不知道具体的URL,因此就需要借助搜索引擎帮助用户查找需要的信息。搜索引擎可以方便用户从互联网上查找信息,节约用户时间,受到大家普遍欢迎。互联网上出现很多功能强大的搜索引擎,针对汉语的Baidu和针对多种语言的Google等。然而,有些网站因为商业利益,为了提高其网站的点击率,大量转载别的文章。好的文章也会在博客和论坛之间转载。而且出现热门事件和大众感兴趣的焦点话题后,会有很多网站竟相报道和转载,使得用户从搜索引擎返回的结果会有很多链接不同但内容相同,降低了用户体验。用户不得不在大批相同的结果集中寻找自己需要的信息,而且重复网页的存在也增加了索引数据库的存储容量。去除重复的网页是提高搜索引擎实用性和效率的一个途径。本文首先在基于HTML标签的最大正文块算法基础上实现了网页主题内容的提取,并在此基础上,提出了基于关键词和特征码的页面去重算法,并开发了实验系统,对该算法进行了验证,通过对实验结果的分析讨论证明了算法的有效性。本文的主要工作有以下几点: 1.理论研究:分析了搜索引擎运行原理与关键技术,从文本的相似检测到网页相似检测领域中几个经典的去重算法。 2.网页去重与文本去重并不完全相同,需要先提取出去除导航、广告、版权等网页噪声的网页主题内容,在基于HTML标签的最大正文块算法基础上,综合考虑各种类型的网页,设计算法实现了网页主题内容提取。 3.算法改进:在提取出的网页主题内容基础上,综合考虑了三种经典的网页去重算法:基于特征码,特征句和KCC算法,借鉴其优势,提出了基于关键词和特征码的网页去重算法。该算法简单高效,可以有效识别转载过程中有微小改动的网页,提高了网页去重的准确性。 4.设计实现:在开源框架lucene基础上实现了一个简单的单机版搜索引擎系统,将基于关键词和特征码算法内嵌到去重模块。该系统可以根据需要抓取网页、对网页进行去重处理、对去重后的网页建立索引并进行搜索,根据用户查询关键词返回相关结果。 5.实验分析:将本文去重算法内嵌到搜索引擎系统中,对抓取的900个含重复网页的数据集进行去重处理,并分析实验结果,证明改进算法的有效性。
[Abstract]:With the rapid development of computer hardware and software and Internet technology, all kinds of information on the network have increased rapidly, which has become the largest number of information resources and the most complete type of information resources in human history. A comprehensive information resource database with the largest scale of information resources. However, when users look up the information needed on the Internet, they only know the key words of search, and do not know the specific URL,. Therefore, they need to use search engines to help users find the information they need. Search engine can be convenient for users to find information from the Internet, save user time, is generally welcomed by everyone. There are many powerful search engines on the Internet, such as Baidu for Chinese and Google for many languages. However, some sites because of commercial interests, in order to improve the click rate of their websites, reprint other articles. Good articles will also be reprinted between blogs and forums. And after the hot events and the hot topics of public interest, there will be many websites to report and reprint each other, which will result in many different links but the same content from the search engine, which reduces the user experience. Users have to search for the information they need in a large number of the same result sets, and the existence of duplicate pages also increases the storage capacity of the index database. Removing duplicate web pages is a way to improve the usability and efficiency of search engines. In this paper, we first implement the extraction of web page topic content on the basis of the maximum text block algorithm based on HTML tag, and on this basis, we propose a page de-reduplication algorithm based on keyword and signature, and develop an experimental system. The validity of the algorithm is proved by the analysis and discussion of the experimental results. The main work of this paper is as follows: 1. Theoretical study: the principle and key technologies of search engine are analyzed. From text similarity detection to web page similarity detection, several classical algorithms are proposed. 2. Web pages are not exactly the same as text pages. It is necessary to extract the subject content of web pages that remove the noise of navigation, advertising, copyright and other web pages. Based on the algorithm of maximum text block based on HTML tags, various types of web pages should be considered synthetically. The algorithm is designed to extract the topic content of the web page. 3. Algorithm improvement: on the basis of extracting web page topic content, three classical web page de-duplication algorithms are considered synthetically: based on signature, feature sentence and KCC algorithm, using their advantages for reference, this paper puts forward a page de-duplication algorithm based on keyword and signature. The algorithm is simple and efficient, which can effectively identify the pages with minor changes in the reprint process, and improve the accuracy of the web pages. 4. Design and implementation: on the basis of open source framework lucene, a simple single-machine search engine system is implemented, which is based on keyword and signature algorithm embedded in the de-reduplication module. The system can grab the web page according to the need, dereprocess the page, build the index and search the removed page, and return the relevant results according to the user's query key words. 5. Experimental analysis: the algorithm is embedded in the search engine system. 900 data sets with duplicate pages are removed and the experimental results are analyzed to prove the effectiveness of the improved algorithm.
【学位授予单位】：河南大学
【学位级别】：硕士
【学位授予年份】：2011
【分类号】：TP393.092

【引证文献】