基于MySQL新闻搜索引擎的设计与实现

发布时间：2018-03-09 14:47

本文选题：信息检索　切入点：网络爬虫　出处：《复旦大学》2013年硕士论文　论文类型：学位论文

【摘要】：随着现代信息技术的飞速发展,互联网络上的信息量和类型正在发生爆炸性的增长。这为人们的日常生活、工作以及学习带来了极大的便利。但是在信息量爆增的同时也带来了新的问题。比如如何对这些海量的信息进行统一的管理,如何将这些分散的资源建立索引,以及如何从海量的信息资源中准确地获取需要的信息等等。搜索引擎是解决这些问题的关键技术,但是传统的通用搜索引擎是对Web上的所有种类的信息都进行搜集,并面向所有不同层次的用户,这种想做的面面俱到的努力在海量信息面前变得越来越没有突破性进展。普通的用户对信息的关注程度和宽度是比较集中的。所以面向特定领域和特定需求的专业搜索引擎的概念应用而生。与传统的通用搜索引擎所不同的地方是专业搜索引擎只会收集与某个主题相关的Web上的信息,在收集信息时并不是来者便收,而是通过分析判断信息内容是否与特定主题相关,并只对相关的信息进行进一步处理。因此,专业搜索引擎无论在资源消耗,还是在查询准确度上都有了显著的提高。本文的主要研究工作就是面向专业搜索引擎,且以新闻为搜索主题。在研究过程中,通过对搜索引擎中关键技术进行深入的理论的学习和实践,进一步加深对搜索引擎领域的了解。在本文中的新闻专业搜索引擎中,选择新浪新闻网站作为网络爬虫的入口地址,对其进行有针对性地收集新闻页面。收集页面的工作由专业的新闻网络爬虫完成,它从新闻首页开始,提取出其中的新闻链接地址,并将这些链接地址存入到待爬取的队列之中,通过三层的深度优先搜索算法对Web网站进行遍历。之后,爬虫还将对收集后的页面进行净化处理和提取有效信息,最后由索引器建立搜索引擎中非常核心的数据：倒排索引。搜索引擎最终是要面向普通用户的,所以,设计好一个用户体验度好的查询接口为用户提供新闻查询服务也是非常必须的任务。本文中详细介绍了网络爬虫是设计和实现,网页的净化和信息抽取以及索引库的构建。这些技术都是目前自然语言处理和人工智能方面的研究热点,通过对这些技术和理论的学习,加深对专业的技能。本面向新闻内容的搜索引擎从最简单的技术着手,逐步实现了搜索引擎这一庞杂系统中的关键模块,实验结果表明系统具有一定的准确率,达到了良好的效果。
[Abstract]:With the rapid development of modern information technology, the amount and type of information on the Internet is increasing explosively. Work and study bring great convenience. But as the amount of information explodes, it also brings new problems. For example, how to manage these huge amounts of information uniformly, how to index these scattered resources, Search engine is the key technology to solve these problems, but the traditional universal search engine is to collect all kinds of information on Web. And for all the different levels of users, This kind of all-encompassing effort in the face of mass information has become less and less groundless. The average user's attention to the information and width is more concentrated. So specific to specific areas and specific needs. Different from traditional general-purpose search engines, professional search engines only collect information on Web that is relevant to a particular topic. When collecting information, it is not collected by the person who comes, but by analyzing and judging whether the content of the information is relevant to a particular topic, and only the relevant information is further processed. Therefore, the professional search engine, regardless of the resource consumption, The main research work of this paper is to face professional search engine, and take news as the search subject. In the process of research, Through the deep theoretical study and practice of the key technologies in the search engine, we can further deepen our understanding of the search engine field. In this paper, we select the Sina news website as the entry address of the web crawler in the news professional search engine. The collection of pages is done by a professional news web crawler, who starts with the first page of the news and extracts the address of the news link. These link addresses are stored in the queue to be crawled, and the Web site is traversed by a three-layer depth-first search algorithm. After that, the crawler will purify the collected pages and extract effective information. Finally, the indexer builds the very core data in the search engine: inverted index. The search engine is ultimately intended for ordinary users, so, It is also a very necessary task to design a good user experience query interface to provide news query service for users. This paper introduces the design and implementation of web crawler in detail. The purification and information extraction of web pages and the construction of index database. These technologies are the research hotspot in the field of natural language processing and artificial intelligence. Through the study of these technologies and theories, The search engine for news content has gradually realized the key module of the complex system from the simplest technology. The experimental results show that the system has a certain accuracy. Good results have been achieved.
【学位授予单位】：复旦大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】