基于元搜索的Web信息搜索技术研究

发布时间：2019-01-23 08:57

【摘要】：伴随着互联网的普及和发展，网络信息内容日益增加，这些信息中不但包含文本形式的内容，图片、音频、视频等内容也夹杂其中。如何在网络资源中快速准确地筛选和整理用户所需信息成为信息检索领域的一个研究热点。数据挖掘技术在人工智能领域也称之为知识发现，它是通过分析已有数据，从海量数据中找寻相同的规律，并将发现的规律进行展示的技术。而Web信息搜索技术是数据挖掘技术在互联网上的一项拓展。搜索引擎最早的收录方式是人工收录，人工收录的方法建立的搜索引擎以雅虎为代表。这种方法将互联网的信息经过人工搜集、筛选并进行相关分类，之后将整理好的结果收录到网站中。但是这种方法由于人工维护成本高昂、用户知识结构各不相同等因素不能满足用户多方面需求。伴随着数据挖掘技术的发展，自动化的搜索引擎应运而生。搜索引擎通过网络机器人程序将互联网中所有数据进行数据关联并进行爬行抓取，从而得到信息索引。同时，它为用户提供一个信息检索平台，用户可以通过该平台使用关键词进行检索。搜索引擎可分为：全文搜索引擎、目录搜索引擎、元搜索引擎等。其中元搜索引擎是网页搜索引擎的进一步延伸，用户可以在一个用户交互平台中根据关键词选择在多个搜索引擎中进行相关检索操作，元搜索的特点就在于可以独立调用其他搜索引擎，实现信息的跨引擎融合，满足用户快速整合信息的需求。元搜索引擎与传统搜索引擎相比，，前者能够获得更加精确而全面的信息。本文系统地阐述了Web信息提取技术的相关原理和研究现状，同时介绍了Web信息提取技术的关键步骤。重点研究了搜索引擎的流程以及关键性技术，并对元搜索进行了深入研究。本文的主要工作主要体现在： (1)对Web信息提取技术的研究背景以及Web信息提取技术的分类和步骤进行了阐述。 (2)对Web信息提取模型、HTML语言和DOM文档对象进行了介绍。 (3)对SSH框架中Struts、Spring、Hibernate框架进行了介绍，对网站的结构信息进行了分析。 (4)对搜索引擎的背景和分类以及关键技术进行了总结，以AJAX技术、HTML Parser等技术设计并实现了一个元搜索引擎。 (5)对搜索引擎得到的结果进行了分析比对。 (6)对搜索引擎程序进行了测试。本文的研究以原有搜索引擎技术为基础，为实现更好的元搜索和开发更优秀的网络信息检索工具提供了一些参考。
[Abstract]:With the popularization and development of the Internet, the content of network information is increasing day by day. These information not only include the content of text form, picture, audio, video and so on. How to quickly and accurately filter and organize the information required by users in the network resources has become a hot research topic in the field of information retrieval. Data mining technology is also called knowledge discovery in the field of artificial intelligence. It is a technique to find the same rule from the massive data and display the discovered rule by analyzing the existing data. Web information search technology is an extension of data mining technology on the Internet. The earliest way to include search engines is manual collection, artificial methods of building search engines to Yahoo as a representative. This method collects, sift and classifies the information of the Internet manually, and then collects the results into the website. However, due to the high cost of manual maintenance and different user knowledge structure, this method can not meet the needs of users. With the development of data mining technology, automated search engine emerges as the times require. The search engine links all the data in the Internet through the network robot program and crawls the data to get the information index. At the same time, it provides users with an information retrieval platform through which users can use keywords to retrieve. Search engines can be divided into: full-text search engines, catalog search engines, meta-search engines and so on. Meta search engine is a further extension of web search engine. Users can select multiple search engines according to keywords in a user interaction platform. The feature of meta-search is that it can call other search engines independently to realize the cross-engine fusion of information and meet the needs of users to integrate information quickly. Compared with traditional search engines, meta-search engines can obtain more accurate and comprehensive information. In this paper, the principle and research status of Web information extraction technology are systematically described, and the key steps of Web information extraction technology are also introduced. Focus on the search engine process and key technologies, and meta-search in-depth study. The main work of this paper is as follows: (1) the research background of Web information extraction technology and the classification and steps of Web information extraction technology are expounded. (2) introduce Web information extraction model, HTML language and DOM document object. (3) the Struts,Spring,Hibernate framework in SSH framework is introduced, and the structure information of the website is analyzed. (4) the background, classification and key technologies of search engine are summarized, and a meta-search engine is designed and implemented by AJAX technology, HTML Parser and so on. (5) the results of search engine are analyzed and compared. (6) testing the search engine program. Based on the original search engine technology, this paper provides some references for realizing better meta search and developing better web information retrieval tools.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP311.52

【引证文献】