基于策略的网络信息提取技术的研究

发布时间：2019-05-18 10:43

【摘要】：进入信息时代之前，信息收集的研究就已有所发展。进入信息时代之后，信息资源得到了前所未有的重视。在某些应用领域中，信息资源的收集更是尤为重要。随着Internet互联网的快速发展，网络上信息资源的飞速增长，为信息的利用提供了便利条件。但是，随着网络信息资源越来越丰富，信息资源收集工作的工作量也是与日俱增。同时，网络上信息资源的无序性、分散性给收集工作带来了障碍。但通过信息提取就能将这些信息收集起来，格式化并存储，方便查询使用。本论文针对网络信息提取这一问题，以网络信息获取、文本信息提取相关技术为主要的研究对象，在深入分析网络搜索原理和信息提取技术的基础上，详细讨论和设计实现了一种网络信息提取软件。主要内容为： 1．研究网络搜索原理和信息提取技术，提出了一种针对网页页面信息的网络信息提取的方法。该方法首先通过网络搜索中的网页爬虫技术从互联网获取网页页面信息，再对网页页面信息进行分析，根据用户设置的基于信息格式的提取策略，获取符合用户所期望的信息。 2．研究网络爬虫技术，讨论分析了URL消重技术要点的工作原理；研究网页的表现方式、网页的传输协议（超文本传输协议）及网页的编写方式（超文本标记语言），结合成熟的正则表达式文本处理技术，实现对使用超文本标记的信息进行分析、提取；讨论分析商用搜索引擎的工作运行方式，提出了搜索引擎调用的方法。 3.设计实现了一款基于策略的网络信息提取软件。软件以正则表达式为基础构建信息提取策略，对网页页面信息中符合提取策略的信息进行抽取；软件具备策略设置界面，策略可根据需要进行设置；软件实现网络爬虫的功能，，可根据用户输入的起始URL地址开始网页抓取；软件还具备调用搜索引擎的能力，可根据用户输入的关键词访问搜索引擎，自动获取、分析搜索结果，通过这些搜索结果再开始网页抓取和信息提取。最后，对软件进行了功能、效能实验，验证软件是否达到预期要求，并就发现的问题进行了讨论并给出了改进措施。
[Abstract]:The research of information collection has been developed before entering the information age. After entering the information age, the information resources have been paid more and more attention. In some applications, the collection of information resources is particularly important. With the rapid development of the Internet, the rapid growth of information resources on the network provides a convenient condition for the utilization of information. However, as the network information resources become more and more abundant, the workload of information resource collection is also increasing. At the same time, the disordering of information resources on the network has brought an obstacle to the collection of information. But the information can be collected, formatted and stored by information extraction so as to be convenient for query and use. This paper, aiming at the problem of network information extraction, takes the network information acquisition and the text information extraction related technology as the main research object. Based on the deep analysis of the network search principle and the information extraction technology, the paper discusses and designs a kind of network information extraction soft. Item. Main content in order to:1. To study the network search principle and information extraction technology, and put forward a network information extraction for web page information The method comprises the following steps of: firstly, acquiring webpage page information from the Internet through a webpage crawler technology in a network search, analyzing the webpage information of the webpage, and acquiring the webpage information according to the information format set by the user, 2. Research the technology of web crawler, and discuss the work principle of the key points of the URL elimination technique; study the way of the web page, the transmission protocol of the web page (the hypertext transfer protocol) and the preparation of the web page (Hypertext Markup Language), and combine the mature regular expression According to the processing technology, the information of the hypertext markup is analyzed and extracted; the working mode of the business search engine is analyzed and analyzed, and a search engine is put forward the method of calling.3. The design implements a policy-based network The software implements the information extraction strategy based on the regular expression, extracts the information corresponding to the extraction strategy in the page information of the web page, the software has the policy setting interface, the policy can be set according to the requirement, and the software implementation The function of the network crawler can start the webpage grab according to the starting URL address input by the user; the software also has the capability of calling the search engine, can access the search engine according to the keywords input by the user, automatically acquire, analyze and search results, and then start the webpage through the search results In the end, the functions and performance experiments of the software are carried out, and whether the software meets the expected requirements is verified, and the problems found are discussed.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【参考文献】