当前位置:主页 > 科技论文 > 搜索引擎论文 >

深层网络数据源发现与查询结果抽取研究

发布时间:2018-05-30 13:23

  本文选题:深层网络 + 数据源发现 ; 参考:《西南交通大学》2013年硕士论文


【摘要】:随着互联网技术的飞速发展,网络中蕴藏的有价值信息愈来愈多。但各站点提供的信息在数量及质量上都存在巨大的差异。这给人们选取高质量信息带来了困难。搜索引擎技术可以对网络资源进行分类整理和检索,极大地提高了人们获取有价值资源的效率。然而有的数据资源位于后台数据库中,不能被传统搜索引擎检索,这部分网络资源称为深层网络。深层网络所包含的数据具有结构化程度高、数据量大、质量优质等特点。因此,研究这些数据具有重要的意义。 本文针对如何发现并抽取深层网络数据展开了相关研究。要利用深层网络中的信息,首要问题就是发现深层网络的数据源。其次,对于向深层网络提交查询后所返回的结果数据区域,如何自动发现这些区域是对其信息抽取的前提。针对这些问题,本文主要完成三个方面的工作:研究并改进了一种数据源的发现方法;采用了一种新的网页结构相似度比较算法,在算法的基础上实现了网页数据区域的识别:设计了深层网络信息集成系统框架,并实现了数据源发现与结果网页信息抽取功能模块。 首先是深层网络数据源的发现及方法改进。论文设计了一种数据源发现框架。针对查询接口的判定问题,本文分析了查询接口与其他表单的区别,采用了一系列规则进行判断。数据源一般只限某一类领域,为准确查找数据源,必须判定其是否与主题类别相关。论文分析了传统数据源分类方法在特征选择方面的不足之处,并对特征选择策略进行了改进。实验表明,改进的方法能有效发现主题相关的数据源站点。 然后是网页信息抽取及新算法的应用。本文通过分析在线数据库返回结果页面的特点,发现每个数据区域对应的标签树在结构上十分相似。论文采用了一种新的网页结构相似度比较算法,识别数据区域所在位置。新算法将网页的标签表示成树的形式,并定义一种特殊的子树,将整个树的比较划为对这些特殊子树的比较,实验证明了此算法能有效反映网页结构的相似程度。使用该算法找出数据区域所在位置之后,本文利用网页结构特点及关键词提取相关记录,并将这些信息抽取出来。 最后是深层网络数据集成框架设计与主要模块实现。论文设计了深层网络信息集成框架。并且在第三章数据源发现方法及第四章深层网络结果页面信息抽取方法的基础上,实现了该集成框架的主要模块。
[Abstract]:With the rapid development of Internet technology, there are more and more valuable information in the network. However, the information provided by each site in quantity and quality are huge differences. This makes it difficult for people to select high quality information. Search engine technology can sort and retrieve network resources, which greatly improves the efficiency of obtaining valuable resources. However, some data resources are located in the backstage database and cannot be retrieved by the traditional search engine. This part of the network resources is called the deep network. The data contained in the deep network has the characteristics of high degree of structure, large amount of data, high quality and so on. Therefore, the study of these data is of great significance. This paper focuses on how to find and extract deep network data. To utilize the information in the deep network, the first problem is to find the data source of the deep network. Secondly, how to find these regions automatically is the premise of information extraction for the result data regions returned after the query is submitted to the deep network. In order to solve these problems, this paper mainly completes three aspects: researching and improving a data source discovery method, adopting a new similarity comparison algorithm of web page structure, On the basis of the algorithm, the recognition of the web page data area is realized: the deep network information integration system framework is designed, and the function module of data source discovery and result page information extraction is implemented. The first is the discovery and improvement of deep network data sources. This paper designs a data source discovery framework. Aiming at the judgment of query interface, this paper analyzes the difference between query interface and other forms, and adopts a series of rules to judge. In order to find the data source accurately, it is necessary to determine whether it is related to the subject category. This paper analyzes the shortcomings of the traditional data source classification methods in feature selection, and improves the feature selection strategy. Experiments show that the improved method can effectively find the data source sites related to the topic. Then there is the application of web information extraction and new algorithm. By analyzing the characteristics of the result page of the online database, it is found that the label tree corresponding to each data region is very similar in structure. In this paper, a new similarity comparison algorithm is used to identify the location of the data region. The new algorithm represents the label of the web page as a tree and defines a special subtree. The comparison of the whole tree is divided into the comparison of these special subtrees. The experiments show that the algorithm can effectively reflect the similarity degree of the web page structure. After using the algorithm to find out the location of the data region, this paper extracts the relevant records by using the features of the web page structure and key words, and extracts the information. Finally, the deep network data integration framework design and main module implementation. The paper designs a deep network information integration framework. On the basis of the method of data source discovery in chapter 3 and the method of extracting information from the result page of deep network in chapter 4, the main module of the integration framework is implemented.
【学位授予单位】:西南交通大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3

【参考文献】

相关期刊论文 前10条

1 祝官文;王念滨;王红滨;;基于主题和表单属性的深层网络数据源分类方法[J];电子学报;2013年02期

2 杨丽华;袁方;姚增利;王煜;;基于启发式规则的Deep Web接口发现[J];河北大学学报(自然科学版);2010年01期

3 祁钰;关毅;吕新波;岳淑珍;;网页结构树相似度计算[J];黑龙江大学自然科学学报;2009年05期

4 石倩;陈荣;鲁明羽;;基于规则归纳的信息抽取系统实现[J];计算机工程与应用;2008年21期

5 林超;赵朋朋;崔志明;;Deep Web数据源聚焦爬虫[J];计算机工程;2008年07期

6 杨巨峰;史广顺;赵玉娟;王庆人;;基于规则集的Deep Web信息检索[J];计算机工程;2008年13期

7 王权;施韶亭;;基于子树广度的Web信息抽取[J];计算机工程;2009年03期

8 华慧;伏玉琛;周小科;;基于查询接口文本的Deep Web数据源分类[J];计算机工程;2010年12期

9 王海龙;胡景芝;赵朋朋;崔志明;;基于搜索引擎的Deep Web数据源发现[J];计算机工程;2011年05期

10 刘伟;孟小峰;孟卫一;;Deep Web数据集成研究综述[J];计算机学报;2007年09期

相关硕士学位论文 前1条

1 陈洪平;面向Deep Web的数据抽取与语义标注技术研究[D];苏州大学;2010年



本文编号:1955450

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1955450.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户88aaa***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com