当前位置:主页 > 科技论文 > 搜索引擎论文 >

面向Deep Web数据集成的数据融合问题研究

发布时间:2018-02-11 18:49

  本文关键词: Deep Web数据集成 Deep Web数据源质量评估 数据融合 出处:《山东大学》2012年硕士论文 论文类型:学位论文


【摘要】:随着互联网技术的进步和发展,Web包含了越来越多的丰富信息,从而使Web成为了一个巨大的、分布广泛的、全球化的在线信息源。尤其是近些年来,各式各样的大型数据库逐渐建立起来,以应对各种个人或商业需求,Web已经逐渐成为人们生活中必不可少的一部分。Web上的数据杂乱无章,信息种类复杂多样,如果按照数据被访问的途径,可将整个Web分为Surface Web(浅层网络)和DeepWeb(深层网络)。其中,Surface Web是指Web中通过超链接可以被传统搜索引擎索引到的静态页面的集合;而Deep Web是指Web中可访问的在线数据库,其内容不能被传统的搜索引擎索引,而是隐藏在查询接口后面。通过研究表明,DeepWeb有数据量大、领域覆盖全面、主题性强、信息结构化程度高等特点。为了充分利用这些宝贵的资源,用于进一步的分析和挖掘,我们迫切的需要对Deep Web进行数据集成。 在各个领域,Deep Web信息量呈爆炸式增长着,数据源的种类和信息的类型也越来越多样化。然而,这些信息是并不总是可信的,而且不同的数据源往往提供提供异构的、冲突的数据,如何从这些海量的信息中获得人们所真正需要的、正确的信息,成为信息集成所面临的一大挑战。因此,我们需要通过数据融合来去伪存真,获得高质量的数据,为分析决策提供支持。 数据融合技术已经获得了越来越多的关注,许多研究工作者也在这一领域做出了很多的贡献。目前,数据融合工作仍然存在以下问题有待解决:(1) Deep Web上的数据源质量参差不齐,其提供的值的质量也差别很大,质量较高的数据源所提供的值的置信度往往更高。所以我们需要在数据融合之前首先对各个数据源进行质量评估,并将评估结果运用到真值发现的过程中去(2)目前还没有一个较为完善、标准的方法来进行数据融合,所以需要综合考虑数据源的准确度、数据源之间的依赖、值之间的牵连度等若干因素,来解决数据冲突,发现真值。 本文以面向Deep Web的数据集成为目标,在Deep Web数据源质量评估和真值发现方法等方面做了很多的研究和探索,主要工作和贡献概括如下: 1.本文提出了一种Deep Web数据源质量评估模型。Deep Web上各个数据源有很大的差异性,不同质量的数据源往往提供不同质量的数据。但是,目前大部分数据融合的研究并不专门对数据源进行质量评估,而是在计算之初给各个数据源质量赋相同的初值,并通过迭代算法不停的改进和完善数据源的质量。为了更好的进行数据融合,我们提出了一种在数据融合之前进行Deep Web数据源质量评估的方法,该方法将针对数据融合的特点,选取数据质量、接口页面质量和服务质量三个维度的多个因素作为评估标准,分别对各个质量评估因素进行量化,最后对各个数据源的质量进行统一评分,得到各个数据源的质量评估结果,并将评估结果运用到之后的数据融合中去。实验证明,我们的模型能够对数据源质量进行较为准确的评估,并且如果将得到的评估结果运用到数据融合过程中,可以对数据融合有明显的改进作用。 2.本文提出了一种面向Deep Web数据集成的真值发现方法。在各个领域,Deep Web上的数据量激增,同时也存在着大量的冲突数据,所以如何从这些大量冲突数据中发现人们所需要的、正确的值变得至关重要。我们结合自己的研究背景(面向市场情报的数据集成),提出了一种面向Deep Web数据集成的数据融合计算模型。该模型综合考虑了数据源的准确度、数据源之间的依赖度、不同值之间的牵连度等因素,从冲突数据中找到真值。由于这几个因素之间是相互作用的,所以我们迭代的计算这几个因素,不停的改进这些因素的值,直到结果收敛。同时我们也将数据源质量评估的结果运用到我们的模型中来。通过实验数据证明,我们所提出的真值发现模型有效性更高。
[Abstract]:With the progress and development of Internet technology, Web contains rich information more and more, so Web has become a huge, widely distributed, online source of information globalization. Especially in recent years, a large database of every kind of gradually set up in response to a variety of personal or business needs, Web has gradually become an essential data people living in a part of the.Web on the out of order, the information types are complicated, if in accordance with the way of the data access, we can divide the whole Web into Surface Web (shallow network) and DeepWeb (deep layer network). Among them, Surface Web refers to Web through hyperlinks can be a collection of traditional search engine static pages the index to the Deep; Web refers to the online database can be accessed in Web, its content can not be indexed by traditional search engines, but hidden behind the query interfaces. The study shows, DeepWeb has the characteristics of large data volume, wide coverage, strong theme and high level of information structure. In order to make full use of these valuable resources for further analysis and mining, we urgently need data integration for Deep Web.
In every field, Deep Web amount of information exploding, the types of information and the data source is more and more diversified. However, this information is not always reliable, and different data sources often provide heterogeneous, conflicting data, how to get the real needs of the people from the sea quantity the correct information, information, information integration has become a big challenge faced. Therefore, we need to come true through data fusion, to obtain high quality data for analysis and decision.
Data fusion technology has gained more and more attention, many researchers have made many contributions in this field. At present, data fusion still has many problems to be solved: (1) Deep Web data source quality is uneven, it provides the value of quality difference, high quality data sources the supplied value often have higher confidence. So we need before data fusion quality assessment of each data source, and apply the estimation results to the true values found in the process of (2) there is not a more perfect, standard method for data fusion, so it is necessary to consider the data the source of the accuracy of the data dependence between sources, the implication between values of several factors to resolve data conflicts, find the true value.
In this paper, aiming at data integration for Deep Web, we have done many researches and explorations in Deep Web data source quality assessment and truth value detection methods. The main contributions and contributions are summarized as follows.
1. this paper proposes a Deep Web data source quality assessment model of.Deep Web on each data source has a lot of difference, the quality of different data sources often provide different quality of data. However, most of the current research of data fusion is not specifically for the data source quality assessment, but the quality of the various data sources to assign the same initial value in the beginning of the calculation, and improve and perfect the quality of the data source through the iterative algorithm. In order to keep the data fusion better, we propose a method for the quality evaluation of Deep Web data source in the data fusion, the method will be selected according to the characteristics of data fusion, data quality, multiple factors interface page quality and service quality of the three dimensions as evaluation criteria, respectively, quantifying the quality evaluation factors, and finally unified score on the quality of the various data sources, each The assessment results of the quality of the data source, and apply the estimation results to after data fusion. Experiments show that our model is able to evaluate accurately the quality of data source, and if the evaluation results will be applied to the data fusion process, data fusion can significantly improve the performance.
2. this paper presents a method to find the true value for Deep Web data integration. In various fields, the amount of data of Deep Web on the surge, there are also a large number of data conflicts, so how from these massive data found in the conflict of people need, the correct value becomes very important. We combine the research background own (for market intelligence data integration), is proposed for Deep Web data integration and data fusion model. The model considers the accuracy of the data source, data dependence between sources, different values between the implicated factors, find the true value from the data. The conflict between these factors interact with each other, so we calculate the iteration of these factors, constantly improve the values of these factors, until the results converge. At the same time we will also use data source quality assessment results to our model. From the experimental data, we have shown that the true value discovery model is more effective.

【学位授予单位】:山东大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP202;TP311.13

【参考文献】

相关期刊论文 前4条

1 胡鹏昱;赵朋朋;方巍;崔志明;;深网数据源质量估计模型[J];计算机工程;2009年09期

2 凌妍妍;孟小峰;刘伟;;基于属性相关度的Web数据库大小估算方法[J];软件学报;2008年02期

3 胡鹏昱;苗忠义;崔志明;方巍;;扩展的Deep Web质量估计模型研究[J];微电子学与计算机;2008年09期

4 赵朋朋;崔志明;高岭;仲华;;关于中国Deep Web的规模、分布和结构[J];小型微型计算机系统;2007年10期



本文编号:1503748

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1503748.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户b6a43***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com