当前位置:主页 > 管理论文 > 移动网络论文 >

基于采样的Deep Web数据源选择方法研究

发布时间:2018-09-13 15:36
【摘要】:由于互联网信息的飞速发展,Web中蕴含了海量的信息供人们使用,其中Deep Web数据库是对用户不可见的,其中涵盖的信息只能通过特定的查询接口来查询获得。为了充分利用Deep Web中丰富的有价值的信息,以及提高对其查询的效率,Deep Web数据集成系统的建立成为了当前的研究热点。其中,Deep Web数据库的选择则是此集成系统中查询处理模块相当重要的环节。本文针对Deep Web数据源的选择,从通过采样的办法获取数据源特征,评估采样质量,以及根据选取评价指标计算数据源的总体得分对数据源进行排序、选择,这三个方面进行重点研究。第一,本文在基于采样的随机漫步采样方法的基础上,针对对于关键字属性研究的缺失,通过分析采样过程中属性分类的问题,提出一种引入关键字属性并对其进行属性分类的扩展方法,同时,进一步考虑到已有研究缺乏对分类属性中含树形特征的属性的研究,从而提出树形分类属性的概念并给出了在采样过程中的处理方法。第二,在原始随机漫步采样方法的基础上,通过保存采样路径,使随后产生的将要进行采样的路径与已有路径进行扫描比较,据此提出一种避免拥有部分相同路径的属性值产生重复提交查询的随机漫步方法的改进算法,以此对数据源进行采样,从而进一步提高采样效率。第三,在采样评价体系中考虑了样本与数据源的信息内容的一致性,将文本信息内容的文本相似度计算方法引入采样质量评价体系中来,结合样本集与数据源比值法对样本偏差的衡量,进一步完善了对采样质量的评价。第四,在采样结果所获样本集的基础上,对数据源质量进行评价,给出权威性、领域相关性、准确性、冗余性、时效性这五个评价指标对数据源质量进行评估,并给出五项指标的量化方法及公式。并在准确性指标计算中,对语义相似度的计算做了相应的改进,将汉明距离的相似度计算方法加入了语义相似度的元素。通过对五个指标的综合评价,得到数据源的总体得分,按总分进行排序选择。实验表明,本文提出的方法,对以往方法存在的问题有了很大的改进,并进一步在采样质量和效率上都有很好的效果和提高,对样本集的质量评估更可靠有效。
[Abstract]:Due to the rapid development of Internet information, there is a huge amount of information for people to use in the web. The Deep Web database is invisible to users, and the information contained therein can only be queried through a specific query interface. In order to make full use of the valuable information in Deep Web and improve the efficiency of query, the establishment of Deep Web data integration system has become a hot research topic. The selection of Deep Web database is an important part of query processing module in this integrated system. According to the selection of Deep Web data sources, this paper obtains the characteristics of the data sources through sampling, evaluates the sampling quality, and sorts the data sources according to the total score of the selected evaluation indicators. These three aspects carry on the key research. First, based on the random sampling method based on sampling, this paper analyzes the problem of attribute classification in the process of sampling, aiming at the lack of research on keyword attributes. In this paper, an extended method of introducing keyword attributes and classifying them is proposed. At the same time, considering the lack of researches on attributes with tree features in classification attributes, Thus, the concept of tree classification attributes is proposed and the processing method in the sampling process is given. Secondly, on the basis of the original random walk sampling method, by preserving the sampling path, the path to be sampled is scanned and compared with the existing path. Based on this, an improved algorithm is proposed to avoid the random walk of the attribute value with part of the same path to generate repeated submission queries, so as to sample the data source and further improve the sampling efficiency. Thirdly, the consistency of information content between sample and data source is considered in the sampling evaluation system, and the text similarity calculation method of text information content is introduced into the sampling quality evaluation system. The evaluation of sampling quality is further improved by measuring the sample deviation by the ratio of sample set to data source. Fourthly, on the basis of the sample set obtained from the sampling results, the quality of the data source is evaluated, and the quality of the data source is evaluated by five evaluation indexes, namely, authority, domain correlation, accuracy, redundancy and timeliness. The quantitative method and formula of five indexes are given. In the accuracy index calculation, the semantic similarity calculation is improved accordingly, and the similarity calculation method of hamming distance is added to the semantic similarity element. Through the comprehensive evaluation of the five indexes, the total score of the data source is obtained, and the ranking selection is carried out according to the total score. The experimental results show that the method proposed in this paper has greatly improved the existing problems of the previous methods, and further improved the sampling quality and efficiency, and is more reliable and effective for the quality evaluation of the sample set.
【学位授予单位】:上海师范大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:TP393.09;TP311.13

【参考文献】

相关期刊论文 前6条

1 吴春明;谢德体;;基于领域特征文本的Deep Web分类研究[J];计算机科学;2012年04期

2 王成良;桑银邦;;Deep Web集成系统中同类主题数据源选择方法[J];计算机应用研究;2011年09期

3 姜芳艽;孟小峰;;Deep Web数据集成中查询处理的研究与进展[J];计算机科学与探索;2009年02期

4 凌妍妍;孟小峰;刘伟;;基于属性相关度的Web数据库大小估算方法[J];软件学报;2008年02期

5 余伟;李石君;文利娟;田建伟;;基于数据质量的Deep Web数据源排序[J];小型微型计算机系统;2010年04期

6 邓松;万常选;刘喜平;廖国琼;;基于用户反馈的深网数据源选择[J];小型微型计算机系统;2012年11期



本文编号:2241591

资料下载
论文发表

本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2241591.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户f254f***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com