面向不确定数据流的Top-k查询处理

发布时间:2019-03-16 10:13
【摘要】:不确定数据广泛存在于信息社会的各个领域之中,包括金融、军事、位置服务、医疗以及气象等。随着移动互联网的快速普及以及新型数据采集技术的不断问世,不确定数据的规模急遽增长。因此,不确定数据管理技术受到了学术界与工业界研究人员的共同关注。数据不确定性出现在关系数据、半结构化数据、数据流以及多维数据之中。本文研究如何解决不确定数据流的Top-k查询处理。不确定数据流是一个高速到达的海量不确定数据元组序列,主要处理的难点有:(1)数据流到达速率极快,必须及时进行处理;(2)数据规模潜在无限,往往无法将全部数据存放在内存之中;(3)由于概率的存在,需要设计高效的优化算法,来降低计算成本。目前,虽然学术界已经积累了众多的研究成果,但现有方法在应对具体场景时仍存在局限性,因此亟需开发新型不确定数据流管理技术。本文提出了一种新型的不确定数据流近似查询算法,可以处理不确定数据流的ER-Topk与TTk查询问题。此外,为了实现数据流吞吐与查询响应的双重性能提升,我们设计出了一套通用的不确定数据流的查询处理框架。本文的工作主要包括以下几个方面:海量数据流近似查询算法解决了目前不确定数据流在处理ER-Topk与TTk查询时所遇到的存储空间消耗过大的问题。该算法可以有效地对到达的不确定数据流进行过滤处理,在控制数据精度的情况下减少数据处理压力,提升系统的整体性能。实时不确定数据流处理框架基于近似算法提出一种针对于处理ER-Topk与TTk的数据流批处理框架。框架采用并行处理技术以实现对不断快速到达数据的高吞吐处理。数据流误差检测不确定数据流往往由于各种因素的影响而存在错误信息。为了避免错误数据对查询结果产生严重影响,本文提出了一种错误数据检测方法,通过对数据特征的分析实现异常判断。框架的有效性验证本文提出的近似算法与框架旨在解决不确定数据流上的ER-Topk与TTk查询。为了验证算法与框架的数据吞吐能力、可靠性以及查询响应速率,本文通过设计不同的实验策略,结合模拟数据与真实数据来检测算法与框架的真实表现。
[Abstract]:Uncertain data exist widely in all fields of the information society, including finance, military, location services, medical care, meteorology and so on. With the rapid popularization of mobile Internet and the advent of new data acquisition technology, the scale of uncertain data increases rapidly. Therefore, uncertain data management technology has been concerned by researchers both in academia and industry. Data uncertainty occurs in relational data, semi-structured data, data streams, and multidimensional data. In this paper, we study how to solve the Top-k query processing of uncertain data streams. Uncertain data flow is a large number of uncertain data tuples which arrive at a high speed. The main difficulties of data flow processing are: (1) the arrival rate of data stream is very fast and must be processed in time; (2) the scale of data is potentially infinite and it is often impossible to store all the data in memory; (3) because of the existence of probability, it is necessary to design an efficient optimization algorithm to reduce the computation cost. At present, although the academic circles have accumulated a lot of research results, the existing methods still have limitations in dealing with specific scenarios, so it is urgent to develop a new type of uncertain data flow management technology. In this paper, a new approximate query algorithm for uncertain data streams is proposed, which can deal with the ER-Topk and TTk queries of uncertain data streams. In addition, in order to improve the performance of data stream throughput and query response, we design a general query processing framework for uncertain data streams. The work of this paper mainly includes the following aspects: the approximate query algorithm for massive data streams solves the problem that the uncertain data streams consume too much storage space when dealing with ER-Topk and TTk queries. The algorithm can filter the uncertain data flow effectively, reduce the pressure of data processing and improve the overall performance of the system under the condition of controlling the data precision. A real-time uncertain data stream processing framework based on approximate algorithm is proposed to deal with ER-Topk and TTk data stream batch processing framework. Parallel processing technology is used in the framework to realize high throughput processing of fast reaching data. Data flow error detection uncertainty data flow is often due to the influence of various factors and there are error messages. In order to avoid the serious influence of the error data on the query result, this paper proposes a method of error data detection, which realizes abnormal judgment by analyzing the characteristics of the data. The validity of the framework validates the approximate algorithm and framework proposed in this paper to solve the ER-Topk and TTk queries on uncertain data streams. In order to verify the data throughput, reliability and query response rate of the algorithm and the framework, this paper designs different experimental strategies to detect the real performance of the algorithm and the framework by combining the simulated data and the real data.
【学位授予单位】:华东师范大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13

【参考文献】

相关期刊论文 前4条

1 肖丹萍;叶东毅;;基于免疫原理的不确定数据流聚类算法[J];模式识别与人工智能;2012年05期

2 李文凤;彭智勇;李德毅;;不确定性Top-K查询处理[J];软件学报;2012年06期

3 张晨;金澈清;周傲英;;一种不确定数据流聚类算法[J];软件学报;2010年09期

4 周傲英;金澈清;王国仁;李建中;;不确定性数据管理技术研究综述[J];计算机学报;2009年01期

相关博士学位论文 前2条

1 侯东风;流式数据多维建模与查询关键技术研究[D];国防科学技术大学;2010年

2 刘青宝;模糊、动态多维数据建模理论与方法研究[D];国防科学技术大学;2006年



本文编号:2441163

资料下载
论文发表

本文链接:https://www.wllwen.com/shoufeilunwen/xixikjs/2441163.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户da675***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com