数据时效性的理论和算法研究
发布时间:2018-06-21 21:43
本文选题:数据质量 + 数据可用性 ; 参考:《哈尔滨工业大学》2016年博士论文
【摘要】:随着大数据时代的到来,数据的可用性受到广泛的关注。真实世界会随着时间的流逝迅速变化,进而导致数据库中的数据过时失效。当前已有统计表明过时数据会对企业决策和国民生活造成众多不良影响,且会引起其他维度的可用性下降,如引起数据不一致、不精确、不完整等,因此确保数据的时效性至关重要。当前数据可用性领域对于时效性的研究仍然不成体系,数据时效性研究面临极大挑战。首先,很多数据库中都没有精确可用的时间戳,这使得数据集合在给定时刻的时效性,即绝对时效性,很难判定。其次,不同的查询或应用场景对时效性有不同的要求,在一些情境下绝对时效性可能无法判定,这使得数据相对于查询或者用户的时效性判定尤为重要。第三,在得到数据库的时效性判定结果之后,必须进一步给出数据时效性的修复方法,当前数据可用性领域的研究并没有给出可以直接用于修复时效性的数据修复方法。第四,在仅有一个数据源的情况下,完全地修复一个数据库是非常困难,甚至不可行的。因为不同数据源包含的数据不同,所以往往要需要根据现有知识,将来自其他数据源的数据和目标数据源的最新值整合起来才能得到完整的目标数据表最新值。为了有效地应对上述挑战,本文尝试给出一系列理论和算法,解决了数据时效性的一些关键问题,主要研究内容可以概括如下。(1)本文研究了数据绝对时效性的表达原理及判定算法。为了克服当前基于时间戳和基于规则的两类时效性判定方法的局限性,形式化地定义了不确定时效规则及相应的数据时效性模型。该规则和模型可以表达不确定的领域知识,定量地判定数据时效性,且能够判定数据在特定时刻是否过时。在此基础上,本文首先研究了不确定时效规则的基础问题,如公理化、可满足、蕴含等问题;然后给出了定量地判定数据时效性的模型,分别定义了数据项、元组、数据集合的时效性;接着,将数据项间的时序关系构建成时序图,并基于时序图给出了多项式时间的时效性判定算法;最后在真实数据上的实验验证了算法的有效性。(2)本文研究了数据相对时效性表达原理及判定算法。在数据的绝对时效性无法判定,或判定结果不能有效地表达用户需求的情况下,可以利用一些冗余记录和时效约规则来实现数据相对时效性的判定。本文借助冗余记录和时效规则研究数据相对时效性判定问题,建立了相对时效性的判定模型并提出了相关求解算法。本文首先定义了查询相关时效性,将查询归结为最新值查询和时效序列查询两类,对每类查询,设计了查询结果的时效性判定方法,并将每类查询作为一个整体,给出了数据集合相对于一类查询的平均时效性判定方法;然后,将用户按查询偏好分为3类,研究了用户相关时效性;最后在真实数据和虚拟数据上分别进行了实验,验证了算法的有效性,分析了各参数对算法的影响。(3)本文研究了基于规则的数据时效性错误修复模型及修复算法。将数据库中的过时数据修复为最新值是提高数据质量的关键步骤。当前主要有基于规则和基于统计两类数据修复方法:基于规则的修复方法难以表达数据中某些复杂的关联关系,而基于统计的方法需要学习较复杂的条件概率分布,且难以直接应用数据语义相关的领域知识。为了克服上述两类方法的缺点,本文提出一类新的修复规则,将规则和统计的方法结合起来修复过时数据,该规则一方面能够通过规则模式表达领域知识,另一方面还能够使用其特有的分布表来描述数据随时间变化的统计信息。首先,本文研究了静态数据上的最小规则模式生成问题,证明了静态数据上的规则模式生成问题是NP-难的,并给出了两个解决该问题的多项式时间近似算法。接着,本文研究了动态数据上的最小规则模式生成问题,给出算法可在数据动态变化的情况下迅速更新现有的规则模式集合,最好情况下,只需O(1)时间即可完成更新。同时,本文还给出了静态数据上的分布表学习算法和数据动态变化情况下的分布表更新算法。然后,本文研究了不同修复代价约束条件下的最优修复计划产生问题,证明了在修复预算为正无穷时,该问题在多项式时间内可解,否则该问题是NP-难的,并给出了上述两种情况下该问题的解决方法。最后本文通过真实和虚拟数据集合上的实验证明了上述方法的有效性。(4)本文研究了基于查询的数据时效性错误修复问题。在数据集成或Web环境下,许多数据表被分散地存储在不同地方,这些数据表之间往往存在着部分数据重叠的情况,但不同数据源的更新频率不尽相同。如果我们向某数据源请求一个数据表或发出一个查询,往往会因为数据源更新不及时而无法得到目标数据表的最新数据。为了将目标数据表修复为最新值,需根据数据库中的时序约束和参照完整性约束构造一个合取查询,使得该查询的结果恰由目标数据表对应的最新值构成,称为时效保持查询。本文研究在给定数据库时序关系和参照完整性约束的情况下时效保持查询构造问题。首先,本文给出了时效保持查询的形式化定义,使用该查询可以给出目标数据表的最新值。接着,本文定义了模式时效图,用于表达数据库中不同数据表之间的时序约束和参照完整性约束,并将时效保持查询等价的表达为图中的一个终点树。然后,本文形式化了最小时效保持查询生成问题,证明了最小化时效保持查询是一个NP-难问题,并分别给出了不同情况下的最小化时效保持查询算法;最后,本文通过实验验证了所提模型和算法的有效性。
[Abstract]:With the advent of the big data age, the availability of data has been widely concerned. The real world will change rapidly over time, resulting in data outdated data in the database. The current statistics show that outdated data will cause many undesirable effects on enterprise decision-making and national life, and will cause other dimensions to be available. Drop, such as causing data inconsistency, inaccuracy, incomplete and so on, so it is essential to ensure timeliness of data. The current data availability field is still not a system for timeliness, and data aging research is facing great challenges. First, many databases have no precise time stamps, which makes the data set at a time time. Engraved timeliness, namely absolute timeliness, is difficult to determine. Secondly, different queries or application scenarios have different requirements for timeliness, and in some situations the absolute timeliness may not be judged, which makes the data relative to the query or user's timeliness determination is particularly important. Third, after getting the results of the timeliness of the database, It is necessary to further give a method of data aging repair. Research in the field of current data availability does not give a data repair method that can be directly used to repair the timeliness. Fourth, it is very difficult and even infeasible to completely repair a database in the case of only one data source. The data is different, so it is often necessary to integrate the latest data from other data sources and target data sources to get the latest value of the target data table. In order to cope with the challenges mentioned above, a series of theories and algorithms are given to solve the key problems of data aging. The main research contents can be summarized as follows. (1) this paper studies the principle of absolute timeliness of data and the algorithm of decision. In order to overcome the limitations of the two kinds of time stamp based time stamp and rule based time limitation method, we formally define the uncertain Aging Rule and the corresponding data aging model. To express uncertain domain knowledge, determine data timeliness quantificationally and determine whether data is out of date at a specific time. On the basis of this, this paper first studies the basic problems of uncertain aging rules, such as axiom, satisfaction and implication, and then gives a model to determine the timeliness of data quantificationally, and defines the number of data. According to the item, the data set is timeliness of the data set; then, the time series relation between the data items is constructed into a time series graph, and the time timeliness determination algorithm of polynomial time is given based on the time series graph. Finally, the validity of the algorithm is verified on the real data. (2) the data relative timeliness expression principle and the decision algorithm are studied in this paper. When the absolute timeliness is unable to be judged, or when the result can not effectively express the user's demand, some redundant records and time limitation rules can be used to determine the relative timeliness of data. In this paper, the relative timeliness determination of data is studied with the aid of redundant records and aging rules, and a relative timeliness determination model is established. In this paper, the correlation algorithm is proposed. Firstly, the validity of query is defined, and the query is reduced to the two classes of the latest value query and the time series query. For each class of queries, the timeliness determination method of the query results is designed, and each class of queries is taken as a whole, and the average timeliness judgment of the data set relative to a class of queries is given. Then, the user is divided into 3 categories according to the query preference, and the user related timeliness is studied. Finally, experiments are carried out on real and virtual data to verify the effectiveness of the algorithm and analyze the influence of the parameters on the algorithm. (3) this paper studies the rule based data aging error repair model and the repair algorithm. It is the key step to improve the quality of data, according to the outdated data in the library. At present, there are two types of data repair methods based on rule and Statistics: rule based repair methods are difficult to express some complex relationships in data, and the statistical method needs to learn more complex conditional probability distribution, and it is difficult. In order to overcome the shortcomings of the two kinds of methods, a new kind of repair rule is proposed in this paper, which combines rules and statistical methods to repair outdated data. On the one hand, the rule can express domain knowledge in a regular pattern, and on the other hand it can be described with its unique distribution table. In this paper, the minimum rule pattern generation problem on static data is studied. It is proved that the rule pattern generation problem on static data is NP- difficult, and two polynomial time approximation algorithms for solving the problem are given. Then, this paper studies the minimum rule pattern generation on dynamic data. The algorithm can quickly update the existing rule pattern set in the case of dynamic change of data. In the best case, it only needs O (1) time to complete the update. At the same time, the distribution table learning algorithm on static data and the distribution table updating algorithm under the dynamic change of data are also given. The problem of optimal repair plan generation under the complex cost constraint proves that the problem can be solved in polynomial time when the repair budget is positive infinity, otherwise the problem is NP- difficult, and the solution of the problem under the two circumstances is given. Finally, this paper proves the above method through the experiments on the real and virtual data sets. (4) 4. In this paper, we study the problem of time dependent error repair based on query. In data integration or Web environment, many data tables are stored in different places. There are often partial data overlaps between these data tables, but the update frequency of different data sources is not the same. If we are to a data source Request a data table or issue a query, often because the data source is not updated in time and can not get the latest data of the target data table. In order to repair the target data table to the latest value, a conjunctive query is constructed based on the time series constraint and the reference integrity constraint in the database, so that the result of the query is exactly the target data. The latest value composition of the table is called the timeliness retention query. This paper studies the query construction problem in the case of a given database timing relationship and reference integrity constraints. First, a formal definition of the time retention query is given in this paper. The query can be used to give the latest value of the target data table. Pattern aging diagram is used to express time series constraints and reference integrity constraints between different data tables in a database, and to express the time preserving query equivalence as an end tree in the graph. Then, this paper formalizes the minimum aging preserving query generation problem. It is proved that the minimum aging maintenance query is a NP- difficult problem, and respectively Finally, the effectiveness of the proposed model and algorithm is verified by experiments.
【学位授予单位】:哈尔滨工业大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP311.13
,
本文编号:2050107
本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/2050107.html