缺失数据查询处理技术研究

发布时间：2018-05-14 18:59

本文选题：缺失数据 + 不完整/确定数据　；参考：《浙江大学》2017年博士论文

【摘要】：随着计算机、互联网、通信以及定位技术的快速发展,数据呈爆炸式增长。这些数据在形态上具有海量、高维、多源、异构、不确定/不完整等特征,使得当前相关技术很难支撑人们进行复杂而多样的智能处理需求。缺失数据具有不确定/不完整特征,并广泛存在于科学研究和社会生活的各个领域,如通信、交通和经济等领域。查询是计算机科学的基本问题,存在于目前几乎所有的计算机应用领域。如何高效、智能地查询数据,挖掘有价值的信息,服务于社会,已成为当今信息技术的重大挑战。然而,现有的查询处理技术大多只关注完整数据。由于缺失数据的性质与完整数据的性质存在很大差异,因而缺失数据查询不能(直接)有效地采用传统的完整数据查询处理方法。为此,本文对缺失数据查询处理技术进行了深入研究,主要包括:1)不完整数据查询:相似查询在地理信息系统和多媒体检索等方面有着广阔的应用前景。现有大多的相似查询处理方法只针对完整数据,但在许多的实际应用中,数据集中的对象可能存在缺失信息(如数据对象的某些属性值缺失等);这些对象之间的邻近关系就不能用传统的相似度计算方式衡量,而需要寻找适用于缺失数据对象的新颖的邻近关系度量方式。本文开发了两种有效的不完整数据索引方法,并提出了精确和近似的查询算法,以支持不完整数据k最近邻查询问题。此外,偏好查询在决策支持、个性化服务、以及推荐系统等方面具有重要的应用价值。例如,在电影推荐系统中,针对含有缺失信息的电影评分数据集,系统可以利用Top-k支配查询选出前k个观众评分最好的电影以推荐给电影爱好者。本文提出了面向不完整数据Top-kk支配查询的skyband剪枝、上界值剪枝、位图索引技术以及压缩技术,并设计了多种有效的查询算法,以支持不完整数据Top-k支配查询问题。同时,本文分析了位图索引压缩技术对查询效率和索引空间成本的影响,并进一步探讨了查询效率和索引空间成本的权衡问题。2)不确定图查询:尽管已有许多专家学者致力于不确定图查询处理技术研究,并取得了大量可喜的研究成果,但距离满足不断出现的、复杂而多样的查询需求还有一定的差距,仍待进一步深入研究。例如,不确定图反k最近邻查询在资源优化配置、本质蛋白质识别等场景具有重大的应用前景。然而,尚未有学者对不确定图反k最近邻查询进行研究。所以,本文探讨了不确定图反k最近邻查询问题。本文引入了图规模剪枝、等价图剪枝和概率界剪枝以缩小图规模、减少可能图数量和缩小候选数据节点数量。本文除了提出精确算法来处理不确定图反k最近邻查询,还给出了一种新颖的抽样算法。该算法使用了图规模剪枝和自适应分层抽样方法。该抽样方法被证明是无偏的且其方差不比随机抽样方法差。与已有算法相比,本文所提出的精确算法效率平均提高了五个数量级;而且,抽样算法比精确算法效率至少提高了三个数量级。3)概率查询质量优化:在许多的实际应用中,数据带有不确定性,使得基于这些不确定数据的查询(即概率查询)往往返回一组不确定的查询结果。换句话说,概率查询返回的结果包含大量噪声,这使得查询质量下降。为此,本文提出一种新颖的概率查询质量优化框架,旨在给定的预算内选择出一组最影响查询质量的不确定数据对象去清洗,以最大化查询质量。本文结合一个新开发的ASI结构加速查询质量计算,并利用两个有效的剪枝策略提出分支界限法、贪心算法和抽样算法以支持数据对象选择问题。ASI结构被证实比未使用该结构的质量计算算法性能提升了两到三个数量级。4)缺失数据查询处理系统:基于上述的研究成果,本文开发了一个基于不完整数据偏好查询的餐馆推荐系统。该系统考虑了餐厅的不完整评级信息,并在PostgreSQL数据库中集成了不完整数据skyline查询和Top-k支配查询,主要支持三个功能模块:友好便捷的查询提交模块、灵活实用的结果解释模块、以及增量式的数据集即时交互模块。
[Abstract]:With the rapid development of computer, Internet, communication and positioning technology, the data are increasing explosively. These data have the characteristics of massive, high dimension, multi source, heterogeneous, uncertain / incomplete and so on, which make it difficult to support people to carry out complex and multiple intelligent processing requirements. Missing data is uncertain / incomplete. The whole feature is widely used in all fields of scientific research and social life, such as communications, transportation and economic fields. Inquiry is the basic problem of computer science. It exists in almost all computer applications. How to efficiently, intelligently inquire data, excavate valuable information and serve the society, has become today's information technology. However, most of the existing query processing techniques pay attention to the complete data. Because the nature of the missing data is very different from the nature of the complete data, the missing data query can not (directly) effectively use the traditional complete data query processing method. In this paper, the missing data query processing technology is carried out in this paper. In-depth research includes: 1) incomplete data query: similar query has a broad application prospect in geographic information system and multimedia retrieval. Most of the existing similar query processing methods only aim at the complete data, but in many practical applications, there may be missing information in the objects of the data set (such as a certain data object. " The proximity relations between these objects can not be measured by the traditional similarity calculation method, but we need to find a novel proximity relationship measure suitable for missing data objects. This paper developed two effective methods of incomplete data indexing, and proposed an accurate and approximate query algorithm to support not. Complete data k nearest neighbor query. In addition, preference query has important application value in decision support, personalized service, and recommendation system. For example, in the movie recommendation system, the system can use Top-k dominating query to select the best movies of the former K audience for the movie scoring data set with missing information. This paper proposes skyband pruning, upper bound pruning, bitmap index technology and compression technology for incomplete data Top-kk dominating query, and designs a variety of effective query algorithms to support incomplete data Top-k dominating query problem. At the same time, this paper analyzes the query efficiency of bitmap index compression technology. The impact of rate and index space cost, and further discussion of query efficiency and index space cost tradeoff problem.2) uncertain graph query: Although many experts and scholars have committed to the research of uncertain graph query processing technology, and have achieved a lot of gratifying research results, but the distance meets the continuous, complex and diverse query needs. There is still a certain gap, which still needs to be further studied. For example, the scenario of uncertain graph inverse k nearest neighbor query in resource optimization and essential protein recognition has a great application prospect. However, no scholars have studied the inverse k nearest neighbor query for uncertain graphs. Therefore, this paper discusses the problem of the inverse k nearest neighbor query. In this paper, we introduce a graph scale pruning, equivalent graph pruning and probability boundary pruning to reduce the size of the graph, reduce the number of possible graphs and reduce the number of candidate data nodes. In addition to an accurate algorithm to deal with the inverse k nearest neighbor query, a novel sampling algorithm is given. This algorithm uses the scale pruning and adaptive segmentation of the graph. The sampling method is proved to be unbiased and its variance is not worse than the random sampling method. Compared with the existing algorithms, the efficiency of the proposed algorithm is improved by five orders of magnitude; moreover, the sampling algorithm improves at least three orders of magnitude.3 compared with the exact algorithm efficiency: in many practical cases In application, the data with uncertainty makes the query based on these uncertain data often return a set of uncertain query results. In other words, the result of the return of the probability query contains a lot of noise, which makes the query quality decline. Therefore, a novel probabilistic query quality optimization framework is proposed. In the fixed budget, we select a set of uncertain data objects which most influence the quality of query to maximize the quality of query. This paper combines a newly developed ASI structure to accelerate query quality calculation, and uses two effective pruning strategies to put forward the branch boundary method, greedy algorithm and sampling algorithm to support the.ASI node of the data object selection problem. The structure is proved to improve the missing data query processing system by two to three orders of magnitude.4 than the quality calculation algorithm that does not use the structure. Based on the above research results, this paper develops a restaurant recommendation system based on incomplete data preference query. The system takes into account the incomplete rating information of the dining hall and is in the PostgreSQL data. The library is integrated with incomplete data skyline query and Top-k dominating query. It mainly supports three functional modules: friendly and convenient query submission module, flexible and practical result interpretation module, and incremental data set instant interaction module.

【学位授予单位】：浙江大学
【学位级别】：博士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】