连续不确定XML索引技术研究

发布时间：2018-05-29 06:58

本文选题：连续不确定数据 + XML　；参考：《内蒙古科技大学》2015年硕士论文

【摘要】：随着网络技术的快速发展, XML类型的数据已成为当前一种主流的数据形式,并成为Internet中进行数据交换和表示事实上的标准。在实际生活中，数据的不确定性是普遍存在的，传统的确定性数据已经不能准确描述现实世界。随着人们对不确定性数据的认识研究和对数据采集和处理技术的深入理解，不确定性数据在物流、工业、金融、军事等领域得到相当广泛的应用。基本上，在数据库中的不确定性是为了捕捉现实世界的状态，如监控的压强，温度，移动目标的位置都是在不断改变的。数据的不确定性信息可以以概率值或概率分布的形式在XML文档中表示。对于连续不确定的数据，存储用概率密度函数pdf可能值的范围来代替存储数据单一的值。而相应的概率阈值范围查询，是通过给定概率阈值及范围，来获取超过概率阈值起点并满足查询范围的结果。在概率阈值范围查询中，由于满足查询指定的概率值的出现，从而使得结果被扩大化。概率阈值范围查询比传统查询更精确及信息化。随着用户查询需求的日益增长和多元化，有效地构建XML索引面临着严峻的挑战。当前，XML索引技术的发展也成为了一个热点研究。目前，在实际应用中很多的数据都是服从连续分布的，通过对已有XML索引的研究，本文针对概率阈值范围查询,提出了一种对任意连续不确定XML数据均适用的RLPI索引。首先，在Dewey编码的基础上进行改进，增加了对不确定XML中分布节点IND和MUX的处理一种前缀编码PED-ewey。其次，在RLPI路径索引中将具有相同逆序标签路径的索引项聚集存储，节省了空间花销；在RLPI值索引中，通过预处理任意连续不确定数据，并结合相应地过滤策略，过滤与查询无关的节点，减少了pdf的计算，从而提高了查询的速度。由于计算连续不确定数据pdf比较费时，为进一步提高查询速度，提出一种优化算法CUXI索引树。算法借鉴R树的对空间数据自顶向下递归构建索引树的思想，通过对连续不确定的XML数据聚类构建相应的索引树，并在节点存储提前计算的一些信息，来过滤掉与概率阈值范围查询无关的元素，以减少查询中需处理的元素数目，提高查询的速度。本文实验通过设定文档大小、查询用例和概率阈值作为变量，对比算法查询响应时间测试算法性能。对实验结果进行分析，证明本文提出的RLPI索引算法和CUXI索引树算法具有高效性。
[Abstract]:With the rapid development of network technology, XML type data has become a mainstream data form, and it has become the standard of data exchange and representation in Internet. In real life, the uncertainty of data is universal, the traditional deterministic data can not accurately describe the real world. With the understanding of uncertain data and the deep understanding of data acquisition and processing technology, uncertain data has been widely used in logistics, industry, finance, military and other fields. Basically, the uncertainty in the database is to capture the state of the real world, such as the monitoring pressure, temperature, moving target location is constantly changing. The uncertain information of data can be expressed in XML document in the form of probabilistic value or probability distribution. For continuous uncertain data, the range of the possible values of the probability density function (pdf) is used to replace the single value of the stored data. The corresponding probabilistic threshold range query is based on the given probability threshold and range to obtain the results that exceed the threshold of probability threshold and satisfy the range of the query. In the probabilistic threshold range query, the result is expanded because the probability value specified by the query is satisfied. The probabilistic threshold range query is more accurate and informative than the traditional query. With the increasing and diversification of user query demand, constructing XML index effectively is facing a severe challenge. At present, the development of XML indexing technology has also become a hot research. At present, a lot of data are distributed continuously in practical application. Through the research of existing XML index, this paper proposes a RLPI index which is suitable for arbitrary continuous uncertain XML data, aiming at the query of probability threshold range. Firstly, based on Dewey coding, a prefix code PED-ewey is added to deal with the distributed nodes IND and MUX in uncertain XML. Secondly, the index items with the same inverse label path are clustered and stored in the RLPI path index, which saves the space cost. In the RLPI value index, the arbitrary continuous uncertain data is pretreated and the corresponding filtering strategy is combined. Filtering nodes independent of query reduces the computation of pdf and improves the speed of query. Because the computation of continuous uncertain data pdf is time-consuming, in order to further improve the query speed, an optimization algorithm, CUXI index tree, is proposed. The algorithm uses the idea of R-tree to construct index tree from top to bottom recursion of spatial data, constructs the index tree by clustering continuous uncertain XML data, and stores some information calculated in advance at the node. In order to reduce the number of elements to be processed in the query and improve the speed of the query, it can filter out the elements independent of the range of probabilistic threshold. In this paper, the performance of the algorithm is compared by setting the document size, query case and probability threshold as variables. The experimental results show that the proposed RLPI index algorithm and the CUXI index tree algorithm are efficient.
【学位授予单位】：内蒙古科技大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP311.13

【相似文献】