基于动态散列的嵌入式数据库混合索引的研究

发布时间：2019-03-13 11:42

【摘要】：随着嵌入式技术和电子技术的不断发展，微处理器与闪存等嵌入式系统关键部件的性能在不断的提升，同时其价格在逐步下降，这使得嵌入式设备可以以较低的成本处理更加复杂的任务。然而简单的文件式数据管理已经无法满足以数据处理为核心任务的嵌入式系统对数据管理的要求，此时嵌入式数据库系统应运而生。本文将开源数据库SQLite作为嵌入式数据库研究的实例对象，从混合索引机制方面入手，结合动态散列与树型索引的特点提出了一种高效的混合索引机制，并且实施相应的仿真实验以验证该混合索引机制在检索与插入操作方面的高效性。在研究工作中，首先对SQLite的体系结构、索引组织方式进行了分析，并对比较典型的动态散列、树型索引机制和混合索引机制的结构及相关操作算法进行了研究。在结合T树与红黑树算法特点的基础上，提出了一个非严格平衡的树型结构——TC树，该树型索引机制可以解决T树因严格的平衡条件引起的频繁调整操作导致的操作性能降低的问题。另外，结合线性散列与可扩展散列目录扩展算法的特点，提出了一种平均每次分裂只扩展一个目录单位的增量式动态散列算法——IDH，该动态散列的目录扩展数目与发生溢出的桶的位置相关且溢出的桶可以及时分裂。此外，，将用于数据桶定位的IDH与用于组织桶中数据的TC树相组合形成了一种新的混合索引机制——IDH-TC。通过测试实验，验证了该混合索引机制在数据随机情况下检索与插入操作的高效性。其检索操作的时间复杂度几乎为O(1)，而插入操作的性能受存储利用率的影响呈现周期性缓增骤降的波动趋势；对于TC树，当节点容量对操作性能的影响达到最优时，该树与T树相比，插入操作性能明显较好，检索和删除操作性能相当；线性散列、可扩展散列和增量式动态散列采用桶溢出的分裂方式与采用存储利用率的分裂方式相比，随着数据量的增大，目录增长速度较快，溢出桶数目较少，存储利用率较低。采用存储利用率作为分裂条件时，三种数据偏斜性分布对线性散列与可扩展散列的目录增长情况相同。采用桶溢出情况时，对于线性散列，数据分布越靠后端，目录增长越慢。对于可扩展散列，数分布于前端与后端时目录增长速度相当并均明显快于数据分布于中端时的目录增长速度。对于增量式动态散列，数据分布越靠后端，目录增长越快。综上所述，本文提出的混合索引机制IDH-TC是一种高效的嵌入式数据库索引机制。如果应用到嵌入式数据库中，势必可以较好的改善其检索、插入等操作的性能。
[Abstract]:With the development of embedded technology and electronic technology, the performance of the key components of embedded system, such as microprocessor and flash memory, is constantly improving, and its price is gradually decreasing. This allows embedded devices to handle more complex tasks at a lower cost. However, simple file data management has been unable to meet the data management requirements of the embedded system with the core task of data processing, so the embedded database system emerges as the times require. In this paper, the open source database SQLite is regarded as an example object of embedded database research. From the aspect of hybrid indexing mechanism, an efficient hybrid indexing mechanism is proposed, which combines the characteristics of dynamic hash and tree indexing. Simulation experiments are carried out to verify the efficiency of the hybrid indexing mechanism in retrieval and insertion operations. In the research work, firstly, the architecture and index organization mode of SQLite are analyzed, and the typical dynamic hash, tree index mechanism and mixed index mechanism structure and related operation algorithm are studied. On the basis of combining the characteristics of T-tree and red-black tree algorithm, a non-strictly balanced tree structure, TC tree, is proposed. The tree indexing mechanism can solve the problem that the operation performance of the T-tree is degraded by the frequent adjustment operation caused by the strict balance conditions. In addition, according to the characteristics of linear hash and extensible hash directory expansion algorithm, an incremental dynamic hash algorithm, IDH, which extends only one directory unit per split, is proposed in this paper. The number of directory extensions in this dynamic hash is dependent on the location of the overflow bucket and the overflow bucket can split in time. In addition, combining the IDH for bucket location with the TC tree for organizing data in the bucket forms a new hybrid indexing mechanism-IDH-TC.. The experimental results show that the hybrid indexing mechanism is efficient in data retrieval and insertion under random data conditions. The time complexity of the retrieval operation is almost O (1), while the performance of the insert operation is affected by the storage utilization. For the TC tree, when the node capacity has the best effect on the operation performance, the insertion operation performance of the tree is obviously better than that of the T-tree, and the retrieval and deletion performance of the tree is the same as that of the T-tree. Linear hash, scalable hash and incremental dynamic hash adopt bucket overflow splitting mode compared with storage utilization split mode, with the increase of data volume, directory growth speed is faster, and the number of overflow buckets is smaller. Storage utilization is low. When the storage utilization ratio is used as the splitting condition, the growth of the three kinds of data skewness distributions to the linear hash is the same as that of the scalable hash. In the case of bucket overflow, for linear hash, the more backend the data distribution, the slower the directory growth. For scalable hashes, the directory growth rate when the number is distributed at the front end and the back end is similar and significantly faster than the directory growth rate when the data is distributed at the middle end. For incremental dynamic hashing, the more data is distributed back-end, the faster the directory grows. To sum up, the hybrid indexing mechanism IDH-TC proposed in this paper is an efficient index mechanism for embedded database. If it is applied to embedded database, it can improve the performance of retrieval, insert and so on.
【学位授予单位】：太原科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP368.1;TP311.13

【参考文献】