当前位置:主页 > 科技论文 > 软件论文 >

基于压缩数据库的数据挖掘算法的研究

发布时间:2018-09-13 14:23
【摘要】:随着当前社会经济的繁荣和科学技术的进步,在各行各业中积累了大量的数据。在科学与统计此类数据库中,存储着科学实验结果、地理测绘、人口普查、经济活动等多种类型的重要数据,且这些数据往往都是静态的,录入数据库之后基本不会发生改变且会被永久保留。这就导致此类数据库中存储的数据往往是海量的,在传统数据库上查询,计算和分析的方法应用在此类数据库上所带来的I/O传输变得十分巨大且难以接受。因此对海量数据库进行压缩处理成为了一个重要的研究方向。目前数据库领域的学者已经提出了许多压缩数据库的相关算法。但是,在压缩数据库上进行数据挖掘和分析的相关研究却很少。本文的研究内容主要是如何在压缩数据库上进行高效地数据挖掘。主要包括以下四个方面:本文根据科学与统计此类数据库具有静态性、稀疏性、聚集性和重复性等特点,提出了一种新的基于Blcok的数据库压缩算法,并对该算法的压缩比进行了理论分析。通过实验与其他数据库压缩算法进行了对比实验,证明该压缩算法在科学与统计数据库上有很高的压缩比。在关联规则挖掘上,本文提出了CApriori算法,该算法是一种直接运行在基于Block压缩方法压缩后的数据库上的挖掘算法。同时本文对CAPriori算法相比于Apriori算法在时间上的提升进行了理论分析。并且通过实验验证了CAPriori算法在压缩后的科学与统计此类数据库上有更优的时间性能。在聚类挖掘上,本文提出了C-kmeans算法,该算法是一种直接运行在压缩数据库上聚类算法,且该算法是一种改进后的Kmeans算法。因为Kmeans算法运行时间与数据记录是线性相关的,所以算法运行时间主要消耗在I/O传输上。C-Kmeans算法直接读取压缩数据库并进行挖掘可以节省大量的时间。目前存在的事务数据库垂直数据布局上的频繁模式挖掘算法,会进行大量tidset的交集运算,从而产生大量的中间结果,这就需要频繁的外存读写。本文针对此问题提出了CONVTV压缩算法,该压缩算法对垂直数据采用了两种不同的格式进行保存,在大部分数据集上都实现了很高的压缩比。
[Abstract]:With the prosperity of social economy and the progress of science and technology, a lot of data have been accumulated in various industries. In such databases as science and statistics, there are a variety of important types of data stored in scientific experiments, geographic mapping, censuses, economic activities, etc., which are often static. Entry into the database will not change and will be permanently retained. As a result, the data stored in this kind of database is often massive, and the I / O transmission brought by the methods of query, calculation and analysis on the traditional database becomes very large and difficult to accept. Therefore, the compression of massive databases has become an important research direction. At present, many related algorithms of compressed database have been proposed by scholars in database field. However, there are few researches on data mining and analysis on compressed database. The main research content of this paper is how to mine data efficiently on compressed database. The main contents are as follows: according to the static, sparse, aggregation and repeatability of scientific and statistical databases, a new database compression algorithm based on Blcok is proposed in this paper. The compression ratio of the algorithm is analyzed theoretically. The experimental results show that the compression algorithm has a high compression ratio in scientific and statistical databases. In the mining of association rules, this paper presents the CApriori algorithm, which is a mining algorithm which runs directly on the compressed database based on the Block compression method. At the same time, the CAPriori algorithm compared with the Apriori algorithm in the time of the promotion of theoretical analysis. The experimental results show that the CAPriori algorithm has better time performance in compressed scientific and statistical databases. In clustering mining, this paper proposes C-kmeans algorithm, which is a clustering algorithm running directly on compressed database, and this algorithm is an improved Kmeans algorithm. Because the running time of the Kmeans algorithm is linearly related to the data record, the running time of the algorithm is mainly consumed on the I / O transmission. C-K means algorithm can save a lot of time by reading the compressed database directly and mining it. The existing algorithms for mining frequent patterns in vertical data layout of transaction databases will perform a large number of tidset intersection operations, resulting in a large number of intermediate results, which requires frequent external memory reading and writing. In this paper, CONVTV compression algorithm is proposed to solve this problem. This compression algorithm uses two different formats to save vertical data and achieves a high compression ratio on most data sets.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13

【参考文献】

相关期刊论文 前2条

1 孙志长;冯祖洪;王沛栋;;一种高效的混合压缩数据挖掘算法[J];计算机应用研究;2009年10期

2 高宏,李建中;超大型压缩数据仓库上的CUBE算法[J];黑龙江大学自然科学学报;1999年04期



本文编号:2241424

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2241424.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户03d20***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com