基于近邻决策树的文件访问行为预测方法研究
发布时间:2018-03-23 01:24
本文选题:大规模存储系统 切入点:元数据 出处:《华中科技大学》2012年硕士论文 论文类型:学位论文
【摘要】:数据的快速增长导致存储需求的规模不断上升,存储系统中文件数目不断增多,文件类型千差万别,文件管理日趋复杂,同时各种新型存储介质不断加入到存储系统,系统中各种介质混合使用,文件分类存储管理需求不断增长。而文件管理的一个重要前提就是如何对文件未来的访问行为也就是访问频率做出准确预测,现有的存储系统不能有效地提供文件访问行为预测的功能,很难满足文件分类存储管理的需求。 设计并实现了一种新型的文件访问分类预测方法,它提供对文件未来访问行为分类预测的功能,并能找出与任一文件最相似的K个文件,这能帮助存储系统预知文件未来的访问行为,优化文件物理布局,同时给文件缓存等提供决策支持。 文件访问分类预测系统的主要思想是利用文件的静态元数据和早期的访问记录来建立分类预测模型预知文件未来访问行为。它首先利用元数据建立决策分割树,然后在树的叶子节点建立K近邻分类模型,然后利用这个混合模型来预测文件未来的访问行为。决策分割树是一个高度平衡的多叉树,它的主要作用是利用文件的元数据对原始的训练集合做智能划分,这样不仅可以去除噪音数据而且能节省后续的分类时间,而新来的文件通过决策分割树,,会被分到对应的子集中去,之后通过在子集中利用最大堆扫描找到与它最相似的K个文件,通过这K个文件来投票决定它未来的访问行为。 实验结果表明,通过真实文件系统的记录文件提取数据,所设计的文件访问分类预测系统能准确预测文件未来的访问频率,准确率高达90%,而且其分类所消耗的时间对比传统的KNN算法缩短了近20倍。
[Abstract]:With the rapid growth of data, the scale of storage demand is increasing, the number of files in storage system is increasing, the file types are different, the file management is becoming more and more complicated, and various new storage media are added to the storage system. With the mixed use of all kinds of media in the system, the demand for file classification storage management is increasing, and one of the important prerequisites of file management is how to accurately predict the future access behavior of files, that is, the frequency of access. The existing storage system can not effectively provide the function of file access behavior prediction, and it is difficult to meet the requirements of file classification storage management. A new file access classification prediction method is designed and implemented. It provides the function of classifying and predicting the future access behavior of files, and can find the K files that are the most similar to any file. This can help the storage system to predict the future access behavior of files, optimize the physical layout of files, and provide decision support for file cache. The main idea of file access classification prediction system is to establish a classification prediction model to predict the future access behavior of files by using static metadata and early access records. Then the K-nearest neighbor classification model is established at the leaf node of the tree, and then the hybrid model is used to predict the future access behavior of the file. The decision partition tree is a highly balanced multitree. Its main function is to use the metadata of files to intelligently partition the original training set, which can not only remove the noise data but also save the subsequent classification time. It will be divided into the corresponding subsets, and then the most similar K files will be found by using the maximum heap scan in the subsets, and the K files will be used to vote for its future access behavior. The experimental results show that the designed file access classification and prediction system can accurately predict the future access frequency of the files by extracting the data from the real file system. The accuracy is as high as 90 and the time consumed by the classification is nearly 20 times shorter than that of the traditional KNN algorithm.
【学位授予单位】:华中科技大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP333
【参考文献】
相关期刊论文 前1条
1 王强,刘东波,王建新;数据仓库元数据标准研究[J];计算机工程;2002年12期
本文编号:1651282
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/1651282.html