基于Hadoop的特征核数据提取算法的研究

发布时间：2018-08-25 13:22

【摘要】：现在社会已经处于并将长期处于大数据时代,海量数据具有4个V的特点,即数量大(Volume),多样性(Variety),需要处理的速度快(Velocity)和真实性(Veracity)。虽然目前的数据量很大,但是往往携带者一些冗余信息,人们关注的其实是它们真实携带的有效数据特征。如果将数据看成大矩阵,则此矩阵在大部分情况下是稀疏的,可以将其映射到更低维的空间,这个低维的空间我们称之为数据特征空间,将原始数据投影到该空间后可以得到特征核数据,而且特征核数据往往携带着原始数据的主要信息。给出了信息损失率小于的-特征核数据和-特征空间的定义后,我们的目的是寻求最优特征核数据和最优特征空间。为此,本文根据高维度大数据的特点,利用Hadoop分布式计算框架提出了挖掘数据主成分的一些方法,同时针对Hadoop使用过程中出现的缺点提出了一些技术,可以有效地降低内存使用率,提高文件访问效率。本文先交代预备知识和数学定义,为后面具体算法的实现提供了理论支持和衡量标准。然后针对分布式应用环境提供了适应Hadoop的新型向量数据结构,并在此基础上定义了不同节点之间数据发送端和接收端的工作流程和数据格式。其次数据预处理模块将输入信息处理成系统能识别的形式,继而获取三对角阵并用QR算法将三对角阵特征分解以获得特征信息。最后将特征向量稍加变换得到新的投影空间,将原始数据投影到新的投影空间即可得到核数据集。本文在实现的过程中经常会对向量进行处理,虽然向量的维度很大,但将矩阵按行分割后每一块向量仅仅占用KB数量级的空间,Hadoop分布式文件系统为存储在其中的每一份文件分配固定数据块(block)的大小,这在实现的过程中会出现Name Node内存占用过高和文件访问效率过低的现象。针对Hadoop不善于处理海量小文件的问题,我们提出了一种优化HDFS的技术,基本思想是将小文件合并成适应一个块的大文件然后建立索引。更进一步地,基于名字的索引可以有效提高文件访问效率。实验结果表明,本文提出的策略可以有效地挖掘原始数据的核数据集。
[Abstract]:The society has been and will be in big data's time now, the massive data has 4 V's characteristic, namely the quantity (Volume), diversity (Variety), needs to deal with the fast (Velocity) and the authenticity (Veracity). Although the amount of data is very large at present, it often carries some redundant information. If you think of the data as a large matrix, the matrix is sparse in most cases and can be mapped to a lower dimensional space called the data feature space. The feature kernel data can be obtained by projecting the original data into the space, and the feature core data often carry the main information of the original data. After giving the definitions of the-feature kernel data and the-feature space whose information loss rate is less than the loss rate, our aim is to find the best feature kernel data and the optimal feature space. Therefore, according to the characteristics of high dimensional big data, this paper puts forward some methods to mine the principal components of data by using the Hadoop distributed computing framework, and puts forward some techniques in view of the shortcomings in the process of using Hadoop. It can effectively reduce memory usage and improve file access efficiency. In this paper, the preparatory knowledge and mathematical definition are explained first, which provides theoretical support and measurement standard for the implementation of the following algorithm. Then, a new vector data structure adapted to Hadoop is provided for distributed application environment, and the workflow and data format of data sender and receiver between different nodes are defined. Secondly, the data preprocessing module processes the input information into a form that the system can recognize, and then obtains the tridiagonal matrix and decomposes the tridiagonal matrix feature to obtain the characteristic information by QR algorithm. At last, the new projection space is obtained by the transformation of the feature vector, and the kernel data set can be obtained by projecting the original data into the new projection space. In this paper, the vector is often processed in the process of implementation, although the dimension of the vector is very large, However, after dividing the matrix by line, each vector occupies only the KB order of magnitude space. The distributed file system allocates the size of a fixed data block (block) for each file stored in it. In the process of implementation, the Name Node memory is overoccupied and the file access efficiency is too low. Aiming at the problem that Hadoop is not good at dealing with large amount of small files, we propose a technique of optimizing HDFS. The basic idea is to merge small files into large files adapted to a block and build indexes. Furthermore, the name-based index can effectively improve the efficiency of file access. Experimental results show that the proposed strategy can effectively mine the core data set of raw data.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】