基于Hadoop的特征核数据提取算法的研究
[Abstract]:The society has been and will be in big data's time now, the massive data has 4 V's characteristic, namely the quantity (Volume), diversity (Variety), needs to deal with the fast (Velocity) and the authenticity (Veracity). Although the amount of data is very large at present, it often carries some redundant information. If you think of the data as a large matrix, the matrix is sparse in most cases and can be mapped to a lower dimensional space called the data feature space. The feature kernel data can be obtained by projecting the original data into the space, and the feature core data often carry the main information of the original data. After giving the definitions of the-feature kernel data and the-feature space whose information loss rate is less than the loss rate, our aim is to find the best feature kernel data and the optimal feature space. Therefore, according to the characteristics of high dimensional big data, this paper puts forward some methods to mine the principal components of data by using the Hadoop distributed computing framework, and puts forward some techniques in view of the shortcomings in the process of using Hadoop. It can effectively reduce memory usage and improve file access efficiency. In this paper, the preparatory knowledge and mathematical definition are explained first, which provides theoretical support and measurement standard for the implementation of the following algorithm. Then, a new vector data structure adapted to Hadoop is provided for distributed application environment, and the workflow and data format of data sender and receiver between different nodes are defined. Secondly, the data preprocessing module processes the input information into a form that the system can recognize, and then obtains the tridiagonal matrix and decomposes the tridiagonal matrix feature to obtain the characteristic information by QR algorithm. At last, the new projection space is obtained by the transformation of the feature vector, and the kernel data set can be obtained by projecting the original data into the new projection space. In this paper, the vector is often processed in the process of implementation, although the dimension of the vector is very large, However, after dividing the matrix by line, each vector occupies only the KB order of magnitude space. The distributed file system allocates the size of a fixed data block (block) for each file stored in it. In the process of implementation, the Name Node memory is overoccupied and the file access efficiency is too low. Aiming at the problem that Hadoop is not good at dealing with large amount of small files, we propose a technique of optimizing HDFS. The basic idea is to merge small files into large files adapted to a block and build indexes. Furthermore, the name-based index can effectively improve the efficiency of file access. Experimental results show that the proposed strategy can effectively mine the core data set of raw data.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13
【参考文献】
相关期刊论文 前8条
1 易秀双;刘勇;李婕;王兴伟;;基于MapReduce的主成分分析算法研究[J];计算机科学;2017年02期
2 高宏宾;侯杰;李瑞光;;基于核主成分分析的数据流降维研究[J];计算机工程与应用;2013年11期
3 董焕;闫德勤;;基于NMF和LPP的降维方法[J];吉林师范大学学报(自然科学版);2011年04期
4 王俭臣;单甘霖;张岐龙;段修生;;基于改进SVM-RFE的特征选择方法研究[J];微计算机应用;2011年02期
5 唐亮;段建国;许洪波;梁玲;;基于互信息最大化的特征选择算法及应用[J];计算机工程与应用;2008年13期
6 罗泽举;宋丽红;朱思铭;;基于独立成分分析的分解向前SVM降维算法[J];计算机应用;2007年09期
7 李大锋;罗林开;岑涌;;基于PCA与分类回归树的疾病诊断应用研究[J];计算机与数字工程;2007年05期
8 林晓立;陈恩红;任皖英;;高维数据特征提取算法的研究及比较[J];计算机科学;2003年04期
相关博士学位论文 前1条
1 毛勇;基于支持向量机的特征选择方法的研究与应用[D];浙江大学;2006年
相关硕士学位论文 前4条
1 黄勇;改进的互信息与LDA结合的特征降维方法研究[D];华中师范大学;2016年
2 李泰辉;IG-NMF特征降维方法在入侵检测中的应用研究[D];吉林大学;2016年
3 陈佩;主成分分析法研究及其在特征提取中的应用[D];陕西师范大学;2014年
4 李微微;遥感图像融合技术及应用方法研究[D];燕山大学;2012年
,本文编号:2203004
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2203004.html