基于FPGA的矩阵奇异值分解加速方案的设计与实现

发布时间：2018-05-25 17:26

本文选题：奇异值分解 + 现场可编程逻辑门阵列　；参考：《北京交通大学》2017年硕士论文

【摘要】：奇异值分解(singular value decomposition)是数值计算学科中的一个重要组成,并且在诸如无线通信领域的大规模MIMO、图像处理领域的特征提取及主成分分析、机器学习领域的数据压缩、词义索引和大数据领域的数据相关性分析中都发挥着至关重要的作用。奇异值分解算法是计算复杂度相对较高的矩阵分解算法,而且随着数据处理规模的不断增加,无论在通信方向的大规模MIMO中,还是对于矩阵维度及数据量都更加庞大的图像及数据挖掘等研究与应用场景中,对于奇异值分解的运算速度都有越来越高的需求,因此对矩阵奇异值分解的加速方案实现具有很高的研究与应用价值。本文重点研究了基于单边Jacobi方法的矩阵奇异值分解,该算法具有相对精度高、分解速度快的特点,是一种非常适合并行化和大规模矩阵计算的一种旋转运算方法。对于Jacobi算法而言,旋转变换和列对排序对分解的速度有决定性作用,本文对不同的矩阵列对索引方式进行了研究,并将两种序列生成方式,循环序列和指环序列应用到硬件设计当中。其中指环序列的列对排序方式,不仅利于并行化实现,而且可以得到有序排列奇异值矩阵,并对算法的收敛速度也有积极的促进作用。针对实时性、低延迟需求,本文提出了基于片上存储的循环序列单边Jacobi变换算法硬件架构,其性能相比于相同算法的MATLAB方案和GPU方案有很明显加速效果,保持了相当的数值精度。在此基础上,设计实现了一种基于片上存储以及指环序列方式的并行化硬件加速方案,相比于循环序列方式,实测加速比达到2.95倍。其次,针对大规模、高吞吐率的图像处理以及数据挖掘等应用场景,为解决片内存储容量与硬件设计复杂的问题,提出了基于片外存储器和指环序列的单边Jacobi算法的并行架构设计,并且基于性能与资源的关系,提出了其在并行化硬件设计上性能与资源的平衡策略。
[Abstract]:Singular value decomposition (singular value decomposition) is an important component of numerical computing, and it plays an important role in large scale MIMO in the field of wireless communications, feature extraction and principal component analysis in the field of image processing, data compression in machine learning, word meaning index and data correlation analysis in large data fields. The singular value decomposition algorithm is a matrix decomposition algorithm with relatively high computational complexity, and as the scale of data processing is increasing, the singular values are in the large-scale MIMO of the communication direction, or in the research and application scenarios, such as the matrix dimension and the data mining, which are more large in the matrix dimension and the data amount. The computing speed of decomposition is higher and higher, so the acceleration scheme of matrix singular value decomposition has high research and application value. This paper focuses on the singular value decomposition of matrix based on single side Jacobi method. This algorithm has the characteristics of high relative precision and fast decomposition speed, which is very suitable for parallelization and large scale. A rotation operation method of scale matrix calculation. For Jacobi algorithm, the rotation transformation and column pair sorting have a decisive effect on the speed of decomposition. In this paper, the index mode of different matrix columns is studied, and two kinds of sequence generation, cyclic sequence and ring sequence are applied to the hardware design. The sequence method is not only conducive to parallel implementation, but also can get an orderly array of singular value matrices, and it also has a positive effect on the convergence speed of the algorithm. In view of real time and low delay demand, this paper proposes a hardware architecture of the single side Jacobi transform algorithm based on the memory on chip. Its performance is compared to the same algorithm. The MATLAB scheme and the GPU scheme have obvious acceleration effect and maintain a considerable numerical accuracy. On this basis, a parallel hardware acceleration scheme based on the on-chip storage and the ring sequence is designed and implemented. Compared with the cyclic sequence, the measured acceleration ratio is 2.95 times. Secondly, for large-scale, high throughput images. In order to solve the problem of complex memory storage capacity and hardware design, the parallel architecture design of single side Jacobi algorithm based on external memory and ring sequence is proposed. Based on the relationship between performance and resources, the balance strategy of performance and resources in the design of parallel hard pieces is proposed.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP301.6;TN791

【参考文献】