基于CUDA的帧间预测优化及并行化
发布时间:2019-06-10 09:52
【摘要】:作为H.264/AVC编码框架的主要模块,,帧间预测模块通过多帧预测、亚像素运动估计、基于率失真优化的模式决策等方式实现压缩效率的提升,但也使得整个模块耗时长,资源占用率高;另一方面,基于GPU的并行编程框架CUDA(Compute UnifiedDevice Architecture)的不断发展,使得GPU成为计算机上另一个可编程以及可执行单元,与此同时GPU在科学计算领域的计算能力已远远超过CPU;因此,考虑如何基于CUDA平台加速帧间预测模块达到整体编码效率的提升已经成为多媒体技术和高性能计算领域研究的热点问题。 通过对多种分辨率、帧率以及视频数据的统计数据发现,在帧间预测编码过程中运动向量在局部域和全局域分布具有趋势一致性的特征,并且不同模式编码块的运动向量具有强相关性;基于以上规律以及CUDA平台的特征,对串行环境下的帧间预测模块从整体框架和核心算法两个角度进行优化,主要有:(1)基于CUDA平台将帧间预测模块划分为插值滤波模块、运动估计模块和多模式运动向量合成模块等若干子模块;(2)针对传统全搜索算法在搜索机制上的盲目性和快速搜索算法多条件分支难以在充分调用CUDA平台计算资源的特点,提出并实现了面向运动趋势的自适应迭代搜索算法;(3)为降低单线程计算负载、充分利用邻域运动信息同时避免因数据依赖而导致并发度不高的问题,提出并实现了基于域划分和双抽样的预搜索机制;(4)基于运动向量的层间相关性特点,提出并实现基于层间编码块的最优运动向量合并机制。 实验结果表明,相比全搜索算法,面向运动趋势的迭代搜索算法可以达到70~80倍的性能提升,同时SNR保持在0.5dB以下;同快速搜索算法相比,加速可以达到3~4倍,且压缩率更高;相比基于CUDA平台的运动估计算法,可提升约20%的编码效率。
[Abstract]:As the main module of H.264/AVC coding framework, the inter-frame prediction module improves the compression efficiency by multi-frame prediction, sub-pixel motion estimation, rate-distortion optimization mode decision and so on, but it also makes the whole module take a long time. The resource utilization rate is high; On the other hand, with the continuous development of parallel programming framework CUDA (Compute UnifiedDevice Architecture) based on GPU, GPU has become another programmable and enforceable unit on the computer. At the same time, the computing power of GPU in the field of scientific computing has far exceeded that of CPU;. Therefore, it has become a hot issue in the field of multimedia technology and high performance computing to consider how to accelerate the improvement of the overall coding efficiency of inter-frame prediction module based on CUDA platform. Through the statistical data of various resolutions, frame rates and video data, it is found that the distribution of motion vectors in the local domain and the global domain has the characteristic of trend consistency in the process of inter-frame prediction coding. And the motion vectors of different modes of coding blocks have strong correlation. Based on the above rules and the characteristics of CUDA platform, the inter-frame prediction module in serial environment is optimized from the overall framework and the core algorithm. The main points are as follows: (1) the inter-frame prediction module is divided into interpolation filtering module based on CUDA platform. Motion estimation module and multi-mode motion vector synthesis module and other sub-modules; (2) aiming at the blindness of the traditional full search algorithm in the search mechanism and the fact that the multi-condition branch of the fast search algorithm is difficult to fully invoke the computing resources of the CUDA platform, an adaptive iterative search algorithm oriented to the motion trend is proposed and implemented. (3) in order to reduce the single-thread computing load and make full use of neighborhood motion information to avoid the problem of low concurrency caused by data dependence, a pre-search mechanism based on domain partition and double sampling is proposed and implemented. (4) based on the interlayer correlation of motion vectors, an optimal motion vector merging mechanism based on interlayer coding blocks is proposed and implemented. The experimental results show that compared with the full search algorithm, the motion trend oriented iterative search algorithm can improve the performance of 70 脳 80 times, while SNR is kept below 0.5dB. Compared with the fast search algorithm, the acceleration can reach 3 鈮
本文编号:2496375
[Abstract]:As the main module of H.264/AVC coding framework, the inter-frame prediction module improves the compression efficiency by multi-frame prediction, sub-pixel motion estimation, rate-distortion optimization mode decision and so on, but it also makes the whole module take a long time. The resource utilization rate is high; On the other hand, with the continuous development of parallel programming framework CUDA (Compute UnifiedDevice Architecture) based on GPU, GPU has become another programmable and enforceable unit on the computer. At the same time, the computing power of GPU in the field of scientific computing has far exceeded that of CPU;. Therefore, it has become a hot issue in the field of multimedia technology and high performance computing to consider how to accelerate the improvement of the overall coding efficiency of inter-frame prediction module based on CUDA platform. Through the statistical data of various resolutions, frame rates and video data, it is found that the distribution of motion vectors in the local domain and the global domain has the characteristic of trend consistency in the process of inter-frame prediction coding. And the motion vectors of different modes of coding blocks have strong correlation. Based on the above rules and the characteristics of CUDA platform, the inter-frame prediction module in serial environment is optimized from the overall framework and the core algorithm. The main points are as follows: (1) the inter-frame prediction module is divided into interpolation filtering module based on CUDA platform. Motion estimation module and multi-mode motion vector synthesis module and other sub-modules; (2) aiming at the blindness of the traditional full search algorithm in the search mechanism and the fact that the multi-condition branch of the fast search algorithm is difficult to fully invoke the computing resources of the CUDA platform, an adaptive iterative search algorithm oriented to the motion trend is proposed and implemented. (3) in order to reduce the single-thread computing load and make full use of neighborhood motion information to avoid the problem of low concurrency caused by data dependence, a pre-search mechanism based on domain partition and double sampling is proposed and implemented. (4) based on the interlayer correlation of motion vectors, an optimal motion vector merging mechanism based on interlayer coding blocks is proposed and implemented. The experimental results show that compared with the full search algorithm, the motion trend oriented iterative search algorithm can improve the performance of 70 脳 80 times, while SNR is kept below 0.5dB. Compared with the fast search algorithm, the acceleration can reach 3 鈮
本文编号:2496375
本文链接:https://www.wllwen.com/kejilunwen/wltx/2496375.html