基于GPU的快速摘要生成方法
发布时间:2018-11-11 11:07
【摘要】:作为搜索引擎展示最终搜索结果的重要组成部分,基于查询的摘要是现代搜索引擎最常用的方法,它可以向用户展示结果文档中与检索词关联度最大的若干片段,这种基于查询的摘要可以使得搜索结果对于用户而言更直观,更具针对性。根据查询词来计算一篇文档的摘要是轻量级的任务,但是现今的搜索引擎往往要面对海量的查询请求,而每个请求所呈现的结果页面中的每个结果文档都必须根据查询词来生成相应的摘要,因此基于查询的摘要计算是现代搜索引擎系统中耗费计算资源相当大的一个部分。为了改进在大负载条件下摘要生成计算的性能和经济性,提出了一种基于CPU-GPU(Graphic Processing Unit,,图形处理单元)混合系统的高性能并行处理方法。 提出了一种适合GPU处理的摘要生成算法,这个算法采用了滑动窗口的文档切分方法,目的是为了避免传统的截断式文档切分法所导致的高关联度片段被切断的问题。与此同时,算法还采用了一种新的量化公式来评估一个片段与查询词的关联度。 在对CPU-GPU混合系统运行特征进行分析的基础之上,对前述的摘要生成算法进行了改进。将一个摘要生成任务内部并行化的同时,还实现了任务间的并行化,并设计了一种三段式的流水线系统来支持此并行化的处理方法。为了实现此三段式流水线系统,设计了一种异步执行框架JobFlow,此框架采用基于服务的编程模式,可以支持高度的模块化和并行化的程序设计。 开展了多项试验以优化系统的性能指标并评估系统的性能和经济效能。实验结果显示,与基准摘要生成算法Lucene的Highlighter组件相比较,GPU流水线处理系统获得了较高的加速比,同时能降低了系统的成本。
[Abstract]:As an important part of search engine to display final search results, query-based summary is the most commonly used method in modern search engine. This query-based summary can make search results more intuitive and targeted to users. It is a lightweight task to calculate the summary of a document according to the query words, but nowadays search engines often have to face a large number of query requests. However, each result document in the result page presented by each request must generate the corresponding summary according to the query term. Therefore, the query-based summary computing is a part of the modern search engine system that consumes a lot of computing resources. In order to improve the performance and economy of summary generation under heavy load, a high performance parallel processing method based on CPU-GPU (Graphic Processing Unit, graphics processing unit (CPU-GPU (Graphic Processing Unit,) hybrid system is proposed. A summary generation algorithm suitable for GPU processing is proposed in this paper. This algorithm uses a sliding window method to segment documents in order to avoid the problem of cutting off high correlation segments caused by the traditional truncated document segmentation method. At the same time, a new quantitative formula is used to evaluate the correlation between a segment and a query word. On the basis of analyzing the operation characteristics of CPU-GPU hybrid system, the algorithm of summary generation is improved. While a summary generation task is parallelized, the parallelization between tasks is realized, and a three-segment pipeline system is designed to support the parallelization. In order to realize this three-segment pipeline system, an asynchronous execution framework (JobFlow,) is designed. The framework adopts a service-based programming model and can support highly modular and parallel programming. Several experiments were carried out to optimize the performance index and evaluate the performance and economic performance of the system. The experimental results show that compared with the Highlighter component of the benchmark digest generation algorithm Lucene, the GPU pipeline processing system has a higher speedup ratio and can reduce the cost of the system at the same time.
【学位授予单位】:华中科技大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3
本文编号:2324651
[Abstract]:As an important part of search engine to display final search results, query-based summary is the most commonly used method in modern search engine. This query-based summary can make search results more intuitive and targeted to users. It is a lightweight task to calculate the summary of a document according to the query words, but nowadays search engines often have to face a large number of query requests. However, each result document in the result page presented by each request must generate the corresponding summary according to the query term. Therefore, the query-based summary computing is a part of the modern search engine system that consumes a lot of computing resources. In order to improve the performance and economy of summary generation under heavy load, a high performance parallel processing method based on CPU-GPU (Graphic Processing Unit, graphics processing unit (CPU-GPU (Graphic Processing Unit,) hybrid system is proposed. A summary generation algorithm suitable for GPU processing is proposed in this paper. This algorithm uses a sliding window method to segment documents in order to avoid the problem of cutting off high correlation segments caused by the traditional truncated document segmentation method. At the same time, a new quantitative formula is used to evaluate the correlation between a segment and a query word. On the basis of analyzing the operation characteristics of CPU-GPU hybrid system, the algorithm of summary generation is improved. While a summary generation task is parallelized, the parallelization between tasks is realized, and a three-segment pipeline system is designed to support the parallelization. In order to realize this three-segment pipeline system, an asynchronous execution framework (JobFlow,) is designed. The framework adopts a service-based programming model and can support highly modular and parallel programming. Several experiments were carried out to optimize the performance index and evaluate the performance and economic performance of the system. The experimental results show that compared with the Highlighter component of the benchmark digest generation algorithm Lucene, the GPU pipeline processing system has a higher speedup ratio and can reduce the cost of the system at the same time.
【学位授予单位】:华中科技大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3
【参考文献】
相关期刊论文 前3条
1 颜维龙,盖杰,武港山,袁春风;面向网络的全文检索中索引文件的组织[J];计算机应用研究;2002年11期
2 张卫;杨晓江;;基于PC机群的分布式信息检索系统[J];情报杂志;2006年12期
3 许涛,吴淑燕;Google搜索引擎及其技术简介[J];现代图书情报技术;2003年04期
本文编号:2324651
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2324651.html