基于hadoop的海量搜索日志分析平台的设计和实现

发布时间：2018-11-06 17:21

【摘要】：自20世纪末期以来,随着互联网行业的增长和人类活动信息化进程的加速,人们的信息交流日趋频繁,如何进行有效的信息检索也随之成为人们面临的难题之一。搜索引擎技术的出现帮助人们走出了信息的迷宫,实现了有效的信息检索,极大的改变了人们工作和生活的方式。目前,对搜索引擎技术的研究已不再仅仅局限于其本身,对网络用户行为的研究也越来越被关注。这是因为对网络用户行为进行系统深入的研究,有利于直接捕捉用户的显性需求并发掘其隐性需求。与网络和信息化相关的另一个挑战是对如何应对海量数据的处理。这不仅对传统数据库服务器的存储模式是一种巨大的考验,同时对服务器的CPU、IO的计算性能也是严峻的挑战,而Hadoop/Hive是现技术领域解决这类问题的非常合适的方法和工具。基于以上现状,通过对大量文献的阅读和参考,以及对搜索引擎日志的产生和常见模型进行的详细分析,论文设计了一个用于处理海量搜索日志的分析平台。具体包括：数据采集预处理模块、数据存储模块、数据分析模块和集群管理模块四部分。其中,设计了一套基于用户行为模式挖掘的算法来对搜索引擎的日志进行分析和处理；在平台监控模块中,实现了对于集群的监控和竹理。以数据挖掘的流程为思路,以海量数据分析工具Hadoop为实验平台,采用MapReduce I映射/规约的编程模型,并采用简单实用的类SQL的HIVE和HBase的海量数据库来处理海量日志：同时,将挖掘模式分解在各分布式服务器进行关联匹配,然后将挖掘结果合成,由此实现减轻网络和服务器性能的这-瓶颈的压力,体现异步挖掘和异步数据规约的优势；最后通过搭建实验环境来验证本平台。采用的数据是搜狗实验室提供三个的搜索引擎的日志样本(样本数据、单日数据、月度数据),根据样本分别从用户查询主题、用户点击数与URL排序和用户会话分析等儿个方面对用户检索行为进行详细的分析,同时还对平台进行了性能的优化,对比优化前后的系统运行用时。通过实验数据表明论文设计的日志分析平台具有良好的稳定性和有效性。
[Abstract]:Since the end of the 20th century, with the growth of the Internet industry and the acceleration of the information process of human activities, people's information exchange is becoming more and more frequent, and how to carry out effective information retrieval has become one of the problems that people face. The emergence of search engine technology helps people out of the maze of information, realizes effective information retrieval, and greatly changes the way people work and live. At present, the research on search engine technology is no longer confined to itself, and the research on the behavior of network users has been paid more and more attention. This is because the systematic and in-depth research on the behavior of network users is conducive to capturing the explicit needs of users and discovering their hidden needs directly. Another challenge related to networking and informatization is how to deal with massive data. This is not only a great test to the storage mode of the traditional database server, but also a severe challenge to the computing performance of the CPU,IO of the server. Hadoop/Hive is a very suitable method and tool to solve this kind of problem in the field of current technology. Based on the above situation, through the reading and reference of a large number of documents, as well as the generation of search engine logs and the detailed analysis of common models, this paper designed an analysis platform for dealing with massive search logs. It includes four parts: data preprocessing module, data storage module, data analysis module and cluster management module. Among them, a set of algorithms based on user behavior pattern mining is designed to analyze and process the search engine log. In the platform monitoring module, the monitoring and management of the cluster is realized. Taking the flow of data mining as the train of thought, taking the massive data analysis tool Hadoop as the experimental platform, adopting the programming model of MapReduce I mapping / specification, and using the simple and practical HIVE and HBase massive database of SQL to deal with the massive log: at the same time, The mining pattern is decomposed into each distributed server for association matching, and then the mining results are combined to reduce the pressure of the bottleneck of network and server performance, and reflect the advantages of asynchronous mining and asynchronous data specification. Finally, the platform is verified by setting up the experimental environment. The data used are three search engine log samples (sample data, one-day data, monthly data) provided by Sogou Labs. The user search behavior is analyzed in detail from the aspects of user click number URL sort and user session analysis. At the same time the performance of the platform is optimized and the system running time before and after optimization is compared. The experimental data show that the log analysis platform designed in this paper has good stability and effectiveness.
【学位授予单位】：大连理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】