基于hadoop的海量搜索日志分析平台的设计和实现
[Abstract]:Since the end of the 20th century, with the growth of the Internet industry and the acceleration of the information process of human activities, people's information exchange is becoming more and more frequent, and how to carry out effective information retrieval has become one of the problems that people face. The emergence of search engine technology helps people out of the maze of information, realizes effective information retrieval, and greatly changes the way people work and live. At present, the research on search engine technology is no longer confined to itself, and the research on the behavior of network users has been paid more and more attention. This is because the systematic and in-depth research on the behavior of network users is conducive to capturing the explicit needs of users and discovering their hidden needs directly. Another challenge related to networking and informatization is how to deal with massive data. This is not only a great test to the storage mode of the traditional database server, but also a severe challenge to the computing performance of the CPU,IO of the server. Hadoop/Hive is a very suitable method and tool to solve this kind of problem in the field of current technology. Based on the above situation, through the reading and reference of a large number of documents, as well as the generation of search engine logs and the detailed analysis of common models, this paper designed an analysis platform for dealing with massive search logs. It includes four parts: data preprocessing module, data storage module, data analysis module and cluster management module. Among them, a set of algorithms based on user behavior pattern mining is designed to analyze and process the search engine log. In the platform monitoring module, the monitoring and management of the cluster is realized. Taking the flow of data mining as the train of thought, taking the massive data analysis tool Hadoop as the experimental platform, adopting the programming model of MapReduce I mapping / specification, and using the simple and practical HIVE and HBase massive database of SQL to deal with the massive log: at the same time, The mining pattern is decomposed into each distributed server for association matching, and then the mining results are combined to reduce the pressure of the bottleneck of network and server performance, and reflect the advantages of asynchronous mining and asynchronous data specification. Finally, the platform is verified by setting up the experimental environment. The data used are three search engine log samples (sample data, one-day data, monthly data) provided by Sogou Labs. The user search behavior is analyzed in detail from the aspects of user click number URL sort and user session analysis. At the same time the performance of the platform is optimized and the system running time before and after optimization is compared. The experimental data show that the log analysis platform designed in this paper has good stability and effectiveness.
【学位授予单位】:大连理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
【参考文献】
相关期刊论文 前10条
1 王建勇,单松巍,雷鸣,谢正茂,李晓明;海量Web搜索引擎系统中用户行为的分布特征及其启示[J];中国科学E辑:技术科学;2001年04期
2 杨文峰,李星;网络搜索引擎的用户查询分析[J];计算机工程;2001年06期
3 鲍钰,黄国兴,张召;基于Web日志挖掘的网站结构优化方法[J];计算机工程;2003年12期
4 王川;王大玲;于戈;马海涛;刘鑫钢;;基于用户行为模型的搜索引擎[J];计算机工程;2008年04期
5 陈红涛;杨放春;陈磊;;基于大规模中文搜索引擎的搜索日志挖掘[J];计算机应用研究;2008年06期
6 余慧佳;刘奕群;张敏;茹立云;马少平;;基于大规模日志分析的搜索引擎用户行为分析[J];中文信息学报;2007年01期
7 岑荣伟;刘奕群;张敏;茹立云;马少平;;基于日志挖掘的搜索引擎用户行为分析[J];中文信息学报;2010年03期
8 姚海燕;邓小昭;;网络用户信息行为研究概述[J];情报探索;2010年02期
9 多雪松;张晶;高强;;基于Hadoop的海量数据管理系统[J];微计算机信息;2010年13期
10 崔林,宋瀚涛,龚永罡,陆玉昌;基于Web使用挖掘的个性化服务技术研究[J];计算机系统应用;2005年03期
相关硕士学位论文 前5条
1 张文峰;基于MapReduce模型的分布式计算平台的原理与设计[D];华中科技大学;2010年
2 万至臻;基于MapReduce模型的并行计算平台的设计与实现[D];浙江大学;2008年
3 朱珠;基于Hadoop的海量数据处理模型研究和应用[D];北京邮电大学;2008年
4 李云桃;基于Hadoop的海量数据处理系统的设计与实现[D];哈尔滨工业大学;2009年
5 夏yN;Hadoop平台下的作业调度算法研究与改进[D];华南理工大学;2010年
,本文编号:2314952
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2314952.html