基于分布式的网络用户行为分析系统的设计

发布时间：2018-03-30 20:12

本文选题：用户行为分析　切入点：数据挖掘　出处：《北京邮电大学》2014年硕士论文

【摘要】：伴随着移动终端的应用,网络用户群体的规模有一次飞速的扩大,频繁的访问行为积累了海量数据,隐含着有效信息,可以为网络服务和网络安全提供指导或者为网站建设者提供结构方面的支持,所以对网络用户访问行为的分析成为了研究热点。由于网络用户个体化差异,单一用户行为往往不构成特征规律,而当考虑客户群体的时候,隐含的特征规律便显现出来。如果能准确的把握用户群体的特征规律,进而划分客户群体,互联网应用和服务提供者便可以针对不同客户群体的需要提供个性化的服务和高附加值的业务推荐,实现网络客户群和网络服务者的利益最大化。本课题设计了一个高性能的分布式网络用户行为分析系统来划分用户群体。首先,爬取网页内容,通过TFIDF分词技术提取网页关键字,并构成页面向量,同时通过WEB服务器得到用户访问的上下文信息,通过数据预处理模块,消除冗余度,形成具有唯一性且冗余度低的数据源。其次,详细研究并改进了数据挖掘技术中的聚类方法,并在Hadoop分布式处理框架MapReduce中实现了算法的并行化,使其更能适合现实中海量数据的处理,并验证了MapReduce并行化处理性能上的提升。之后,设计出分布式用户行为分析系统的框架,包括数据采集模块,数据预处理模块,文本聚类模块,知识结果集模块并实现了各个模块的主要功能,并根据现有的系统性能测试指标对该系统进行了测试和评估,最后总结了论文的特点以及不足之处,并提出了对前景的展望。
[Abstract]:With the application of mobile terminal, the scale of the network user group has expanded rapidly, and the frequent access behavior accumulates massive data, which implies the effective information.It can provide guidance for network services and network security or provide structural support for website builders, so the analysis of network users' access behavior has become a hot research topic.Because of the individualized differences of network users, the single user behavior often does not constitute the characteristic law, but when the customer group is considered, the implicit characteristic law appears.If you can accurately grasp the characteristics of the user group, and then divide the customer group, Internet applications and service providers can provide personalized services and high value-added business recommendations to meet the needs of different customer groups.To maximize the benefits of network customers and network service providers.In this paper, a high performance distributed network user behavior analysis system is designed to divide user groups.First of all, crawl the content of the web page, extract the key words of the page by TFIDF segmentation technology, and form the page vector. At the same time, the context information accessed by the user is obtained through the WEB server, and the redundancy is eliminated by the data preprocessing module.A unique and low redundancy data source is formed.Secondly, the clustering method in data mining technology is studied and improved in detail, and the algorithm is parallelized in Hadoop distributed processing framework (MapReduce), which makes it more suitable for mass data processing in reality.The performance improvement of MapReduce parallelization processing is verified.Then, the framework of the distributed user behavior analysis system is designed, including data acquisition module, data preprocessing module, text clustering module, knowledge result set module and the main functions of each module.The system is tested and evaluated according to the existing system performance test index. Finally, the characteristics and shortcomings of the paper are summarized, and the prospect of the system is put forward.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092;TP391.1

【参考文献】