基于SPARK的用户特征分析

发布时间：2018-10-29 16:22

【摘要】：近年来,互联网的飞速发展提供了一个丰富便捷的网络环境,人们越来越习惯在网络上进行交流、交易、娱乐等等,海量的用户网络数据充斥着整个互联网,越来越多的人看到了大数据背后隐藏的价值,全球范围内掀起来大数据研究的浪潮;随着大数据技术的火热研究,吸引了国内外众多学者投入到大数据挖掘的研究中,实现了基于用户网络行为数据的分析挖掘的研究体系。大数据计算平台并不需要使用超高性能的服务器才能实现,使用普通的PC即可搭建而成,并且这种集群化的模式表现出的计算性能往往比超高性能的服务器还要好。以Spark为代表的分布式计算平台是近几年刚刚兴起并且快速发展的一种新技术,原因在于这种分布式平台是基于内存的计算模式,可以提供海量存储和超级计算的能力。把分析挖掘超大数据集的任务使用云计算方案来解决,能够极大地提升计算速度和用户分类的效能。因此,以Spark为代表的分布式计算平台和海量用户数据集的分类挖掘相融合,会是一个很有科研价值和应用潜力的研究方向。本文主要研究基于Spark和改进的TF-IDF算法的用户特征分析,具体工作如下:1、研究了 Spark的相关技术以及Spark集群的搭建过程。使用朴素贝叶斯分类算法,结合Spark内存计算框架,对用户观看视频及次数信息进行分析,建立用户性别和年龄区间的分类模型;并进一步介绍了整个分析系统的架构。2、在基本的分类算法中,并没考虑特征项权重问题,这样并不能体现出每一个特征项的价值,基于这一因素,采用传统的TF-IDF权重进行进一步实验,与基本的分类算法对比分类效果。3、列出传统的TF-IDF权重计算方法的缺陷,仅仅考虑特征项自身的价值,而没有体现特征项与类别之间的相关性;针对这一问题,提出了一种基于特征项与类别间相关性的TFC-IDFC权重计算方法,并详细介绍了优化分类模型的过程,通过实验得出分类结果。4、将改进的权重计算方法与基本分类算法和传统的TF-IDF权重计算方法进行比较,通过正确率和F1值两个指标,证明考虑到特征项与类别的相关性所提出的TFC-IDFC权重使得分类模型的分类能力更好。
[Abstract]:In recent years, the rapid development of the Internet has provided a rich and convenient network environment. People are more and more used to communicate, trade, entertain and so on the network. More and more people have seen the hidden value behind big data, and the wave of research has been raised in the whole world. With the hot research of big data technology, many scholars at home and abroad have been attracted to the research of big data mining, and realized the research system of analysis and mining based on user network behavior data. Big data computing platform does not need to use ultra-high performance server to achieve, using ordinary PC can be built, and this cluster mode often shows better computing performance than ultra-high performance server. The distributed computing platform, represented by Spark, is a new technology that has just emerged and developed rapidly in recent years. The reason is that the distributed platform is a memory-based computing model, which can provide mass storage and super computing capabilities. Using cloud computing to solve the task of analyzing and mining large data sets can greatly improve the computing speed and the efficiency of user classification. Therefore, the integration of the distributed computing platform represented by Spark and the classification and mining of massive user data sets will be a research direction with scientific research value and application potential. This paper mainly studies the user characteristics analysis based on Spark and improved TF-IDF algorithm. The main work is as follows: 1. The related technology of Spark and the process of building Spark cluster are studied. By using naive Bayesian classification algorithm and Spark memory computing framework, this paper analyzes the information of user watching video and times, and establishes the classification model of user's gender and age interval. And further introduced the structure of the whole analysis system. 2. In the basic classification algorithm, the weight of feature item is not considered, so it can not reflect the value of each feature item, based on this factor, The traditional TF-IDF weight is used for further experiments, and the classification effect is compared with the basic classification algorithm. 3. The defects of the traditional TF-IDF weight calculation method are listed, and only the value of the feature item itself is considered. It does not reflect the correlation between feature items and categories; In order to solve this problem, a TFC-IDFC weight calculation method based on the correlation between feature items and classes is proposed, and the process of optimizing classification model is introduced in detail. The improved weight calculation method is compared with the basic classification algorithm and the traditional TF-IDF weight calculation method. It is proved that the TFC-IDFC weight, which takes into account the correlation between feature items and categories, makes the classification model better.
【学位授予单位】：天津工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】