基于分布式处理的用户行为特征提取与建模研究
发布时间:2019-04-24 11:56
【摘要】:随着互联网行业的蓬勃发展和运营商基础设施与服务的不断建设升级,用户访问互联网而产生的数据日益丰富。分布式数据处理技术的发展和数据挖掘及机器学习领域的结合,使得针对互联网用户进行特征提取和行为偏好研究成为热门领域。运营商作为数据管道掌握着全网范围内的网络访问流量记录,在其采集的DPI数据上进行处理、挖掘和分析,对全方位刻画用户行为偏好有着巨大潜力。在此背景下,本文针对国内某运营商采集的某市固网宽带DPI数据进行了研究,利用分布式处理技术和数据挖掘相关方法从用户的上网流量记录中提取互联网用户行为特征。传统的基于运营商流量的数据分析多是以研究各类业务的流量分布特性为切入点,描绘用户不同时段使用不同种类应用的行为习惯。本文以DPI记录中URL为出发点,从用户访问网站的类别、序列模式特征和在线商品浏览等方面提取用户上网行为特征,并进行了建模研究和实验分析。首先,本文利用爬虫技术从导航网站和分类目录网站获取网站分类标签库,并且对上网终端搭载的操作系统进行识别,通过统计分析和聚类技术研究了基于网站标签的用户群组兴趣特征;其次,本文将序列模式挖掘方法应用于全网范围内用户跨多个网站的访问特征研究,建立用户访问网站的序列模型,发现在全天范围内用户的网站访问行为在时序上的频繁序列模式;最后,本文针对用户访问电商网站产生的流量进行了单独研究,并结合爬虫技术将用户的兴趣偏好特征直接细化到商品、品牌和类目三个级别,通过频繁项集挖掘和关联分析提取用户在线浏览商品的偏好特征,并通过建模和实验进行了全面的研究和分析。
[Abstract]:With the rapid development of Internet industry and the continuous construction and upgrading of operators' infrastructure and services, the data generated by users accessing the Internet is becoming more and more abundant. With the development of distributed data processing technology and the combination of data mining and machine learning, the research on feature extraction and behavior preference of Internet users has become a hot field. As a data pipeline, operators master the network access traffic records in the whole network, and process, mine and analyze the collected DPI data, which has great potential to portray the behavior preference of users in all directions. Under this background, this paper studies the fixed-line broadband DPI data collected by a domestic operator, and extracts the behavior characteristics of Internet users from users' Internet traffic records by means of distributed processing technology and data mining related methods. The traditional data analysis based on carrier traffic is based on the research of traffic distribution characteristics of all kinds of services, and describes the behavior habits of users using different kinds of applications at different times. Taking URL in DPI record as the starting point, this paper extracts the characteristics of users' online behavior from the categories of users visiting websites, sequence pattern features and online merchandise browsing, and carries on modeling research and experimental analysis. First of all, this paper uses crawler technology to obtain the website classification tag library from the navigation website and the classified directory website, and to identify the operating system on the Internet terminal. Through statistical analysis and clustering technology, the interest characteristics of user groups based on website tags are studied. Secondly, in this paper, the sequential pattern mining method is applied to the study of the access characteristics of users across multiple websites in the whole network, and the sequence model of users visiting the websites is established. The frequent sequence patterns of users' website visit behavior in time series are found in the whole day. Finally, this paper makes a separate study on the traffic generated by users visiting e-commerce websites, and combines with crawler technology to refine the user's interest and preference directly to three levels: commodity, brand and category. Through frequent itemsets mining and association analysis, the preference features of users browsing goods online are extracted, and comprehensive research and analysis are carried out through modeling and experiments.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13;TP393.092
本文编号:2464425
[Abstract]:With the rapid development of Internet industry and the continuous construction and upgrading of operators' infrastructure and services, the data generated by users accessing the Internet is becoming more and more abundant. With the development of distributed data processing technology and the combination of data mining and machine learning, the research on feature extraction and behavior preference of Internet users has become a hot field. As a data pipeline, operators master the network access traffic records in the whole network, and process, mine and analyze the collected DPI data, which has great potential to portray the behavior preference of users in all directions. Under this background, this paper studies the fixed-line broadband DPI data collected by a domestic operator, and extracts the behavior characteristics of Internet users from users' Internet traffic records by means of distributed processing technology and data mining related methods. The traditional data analysis based on carrier traffic is based on the research of traffic distribution characteristics of all kinds of services, and describes the behavior habits of users using different kinds of applications at different times. Taking URL in DPI record as the starting point, this paper extracts the characteristics of users' online behavior from the categories of users visiting websites, sequence pattern features and online merchandise browsing, and carries on modeling research and experimental analysis. First of all, this paper uses crawler technology to obtain the website classification tag library from the navigation website and the classified directory website, and to identify the operating system on the Internet terminal. Through statistical analysis and clustering technology, the interest characteristics of user groups based on website tags are studied. Secondly, in this paper, the sequential pattern mining method is applied to the study of the access characteristics of users across multiple websites in the whole network, and the sequence model of users visiting the websites is established. The frequent sequence patterns of users' website visit behavior in time series are found in the whole day. Finally, this paper makes a separate study on the traffic generated by users visiting e-commerce websites, and combines with crawler technology to refine the user's interest and preference directly to three levels: commodity, brand and category. Through frequent itemsets mining and association analysis, the preference features of users browsing goods online are extracted, and comprehensive research and analysis are carried out through modeling and experiments.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13;TP393.092
【参考文献】
相关期刊论文 前5条
1 杨波;;通信运营商宽带用户行为分析的研究与应用[J];邮电设计技术;2014年11期
2 边凌燕;贺仁龙;姚晓辉;;基于DPI数据挖掘实现URL分类挂载的相关技术研究[J];电信科学;2013年11期
3 陶彩霞;谢晓军;陈康;郭利荣;刘春;;基于云计算的移动互联网大数据用户行为分析引擎设计[J];电信科学;2013年03期
4 刘栋;尉永清;薛文娟;;基于Map Reduce的序列模式挖掘算法[J];计算机工程;2012年15期
5 邢东山,沈钧毅,宋擒豹;从Web日志中挖掘用户浏览偏爱路径[J];计算机学报;2003年11期
相关博士学位论文 前2条
1 郭敏杰;基于云计算的海量网络流量数据分析处理及关键算法研究[D];北京邮电大学;2014年
2 窦伊男;根据多维特征的网络用户分类研究[D];北京邮电大学;2010年
,本文编号:2464425
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2464425.html