基于人类动力学的评论垃圾识别方法研究
发布时间:2018-05-09 19:28
本文选题:评论垃圾识别 + 垃圾用户识别 ; 参考:《西南石油大学》2017年硕士论文
【摘要】:随着电子商务,移动互联网和在线社交媒体平台的不断涌现,人们可以通过互联网进行购物,交友,娱乐,互联网成为了大众生活中密不可分的一部分。这些平台的评论功能让用户在自由表达自己观点的同时,逐步的从最初单纯的网络信息获取者变成了网络信息的贡献者,也让用户生成内容充斥网络世界。隐藏在这些内容中的垃圾信息严重的影响着人们的日常生活。如何让计算机自动高效的从这些庞大的信息中识别出垃圾内容以及垃圾内容的产生者是一项非常具有挑战性的课题,也是文本挖掘和自然语言处理领域的热点问题之一。基于现有的研究工作以及互联网舆情分析的需求,本文以网易新闻门户网站的新闻评论以及用户数据为研究对象,提出了基于人类动力学思想的评论垃圾识别方法。在方法研究过程中,本文从评论发布者和评论两个角度出发,分别提取了用于模型构建的样本特征空间。在提取评论发布者特征时,首先分析了网站垃圾用户与正常用户的行为规律特点,根据分析结果对评论发布者的个人行为规律进行了统计计算,包括用户的基础行为数据如评论,回复,收藏与订阅总数,日均评论数等;以及用户的评论发布行为规律,如评论发布的时间间隔均值,方差等。此外,本文对评论者的四种交互行为:回复、关注、评论同一新闻和发布相似评论进行了建模分析,并根据建立的网络模型,采用六种网络拓扑特征计算方法提取评论者的交互特征。最后本文计算了评论文本的IV值,结合评论相关属性构建了评论的特征空间。基于构建的评论者以及评论特征空间,本文设计了四组实验,采用GBDT和SVM机器学习算法对不同的特征子集进行了模型训练,并对比分析最终的实验结果,得出了评论垃圾识别方法的最优特征子集。实验结果充分的证明了,基于人类动力学行为规律的方法能够对网络平台中存在的垃圾用户进行有效的识别,尤其在识别机器行为的垃圾用户上具有较高准确率。此外,加入用户行为特征的评论垃圾识别模型在评论垃圾识别的精确率和召回率上都有明显提升。
[Abstract]:With the emergence of e-commerce, mobile Internet and online social media platforms, people can purchase, make friends, entertainment and the Internet through the Internet has become an inseparable part of public life. The comment function of these platforms allows users to express their views freely, at the same time, gradually from the original simple network information acquirers to the network information contributors, but also allows users to generate content flooding the network world. The junk information hidden in these contents seriously affects people's daily life. How to make the computer automatically and efficiently identify the garbage content and its generator from these huge information is a very challenging issue, and it is also one of the hot issues in the field of text mining and natural language processing. Based on the existing research work and the demand of Internet public opinion analysis, this paper takes the news comments and user data of NetEase news portal as the research object, and puts forward a comment garbage recognition method based on human dynamics. In the process of research, this paper extracts the sample feature space for model construction from the point of view of comment publisher and comment. When extracting the characteristics of comment publisher, firstly, it analyzes the behavior characteristics of spam users and normal users, and calculates the individual behavior rules of comment publishers according to the analysis results. It includes the basic behavior data of users such as comments, replies, collections and subscriptions, daily average comments, etc., as well as the behavior rules of users' comment publishing, such as the average time interval and variance of comments published. In addition, this paper models and analyzes four kinds of interactive behaviors of reviewers: reply, attention, comment on the same news and publish similar comments, and according to the established network model, Six computing methods of network topology feature are used to extract the interactive features of commenters. Finally, the IV value of the comment text is calculated, and the comment feature space is constructed by combining the comment related attributes. Based on the constructed reviewer and comment feature space, this paper designs four groups of experiments, uses GBDT and SVM machine learning algorithms to train different feature subsets, and compares and analyzes the final experimental results. The optimal feature subset of the comment garbage recognition method is obtained. The experimental results fully prove that the method based on the human dynamics behavior law can effectively identify the garbage users in the network platform, especially for the garbage users who recognize the behavior of the machine. In addition, the accuracy and recall rate of comment garbage recognition model with user behavior feature are improved obviously.
【学位授予单位】:西南石油大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
【参考文献】
相关期刊论文 前3条
1 周志华;;《机器学习》[J];中国民商;2016年03期
2 樊超;郭进利;韩筱璞;汪秉宏;;人类行为动力学研究综述[J];复杂系统与复杂性科学;2011年02期
3 何海江;;一种适应短文本的相关测度及其应用[J];计算机工程;2009年06期
,本文编号:1867121
本文链接:https://www.wllwen.com/jingjilunwen/dianzishangwulunwen/1867121.html