基于属性约简的社交网络异常用户识别系统的设计与实现

发布时间：2018-04-16 06:01

本文选题：微博用户 + 特征提取　；参考：《北京邮电大学》2016年硕士论文

【摘要】：微博已经发展成了一个重要的社会化媒体,大量用户在这平台上发送和传播信息。微博中异常用户的存在,严重影响了微博的网络环境,因此识别微博用户类型的研究具有重大意义。本文以新浪微博为例,选取部分微博用户作为研究对象,分析并提取用户特征,通过属性约简对用户特征进行选择,采用统计学中的统计方法以及数据挖掘中的分类方法对用户数据进行分析。以C4.5决策树这一分类方法作为切入点,并结合其它几种分类方法进行对比,训练历史数据形成分类器,对新的样本进行预测分类,具有较高的准确度。最后在C4.5决策树分类器的基础上加入了属性约简,达到了给决策树剪枝的效果,进一步提高了预测结果的准确度,可使分类准确率达到92.68%。文章不仅基于微博用户特征研究,还基于博文内容的文本研究,使用朴素贝叶斯的分类方法,对微博内容进行文本分类。经过中文分词、词频统计、去停用词之后对微博的文本用向量空间模型进行表示,最后使用weka实现了朴素贝叶斯分类实验。由于在去停用词阶段考虑去除了具有微博自身特点的停用词,分类准确率能达到88.65%,取得了良好的分类效果。文章最后基于微博用户分类和微博文本分类的理论,设计并实现了微博用户识别系统,不仅能对微博用户数据进行分析,还能批量和在线处理微博用户数据判断其用户类型,具有重要的现实意义。
[Abstract]:Weibo has developed into an important social media, where a large number of users send and disseminate information.The existence of abnormal users in Weibo has seriously affected the network environment of Weibo, so it is of great significance to identify the user types of Weibo.In this paper, taking Weibo of Sina as an example, we select a part of Weibo user as the research object, analyze and extract the user characteristics, and select the user characteristics through attribute reduction.The statistical method and the classification method in data mining are used to analyze the user data.The C4.5 decision tree is used as the starting point and some other classification methods are compared to train the historical data to form a classifier and to predict the new samples. This method has high accuracy.Finally, attribute reduction is added on the basis of C4.5 decision tree classifier, which achieves the effect of pruning the decision tree, further improves the accuracy of prediction results, and makes the classification accuracy reach 92.68%.This paper not only based on Weibo user characteristics, but also based on the text research of blog content, using naive Bayes classification method to do text categorization of Weibo content.After Chinese word segmentation and word frequency statistics, Weibo's text is represented by vector space model after stopping words. Finally, naive Bayes classification experiment is implemented by using weka.Due to the removal of discontinuation words with Weibo's own characteristics, the classification accuracy can reach 88.65, and a good classification effect is obtained.Finally, based on the theory of Weibo user classification and Weibo text classification, a user identification system is designed and implemented. The system can not only analyze the user data, but also judge the user type in batch and online processing.It has important practical significance.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP393.092;TP391.1

【参考文献】