微博评论信息的聚类分析

发布时间：2018-03-08 13:07

本文选题：微博评论分析　切入点：中文分词　出处：《安徽大学》2017年硕士论文　论文类型：学位论文

【摘要】：微博作为一种分享和交流信息的社交平台,自2009年国内公司新浪推出微博平台以来得到了快速发展和广泛的应用。截至2016年9月30日,新浪微博月活跃用户已达到2.97亿。微博信息具有信息交互简便快捷、随时随地传播信息、信息发布门槛低、传播方式呈裂变等特点。作为一个新闻发布平台、新闻发生地和信息交互平台,微博在人们了解信息、发布信息、信息交流等日常网络行为中充当越来越重要的角色。但相比之下,微博信息具有简短、数量庞大、内容复杂的特点,传统的数据挖掘方法在对该类型信息分析时面临诸多挑战。为此,本文利用了文本聚类方法,针对微博评论信息的特点并根据微博热点事件的大量用户评论展开分析,探索出一套以文本聚类为基础的微博评论信息处理的可行方法。目的在于将内容相近或相似的评论信息聚集成簇,了解社会对热点事件的不同观点,能够进行有效的舆情分析与检测,对于特定事件还能让领导层更好的了解民意,有助于进行决策改革。本文主要工作如下:首先分析了微博文本信息的特点,研究了常用的文本信息分析方法,阐述了聚类分析技术,包括聚类的定义、形式和相似度量方法。其次,针对微博信息特点和信息处理方式,分析了微博评论信息的聚类步骤,包括文本预处理、微博文本表示以及聚类分析。在文本预处理阶段,讨论了中文分词、停用词过滤和文本去噪等,在文本表示阶段,讨论了多种文本表示方法和特征项的权重表示方法,在文本聚类阶段,分析了聚类的不同方法并描述了多种算法。通过上述讨论分析,确定了本文采用的具体分析方法。接着利用R软件进行文本去噪并通过jiebaR包完成中文分词、停用词过滤等预处理工作。在分析比较了多种文本表示方法之后,本文采用向量空间模型表示微博评论文本。而在选择聚类算法时,采用了广泛使用的k-means算法,但考虑到k-means算法对初始点和离群点敏感,k值需要人为设定的缺点,增加了 k-medoids算法。这是因为k-medoids算法和k-means算法相似,但对离群点具有鲁棒性,并且在R软件的pamk函数中k值不需要人为设定。在具体的算法实现过程中,分析了k值和初始点的不同对聚类结果的影响,探讨了R语言实现k-medoids算法和k-means算法的途径。利用词云和词项网络等方式将微博评论信息进行可视化。本文抓取4月26日央视新闻发布的关于首艘国产航母下水的微博的4000多条评论,对评论集进行数据预处理和文本表示之后,对结构化数据进行开展词项聚类和文档聚类。通过实验发现,不同的随机种子的选择对聚类结果影响不大,由于本文数据量并不大,所以算法运行时间上并没有明显差异。在利用系统聚类法对特征项进行词项聚类时,采用离差平方和法与最大距离法的系统聚类结果较好。利用k-medoids聚类分析得到的结果显示其最佳聚类结果簇个数为2,但是其平均阴影值为0.69,表明两个个簇之间的划分较好。由于本文采用基于词典的分词方法和空间向量模型,特征项之间的语义联系弱,使得聚类结果不够合理。
[Abstract]:Micro-blog as a social platform to share and exchange information, since 2009, the domestic company Sina launched micro-blog platform has been rapid development and wide application. As of September 30, 2016, Sina micro-blog monthly active users has reached 297 million. Micro-blog has information interaction is convenient, whenever and wherever possible the dissemination of information, information dissemination mode has a low threshold, such as fission. As a news release platform, news and information exchange platform, micro-blog in the understanding of information, dissemination of information, exchange of information and other daily network behavior plays a more and more important role. But in contrast, micro-blog has a large number of short information, content is complex, the traditional data mining method on the challenges facing the analysis of the type of information. Therefore, this paper uses text clustering method, according to the characteristics of micro-blog review information according to the micro A large number of user reviews Bo hot events to analyze, to explore a set of text clustering based on micro-blog information processing methods. The objective is to review information content of close or similar clusters, understand the different views on social hot events, in which public opinion analysis and effective detection for specific events, but also let leaders better understand public opinion, contribute to the decision-making reform. The main work is as follows: firstly, analyzes the characteristics of micro-blog text information, the study of text information analysis method, describes the clustering analysis technology, including the definition of the cluster, form and method of similarity measure. Secondly, according to the characteristics of micro-blog information and information processing method, analyzes the clustering step for micro-blog review information, including text preprocessing, text representation and micro-blog clustering analysis. In the text pre-processing stage, discussed in the The stop word filtering and text segmentation, denoising, in text representation stage, discusses various text representation methods and feature weights, in the phase of text clustering, clustering analysis of different methods and describes several algorithms. Through the above discussion and analysis, to determine the specific method used in this paper. Then text denoising by jiebaR Chinese segmentation using R software package is completed, stop word filtering pretreatment. After comparing and analyzing kinds of text representation method, this paper uses the vector space model to express the micro-blog text. And in the choice of clustering algorithm, the k-means algorithm is widely used, but considering the k-means algorithm on the initial and the outlier sensitive, K value should be set artificially increased the shortcomings of k-medoids algorithm. This is because the k-medoids algorithm and K-means algorithm are similar, but is robust to outliers And, in the pamk function of R software in the K value should be set artificially. In the specific implementation process of the algorithm, analyzes the influence of K value and the initial point of different clustering results, discussed the k-medoids algorithm and K-means algorithm R language. Using words and lexical entry network will comment on micro-blog information visualization. More than 4000 comments the CCTV news release on April 26th to grab the first domestic aircraft carrier launched micro-blog, after data preprocessing and text representation of comments on structured data sets, carry out lexical entry clustering and document clustering. Through the experiment, different random seed selection has little effect on the clustering results. The amount of data is not large, so the running time of the algorithm and there is no significant difference. The lexical entry to cluster features in the system clustering method is used, the deviation square method and the System clustering results the maximum distance method is better. By k-medoids cluster analysis results indicate that the optimal clustering result of cluster number is 2, but the average shadow value of 0.69, indicates that the division between the two clusters were better. Because the paper uses word segmentation method and space vector model dictionary based on the semantic relation between feature items weak, the clustering result is not reasonable.

【学位授予单位】：安徽大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】