基于自然语言处理的社交网络数据挖掘研究

发布时间：2018-06-25 23:14

本文选题：微博 + 分词　；参考：《华北电力大学》2017年硕士论文

【摘要】：微博是一种目前非常热门的社交平台,用户以短文本或多媒体信息的方式在平台上实现实时的信息分享与交流。用户发布的文本虽短,但长时间积累下来的数据蕴含着丰富的用户的个性化特征等信息。平台的用户数据中蕴含着丰富的社会信息价值,微博用户数据挖掘对于社交网络发展与社交信息分析具有重要意义。社交网络数据挖掘完成的主要功能就是通过分析和挖掘用户在微博中的海量短文本,得到用户的个性化特征等信息。其首要工作是从网络中采集大量微博数据,采用特定的格式进行信息存储;然后对获取的微博信息进行分词处理和信息特征表示处理,最后通过数据挖掘方法进行用户识别和用户类型分析。本文利用网络爬虫技术设计了基于模拟登录的用户数据爬取系统,提供了从网络中获取大量用户微博数据的方法。根据用户数据结构特征,采用基于JSON格式的NOSQL数据库进行存储。针对目前分词方法存在的新词发现困难的问题,提出了基于词典匹配与统计标注相融合的中文分词方法。本方法以字典匹配方法为基础,融入CRF标注算法,并在分词过程中迭代训练实现算法自学习能力。通过将匹配方法与标注方法相融合,根据汉语语义规律选取分词结果,有效改善了中文分词在分词准确性和未登录词发现等方面的分词效果。在测试语料上实验结果表明,文中提出的方法与最大正向匹配算法相比,F值提高了9.6%,且比CRF标注算法提高了2.9%,能更好地满足实际应用需求。目前的微博数据挖掘中主要采用One-hot representation特征表示方法,其缺点是不能表达上下文语义。本文采用基于word2vec的用户特征表示方法,在用户特征表示中加入了上下文信息并且降低了用户信息向量维度,提高了后续数据挖掘算法的计算效率。通过对微博用户数据的分析,发现用户中存在部分垃圾用户会对数据挖掘带来噪声干扰,本文设计了基于SVM的垃圾用户识别模型对垃圾用户进行识别,在测试集上F值达到0.94。然后根据微博用户关注内容,利用K-means聚类分析算法进行了用户社区划分。由于用户社区划分的不确定性,通过DB-index算法计算最优聚类中心数值,提高了聚类结果的类间辨识度和类内相似度。
[Abstract]:Weibo is a very popular social platform at present. Users share and communicate real-time information on the platform by short text or multimedia information. Although the text published by users is short, the data accumulated for a long time contains rich information such as personalized features of users. The user data of the platform contains rich social information value. Weibo user data mining is of great significance to the development of social network and the analysis of social information. The main function of social network data mining is to get the personalized features of users by analyzing and mining the massive short text books of users in Weibo. The first task is to collect a large amount of Weibo data from the network and store the information in a specific format. Then the word segmentation and information feature representation of the obtained Weibo information are processed. Finally, user identification and user type analysis are carried out by data mining method. In this paper, a user data crawling system based on simulated login is designed by using web crawler technology, which provides a method to obtain a large number of user Weibo data from the network. According to the characteristics of user data structure, NOSQL database based on JSON format is used for storage. Aiming at the difficulty of finding new words in word segmentation methods, a Chinese word segmentation method based on the combination of dictionary matching and statistical tagging is proposed. Based on the dictionary matching method, the algorithm is integrated with CRF tagging algorithm, and the self-learning ability of the algorithm is realized by iterative training in the process of word segmentation. By combining the matching method with the tagging method and selecting the segmentation results according to the Chinese semantic rules, the segmentation effect of Chinese word segmentation in terms of the accuracy of word segmentation and the discovery of unrecorded words is effectively improved. The experimental results on the test corpus show that compared with the maximum forward matching algorithm, the proposed method can increase the F value by 9.6, and the CRF tagging algorithm by 2.9 points, which can better meet the practical application requirements. One-hot representation feature representation is mainly used in Weibo data mining, but its disadvantage is that it can not express context semantics. In this paper, the user feature representation method based on word2vec is adopted. The context information is added to the user feature representation and the dimension of user information vector is reduced, which improves the computational efficiency of the subsequent data mining algorithm. Through the analysis of Weibo user data, it is found that there are some garbage users in the user who will bring noise interference to the data mining. In this paper, the garbage user identification model based on Weibo is designed to identify the garbage user, and the F value on the test set reaches 0.94. Then according to the Weibo user focus, K-means clustering algorithm is used to divide the user community. Due to the uncertainty of user community division, the optimal clustering center value is calculated by DB-index algorithm, which improves the inter-class identification and intra-class similarity of the clustering results.
【学位授予单位】：华北电力大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】