基于微博的用户职业抽取研究

发布时间：2018-03-25 19:18

本文选题：用户职业　切入点：微博　出处：《中国科学技术大学》2017年硕士论文

【摘要】：随着信息技术的快速发展,互联网已经深深融入了人们的日常生活。微博作为互联网时代主要的应用之一在知识分享、信息传递等应用中扮演着重要的角色。微博作为一种新兴的社交网络工具,其用户数量大、数据资源丰富、传递信息快的优势使我们有可能在微博平台上抽取有商业价值的信息,例如微博用户的职业、年龄等。此类数据对于互联网时代的广告推送以及个性化推荐都具有重要的价值。因此,基于微博的用户信息抽取成为目前互联网信息抽取中的一个热点研究方向。本论文主要研究微博平台上的用户职业信息抽取问题。这一问题的主要挑战在于现有的微博平台没有提供普通用户的职业信息认证,而已有的职业抽取工作采用的是传统的特征提取方法,操作复杂且耗时。这要求我们设计新的面向微博用户职业抽取的高效算法。针对这一研究目标,本文从两个角度研究了微博用户职业抽取问题,即基于词向量和职业词典相结合的微博用户职业抽取方法以及基于多层神经网络模型的抽取方法。总体而言,本论文的主要工作和贡献可归纳为下面几点:(1)提出了一种基于特征工程的微博用户职业抽取方法。目前已有的针对微博用户职业抽取的工作大部分停留在完善提取用户特征来提高准确率的层面,工作量大且不易实现。本文基于词语相似性迭代方法来提取职业相关的词典,并使用词典过滤冗余词汇,再将每个用户样本清洗后的所有词的词向量的列取和来表示用户样本,不仅可以去除冗余特征,还能够增强特征的表达能力,有效减少抽取过程的工作量并且提高抽取性能。我们在实际微博数据集上的实验表明,基于词典过滤的方法可以达到87.12%的准确率,相比于传统的特征提取方法提高了 9%的准确率。(2)将多层神经网络模型应用于微博用户职业抽取中,通过实验对比了MLP、CNN、LSTM以及FastText模型的性能并进行了讨论和分析。随着微博用户的快速增长和职业领域的不断扩张,如果词典提取不够完善便无法准确捕获用户特征信息,且微博数据噪声干扰大,因此,在应用多层神经网络模型的过程中,本文还提出了一种基于领域偏好的微博数据去噪算法,并在此基础上应用多层神经网络模型FastText进行微博用户职业抽取。实验表明,基于领域偏好的去噪算法可以提高近5%的分类准确率。
[Abstract]:With the rapid development of information technology, the Internet has been deeply integrated into people's daily life. Weibo, as one of the main applications in the Internet era, is sharing knowledge. As a new social network tool, Weibo has a large number of users and abundant data resources. The advantage of fast messaging makes it possible for us to extract commercially valuable information on Weibo's platform, such as the occupation of Weibo users. Age and so on. Such data are of great value for advertising push and personalized recommendation in the Internet age. The user information extraction based on Weibo has become a hot research direction in Internet information extraction. This paper mainly studies the problem of user professional information extraction based on Weibo platform. The main challenge of this problem lies in the current situation. Weibo's platform does not provide professional information certification for ordinary users. The traditional feature extraction method is used in the existing job extraction work, which is complex and time-consuming. This requires us to design a new efficient algorithm for Weibo user occupation extraction. In this paper, we study the problem of Weibo user occupation extraction from two angles, that is, Weibo user occupation extraction method based on the combination of word vector and occupational dictionary, and the extraction method based on multi-layer neural network model. The main work and contribution of this paper can be summarized as follows: 1) this paper proposes a method of user occupation extraction for Weibo based on feature engineering. User features are used to improve the level of accuracy, Based on the iterative method of word similarity, this paper extracts occupational related dictionaries, filters redundant words by using dictionaries, and then adds the word vectors of all words cleaned by each user sample to represent the user samples. It can not only remove redundant features, but also enhance the expression of features, reduce the workload of extraction process and improve the performance of extraction. The method based on dictionary filtering can achieve the accuracy of 87.12%. Compared with the traditional feature extraction method, it improves the accuracy rate by 9%.) the multilayer neural network model is applied to Weibo user occupation extraction. In this paper, the performance of FastText model and its LSTM model are compared and analyzed through experiments. With the rapid growth of Weibo users and the continuous expansion of professional field, if the dictionary extraction is not perfect enough, the characteristic information of users can not be captured accurately. Weibo's data noise is very noisy. Therefore, in the process of applying the multilayer neural network model, this paper also proposes a new algorithm based on domain preference to remove the noise from Weibo data. On this basis, the multilayer neural network model FastText is used to extract Weibo user occupation. Experiments show that the denoising algorithm based on domain preference can improve the classification accuracy by nearly 5%.
【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.092;TP391.1

【相似文献】