基于半监督学习的微博情感分析方法研究

发布时间：2018-05-25 16:06

本文选题：微博 + 情感分析　；参考：《山东财经大学》2014年硕士论文

【摘要】：微博的快速发展使其平台积累了大量的文本，其中蕴含着大量的有价值的信息，包括商业信息、社交网络和用户观点与情感等。微博的短文本特征使其文本分析具有一定挑战性，并且中文文本固有的特征使得文本分析性能下降。针对上述特征，本文应用半监督学习对微博文本进行情感分类：结合语言资源和标注集合对文本情感分类器进行训练和优化。情感分类包括两个任务：识别情感的极性，如正性、负性；识别情感类别：如高兴、愤怒。本文主要工作如下： 1）微博信息抽取。应用微博运营商提供的API，对微博信息进行采集，以热门话题和认证用户为入口，采集话题相关的微博和用户微博及其评论文本。 2）半监督学习。结合已有的标注集，运用主动学习标注微博文本的情感极性和类别，以减少标注成本。应用标注数据集于监督学习中，包括最大熵、神经网络和支持向量机模型，对不同监督学习模型进行优化，分析其误差和学习曲线。 3）特征抽取。使用已有语言资源和开源软件，如情感词汇本体和同义词词林进行特征抽取，基本特征包括文本所固有的词项、词性和词林编码等。此外，鉴于文本特征空间维度较大，采用PCA对特征空间进行降维。在模型优化过程中，，对比了不同特征空间组合和模型的准确度。部分的特征抽取过程，如自然语言处理、以及微博信息处理运行在分布式计算框架上，以提高算法的运行效率。情感极性分析的准确率达到0.7，具有一定的应用价值。而多类别情感分析准确度相对较低，为0.34：由于标注语料不充分和文本情感表达的复杂性，频率较高的类别，如喜欢、厌恶，分类效果较好，而惊奇、恐惧等分类效果不佳。情感分析结果可作用于舆情监测、市场调研和社会计算等方面，具有一定的商业价值。在其分析的基础上，可结合在线网络的结构和时序进行信息传播和受众分析，获得用户的行为模式和规律。结合用户特征，可进一步获得用户在发布信息等行为时的真实情感与心理状态，称之为情感计算，也是情感分析的最终目的。
[Abstract]:The rapid development of Weibo makes its platform accumulate a lot of text, which contains a lot of valuable information, including business information, social networks and user views and feelings. The text analysis of Weibo is challenging due to its short text feature, and the performance of text analysis is degraded by the inherent features of Chinese text. In view of the above characteristics, this paper applies semi-supervised learning to Weibo text affective classification, and combines language resources and tagging sets to train and optimize the text affective classifier. Emotion classification includes two tasks: recognizing the polarity of emotion, such as positivity and negativity, and identifying emotional categories such as happiness and anger. The main work of this paper is as follows: 1) Weibo information extraction. Using API provided by Weibo operator, the information of Weibo is collected, which takes hot topics and authenticated users as the entry, and collects the Weibo and user Weibo and their comment texts related to the topic. 2) Semi-supervised learning. In order to reduce the annotation cost, we use active learning to annotate the emotional polarity and category of Weibo text. The annotated data set is applied to supervised learning, including maximum entropy, neural network and support vector machine model. Different supervised learning models are optimized and their errors and learning curves are analyzed. 3) feature extraction. Using existing language resources and open source software, such as affective lexical ontology and synonym forest, feature extraction is carried out. The basic features include words inherent in the text, part of speech and lexical forest coding, and so on. In addition, in view of the large dimension of text feature space, PCA is used to reduce the dimension of feature space. In the process of model optimization, the combination of different feature spaces and the accuracy of the model are compared. Some of the feature extraction processes such as natural language processing and Weibo information processing run on the distributed computing framework to improve the efficiency of the algorithm. The accuracy of affective polarity analysis is 0.7, which has certain application value. However, the accuracy of multi-category affective analysis is relatively low (0.34): because of the insufficient tagging data and the complexity of the emotional expression of the text, the categories with higher frequency, such as like, disgust, classification effect are better, but surprise, fear and other classification effects are not good. The result of emotion analysis can be used in public opinion monitoring, market research and social calculation, and has certain commercial value. On the basis of its analysis, the structure and timing of online network can be combined with information dissemination and audience analysis, and the behavior patterns and rules of users can be obtained. Combining the characteristics of users, we can further obtain the real emotional and psychological state of users when they publish information, which is called emotional calculation, and is also the ultimate purpose of emotional analysis.
【学位授予单位】：山东财经大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092;TP391.1

【相似文献】