基于情感字集的中文情感倾向性分类研究

发布时间：2018-12-25 14:36

【摘要】：情感倾向性分类一般是指对文本的情感极性,如:积极、消极、中性等,进行自动化分类,在大数据时代主要用于调查大众对某事件、人物或团体所持态度。传统的方法特别费时且有很大的局限性,现如今通过搜取互联网上的海量信息可以更加快速、方便的得到他人的意见,并且根据这些大量信息得出的意见可靠性往往更高。本文首先分析了基于情感词典的中文情感倾向性分类的情况并利用ICTCLAS分词和知网情感词典进行传统的中文情感倾向性分类实验,对实验结果进行分析总结后,发现不管是用哪种分词工具或情感词典都将给情感倾向分类结果带来一些不确定的干扰,特别是不同的情感词典在可靠度和分析的类别上都有很大的差别。针对以上这些情况,本文提出了 "情感字集"的概念,这些"字"不仅与使用类别无关且不需要中文分词。所以这里首先欲找出这样的一个情感字集:这些字本身就能影响其所组词后的词的情感倾向,或者字本身就带有强烈的情感倾向。本文从两个不同的来源挖掘出两个不同版本的"情感字集",并且分别对这两个版本进行了实验得到了不同的实验结果,最后选择实验效果更好的版本对情感倾向值的计算方法进行了以下改进。因为没有分词的过程,针对常用的否定词与程度词分别归纳整理了常用的"否定字"与"程度字"并将这些"否定字"、"程度字"对情感字的影响加入到实验算法中。基于情感字集的情感倾向性分类,在计算句子的情感倾向值时是根据每个字的情感值进行计算的,且所有的字都是完全独立,而一些特殊词组被拆分后有可能会影响句子的情感倾向性,所以本文使用了最大正向匹配法对这些词进行识别。最后又通过查找字间关联,减少了连续同类型字的信息熵,进一步提高实验的准确率,最高准确率相对于传统按词的准确率提高了近20%。
[Abstract]:Emotional preference classification generally refers to the emotional polarity of the text, such as positive, negative, neutral, etc. In big data's time, it was mainly used to investigate the attitude of the public towards a certain event, person or group. Traditional methods are especially time-consuming and have great limitations. Nowadays, it is more rapid and convenient to get the opinions of others by searching the vast amount of information on the Internet, and the reliability of the opinions obtained from these information is often higher. This paper first analyzes the situation of Chinese affective preference classification based on affective dictionary and carries on the traditional Chinese affective tendency classification experiment by using ICTCLAS participle and Know-net emotion dictionary. After analyzing and summing up the experimental results, It is found that no matter which kind of word segmentation tool or emotion dictionary is used, it will bring some uncertain interference to the classification results of affective tendency, especially different emotion dictionaries have great differences in reliability and category of analysis. In view of the above, this paper proposes the concept of "affective word set", which is not only independent of the usage category but also does not need Chinese word segmentation. So the first thing here is to find out such a set of emotional words: the words themselves can affect the emotional tendency of the words after the words, or the word itself has a strong emotional tendency. In this paper, two different versions of "affective word sets" are mined from two different sources, and the two versions are experimented with to obtain different experimental results. Finally, the better version of the experiment is chosen to improve the calculation method of affective tendency. Because there is no participle process, the common negative words and degree words are summed up and arranged separately, and the influence of these negative words and degree words on the affective words is added to the experimental algorithm. The affective preference classification based on the affective word set is calculated according to the emotion value of each word when calculating the affective tendency value of the sentence, and all the words are completely independent. Some special phrases may affect the emotional tendency of sentences after they are split, so we use the maximum forward matching method to identify these words. Finally, by searching the correlation between words, the information entropy of the words of the same continuous type is reduced, and the accuracy of the experiment is further improved. The highest accuracy rate is nearly 20% higher than that of the traditional words.
【学位授予单位】：昆明理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】