基于K关联图的流分类算法及其在微博情感分析中的应用

发布时间：2018-05-26 15:30

本文选题：微博 + 数据流　；参考：《郑州大学》2014年硕士论文

【摘要】：随着信息黄金时代的到来，人们越来越意识到数据的重要性，而从这些大量的数据中挖掘有用的信息也变得越来越困难。特别是微博的兴起，使得每天产生大量的微博文本数据，而这些微博文本较短，信息量较少，通常被称作短文本流。在这些短文本流中，蕴藏着大量意见资源。比如产品的评论，这些评论对于卖家和买家都很有价值；又如热点事件的评论，这些评论对于政府部门了解人民群众对某些事件的态度也很重要。因此，如何从短文本流中挖掘有用的知识是人们关心的问题，这些需求也促使着数据流挖掘成为近年来研究的热点和难点。本文在总结了一些成熟的数据流分类算法的基础之上，提出了一种基于K关联图的数据流分类算法（K-associated Graphs Based Classifier，KGBC），该算法首先把整个数据块表示成一张K关联图，通过K关联图能够表示数据实例之间的相似关系和子图的纯度。然后根据K关联图优化算法对数据块划分的结果去选择基础分类器中与当前待分类的数据块概念相似的基础分类器，最后对这些基础分类器进行集成，使用概念相似度作为基础分类器的权重对测试数据进行分类。该算法不用每当新的数据块来的时候重新训练分类器，从而节省时间。实验表明，KGBC算法具有较好的预测准确率。本文的另一项工作是短文本流中的情感分析。短文本流情感分析关键是如何判别文本消息的情感倾向性，而判别情感倾向性的首要条件是构建一个适合微博文本的情感词词典。因此，本文提出了一种基于依存句法的微博情感词抽取算法，根据微博情感词在依存句法中常出现的位置总结出一些模版，根据模版自动的识别网络上新的情感词。考虑到中文微博表达多元化的特点，本文采用微博文本中的情感词、词性、上下文关系和主题特征等作为情感分类的特征，通过实验对比KGBC算法和传统的情感分类算法，，从而验证了KGBC算法在短文本流情感分类的有效性。
[Abstract]:With the arrival of the golden age of information, people are becoming more and more aware of the importance of data, and it is becoming more and more difficult to mine useful information from these large amounts of data. Especially with the rise of Weibo, a large amount of Weibo text data is produced every day, and these Weibo texts are short and have less information, so they are usually called short text stream. In these short text stream, contain a large number of opinion resources. For example, product reviews, which are valuable to both sellers and buyers, as well as hot spot reviews, are also important for government departments to understand people's attitudes to certain events. Therefore, how to mine useful knowledge from short text stream is a problem that people are concerned about, and these requirements make data stream mining become a hot and difficult point in recent years. On the basis of summarizing some mature data stream classification algorithms, this paper proposes a K-associated Graphs Based classifier KGBCU algorithm based on K-associative graph, which first represents the whole data block as a K-correlation graph. The similarity relation between data instances and the purity of subgraph can be expressed by K correlation graph. Then, according to the result of data block partition based on K-correlation graph optimization algorithm, we select the basic classifier in the basic classifier, which is similar to the current data block concept to be classified. Finally, we integrate these basic classifiers. The concept similarity is used as the weight of the basic classifier to classify the test data. The algorithm does not need to retrain the classifier whenever a new block of data comes in, thus saving time. Experiments show that the KGBC algorithm has better prediction accuracy. Another work of this paper is the emotional analysis in the text stream. The key of emotional analysis is how to judge the emotional tendency of text message, and the first condition of judging emotional tendency is to construct a dictionary of affective words suitable for Weibo text. Therefore, this paper presents an algorithm for extracting Weibo affective words based on dependency syntax. According to the common location of Weibo affective words in dependency syntax, some templates are summed up, and new emotive words are automatically recognized on the network according to template. Considering the diversity of the expression of Chinese Weibo, this paper uses the affective words, parts of speech, context and subject features in Weibo text as the features of emotional classification, and compares the KGBC algorithm with the traditional affective classification algorithm through experiments. The validity of KGBC algorithm in short text stream emotion classification is verified.
【学位授予单位】：郑州大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092;TP311.13

【参考文献】