基于个人微博时序事件的研究
发布时间:2019-03-15 14:44
【摘要】:微博作为一个新兴的社交媒体服务,从各个方面渗透并影响着人们的生活,成为人们共享信息、交流情感的一个重要平台。其中大部分的个人微博内容记录其生活经历、专业兴趣以及热点话题的讨论等,所以微博数据就成了个人履历情感的载体。由于发微博的的实时性、便利性有时甚至是秒发,,这样个人微博就逐渐代替了日记,形成了时记或分记,这样长时间后形成的微博数据量会非常庞大,想要了解博主就只能通过逐条浏览其历史微博,这就造成了时间浪费。如何快速准确的了解博主的动态已成为目前急需解决的问题,微博归类就是针对这一问题而提出的。在微博归类过程中,微博相似度的精度决定了其的准确性,本文研究的重点就是如何提高微博相似度的精确性。 由于个人微博数据总体数量较多、单条简短及内容随意性大等特性,利用传统分类方法以及信息提取算法进行处理时存在一定的局限性。本文考虑到单条微博文本信息简短包含的有效特征少,且内容比较口语化的特性,从同类词方面对文本的特征词进行了扩展,尽量降低特征丢失的可能性,提出了一种基于改进的Jaccard相似度和余弦相似度的综合相似度算法。首先,对获取的微博数据进行过滤,去除没有任何信息的文本和无关链接、图片等,并利用相关中科院的汉语词法分词系统ICTCLAS对文本进行分词、做词性标记和过滤停用词以及表情词;其次,采用改进的TF-IDF算法提取微博特征词和LDA(Latent Dirichlet Allocation)主题模型构造同类词模板来提高微博相似度的精度,即先利用特征选择评估函数CHI衡量每个特征词对每个类别的重要程度并使特征词在该类别文本中符合均匀分布后再计算TF-IDF值来提取微博特征词;然后,在提取的特征词和构造的同类词模板的基础上结合Jaccard相似度和余弦相似度计算个人微博的综合相似度,该算法克服了传统只基于词语共现方法的不足,能够从同类词特征和个体数值特征等方面更深层次、更全面的计算两条微博的相似度;最后,利用K-Means时序事件归类算法对个人微博数据进行归类,使相同话题微博归类到同一个集合中。 实验结果表明本文提出的综合相似度算法比传统的相似度算法具有更高的精确度,在一定程度上提高了个人微博时序事件归类的准确性。
[Abstract]:Weibo, as a new social media service, permeates and affects people's lives from various aspects, and becomes an important platform for people to share information and exchange emotions. Most of the personal Weibo content records their life experience, professional interest and discussion of hot topics, so Weibo data has become the carrier of personal experience emotion. Because of Weibo's real-time, convenience and sometimes even second hair, individual Weibo gradually replaced the diary and formed a chronology or minutes, so that the amount of Weibo data formed after such a long period of time will be very large. If you want to know the blogger, you can only browse the history of Weibo one by one, which causes a waste of time. How to quickly and accurately understand the dynamics of bloggers has become an urgent problem to be solved. Weibo's classification is aimed at this problem. In the process of Weibo classification, the accuracy of Weibo similarity determines its accuracy. The focus of this paper is how to improve the accuracy of Weibo similarity. Because of the large number of individual Weibo data, short and random content, there are some limitations in using the traditional classification method and information extraction algorithm to process the data. Taking into account the few effective features contained in the short message of Weibo and the colloquial character of the content, this paper extends the feature words of the text from the aspect of similar words, and reduces the possibility of feature loss as far as possible. A synthetic similarity algorithm based on improved Jaccard similarity and cosine similarity is proposed. First of all, we filter the obtained Weibo data, remove the text without any information and irrelevant links, pictures, and so on, and use the Chinese word segmentation system ICTCLAS of the relevant Chinese Academy of Sciences to segment the text. Make part-of-speech markers and filter deactivated words and emoji words; Secondly, the improved TF-IDF algorithm is used to extract Weibo feature words and LDA (Latent Dirichlet Allocation) theme model to construct similar word templates to improve the similarity accuracy of Weibo. Firstly, we use the feature selection evaluation function (CHI) to measure the importance of each feature word to each category, and then calculate the TF-IDF value to extract the feature words of Weibo after the feature words accord with the uniform distribution in the text of this category. Then, on the basis of extracting feature words and constructing similar word templates, we combine Jaccard similarity and cosine similarity to calculate the synthetic similarity of individual Weibo. This algorithm overcomes the deficiency of traditional co-occurrence method based on words. The similarity between the two Weibo can be calculated more comprehensively from the similar word features and individual numerical features and other aspects of the deeper level; Finally, the K-Means time series event classification algorithm is used to classify the personal Weibo data, and Weibo, the same topic, is classified into the same set. The experimental results show that the synthetic similarity algorithm proposed in this paper has higher accuracy than the traditional similarity algorithm, and improves the accuracy of individual Weibo temporal event classification to a certain extent.
【学位授予单位】:内蒙古科技大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP393.092
本文编号:2440722
[Abstract]:Weibo, as a new social media service, permeates and affects people's lives from various aspects, and becomes an important platform for people to share information and exchange emotions. Most of the personal Weibo content records their life experience, professional interest and discussion of hot topics, so Weibo data has become the carrier of personal experience emotion. Because of Weibo's real-time, convenience and sometimes even second hair, individual Weibo gradually replaced the diary and formed a chronology or minutes, so that the amount of Weibo data formed after such a long period of time will be very large. If you want to know the blogger, you can only browse the history of Weibo one by one, which causes a waste of time. How to quickly and accurately understand the dynamics of bloggers has become an urgent problem to be solved. Weibo's classification is aimed at this problem. In the process of Weibo classification, the accuracy of Weibo similarity determines its accuracy. The focus of this paper is how to improve the accuracy of Weibo similarity. Because of the large number of individual Weibo data, short and random content, there are some limitations in using the traditional classification method and information extraction algorithm to process the data. Taking into account the few effective features contained in the short message of Weibo and the colloquial character of the content, this paper extends the feature words of the text from the aspect of similar words, and reduces the possibility of feature loss as far as possible. A synthetic similarity algorithm based on improved Jaccard similarity and cosine similarity is proposed. First of all, we filter the obtained Weibo data, remove the text without any information and irrelevant links, pictures, and so on, and use the Chinese word segmentation system ICTCLAS of the relevant Chinese Academy of Sciences to segment the text. Make part-of-speech markers and filter deactivated words and emoji words; Secondly, the improved TF-IDF algorithm is used to extract Weibo feature words and LDA (Latent Dirichlet Allocation) theme model to construct similar word templates to improve the similarity accuracy of Weibo. Firstly, we use the feature selection evaluation function (CHI) to measure the importance of each feature word to each category, and then calculate the TF-IDF value to extract the feature words of Weibo after the feature words accord with the uniform distribution in the text of this category. Then, on the basis of extracting feature words and constructing similar word templates, we combine Jaccard similarity and cosine similarity to calculate the synthetic similarity of individual Weibo. This algorithm overcomes the deficiency of traditional co-occurrence method based on words. The similarity between the two Weibo can be calculated more comprehensively from the similar word features and individual numerical features and other aspects of the deeper level; Finally, the K-Means time series event classification algorithm is used to classify the personal Weibo data, and Weibo, the same topic, is classified into the same set. The experimental results show that the synthetic similarity algorithm proposed in this paper has higher accuracy than the traditional similarity algorithm, and improves the accuracy of individual Weibo temporal event classification to a certain extent.
【学位授予单位】:内蒙古科技大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP393.092
本文编号:2440722
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2440722.html