微博短文本情感分析关键技术研究与实现
发布时间:2018-08-27 18:12
【摘要】:随着社交网络的兴起以及微博自媒体时代的到来,互联网上每天能产生数以亿计的博文。海量的微博文数据蕴含了丰富的有关个人、社会、企业、政府多维度、多层次、多样化的信息。对博文进行内容分析,监控网络舆情,完成对博文中蕴含的情感倾向性的分析等,有重要的理论研究价值和应用价值。 本文基于模拟用户登录方式采集海量微博数据,通过分词、词性标注、主题词提取等自然语言处理技术,结合情感词库和微博语料,通过构建向量空间模型,并动态调整情感影响因子的权重等参数,对微博数据进行情感分析。本文所做的工作如下:首先,基于模拟浏览器技术,结合HttpWatch8.5抓包分析技术,采集海量微博信息。第二,,基于隐马尔可夫模型和N-Gram语言模型,设计实现了中文分词器SkyLightAnalyzer,主要功能包括分词、词性标注、词义消歧、未登录词识别等。第三,基于统计和规则相结的算法,在前述中文分词器的基础上,实现了针对博文的主题词提取与情感单元提取。第四,提出基于向量空间模型和动态调整情感影响因子的权重算法,设计并实现了基于博主个性化建模与内容分析的情感倾向性分析方法。实验与实用表明了本文提出的算法的有效性。文中也对存在的不足以及下一步的工作计划进行了说明。
[Abstract]:With the rise of social networks and the advent of Weibo since the media era, hundreds of millions of blog posts can be generated on the Internet every day. The massive Weibo text data contains abundant information about individual, society, enterprise and government. It has important theoretical research value and application value to analyze the content of blog articles, monitor network public opinion, and complete the analysis of emotional tendency contained in blog posts. Based on simulated user login, this paper collects massive Weibo data, constructs vector space model by using natural language processing technology, such as participle, part of speech tagging, subject word extraction and so on, combining emotional lexicon and Weibo corpus. And dynamically adjust the weight of affective factors and other parameters, Weibo data for emotional analysis. The work of this paper is as follows: first, based on the simulation browser technology, combined with HttpWatch8.5 packet capture analysis technology, collect massive Weibo information. Secondly, based on the hidden Markov model and N-Gram language model, the main functions of Chinese word Segmentation (SkyLightAnalyzer,) include word segmentation, part of speech tagging, word sense disambiguation, unrecorded word recognition and so on. Thirdly, based on the algorithm of combining statistics and rules, based on the above Chinese word segmentation, the thesis implements the subject word extraction and emotion unit extraction for blog posts. Fourthly, an algorithm based on vector space model and dynamic adjustment of affective influence factors is proposed, and an emotional orientation analysis method based on personalization modeling and content analysis is designed and implemented. Experimental and practical results show the effectiveness of the proposed algorithm. The paper also describes the shortcomings and the next work plan.
【学位授予单位】:河北科技大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP391.1;TP393.092
本文编号:2208046
[Abstract]:With the rise of social networks and the advent of Weibo since the media era, hundreds of millions of blog posts can be generated on the Internet every day. The massive Weibo text data contains abundant information about individual, society, enterprise and government. It has important theoretical research value and application value to analyze the content of blog articles, monitor network public opinion, and complete the analysis of emotional tendency contained in blog posts. Based on simulated user login, this paper collects massive Weibo data, constructs vector space model by using natural language processing technology, such as participle, part of speech tagging, subject word extraction and so on, combining emotional lexicon and Weibo corpus. And dynamically adjust the weight of affective factors and other parameters, Weibo data for emotional analysis. The work of this paper is as follows: first, based on the simulation browser technology, combined with HttpWatch8.5 packet capture analysis technology, collect massive Weibo information. Secondly, based on the hidden Markov model and N-Gram language model, the main functions of Chinese word Segmentation (SkyLightAnalyzer,) include word segmentation, part of speech tagging, word sense disambiguation, unrecorded word recognition and so on. Thirdly, based on the algorithm of combining statistics and rules, based on the above Chinese word segmentation, the thesis implements the subject word extraction and emotion unit extraction for blog posts. Fourthly, an algorithm based on vector space model and dynamic adjustment of affective influence factors is proposed, and an emotional orientation analysis method based on personalization modeling and content analysis is designed and implemented. Experimental and practical results show the effectiveness of the proposed algorithm. The paper also describes the shortcomings and the next work plan.
【学位授予单位】:河北科技大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP391.1;TP393.092
【参考文献】
相关期刊论文 前10条
1 张仰森;郭江;;四种统计词义消歧模型的分析与比较[J];北京信息科技大学学报(自然科学版);2011年02期
2 朱聪慧;赵铁军;郑德权;;基于无向图序列标注模型的中文分词词性标注一体化系统[J];电子与信息学报;2010年03期
3 李华波;吴礼发;赖海光;郑成辉;黄康宇;;有效的爬行Ajax页面的网络爬行算法[J];电子科技大学学报;2013年01期
4 王佰玲;曲芸;张永铮;田志宏;;基于数据流的网页内容分析技术研究[J];电子学报;2013年04期
5 潘欣;吕静波;张素莉;;基于网络蜘蛛的新词自动发现算法研究[J];长春工程学院学报(自然科学版);2011年03期
6 崔世起;刘群;孟遥;于浩;西野文人;;基于大规模语料库的新词检测[J];计算机研究与发展;2006年05期
7 黄德根;焦世斗;周惠巍;;基于子词的双层CRFs中文分词[J];计算机研究与发展;2010年05期
8 姚继伟;赵东范;;基于短语匹配的中文分词消歧方法[J];吉林大学学报(理学版);2010年03期
9 张海军;史树敏;朱朝勇;黄河燕;;中文新词识别技术综述[J];计算机科学;2010年03期
10 张敏;王春红;;基于统计方法的Web新词分词方法研究[J];计算机工程与科学;2010年05期
相关博士学位论文 前1条
1 车超;知识自动获取的词义消歧方法[D];大连理工大学;2010年
本文编号:2208046
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2208046.html