基于机器学习的微博人物关系信息抽取与分析研究
本文选题:微博 + 人物关系抽取 ; 参考:《北京邮电大学》2017年硕士论文
【摘要】:随着互联网技术的飞速发展,社交网络的研究对于舆情监控和商业分析等工作越来越重要,因此对社交网络的研究成为热点。微博是中国最大的社交网络社区,其语料特点和传统媒体有很大的不同。本文根据微博的特点,主要研究在微博用户社交的场景下如何提高人物关系抽取的性能,以及如何提高人物关系强度预测的能力。论文的主要工作和成果包括:(1)研究微博语料的特点,并分析传统人物关系抽取算法的优缺点,针对传统方法对模糊样本的识别能力不足的问题,提出了 SVMDT_RFC算法模型。通过引进SVM决策树,改进随机森林算法,使用最大化分类间隔的SVM决策树节点分裂算法和基于分类间隔加权的随机森林投票算法,提高对于模糊样本的人物关系抽取能力。本文将SVMDT_RFC算法与SVM和随机森林算法进行实验比较,结果表明该方法可以提高模糊样本的人物关系抽取的准确率,对于中等长度文本和长文本的人物关系抽取的准确率提升效果更显著。(2)研究传统的人物关系建模方法,针对传统模型对现实生活中人物关系的还原度不足的问题,结合微博语料文本情感特征信息丰富的特点,通过引入情感强度特征,设计了一种人物关系建模方案。该方案结合用户属性特征与行为特征,通过构建感词典与表情词典等方式分析用户的情感强度。将情感特征引入模型,实现了对人物关系的多维度建模,可以更准确的模拟真实的人物关系,提高模型的真实性和有效性。(3)在上述人物关系模型的基础上,提出了一种基于多层感知机的人物关系强度预测方案,通过十折交叉验证实验,与决策树模型和最大熵模型进行对比,实验结果证明了本文提出的方案能够提高人物关系强度预测的准确性。其次,将传统人物关系模型与本文提出的人物关系模型进行对比,发现引入情感特征后,提高了预测的准确率,证明本文提出的人物关系模型的有效性。最后,解决传统的人物关系强度预测方案的仅能输出强和弱两种结果导致的预测不准确的问题,此方案可以多级别量化预测任务关系强度,可以更精细化更准确的预测人物关系强度,通过对比不同强度级别的人物关系的预测中,引入情感特前和引入情感特征后的结果,证明多级别量化的关系强度预测方案有助于对人物关系进行更深入的分析和研究。论文的结构和各章节内容安排如下:第一章介绍了论文的选题背景以及对于微博网络研究的意义,介绍人物关系抽取的研究现状和人物关系强度预测的研究现状。第二章首先介绍了人物关系抽取系统的流程以及其中涉及到的问题。之后分析了目前人物关系抽取方案中存在的问题。第三章分析了如何对人物关系进行建模以及模型的不足,最后简单介绍了用到的相关算法。第四章首先分析了目前人物关系抽取方案的问题,即对模糊样本的抽取能力不足。通过引入SVM决策树对随机森林算法进行改进,提出了基于SVMDT__RFC算法的微博人物关系抽取的技术方案。第五章针对人物关系抽取可以获取关系种类但无法给出关系强度的问题,首先介绍了引入情感特征并结合属性特征以及行为特征的人物关系模型,然后提出了一种可以获得多级别量化输出的基于多层感知机的人物关系强度预测的方案。第六章对全文进行了总结,并指出当前研究的一些不足,以及今后改善方向。
[Abstract]:With the rapid development of Internet technology, the research of social networks is becoming more and more important for public opinion monitoring and business analysis. Therefore, the research on social networks has become a hot spot. Micro-blog is the largest social network community in China. Its language features are very different from that of traditional media. Based on the characteristics of micro-blog, this paper mainly studies in micro How to improve the performance of personage relationship extraction and how to improve the ability to predict the relationship intensity of personage. The main work and achievements of this paper include: (1) study the characteristics of micro-blog language and analyze the advantages and disadvantages of the traditional figure extraction algorithm. The SVMDT_RFC algorithm model is proposed. By introducing the SVM decision tree, improving the random forest algorithm, using the SVM decision tree node splitting algorithm which maximizes the classification interval and the random forest voting algorithm based on the classification interval weighting, the SVMDT_RFC algorithm is improved with the SVM and the random forest. Compared with the experimental results, the results show that the method can improve the accuracy of the figure relationship extraction of the fuzzy samples and improve the accuracy of the figure relation extraction of medium length text and long text. (2) the traditional modeling method of personage relationship is studied, and the reduction degree of the traditional model to the relationship of the real life is aimed at the reduction degree of the character relationship in the real life. In combination with the characteristics of emotional feature information of micro-blog text text, a personage relationship modeling scheme is designed by introducing emotional intensity features. The scheme combines user attributes and behavior features to analyze user's emotional intensity by constructing a sense dictionary and an expression dictionary. The multi-dimensional modeling of the character relationship can be used to simulate the real character relationship more accurately and improve the authenticity and validity of the model. (3) on the basis of the model of the personage relationship, a kind of figure relationship intensity prediction scheme based on the multilayer perceptron is proposed, and the ten fold cross validation experiment is carried out with the decision tree model and the maximum entropy. Compared with the experimental results, the experimental results show that the proposed scheme can improve the accuracy of the prediction of the relationship strength of the personage. Secondly, the traditional figure relationship model is compared with the figure model proposed in this paper, and it is found that after introducing the emotional characteristics, the accuracy of the prediction is improved, and the validity of the model is proved to be effective. Finally, the solution of the traditional figure relationship intensity prediction scheme can only output the inaccurate prediction problem caused by two strong and weak results. This scheme can quantify the relationship intensity of the prediction task in multiple levels, more precise and more accurate prediction of the relationship strength of the figure, by the prediction of the relationship between the characters of different intensity levels, As the result of emotional characteristics and emotional characteristics, it is proved that the multi level quantitative relationship intensity prediction scheme helps to further analyze and study the relationship between characters. The structure and the contents of the chapters are arranged as follows: the first chapter introduces the background of the topic and the significance of the research on the micro-blog network, and introduces the relationship between the characters. The research status of extraction and the research status of character relationship intensity prediction. The second chapter first introduces the process of the character extraction system and the problems involved. Then it analyzes the existing problems in the current figure extraction scheme. The third chapter analyzes how to model the relationship between characters and the insufficiency of the model, and finally, In the fourth chapter, the fourth chapter firstly analyzes the problem of the current figure extraction scheme, that is, the ability to extract the fuzzy samples is insufficient. By introducing the SVM decision tree to improve the random forest algorithm, the SVMDT__RFC algorithm based technology scheme for the extraction of micro-blog personage relations is proposed. The fifth chapter is aimed at the character relationship. The problem of extracting relationship types but unable to give the relationship strength is extracted. First, it introduces the relationship model which introduces emotional characteristics and combines attribute characteristics and behavior characteristics, and then proposes a scheme to predict the intensity of personage threshold based on multi-layer perceptron. The sixth chapter introduces the full text into the full text. Summarized and pointed out some deficiencies in the current research and the direction for improvement in the future.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1;TP393.092
【相似文献】
相关期刊论文 前3条
1 张群力;;论纪实叙事中人物关系的运用[J];电视研究;2012年09期
2 傅宛菊;陈木兰;;中国新魔幻电影的类型化初探[J];东南传播;2014年08期
3 丁海峰;;论电影《海洋天堂》中细节的运用[J];西部广播电视;2013年05期
相关会议论文 前3条
1 白劲鹏;;可怕的对称——论《了不起的盖茨比》中的主次人物关系[A];外语语言教学研究——黑龙江省外国语学会第十一次学术年会论文集[C];1997年
2 黄素影;;《天伦》创作小结[A];我的角色与我们的剧团——第六届电影表演艺术学会奖文集[C];1997年
3 吴士余;;重视人物关系的典型化[A];《毛泽东文艺思想研究》第三辑暨全国毛泽东文艺思想研究会第三次年会论文集[C];1983年
相关重要报纸文章 前6条
1 本报记者 张悦;音乐剧《蝶》推出修排版[N];中国艺术报;2008年
2 记者 金朝力;网络视频业首推人脸识别功能[N];北京商报;2010年
3 本文实习记者 张柳青;纪念汶川地震一周年[N];中国电影报;2009年
4 许柏林;小成本拍出大境界[N];人民日报;2012年
5 张克丹 综合整理;青春·理想·奋斗·奉献[N];中国电影报;2009年
6 上海戏剧学院副教授 石俊;问号的力量[N];文汇报;2012年
相关硕士学位论文 前10条
1 周舸;基于机器学习的微博人物关系信息抽取与分析研究[D];北京邮电大学;2017年
2 潘云;基于中文在线资源的人物关系抽取研究[D];华东师范大学;2015年
3 史军;初析舞剧《奶奶的信》的立意与结构[D];北京舞蹈学院;2015年
4 唐丞博;谈《追梦时刻》中人物关系的发展和变化[D];云南艺术学院;2016年
5 刘博佳;基于维基百科的人物关系抽取研究[D];北京交通大学;2016年
6 陈静;关于《哥儿》的中译本中粗话的翻译研究[D];北京外国语大学;2016年
7 冯元为;基于知识图谱构建人物关系的设计与实现[D];重庆大学;2016年
8 杨岸桢;基于中文微博文本的人物关系提取与分析[D];西华大学;2016年
9 黄蓓静;深度学习技术在中文人物关系抽取中的应用研究[D];华东师范大学;2017年
10 徐珊;孙昌涉初期小说的人物关系和作家意识研究[D];山东大学;2009年
,本文编号:1921160
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1921160.html