基于主题模型的社交媒体主题挖掘和文献影响力预测分析
发布时间:2018-04-08 14:11
本文选题:主题模型 切入点:社交媒体 出处:《西南大学》2017年硕士论文
【摘要】:Web2.0和互联网技术成熟与进步促使用户产生内容逐渐成为用户使用互联网的全新方式。用户作为互联网资源的使用者,同时也作为互联网资源的创造者,让人与互联网的交互模式得到升华。人们倾向于在网络平台分享原创的个性化的建议,意见领袖、专家等也乐于分享专业的内容,为相关领域贡献智慧。比如,普通用户通常在Twitter等社交媒体平台分享自己的生活,专家将科研文献发布于学术平台供学习和阅读。这两者的内容是都是文本,但是在文本挖掘方法和应用探索方面却大相径庭。面临的研究挑战都是如何从海量数据高效准确找到不同的用户所需要的信息。本文的主要工作是利用主题模型进行社交媒体短文本主题挖掘和文献的未来影响力预测研究。主题模型的主要思想是借助于文本内容的潜在主题,挖掘出文档与主题,以及主题与单词之间的关系,或者利用两者之间的关系来指导模型的结果。基于不同的场景构造合适的主题模型可以实现不同的目的。过去的方法中,由于Twitter文本长度短、稀疏,用语不规范等特征导致传统的LDA,PLSA对这种文本环境无法进行有效的主题分析。值得一提的是,相比传统的基于引用统计的方法进行文献影响力评估方法而言,本文引入主题模型的语义分析方法应用于文献未来影响力预测是新颖并且具有挑战的想法。针对传统方法的不足、不同应用场景的特殊性以及主题模型的对文本挖掘的效果。本文重点进行了以下两个研究:(1)基于社交媒体短文本的主题挖掘分析(2)基于主语义分析的文献影响力预测。本文分别以社交媒体短文本,文献长文本为基础,分别用Twitter中的时间和标签属性来改进和扩展LDA模型,通过读文献进行特征词/词组的定义,将文章的创新型与LDA分析出的重要性结合起来进行影响力预测研究。为了研究社交媒体的短文本环境下主题挖掘情况,本文提出了新的主题模型HTTM,该模型先后利用Twitter消息(推文)中时间和标签信息为传统的LDA增加了新的“标签-时间”层次来提高主题的表达性,推文聚类效果以及主题在时间序列下的演化效果。最后的实验效果证明了HTTM模型在以上几个方面的有效性。针对文献影响力预测研究,本文提出了一个TTRM模型来预测文献的未来影响力。该模型以文章特征词/词对为链接,分别将文献发表的时间和文章本身内容将进行创新性和重要性建模。其中对于重要性建模过程中创新地使用了主题模型的方法,分析文章在当前文献集中的重要程度。试验中,使用文献数据集,证实了TTRM模型在文献排序和影响力预测拟合上的有效性。实验中对比使用了基于引用的PageRank模型,和以TF-IDF作为文章重要性建模方法的MRR-ranking模型,TTRM在文献排名和文献影响力预测方面都有一定的优势。并且证明了我们的假设,即文献内容中某些词对于文章创新性具有贡献作用,和发现新文献具有一定的作用。
[Abstract]:Web2.0 and Internet technology is mature and progress to the user generated content has gradually become a new way for users to use the Internet as the Internet users. Users of resources, at the same time as the Internet resource creators, let the interactive mode with the Internet and the soul. People are inclined to the network platform and share original personalized advice, opinion leaders, experts, etc. is willing to share the professional content, for the relevant contribution in the field of intelligence. For example, ordinary users often share their lives in Twitter and other social media platforms, experts will be released in the scientific literature academic platform for learning and reading. The content is the text, but in text mining methods and application exploration but be quite different. The challenge is how to efficiently and accurately find the vast amounts of data from different information needed by the user. The main work of this paper is Social media short text mining and utilization of literature topic model future influence prediction research. The main idea of topic models is based on the underlying theme of the text content, dig out the document with the subject, and the relationship between the theme and the word, or the relationship between the model results. To guide the topic model to construct different scenes suitable can achieve different purposes. Based on the past method, because the Twitter length is short, sparse, terms are not standardized characteristics due to the traditional LDA, PLSA on the environment can not conduct effective text topic analysis. It is worth mentioning that, compared to the traditional literature influence evaluation methods cited statistics based on, this paper introduces the semantic topic model analysis method is applied to predict the future impact of literature is a novel and challenging the traditional idea. Method, different application scenarios and the particularity of topic model of text mining results. This paper focuses on the following two research: (1) social media short text mining analysis based on the theme (2) forecast subject semantic analysis literature. This paper respectively influence based on the social media in short text, literature long the text is based, with both the time and the Twitter attribute to the improvement and expansion of LDA model, the definition of feature words / phrases by reading the literature, the innovation of this paper and LDA analysis of the importance of combining research. In order to predict the influence of short text environment social media research under the topic mining, is proposed in this paper. The HTTM theme of a new model, this model has the use of the Twitter message (tweets) in time and tag information for traditional LDA has added a new "label - time" to improve the level of the expression of the theme The effect of evolution, tweets and theme clustering effect in the time series. Finally, the experimental results demonstrate the effectiveness of the HTTM model in the above aspects. According to the prediction of the influence of the literature, this paper proposes a TTRM model to predict the future of literature influence. In this model, the characteristics of word / word pairs link the publication time and the content itself will be innovative and important. The importance of modeling method in the modeling process of innovation in the use of the topic model, analysis the importance in the current literature. The centralized test, using literature data sets, confirmed the effectiveness of TTRM model in the literature sorting and impact forecast fitting. Experimental comparison using the PageRank model and MRR-ranking model based on a reference to the importance of modeling method with TF-IDF as the TTRM, and the ranking in the literature It has certain advantages in predicting influence. It also proves our assumption that some words in literature content contribute to the innovation of articles, and it has a certain effect in finding new literatures.
【学位授予单位】:西南大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
【参考文献】
相关期刊论文 前4条
1 薛素芝;鲁燃;任圆圆;;基于速度增长的微博热点话题发现[J];计算机应用研究;2013年09期
2 刘大有;薛锐青;齐红;;基于作者权威值的论文价值预测算法[J];自动化学报;2012年10期
3 陈辉林;夏道勋;;基于CART决策树数据挖掘算法的应用研究[J];煤炭技术;2011年10期
4 袁志坚;王乐;田李;贾焰;杨树强;;数据流突发检测研究与进展[J];计算机工程与应用;2008年21期
相关博士学位论文 前1条
1 张金松;基于引文上下文分析的文献检索技术研究[D];大连海事大学;2013年
相关硕士学位论文 前1条
1 王晶;基于社交媒体的热点主题挖掘及主题演化分析[D];西南大学;2016年
,本文编号:1721960
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1721960.html