当前位置:主页 > 科技论文 > 软件论文 >

基于微博的金融领域的热点话题的发现与分析

发布时间:2018-05-17 22:04

  本文选题:TF-IWF-IDF + Word2Vec ; 参考:《北京邮电大学》2016年硕士论文


【摘要】:微博,一个集社交娱乐、新闻来源、信息发布等于一身的社交平台,拥有庞大的用户群体,而网络炒股和金融理财的用户规模也大幅提升。微博每天产生大量的信息数据,涉及行业多、覆盖的范围广,其信息的时效性和权威性较高,是股民和理财者特别关注微博的重要原因。如何从这些大量微博数据当中发现股民和理财者关注的金融领域内热点话题,已成为各大证券公司和金融理财公司关注的热点。本文主要是解决上述问题,即从微博中提取金融领域内的热点话题。本文首先对话题发现与追踪相关技术进行了介绍,以及话题发现聚类算法的相关技术介绍。接着对聚类算法进行分析,选择了 Single-Pass算法作为文本聚类算法,并提出了改进的算法。为了改进TF-IDF中IDF是定值、不能随数据集动态变化的问题,提出了基于词性位置的增量TF-IWF-IDF权重计算方法。传统的特征向量忽略了特征项语义和上下文环境的考虑,因此在文中提出了基于Word2Vec的增量TF-IWF-IDF特征向量表示方法。本文针对Single-Pass算法存在的问题,提出了基于多话题中心的二次聚类算法。针对微博数据,经实验对比分析,本文中热点话题发现的效果比未改进的Single-Pass算法提升了近10%左右。最后本文基于上述聚类算法来设计和实现了金融热点话题原型系统,在分析功能需求的基础上,详细介绍了原型系统的系统架构和功能模块的设计与实现,并给出了原型系统效果图。
[Abstract]:Weibo, a social-networking platform with social entertainment, news sources and information distribution, has a huge user base, while the number of Internet speculators and financial users has soared. Weibo produces a large amount of information data every day, involving many industries, covering a wide range, its information timeliness and authority is high, which is an important reason for shareholders and financial managers to pay special attention to Weibo. How to find the hot topic in the financial field from the large amount of Weibo data has become the focus of attention of the major securities companies and financial management companies. This paper is mainly to solve the above problems, that is, to extract the hot topics in the financial field from Weibo. In this paper, we first introduce the related technologies of topic discovery and tracking, as well as the related technology of topic discovery clustering algorithm. Then the clustering algorithm is analyzed, and the Single-Pass algorithm is selected as the text clustering algorithm, and an improved algorithm is proposed. In order to improve the problem that IDF is a fixed value in TF-IDF and cannot change dynamically with the data set, an incremental TF-IWF-IDF weight calculation method based on part of speech position is proposed. Traditional feature vectors ignore the semantic and contextual considerations of feature items, so an incremental TF-IWF-IDF feature vector representation method based on Word2Vec is proposed in this paper. In order to solve the problem of Single-Pass algorithm, this paper proposes a multi-topic center based quadratic clustering algorithm. According to the Weibo data, the experimental results show that the effect of hot topic discovery in this paper is about 10% higher than that of the unimproved Single-Pass algorithm. Finally, this paper designs and implements the financial hot topic prototype system based on the above clustering algorithm. Based on the analysis of the functional requirements, the system architecture and the design and implementation of the function module of the prototype system are introduced in detail. The prototype system effect diagram is also given.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1;TP393.092

【参考文献】

相关期刊论文 前6条

1 格桑多吉;乔少杰;韩楠;张小松;杨燕;元昌安;康健;;基于Single-Pass的网络舆情热点发现算法[J];电子科技大学学报;2015年04期

2 马雯雯;魏文晗;邓一贵;;基于隐含语义分析的微博话题发现方法[J];计算机工程与应用;2014年01期

3 殷风景;肖卫东;葛斌;李芳芳;;一种面向网络话题发现的增量文本聚类算法[J];计算机应用研究;2011年01期

4 税仪冬;瞿有利;黄厚宽;;周期分类和Single-Pass聚类相结合的话题识别与跟踪方法[J];北京交通大学学报;2009年05期

5 杨燕;靳蕃;KAMEL Mohamed;;聚类有效性评价综述[J];计算机应用研究;2008年06期

6 张白妮,骆嘉伟,汤德佑;基于比对相似度动态矩阵聚类算法在基因序列中的应用[J];计算机应用;2004年08期

相关会议论文 前1条

1 李恒训;张华平;秦鹏;于满泉;刘金刚;;基于主题词的网络热点话题发现[A];第五届全国信息检索学术会议论文集[C];2009年



本文编号:1903049

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1903049.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户80f70***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com