基于异构网络的微博新闻事件自动检测与摘要算法研究与实现
发布时间:2018-06-29 23:13
本文选题:异构信息网络 + 跨模态融合 ; 参考:《西南交通大学》2017年硕士论文
【摘要】:如今,微博平台在实时传播信息方面发挥了重要作用。然而,由于其具有规模大、实时性强和数据非结构化的特点,常见的数据挖掘方法在处理它们时不再适用。为了克服传统微博事件检测与摘要方法忽视微博平台中丰富视觉和社交信息的缺点,帮助人们快速掌握本质意义的大量的微博,本文以著名社交网站Twitter上多个个热点话题约100万数据作为主要研究对象,主要研究了跨模态微博事件检测、摘要。考虑包括文本、视觉、社交、时间等多个特征,提出了基于异构网络的事件检测和摘要框架。首先在数据预处理阶段,定义严格的过滤模式去除无意义的博文和图片;接下来在事件检测阶段,使用异构网络模拟微博数据的异质特性,采用后期多模态融合实体相似性模型来组合Twitter数据的异质特征,并使用近似相似算法生成融合特征后的同构图。下一步在同构相似度图上采用改进DBSCAN的算法,融入概率模型解决子话题分割的问题,然后根据子话题的热度及新颖度对产生的聚类排序。最后,分别为话题生成文本和视觉摘要。本文的贡献如下:1、利用多模态信息构建动态异构信息网络,解决传统方法不能利用微博丰富附加信息的缺点。利用AFF函数融合多模态特征,考虑它们的语义相似性和时空接近性来区分事件。从异构网络转换为同构网络,保留关键信息的同时为之后的检测和摘要简化结构。2、为了提高检测和摘要的多样性,减少话题分割的现象,在聚类阶段,提出HRDBSCAN算法,在原有聚类算法的基础上结合概率统计方法合并相似类簇;在摘要阶段,对子话题摘要结果再聚类,确保每个子话题在摘要只出现一次。3、在包含若干真实事件的Twitter数据集上实验,实验结果证明与现有方法相比本文提出框架的新颖性和优越性。
[Abstract]:Nowadays, the Weibo platform plays an important role in spreading information in real time. However, because of its large scale, strong real-time and unstructured data, common data mining methods are no longer applicable to deal with them. In order to overcome the shortcomings of traditional Weibo event detection and summary methods which ignore the rich visual and social information in Weibo platform and help people quickly grasp a large number of Weibo with essential meaning. In this paper, 1 million data about 1 million hot topics on the famous social network are taken as the main research object, and the cross-modal Weibo event detection is mainly studied. Considering text, visual, social, time and other features, an event detection and summary framework based on heterogeneous networks is proposed. In the data preprocessing stage, strict filtering mode is defined to remove meaningless blog posts and images. Then, heterogeneous network is used to simulate the heterogeneity of Weibo data in the event detection phase. A multimodal fusion entity similarity model is used to combine the heterogeneous features of Twitter data, and an approximate similarity algorithm is used to generate the homocomposition of the fusion features. In the next step, the improved DBSCAN algorithm is used in the isomorphic similarity graph to solve the sub-topic segmentation problem by incorporating the probability model, and then the resulting clustering is sorted according to the heat and novelty of the sub-topic. Finally, text and visual summary are generated for the topic. The contributions of this paper are as follows: 1. We use multi-modal information to construct dynamic heterogeneous information network to solve the problem that traditional methods can not enrich additional information by using Weibo. AFF functions are used to fuse multi-modal features and their semantic similarity and spatio-temporal proximity are considered to distinguish events. In order to improve the diversity of detection and summary and reduce the phenomenon of topic segmentation, HRDBSCAN algorithm is proposed in the clustering stage, in order to improve the diversity of detection and summary and reduce the phenomenon of topic segmentation. On the basis of the original clustering algorithm combined with the probability and statistics method to merge the similar clusters, in the summary stage, the sub-topic summary results of the clustering, Make sure that each subtopic only appears once. 3 in the summary, and experiment on the Twitter dataset containing some real events. The experimental results show that the proposed framework is more novel and superior than the existing methods.
【学位授予单位】:西南交通大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP393.092
【参考文献】
相关期刊论文 前1条
1 刘美玲;郑德权;赵铁军;于洋;;动态多文档文摘模型[J];软件学报;2012年02期
,本文编号:2083771
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2083771.html