长文本辅助短文本的知识迁移聚类方法

发布时间：2018-11-22 11:22

【摘要】：随着互联网的飞速发展,尤其是微博、在线广告等应用的涌现,出现在互联网上的短文本越来越多,对短文本的理解也就成为了一项非常重要的工作。大部分传统文本挖掘方法都是针对长文本设计的。对于短文本,由于其表征的稀疏性,现有的大部分技术并不能有效地应用在短文本上。为了更好地理解短文本,我们发现通常可以找到主题相关的长文本作为辅助数据来帮助短文本的理解。本文描述了一个创新的短文本聚类方法,该方法通过从辅助长文本数据上迁移知识来帮助短文本聚类。大部分之前用来提高短文本聚类效果的相关工作忽略了短文本和辅助长文本之间的语义及主题不一致性,为了解决这些存在于目标数据和辅助数据间的不一致性,我们提出了一种新的主题模型,二元隐含狄利克雷分配模型(DLDA)。该模型同时从长文本和短文本数据中学习主题,为了针对长短文本语言上的不一致性,我们设计了两个模型来区别对待长短文本。一种模型通过调整文档主题分布的先验对文档集的主题选择进行控制,一种模型通过改进文档生成过程的假设自动控制不同文档对主题的选择。通过在广告和微博(Twitter)数据上的大规模聚类实验,证明我们的方法获得了优于当今主流方法的短文本聚类效果。同时也证明了,考虑目标短文本数据集与辅助长文本数据集之间的差异可以对提升短文本的聚类效果有很大帮助。
[Abstract]:With the rapid development of the Internet, especially the emergence of Weibo, online advertising and other applications, more and more short text books appear on the Internet, so understanding of short texts has become a very important work. Most of the traditional text mining methods are designed for long text. For short text, due to its sparse representation, most of the existing techniques can not be effectively applied to short text. In order to better understand the short text, we find that we can usually find long text related to the topic as auxiliary data to help the understanding of short text. This paper describes an innovative short text clustering method, which helps short text clustering by transferring knowledge from auxiliary long text data. Most of the previous work used to improve the clustering effect of short texts ignored the semantic and thematic inconsistency between short text and auxiliary long text, in order to solve the inconsistency between target data and auxiliary data. We propose a new topic model, binary implicit Dirichlet assignment model (DLDA). In order to deal with the language inconsistency of long and short texts, we designed two models to treat short and short texts differently. A model controls the topic selection of a document set by adjusting a priori distribution of document topics, and a model automatically controls the topic selection of different documents by improving the assumptions of the document generation process. Through a large scale clustering experiment on advertising and Weibo (Twitter) data, it is proved that our method is better than the current mainstream method in short text clustering. It is also proved that considering the difference between the target short text dataset and the auxiliary long text data set can greatly improve the clustering effect of short text.
【学位授予单位】：上海交通大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【共引文献】