基于时间特性的微博检索和微博过滤研究

发布时间：2019-01-10 19:16

【摘要】：随着社交媒体和移动互联网的迅速发展,以微博为代表的短文本信息流的处理技术变得越来越重要。面对海量微博和众多用户多样性的信息需求,微博检索和微博过滤已经成为微博服务不可或缺的重要组成部分。近年来,微博的时间特性引起了研究者的注意。研究表明,微博的时间特性为微博检索性能的提高提出了一个新的解决途径,基于时间的检索技术逐渐成为微博检索的研究热点。本文关注利用时间特性来提升微博检索和微博过滤的性能,围绕查询建模、文档建模、查询与文档相关度计算以及过滤模型展开研究,力图利用微博的时间特性缓解短文本给基于内容的微博检索带来的困境,并利用历史微博的排序信息及时间特性,提高微博过滤的性能。本文研究的具体内容如下。(1)针对微博查询短的问题,提出了基于词语时间分布的查询模型。本文首先分析了扩展词与查询词在时间分布上的特点,在提出词语时间分布的定义和估计方法的基础上,给出了查询词与扩展词的时间分布相似性的度量,以此作为它们的相关度,完成扩展词的选择和查询模型的重估。本文方法利用时间信息而不是内容来扩展查询,避免了基于内容的查询扩展方法因微博内容短而无法准确估计扩展词的不足。(2)针对微博内容短的问题,提出了基于时间的微博文档模型。该模型尝试利用词在爆发期内微博上的分布以及词在时间近邻微博上的分布来估计扩展词的权重,并提出了基于机器学习的扩展词选择方法,据此构建文档扩展模型,并利用该文档扩展模型估计更准确的文档模型。为优化基于时间的文档模型的时间复杂度,本文提出了两个优化的时间文档模型,减轻了文档扩展带来的系统开销。(3)针对短文本给微博与查询的相关度计算带来的影响,将时间特性引入到微博检索中。使得微博检索在考虑内容相关度之外,还考虑了微博与查询在时间上存在的多种相关度,以使排序结果更符合相关微博的时间特性。具体而言,在经典语言模型检索框架下,给出了三种利用时间关系优化检索结果的方法;在排序学习框架下,提出了一种基于时间敏感的排序学习算法,设计了时间敏感损失函数,提高了微博检索的性能。(4)针对在微博实时过滤中传统分类模型过滤效果不佳的问题,提出了基于历史微博信息的微博实时过滤模型,有效地融合了检索模型和分类模型。具体而言,本文提出了基于历史微博的微博实时过滤模型的框架,将历史微博的排序信息以及时间近邻信息应用在检索模型中构建先验知识,并利用先验知识动态调整分类模型的分类面。进一步,以语言模型和逻辑回归模型为例,实现了该框架的一个实例,并给出了具体参数的估计方法。
[Abstract]:With the rapid development of social media and mobile Internet, the processing technology of short text stream represented by Weibo has become more and more important. In the face of the huge amount of Weibo and the diverse information demand of many users, Weibo retrieval and Weibo filtering have become an indispensable and important part of Weibo service. In recent years, Weibo's time characteristics have attracted the attention of researchers. The research shows that Weibo's time characteristic provides a new way to improve the performance of Weibo's retrieval, and the time-based retrieval technology has gradually become a hot research topic of Weibo's retrieval. This paper focuses on the use of time characteristics to improve the performance of Weibo retrieval and Weibo filtering, focusing on query modeling, document modeling, query and document correlation calculation, and filtering model. This paper tries to make use of Weibo's time characteristics to alleviate the predicament brought by the short text to the content-based Weibo retrieval, and to improve the filtering performance by using the sort information and time characteristic of historical Weibo. The main contents of this paper are as follows: (1) aiming at the short query of Weibo, a query model based on word time distribution is proposed. In this paper, the characteristics of temporal distribution of extended words and query words are analyzed. On the basis of the definition and estimation method of temporal distribution of words, a measure of the similarity of temporal distribution between query words and extended words is given. As their correlation degree, the selection of extended words and the revaluation of query model are completed. In this paper, time information rather than content is used to expand the query, which avoids the shortage of Weibo's short content. (2) aiming at the problem of the short content of Weibo, the method can not estimate the shortage of extension words accurately because of the short content of Weibo. A time-based Weibo document model is proposed. The model attempts to estimate the weight of extended words by using the distribution of words on Weibo during the outbreak period and on the temporal neighbor Weibo, and puts forward an extended word selection method based on machine learning, based on which a document extension model is constructed. The extended document model is used to estimate the more accurate document model. In order to optimize the time complexity of the time-based document model, two optimized time-document models are proposed in this paper, which reduce the system overhead brought by the document expansion. (3) aiming at the impact of the short text book on the calculation of the correlation between Weibo and the query, This paper introduces time characteristic into Weibo search. In order to make Weibo search in consideration of the relevance of content, but also considering the time correlation between Weibo and query, in order to make the ranking results more in line with the time characteristics of the relevant Weibo. Specifically, under the framework of classical language model retrieval, three methods of optimizing retrieval results using time relation are presented. In the framework of ranking learning, a time-sensitive learning algorithm is proposed, and a time-sensitive loss function is designed. The performance of Weibo retrieval is improved. (4) aiming at the problem of poor filtering effect of traditional classification model in Weibo real-time filtering, a real-time filtering model based on historical Weibo information is proposed, which effectively integrates the retrieval model and classification model. Specifically, this paper puts forward a framework of historical Weibo's real-time filtering model, which applies the ranking information of historical Weibo and the time nearest neighbor information to the retrieval model to construct the prior knowledge. A priori knowledge is used to dynamically adjust the classification surface of the classification model. Furthermore, taking the language model and the logical regression model as examples, an example of the framework is implemented, and the estimation method of the specific parameters is given.
【学位授予单位】：哈尔滨工业大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP391.3;TP393.092

【相似文献】