微博搜索的关键技术研究

发布时间：2018-07-28 07:39

【摘要】：微博迅速成为一种重要的实时信息源,其搜索存在两个重要问题：查询词与微博消息的相关性计算、搜索结果整理。相关性计算从内容、语义上衡量消息与查询词之间的相似程度；搜索结果整理以简明有序的方式组织消息,克服冗余性和不规范书写,主要整理方式包括分类、摘要等。本文以推特为例,探索了微博搜索研究任务中几个重要问题：相关性计算、查询结果分类、摘要和对比话题摘要。针对相关性计算问题,提出两种消息排序模型,分别基于学习排序和递归神经网络语言模型。与目前微博搜索服务中的相关性排序算法比较,前者显著提高了消息序列的相关性,后者缩短消息在计算语义相关度时的鸿沟,提高查询结果的覆盖率。基于学习排序的模型系统研究了文本相关性特征、微博书写特征和微博的作者权威度特征在微博相关性计算中的作用。基于递归神经网络语言模型的排序模型将语义相似度引入消息的相关性计算中,在词向量粒度上计算消息之间的词法语义相似度。针对搜索结果分类问题,提出基于消息关联关系的协同分类模型,为微博定义了一个话题分类体系。与基于特征的基准模型相比,该模型的准确率和F值分别提高了5.38%和4.74%。该模型将消息之间的两种共享话题关系应用到三种基于图的协同分类模型中,考虑本地特征和来自关联消息的类别分布,同时为一批微博消息分类,降低数据稀疏的影响,极大地提高了分类器的精确率和召回率,其中采用共享话题标签(#hashtag)关系的迭代分类算法结果最优。针对搜索结果摘要问题,提出基于时间轴的关联交互自增强式摘要模型。与基于图的基准模型比较,该模型的ROUGE-1平均提高了14%。给定查询词的搜索结果,该模型按照时间顺序将其划分成若干子话题,同时考虑文本内容、作者社会影响力和文本质量计算消息的重要度,根据重要度和多样性对微博消息进行排序和抽取以生成摘要。实验表明,作者的社会影响力和文本质量有效地改进了文本重要度的度量。针对对比话题摘要问题,提出基于消息关联关系的最优化对比话题摘要模型。与基于内容相似度计算的基准模型比较,该模型的对比属性覆盖率和比较消息对准确率分别提高了14.7%和11.6%。该模型充分利用消息之间的相似度关系和三种共享话题关系,采用网页排序算法和SimRank方法,最大化消息对的内部对比性和话题代表性,概括对比查询词搜索结果中的共同点和不同点生成摘要。
[Abstract]:Weibo has quickly become an important source of real-time information. There are two important problems in its search: the correlation calculation between query words and Weibo messages, and the sorting of search results. Correlation calculation measures the similarity between messages and query words in terms of content and semantics. Search results organize messages in a concise and orderly manner to overcome redundancy and non-standard writing. The main sorting methods include classification, summary and so on. Taking Twitter as an example, this paper explores several important issues in Weibo search research tasks: correlation calculation, query result classification, summary and comparative topic summary. In order to solve the problem of correlation computation, two message ordering models are proposed, which are based on learning sorting and recursive neural network language model. Compared with the existing correlation sorting algorithm in Weibo search service, the former significantly improves the correlation of message sequences, while the latter shortens the gap in the calculation of semantic correlation and improves the coverage of query results. The model based on learning ranking systematically studies the role of text correlation features, Weibo writing features and authorship features of Weibo in the calculation of Weibo correlation. The ranking model based on recursive neural network language model introduces semantic similarity into message correlation calculation and calculates lexical semantic similarity between messages on word vector granularity. To solve the problem of search result classification, a cooperative classification model based on message association is proposed, and a topic classification system is defined for Weibo. Compared with the feature-based benchmark model, the accuracy and F value of the model are increased by 5.38% and 4.74%, respectively. The model applies two kinds of shared topic relationships between messages to three graph-based cooperative classification models, considering local features and category distribution from associated messages. At the same time, it classifies a batch of Weibo messages to reduce the effect of data sparsity. The precision and recall rate of the classifier are greatly improved, and the iterative classification algorithm using #hashtag relation is optimal. In order to solve the problem of search result summary, a time axis based autoenhancement model of associative interaction is proposed. Compared with the graph-based benchmark model, the average ROUGE-1 of this model is increased by 14%. Given the search results of the query words, the model divides the query words into several sub-topics according to the time order, and considers the content of the text, the author's social influence and the importance of calculating the message of the text quality. Weibo messages are sorted and extracted according to importance and diversity to generate abstracts. Experiments show that the author's social influence and text quality can effectively improve the measurement of text importance. Aiming at the problem of comparative topic summary, an optimal model of contrast topic summary based on message association relationship is proposed. Compared with the benchmark model based on content similarity calculation, the comparison attribute coverage and comparison message pair accuracy of the model are improved by 14.7% and 11.6%, respectively. The model makes full use of the similarity relationship between messages and three kinds of shared topic relationships, and uses web page sorting algorithm and SimRank method to maximize the internal comparison and topic representation of message pairs. Generalize and compare the common points and differences in the search results of query terms to generate a summary.
【学位授予单位】：中国科学技术大学
【学位级别】：博士
【学位授予年份】：2014
【分类号】：TP393.092;TP391.1

【参考文献】