微博检索结果优化研究

发布时间：2018-10-18 14:38

【摘要】：当今世界,互联网迅猛发展,信息无论从产生还是传播速度上,都大幅提升,在这样一个信息爆炸的时代,如何快速有效的从大量数据中获取感兴趣的信息,给搜索引擎的发展带来了巨大的挑战。微博作为近几年兴起的社交方式,渐渐走入每个人的生活,微博上的内容既包含权威的新闻事件,热点话题,也包含数以亿计的普通用户发布的生活娱乐内容。对于微博的检索一直是一个值得研究的热门话题。本文首先介绍了信息检索的相关技术,分析了 Learning to rank模型的优势以及信息检索系统的衡量标准。经过调研,本文从相关性和多样性两个方面优化微博检索结果。相关性方面,本文设计并实现了 GBDT模型训练非语义特征,再融合LTR模型的网络结构,同时引入神经网络训练的词向量作为特征。在推特数据集上,优化了 MAP和P@30两项指标。多样性方面,实现了将神经网络训练的句向量作为特征的k-means聚类。验证了句向量训练的有效性。另外,利用Simhash去重算法,去除近似重复的推特,取得了比聚类更优的F1值指标。本文的选题是基于2014年TREC微博检索评测任务,提出了新的思路和解决方法。最后,本文阐述了完成该任务时的设计与实现流程,并分析了评测结果。
[Abstract]:In today's world, with the rapid development of the Internet and the rapid development of information, both the production and the speed of dissemination of information have been greatly improved. In such an era of information explosion, how to quickly and effectively obtain information of interest from a large number of data, The development of search engines has brought great challenges. Weibo, as a social way rising in recent years, has gradually entered the life of everyone. The content on Weibo includes not only authoritative news events, hot topics, but also the life and entertainment content published by hundreds of millions of ordinary users. The search for Weibo has been a hot topic worth studying. This paper first introduces the related technologies of information retrieval, analyzes the advantages of Learning to rank model and the measurement standard of information retrieval system. After investigation, this paper optimizes Weibo's retrieval results from two aspects: relevance and diversity. In terms of correlation, this paper designs and implements the GBDT model training non-semantic features, and then fuses the network structure of LTR model. At the same time, it introduces the word vector trained by neural network as the feature. On the Twitter dataset, we optimized the MAP and P30s. In terms of diversity, k-means clustering with sentence vectors trained by neural networks as features is realized. The validity of sentence vector training is verified. In addition, the Simhash de-duplication algorithm is used to remove the approximately repetitive Twitter, and the F1 value index is obtained better than the clustering algorithm. This paper is based on the task of TREC Weibo retrieval evaluation in 2014, and puts forward new ideas and solutions. Finally, this paper describes the design and implementation of the task, and analyzes the evaluation results.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP393.092;TP391.1

【参考文献】