基于联邦检索思想的微博搜索研究

发布时间：2018-08-02 08:24

【摘要】：随着Web2.0时代的到来,互联网中的各类应用越来越多,用户在网络中的参与度正在逐渐提高,人们所处的网络也正在朝着社会化网络迈进。微博服务就是社会化网络中最为典型的一个应用,它以内容精简、发布方便等特点吸引着越来越多的用户。随着微博用户数量的不断增加,用户在微博平台中生成的内容也呈指数级形式增长。然而,针对微博内容的搜索还是采用传统的集中式检索模式,带来了一定的问题。首先,由于微博数据量庞大,直接搜索全部微博会比较耗时,降低用户的搜索体验;其次,微博主题太多,采用集中式检索有可能造成准确率不高;最后,集中式检索只能使用一种检索模型,而联邦检索可以针对不同数据集提供不同的检索模型,灵活性更强。联邦检索是信息检索的一个重要分支研究领域,它可以分布式地搜索不同的数据集,解决了集中式检索中效率、准确率均不高的问题。联邦检索首先会判断每个数据集和查询词的相关性,然后将查询词送往相关性较大的数据集进行检索,最后将检索结果合并后返回给用户。因为查询的数据集都相对相关,在搜索结果准确率方面比集中式检索相对要高,同时,解决了数据集过于庞大,无法有效检索的问题。基于联邦检索的优势所在,本文提出了一种基于联邦检索思想的微博搜索技术。该技术将联邦检索的思想应用到微博搜索领域,同时考虑到微博文本的特殊性,融入微博作者的权威度因子,使文档排序得分的计算更加精确。在真实微博数据集上的实验结果表明,本文所提出的方法能提高微博搜索的准确率。本文主要做了以下几个方面的工作:(1)开发基于联邦检索思想的微博搜索框架。本文的研究重点是针对微博数据采用联邦检索技术进行信息搜索。为此,首先建立适应微博搜索的联邦数据集,生成每个数据集的数据集描述;然后采用数据集选择方法,根据已经建立好的数据集描述,计算查询词和每个数据集的匹配得分,将数据集按照相关性进行排序,选择若干相关性较大的数据集;接下来将查询词送往被选择的数据集进行搜索;最后,合并不同数据集返回的结果,形成单一搜索结果列表,并返回给用户。(2)提出一种融合微博作者权威的结果合并算法。本文考虑到微博的特点,在前人研究的基础上,提出了一种融合微博作者权威的结果合并方法。实验结果表明,与以往的结果合并方法相比,本文所提出的方法能有效提高搜索结果的准确率。(3)设计基于联邦检索思想的微博搜索系统。在前两章的基础上,设计实现了基于联邦检索思想的微博搜索原型系统。系统主要包括微博索引建立、普通搜索以及联邦检索三大功能模块,最后本文对系统进行了演示。
[Abstract]:With the arrival of Web2.0 era, more and more applications in the Internet, user participation in the network is gradually increasing, people in the network is also moving towards the social network. Weibo service is the most typical application in social network. It attracts more and more users because of its simple content and convenient distribution. With the increasing number of Weibo users, the content generated by users in Weibo platform also increases exponentially. However, the search for Weibo content still adopts the traditional centralized retrieval mode, which brings some problems. First, because of the large amount of Weibo data, direct search for all Weibo will be time-consuming and reduce the user's search experience. Secondly, there are too many topics in Weibo, so centralized retrieval may result in low accuracy. Centralized retrieval can only use one retrieval model, while federated retrieval can provide different retrieval models for different data sets, so it is more flexible. Federated retrieval is an important branch of information retrieval. It can search different data sets distributed and solve the problem of low efficiency and accuracy in centralized retrieval. Federated retrieval first determines the correlation between each data set and the query term, then sends the query term to the highly correlated data set for retrieval. Finally, the retrieval results are merged and returned to the user. Because the data sets of the query are relative related, the accuracy of search results is higher than that of centralized retrieval. At the same time, the problem that the data set is too large to be retrieved effectively is solved. Based on the advantages of federated retrieval, this paper proposes a Weibo search technology based on federated retrieval idea. This technique applies the idea of federated retrieval to the field of Weibo search and takes into account the particularity of Weibo text and integrates the authority factor of Weibo authors so as to make the calculation of document sorting score more accurate. Experimental results on real Weibo datasets show that the proposed method can improve the accuracy of Weibo search. The main work of this paper is as follows: (1) A Weibo search framework based on federated retrieval is developed. The research focus of this paper is to use federated retrieval technology to search for Weibo data. In order to solve this problem, a federated data set suitable for Weibo search is first established to generate the data set description of each dataset. Then, according to the established data set description, the matching score between the query term and each data set is calculated by using the dataset selection method. Sort the data set according to the correlation, select several data sets with high correlation; then send the query term to the selected data set for search; finally, merge the results returned from the different data sets to form a single search result list. And it is returned to the user. (2) A result merging algorithm combining the authority of Weibo authors is proposed. In this paper, considering the characteristics of Weibo, a method of merging the authorship of Weibo authors is proposed based on previous studies. The experimental results show that the proposed method can effectively improve the accuracy of search results compared with the previous results merging methods. (3) A Weibo search system based on federated retrieval idea is designed. On the basis of the first two chapters, a prototype system of Weibo search based on federated retrieval idea is designed and implemented. The system mainly includes three function modules: Weibo index building, general search and federated retrieval. Finally, this paper demonstrates the system.
【学位授予单位】：湖南科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3;TP393.092

【参考文献】