基于二分图的查询推荐算法

发布时间：2018-04-28 21:03

本文选题：加权二分图 + 查询推荐　；参考：《安徽大学》2014年硕士论文

【摘要】：当前,互联网已经成为全世界最大的一个知识库,蕴含着海量的信息,人们可以获取的网络信息与日俱增。用户在面对大规模的网络信息时,却往往茫然于如何更快更准确地找到所需要的信息。搜索引擎可以帮助人们从海量数据中获取信息,已经成为用户获取网络信息的最主要甚至必不可少的工具之一。但目前的搜索引擎与用户的交互方式仍然是主要通过用户根据信息需求自主输入查询关键词进行检索,搜索引擎返回查询结果。由于输入的查询词一般较为简短,并且查询词自身存在歧义性和多义性,搜索引擎并不能准确理解用户真实的搜索意图。基于此种背景下,查询推荐技术如今已经被搜索引擎普遍采用,帮助搜索引擎更准确地了解用户真实的查询意图以及帮助用户构造更加完善的查询。本文主要研究了一种基于二分图的查询推荐算法。采用搜狗查询日志作为实验数据集,对该数据集进行分析与预处理之后,抽取31万条用户历史点击数据作为实验用数据。将用户点击URL在搜索引擎返回结果列表中的排序号和用户点击该URL的顺序号考虑到二分图连接边的权重计算公式中,利用TF-IDF思想计算边的权重,得到Query-URL加权二分图。利用用户点击的URL集合构造向量来表示对应的查询,然后使用余弦相似度方法计算任意两个不同查询间的相似度,最后构建一个描述查询间相关度的查询关系网络图。对一个输入查询推荐N个候选查询的过程是：首先在查询关系网络图上找到该输入查询所在节点的邻居节点构成初始候选查询集合H。若集合H中查询的数目不小于N,直接选取前N个与输入查询相关度得分较高的候选查询进行推荐；若集合H中查询的数目小于N,则将和输入查询节点间接连接的h-hop范围内节点也加入集合H中,利用k-means算法对集合H中的查询进行聚类,最后对包含输入查询的簇进行排序,推荐前N个与输入查询相关度得分较高的候选查询。实验结果表明,本文研究的查询推荐算法具有良好的推荐效果和一定的应用价值。
[Abstract]:At present, the Internet has become the world's largest knowledge base, containing a large amount of information, people can get more and more network information. In the face of large-scale network information, users are often confused about how to find the needed information more quickly and accurately. Search engine can help people to obtain information from massive data and has become one of the most important and even indispensable tools for users to obtain information on the network. However, the interaction between search engines and users is still mainly based on the information needs of users to input query keywords for retrieval, search engines return query results. Because the inputted query words are generally short, and the query words themselves are ambiguous and ambiguous, the search engine can not accurately understand the users' real search intention. Based on this background, query recommendation technology has been widely used by search engines, which helps search engines understand users' real query intention more accurately and help users to construct more perfect queries. This paper mainly studies a query recommendation algorithm based on bipartite graph. Sogou query log is used as experimental data set. After analyzing and preprocessing the data set, 310000 user history click data are extracted as experimental data. The sorting number of user clicking URL in the search engine return result list and the order number of user clicking on the URL are taken into account in the calculation formula of the weight of the connection edge of the bipartite graph, and the weight of the edge is calculated by using the idea of TF-IDF, and the weighted bipartite graph of Query-URL is obtained. The URL set is used to construct the vector to represent the corresponding query. Then the similarity between any two different queries is calculated by using the cosine similarity method. Finally, a query relational network graph is constructed to describe the correlation between the queries. The process of recommending N candidate queries for an input query is as follows: firstly, the neighbor nodes of the node where the input query is located are found on the query relational network diagram to form the initial candidate query set H. If the number of queries in the set H is not less than N, we directly select the first N candidate queries with high correlation score to recommend. If the number of queries in the set H is less than N, then the nodes in the range of h-hop that are indirectly connected with the input query nodes are added to the set H, and the query in the set H is clustered by using the k-means algorithm. Finally, the clusters containing input queries are sorted. The first N candidate queries with high correlation with input queries are recommended. Experimental results show that the query recommendation algorithm studied in this paper has good recommendation effect and certain application value.
【学位授予单位】：安徽大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP391.3

【相似文献】