基于用户意图识别的查询重构研究
发布时间:2018-07-23 09:05
【摘要】:如今,搜索引擎能够帮助用户从网络上获取所需信息,大大地缓解用户信息焦虑。但是用户输入的查询词简短,容易产生模糊歧义性,基于查询关键字匹配的搜索引擎无法识别一词多义情况。查询重构技术是识别查询词背后真正用户意图的一个解决方案。然而,在查询重构技术中,会话切分的方法存在一定缺陷,基于会话共现信息生成的候选查询,更容易偏离原查询的用户意图,导致通过查询重构识别的用户意图存在交叉重叠问题。通过对查询重构相关的理论和技术进行深入研究,基于AOL查询日志,构造了用户意图识别的查询重构模型。该模型主要解决如何进行用户意图的切分、如何识别出能表达原查询用户意图的查询重构、如何对识别的用户意图进行聚类等三个问题。由于现有的算法存在一些不足,论文重点对构造的模型进行改进,具体如下:该模型总共分三部分:查询日志的会话切分、计算原查询词查询重构及查询重构聚类。第一部分,为了解决词汇相似度问题,融入查询间点击相似性特征。第二部分,为了解决查询重构表达不准确问题,另外考虑查询间的时间距离及查询间点击相似性等因素来计算原查询与候选查询关系。第三部分,针对查询重构识别的用户意图存在交叉重叠问题,提出查询重构聚类方法。然而,伴随聚类也产生两个问题:查询重构向量维数稀疏性和转移概率计算不准确性。为了解决查询重构向量维数稀疏性的问题,通过对会话中查询重构词和点击URL构造Query-URL图,引入吸收态的马尔科夫随机游走模型对图建模。为了解决转移概率计算不准确的问题,综合考虑URL、排序号、顺序号三方面因素,参考TF-IDF思想定义了类似的CF-IQF模型计算图中边的权重。然后计算吸收态分布,构建查询重构向量,最后利用查询重构向量的余弦相似度结合complete link算法实现聚类。通过对本模型各部分算法进行对比实验验证,结果表明本模型算法具有一定的优越性。
[Abstract]:Today, search engines can help users get the information they need on the Internet, greatly easing their information anxiety. However, the query words input by users are short and easy to produce fuzzy ambiguity. The search engine based on query keyword matching can not recognize the polysemy of the word. Query refactoring is a solution to identify the real user's intention behind the query word. However, in the query refactoring technology, the method of session segmentation has some defects, and the candidate query generated based on session co-occurrence information is easier to deviate from the original query's user intention. This results in the overlapping problem of user intention identified by query refactoring. Based on AOL query log, the query refactoring model of user intention recognition is constructed by deeply studying the theory and technology of query refactoring. The model mainly solves three problems: how to segment the user's intention, how to recognize the query reconstruction that can express the original query's user's intention, and how to cluster the identified user's intention. Due to the shortcomings of the existing algorithms, this paper focuses on the improvement of the constructed model as follows: the model is divided into three parts: session segmentation of query log, query reconfiguration of original query words and clustering of query reconfiguration. In the first part, in order to solve the problem of lexical similarity, the click-similarity feature is incorporated into the query. In the second part, in order to solve the problem of inaccuracy of query reconfiguration, the relationship between original query and candidate query is calculated by considering the time distance between queries and the similarity of clicks between queries. In the third part, aiming at the overlapping problem of user intention in query refactoring identification, a query refactoring clustering method is proposed. However, there are two problems associated with clustering: sparse dimension of query reconstruction vector and inaccuracy of calculation of transition probability. In order to solve the problem of sparse dimension of query refactoring vector, Query-URL graph was constructed by query refactoring words and clicking URL in session, and an absorbing Markov random walk model was introduced to model the graph. In order to solve the problem of inaccurate calculation of transition probability, considering the three factors of URL, sort number and order number, a similar CF-IQF model is defined according to the TF-IDF idea to calculate the weights of the edges in the graph. Then the absorption state distribution is calculated and the query reconstruction vector is constructed. Finally, the cosine similarity of the query reconstruction vector and the complete link algorithm are used to realize the clustering. The experimental results show that the algorithm has some advantages.
【学位授予单位】:哈尔滨工程大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.3
[Abstract]:Today, search engines can help users get the information they need on the Internet, greatly easing their information anxiety. However, the query words input by users are short and easy to produce fuzzy ambiguity. The search engine based on query keyword matching can not recognize the polysemy of the word. Query refactoring is a solution to identify the real user's intention behind the query word. However, in the query refactoring technology, the method of session segmentation has some defects, and the candidate query generated based on session co-occurrence information is easier to deviate from the original query's user intention. This results in the overlapping problem of user intention identified by query refactoring. Based on AOL query log, the query refactoring model of user intention recognition is constructed by deeply studying the theory and technology of query refactoring. The model mainly solves three problems: how to segment the user's intention, how to recognize the query reconstruction that can express the original query's user's intention, and how to cluster the identified user's intention. Due to the shortcomings of the existing algorithms, this paper focuses on the improvement of the constructed model as follows: the model is divided into three parts: session segmentation of query log, query reconfiguration of original query words and clustering of query reconfiguration. In the first part, in order to solve the problem of lexical similarity, the click-similarity feature is incorporated into the query. In the second part, in order to solve the problem of inaccuracy of query reconfiguration, the relationship between original query and candidate query is calculated by considering the time distance between queries and the similarity of clicks between queries. In the third part, aiming at the overlapping problem of user intention in query refactoring identification, a query refactoring clustering method is proposed. However, there are two problems associated with clustering: sparse dimension of query reconstruction vector and inaccuracy of calculation of transition probability. In order to solve the problem of sparse dimension of query refactoring vector, Query-URL graph was constructed by query refactoring words and clicking URL in session, and an absorbing Markov random walk model was introduced to model the graph. In order to solve the problem of inaccurate calculation of transition probability, considering the three factors of URL, sort number and order number, a similar CF-IQF model is defined according to the TF-IDF idea to calculate the weights of the edges in the graph. Then the absorption state distribution is calculated and the query reconstruction vector is constructed. Finally, the cosine similarity of the query reconstruction vector and the complete link algorithm are used to realize the clustering. The experimental results show that the algorithm has some advantages.
【学位授予单位】:哈尔滨工程大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.3
【参考文献】
相关期刊论文 前9条
1 李纲;胡蓉;;信息搜寻中用户查询重构研究综述[J];图书情报工作;2014年11期
2 付博;赵世奇;刘挺;;Web查询日志研究综述[J];电子学报;2013年09期
3 张晓娟;陆伟;;利用查询重构识别查询意图[J];现代图书情报技术;2013年01期
4 宋巍;张宇;刘挺;李生;;基于检索历史上下文的个性化查询重构技术研究[J];中文信息学报;2010年03期
5 陈琦;伍朝辉;姚芳;宋秀荣;张付志;;基于TF*IDF的垃圾邮件过滤特征选择改进算法[J];计算机应用研究;2009年06期
6 张磊;李亚楠;王斌;李鹏;蒋在帆;;网页搜索引擎查询日志的Session划分研究[J];中文信息学报;2009年02期
7 卢春燕;雷景生;;基于模糊关联的交互式Web信息检索技术[J];广西师范大学学报(自然科学版);2007年02期
8 张贝妮;王军;;数字图书馆中的检索式扩展方法研究[J];计算机应用研究;2006年04期
9 王继民,陈,
本文编号:2138904
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2138904.html