一种在搜索日志中挖掘用户搜索意图并推荐相关搜索词的方法
发布时间:2018-07-23 19:17
【摘要】:随着互联网的飞速发展,用户需要面对的数据越来越多,要想从这海量的数据中有效地找到符合需求的数据,当前只能使用搜索引擎。然而实际上大多数用户面对搜索引擎返回的成千上万的结果,往往无从下手,其中存在着大量与用户搜索意图不相关的干扰结果。另外,传统搜索引擎的结果返回方式为一维线性列表,也降低了用户的查询效率。目前,对于提高用户搜索效率的研究越来越受到重视,许多学者从搜索结果文档或搜索日志入手提出各种各样提高搜索效率的方法。 本文主要研究和探索如何基于现有搜索引擎资源提高用户的搜索效率,实现一种能弥补现有系统不足的方法。该方法从搜索日志入手,对搜索日志信息进行有效的处理和提取,得到相关的数据集。然后构造种子搜索词在数据集中提取满足不同层面搜索意图的候选词语,并提取有效特征数据进行训练,得到一个二分类模型。对于用户查询词,先用分类模型得到用于推荐的相关搜索词,再通过短文本相似度计算等方法合并相似文本。最后返回给用户不同意图的相关搜索词以及结构更加合理的搜索文档。实验表明,该方法能够提取出符合预期的相关搜索词,进而有效提升搜索效率。
[Abstract]:With the rapid development of the Internet, users have to face more and more data. If we want to find the data that meets the needs effectively, we can only use search engine. However, in fact, most users often have no way to deal with the tens of thousands of results returned by search engines, among which there are a large number of disturbing results that are irrelevant to the users' search intentions. In addition, the traditional search engine returns results in one dimensional linear list, which also reduces the query efficiency of users. At present, more and more attention has been paid to the research of improving the efficiency of user search. Many scholars have put forward various methods to improve the efficiency of search from the point of search result document or search log. This paper mainly studies and explores how to improve the search efficiency of users based on the existing search engine resources and realize a method that can make up the deficiency of the existing system. In this method, the search log information is processed and extracted effectively, and the relevant data sets are obtained. Then a seed search term is constructed to extract candidate words satisfying different levels of search intention in the data set, and the valid feature data are extracted for training, and a two-classification model is obtained. For user query words, the related search terms used for recommendation are obtained by classification model, and then similar text is merged by calculating the similarity of short text. Finally, it returns relevant search terms with different intentions and more reasonably structured search documents. The experimental results show that the proposed method can extract the relevant search terms in accordance with the expectation and improve the search efficiency effectively.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
[Abstract]:With the rapid development of the Internet, users have to face more and more data. If we want to find the data that meets the needs effectively, we can only use search engine. However, in fact, most users often have no way to deal with the tens of thousands of results returned by search engines, among which there are a large number of disturbing results that are irrelevant to the users' search intentions. In addition, the traditional search engine returns results in one dimensional linear list, which also reduces the query efficiency of users. At present, more and more attention has been paid to the research of improving the efficiency of user search. Many scholars have put forward various methods to improve the efficiency of search from the point of search result document or search log. This paper mainly studies and explores how to improve the search efficiency of users based on the existing search engine resources and realize a method that can make up the deficiency of the existing system. In this method, the search log information is processed and extracted effectively, and the relevant data sets are obtained. Then a seed search term is constructed to extract candidate words satisfying different levels of search intention in the data set, and the valid feature data are extracted for training, and a two-classification model is obtained. For user query words, the related search terms used for recommendation are obtained by classification model, and then similar text is merged by calculating the similarity of short text. Finally, it returns relevant search terms with different intentions and more reasonably structured search documents. The experimental results show that the proposed method can extract the relevant search terms in accordance with the expectation and improve the search efficiency effectively.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
【参考文献】
相关期刊论文 前10条
1 顾益军,樊孝忠,王建华,汪涛,黄维金;中文停用词表的自动选取[J];北京理工大学学报;2005年04期
2 张磊;张代远;;中文分词算法解析[J];电脑知识与技术;2009年01期
3 龙树全;赵正文;唐华;;中文分词算法概述[J];电脑知识与技术;2009年10期
4 王成;刘亚峰;王新成;闫桂荣;;分类器的分类性能评价指标[J];电子设计工程;2011年08期
5 王继民,陈,
本文编号:2140393
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2140393.html