用户视频检索意图强度识别算法研究

发布时间：2018-05-15 01:22

本文选题：短文本分类 + 信息检索　；参考：《浙江大学》2015年硕士论文

【摘要】：随着数据爆炸性增长,用户在信息面前面临越来越多的选择性困难。搜索引擎是人们获取信息的一个重要手段,并且随着智能设备的普及,移动端的搜索占有越来越重要的地位。移动设备有限的展示空间决定了要为用户提供尽可能精准、有效的信息,因此需要更加准确识别用户的检索意图,从而为用户提供更加精准的服务,增强用户体验。然而在互联网发达的时代,人们的信息需求通常以短串的形式表达,一般由3-4个词组成,信息描述相对模糊、歧义性较强,造成了对用户实际需求识别不够准确。本文利用搜索引擎中丰富的数据资源以及用户的交互结果,分析、解决用户视频检索意图强度识别的问题。该技术应用于通用搜索和视频检索系统中,通过分析用户的检索串识别出视频意图强弱,从而将更加精准的结果以友好的方式展示给用户。本文首先对用户输入的检索串利用搜索引擎展示结果以及用户点击结果中的标题进行扩展,同时根据本课题类别间文本重合度较高的特点提出了一种新的基于熵和词频的文本特征选择方法。其次,详细设计并抽取了基于文本、视频域名统计、搜索引擎返回结果类型、深度语言模型的语义信息以及session的统计等5组不同的特征及其组合方法进行实验,验证了本课题的有效性。受深度学习语言模型word2vec的启发,提出了站点域名的词向量表示方法Host2vec,将深度语言模型引入检索意图强度识别的问题中来。最后,针对用户检索视频检索意图强度随时序变化的关系进行了分析、挖掘。
[Abstract]:With the explosive growth of data, users face more and more difficulties of selectivity in front of information. Search engine is an important means for people to obtain information, and with the popularity of intelligent devices, mobile search plays an increasingly important role. The limited display space of mobile devices determines the need to provide users with as accurate and effective information as possible, so it is necessary to identify users' retrieval intentions more accurately, so as to provide users with more accurate services and enhance user experience. However, in the era of Internet development, people's information needs are usually expressed in short strings, usually composed of 3-4 words. The information description is relatively vague and ambiguous, which results in inaccurate identification of users' actual needs. Based on the rich data resources in search engines and the interactive results of users, this paper analyzes and solves the problem of identifying the intension of users' video retrieval. This technique is applied to the general search and video retrieval system. By analyzing the user's retrieval string, the video intention is identified, and the more accurate results are displayed to the user in a friendly manner. This paper first extends the search string input by the user using search engines to display the results as well as the titles in the user click results. At the same time, a new text feature selection method based on entropy and word frequency is proposed. Secondly, we design and extract five groups of different features and their combination methods based on text, video domain name statistics, search engine return result type, semantic information of depth language model and session statistics. The validity of this subject is verified. Inspired by the deep learning language model (word2vec), this paper proposes a word vector representation method of site domain name, Host2vec. the depth language model is introduced into the problem of identifying the intension of retrieval intention. Finally, the relationship between the order change of the intention intensity of the user retrieval video retrieval is analyzed, and the mining is carried out.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP391.41

【参考文献】