个性化智能搜索引擎中查询扩展技术研究
发布时间:2019-03-25 12:09
【摘要】:随着互联网的不断发展,网络信息量日益增加,面对海量的信息,人们对搜索引擎在查全率,查准率以及个性化方面的要求越来越高。查询扩展是个性化智能搜索引擎中的关键技术,它在搜索引擎检索用户查询前对用户查询进行扩展,有效地提高了搜索引擎的查全率和查准率。 首先,我们对用户输入的查询关键词进行词义上的扩展。利用同义词词林和知网(HowNet)知识库进行词语相似度计算,找到与用户查询关键词相似度最大的词语进行关键词的同义词、近义词扩展,提高搜索引擎的查全率和查准率。 其次,我们对用户输入的查询问句进行语义上的扩展。这一功能的实现由两部分组成,一方面进行问句关键词的提取和扩展,对问句进行去冗余,中文分词,词性标注,去停用词等一系列操作,提取出问句中包含用户核心语义的关键词或关键词集合,然后对取得的关键词进行关键词扩展;另一方面利用问句答案常用词对问句进行扩展,构建问句分类体系,对用户查询问句进行分类,同时利用问句答案语料库,统计每种类型的问句答案中常会出现的词,生成问句答案常用词词表,然后根据用户查询问句所属类别对问句进行答案常用词扩展;最终利用这两方面得到词语对用户查询问句进行扩展。 然后,我们对用户浏览行为进行分析,挖掘用户兴趣。我们收集用户IE收藏夹中的网址和用户浏览历史记录,读取相应网页,提取网页正文,进行中文切词,生成文档集,然后使用基于TF-IDF的向量空间模型生成文档集对应的向量集,对向量集进行聚类,然后对聚类结果进行分析,提取用户兴趣代表词。 最后,,将查询扩展以及用户兴趣提取应用于个性化智能搜索引擎之中。首先对用户查询进行查询扩展,然后将扩展后的查询作为检索内容输入到搜索引擎的检索模块,并对检索模块返回的结果按照与用户兴趣的相符程度进行排序展示。
[Abstract]:With the continuous development of the Internet, the amount of information in the network is increasing day by day. In the face of the massive amount of information, people have higher and higher requirements on the recall, precision and personalization of the search engine. Query extension is the key technology in personalized intelligent search engine. It extends user query before searching user query and improves the recall rate and precision rate of search engine effectively. First, we extend the word meaning of the query keyword entered by the user. The synonym forest and (HowNet) knowledge base are used to calculate the similarity of words. The synonyms of the words with the largest similarity to the user query keywords are found, and the synonyms are extended to improve the recall and precision of the search engine. Secondly, we extend the semantic of the query questions entered by the user. The realization of this function consists of two parts. On the one hand, the extraction and extension of the key words of the question sentence, the redundancy of the question sentence, the Chinese word segmentation, the part of speech tagging, the deactivation of the word, and a series of operations, such as a series of operations, The key words or the set of keywords which contain the user's core semantics are extracted from the questions, and then the keywords obtained are extended. On the other hand, it extends the question by using the common words of question answer, constructs the question classification system, classifies the user query question, and at the same time makes use of the question answer corpus to count the words that often appear in each type of question answer. Generate the common vocabulary of the question answer, and then expand the common word of the question according to the category of the user's query question. Finally, we use these two words to expand the user query questions. Then, we analyze user browsing behavior, mining user interest. We collect the web sites in the IE favorites and user browsing history, read the corresponding web pages, extract the text of the web page, cut Chinese words, and generate a set of documents. Then the TF-IDF-based vector space model is used to generate the vector set corresponding to the document set, and then the vector set is clustered. Then the clustering results are analyzed and the user interest representative words are extracted. Finally, query extension and user interest extraction are applied to personalized intelligent search engine. First, the user query is expanded, then the expanded query is input into the search module as the retrieval content, and the results returned by the search module are sorted and displayed according to the degree of conformity with the user's interest.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3
本文编号:2446966
[Abstract]:With the continuous development of the Internet, the amount of information in the network is increasing day by day. In the face of the massive amount of information, people have higher and higher requirements on the recall, precision and personalization of the search engine. Query extension is the key technology in personalized intelligent search engine. It extends user query before searching user query and improves the recall rate and precision rate of search engine effectively. First, we extend the word meaning of the query keyword entered by the user. The synonym forest and (HowNet) knowledge base are used to calculate the similarity of words. The synonyms of the words with the largest similarity to the user query keywords are found, and the synonyms are extended to improve the recall and precision of the search engine. Secondly, we extend the semantic of the query questions entered by the user. The realization of this function consists of two parts. On the one hand, the extraction and extension of the key words of the question sentence, the redundancy of the question sentence, the Chinese word segmentation, the part of speech tagging, the deactivation of the word, and a series of operations, such as a series of operations, The key words or the set of keywords which contain the user's core semantics are extracted from the questions, and then the keywords obtained are extended. On the other hand, it extends the question by using the common words of question answer, constructs the question classification system, classifies the user query question, and at the same time makes use of the question answer corpus to count the words that often appear in each type of question answer. Generate the common vocabulary of the question answer, and then expand the common word of the question according to the category of the user's query question. Finally, we use these two words to expand the user query questions. Then, we analyze user browsing behavior, mining user interest. We collect the web sites in the IE favorites and user browsing history, read the corresponding web pages, extract the text of the web page, cut Chinese words, and generate a set of documents. Then the TF-IDF-based vector space model is used to generate the vector set corresponding to the document set, and then the vector set is clustered. Then the clustering results are analyzed and the user interest representative words are extracted. Finally, query extension and user interest extraction are applied to personalized intelligent search engine. First, the user query is expanded, then the expanded query is input into the search module as the retrieval content, and the results returned by the search module are sorted and displayed according to the degree of conformity with the user's interest.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3
【参考文献】
相关期刊论文 前10条
1 田久乐;赵蔚;;基于同义词词林的词语相似度计算方法[J];吉林大学学报(信息科学版);2010年06期
2 魏桂英,郑玄轩;层次聚类方法的CURE算法研究[J];科技和产业;2005年11期
3 龙树全;赵正文;唐华;;中文分词算法概述[J];电脑知识与技术;2009年10期
4 程涛;施水才;王霞;吕学强;;基于同义词词林的中文文本主题词提取[J];广西师范大学学报(自然科学版);2007年02期
5 刘远超,王晓龙,刘秉权,钟彬彬;基于聚类分析策略的用户偏好挖掘[J];计算机应用研究;2005年12期
6 黄名选;严小卫;张师超;;查询扩展技术进展与展望[J];计算机应用与软件;2007年11期
7 张立娜;杨之音;杨波;;第三代搜索引擎发展现状研究[J];科技情报开发与经济;2011年34期
8 王林;搜索引擎的原理和发展[J];图书馆理论与实践;2004年04期
9 张宇,刘挺,文勖;基于改进贝叶斯模型的问题分类[J];中文信息学报;2005年02期
10 余慧佳;刘奕群;张敏;茹立云;马少平;;基于大规模日志分析的搜索引擎用户行为分析[J];中文信息学报;2007年01期
本文编号:2446966
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2446966.html