基于分类技术的个性化检索系统的研究与设计

发布时间：2018-07-14 07:44

【摘要】：随着Internet和网络信息技术的迅猛发展,网络资源呈指数急剧增长,传统的通用搜索引擎的查询结果只依赖于查询关键词,而实际上,即便相同的查询词,不同的用户查询目的可能不同,所希望的返回结果也会因人而异。针对这种情况,人们迫切需要一种针对个人特点提供更加精确查询结果的搜索工具,本文提出了以用户为中心的基于分类的个性化搜索引擎。本文在对个性化信息检索相关技术进行了较为全面深入的分析基础上,分别研究个性化搜索引擎的常用技术,和搜索引擎中理解用户搜索目的的主要技术。并根据用户的浏览及查询日志建立了检索系统的模型。对文本的自动分类进行了介绍,给出几种常见的文本表示模型,以及利用WEKA和LibSVM对文本进行自动分类。基于文本分类,提出一种排序算法,在检索结果中,显示尽量多的类别,让尽量多的不同类别的用户都能找到相应主题类别的信息。同时,根据用户行为特征,即用户对各个主题类别的点击率和各个主题类别网页的平均访问时间,通过修改lucene的评分域,从而改变lucene自有对文档的排序评分。通过实验证明,经过考虑用户的行为特征,在具有不同兴趣的用户查询相同词语时,可以检索出不同的结果页面。由于用户搜索的关键词有很大一部分是重复的,按照2/8定律,20%的搜索词占到了总搜索次数的80%。当用户提交由一组关键词组成的查询的时候,系统就判断这个查询对应的记录是否在Cache中存在,如果不存在,把查询语句递交给检索器,检索器返回结果的综合的文档号序列存到一个文件中,在Cache中保存所存储的序列在文件中的偏移值。如果已经存在,就从Cache中获得这个存储记录的偏移。然后是系统原型的设计与实现,首先给出了系统的完整架构,然后分检索模块、结果排名模块、查询cache模块等几个主要模块做详细说明,分析了系统中几个主要的数据结构。最后对系统进行了测试分析,验证了可行性。最后,总结了本文的工作,并展望下一步的工作计划。同时指出本系统的一些缺陷,提出系统在整体架构上的改进方法。
[Abstract]:With the rapid development of Internet and network information technology, the network resources increase exponentially. The query results of traditional general search engine only depend on the query keywords, but in fact, even if the same query words, Different users may query for different purposes, and the desired return results will vary from person to person. In view of this situation, people urgently need a search tool to provide more accurate query results according to individual characteristics. In this paper, a user-centered personalized search engine based on classification is proposed. Based on the thorough analysis of the relevant technologies of personalized information retrieval, this paper studies the common technologies of personalized search engine and the main technology of understanding the purpose of user search in the search engine. According to the user's browsing and query log, the model of retrieval system is established. This paper introduces the automatic text classification, presents several common text representation models, and makes use of WEKA and LibSVM to classify the text automatically. Based on text classification, a sorting algorithm is proposed, in which as many categories as possible can be displayed in the retrieval results, so that users of as many different categories as possible can find the information of the corresponding subject categories. At the same time, according to the user behavior characteristics, that is, the user's click rate of each topic category and the average visit time of each topic category web page, by modifying the lucene scoring field, we can change the lucene's own ranking score on the documents. It is proved by experiments that different result pages can be retrieved when users with different interests query the same words after considering the behavior characteristics of users. Because a large part of the search keywords are repeated, 20% of the search terms account for 80% of the total search times according to the law of 2 / 8. When the user submits a query consisting of a set of keywords, the system determines whether the corresponding record of the query exists in the cache, and if not, submits the query statement to the searcher. The synthetic document number sequence of the result returned by the searcher is stored in a file and the offset value of the stored sequence in the file is saved in the cache. If it already exists, the offset of the stored record is obtained from Cache. Then the design and implementation of the prototype of the system is given. Firstly, the complete architecture of the system is given, and then several main modules, such as retrieval module, result ranking module, query cache module, etc., are described in detail, and several main data structures in the system are analyzed. Finally, the system is tested and analyzed, and the feasibility is verified. Finally, the paper summarizes the work of this paper and looks forward to the next work plan. At the same time, some defects of the system are pointed out, and the improvement method of the whole system is put forward.
【学位授予单位】：武汉理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】