基于自动文摘与用户反馈的个性化搜索引擎系统的研究与设计

发布时间：2018-12-30 16:14

【摘要】：在信息爆炸的今天，搜索引擎已经成为了一种从大量的数据信息中发现、推理知识的有效工具。但是，传统的搜索引擎系统存在着对于不同用户的同样查询会返回相同结果的弊端，而且用户也越来越迫切地希望系统能返回更高准确率的结果。所以，本文将自动文摘和用户反馈技术引入到传统的搜索引擎系统中，以此提高系统的精确率。本文通过分析传统搜索引擎MG（Managing Gigabytes）系统模型，研究并设计了一个相对完整的个性化搜索引擎系统。根据需求分析，本文把系统分为了文档处理模块、聚类模块、用户查询处理模块、用户分类模块、系统反馈模块、相似度计算模块、排序模块、结果显示模块以及系统评估模块。系统首先对用户进行聚类分析，提取用户的兴趣模型；然后根据用户反馈信息，在计算查询向量与文档向量的相似度时，调整个性化参数，使查询结果更加精确。同时还对文档的特征项约简算法进行了改进，首先对文档进行自动文摘处理，其次分析文档摘要提取特征项集，然后对特征项按照对文档类别的贡献度进行排序，最后在保证精确率的前提下以牺牲完备性来换取特征项的快速收敛。系统还结合了最小完美哈希函数与大内存存储技术，降低了倒排文档字典的存储空间并且提升了倒排文档索引的读取速度。最后通过建立最小堆数据结构对海量文档的排序进行了空间上的优化。通过理论分析和实验论证，相比MG搜索引擎系统而言，特征项约简算法改进后，时间效率有了一定地提高；倒排文档索引字典的存储空间节省了将近一半；文档排序算法改进后，降低了排序的空间复杂度；相似度计算算法改进后，，对于个人的兴趣而言，使查询的个性化精确率有了一定地提升。
[Abstract]:With the information explosion, search engine has become an effective tool for discovering and reasoning knowledge from a large amount of data. However, the traditional search engine system has the disadvantage that the same query for different users will return the same result, and users are more and more eager for the system to return more accurate results. In this paper, automatic abstracts and user feedback techniques are introduced into the traditional search engine system to improve the accuracy of the system. By analyzing the traditional search engine MG (Managing Gigabytes) system model, this paper studies and designs a relatively complete personalized search engine system. According to requirement analysis, the system is divided into document processing module, clustering module, user query processing module, user classification module, system feedback module, similarity calculation module, sorting module. The result display module and the system evaluation module. The system firstly analyzes the users and extracts the interest model of the users, then adjusts the personalized parameters to make the query result more accurate when calculating the similarity between the query vector and the document vector according to the user feedback information. At the same time, the algorithm of feature item reduction is improved. Firstly, the document is abstracted automatically, then the feature item set is extracted by analyzing the document summary, and then the feature item is sorted according to the contribution to the document category. Finally, the fast convergence of the feature term is obtained at the expense of completeness. The system also combines the minimum perfect hash function and the large memory storage technology to reduce the storage space of the inverted document dictionary and to improve the reading speed of the inverted document index. Finally, the sorting of massive documents is optimized by building the minimum heap data structure. Through theoretical analysis and experimental demonstration, compared with MG search engine system, the time efficiency of feature item reduction algorithm is improved, and the storage space of inverted document index dictionary is saved nearly half. After the improvement of document sorting algorithm, the complexity of sorting space is reduced. After the improvement of similarity calculation algorithm, the personalized accuracy rate of query is improved to a certain extent.
【学位授予单位】：天津大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3

【参考文献】