结合用户兴趣模型的垂直搜索引擎的设计和实现

发布时间：2018-05-08 09:01

本文选题：用户兴趣模型 + 垂直搜索引擎　；参考：《北京邮电大学》2017年硕士论文

【摘要】：近年来,互联网时代对于公众的影响在不断深化。用户在享受丰富多样的信息带来的生活便捷的同时,也体验到信息过量带来的困扰。在大量信息中,用户不能快速定位到有价值的信息资源,在一定程度上降低了信息的利用率,造成了“资源浪费”。通用搜索引擎已经不能满足固定用户更深入的需求,具体表现在信息覆盖率和准确率低,返回内容不够精确,无效信息偏多等方面。为了解决通用搜索引擎存在的问题,提升用户在搜索过程中的用户体验,本文设计和实现了结合用户兴趣模型的垂直搜索引擎,并构建API,加入到C++工程中,为用户提供通信领域内的专业知识检索服务。对用户搜索过程中的不同行为进行采集和分类,通过更新后的基于混合行为的用户兴趣模型计算兴趣度,为各个页面计算出更为可靠评分值,为用户提供个性化的检索结果。具体工作如下:首先,本文明确了期望该系统解决的关键问题,介绍了搜索引擎的工作流程,和在开发过程中涉及到的关键技术,重点分析了网页链接去重的解决思路。其次,本文详细介绍了用户兴趣模型的分析和建模过程,重点描述了在Python环境下的用户数据采集方式,和用户行为分类标准。在此基础上,作者提出基于混合行为的用户兴趣模型,突出了用户阅读时间的特殊性,在阅读时间出现异常的情况下,利用其它行为来表征用户兴趣度。再次,本文介绍了系统的总体架构设计,建立起以网页抓取模块,索引与检索模块,页面展示模块为核心内容的架构体系。利用基于Python语言的Scrapy开源爬虫框架、BeautifulSoup网页解析库、Whoosh索引检索库和Flask框架,对垂直搜索引擎系统进行开发。在开发过程中,指出了 Scrapy框架原有的URL去重方法内存耗费过大的问题,并借助布隆过滤器对原有方案进行了改进。根据实际经验,制定了两种防止爬虫被ban的策略。为解决Whoosh中文分词效果不理想的问题,使用jieba开源分词组件对原有分词功能进行了改进。最后对原型系统进行了长达32天的跟踪测试,从查全率、查准率、响应时间和死链比率4个方面对本系统进行了评估,通过收集用户评价和反馈意见,得出测试结论。
[Abstract]:In recent years, the impact of the Internet era on the public is deepening. Users not only enjoy the convenience of life brought by rich and diverse information, but also experience the troubles caused by excessive information. In a large amount of information, the user can not locate the valuable information resource quickly, which reduces the utilization rate of the information to a certain extent and causes "resource waste". The general search engine can no longer meet the deeper needs of fixed users, such as low information coverage and accuracy, inaccuracy of return content, more invalid information and so on. In order to solve the problems existing in the general search engine and enhance the user's experience in the search process, this paper designs and implements a vertical search engine based on user interest model, and constructs API, which is added to C project. To provide users with professional knowledge retrieval services in the field of communications. The different behaviors in the process of user search are collected and classified, and the interest degree is calculated by the updated model of user interest based on mixed behavior, so that the more reliable score is calculated for each page, and the personalized retrieval result is provided for the user. The specific work is as follows: firstly, this paper clarifies the key problems expected to be solved by the system, introduces the workflow of the search engine, and the key technologies involved in the development process. Secondly, this paper introduces the analysis and modeling process of user interest model in detail, especially describes the user data collection method and user behavior classification standard in Python environment. On this basis, the author proposes a user interest model based on mixed behavior, which highlights the particularity of the user's reading time. In the case of abnormal reading time, other behaviors are used to characterize the user's interest. Thirdly, this paper introduces the overall architecture design of the system, and establishes an architecture system with web capture module, index and retrieval module, page display module as the core content. The vertical search engine system is developed by using the Scrapy open source crawler framework based on Python, Beautiful Soup web page parsing library, whosh index retrieval library and Flask framework. In the process of development, the problem of excessive memory consumption in the original URL de-heavy method of Scrapy framework is pointed out, and the original scheme is improved with the help of Bloom filter. Based on practical experience, two strategies to prevent reptiles from being subjected to ban are proposed. In order to solve the problem that the effect of Chinese word segmentation in Whoosh is not satisfactory, the original function of word segmentation is improved by using jieba open source partitioning component. Finally, the prototype system is tested for 32 days. The system is evaluated from four aspects: recall rate, recall rate, response time and dead chain ratio. The test results are obtained by collecting user evaluation and feedback.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【相似文献】