主题搜索引擎搜索策略的研究及算法设计

发布时间：2018-05-03 03:37

本文选题：搜索引擎 + 主题爬虫　；参考：《兰州大学》2017年硕士论文

【摘要】：当前互联网应用中网站的搜索正变得越来越普及,一个网站要想做大做强,其内容必定要丰富,用户想要找到的内容,不管是最新的还是以前的(比如一段时间以前就见过的新闻报道,因为不再是最新的内容而没有出现在首页上),我们都可以借助搜索引擎来查找它。通过搜索引擎,用户可以享受快速获得资源的服务,几乎足不出户,搜索引擎就可以使人们更有效的从互联网络获取各种信息了,所以一个搜索引擎的好坏直接决定了人们的互联网生活。本文通过分析了主流搜索策略及算法,对搜索引擎的分类、技术架构及原理结构进行了深度的剖析,同时研究了基于主题爬虫系统的设计和模型的建立,在现有的技术支持上融入了机器学习算法,具体的讨论了文档的特征选择算法思想,并阐述了目前主流的TF-IDF改进算法,以Python 2.7为开发平台,设计实现了基于Context Graph的主题爬虫系统。最终以国内各大汽车网站为例,将“汽车”设为主题词进行分类爬取,以查全率、查准率、F1值来评价所涉及的系统性能的好坏。通过实验结果,说明本文设计的算法在文档的主题词分类及网页爬取的效率上具有较好的性能。
[Abstract]:At present, the search for websites in Internet applications is becoming more and more popular. If a website wants to be large and strong, its content must be rich, the content users want to find, Whether it's the latest or the previous (for example, news stories that have been seen for some time, because they're no longer the latest content and not on the front page), we can use search engines to find them. Through search engines, users can enjoy quick access to resources, almost without leaving home, search engines can enable people to obtain information from the Internet more effectively. So the quality of a search engine directly determines people's Internet life. By analyzing the mainstream search strategies and algorithms, this paper deeply analyzes the classification, technical framework and principle structure of search engine, and studies the design and modeling of theme-based crawler system. The machine learning algorithm is integrated into the existing technical support, the idea of feature selection algorithm of document is discussed in detail, and the current mainstream TF-IDF improved algorithm is expounded, which takes Python 2.7 as the development platform. The theme crawler system based on Context Graph is designed and implemented. Finally, taking the domestic automobile websites as an example, the "automobile" is set up as the subject word for classification and crawling, and the system performance is evaluated by the recall rate, the precision rate and the F1 value. The experimental results show that the algorithm proposed in this paper has good performance in the classification of subject words and the efficiency of web crawling.
【学位授予单位】：兰州大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【参考文献】