基于MapReduce的分布式搜索引擎研究

发布时间：2018-06-29 00:05

本文选题：搜索引擎 + MapReduce　；参考：《兰州理工大学》2013年硕士论文

【摘要】：随着资源爆炸式增长,搜索引擎已成为互联网用户获取信息的重要工具,传统搜索引擎多采用集中式架构,将搜索系统部署在一台服务器上,因此对服务器性能要求较高,且存在系统稳定性与可扩展性不高等问题；另外它采用关键词匹配模式,致使用户无法从海量数据中快速准确获取信息,在信息覆盖率、结果相关性和准确性方面都无法满足用户的更高需求。近年来,分布式计算理论被广泛的研究,基于分布式计算的搜索引擎应运而生,它克服了集中式搜索引擎的不足,通过扩展系统的服务器来实现大数据量的处理,同时引入用户个性化搜索模型,结合了语义分析等研究热点,已成为数据挖掘和智能信息处理领域的研究热点。通过对搜索引擎的工作原理、结构和分布式计算等相关技术的研究,本文对基于MapReduce分布式搜索引擎中的模型框架、数据处理流程、排序算法优化和主题爬虫进行了研究。主要研究工作包括以下几个方面：‘ (1)通过研究分布式文件系统(HDFS),分析了MapReduce编程模型的工作原理,针对原架构中单—NameNode控制结构存在负载不均衡及性能瓶颈等问题,提出了基于多NameNode节点控制的结构；在MapReduce模型处理数据过程中,因中间结果中Key值过于分散或集中,造成了数据不均衡问题,导致Reduce端作业的执行时间过长或失败,本文通过在Map阶段处理后,引入数据平衡机制,从而提高了系统的性能并降低了系统的故障率。 (2)PageRank算法采用的是平均分配权值的思路,且没考虑页面间主题相关性；本文通过引入主题相关度和时效性机制,使算法能同时兼顾链接间的主题相关性和页面的时效性；PageRank算法在计算网页权值时会产生大量的中间迭代数据,从而导致算法性能降低,本文采用了基于块结构算法划分网络的方法,有效地减少了中间迭代计算所产生的数据量,提高了算法的性能。 (3)通过采用基于词频差异的特征选取方法和改进后的TF-IDF公式,改进了Context Graph爬虫搜索策略,综合考虑了网页不同部分的文本信息对特征选取,及各特征词类间权重和类中权重的影响,提高了主题爬虫的爬行效率。
[Abstract]:With the explosive growth of resources, search engines have become an important tool for Internet users to obtain information. Traditional search engines often use centralized architecture and deploy search systems on a single server. Besides, it uses keyword matching mode, which makes users can not get information from mass data quickly and accurately, and can not get information coverage in information coverage. Results relevance and accuracy can not meet the higher demand of users. In recent years, distributed computing theory has been widely studied, and the search engine based on distributed computing has emerged as the times require. It overcomes the shortcomings of centralized search engine and realizes the processing of large amount of data by extending the server of the system. At the same time, the introduction of user personalized search model, combined with semantic analysis and other research hot spots, has become a research hotspot in the field of data mining and intelligent information processing. Based on the research of the working principle, structure and distributed computing technology of search engine, this paper studies the model framework, data processing flow, sorting algorithm optimization and subject crawler in MapReduce distributed search engine. The main research work includes the following aspects: (1) by studying the distributed File system (HDFS), the working principle of MapReduce programming model is analyzed, and the problems of load imbalance and performance bottleneck in the single NameNode control structure in the original architecture are pointed out. This paper proposes a structure based on multi-node control of NameNode. In the process of data processing in MapReduce model, the key value in the intermediate result is too scattered or centralized, which results in the problem of data imbalance, which results in the long execution time or failure of the reduce job. After processing in Map stage, this paper introduces the data balance mechanism to improve the performance of the system and reduce the failure rate of the system. (2) the PageRank algorithm adopts the idea of average distribution weight, and does not consider the topic correlation between pages; In this paper, by introducing the mechanism of topic correlation and timeliness, the PageRank algorithm can produce a lot of intermediate iterative data when calculating the weights of web pages by taking into account both the topic correlation between links and the timeliness of pages. As a result, the performance of the algorithm is reduced. In this paper, the algorithm based on block structure is used to divide the network, which effectively reduces the amount of data generated by the intermediate iterative computation. The performance of the algorithm is improved. (3) by adopting the feature selection method based on word frequency difference and the improved TF-IDF formula, the context Graph crawler search strategy is improved, and the text information selection of different parts of the web page is considered synthetically. The crawling efficiency of the subject reptiles is improved by the influence of the weight of each feature class and the weight in the class.
【学位授予单位】：兰州理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】