校园搜索引擎开发及其流量测量
发布时间:2018-06-08 02:16
本文选题:搜索引擎 + Lucene ; 参考:《北京邮电大学》2012年硕士论文
【摘要】:搜索引擎往往是用户访问互联网的第一站,它帮助用户从海量网页中挑选出自己真正关心的信息。搜索引擎技术虽然已经相对成熟,但是核心技术还是掌握在处于垄断地位的大公司手里。这些大的搜索引擎公司对外提供的是整个互联网数据的检索功能,而一些公司和机构也希望拥有针对自己内部网的搜索工具,这样更具有针对性,会使得搜索效果更好并且可以防止信息泄露。本文作者所在高校的内网信息相当丰富,而目前校内还没有一个类似搜索引擎的工具对其进行整理,给校园用户带来诸多不便。 基于方便校内师生查找校内网络资源这个出发点,本文开发了一个校园搜索引擎,对校内网页进行索引,为师生查询提供良好的搜索结果。本文校园搜索引擎的开发是基于优秀的开源软件Lucene和Nutch的框架,根据校内网页的特点和独特的需求提出并实现了新的网页数据集更新算法、去重算法、排序算法等,并且对很多模块都进行了重新定制。最终结果是开发出了一个称之为“畅邮”的校园搜索引擎,测试结果显示“畅邮”能够为用户提供比较满意的服务。“畅邮”的排序算法等实现有很好的扩展性,以后可以根据需求进行逐步的改进。 同时,由于搜索业务计算量很大,单机实现速度太慢,本文开发的校园搜索引擎部署在Hadoop分布式平台上。随着越来越多的公司和机构开始使用Hadoop运行他们的业务,关于Hadoop的研究也受到人们的广泛关注。但是,关于运行Hadoop的数据中心的流量测量工作目前几乎没有,测量工作的缺乏阻碍了Hadoop及数据中心研究的发展。本文在运行“畅邮”的Hadoop集群基础上,对运行Hadoop的数据中心流量特性进行了测量。根据数据中心网络的固有特点,提出了一个有针对性的测量方法,并且开发出了一个名为HADE的软件专门用来处理和分析网络数据。本文最后给出了流量特性的测量结果,并对这些测量结果做出了一定分析,为Hadoop及数据中心研究者提供有价值的研究依据。
[Abstract]:Search engine is often the first station for users to visit the Internet. It helps users pick out the information they really care about from the massive web pages. Search engine technology has been relatively mature, but the core technology is still in the monopoly of the hands of large companies. These large search engine companies provide the entire Internet data retrieval function, and some companies and institutions also want to have search tools for their own intranet, which is more targeted. Will make the search more effective and prevent information disclosure. The author has abundant information on the intranet in colleges and universities, but at present there is no search engine tool to sort it out. Based on the convenience for teachers and students to find the campus network resources, this paper develops a campus search engine to index the campus web pages to provide good search results for teachers and students. The development of campus search engine is based on the framework of the excellent open source software Lucene and Nutch. According to the characteristics and unique requirements of the campus web pages, this paper proposes and implements a new algorithm for updating web pages, reshuffling algorithms, sorting algorithms, etc. And many modules have been recustomized. The final result is to develop a campus search engine called "Changyou". The test results show that "Changyou" can provide satisfactory services to users. The sorting algorithm of "Changyou Post" has good expansibility, and can be improved step by step according to the demand. At the same time, because of the large amount of calculation of search service, the speed of single machine realization is too slow. The campus search engine developed in this paper is deployed on Hadoop distributed platform. As more and more companies and organizations begin to use Hadoop to run their business, the research on Hadoop has been paid more and more attention. However, there is almost no traffic measurement work on the data center running Hadoop, and the lack of measurement work hinders the development of Hadoop and data center research. Based on the Hadoop cluster running Changyou, this paper measures the traffic characteristics of the data center running Hadoop. According to the inherent characteristics of the data center network, a targeted measurement method is proposed, and a software named Hade is developed to process and analyze the network data. At the end of this paper, the measurement results of the flow characteristics are given, and the results are analyzed to provide a valuable basis for the research of Hadoop and data center researchers.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3
【参考文献】
相关期刊论文 前8条
1 龙树全;赵正文;唐华;;中文分词算法概述[J];电脑知识与技术;2009年10期
2 杨小平,丁浩,黄都培;基于向量空间模型的中文信息检索技术研究[J];计算机工程与应用;2003年15期
3 姚文琳;刘文;;一种基于本体的PageRank算法的改进策略[J];计算机工程;2009年06期
4 陈伟柱,陈英,吴燕;基于分类技术的搜索引擎排名算法——CategoryRank[J];计算机应用;2005年05期
5 马维e,
本文编号:1993945
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1993945.html