基于Hadoop的大规模中文网站聚类的设计与实现

发布时间：2018-11-04 19:11

【摘要】：文本聚类分析是数据挖掘范畴内的一项重要研究,在统计学、金融、生物、医学、信息检索及文档分类等业内都已普及,同时比较热门的还有网站导航栏、论文相似性检测及用户推荐等应用。随着网络的快速普及,各种中文网站的数量都呈现了巨大的增长,人们从网页上获取的数据信息量也越来越多。由于不同的人有不同的需要和标准,导致了数据的多样性和质量要求。那么,怎样快速且高效率的从网页上挖掘出我们所需的信息已成现阶段的一个巨大挑战。对文本聚类的研究应用为此提供了一个很好的解决途径。也正是由于数据具有海量、多样性等特征,使得传统的聚类分析在对文本进行聚类处理的时候往往在时间空间上达不到理想的效果。随着云计算的兴起,采用分布式并行框架进行聚类处理,已被越来越多的学者研究应用。Hadoop是由Apache基金会开发的一个分布式系统基础架构,它有两个核心的框架设计：HDFS和MapReduce。HDFS框架主要承担着为海量的数据提供存储的任务,而框架MapReduce的任务就是计算,且这种对海量数据的计算是并行的。本文正是基于Hadoop平台上设计的对中文网站进行聚类分析的系统,下面是本文的主要研究工作。1.对经常使用的经典聚类算法思想及相关理论知识进行介绍。详细介绍了文本聚类的整个流程过程及常见的相似性度量方法等等。2.深入理解Hadoop平台的两大核心框架及关键技术,阐述它们间的相互联系及运行机制,说明相比传统单机环境下作聚类实验的优势。3.搭建Hadoop分布式环境,配置使用eclipse开发工具,采用k-means聚类算法,编写程序对中文网站网页数据进行系统测试,得到聚类结果,实验成功对所有网页进行划分；对实验结果整理、进行分析,证明Hadoop在处理大规模数据上的强大计算能力,且在一定程度下,随着集群节点的增加,计算能力增强。
[Abstract]:Text clustering analysis is an important research in the field of data mining. It has been widely used in the fields of statistics, finance, biology, medicine, information retrieval and document classification. Similarity detection and user recommendation are used in this paper. With the rapid popularity of the Internet, the number of various Chinese websites has shown a huge growth, people get more and more data from the web pages. Because different people have different needs and standards, resulting in data diversity and quality requirements. Therefore, how to quickly and efficiently mine the information we need from web pages has become a huge challenge at this stage. The research and application of text clustering provide a good way to solve this problem. It is precisely because the data has the characteristics of magnanimity and diversity that the traditional clustering analysis often can not achieve the ideal effect in time and space when clustering the text. With the rise of cloud computing, cluster processing using distributed parallel framework has been studied and applied by more and more scholars. Hadoop is a distributed system infrastructure developed by Apache Foundation. It has two core framework design: HDFS and MapReduce.HDFS framework mainly undertake the task of providing storage for massive data, and the task of frame MapReduce is to compute, and this kind of computation of mass data is parallel. This paper is based on the Hadoop platform to design the Chinese website clustering analysis system, the following is the main research work. 1. This paper introduces the idea of classical clustering algorithm and related theoretical knowledge. In this paper, the whole process of text clustering and the common similarity measurement methods are introduced in detail. 2. In this paper, we deeply understand the two core frameworks and key technologies of Hadoop platform, expound their interrelation and operation mechanism, and explain the advantages of clustering experiment in traditional single machine environment. 3. Build the Hadoop distributed environment, configure the use of eclipse development tools, use k-means clustering algorithm, write a program to test the Chinese web page data, get the clustering results, the experiment successfully divided all the pages; The analysis of the experimental results shows that Hadoop has powerful computing power in dealing with large scale data, and to a certain extent, with the increase of cluster nodes, the computing power is enhanced.
【学位授予单位】：华中师范大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13;TP393.092

【相似文献】