基于Hadoop平台的网页聚类方法研究

发布时间：2018-03-01 06:04

本文关键词： Normalized Cuts Multiclass谱聚类网页聚类 Hadoop MapReduce　出处：《华南理工大学》2012年硕士论文　论文类型：学位论文

【摘要】：网页是互联网中信息存在的主要形式，人们通过网页发布和查询信息。而随着信息时代的日益变迁，网页的数量呈现了爆炸式的增长。在数以亿计的网页中，如何才能更加有效的挖掘知识？如何才能快速的辨别垃圾信息？如何才能更加从容地对数据归类？数据挖掘是处理这些问题的有力工具，而网页聚类则是其中的一种手段。通过聚类，能够无监督或半监督的对网页进行基于语义的划分。网页聚类的实际应用很广，它能够应用到很多实际问题当中。搜索引擎能够通过网页聚类，为用户提供更多的相关信息。对搜索引擎结果进行聚类，，能够为用户提供搜索结果的导航，用户能够根据聚类标签，直接定位到自己期望的内容。网页聚类还能区分垃圾网页等等。因此，网页聚类一直以来都是数据挖掘中的一个研究重点，但是还有很多问题值得我们继续研究。可以将网页聚类问题划分为多个子问题，即网页的去噪、内容的提取、相似度的定义、降维、聚类算法的应用、类别数目的确定、聚类标签的生成等。对于上述的每个子问题，都经过了前人的研究，但仍然存在改进的空间。本文针对网页聚类问题中的聚类算法的应用进行了研究，将Multiclass谱聚类算法应用到了网页聚类和网页结果聚类中。并实现了能对搜索结果聚类的网页搜索引擎，该搜索引擎系统中集成了多重聚类方式，集成了Multiclass谱聚类算法和Normalized Cuts算法等聚类算法。基于谱聚类的网页聚类方法虽然能够获得良好的聚类效果，但算法中使用了一个N*N维（其中N是聚类对象的个数）的矩阵来表示聚类对象之间的相似关系。随着聚类对象数目的增多，该矩阵的大小增长更快，导致内存无法存储该矩阵，从而使得谱聚类方法失去可扩展性。因此本文研究了增强谱聚类的扩展性的方法，提出了使用Hadoop平台中的MapReduce机制扩展Normalized Cuts算法的方法，并实现了基于Hadoop平台的网页聚类方法，这种方法具有可扩展性，能并行的执行，从而解决了单台机器不能将整个相似性矩阵存储在内存中的问题。
[Abstract]:Web pages are the main forms of information in the Internet. People publish and query information through web pages. With the change of the information age, the number of web pages is increasing explosively. In hundreds of millions of web pages, How can we excavate knowledge more effectively? How to quickly identify spam? How can data be categorized more calmly? Data mining is a powerful tool to deal with these problems, and web page clustering is one of the means. Through clustering, pages can be partitioned based on semantics without supervision or semi-supervision. The practical application of web page clustering is very wide, it can be applied to many practical problems. Search engine can provide users with more relevant information through web page clustering. Can provide users with navigation of search results, users can directly locate their desired content based on clustering tags. Web clustering can also distinguish garbage pages and so on. Web page clustering has always been a research focus in data mining, but there are still many problems that we should continue to study. The problem of web page clustering can be divided into several sub-problems, namely, the denoising of web pages, the extraction of content, the definition of similarity, the reduction of dimension, the application of clustering algorithm, and the determination of the number of categories. For each of the above sub-problems, there is still room for improvement, but there is still room for improvement. The Multiclass spectral clustering algorithm is applied to the web page clustering and the web page result clustering, and a web search engine which can cluster the search results is implemented. Multiclass spectrum clustering algorithm and Normalized Cuts clustering algorithm are integrated. Although the web page clustering method based on spectral clustering can obtain good clustering effect, However, the algorithm uses a matrix of N dimension (where N is the number of cluster objects) to express the similarity between clustering objects. With the increase of the number of clustering objects, the size of the matrix grows faster, resulting in the memory can not store the matrix. Therefore, the method of enhancing the extensibility of spectral clustering is studied in this paper, and the method of extending Normalized Cuts algorithm using MapReduce mechanism in Hadoop platform is proposed, and the web page clustering method based on Hadoop platform is realized. This method is extensible and can be executed in parallel, which solves the problem that a single machine can not store the whole similarity matrix in memory.
【学位授予单位】：华南理工大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP393.092

【参考文献】