基于Hadoop的应用可视化研究与实现

发布时间：2018-04-30 16:41

本文选题：Hadoop + LDA主题模型　；参考：《北京邮电大学》2015年硕士论文

【摘要】：随着互联网的极速发展,互联网产生的信息数据成爆炸式增长。互联网已经从信息匮乏迅速转入信息过于庞大而难以甄选有效信息的时代。文本信息作为传递信息的一种载体,依然是人们从互联网中获得信息的一种主要方式。如何从互联网海量文本信息中获取感兴趣的信息是大数据挖掘中的一项重要任务。文本信息挖掘被广泛应用于网络舆情、营销、商业推荐等各方面,研究文本信息挖掘技术具有广阔的市场应用前景。另外一方面,海量数据的涌现,使得传统的服务器已经无法承载海量数据的存储和运算,分布式系统已经成为当前处理海量数据的主流平台。因此,如何将传统的串行数据处理方法有效应用到分布式系统中也成为了分布式系统研究的一个主要问题。本文根据海量文本信息挖掘的核心问题,研究了基于Hadoop平台的文本聚类问题,以便于利用分布式平台提高文本聚类的效率和扩大文本聚类处理数据的容量。本文取得的主要成果有： 1.本文根据Hadoop平台的特性,实现并改进了基于Mapreduce架构的分布式LDA主题模型并行算法,有效解决了单机LDA中处理数据容量受硬件资源限制的问题。实验结果表明,分布式LDA主题模型在处理海量数据时具有明显的时间优势。 2.本文设计和实现了一个可视化的Hadoop集群管理平台,通过该平台简化了Hadoop集群的管理。同时,平台引入了用户权限控制模块,增强了平台的安全性。 3.本文利用实验室的闲置计算机资源,搭建了由25台普通PC机构成的Hadoop的集群,并在该平台上验证和测试了可视化管理平台和并行LDA主题模型的算法。该系统可以稳定可靠地运行。
[Abstract]:With the rapid development of the Internet, the information generated by the Internet has exploded. The Internet has rapidly shifted from a lack of information to an era of information too large to select valid information. Text information, as a carrier of information transmission, is still a main way for people to obtain information from the Internet. How to obtain interesting information from the massive text information of Internet is an important task of big data mining. Text information mining is widely used in network public opinion, marketing, commercial recommendation and other aspects, research text information mining technology has a broad market application prospects. On the other hand, with the emergence of mass data, traditional servers can no longer carry the storage and operation of mass data, and distributed system has become the mainstream platform to deal with mass data. Therefore, how to effectively apply the traditional serial data processing methods to distributed systems has become a major problem in the research of distributed systems. According to the core problem of massive text information mining, the text clustering problem based on Hadoop platform is studied in this paper, in order to improve the efficiency of text clustering and expand the capacity of text clustering processing data by using distributed platform. The main achievements of this paper are as follows: 1. According to the characteristics of Hadoop platform, the parallel algorithm of distributed LDA topic model based on Mapreduce architecture is implemented and improved in this paper, which effectively solves the problem that the data processing capacity in single LDA is limited by hardware resources. The experimental results show that the distributed LDA topic model has obvious time advantage in dealing with massive data. 2. This paper designs and implements a visual Hadoop cluster management platform, which simplifies the management of Hadoop cluster. At the same time, the platform introduces the user rights control module to enhance the security of the platform. 3. Based on the idle computer resources of the laboratory, this paper sets up a cluster of Hadoop composed of 25 ordinary PCs, and verifies and tests the algorithms of the visual management platform and the parallel LDA subject model on the platform. The system can operate stably and reliably.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP391.1

【参考文献】