分布式环境下基于文本的海量数据挖掘

发布时间：2018-06-28 15:40

本文选题：大数据 + 数据挖掘　；参考：《上海交通大学》2013年硕士论文

【摘要】：数据挖掘一直以来都是计算机领域的一个研究热点。近年来，随着Web2.0应用的普及和云计算的发展，互联网已经进入了大数据时代，数据的产生、传输、存储、访问和处理方式产生了明显的变化。传统的数据挖掘方法在数据源异构、数据规模急剧膨胀的大数据时代，正面临严峻的挑战。本文提出了一套完整的分布式环境下基于文本的数据挖掘方法，实现了海量文本数据从数据抽取、预处理、搭建数据仓库到数据挖掘的全过程，并将该方法应用于解决微博用户推荐问题进行验证，取得良好效果。广义的数据挖掘工作通常包含两个部分，搭建数据仓库和进行数据挖掘。数据挖掘的对象通常是来自多个异构数据源的大规模数据，从数据一致性、访问效率等因素考虑，需要有一个统一的管理系统对数据进行集成、维护，即数据仓库。数据仓库的搭建包含了数据的抽取、转换和加载，即ETL过程。传统的数据仓库设计是基于RDBMS设计思想的，需要整合所有数据源的数据类型和数据结构，设计一个统一的模式（Schema），包括表结构和外键等。这样做的优势在于可以保证数据的ACID性质。但是在大数据背景下，数据源复杂，异构性强、数据规模扩展迅速，从而对基于RDBMS数据仓库的可扩展性、灵活性以及效率提出了新的挑战。在完成数据仓库搭建的基础上，传统的数据挖掘已经形成了一整套较为成熟的算法体系，典型的算法包括分类、聚类、关联、预测等，此外还与其他学科交叉产生了包括机器学习、神经网络等技术。这些数据挖掘技术应用场景具备一些鲜明的特点：数据一次写入，，频繁读，运算密集，而数据更新操作较少。针对这些特点，基于RDBMS设计方法保证的ACID性质的优势不仅得不到充分体现，反而成为了性能上的制约。针对以上问题，本文提出了一套分布式环境下，基于文本的数据仓库搭建与数据挖掘的方案。首先，在数据仓库搭建方面，本文提出一种在分布式环境下快速搭建数据仓库的方法，利用MapReduce完成整个ETL过程；同时摒弃了RDBMS而使用NoSQL数据库集群作为数据仓库的基础，从而保证了系统的可扩展性和运行效率。其次，借鉴搜索引擎的思想，提出一种MongoDB+Lucene+MapReduce的针对文本数据的数据挖掘解决方案，通过并行访问，提高对分布式环境下海量文本数据的访问效率；采用计算TFIDF值评估文本信息量，而非传统的词法、语法分析。最后，应用这一整套方法，解决了一个具有Web2.0特征的数据挖掘问题：微博的用户推荐问题，从而验证了这一方法的可行性，并取得良好效果。
[Abstract]:Data mining has always been a research hotspot in computer field. In recent years, with the popularity of Web 2.0 applications and the development of cloud computing, the Internet has entered the big data era, the generation, transmission, storage, access and processing of data has changed significantly. Traditional data mining methods are facing severe challenges in the era of heterogeneous data sources and rapidly expanding data scale in the big data era. In this paper, a complete method of text-based data mining in distributed environment is proposed. The whole process of massive text data extraction, preprocessing, data warehouse and data mining is realized. The method is applied to solve the problem of Weibo user recommendation and good results are obtained. The generalized data mining work usually consists of two parts, data warehouse and data mining. The objects of data mining are usually large-scale data from multiple heterogeneous data sources. Considering the data consistency, access efficiency and other factors, it is necessary to have a unified management system to integrate and maintain the data, that is, data warehouse. The construction of data warehouse includes extraction, transformation and loading of data, that is, ETL process. The traditional design of data warehouse is based on the RDBMS design idea. It is necessary to integrate the data types and data structures of all data sources and design a unified schema, including table structure and foreign key, etc. The advantage of this is that the acid nature of the data can be guaranteed. However, under the background of big data, the data source is complex, heterogeneous, and the data scale expands rapidly, which brings a new challenge to the extensibility, flexibility and efficiency of RDBMS-based data warehouse. On the basis of data warehouse construction, traditional data mining has formed a set of more mature algorithm system, typical algorithms include classification, clustering, association, prediction and so on. In addition, it also intersects with other disciplines, including machine learning, neural networks and other technologies. The application scenes of these data mining techniques have some distinct characteristics: the data is written at one time, read frequently, and the operation is dense, but the operation of data update is less. In view of these characteristics, the advantages of acid properties guaranteed by RDBMS design method are not fully reflected, but also become a performance constraint. In order to solve the above problems, this paper proposes a method of data warehouse building and data mining based on text in distributed environment. First of all, in the aspect of data warehouse construction, this paper puts forward a method to build data warehouse quickly in distributed environment, using MapReduce to complete the whole ETL process, and abandoning RDBMS and using NoSQL database cluster as the basis of data warehouse. Thus, the expansibility and efficiency of the system are guaranteed. Secondly, using the idea of search engine for reference, this paper proposes a data mining solution for text data based on MongoDB Lucene MapReduce, which can improve the efficiency of accessing massive text data in distributed environment through parallel access. The calculation of TFIDF value is used to evaluate the amount of text information, rather than the traditional lexical and grammatical analysis. Finally, a Web 2.0 characteristic data mining problem is solved by using this method, which is the user recommendation problem of Weibo, which verifies the feasibility of this method and achieves good results.
【学位授予单位】：上海交通大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP311.13

【参考文献】