基于随机森林和Spark的并行文本分类算法研究

发布时间：2018-11-25 09:04

【摘要】：文本分类问题广泛存在于搜索引擎、信息检索等应用中。尤其是信息技术广为发展的时代,有效地对大数据中的文本进行分类是数据挖掘研究的重要内容之一。本文研究了随机森林算法在海量文本分类中的应用,随机森林算法是一种集成算法,能有效的处理海量数据。随机森林分类算法通过随机性的引入,在获得较好分类效果的同时很好的解决决策树算法过拟合的问题。随机森林算法在抽样建立决策树的过程中,可能会生成较差的随机子空间,从而使得对应的决策树分类能力较弱,针对此特点本文采用基于粗糙集理论的随机森林算法调整这些决策树的分类能力。同时根据随机森林中各决策树的分类能力,在随机森林算法中采用加权投票方法,实验表明基于粗糙集理论的随机森林算法在大多数数据集上分类性能优于KNN、朴素贝叶斯、决策树和传统的随机森林等算法。MapReduce框架是目前应用最广泛的大数据并行计算框架,MapReduce框架下的并行文本分类算法的研究得到了较多的关注。MapReduce框架的缺点是,在并行计算的过程中会将中间结果存储在HDFS上,造成大量的IO开销等；而Spark框架是基于内存计算的并行框架,在执行过程中并不会直接将中间结果存储在磁盘(只有当内存不足时才会将数据部分缓存到磁盘),因此Spark框架的执行效率相对较好。本文研究了随机森林算法和Spark框架在海量文本分类上的应用,并同MapReduce框架下的并行文本分类进行了简单比较,实验表明Spark框架上并行文本分类并行性能较好,且优于MapReduce框架下并行文本分类。最后,为方便用户对集群的使用,设计了基于B/S结构的并行文本分类系统,用于远程提交任务、集群监控和数据下载等。
[Abstract]:Text classification is widely used in search engine and information retrieval. Especially in the era of extensive development of information technology, effectively classifying texts in big data is one of the important contents of data mining research. In this paper, the application of stochastic forest algorithm in massive text classification is studied. Stochastic forest algorithm is an ensemble algorithm, which can deal with mass data effectively. By introducing randomness into the stochastic forest classification algorithm, the problem of over-fitting of decision tree algorithm is well solved while the classification effect is better. In the process of establishing decision tree by sampling, the random forest algorithm may generate poor random subspace, which makes the classification ability of the corresponding decision tree weak. In this paper, the classification ability of these decision trees is adjusted by using the stochastic forest algorithm based on rough set theory. At the same time, according to the classification ability of each decision tree in the random forest, the weighted voting method is used in the random forest algorithm. The experiment shows that the classification performance of the stochastic forest algorithm based on rough set theory is better than that of KNN, naive Bayes on most data sets. Decision tree and traditional stochastic forest algorithms. MapReduce framework is the most widely used big data parallel computing framework at present. The research of parallel text classification algorithm under MapReduce framework has attracted more attention. The disadvantage of MapReduce framework is that, In the process of parallel computing, the intermediate results will be stored on the HDFS, resulting in a large amount of IO overhead. The Spark framework is a parallel framework based on memory computing, and the intermediate results are not stored directly on disk (only when the memory is out of memory, the data can be cached to the disk), so the execution efficiency of the Spark framework is relatively good. In this paper, the application of stochastic forest algorithm and Spark framework in massive text classification is studied and compared with the parallel text classification based on MapReduce framework. The experiments show that the parallel performance of parallel text classification based on Spark framework is good. And it is better than parallel text classification in MapReduce framework. Finally, a parallel text classification system based on B / S structure is designed to facilitate the users to use the cluster. The system is used for remote submission tasks, cluster monitoring and data downloading.
【学位授予单位】：西南交通大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】