基于Spark的超大文本分类方法的设计与实现

发布时间：2018-05-06 16:19

本文选题：大数据 + 文本分类　；参考：《北京交通大学》2017年硕士论文

【摘要】：互联网技术飞速发展,衍生出了海量的网络文本数据。但是大部分海量数据没有经过处理和分类,导致了垃圾邮件、广告推送等不良网络行为的出现,使得人们很难从海量数据中提取出有用信息,浪费了大量时间精力去处理垃圾信息。因此,如何对海量文本数据进行高效的分类,具有重要理论意义和应用价值。论文首先分析了传统的文本分类算法存在的问题:(1)提取特征向量速度慢,效率低。因为海量数据的特征空间趋近无穷开放,但是传统的文本表示算法使用批处理的方式进行离线的特征提取,不仅计算效率低,而且内存占用大,甚至导致内存溢出等严重问题。(2)传统的分类器不适合在大数据计算框架中进行计算。海量数据通常使用分布式并行计算的方式进行处理,而传统的分类算法,例如SVM,朴素贝叶斯分类器,并不适合分布式并行计算。另外,深度学习算法,虽然广泛运用在语义识别中,但是应用在文本分类系统时却是成效甚微,反而需要耗费很长时间进行模型训练,收益并不明显。因此,针对以上问题,论文主要在文本表示、分类器设计两个方面进行研究和探索,主要工作如下:(1)在文本表示方面,提出了基于流数据的在线分域特征选择算法(OFFS算法)。该算法对向量空间模型进行改进,可以对流数据进行实时的特征提取,快速生成文本向量。解决了传统特征提取算法效率低、耗费内存等问题。(2)在分类器设计方面,设计出基于BP神经网络与OFFS算法相结合的OFFS-BP神经网络文本分类器。该分类器适应了分布式并行计算环境,减少模型训练时间,能够兼顾计算效率和分类准确率。(3)基于Spark平台,实现了 OFFS-BP神经网络分类器。首先利用Spark Streaming子框架实现OFFS算法;然后使用Spark MLlib子框架实现BP神经网络分类器;最后将SparkStreaming和Spark MLlib框架通过Spark编程模型RDD进行无缝连接。多种数据集实验表明,论文提出的OFFS-BP神经网络分类器更适合大数据,且计算耗时更少,分类更高效。
[Abstract]:With the rapid development of Internet technology, huge amounts of network text data have been derived. However, most of the massive data are not processed and classified, which leads to the emergence of bad network behaviors such as spam, advertising push, etc., which makes it difficult for people to extract useful information from the mass data. A lot of time and energy is wasted to deal with junk information. Therefore, how to classify massive text data efficiently has important theoretical significance and application value. Firstly, the paper analyzes the problem of traditional text classification algorithm: (1) extraction of feature vector is slow and inefficient. Because the feature space of massive data tends to be infinitely open, but the traditional text representation algorithm uses batch processing to extract features offline, it not only has low computational efficiency, but also occupies a lot of memory. Even causes serious problems such as memory overflow. 2) traditional classifier is not suitable for big data computing framework. Mass data is usually processed by distributed parallel computing, but traditional classification algorithms, such as SVM and naive Bayes classifier, are not suitable for distributed parallel computing. In addition, although the depth learning algorithm is widely used in semantic recognition, it has little effect in text classification system, and it takes a long time to train the model, and the benefits are not obvious. Therefore, aiming at the above problems, this paper mainly studies and explores the two aspects of text representation and classifier design. The main work is as follows: 1) in text representation, an online feature selection algorithm based on streaming data is proposed, which is called OFFS algorithm. The algorithm improves the vector space model and can extract the feature of convection data in real time and generate the text vector quickly. It solves the problems of low efficiency and memory consumption of traditional feature extraction algorithm. In the design of classifier, a OFFS-BP neural network text classifier based on BP neural network and OFFS algorithm is designed. The classifier adapts to the distributed parallel computing environment, reduces the training time of the model, and takes into account the computing efficiency and classification accuracy. The classifier is implemented based on the Spark platform and the OFFS-BP neural network classifier. First, the OFFS algorithm is implemented by using the Spark Streaming subframework, then the BP neural network classifier is implemented by using the Spark MLlib subframework; finally, the SparkStreaming and Spark MLlib frameworks are seamlessly connected through the Spark programming model RDD. Experiments on various data sets show that the proposed OFFS-BP neural network classifier is more suitable for big data, and the computation time is less and the classification is more efficient.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1;TP18

【参考文献】