基于MapReduce模型文本分类算法的研究

发布时间：2018-10-18 16:41

【摘要】：随着网络规模的不断扩大和信息量的不断增加，集中式环境文本分类不能满足现有的需要，因此在分布式环境下对大规模数据处理成为当前IT行业关注的焦点。无论是在广告投放，还是在信息检索等领域，都需要对大规模数据处理进行文本分类，因此研究云计算环境下的大规模数据文本分类就成为了焦点。本文就在Hadoop系统平台下，以文本分类为前提，以本文设计的倒排索引树结构为基础，对文本分类算法及其增量算法进行了以下研究。综上所述：本文的主要研究成果、贡献和创新点可概括以下几点： 1.为了满足特征选择方法的计算速度和文本分类KNN、Bayes等算法以及文本向量维度分布稀松性，本文给出了倒排索引树结构，并在云平台上将倒排索引树结构并行化。 2.结合倒排索引树的结构和文本分类算法，给出了海量数据的倒排索引树构建算法及其剪枝策略，同时也给出了增量倒排索引树算法以及增量倒排索引树并行化设计。 3.基于倒排索引树结构，设计了K-means增量分类算法，并给出了Hadoop平台下该算法分类的并行化设计。 4.根据倒排索引树结构，提出了云计算hadoop平台下基于倒排索引树的朴素贝叶斯分类算法，并给出了该算法的三种改进方法，分别有采用TFIDF权重加权的，互信息加权的，期望交叉熵加权的朴素贝叶斯文本分类算法，同时也给出了基于倒排索引树的局部朴素贝叶斯文本分类算法。 5.搭建hadoop集群进行实验分析，验证了倒排索引树结构及其文本分类改进算法的分类准确率，召回率和分类性能。
[Abstract]:With the continuous expansion of the network scale and the increase of the amount of information, the centralized environment text classification can not meet the existing needs, so large-scale data processing in the distributed environment has become the focus of attention in the current IT industry. It is necessary to classify the large-scale data processing in the field of advertising and information retrieval, so the research of large-scale data text classification in cloud computing environment has become the focus. In this paper, based on the inverted index tree structure designed in this paper, the text classification algorithm and its incremental algorithm are studied on the basis of text classification based on Hadoop system. To sum up: the main research results, contributions and innovations can be summarized as follows: 1. In order to satisfy the computation speed of feature selection method, text classification KNN,Bayes algorithm and text vector dimension distribution looseness, the inverted index tree structure is presented in this paper, and the inverted index tree structure is parallelized on cloud platform. 2. Combined with the structure of inverted index tree and text classification algorithm, this paper presents an inverted index tree construction algorithm and pruning strategy for massive data. At the same time, the incremental inverted index tree algorithm and the parallel design of incremental inverted index tree are presented. Based on the inverted index tree structure, the K-means incremental classification algorithm is designed, and the parallel design of the algorithm classification based on Hadoop platform is given. 4. According to inverted index tree structure, a naive Bayesian classification algorithm based on inverted index tree in cloud computing hadoop platform is proposed, and three improved methods are given, which are weighted by TFIDF weight and weighted by mutual information. A naive Bayesian text classification algorithm with expected cross-entropy weighted is proposed. At the same time, a local naive Bayesian text classification algorithm based on inverted index tree is presented. The hadoop cluster was built for experimental analysis to verify the classification accuracy recall rate and classification performance of the inverted index tree structure and its improved text classification algorithm.
【学位授予单位】：辽宁大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【参考文献】