基于MapReduce模型文本分类算法的研究
[Abstract]:With the continuous expansion of the network scale and the increase of the amount of information, the centralized environment text classification can not meet the existing needs, so large-scale data processing in the distributed environment has become the focus of attention in the current IT industry. It is necessary to classify the large-scale data processing in the field of advertising and information retrieval, so the research of large-scale data text classification in cloud computing environment has become the focus. In this paper, based on the inverted index tree structure designed in this paper, the text classification algorithm and its incremental algorithm are studied on the basis of text classification based on Hadoop system. To sum up: the main research results, contributions and innovations can be summarized as follows: 1. In order to satisfy the computation speed of feature selection method, text classification KNN,Bayes algorithm and text vector dimension distribution looseness, the inverted index tree structure is presented in this paper, and the inverted index tree structure is parallelized on cloud platform. 2. Combined with the structure of inverted index tree and text classification algorithm, this paper presents an inverted index tree construction algorithm and pruning strategy for massive data. At the same time, the incremental inverted index tree algorithm and the parallel design of incremental inverted index tree are presented. Based on the inverted index tree structure, the K-means incremental classification algorithm is designed, and the parallel design of the algorithm classification based on Hadoop platform is given. 4. According to inverted index tree structure, a naive Bayesian classification algorithm based on inverted index tree in cloud computing hadoop platform is proposed, and three improved methods are given, which are weighted by TFIDF weight and weighted by mutual information. A naive Bayesian text classification algorithm with expected cross-entropy weighted is proposed. At the same time, a local naive Bayesian text classification algorithm based on inverted index tree is presented. The hadoop cluster was built for experimental analysis to verify the classification accuracy recall rate and classification performance of the inverted index tree structure and its improved text classification algorithm.
【学位授予单位】:辽宁大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 张玉芳;陈小莉;熊忠阳;;基于信息增益的特征词权重调整算法研究[J];计算机工程与应用;2007年35期
2 秦锋;任诗流;程泽凯;罗慧;;基于属性加权的朴素贝叶斯分类算法[J];计算机工程与应用;2008年06期
3 唐亮;段建国;许洪波;梁玲;;基于互信息最大化的特征选择算法及应用[J];计算机工程与应用;2008年13期
4 邓维斌;王国胤;王燕;;基于Rough Set的加权朴素贝叶斯分类算法[J];计算机科学;2007年02期
5 向小军;高阳;商琳;杨育彬;;基于Hadoop平台的海量文本分类的并行化[J];计算机科学;2011年10期
6 张玉芳;彭时名;吕佳;;基于文本分类TFIDF方法的改进与应用[J];计算机工程;2006年19期
7 李学明;李海瑞;薛亮;何光军;;基于信息增益与信息熵的TFIDF算法[J];计算机工程;2012年08期
8 邓维斌;黄蜀江;周玉敏;;基于条件信息熵的自主式朴素贝叶斯分类算法[J];计算机应用;2007年04期
9 周敏;周继鹏;丁光华;;PSL:针对大规模数据应用的并行Slope One算法[J];科学技术与工程;2010年03期
10 冀素琴;石洪波;卫洁;;基于Map Reduce的Bagging贝叶斯文本分类[J];计算机工程;2012年16期
相关硕士学位论文 前5条
1 李原;中文文本分类中分词和特征选择方法研究[D];吉林大学;2011年
2 刘丛山;基于Hadoop的文本分类研究[D];上海交通大学;2012年
3 王新丽;中文文本分类系统的研究与实现[D];天津大学;2007年
4 李军华;云计算及若干数据挖掘算法的MapReduce化研究[D];电子科技大学;2010年
5 乔鸿欣;基于MapReduce的KNN分类算法的研究与实现[D];北京交通大学;2012年
,本文编号:2279728
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/2279728.html