基于深度学习理论和SVM技术的文本分类研究与实现
发布时间:2019-05-17 13:35
【摘要】:随着互联网技术高速发展,产生海量的数据信息。每天都有数以百万计的网民通过互联网获取对自己有价值和意义的信息,如何能够让每一个人能快速、准确的从海量的数据中得到自己想要的知识、技能,已经成为当前研究的热点问题。要解决这类问题,研究者对数据进行获取分析、挖掘、归类,帮助人们提高信息检索的效率。本文主要核心的工作是:利用深度学习进行特征提取和支持向量机相结合的方法对海量数据文本进行挖掘分类和分析,最后得到文本的本质特征。传统的文本分类算法都是采用期望交叉熵、信息增益和互信息等统计方法,通过设置阈值获取特征集。如果训练集的数据量较大,则容易出现特征项不明确、特征信息丢失等缺陷,针对这些问题,本文利用深度学习方法,结合现有的数据特点,提出将深度学习的两种方法和支持向量机方法进行结合设计分类器,完成文本分类,本文主要的研究内容和创新点如下:1.对国内外现有的文本分类技术的研究现状和研究意义进行了介绍,并且对文本分类重要性进行了阐述,最后指出了本论文要做的工作。2.首先研究了传统的分类技术,从文本预处理,文本特征提取和文本分类三部分充分研究,然后对贝叶斯,KNN,SVM分类算法进行阐述,并且对三种算法的适用范围和优缺点进行了分析。3.介绍深度学习的相关理论知识,提出了利用稀疏自动编码将原始数据进行高维空间映射,运用深度信念网络对稀疏自动编码的输出进行投影获取文本抽象特征。研究了深度学习中的稀疏自动编码和深度信念网络相结合进行文本特征提取的过程。4.本文结合深度学习和改进的多分类SVM方法,设计出由稀疏自动编码和深度信念网络,SVM分类相结合的分类器对文本进行分类。最后通过设计实验,对本文提出的方法进行测试,并与传统的文本分类方法进行了比较和分析。通过修改参数测试文本分类的准确率。
[Abstract]:With the rapid development of Internet technology, a large number of data and information are produced. Every day, millions of netizens get valuable and meaningful information through the Internet. How can everyone get the knowledge and skills they want from massive data quickly and accurately? It has become a hot issue in current research. In order to solve this kind of problem, researchers analyze, mine and classify the data to help people improve the efficiency of information retrieval. The main work of this paper is to use deep learning for feature extraction and support vector machine to mine and analyze the massive data text, and finally get the essential features of the text. Traditional text classification algorithms use statistical methods such as expected cross entropy, information gain and mutual information to obtain feature sets by setting threshold values. If the amount of data in the training set is large, it is easy to have some defects, such as unclear feature items and loss of feature information. In order to solve these problems, this paper uses the deep learning method to combine the existing data characteristics. Two methods of deep learning and support vector machine (SVM) are proposed to design classifiers to complete text classification. the main research contents and innovations of this paper are as follows: 1. This paper introduces the research status and significance of the existing text classification technology at home and abroad, and expounds the importance of text classification, and finally points out the work to be done in this paper. 2. Firstly, the traditional classification technology is studied, which is fully studied from three parts: text preprocessing, text feature extraction and text classification, and then the Bayesian and KNN,SVM classification algorithms are described. The applicable scope, advantages and disadvantages of the three algorithms are analyzed. This paper introduces the related theoretical knowledge of depth learning, and proposes to use sparse automatic coding to map the original data in high dimensional space, and to use depth belief network to project the output of sparse automatic coding to obtain text abstract features. The process of text feature extraction based on sparse automatic coding and depth belief network in depth learning is studied. 4. In this paper, based on the deep learning and improved multi-classification SVM method, a classifier based on sparse automatic coding, depth belief network and SVM classification is designed to classify the text. Finally, through the design experiment, the method proposed in this paper is tested, and compared and analyzed with the traditional text classification method. The accuracy of text classification is tested by modifying parameters.
【学位授予单位】:江苏科技大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
本文编号:2479130
[Abstract]:With the rapid development of Internet technology, a large number of data and information are produced. Every day, millions of netizens get valuable and meaningful information through the Internet. How can everyone get the knowledge and skills they want from massive data quickly and accurately? It has become a hot issue in current research. In order to solve this kind of problem, researchers analyze, mine and classify the data to help people improve the efficiency of information retrieval. The main work of this paper is to use deep learning for feature extraction and support vector machine to mine and analyze the massive data text, and finally get the essential features of the text. Traditional text classification algorithms use statistical methods such as expected cross entropy, information gain and mutual information to obtain feature sets by setting threshold values. If the amount of data in the training set is large, it is easy to have some defects, such as unclear feature items and loss of feature information. In order to solve these problems, this paper uses the deep learning method to combine the existing data characteristics. Two methods of deep learning and support vector machine (SVM) are proposed to design classifiers to complete text classification. the main research contents and innovations of this paper are as follows: 1. This paper introduces the research status and significance of the existing text classification technology at home and abroad, and expounds the importance of text classification, and finally points out the work to be done in this paper. 2. Firstly, the traditional classification technology is studied, which is fully studied from three parts: text preprocessing, text feature extraction and text classification, and then the Bayesian and KNN,SVM classification algorithms are described. The applicable scope, advantages and disadvantages of the three algorithms are analyzed. This paper introduces the related theoretical knowledge of depth learning, and proposes to use sparse automatic coding to map the original data in high dimensional space, and to use depth belief network to project the output of sparse automatic coding to obtain text abstract features. The process of text feature extraction based on sparse automatic coding and depth belief network in depth learning is studied. 4. In this paper, based on the deep learning and improved multi-classification SVM method, a classifier based on sparse automatic coding, depth belief network and SVM classification is designed to classify the text. Finally, through the design experiment, the method proposed in this paper is tested, and compared and analyzed with the traditional text classification method. The accuracy of text classification is tested by modifying parameters.
【学位授予单位】:江苏科技大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 郭正斌;张仰森;蒋玉茹;;一种面向文本分类的特征向量优化方法[J];计算机应用研究;2017年08期
2 肖江;王晓进;;基于SVM的在线商品评论的情感倾向性分析[J];信息技术;2016年07期
3 耿杰;范剑超;初佳兰;王洪玉;;基于深度协同稀疏编码网络的海洋浮筏SAR图像目标识别[J];自动化学报;2016年04期
4 常建秋;沈炜;;基于字符串匹配的中文分词算法的研究[J];工业控制计算机;2016年02期
5 卢宏涛;张秦川;;深度卷积神经网络在计算机视觉中的应用研究综述[J];数据采集与处理;2016年01期
6 曲建岭;杜辰飞;邸亚洲;高峰;郭超然;;深度自动编码器的研究与展望[J];计算机与现代化;2014年08期
7 袁琳琳;陈红平;;汉语自动分词系统的设计与实现[J];信息与电脑(理论版);2014年07期
8 梁胜;成卫青;;基于组合型中文分词技术的改进[J];南京邮电大学学报(自然科学版);2013年06期
9 单丽莉;刘秉权;孙承杰;;文本分类中特征选择方法的比较与改进[J];哈尔滨工业大学学报;2011年S1期
10 姜鹤;陈丽亚;;SVM文本分类中一种新的特征提取方法[J];计算机技术与发展;2010年03期
相关硕士学位论文 前1条
1 马冬梅;基于深度学习的图像检索研究[D];内蒙古大学;2014年
,本文编号:2479130
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2479130.html