基于深度学习理论和SVM技术的文本分类研究与实现

发布时间：2019-05-17 13:35

【摘要】：随着互联网技术高速发展,产生海量的数据信息。每天都有数以百万计的网民通过互联网获取对自己有价值和意义的信息,如何能够让每一个人能快速、准确的从海量的数据中得到自己想要的知识、技能,已经成为当前研究的热点问题。要解决这类问题,研究者对数据进行获取分析、挖掘、归类,帮助人们提高信息检索的效率。本文主要核心的工作是:利用深度学习进行特征提取和支持向量机相结合的方法对海量数据文本进行挖掘分类和分析,最后得到文本的本质特征。传统的文本分类算法都是采用期望交叉熵、信息增益和互信息等统计方法,通过设置阈值获取特征集。如果训练集的数据量较大,则容易出现特征项不明确、特征信息丢失等缺陷,针对这些问题,本文利用深度学习方法,结合现有的数据特点,提出将深度学习的两种方法和支持向量机方法进行结合设计分类器,完成文本分类,本文主要的研究内容和创新点如下:1.对国内外现有的文本分类技术的研究现状和研究意义进行了介绍,并且对文本分类重要性进行了阐述,最后指出了本论文要做的工作。2.首先研究了传统的分类技术,从文本预处理,文本特征提取和文本分类三部分充分研究,然后对贝叶斯,KNN,SVM分类算法进行阐述,并且对三种算法的适用范围和优缺点进行了分析。3.介绍深度学习的相关理论知识,提出了利用稀疏自动编码将原始数据进行高维空间映射,运用深度信念网络对稀疏自动编码的输出进行投影获取文本抽象特征。研究了深度学习中的稀疏自动编码和深度信念网络相结合进行文本特征提取的过程。4.本文结合深度学习和改进的多分类SVM方法,设计出由稀疏自动编码和深度信念网络,SVM分类相结合的分类器对文本进行分类。最后通过设计实验,对本文提出的方法进行测试,并与传统的文本分类方法进行了比较和分析。通过修改参数测试文本分类的准确率。
[Abstract]:With the rapid development of Internet technology, a large number of data and information are produced. Every day, millions of netizens get valuable and meaningful information through the Internet. How can everyone get the knowledge and skills they want from massive data quickly and accurately? It has become a hot issue in current research. In order to solve this kind of problem, researchers analyze, mine and classify the data to help people improve the efficiency of information retrieval. The main work of this paper is to use deep learning for feature extraction and support vector machine to mine and analyze the massive data text, and finally get the essential features of the text. Traditional text classification algorithms use statistical methods such as expected cross entropy, information gain and mutual information to obtain feature sets by setting threshold values. If the amount of data in the training set is large, it is easy to have some defects, such as unclear feature items and loss of feature information. In order to solve these problems, this paper uses the deep learning method to combine the existing data characteristics. Two methods of deep learning and support vector machine (SVM) are proposed to design classifiers to complete text classification. the main research contents and innovations of this paper are as follows: 1. This paper introduces the research status and significance of the existing text classification technology at home and abroad, and expounds the importance of text classification, and finally points out the work to be done in this paper. 2. Firstly, the traditional classification technology is studied, which is fully studied from three parts: text preprocessing, text feature extraction and text classification, and then the Bayesian and KNN,SVM classification algorithms are described. The applicable scope, advantages and disadvantages of the three algorithms are analyzed. This paper introduces the related theoretical knowledge of depth learning, and proposes to use sparse automatic coding to map the original data in high dimensional space, and to use depth belief network to project the output of sparse automatic coding to obtain text abstract features. The process of text feature extraction based on sparse automatic coding and depth belief network in depth learning is studied. 4. In this paper, based on the deep learning and improved multi-classification SVM method, a classifier based on sparse automatic coding, depth belief network and SVM classification is designed to classify the text. Finally, through the design experiment, the method proposed in this paper is tested, and compared and analyzed with the traditional text classification method. The accuracy of text classification is tested by modifying parameters.
【学位授予单位】：江苏科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】