基于SVM算法的文本分类的研究

发布时间：2018-05-23 16:02

本文选题：文本分类 + SVM　；参考：《吉林大学》2017年硕士论文

【摘要】：随着社会科技的进步,各个领域对数据的关注度与日俱增,与此同时,科研人员对数据的敏感性和对数据的应用能力也不断增强,这一切使人们进入了大数据时代。但是在互联网中流动的不只有人们需要的可用资源,还包括大量干扰正常工作、误导大众的危害内容。在可用资源里,数据也是杂乱无章的,这不仅造成网络信息过载,也给人们带来了低效率的感受。因此,对数据进行系统的处理、精准的分类,使它们成为有特定用途的可用信息是科研人员的追求目标。本文在撰写的前期,先就当前文本分类的研究成果进行了一定程度的学习,这其中包括对国内和国外两部分成果的研究;然后,着重学习和分析了如何用SVM方法解决文本二分类问题,进而引申到多分类问题。SVM——支持向量机,属于机器学习中的一种方法,是以统计学习理论作为基础的,在文本分类、图像分类等许多领域都体现了很好的性能。在使用分类器之前,需要准备可靠的数据作为输入,以保证分类的高效性。本文通过学习与分析,决定在文本表示阶段做出一定的改变。文本在成为计算机能够识别的形式时,需要对自身的表现形式做某种转化。转化的方式有很多,可以把词转化成向量,或者最简单的二进制格式等。综合词语的语义和出现频率两方面因素,本文决定使用doc2vec算法作为文本表示方法。为此,本文的整体撰写框架如下:首先,对文本分类问题的研究现状和整体发展过程进行学习后,对本文的实验目的和想法做了全面的分析,明确了理论框架和实验流程。主要包括:对信息进行预处理,其分为文本的特征表示和特征提取两部分;接着对几种经典的分类器算法进行介绍,着重分析了支持向量机的基本原理。然后,介绍深度学习的主要内容和word2vec算法,以及在此算法基础上发展而来的doc2vec算法,对词向量模型进行比较,确定实验所使用的模型。最后,将实验需要的理论基础和思想介绍完毕后,将理论与实践结合,设计一个基于SVM的中文新闻文本分类模型。该模型的主要内容是:以doc2vec的输出作为多核SVM的输入,利用实验语料集,计算多个和矩阵,最后使用spg-gmkl训练并分类,实验结果可以证明多核SVM的优势与实用性。
[Abstract]:With the development of social science and technology, more and more attention has been paid to the data in every field. Meanwhile, the sensitivity of the researchers to the data and their ability to apply the data have been enhanced, which has made people enter the era of big data. But what flows through the Internet is not only the available resources that people need, but also a lot of harmful content that interferes with normal work and misleads the public. In the available resources, the data is also messy, which not only causes the network information overload, but also brings people the feeling of inefficiency. Therefore, systematic processing and accurate classification of data is the goal of scientific researchers. In the early stage of writing, this paper first studies the current research results of text classification to a certain extent, which includes the domestic and foreign two parts of the research; then, This paper focuses on the study and analysis of how to solve the second classification problem of text by using SVM method, and then extends to the multi-classification problem. SVM-support vector machine (SVM), which is a method in machine learning, is based on the statistical learning theory, and is based on the text classification. Many fields, such as image classification, show good performance. Before using classifier, we need to prepare reliable data as input to ensure the efficiency of classification. Through study and analysis, this paper decides to make some changes in the text representation stage. When the text becomes a form that the computer can recognize, it needs to make some transformation to its own form of expression. There are many ways to convert words into vectors, or the simplest binary format. Considering the semantic and frequency of words, this paper decides to use doc2vec algorithm as a text representation method. Therefore, the overall writing framework of this paper is as follows: first, after studying the current situation and the overall development process of text classification, this paper makes a comprehensive analysis of the purpose and ideas of the experiment, and clarifies the theoretical framework and experimental flow. The main contents are as follows: preprocessing of information, which is divided into two parts: feature representation and feature extraction. Then, several classical classifier algorithms are introduced, and the basic principle of support vector machine is analyzed emphatically. Then, this paper introduces the main contents of in-depth learning and word2vec algorithm, and the doc2vec algorithm developed on the basis of this algorithm, compares the word vector model and determines the model used in the experiment. Finally, after introducing the theoretical basis and ideas needed by the experiment, a Chinese news text classification model based on SVM is designed by combining theory with practice. The main contents of this model are as follows: the output of doc2vec is taken as the input of multi-core SVM, and several sum matrices are calculated by using experimental corpus. Finally, the advantages and practicability of multi-core SVM are proved by using spg-gmkl training and classification.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】