基于协同训练的半监督短文本分类方法研究

发布时间：2019-06-10 06:06

【摘要】：随着互联网的迅猛发展,信息正在以指数形式飞速增长。通过互联网人们可以轻而易举地获取大量的信息,从而对自己的行为起着非常重要的指引作用。短文本是互联网中一种非常重要的信息载体,短文本中蕴含的信息早期是通过人工标记的方式直接获取,但是人工标记的方式需要大量的专业技术人员参与,消耗了大量的人力物力,并且只能对少量的文本进行标记,而互联网上的文本数量非常庞大,因此人工标记的方式不适合互联网上大规模文本进行分类的需求。采用机器学习的方法对未标注样本进行标注,逐步成为互联网上文本信息处理的一种趋势,同时提高样本标记效率已经成为当前研究的热点。与人工标注的方法相比,使用机器学习技术对未标注样本的标注,不仅准确率高,而且算法非常稳定。半监督协同训练是方法目前机器学习中一种非常重要的文本分类方法。本文主要对基于协同训练的半监督短文本分类进行研究,主要包含以下几个方面的内容:1.对短文本分类问题进行分析,给出了基于协同训练的半监督短文本分类系统模型。短文本分类模型可以分成三个功能模块:预处理模块、训练模块和测试模块。预处理模块,主要是对非结构化的短文本进行处理,通过对短文本去除格式标记、分词、去停用词、特征提取、词频统计、文本向量化等一系列步骤得到结构化的数据集。训练模块,一方面是根据差异性原理构造分类器,使用分类器对未标注样本进行标注;另一方面使用训练样本集对分类器进行协同训练,从而得到不断优化的分类器。测试模块,使用测试样本集对分类器进行测试,验证协同训练方法的可行性和有效性。2.结合半监督协同训练,给出了短文本分类方法,进一步改进了特征提取方法和协同训练方法。(1)特征提取方法的改进。根据短文本中文字数量较少的特点,从词语之间语义联系的角度,来构造短文本中词语之间的邻接矩阵,然后通过邻接矩阵相似度的计算来构造一个无向图,再根据无向图的邻接度计算特征度,将特征度高的特征词进行提取。这种特征提取方法相比于传统方法兼顾了词语之间语义的相似关系,有助于对短文本进行有效分类。(2)协同训练算法改进。为了对未标注样本进行标注,通过多分类器“互助”方式训练分类器。在二分类问题中,对某个未标注样本进行标注如果三个分类器的标注结果相同,代表标注结果有较高的置信度,把标注样本放入到已标注样本集中;如果标注结果不同,那么必有两个分类器的标注结果相同,使用两个分类器的标注结果训练第三个分类器。在标注过程中,反复训练分类器,最终获得性能较好的分类器。3.利用互联网网站搜集到的短文本进行对比实验,验证了协同训练半监督短文本分类方法的有效性。通过选取新浪、搜狐和网易等各大网站搜集到的短文本帖子作为数据集,将本文改进后的方法与传统的短文本分类方法进行对比实验,通过评估指标准确率、召回率和F1值对本文分类方法进行评估,从而验证本文方法的可行性和有效性。因此,本文构建了基于协同训练的半监督短文本分类模型,给出了相应的分类方法,同时对短文本特征提取方法和半监督协同训练进行了改进,并将改进的方法与传统的方法进行了对比实验。实验结果表明,本文给出的方法能有效提高短文本分类的效率。
[Abstract]:With the rapid development of the Internet, information is growing exponentially. People can easily get a lot of information through the Internet so that they play a very important role in their behavior. The short text is a very important information carrier in the Internet, the information contained in the short text is acquired directly by means of manual marking, but the manual marking method requires a large number of professional and technical personnel to participate, and a large amount of manpower and material resources are consumed, And only a small amount of text can be marked, and the number of texts on the internet is very large, so that the method of manual marking is not suitable for the classification of large-scale text on the internet. The method of machine learning is used to mark the unlabeled samples and gradually become a trend of text information processing on the Internet, and the efficiency of sample marking has become the hot point of the current research. Compared with the method of the manual marking, the machine learning technology is used to mark the unlabeled sample, the accuracy is high, and the algorithm is very stable. The semi-supervised cooperative training is a very important text classification method in the current machine learning. This paper mainly studies the classification of semi-supervised short text based on cooperative training, which mainly includes the following aspects:1. This paper analyzes the classification of short text, and gives a semi-supervised short text classification system model based on cooperative training. The short text classification model can be divided into three functional modules: pre-processing module, training module and test module. The pre-processing module is mainly used for processing a non-structured short text book, and the structured data set is obtained through a series of steps such as a short text removal format mark, a participle, a stop word, a feature extraction, a word frequency statistic, a text-to-quantization and the like. The training module, on the one hand, constructs the classifier according to the difference principle, uses the classifier to mark the unlabeled sample, and on the other hand, uses the training sample set to perform the cooperative training on the classifier so as to obtain the continuously optimized classifier. The test module is used for testing the classifier by using the test sample set, and verifying the feasibility and the effectiveness of the cooperative training method. Combined with the semi-supervised cooperative training, the paper gives a short text classification method, and further improves the feature extraction method and the cooperative training method. (1) The improvement of the feature extraction method. according to the characteristics of fewer characters in the short text book, the adjacent matrix between the words in the short text is constructed from the angle of the semantic relation between the words, then a non-directional diagram is constructed by the calculation of the similarity of the adjacent matrix, and the characteristic degree is calculated according to the adjacency degree of the non-directional graph, And the characteristic words with high characteristic are extracted. Compared with the traditional method, the feature extraction method has the advantages that the similarity of the semantic between words is taken into account, and the short text is effectively classified. (2) Improved cooperative training algorithm. In order to dimension an unlabeled sample, the classifier is trained by a multi-classifier "mutual aid". in that two-classification problem, if the dimension result of the three classifiers is the same, the dimension result of the three classifiers is the same, the dimension sample is put into the marked sample set, and if the dimension result is different, Then the result of the annotation of the two classifiers is the same, and the third classifier is trained using the dimensional results of the two classifiers. In the process of labeling, the classifier is trained repeatedly and finally the classifier with better performance is obtained. A comparative experiment was carried out on the short text collected by the Internet website, and the effectiveness of the method of collaborative training and semi-supervised short text classification was verified. By selecting the short text posts collected by the major websites such as Sina, Sohu and NetEase as the data set, the improved method is compared with the traditional short text classification method, and the classification method is evaluated by the evaluation index accuracy, the recall rate and the F1 value. So as to verify the feasibility and the effectiveness of the method. Therefore, this paper constructs a semi-supervised short text classification model based on the cooperative training, and gives the corresponding classification method. At the same time, the feature extraction method and the semi-supervised cooperative training are improved, and the improved method is compared with the traditional method. The experimental results show that the proposed method can effectively improve the efficiency of short text classification.
【学位授予单位】：西南大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】