中文垃圾短文本的自动识别研究

发布时间：2018-05-14 00:09

本文选题：短文本 + 短信　；参考：《郑州大学》2017年硕士论文

【摘要】：当前社会上用于信息交互的各种社交平台、即时通讯工具日益普及。这些用来信息交流的平台工具目前主要使用短文本作为信息传播与交互的载体,其方便、快捷、高效的特点适合当今信息化、快节奏的生活。所谓短文本主要是指手机短信、微博、商品评论、论坛帖子等一类长度较短,字数有限的文本。在该类短文本中,往往存在着大量的违反信息交互平台使用规定的违法的垃圾类文本,例如垃圾短信、广告微博、虚假评论等。由于短文本长度较短,字数有限,编辑来源广泛且内容编辑存在诸多不规范。因此,对其进行二分类,识别其中的垃圾类短文本时将会面临三个问题:(1)数据噪声大;(2)训练数据集不平衡;(3)如果直接采用基于词表的向量空间模型来表示短文本,将会导致特征向量过于稀疏且维度较高。针对以上三个问题,本文主要进行了以下研究:1)提出了适用于短文本的预处理方法,对短文本数据进行规范化处理,主要包括“错别字纠正”、“繁体字转换”、“大小写字母的转换”、“同类信息的统一化表示”等,在一定程度上减少数据集中存在的噪声。2)从短文本内容的编辑语法、用词特点等及非内容的结构属性,即多个角度来提取特征,避免采用基于词表的向量空间模型表示短文本时,特征向量过于稀疏且维度较高。3)提出了随机森林与Adaboost相结合的“随机森林+Adaboost”集成分类方法,该方法将随机森林作为Adaboost算法的基分类器,用来降低数据噪声及数据不平衡所带来的影响。由于短信与商品评论在内容上具有较大相似性,故本文通过选取短信、商品评论作为研究对象,采用本文所提出的方法进行垃圾短文本的识别研究工作。最后,在由中国移动提供的大量短信数据集及COAE 2015任务四的评论数据集上进行实验,结果表明本文所提出的方法是有效的,而且“随机森林+Adaboost”集成算法相对于其它分类算法具备一定的优越性。
[Abstract]:At present, various social platforms for information exchange, instant messaging tools are increasingly popular. These platform tools for information exchange currently mainly use short text book as the carrier of information dissemination and interaction. Its convenient, fast and efficient characteristics are suitable for today's information-based, fast-paced life. Short text mainly refers to text messages, Weibo, commodity reviews, forum posts and other short, limited words. In this kind of short text, there are often a large number of illegal spam texts, such as spam SMS, advertising Weibo, false comments and so on, which violate the rules of information exchange platform. Due to the short length and limited number of words, the short text has a wide range of editing sources and a lot of irregularities in content editing. Therefore, if we use vector space model based on thesaurus to express the short text, we will face three problems when we classify it two times, and we will face three problems: 1) 1) the data noise is very large and 2) the training data set is unbalanced (3) if we use the vector space model based on word table directly to express the short text, The feature vector is too sparse and the dimension is high. In view of the above three problems, this paper mainly carries on the following research: 1) put forward the preprocessing method suitable for short text, and normalizes the data of short text, mainly includes "correcting the wrong character", "converting the traditional characters", "conversion of letters between case and case", "uniform representation of similar information", etc., to a certain extent, reduces the noise existing in the data set) from the editorial syntax of short text content, the features of words, etc., and the structural attributes of non-content, etc. That is, to extract features from multiple angles and avoid using vector space model based on word table to represent short text, the feature vector is too sparse and the dimension is high. 3) an integrated classification method of "random forest Adaboost", which combines stochastic forest and Adaboost, is proposed. In this method, random forest is used as the base classifier of Adaboost algorithm to reduce the effect of data noise and data imbalance. Because of the similarity between short message and commodity comment, this paper chooses short message and commodity comment as the object of study, and adopts the method proposed in this paper to study the identification of short junk text. Finally, experiments are carried out on a large number of short message data sets provided by China Mobile and the comment data set of COAE 2015 Task 4. The results show that the proposed method is effective. Moreover, the "random forest Adaboost" ensemble algorithm has some advantages over other classification algorithms.
【学位授予单位】：郑州大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【相似文献】