面向短文本分类的特征扩展方法

发布时间：2018-08-31 20:35

【摘要】：近年来，各种各样的网络应用(如Facebook, QQ, Twitter，新浪微博等）不断涌现，伴随着这些网络应用，各种各样文本信息随之而来，其中不少应用产生的文本信息内容一般都比较短，，我们称之为短文本信息。短文本数据量异常庞大。短文本信息的研究在很多领域有其重要的用途，例如在社交网络的推荐系统、互联网信息安全、网络信息数据挖掘，话题跟踪与发现、网络新词发现、网络舆论监控等领域都具有广泛的应用场景。本文所研究的是面向短文本分类的特征扩展问题。短文本信息的特点主要体现在文本内容较短、特征稀少、噪音影响大等方面，传统的统计文本分类算法是基于bag-of-words范式的，由于短文本特点，这些文本分类方法对于短文本分类表现相对较差。针对这些问题，本文设计并实现了基于搜索引擎的特征扩展方法，将短文本通过检索得到网络信息，然后将这些相关的信息用于短文本扩展，最后再选择合适的文本分类器对短文本分类，本文主要选用的三种常用的全监督分类器，同时也尝试将半监督分类器应用于短文本分类问题。然而基于特征扩展的短文本特征扩展方法，普遍存在一个问题，即扩展的网络信息通常存在歧义内容。有歧义的网络信息很显然是不合适用于特征扩展的。为了解决这一问题，本论文提出了一种基于图的特征扩展约束方法，通过短文本扩展信息的不断迭代过滤，最终得到用于扩展特征的高质量信息。同时本文也提出一种短文本关键字提取算法，该算法的设计结合了短文本的统计信息，语义信息及关键字出现的位置与顺序等特征，系统中使用这种算法提取可靠的短文本关键字，用于检索网络信息。本文采用的实验数据为新浪微博语料，实验中实现了短文本特征扩展方法、短文本关键字提取算法、扩展约束方法，在此基础上结合多种分类器，设计了中文的短文本分类系统。在这个系统平台通过实验得出多组对比数据。最终的实验结果表明，本文提出的特征扩展方法及特征扩展噪音消除方法能够很好地提高短文本的分类效果，达到了预期的目标。
[Abstract]:In recent years, a variety of network applications (such as Facebook, QQ, Twitter, Sina Weibo and so on) have been emerging. With these network applications, a variety of text information has followed, many of which have generally produced relatively short text information. We call it short text information. The volume of text is extremely large. The research of short text information has important applications in many fields, such as recommendation system of social network, Internet information security, network information data mining, topic tracking and discovery, network neologism discovery, etc. Network public opinion monitoring and other fields have a wide range of applications. In this paper, the problem of feature extension for short text classification is studied. The features of short text information are mainly reflected in short text content, few features and great noise impact. The traditional statistical text classification algorithm is based on bag-of-words paradigm, because of the characteristics of short text. These text classification methods are relatively poor for short text classification. In order to solve these problems, this paper designs and implements the feature extension method based on search engine. The short text book is retrieved to get the network information, and then the relevant information is used in the short text book extension. Finally, we choose the appropriate text classifier to classify short text. Three kinds of commonly used fully supervised classifiers are used in this paper. At the same time, we try to apply the semi-supervised classifier to the short text classification. However, there is a common problem in the feature extension method of short text based on feature expansion, that is, the extended network information usually has ambiguous content. Ambiguous network information is clearly not suitable for feature extension. In order to solve this problem, a graph-based feature extension constraint method is proposed in this paper. Through iterative filtering of short text extension information, high quality information for extended features is obtained. At the same time, this paper also proposes a short text keyword extraction algorithm, which combines the statistical information of short text, semantic information and the location and order of keywords, etc. The system uses this algorithm to extract reliable short text keyword, which is used to retrieve network information. The experimental data used in this paper are the corpus of Sina Weibo. In the experiment, we have implemented the methods of feature expansion of short text, keyword extraction algorithm of short text, extended constraint method, and combined with various classifiers on this basis. A Chinese text classification system is designed. In this system platform through the experiment to obtain a number of groups of comparative data. The final experimental results show that the proposed feature expansion method and the feature expansion noise elimination method can improve the classification effect of short text and achieve the desired goal.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【参考文献】