文本表示模型和特征选择算法研究

发布时间：2018-05-15 01:20

本文选题：文本分类 + 特征选择　；参考：《中国科学技术大学》2017年硕士论文

【摘要】：文本分类是一种处理非结构信息的有效手段,在机器学习和信息检索等领域得到了广泛研究和应用。然而由于文本特征具有高维性、高稀疏性,因此文本分类的效果和速度高度依赖于特征选择方法和文本表示模型的选取。本文在文本特征选择和文本表示模型两个方面展开研究,主要工作如下:(1)传统的基于统计的特征选择方法,没有考虑特征的语义。为此,本文提出基于LDA词向量和Word2vec词向量的特征选择方法,分别从主题和词语上下文关系上,学习特征的语义概念。特征选择完成后,利用向量空间模型,对语料进行分类。在复旦语料上的实验结果表明,基于词向量的特征选择分类效果相对于传统的特征选择得到了一定的改善。并且,基于词向量的特征选择是一种无监督的方法,无需标注数据集。(2)LDA模型(Latent Dirichlet Allocation)没有对输入的特征进行选择,因为含有大量对主题表达没有意义的词,影响主题质量。针对这种情况,本文提出一种基于遗传算法的文本特征选择,预先使用遗传算法对原始的特征空间降低维,使得LDA能够在更有意义的特征空间上进行主题分配。对复旦语料库进行分类实验,分类效果得到了改善。同时本文提出的遗传算法用于特征选择是自适应的,无需事先确定特征选择比例。LDA生成的主题中存在部分垃圾主题,一些主题是不相关的特征词集合。当前主要用通过手工检查找有意义的主题。主题自动排序的方法,目前只有TSR(Topic Significance Ranking)。TSR方法步骤比较多,且只考虑主题与垃圾主题的距离,没有考虑主题之间的关系。针对主题重要性排序,本文提出一种最大垃圾主题距离-最小相似度的主题重要性排序方法。实验结果表明,本文提出的主题重要性排序方法,简单高效,能够识别出有意义的主题。(3)LF-LDA模型(latent feature-LDA)结合词向量训练模型,文本分类效果优于LDA。本文在LF-LDA模型的基础上,提出了基于LF-LDA模型结合Word2vec的文本表示模型,利用LF-LDA生成的主题向量与Word2vec表示的文档向量的距离表示文本。此外,还提出了一种基于主题向量的文本表示模型,利用LF-LDA生成的主题向量的加权组合表示文档。在StackOverflow短文本数据集上实验表明,LF-LDA结合Word2vec的文本表示模型分类效果优于LF-LDA、LDA与Word2vec结合的文本表示模型。基于主题向量的文本表示模型分类效果和LF-LDA相近。
[Abstract]:Text classification is an effective means to deal with unstructured information. It has been widely studied and applied in machine learning and information retrieval. However, due to the high dimension and sparsity of text features, the effect and speed of text classification depend heavily on the selection of feature selection method and text representation model. In this paper, two aspects of text feature selection and text representation model are studied. The main work is as follows: 1) the traditional statistical feature selection method does not take feature semantics into account. In this paper, a feature selection method based on LDA word vector and Word2vec word vector is proposed to study the semantic concept of feature in terms of topic and word context, respectively. After feature selection is completed, the corpus is classified by vector space model. The experimental results on Fudan corpus show that the classification effect of feature selection based on word vector is better than that of traditional feature selection. Moreover, the feature selection based on word vector is an unsupervised method, which does not need to label the data set. The Latent Dirichlet allocation model does not select the input feature, because there are a lot of words which have no meaning to the topic expression, which affect the topic quality. In this paper, a text feature selection based on genetic algorithm is proposed, in which genetic algorithm is used to reduce the dimension of the original feature space, so that LDA can assign topics in a more meaningful feature space. The classification effect of Fudan corpus is improved. At the same time, the genetic algorithm proposed in this paper is adaptive for feature selection, and there are some garbage topics in the theme generated by feature selection ratio. LDA, and some topics are irrelevant feature word sets. The current use of manual inspection to find a meaningful theme. At present, there are only a lot of TSR(Topic Significance Ranking).TSR methods to sort topics automatically, and only the distance between topics and garbage topics is considered, and the relationship between topics is not considered. In this paper, a method of topic importance ranking based on maximum garbage topic distance and minimum similarity is proposed. The experimental results show that the method proposed in this paper is simple and efficient, and it can recognize the meaningful topic, the LF-LDA model and the word vector training model, and the text classification effect is better than that of the LDA-LDA-LDA-LDA-LDA-LDA-LDA-LDA-LDA-LDA-LDA-LDA-LDA-LDA model. Based on the LF-LDA model, a text representation model based on LF-LDA model and Word2vec is proposed in this paper. The text is represented by the distance between the topic vector generated by LF-LDA and the document vector represented by Word2vec. In addition, a text representation model based on topic vectors is proposed, which uses the weighted combination of topic vectors generated by LF-LDA to represent documents. The experiment on StackOverflow short text dataset shows that the classification effect of LF-LDA combined with Word2vec is better than that of LF-LDA-LDA combined with Word2vec. The classification effect of text representation model based on topic vector is similar to that of LF-LDA.
【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】