基于主题相似度的短文本分类方法研究

发布时间：2018-04-16 09:13

本文选题：短文本 + 主题相似度　；参考：《华中师范大学》2017年硕士论文

【摘要】：在互联网广泛应用的影响下,特别是微信、微博、问答系统等新媒体的出现使得互联网每天产生海量的短文本信息。这些短文本的长度短、内容少、用词不规范、数据量庞大而且属于半结构化的信息数据。把长文本的处理方法直接应用于短文本的文本挖掘中,难以取得令人满意的文本挖掘效果。因此,如何准确、实时、高效的挖掘短文本中隐藏的信息,是目前中文信息处理与文本挖掘讨论与研究的热点。短文本具有结构短、文本内容少、数量庞大、语义不明显等特点,导致短文本的分类面临特征稀疏、噪声多、上下文依赖强等问题。基于搜索引擎的短文本分类方法,分类结果比较依赖搜索引擎;基于大规模语料库的分类方法,比较依赖外部语料库。本文在分析短文本特点的基础上,根据目前短文本分类方法存在的缺陷,从短文本的建模矩阵特征稀疏、短文本上下文依赖性强等问题进行切入,探索根据主题判断短文本的相似度从而实现分类。首先,研究文献资料,分析中文文本分类的理论和方法,着重分析短文本分类方法。在分析基于VSM的传统短文本分类方法时,发现短文本建模的特征矩阵稀疏、维度高不利于准确分类,因此设计一种基于主题相似度的分类算法。应用主题挖掘的理论和方法,采用LDA概率模型来估算短文本的主题概率分布向量。其次,针对传统KNN算法在分类过程中,计算量特别大,处理文本集庞大的短文本集时,计算量会更大。本文根据局部敏感哈希解决ANN问题的优点,构建改进LSH的KNN分类器,实现从主题层面上对短文本的快速分类。最后,本文从理论上叙述了构建改进LSH的KNN分类器,能够在一定程度上提高分类效果,减少分类时间。本文根据构建的分类器和文本分类方法,在Linux环境下建模,利用MATLAB实现分类,设计基于VSM分类方法的对比实验,对最终的实验结果对比,得出本文基于主题相似度的分类方法整体分类性能较好。
[Abstract]:Under the influence of the wide application of the Internet, especially the emergence of new media, such as WeChat, Weibo, Question-answering system, etc.These short texts are short in length, small in content, nonstandard in terms, large in data volume and semi-structured information data.It is difficult to obtain satisfactory text mining effect by directly applying the long text processing method to the text mining of short text.Therefore, how to accurately, real-time and efficiently mine hidden information in short text is a hot topic in the discussion and research of Chinese information processing and text mining.Short text text has the characteristics of short structure, less text content, large quantity and unobvious semantics, which leads to the problems of sparse feature, high noise and strong context-dependent in short text classification.Based on search engine, the classification result depends on search engine, and the classification method based on large-scale corpus relies on external corpus.Based on the analysis of the characteristics of the short text, according to the shortcomings of the current short text classification methods, this paper analyzes the sparse features of the modeling matrix of the short text and the strong context-dependent characteristics of the short text.This paper explores how to judge the similarity of short text according to the topic, so as to realize classification.Firstly, the paper studies the literature, analyzes the theory and method of Chinese text classification, and focuses on the text classification method.When analyzing the traditional short text classification method based on VSM, it is found that the feature matrix of short text modeling is sparse and the dimension is high, so a classification algorithm based on topic similarity is designed.Using the theory and method of topic mining, LDA probability model is used to estimate the topic probability distribution vector of short text.Secondly, for the traditional KNN algorithm in the process of classification, the computation is especially large, when dealing with the text set of large short text set, the computation will be more.Based on the advantages of locally sensitive hash to solve the ANN problem, this paper constructs an improved KNN classifier for LSH, and realizes the fast classification of short text at the topic level.Finally, this paper describes theoretically the construction of an improved LSH KNN classifier, which can improve the classification effect and reduce the classification time to a certain extent.In this paper, according to the classifier and text classification method, we model in Linux environment, use MATLAB to realize classification, design a comparative experiment based on VSM classification method, and compare the final experimental results.It is concluded that the classification method based on topic similarity in this paper has better overall classification performance.
【学位授予单位】：华中师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】