基于W-BTM的短文本主题挖掘及文本分类应用

发布时间：2018-01-20 12:58

本文关键词： W-BTM模型主题挖掘短文本文本分类　出处：《山西财经大学》2017年硕士论文　论文类型：学位论文

【摘要】：随着互联网和各类社交网站以及电子商务的快速兴起,以文本信息为代表的非结构化信息大量涌现,从中挖掘出有价值的信息变得越来越重要,但同时复杂的语义也使得信息价值的提取变得越来越困难。尤其是短文本信息,其稀疏性和不完整性也给文本挖掘带来了新的巨大挑战。因此,对于文本信息挖掘的研究逐步转向了对于短文本信息挖掘的研究。BTM是一个针对短文本的主题挖掘模型,在处理短文本的稀疏性和不完整性问题上相对于其它主题模型有很大的优势。但包括BTM模型在内的现有文本挖掘模型,模型中都没有特殊的参数设置等对其进行处理,只是在数据预处理时加载停用词表对其进行删除操作。而不同的语料选择会有差异性,千篇一律的使用同样的停用词表并不具有科学性。因此,对于不同的语料集,应该找出可以反映其文本特征的停用词。基于对上述短文本特点和停用词处理的考虑,以差异系数作为权重模型,表示文本中词语的权重,然后将其作为BTM模型的一个参数形成最终的W-BTM模型,从而消除短文本和停用词对文本主题挖掘的影响。模型中使用吉布斯抽样对参数进行估计,从潜在变量的先验分布中抽样,对后验参数进行估计。最后将模型应用于当当网图书简介数据,使用支持向量机对W-BTM模型产生的结果矩阵进行分类,并对比不同模型的分类结果,证明W-BTM模型的优越性。W-BTM模型在整个语料集中寻找“词对”的前提是“词对”中每个词在整个文档中的权重即差异系数已知。在这种情况下,“词对”有了更深层次的含义,它不再只是单一的表示文档中同时出现的两个词语,而且还代表着词语本身的性质,即是否为停用词。这就可以消除停用词的不恰当选择对于文本信息挖掘准确性的影响。为了验证W-BTM的有效性和科学性,以LDA模型和BTM模型做对比进行文本分类实验和应用,从主题挖掘和文本分类两个角度对整个的实验结果进行评价,最终证明了W-BTM模型的分类效果优于LDA模型和BTM模型。本文的创新之处如下:(1)对于停用词的处理,抛弃传统的选择停用词表并将停用词直接去除的方法,而是使用权重模型取而代之,使得文本挖掘的结果更加科学和准确。(2)将权重模型与BTM模型相结合,形成新的主题模型W-BTM,既可以用于短文本的分类,解决短文本的稀疏性问题,也弥补了数据预处理时停用词处理的漏洞。(3)将W-BTM模型应用于当当网图书简介分类,赋予模型更加实际的现实意义。通过对数据不平衡性的处理、W-BTM模型的使用以及支持向量机对于文本-主题矩阵的分类,最终验证了W-BTM模型的有效性。针对分类结果,将W-BTM模型与LDA模型和BTM模型进行对比,验证了W-BTM模型的优越性。
[Abstract]:With the rapid rise of Internet, social networking sites and electronic commerce, unstructured information, represented by text information, emerges in large numbers, and it becomes more and more important to mine valuable information from it. But at the same time, the complexity of semantics also makes it more and more difficult to extract the information value. Especially, the sparsity and incompleteness of the short text text information also bring a great challenge to text mining. The research of text information mining has gradually turned to the research of short text information mining. BTM is a topic mining model for short text. It has a great advantage over other topic models in dealing with the sparsity and incompleteness of short text, but the existing text mining models, including BTM model. There are no special parameter settings in the model to deal with them, only when the data preprocessing loading stop vocabulary to delete the operation, and different corpus selection will be different. It is not scientific to use the same stop thesaurus all the time. Therefore, for different corpus. Based on the consideration of the characteristics of the text and the processing of the stop word, the difference coefficient is used as the weight model to express the weight of the words in the text. Then, as a parameter of BTM model, the final W-BTM model is formed to eliminate the influence of short text and stop word on text topic mining. Gibbs sampling is used to estimate the parameters in the model. Sampling from the prior distribution of potential variables, the posterior parameters are estimated. Finally, the model is applied to the Dangdang network book profile data, and the support vector machine is used to classify the result matrix generated by the W-BTM model. The classification results of different models were compared. The premise of W-BTM model searching for word pair in the whole corpus is that the weight of each word in the whole document is known, that is, the coefficient of difference is known. "word to" has a deeper meaning, it is not only a single representation of the two words in the document, but also represents the nature of the word itself. This can eliminate the influence of improper choice of discontinuation words on the accuracy of text information mining. In order to verify the validity and scientific nature of W-BTM. The experiment and application of text classification are carried out by comparing LDA model with BTM model, and the whole experiment result is evaluated from two angles of topic mining and text classification. Finally, it is proved that the classification effect of W-BTM model is better than that of LDA model and BTM model. Instead of the traditional method of choosing to stop the word table and removing the stop word directly, the weight model is used instead. Make the result of text mining more scientific and accurate. 2) combine the weight model and BTM model to form a new topic model W-BTM. it can be used in the classification of short text. To solve the problem of short text sparsity, it also makes up the loophole of discontinuation word processing in data preprocessing. (3) the W-BTM model is applied to the classification of book profiles in Dangdang. By dealing with the imbalance of data, the use of W-BTM model and the classification of text-topic matrix by support vector machine (SVM) are given more practical significance. Finally, the validity of W-BTM model is verified, and the superiority of W-BTM model is verified by comparing W-BTM model with LDA model and BTM model.
【学位授予单位】：山西财经大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】