基于受限玻尔兹曼机的分布式主题特征提取

发布时间：2018-06-23 14:29

本文选题：文本数据 + 概率主题模型　；参考：《计算机工程与应用》2017年23期

【摘要】：随着大数据时代的来临,如何有效从海量的文本数据中挖掘和分析主题特征已成为学者们的研究重点。隐含狄利克雷分配(Latent Dirichlet Allocation,LDA)作为经典的概率主题模型,因其自身优越的文本分析能力被广泛应用。然而,该模型大多以包含隐含主题变量的有向图的形式存在,实现文档的表达具有局限性。而分布式表示方法定义文档的语义分布在多个主题中并由多主题特征相乘得到;且由于传统的无监督特征提取模型无法有效处理含类别标记的文档数据,故在研究受限玻尔兹曼机(Restricted Bolzmann Machine,RBM)的基础上,结合文本主题的分布式特性,提出了基于RBM的分布式主题特征提取模型NRBM,其自身作为典型的半监督模型能够有效利用文档中的多标记信息。最终与标准LDA主题模型的对比实验证明了NRBM模型的优越性。
[Abstract]:With the advent of the big data era, how to effectively mine and analyze the topic features from massive text data has become the research focus of scholars. Implicit Dirichlet allocation (LDA), as a classical probabilistic subject model, is widely used for its superior text analysis ability. However, most of the models exist in the form of digraphs containing implicit subject variables, and the realization of document representation is limited. The distributed representation method defines the semantic distribution of the document in multiple topics and multiplies the multi-topic features, and because the traditional unsupervised feature extraction model can not deal with the document data with class tags effectively. Therefore, based on the study of restricted Bolzmann Machine (RBM), this paper combines the distributed feature of text subject. A distributed topic feature extraction model (NRBM) based on RBM is proposed. As a typical semi-supervised model, NRBM can effectively utilize multi-tag information in documents. Finally, the NRBM model is proved to be superior in comparison with the standard LDA thematic model.
【作者单位】：安徽工业大学管理科学与工程学院;
【基金】：国家自然科学基金(No.71172219) 安徽省自然科学研究项目省级重点项目(No.KJ2011Z039)
【分类号】：TP391.1

【相似文献】