当前位置:主页 > 科技论文 > 软件论文 >

一种基于动态词汇表的在线LDA算法

发布时间:2018-11-07 11:33
【摘要】:目前的在线潜在狄利克雷分布模型(LDA)算法大多是基于固定的词汇表,在实际应用中经常会出现词汇表和处理的语料不匹配的情况,影响了模型的实用性。针对这个现象,在置信传播算法(BP)的框架下,使主题单词分布服从狄利克雷过程,重新推导公式,使得词汇表在模型运行之前为空,并且在处理时不断向词汇表中增加发现的新词。实验证明,这种新的基于动态词汇表的算法不仅使得词汇表与语料的贴合度更高,而且使其在混淆度以及互信息指数这两个指标上能够比基于固定词汇表的LDA模型表现得更加优越。
[Abstract]:At present, most of the online potential Delikley distribution model (LDA) algorithms are based on a fixed vocabulary, and the mismatch between the vocabulary and the processed corpus often occurs in practical applications, which affects the practicability of the model. In order to solve this problem, under the framework of confidence propagation algorithm (BP), we rederive the formula from the Delikley process to make the vocabulary empty before the model runs. And in the processing of the vocabulary to continue to add new words found. Experimental results show that the new algorithm based on dynamic vocabulary not only makes the consistency of vocabulary and corpus higher, Moreover, it is superior to the LDA model based on fixed vocabulary in terms of the degree of confusion and mutual information index.
【作者单位】: 苏州大学计算机科学与技术学院;
【基金】:国家自然科学基金(61373092,61572339,61272449) 江苏省科技支撑计划重点项目(BE2014005)资助
【分类号】:TP391.1


本文编号:2316236

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2316236.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户85118***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com