基于全覆盖粒计算的新闻文档子话题划分方法研究

发布时间:2018-05-12 07:33

  本文选题:全覆盖粒计算 + 主题模型 ; 参考:《太原理工大学》2017年硕士论文


【摘要】:当今时代信息爆炸,信息量规模急剧膨胀,来自四面八方的信息如浪潮般涌入人类的生活。在如此庞大的数据面前,用户想要在海量信息中快速、准确地获得自己感兴趣的新闻话题,将面临着巨大的挑战。针对大量的新闻事件,如何按照话题进行组织和归类,以便能够自动地把相关话题的信息汇总,这已成为自然语言处理中一个重要的研究课题。话题识别与划分技术应运而生,其致力于研究对来自不同的文本集进行有效的组织、搜索与结构化。全覆盖粒计算是信息处理和数据挖掘的一种新的研究方法,为具有不确定、不完整信息的大规模海量数据的挖掘提供了一种新的思路。它包括全覆盖理论和粒度的粒化、粒的运算,为子话题划分提供了一种新的解决方法。本文的创新点主要有:1、本文采用LDA(Latent Dirichlet Allocation)主题模型对海量新闻语料进行语义分析并建立模型,提取新闻文档的隐含主题,得到“文档-主题”?矩阵;通过多次实验对?矩阵中的概率设置合适的阈值,进而将“文档-主题”矩阵转换为全覆盖模型;在全覆盖粒计算的基础上,利用粒约简的方法,删除冗余覆盖元,得到最简覆盖元。2、从集合论的角度提出了全覆盖粒计算的诱导划分算法DP(DerivedPartition),探讨了该算法的理论依据,提出了算法的具体过程,分析了算法的时间复杂度;并对算法的结构及过程进行了优化,通过大量的实验验证,表明改进后确实提高了该算法的性能;最后以实例对该算法进行了进一步的解释。3、在LDA主题模型和诱导划分算法的基础上,设计基于全覆盖粒计算的面向新闻文档的子话题划分方法;通过在搜狗新闻语料库上与三种传统的Baseline方法、VSM方法以及经典的Single-Pass方法的对比实验,从不同角度验证了该方法的适用性、可行性和扩展性,说明本文算法能较好的实现子话题划分。
[Abstract]:Nowadays, the information explodes, the scale of information expands rapidly, and the information from all sides flows into human life. In the face of such huge data, users want to quickly and accurately get their own interesting news topics in the mass of information, will face a huge challenge. For a large number of news events, how to organize and classify them according to the topic, so as to automatically aggregate the information of related topics, has become an important research topic in natural language processing. Topic recognition and partitioning techniques emerge as the times require, which focuses on the effective organization, search and structuralization of text sets from different text sets. Full coverage computing is a new research method for information processing and data mining, which provides a new way for mining massive data with uncertain and incomplete information. It includes the theory of full coverage, granulation of granularity and operation of grain, which provides a new method for subtopic division. The main innovation of this paper is: 1. In this paper, we use the LDA(Latent Dirichlet allocation topic model to analyze and build the semantic model of mass news corpus, extract the hidden topic of news document, and get the "document-topic"? Matrix; through multiple experiments? The probability in the matrix sets the appropriate threshold value, and then the "document-topic" matrix is transformed into a full cover model, and on the basis of the calculation of the total covering particles, the redundant overlay elements are deleted by using the method of grain reduction. From the point of view of set theory, the inductive partition algorithm DPNDerivedPartitionn is proposed from the point of view of set theory. The theoretical basis of the algorithm is discussed, the concrete process of the algorithm is put forward, and the time complexity of the algorithm is analyzed. The structure and process of the algorithm are optimized, and a large number of experiments show that the improved algorithm does improve the performance of the algorithm. Finally, an example is given to further explain the algorithm. On the basis of LDA topic model and induced partitioning algorithm, a subtopic partitioning method for news documents is designed based on full coverage computing. The applicability, feasibility and expansibility of this method are verified from different angles by comparing the Sogou news corpus with three traditional Baseline methods and the classical Single-Pass method. The result shows that the algorithm can realize subtopic division well.
【学位授予单位】:太原理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1

【参考文献】

相关期刊论文 前10条

1 李国;张春杰;张志远;;一种基于加权LDA模型的文本聚类方法[J];中国民航大学学报;2016年02期

2 秦琴;谢s,

本文编号:1877710


资料下载
论文发表

本文链接:https://www.wllwen.com/shoufeilunwen/xixikjs/1877710.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户19011***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com