当前位置:主页 > 科技论文 > 软件论文 >

基于LDA模型的专利文本分类及演化研究

发布时间:2018-10-23 12:03
【摘要】:专利文献是技术情报的载体,它的文本中隐藏了大量的技术情报信息,是技术情报消息的最佳情报来源。随着新中国的快速发展,我国专利的申请数量已在逐年升高,至2016年已经连续第五年蝉联全球专利申请量之首。因此,对于这些海量专利文献的信息挖掘技术的研发,已成为国家和企业研究的共同热点。LDA模型是典型的概率主题模型,目前已广泛应用在自然语言处理、数据挖掘和人工智能等领域,用来分析文本的分类和演化问题。其中概率主题模型很少应用在专利文本的相关研究中,故本文在现有专利文本信息挖掘技术框架的基础上,采用LDA模型对专利文本进行分类及演化研究,本文具体的研究内容如下:(1)首先概述几种传统的概率主题模型并对它们作简要的叙述,再对本文算法应用的LDA模型进行详细的描述,介绍其的相关数学概率分布和参数推断算法,最后回顾专利文本中的一些典型的分类算法和演化分析方法。(2)针对传统专利文本自动分类方法中,使用向量空间模型文本表示方法存在的问题,提出一种基于LDA模型专利文本分类方法。该方法利用LDA主题模型对专利文本语料库建模,提取专利文本的文档-主题和主题-特征词矩阵,达到降维目的和提取文档间的语义联系,引入类的类-主题矩阵,为类进行主题语义拓展,使用主题相似度构造层次分类,小类采用KNN分类方法。实验结果:与基于向量空间文本表示模型的KNN专利文本分类方法对比,此方法能够获得更高的分类评估指数。(3)运用概率主题模型全面研究专利文献主题演化,发现专利技术发展趋势。LDA模型按时间窗口对专利文本建模,困惑度确定最优主题,按专利文本结构特性提取主题向量,采用JS散度度量主题之间的关联,引入IPC分类号计算技术主题强度,最后实现主题强度、主题内容和技术主题强度三方面的演化研究。实验结果表明该方法可以较好地分析专利技术随时间的演化规律及趋势。该方法能够深入挖掘专利文献的主题,帮助相关从业人员了解专利技术的演化过程及趋势。
[Abstract]:Patent document is the carrier of technical information, whose text conceals a large amount of technical information and is the best information source of technical information. With the rapid development of New China, the number of patent applications in China has been increasing year by year, and the number of patent applications has been the highest in the world for the fifth consecutive year in 2016. Therefore, the research and development of information mining technology for these massive patent documents has become a common focus of national and enterprise research. LDA model is a typical probabilistic subject model, which has been widely used in natural language processing. Data mining and artificial intelligence are used to analyze the classification and evolution of text. The probabilistic subject model is seldom used in the research of patent text, so this paper uses LDA model to classify and evolve patent text on the basis of the existing technical framework of patent text information mining. The specific contents of this paper are as follows: (1) firstly, several traditional probabilistic subject models are summarized and briefly described, and then the LDA model used in this algorithm is described in detail. The related mathematical probability distribution and parameter inference algorithm are introduced. Finally, some typical classification algorithms and evolutionary analysis methods in patent texts are reviewed. (2) in view of the traditional automatic classification methods for patent texts, This paper presents a patent text classification method based on LDA model, which is based on the problems of vector space model (VSM) text representation. This method uses the LDA topic model to model the patent text corpus, extracts the document topic and theme-feature word matrix of the patent text, achieves the purpose of reducing dimension and extracting the semantic relation between the documents, and introduces the class-topic matrix of the class. In order to extend the topic semantics for the class, the topic similarity degree is used to classify the sublayer, and the KNN classification method is used for the small class. Experimental results: compared with the KNN patent text classification method based on vector space text representation model, this method can obtain a higher classification evaluation index. (3) using probabilistic subject model to study the topic evolution of patent literature. The development trend of patent technology is found. The LDA model models patent text according to time window, determines the optimal subject according to the degree of confusion, extracts the theme vector according to the structural characteristics of patent text, and measures the correlation between the topics by using JS divergence. This paper introduces the IPC taxonomy to calculate the technical topic strength, and finally realizes the evolution of the theme intensity, the theme content and the technical theme intensity. The experimental results show that this method can better analyze the evolution law and trend of patent technology with time. This method can dig into the subject of patent literature and help relevant practitioners to understand the evolution process and trend of patent technology.
【学位授予单位】:江西理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1

【参考文献】

相关期刊论文 前10条

1 刘红光;马双刚;刘桂锋;;基于机器学习的专利文本分类算法研究综述[J];图书情报研究;2016年03期

2 刘桂锋;汪满容;刘海军;;基于概率超图半监督学习的专利文本分类方法研究[J];情报杂志;2016年09期

3 缪建明;贾广威;张运良;;基于摘要文本的专利快速自动分类方法[J];情报理论与实践;2016年08期

4 祖坤琳;赵铭伟;林鸿飞;;基于有序聚类的专利知识演化研究[J];计算机工程与科学;2016年04期

5 韩红旗;付媛;朱礼军;;基于专利IPC分类号的技术竞争对象的群组分析方法[J];情报工程;2015年04期

6 陈海红;;多核SVM文本分类研究[J];软件;2015年05期

7 秦晓慧;乐小虬;;基于LDA主题关联过滤的领域主题演化研究[J];现代图书情报技术;2015年03期

8 王鹏;高铖;陈晓美;;基于LDA模型的文本聚类研究[J];情报科学;2015年01期

9 魏景璇;鲁燃;张艳辉;;基于动态阈值和命名实体的双重过滤话题追踪[J];计算机应用研究;2015年04期

10 李湘东;张娇;袁满;;基于LDA模型的科技期刊主题演化研究[J];情报杂志;2014年07期

相关会议论文 前1条

1 王会珍;朱靖波;陈文亮;季铎;张斌;;基于一元语法模型的中文话题追踪[A];第二届全国学生计算语言学研讨会论文集[C];2004年



本文编号:2289173

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2289173.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户0671e***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com