基于互联网数据的专利分析研究
本文选题:技术生命周期 + 专利定量分析 ; 参考:《哈尔滨工业大学》2017年硕士论文
【摘要】:网络使得数据量飞速增长,海量的专利数据不断的涌入人们的生活。现如今企业需要了解相关的专利情报信息,以制定更加精确的发展战略,可一些隐藏在专利文献中的信息并没有得到充分的利用,传统的基于人工统计的分析方法忽视了它们的存在,专利分析报告中也只是一些人工手动统计的分析结果。因此,本课题通过调研我国专利信息分析的发展现状,在数据统计分析的基础上,计算其技术发展参数的变化。除此之外,挖掘潜藏在专利文献中的可利用的信息,主要集中在专利主题的提取和专利文献的自动分类。为了弥补传统专利分析报告内容的单调贫乏和自动化书写,本研究还致力于丰富专利分析报告内容,实现报告的自动写作系统。为了得到更多相关的专利数据以及完善专利检索的性能,调研了专利查询词扩展对结果的影响。基于词典和百度平台得到的扩展词集,虽然得到的结果较为全面却不够精确,相关反馈与此相反。综合各个方法的优缺点,提出了词典与相关反馈相结合扩展查询的方法,其召回率和精确率均得到了一定的提升。基于爬虫技术得到专利数据时,为了优化仅通过计算技术发展参数来预测成熟度的做法,加入了新的衡量参数,即技术创新度。它的计算加入了对文本相似度的分析,并对本数据集从不同角度的分类来计算技术创新度。为了探讨每年专利申请量的变化趋势,使用时间序列预测算法对得到的数据序列进行处理,指数平滑与ARMA取得了较好的效果,并验证了生命技术因子的确对数据序列的预测产生了影响。专利的IPC号并不是唯一获取主题的方法,在专利文献集合中,应用文本主题提取算法,可以得到更有针对性更加细致的技术主题关键词。本文在已得到的数据集应用了Text Rank、LDA以及TFIDF三种算法,以反映主题的程度作为衡量,Text Rank取得了0.63,虽高于0.55的LDA,但其过于依赖单文档。通过调节LDA选取的初始主题数,发现当设置其为4时,困惑度最小。对于专利文档的自动分类,在大类别上的实验结果均小于等于0.7,在小类别上的实验效果明显提升,其衡量值最低也接近0.7,其中k NN的R值达到了0.88。基于已有的研究成果,本课题为使其更贴近实际生活应用,探讨了专利分析系统的实现,并辅助用户实现专利分析报告的写作。
[Abstract]:The network makes the amount of data grow rapidly, and the massive patent data constantly pour into people's life. Nowadays, companies need to know the relevant patent information in order to formulate a more precise development strategy, but some of the information hidden in the patent literature has not been fully utilized. The traditional analytical methods based on artificial statistics ignore their existence, and the patent analysis report is only some manual statistical analysis results. Therefore, through investigating the current situation of patent information analysis in China, this paper calculates the change of technological development parameters on the basis of statistical analysis of data. In addition, the mining of available information hidden in patent documents mainly focuses on the extraction of patent topics and automatic classification of patent documents. In order to make up for the monotonous and automatic writing of the traditional patent analysis report, this study also focuses on enriching the patent analysis report content and realizing the automatic writing system of the patent analysis report. In order to obtain more relevant patent data and improve the performance of patent retrieval, the effect of patent query word expansion on the results was investigated. The extended lexicon set based on the dictionary and Baidu platform, although the result is more comprehensive but not accurate, the related feedback is the opposite. Combining the advantages and disadvantages of each method, an extended query method combining dictionary and correlation feedback is proposed, and its recall rate and accuracy rate are improved to a certain extent. Based on the patented data of crawler technology, in order to optimize the method of predicting maturity only by calculating technological development parameters, a new measure parameter, technological innovation degree, is added. Its calculation includes the analysis of text similarity and the classification of the data set from different angles to calculate the technological innovation. In order to explore the trend of patent application volume, the time series prediction algorithm is used to process the obtained data series, and the exponential smoothing and ARMA have achieved good results. It is verified that the factors of life technology do have an effect on the prediction of data series. The IPC number of patent is not the only way to obtain the topic. In the collection of patent documents, we can get more pertinence and more meticulous key words of technical topic by applying the text subject extraction algorithm in the collection of patent documents. In this paper, three algorithms, Text Ranker-LDA and TFIDF, are applied to the data sets, which are measured by the degree of topic. The LDAs are 0.63, which are higher than 0.55, but they are too dependent on single document. By adjusting the initial number of topics selected by LDA, it is found that when the number of themes is set to 4, the degree of confusion is minimal. For the automatic classification of patent documents, the experimental results in large categories are less than 0.7, and the experimental results in small categories are obviously improved, and the lowest measurement value is close to 0.7, in which the R value of kNN reaches 0.88. Based on the existing research results, in order to make it more close to the practical application, this paper discusses the realization of patent analysis system, and assists users to write patent analysis report.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 王江涛;;机器人新闻写作的局限与不足——基于腾讯财经写作机器人Dream writer作品的分析[J];传媒观察;2016年07期
2 李洪雪;张磊;;2010-2014年中国药科大学杂环化合物专利申请状况分析[J];中国药科大学学报;2016年02期
3 熊立波;钟盈炯;林波;;“快笔小新”与机器人写作[J];新闻与写作;2016年02期
4 王悦;支庭荣;;机器人写作对未来新闻生产的深远影响——兼评新华社的“快笔小新”[J];新闻与写作;2016年02期
5 卢永春;;人工智能推动媒体转型[J];中国报业;2015年23期
6 王博;刘盛博;丁X;刘则渊;;基于LDA主题模型的专利内容分析方法[J];科研管理;2015年03期
7 苏敏;阮卓;张玲;王晓春;孙玉;迟玉琢;;助力学科报告的专利检索与分析[J];图书馆学刊;2015年01期
8 张惠琴;邵云飞;张宇翔;;基于专利分析的产品技术成熟度预测——以液晶显示技术为例[J];技术经济;2014年10期
9 王哲;姜大成;马运运;孙志一;;解酒类传统药物专利信息分析[J];世界科学技术-中医药现代化;2014年08期
10 李萌;郭蕾;;日本2012年度发明专利审查质量概况分析[J];产业与科技论坛;2014年15期
相关博士学位论文 前1条
1 蒋胜利;高维数据的特征选择与特征提取研究[D];西安电子科技大学;2011年
相关硕士学位论文 前1条
1 黎楠;面向专利的主题挖掘技术研究及应用[D];北京工业大学;2015年
,本文编号:1877254
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1877254.html