基于多策略的学术论文术语抽取方法研究
发布时间:2018-07-08 12:23
本文选题:多策略 + 术语抽取 ; 参考:《华中科技大学》2016年硕士论文
【摘要】:如何快速又准确地抽取术语是自然语言处理中一项重要课题。面向学术论文领域的术语抽取研究能够有效地推动科学的发展与成果的推广。学术论文中,术语在不同的位置,如标题、关键字、摘要等文本块,具有不同的分布特征。传统的术语抽取方法忽略了术语分布的位置信息,因此,急需一种能够综合考虑术语位置信息的方法来弥补现有方法的不足。提出了一种面向学术论文的基于多策略的术语抽取方法TEM,该方法首先根据标题、摘要和关键词的不同特征,分别采用基于边界标记集、基于中文术语构词规则和基于关键词的候选术语抽取策略;接着分析了候选术语抽取的结果及错误类型,引入术语反例规则字典改进抽取结果;再结合K-近频子串归并算法对候选术语进行筛选过滤;最后利用术语的位置信息,构建了综合评分模型,采用层次分析法决策标题、摘要和关键词三个维度的权重值,根据最终的评分排序得到正确术语。此外,针对单词型术语,在TF-IDF算法的基础上引入了类别频率CF,提高了筛选的效果。在实验阶段,测试了K值变化对子串归并的影响,对比了引入CF和位置信息后术语抽取结果的变化。结果表明,相比于传统方法,TF-IDF-CF方法的准确率和召回率分别提升了5.73%和8.43%;TEM-SW方法的准确率和召回率分别提升了7.85%和11.54%,TEM-MW方法的准确率和召回率分别提升了11.62%和9.71%;更好地实现了学术论文术语的抽取。
[Abstract]:How to extract terms quickly and accurately is an important task in natural language processing. Term extraction for academic papers can effectively promote the development of science and the promotion of achievements. In academic papers, terms in different positions, such as titles, keywords, abstracts and other text blocks, have different distribution characteristics. The traditional term extraction method neglects the location information of term distribution, so it is urgent that a method which can consider the term location information synthetically to make up for the deficiency of the existing methods. In this paper, a multi-strategy based term extraction method (temm) for academic papers is proposed. Firstly, according to the different features of titles, abstracts and keywords, a new method based on boundary markers is proposed. The extraction strategy of candidate terms based on Chinese term formation rule and keyword is analyzed, and the results and error types of candidate term extraction are analyzed, and the dictionary of term counterexample rule is introduced to improve the extraction result. Combined with the K-Near-frequency substring merging algorithm, the candidate terms are filtered. Finally, a comprehensive scoring model is constructed by using the location information of the terms, and the weight values of the three dimensions of the AHP decision title, summary and key words are adopted. Get the correct terminology according to the final ranking. In addition, the category frequency CFS is introduced based on the TF-IDF algorithm to improve the screening effect. In the experiment stage, the influence of the change of K value on the substring merging is tested, and the variation of the term extraction results with the introduction of CF and position information is compared. The results show that Compared with the traditional TF-IDF-CF method, the accuracy and recall rate of TF-IDF-CF method were increased by 5.73% and 8.43%, respectively. The accuracy and recall rate of TEM-SW method were increased by 7.85% and 11.54%, respectively, and the recall rate of TEM-MW method was increased by 11.62% and 9.71%, respectively. Paper term extraction.
【学位授予单位】:华中科技大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1
【参考文献】
相关期刊论文 前7条
1 袁劲松;张小明;李舟军;;术语自动抽取方法研究综述[J];计算机科学;2015年08期
2 丁杰;吕学强;刘克会;;基于边界标记集的专利文献术语抽取方法[J];计算机工程与科学;2015年08期
3 杜丽萍;李晓戈;周元哲;邵春昌;;互信息改进方法在术语抽取中的应用[J];计算机应用;2015年04期
4 汤青;吕学强;李卓;施水才;;领域本体术语抽取研究[J];现代图书情报技术;2014年01期
5 周浪;冯冲;黄河燕;王平尧;;一种基于独立性统计的子串归并算法[J];计算机工程与应用;2010年24期
6 周浪;张亮;冯冲;黄河燕;;基于词频分布变化统计的术语抽取方法[J];计算机科学;2009年05期
7 吕学强,张乐,黄志丹,胡俊峰;基于散列技术的快速子串归并算法[J];复旦学报(自然科学版);2004年05期
,本文编号:2107420
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2107420.html