学术文本的结构功能识别——在关键词自动抽取中的应用
发布时间:2018-09-11 09:19
【摘要】:当前的关键词自动提取研究大多基于候选词的词频、文档频率等统计信息,往往忽略了侯选词所在的学术文本的内在结构,导致关键词提取的效果不佳。本文将学术文本看作是5个结构功能域的集合,提出了融合学术文本结构功能特征的多特征组合提取方法,并利用学术文本的章节标题对其结构功能进行识别,然后通过SVM二分类和LambdaMART学习排序算法分别在计算机语言学领域的文献集上进行了实现。实验结果表明,本文提出的组合特征方法相比基准特征在关键词提取的效果上取得了较大的提升,尤其在分类实验中准确率的相对提升上达到10.75%,证明了学术文本结构功能特征在关键词自动提取上的重要性。
[Abstract]:Most of the current research on automatic keyword extraction is based on the statistical information such as word frequency and document frequency of candidate words, which often ignores the internal structure of the academic text in which the candidate words are located, resulting in a poor result of keyword extraction. In this paper, the academic text is regarded as a collection of five structural and functional domains, and a multi-feature combination extraction method is proposed, which combines the structural and functional features of the academic text, and uses the chapter title of the academic text to identify its structure and function. Then, the SVM binary classification and the LambdaMART learning sorting algorithm are implemented on the literature set in the field of computer linguistics. The experimental results show that the combined feature method proposed in this paper has achieved a better result than the benchmark feature in keyword extraction. Especially in the classification experiment, the relative improvement of accuracy is 10.75, which proves the importance of the function feature of academic text structure in the automatic extraction of keywords.
【作者单位】: 武汉大学信息管理学院信息检索与知识挖掘实验所;
【基金】:国家自然科学基金面上项目“面向词汇功能的学术文本语义识别与知识图谱构建”(71473183);国家自然科学基金面上项目“基于多语义信息融合的学术文献引文推荐研究”(71673211)
【分类号】:TP391.1
,
本文编号:2236274
[Abstract]:Most of the current research on automatic keyword extraction is based on the statistical information such as word frequency and document frequency of candidate words, which often ignores the internal structure of the academic text in which the candidate words are located, resulting in a poor result of keyword extraction. In this paper, the academic text is regarded as a collection of five structural and functional domains, and a multi-feature combination extraction method is proposed, which combines the structural and functional features of the academic text, and uses the chapter title of the academic text to identify its structure and function. Then, the SVM binary classification and the LambdaMART learning sorting algorithm are implemented on the literature set in the field of computer linguistics. The experimental results show that the combined feature method proposed in this paper has achieved a better result than the benchmark feature in keyword extraction. Especially in the classification experiment, the relative improvement of accuracy is 10.75, which proves the importance of the function feature of academic text structure in the automatic extraction of keywords.
【作者单位】: 武汉大学信息管理学院信息检索与知识挖掘实验所;
【基金】:国家自然科学基金面上项目“面向词汇功能的学术文本语义识别与知识图谱构建”(71473183);国家自然科学基金面上项目“基于多语义信息融合的学术文献引文推荐研究”(71673211)
【分类号】:TP391.1
,
本文编号:2236274
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2236274.html