文献被引片段特征分析与识别研究
发布时间:2017-12-28 03:35
本文关键词:文献被引片段特征分析与识别研究 出处:《数据分析与知识发现》2017年11期 论文类型:期刊论文
【摘要】:【目的】对科技文献领域的被引片段概念的特征进行分析,并比较不同识别方法效果的差异。【方法】以CL-Sci Summ 2016比赛被引片段标注数据为例,探索被引片段长度、位置与重要性特征,并分析与其对应引文上下文在长度和位置上的相关性。之后以基于词袋模型、主题模型、Word Net语义词典的相似性算法为例,比较这些方法在被引片段识别中的效果差异。【结果】研究结果发现:被标注的被引片段有96%少于三句,且更多地出现在文章前部和章节内的前部分,被引片段的Text Rank权重均值显著高于其他片段;被引片段与引文上下文在长度上显著相关,但在出现位置上相关性不明显;无论从MMR?还是句子与词汇层面的匹配度来看,基于词袋模型的识别方法效果均优于基于语义词典的方法,而后者明显优于基于主题模型的方法。【局限】对于被引片段概念与特性的分析只停留在理论层面,对其特征的分析与有关识别方法的比较也只是在CL-Sci Summ 2016被引片段标注数据上进行的。【结论】科技文献的用词比较规范严谨,所以词汇特征在被引片段的识别过程中起到关键的作用。
[Abstract]:[Objective] to analyze the characteristics of the concept of cited fragments in the field of scientific and technological literature, and to compare the differences of the effect of different recognition methods. [Methods] taking the tagged data of CL-Sci Summ 2016 competition as an example, we explored the length, location and importance of the cited fragment, and analyzed the relevance between the corresponding context and its length and location. Then, based on the similarity algorithm of word bag model, topic model and Word Net semantic dictionary, we compare the effectiveness of these methods in the recognition of induced fragments. [result] the results showed that: labeled cited are 96% less than three, and more appear in the front part of the front and the section within the Text Rank weighted average citation fragment was significantly higher than that in other segments; cited and citation context fragments significantly correlated in length, but in the position correlation is not obvious; no matter from MMR? Or sentence and word level matching degree, the effect of recognition method based on bag of words model was better than the method based on semantic dictionary, the latter is obviously better than the method based on topic model. [limitations] the analysis of the concept and characteristics of the cited part stays at the theoretical level only. The comparison of its characteristics and the related recognition methods is only carried out on the tagged data of CL-Sci Summ 2016. [Conclusion] the use of words in scientific literature is more rigorous, so lexical features play a key role in the identification of the cited fragments.
【作者单位】: 武汉大学信息资源研究中心;华中师范大学信息管理学院;
【分类号】:G353.1
【正文快照】: 1引言文献的被引频次从一定程度上反映了其对学术界的贡献与影响。然而,被引频次仅能说明文献整体的影响力与价值,只有对引用行为进行更深入的分析才能揭示被引文献内部对学界有影响力的那部分内容。随着学术论文全文获取难度的降低,引文上下文(Citation Context)的识别与抽取,
本文编号:1344441
本文链接:https://www.wllwen.com/tushudanganlunwen/1344441.html
教材专著