稀疏地理实体关系的关键词提取方法
发布时间:2018-04-01 22:27
本文选题:地理信息检索 切入点:地理实体关系 出处:《地球信息科学学报》2016年11期
【摘要】:网络文本蕴含地理实体关系抽取技术,需要高时效、强鲁棒的关键词提取方法。与监督学习方法相比,无监督学习方法能捕获文本的动态变化特征并发现新增的关系类型,因此备受关注。其中,基于频率的关键词提取方法获得广泛研究,然而,网络文本蕴含的地理实体关系分布稀疏,基于频率的方法难以直接应用于地理实体关系的关键词提取。为解决该问题,本文基于公开访问的网络资源,提出一种语境增强的关键词提取方法。首先,基于在线百科和开放的同义词词典,通过语境合并和语义融合创建增强的语境,以降低语境中词语的稀疏性。接着,Domain Frequency和Entropy频率统计方法从增强语境中自动构建一个大规模语料。然后,基于该语料选择词法特征并统计其权值,用于扩大语境中词语间的差异。最后,使用选择的词法特征度量增强语境中词语的重要性,将权值最大的词语作为描述地理实体关系的关键词,并基于大规模真实网络文本开展实验。实验结果表明:对于地理实体关系的关键词识别,本文方法的平均精度为85.5%,比Domain Frequency和Entropy方法分别提高41%和36%;对于新增关键词识别,本文方法的精度达到60.3%。语境增强的关键词提取方法能有效地处理地理实体关系分布的稀疏性,可服务于网络文本蕴含地理实体关系的抽取。
[Abstract]:Web text contains geographical entity relation extraction technology, which requires a highly time-efficient and robust keyword extraction method. Compared with supervised learning method, unsupervised learning method can capture the dynamic characteristics of text and find new relationship types. Among them, frequency-based keyword extraction methods have been widely studied. However, the geographical entity relationships in network texts are sparse. The frequency-based method is difficult to be directly applied to the keyword extraction of geographical entity relations. In order to solve this problem, this paper proposes a context-enhanced keyword extraction method based on publicly accessed network resources. Based on online encyclopedia and an open lexicon of synonyms, enhanced contexts are created through contextual merging and semantic fusion. In order to reduce the sparsity of words in context, the frequency statistics of domain Frequency and Entropy automatically construct a large scale corpus from the enhanced context. Then, the lexical features are selected and their weights are counted based on the lexical features. It is used to enlarge the differences between words in context. Finally, the selected lexical features are used to measure the importance of the words in the context, and the words with the highest weight are used as the keywords to describe the relationship between geographical entities. The experimental results show that the average accuracy of this method is 85.555, which is 41% and 36% higher than that of Domain Frequency and Entropy, respectively. The precision of this method is 60.3. The keyword extraction method with enhanced context can deal with the sparse distribution of geographical entity relationship effectively and can serve the extraction of geographical entity relationship implied in network text.
【作者单位】: 中国科学院地理科学与资源研究所资源与环境信息系统国家重点实验室;中国科学院大学;南京师范大学虚拟地理环境教育部重点实验室;
【基金】:国家“863”计划项目(2013AA120305) 国家自然科学基金项目(41401460、41271408、41601421)
【分类号】:TP391.1;P209
【相似文献】
相关期刊论文 前10条
1 姜琳;李宇;卢汉;曹存根;;地理实体概念及其位置关系的获取和验证[J];计算机科学;2007年12期
2 庞森权;;浅谈对地理实体实施命名的方法[J];中国地名;2012年02期
3 冯晓,,李方;地理实体的定义与存在方式[J];计算机辅助工程;1995年01期
4 李四海;李艳雯;邢U
本文编号:1697595
本文链接:https://www.wllwen.com/kejilunwen/dizhicehuilunwen/1697595.html