中文文本语义错误侦测方法研究

发布时间：2018-03-20 01:35

本文选题：语义错误　切入点：知识库　出处：《计算机学报》2017年04期 　论文类型：期刊论文

【摘要】：中文文本语义错误侦测一直以来都是中文文本自动查错的难点.该文针对中文文本语义错误,提出了一种基于语义搭配知识库和证据理论的语义错误侦测模型.讨论了三层语义搭配知识库的构建以及基于该知识库和证据理论的语义错误侦测算法.三层语义搭配知识库的构建主要分为两步:(1)根据《现代汉语实词搭配词典》中的实词搭配框架构建词语搭配规则集,从训练语料中抽取词语搭配,并利用互信息和共现频次进行筛选,构建词语搭配知识库;(2)利用《HowNet》抽取词语的义原信息,生成词语-义原和义原-义原搭配知识库,并利用聚合度进行二次筛选.在三层语义搭配知识库的基础上,首先对知识库采用自顶向下的搜索模式确定可能错误的语义搭配,然后使用语义搭配的互信息量MI和聚合度PD作为证据,采用统计的方法建立证据信任分配函数,结合证据的冲突处理和加权分配D-S规则进行不确定性推理,获取词语的语义搭配关联强度,以判定是否存在语义错误.实验结果显示,该文所提出的查错模型和算法的F-Score值比其他文献中的最好值提高了14.02%.
[Abstract]:Semantic error detection of Chinese text has always been the difficulty of automatic error detection in Chinese text. This paper presents a semantic error detection model based on semantic collocation knowledge base and evidence theory, and discusses the construction of three-layer semantic collocation knowledge base and the semantic error detection algorithm based on this knowledge base and evidence theory. The construction of semantic collocation knowledge base is divided into two steps: 1) according to the framework of notional collocation in Modern Chinese Dictionary of notional collocation, the collocation rule set is constructed. The collocation of words is extracted from the training corpus, and the collocation knowledge base is constructed by using mutual information and co-occurrence frequency.) the sememe information of words is extracted from < HowNet >, and the collocation knowledge base of word-semantic and sememysemous collocation is generated. On the basis of the three-layer semantic collocation knowledge base, the top-down search pattern is used to determine the semantic collocation that may be wrong. Then the mutual information of semantic collocation (MI) and aggregation degree (PD) are used as evidence, and the evidence trust assignment function is established by statistical method, and the uncertainty reasoning is carried out by combining the conflict handling of evidence and weighted allocation D-S rule. In order to determine whether there are semantic errors, the experimental results show that the F-Score value of the proposed error checking model and algorithm is 14.02 higher than the best value in other literatures.
【作者单位】：北京信息科技大学智能信息处理研究所;
【基金】：国家自然科学基金(61070119,61370139) 北京市属高等学校创新团队建设与教师职业发展计划(IDHT20130519)资助~~
【分类号】：TP391.1

【相似文献】