当前位置:主页 > 科技论文 > 搜索引擎论文 >

融合泰语特征的句子级实体关系抽取研究

发布时间:2018-05-15 10:56

  本文选题:泰语句子切分 + 命名实体识别 ; 参考:《昆明理工大学》2017年硕士论文


【摘要】:泰语句子的实体关系抽取研究是泰语自然语言处理的重要内容,其性能对事件抽取、知识库构建和搜索引擎等上层应用研究有着直接影响。然而泰语构词复杂,语气词使用频繁,不习惯书写标点符号造成泰语句子边界模糊等语言特点都增加了泰语信息智能处理的难度。本文结合泰语语言特征和统计机器学习模型,针对泰语句子切分、泰语句子命名实体识别和泰语句子从属实体关系抽取进行了研究探讨。取得了如下三个方面的研究成果。(1)在泰语文本信息中,通常书写的泰语句子之间仅以简单的空格符在句子末尾作为句子分界符,并且泰语中也存在大量的非句末空格符,所以使得泰语句子边界模糊。本文首先分析归纳了一些与泰语句子边界相关的实用语法规则,然后使用统计机器学习中的最大熵分类算法,将关于泰语句子切分的任务转换为对泰语文本中空格符的分类问题。结合泰语文本中空格符的上下文特征来训练最大熵分类模型,从而对泰语信息中的空格符进行类别分类。最后在使用构建的相关语法规则库来对最大熵分类模型的空格符分类结果进行校正。本文的方法相对于只使用泰语语法规则的方法,简化了大量复杂泰语语法知识的规则构建工作,仅针对与泰语句子边界识别相关的主要知识构建了语法规则,并且通过最大熵分类模型更好的利用了在泰语输入语块或段落文本中空格符的上下文特征,从而在泰语句子切分任务中获得了较好的效果,并且性能稳定,为泰语句子的命名实体识别任务奠定了基础。(2)将泰语句子命名实体识别任务转化为对泰语句子中的词汇序列进行标记的任务。本文利用泰语句子中词汇的上下文语言特征,分别使用隐马尔科夫模型和条件随机场模型在泰语实体识别训练语料上进行了模型构建,并且分别使用所构建的序列标注模型在泰语测试语料上进行了实验验证。最终的实验结果也验证了本文使用序列标注方法在泰语命名实体识别任务中的有效性,并且为泰语句子的实体关系抽取研究奠定了基础。(3)在泰语句子命名实体识别的基础上,将泰语句子从属实体关系抽取任务转化为对泰语句子中的实体关系三元组的分类问题。本文首先在缺少泰语从属实体关系语料的情况下,利用句子对齐的汉泰平行句对和汉泰词典构建泰语实体关系语料库。然后使用泰语实体词汇周围的上下文特征训练最大熵分类模型,对泰语句子中候选实体关系三元组的从属实体关系类型进行识别,从而实现泰语句子中的从属实体关系抽取。最后通过实验验证了本文方法在针对泰语句子中从属实体关系进行抽取时的有效性。
[Abstract]:The research on entity relation extraction of Thai sentences is an important part of natural language processing in Thai. Its performance has a direct impact on the research of event extraction, knowledge base construction and search engine. However, the complexity of Thai word-formation, the frequent use of modal words, the unaccustomed writing of punctuation marks, and the blurring of the boundaries of Thai sentences all increase the difficulty of intelligent processing of Thai information. Based on the features of Thai language and the statistical machine learning model, this paper discusses Thai sentence segmentation, Thai sentence naming entity recognition and Thai sentence subordinate entity relation extraction. In Thai text information, only simple blanks are used between Thai sentences as sentence delimiters at the end of the sentence, and there are a large number of non-sentence end blanks in Thai. Therefore, the boundary of Thai sentences is blurred. This paper first analyzes and induces some practical grammar rules related to the boundary of Thai sentences, and then uses the maximum entropy classification algorithm in statistical machine learning. The task of Thai sentence segmentation is converted to the classification of whitespace in Thai text. The maximum entropy classification model is trained by combining the contextual features of white space characters in Thai text, and the whitespace characters in Thai language information are classified. Finally, the whitespace classification results of the maximum entropy classification model are corrected by using the constructed grammar rules. Compared with only using Thai grammar rules, the method in this paper simplifies the construction of a large number of complex Thai grammar rules, and only constructs grammar rules for the main knowledge related to Thai sentence boundary recognition. And the maximum entropy classification model makes better use of the context features of the blanks in the Thai input chunks or paragraph text, thus obtaining a better effect in the Thai sentence segmentation task, and the performance is stable. It lays the foundation for the task of named entity recognition in Thai sentences.) the task of identifying named entities in Thai sentences is transformed into the task of tagging the lexical sequences in Thai sentences. Based on the contextual features of the words in Thai sentences, this paper uses the hidden Markov model and the conditional random field model to construct the model on the training corpus of Thai entity recognition. And the sequence tagging model is used to test the Thai language test corpus. The final experimental results also verify the effectiveness of the method of sequence tagging in the task of Thai named entity recognition, and lay a foundation for the research of entity relation extraction of Thai sentences based on named entity recognition of Thai sentences. In this paper, the subordinate entity relation extraction task of Thai sentence is transformed into the classification problem of the entity relation triple in Thai sentence. In this paper, in the absence of Thai subordinate entity relation corpus, a corpus of Thai entity relations is constructed by using Chinese-Thai parallel sentence pairs with sentence alignment and Chinese-Thai Dictionary. Then the maximum entropy classification model is trained by using the contextual features around the Thai entity vocabulary to identify the subordinate entity relation types of candidate entity relation triples in Thai sentences, so as to achieve subordinate entity extraction in Thai sentences. Finally, the effectiveness of the proposed method in extracting subordinate entities in Thai sentences is verified by experiments.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1

【参考文献】

相关期刊论文 前10条

1 王红斌;沈强;线岩团;;融合迁移学习的中文命名实体识别[J];小型微型计算机系统;2017年02期

2 李丽双;何红磊;刘珊珊;黄德根;;基于词表示方法的生物医学命名实体识别[J];小型微型计算机系统;2016年02期

3 陈鸿;金培权;岳丽华;胡玉娟;殷凤梅;;基于上下文特征分类的评论长句切分方法[J];计算机工程;2015年09期

4 邹嘉龄;刘春腊;尹国庆;唐志鹏;;中国与“一带一路”沿线国家贸易格局及其经济贡献[J];地理科学进展;2015年05期

5 陈鹏;郭剑毅;余正涛;严馨;张志坤;高盛祥;;融合领域知识短语树核函数的中文领域实体关系抽取[J];南京大学学报(自然科学);2015年01期

6 母克东;万琪;;关系抽取研究综述[J];现代计算机(专业版);2015年03期

7 刘绍毓;周杰;李弼程;席耀一;唐浩浩;;基于多分类SVM-KNN的实体关系抽取方法[J];数据采集与处理;2015年01期

8 何炎祥;罗楚威;胡彬尧;;基于CRF和规则相结合的地理命名实体识别方法[J];计算机应用与软件;2015年01期

9 郭喜跃;何婷婷;胡小华;陈前军;;基于句法语义特征的中文实体关系抽取[J];中文信息学报;2014年06期

10 栗伟;赵大哲;李博;彭新茗;刘积仁;;CRF与规则相结合的医学病历实体识别[J];计算机应用研究;2015年04期

相关博士学位论文 前1条

1 何冬梅;泰语构词研究[D];上海师范大学;2012年

相关硕士学位论文 前2条

1 赵世瑜;泰语词法分析关键技术研究[D];昆明理工大学;2016年

2 陈晖;半监督的命名实体识别[D];北京交通大学;2011年



本文编号:1892168

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1892168.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户bd1c9***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com