领域自适应中文分词系统的研究与实现
发布时间:2018-01-26 22:01
本文关键词: 中文分词 多模型 字标注 领域自适应 特征嵌入 出处:《沈阳航空航天大学》2017年硕士论文 论文类型:学位论文
【摘要】:中文分词是指将连续的字序列依照特定的规范切分为合理的词序列的过程。作为自然语言处理最基本的一个步骤,是信息检索、知识获取以及机器翻译等应用必须处理的关键环节。因此,研究中文分词具有重要的理论和现实意义。本文提出了一种基于字的多模型分词方法。该方法采用神经网络模型结构针对每个字单独建立模型。由于中文汉字本身带有语义信息,不同的字在不同语境中其含义和作用不同,造成每个字的构词规律存在差异。与现有字标注分词方法不同的是,该方法能够有效区分每个特征对不同待切分字的影响,从而学习出字构词的特殊性规律。通过与单模型方法、CRF方法以及前人的工作进行对比,本文提出的基于字的多模型方法取得了更好的分词效果。并在SIGHAN Backoff2005提供的中文简体语料PKU和MSR上,取得的F值分别为93.4%和95.5%。根据上述方法,面向领域自适应分词任务,本文提出了一种基于字的领域自适应分词方法。由于字模型相互独立,模型更新时,保留迁移性能强的字模型,对迁移性能弱的字模型进行更新训练。解决了大规模切分数据难与共享,源领域与目标领域数据混合需要重新训练等问题。对目标领域进行分词时,通过模型的自适应能力实现领域自适应。特征嵌入的表示方法能够有效地解决特征稀疏问题,本文采用特征嵌入来表示输入特征。实验结果表明,本文提出的分词方法能够明显提高领域适应性能力。最后,设计并实现了领域自适应中文分词系统。该系统可以实现利用已有的基础模型对输入的句子或文本进行分词,并且支持添加相关领域词典,还可根据待分词领域训练数据对基础模型进行更新,从而获得相关领域较好的分词结果。
[Abstract]:Chinese word segmentation refers to the process of dividing consecutive word sequences into rational word sequences according to specific norms. As the most basic step of natural language processing, information retrieval is one of the most important steps in Chinese word segmentation. Knowledge acquisition and machine translation and other applications must deal with the key links. It is of great theoretical and practical significance to study Chinese word segmentation. In this paper, a word-based multi-model word segmentation method is proposed. The neural network model structure is used to build a model for each word. Body with semantic information. Different words have different meanings and functions in different contexts, which result in different word-formation rules. This method can effectively distinguish the influence of each feature on the different words to be segmented, so as to learn the special rule of word-formation, and compare it with the single model method / CRF method and the previous work. The word-based multi-model method proposed in this paper has achieved better word segmentation effect, and has been applied on the simplified Chinese corpus PKU and MSR provided by SIGHAN Backoff2005. The F values obtained are 93.4% and 95. 5 respectively. According to the above method and domain adaptive word segmentation task, this paper proposes a word-based domain adaptive word segmentation method, because the word model is independent of each other. When the model is updated, the word model with strong migration performance is retained, and the word model with weak migration performance is updated and trained, which solves the difficulty and sharing of large-scale segmentation data. The source domain and target domain data mix need to be retrained and so on. When carries on the word segmentation to the target domain. The domain adaptation is realized by the adaptive ability of the model. The representation method of feature embedding can effectively solve the problem of feature sparsity. In this paper, feature embedding is used to represent input features. The experimental results show that. The word segmentation method proposed in this paper can obviously improve the adaptability of the field. Finally. The domain adaptive Chinese word segmentation system is designed and implemented. The system can use the existing basic model to segment the input sentence or text, and support the addition of relevant domain dictionaries. In addition, the basic model can be updated according to the training data of the domain to be partitioned, and the better segmentation results can be obtained.
【学位授予单位】:沈阳航空航天大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 韩冰;刘一佳;车万翔;刘挺;;基于感知器的中文分词增量训练方法研究[J];中文信息学报;2015年05期
2 韩冬煦;常宝宝;;中文分词模型的领域适应性方法[J];计算机学报;2015年02期
3 周俊;郑中华;张炜;;基于改进最大匹配算法的中文分词粗分方法[J];计算机工程与应用;2014年02期
4 来斯惟;徐立恒;陈玉博;刘康;赵军;;基于表示学习的中文分词算法探索[J];中文信息学报;2013年05期
5 张梅山;邓知龙;车万翔;刘挺;;统计与词典相结合的领域自适应中文分词[J];中文信息学报;2012年02期
6 黄德根;焦世斗;周惠巍;;基于子词的双层CRFs中文分词[J];计算机研究与发展;2010年05期
7 张桂平;刘东生;尹宝生;徐立军;苗雪雷;;面向专利文献的中文分词技术的研究[J];中文信息学报;2010年03期
8 宋彦;蔡东风;张桂平;赵海;;一种基于字词联合解码的中文分词方法[J];软件学报;2009年09期
9 黄昌宁;赵海;;中文分词十年回顾[J];中文信息学报;2007年03期
10 秦颖;王小捷;张素香;;汉语分词中组合歧义字段的研究[J];中文信息学报;2007年01期
,本文编号:1466711
本文链接:https://www.wllwen.com/shoufeilunwen/xixikjs/1466711.html
最近更新
教材专著