基于神经网络的词法分析研究
发布时间:2018-03-08 01:22
本文选题:中文分词 切入点:词性标注 出处:《南京大学》2017年硕士论文 论文类型:学位论文
【摘要】:词法分析是自然语言处理领域中一项重要的基础任务。词法分析任务由中文分词和词性标注这两个基本任务组成。分词是一种将中文字串转换为中文词串的任务。对于中文文本分析来说,几乎所有的任务都依赖于分词。词性标注是给组成句子的每一个词指定一个词性类别的任务。对于句法分析,语义分析等高层次任务来说,词性可以帮助消解歧义,缓解词特征的稀疏性。词法分析任务虽然比较基础,但是具有着非常广泛的需求和应用前景,目前仍是自然语言处理领域中的热点问题。中文分词技术在早期由于计算资源有限以及缺乏标注语料,一般采用基于词典的规则方法。随着计算能力的增长以及标注语料的出现,中文分词的处理技术慢慢从规则方法转移到基于机器学习的方法,其中字标注方法是目前解决分词问题最常使用的手段。在深度学习兴起之后,也有一些研究者尝试利用神经网络解决分词问题,取得了一些进展。词性标注任务也存在着类似的研究路径。在本文中,首先针对传统基于字标注的分词模型基于窗口抽取局部特征,无法捕获长距离依赖的问题,我们提出使用双向长短期记忆网络代替原有特征抽取模块,该网络既可以保存长距离信息也简化了特征抽取工作。其次,我们设计了基于双向长短期记忆网络的贪心模型和结构化模型。最后我们针对通用的词嵌入与具体任务不契合的问题,我们分别设计了分词和词性标注任务相关的词嵌入模型。实验结果表明,基于双向长短期记忆神经网络的分词模型取得了和传统模型相当的效果,而且简单快速的贪心模型与结构化模型性能相当;在加入WCC(Word-context Character Embedding)模型预训练的字嵌入后,在标准数据集上取得了当前最佳或相当的性能,在领域迁移试验中也取得了不错的效果。对于词性标注模型,在加入PCS(POS Sensitive Embedding)模型预训练的词嵌入后,提升了标注系统的能力,并且PCS模型可以快速利用异构数据提高模型性能。
[Abstract]:Lexical analysis is an important basic task in the field of natural language processing. Lexical analysis task consists of two basic tasks: Chinese word segmentation and part of speech tagging. Word segmentation is a task of converting Chinese string into Chinese string. For Chinese text analysis, Almost all tasks depend on participle. Part of speech tagging is the task of assigning a part of speech category to each word that makes up a sentence. For high-level tasks such as syntactic analysis, semantic analysis, and so on, part of speech can help to resolve ambiguity. Although the lexical analysis task is relatively basic, it has a very wide range of needs and application prospects. At present, Chinese word segmentation is still a hot topic in the field of natural language processing. With the increase of computing power and the appearance of tagging corpus, the processing technology of Chinese word segmentation is gradually transferred from rule method to machine learning method. Word tagging is the most commonly used method to solve word segmentation problem. After the rise of in-depth learning, some researchers also try to use neural network to solve word segmentation problem. Some progress has been made. Part of speech tagging task also has a similar research path. In this paper, firstly, aiming at the problem of extracting local features based on window in traditional word segmentation model based on word tagging, we can not capture long distance dependence. We propose to use bidirectional long and short term memory network instead of the original feature extraction module. This network can not only save the long distance information but also simplify the feature extraction work. Secondly, We design a greedy model and a structured model based on a bidirectional short and long term memory network. Finally, we aim at the problem of mismatch between general word embedding and specific tasks. We have designed word embedding models related to word segmentation and part of speech tagging task respectively. The experimental results show that the segmentation model based on bi-directional long-term and short-term memory neural network has the same effect as the traditional model. And the performance of the simple and fast greedy model is comparable to that of the structured model; after the word embedding of the pre-trained WCC(Word-context Character embedding model is added, the best or equivalent performance is achieved on the standard data set. For the part of speech tagging model, the ability of the tagging system can be improved by adding the pre-trained word embedding of the PCS(POS Sensitive embed model, and the PCS model can quickly improve the performance of the model by using heterogeneous data.
【学位授予单位】:南京大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1;TP183
【参考文献】
相关期刊论文 前3条
1 陈明华;殷景华;舒昌;王明江;;基于正反向最大匹配分词系统的实现[J];信息技术;2009年06期
2 黄昌宁;赵海;;中文分词十年回顾[J];中文信息学报;2007年03期
3 张华平,刘群;基于N-最短路径方法的中文词语粗分模型[J];中文信息学报;2002年05期
,本文编号:1581807
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1581807.html