基于神经网络的中文词法分析系统的研究与实现

发布时间：2018-08-14 08:48

【摘要】：本论文的研究目的是构建一套基于神经网络的中文词法分析(中文分词,词性标注和命名实体识别)系统,主要研究内容包含两点,其一是研究适合中文词法分析任务的具体模型,其二是研究如何将其良好地实现。首先我们多维度地简要介绍了中文词法分析的各个任务,随后调研了当下已有的词法分析系统。接着我们将用于序列标注的神经网络结构拆分为输入层、表示学习层和标签预测层,并逐层展开介绍。随后本文以实验为出发点,探究了不同的输入特征和模型结构在各任务上的效果,确定了适合各个任务的神经网络模型。我们得到的各模型结构不尽相同,但都以双向LSTM结构作为表示学习方法,同时融合手工特征或未标注数据的信息。最后,我们介绍了系统实现的代码结构,并完成系统的速度评估。本论文的研究成果主要包含两点。第一点是通过实验确定了适合各中文词法分析任务的具体神经网络结构。我们选择LTP作为基准线模型,以LTP使用的数据集作为实验数据集。在中文分词任务上,我们的模型在开发集和测试集上的F1值比LTP分别高0.33、0.48个百分点;在词性标注任务上,最优模型在开发集上的Accuracy比基准线高0.2个百分点,测试集上高0.22个点;在命名实体识别上,我们确定的模型在开发集和测试集上的F1值比LTP提升了2.57和0.57个百分点。第二个研究成果体现在系统实现上。我们用清晰地代码结构实现了上述神经网络模型,获得了一套可用的中文词法分析系统。
[Abstract]:The purpose of this thesis is to construct a Chinese lexical analysis system (Chinese word segmentation, part of speech tagging and named entity recognition) based on neural network. One is to study the specific model suitable for Chinese lexical analysis, and the other is to study how to implement it well. Firstly, we briefly introduce the tasks of Chinese lexical analysis, and then investigate the existing lexical analysis systems. Then we divide the neural network structure used for sequence tagging into input layer, represent learning layer and label prediction layer, and introduce them layer by layer. Then based on the experiment, this paper explores the effects of different input characteristics and model structures on each task, and determines the neural network model suitable for each task. The structure of each model is different, but the bidirectional LSTM structure is used as the representation learning method, and the information of manual feature or unlabeled data is fused at the same time. Finally, we introduce the code structure of the system, and complete the speed evaluation of the system. The research results of this paper mainly contain two points. The first point is to determine the specific neural network structure suitable for each Chinese lexical analysis task through experiments. We choose LTP as the baseline model and the data set used by LTP as the experimental data set. On the task of Chinese word segmentation, the F1 value of our model is 0.33% higher than that of LTP on the development set and the test set, and the Accuracy of the optimal model is 0.2% higher than the baseline on the part of speech tagging task, and the test set is 0.22 points higher than the test set. In the named entity recognition, the F1 value of the model is 2.57% and 0.57% higher than that of LTP in the development set and test set. The second research result is embodied in the system implementation. We implement the neural network model with clear code structure and obtain a set of available Chinese lexical analysis system.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1;TP183

【参考文献】