基于深度学习的中文网络衍生实体的识别与分类

发布时间：2019-04-16 08:40

【摘要】：随着互联网信息内容的爆炸,网络上充斥着大量的近音词、缩略语、同义词等非规范的中文表达。由于中文在组织与使用上的灵活性,大量的文本主体词采用这些形式的衍生词进行表达,这类主体词被称为网络衍生实体。由于中文网络衍生实体复杂多变,难以识别,并且常常被用来替换原词语以规避政府的网络舆情监管,因此给自然语言处理及舆情监控带来了诸多困难。针对特定类别的衍生实体识别,虽然国内外学者已有广泛的探讨和研究,却至今没有对网络衍生实体的整体数据分布进行研究;并且,大量的新的衍生实体不断出现,对网络衍生实体的识别技术提出了新的要求。本文的主要工作如下:1)分别针对各类衍生实体的识别,对国内外的解决方法进行了研究和对比,分析了近年来主流识别模型的方法和技术的发展趋势;通过对各方法的分析与总结,指出各方法在实际应用中的优劣之处;同时,结合本文所研究的问题的特点,提出采用基于深度学习的方法进行中文网络衍生实体识别的新思路。2)提出了两种用于中文网络衍生实体识别的神经网络架构:滑动窗口法和句子卷积法,从而解决了文本中句子长度不统一、无法输入神经网络的问题;采用word2vec技术获取模型输入向量;同时,采用栈式自编码器编码人工特征向量,组成复合输入以进一步提高模型的识别效果;通过采用特殊的激活函数和训练算法,加速了模型的训练过程,进一步优化了模型的结构。3)在构建的语料库基础上,进行了大量的对比实验。由于缺少开放语料库,本文采用Scrapy爬虫框架进行语料的抓取(语料大小为252.3MB),并且通过人工标注,完成了语料库的构建;针对该语料库,进行了大量的衍生实体识别测试,并比较了模型在各类实体识别上的结果差异;实验结果表明,本文所提出的两种模型框架,均能够有效地应对网络衍生实体识别的问题,其性能指标F1值分别为78.6%和76.9%,并在各类实体的识别上各有所长,其结果均优于采用传统模型在该语料集上的识别效果;同时,通过研究不同参数、不同方法对实验结果的影响,得到了关于该模型的更一般的调参经验,为其他研究人员提供了参考。实践表明,本文所提出的基于深度学习的神经网络实体识别模型,可以很好地应用于中文网络衍生实体的识别任务上来。该模型可以同时对各类衍生实体得到较好的识别性能,能够满足大数据背景下中文网络衍生实体识别的新需求。
[Abstract]:With the explosion of Internet information content, the network is full of non-standard Chinese expressions such as close words, acronyms, synonyms and so on. Due to the flexibility in the organization and use of Chinese, a large number of text subject words are expressed by these forms of derivative words, which are called network-derived entities. Due to the complexity and variety of Chinese Internet derivative entities, which are difficult to identify, and are often used to replace the original words in order to evade the government's network public opinion supervision, it has brought many difficulties to natural language processing and public opinion monitoring. In view of the specific categories of derivative entity recognition, although domestic and foreign scholars have been extensively discussed and studied, there is no research on the overall data distribution of the network derivative entity up to now. Moreover, a large number of new derivative entities appear constantly, which puts forward new requirements for the identification technology of network derivative entities. The main work of this paper is as follows: 1) according to the identification of various derivative entities, this paper studies and compares the solutions at home and abroad, and analyzes the development trend of the mainstream identification model methods and technologies in recent years; Through the analysis and summary of each method, the advantages and disadvantages of each method in practical application are pointed out. At the same time, combined with the characteristics of the problems studied in this paper, A new idea of Chinese network derived entity recognition based on deep learning is proposed. 2) two neural network structures for Chinese network derived entity recognition are proposed: sliding window method and sentence convolutional method. Thus it solves the problem that sentence length is not uniform and can not be inputted into neural network in the text. The word2vec technology is used to obtain the input vector of the model, and the stack self-encoder is used to encode the artificial feature vector to make up the compound input to further improve the recognition effect of the model. Through the use of special activation function and training algorithm, the training process of the model is accelerated and the structure of the model is further optimized. 3) on the basis of the corpus, a lot of comparative experiments are carried out. Because of the lack of open corpus, this paper uses the Scrapy crawler framework to capture the corpus (the size of the corpus is 252.3MB), and completes the construction of the corpus through manual tagging. Based on the corpus, a large number of derived entity recognition tests are carried out, and the results of the model on various entity recognition are compared. The experimental results show that the two models proposed in this paper can effectively deal with the problem of identification of network derived entities, and their performance indices F1 are 78.6% and 76.9%, respectively, and have their own advantages in the identification of all kinds of entities. The results are better than the traditional models in the recognition of the corpus. At the same time, by studying the influence of different parameters and methods on the experimental results, more general experience of adjusting parameters for the model is obtained, which provides reference for other researchers. The practice shows that the neural network entity recognition model based on deep learning proposed in this paper can be applied to the identification task of Chinese network derived entities. This model can identify all kinds of derivative entities at the same time, and can meet the new requirements of Chinese network derived entity recognition under the background of big data.
【学位授予单位】：武汉大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】