基于词向量表征的新词发现及命名实体识别研究

发布时间：2018-12-19 20:27

【摘要】：数据挖掘中结构化数据的挖掘分析相对成熟,但非结构化的数据挖掘分析面临许多挑战。文本数据是一种非常重要的非结构化数据,对于该种数据类型的挖掘分析面临着更多的挑战,主要面临如中文分词、命名实体识别、实体关系抽取、语义理解,情感分析等等一系列的问题。其中,分词技术几乎是绝大多数中文文本数据挖掘分析的基础步骤。然而,由于人们总是在不断地创造新的词汇,这些新词是不可能被人们完全收录,所以会导致分词错误,从而引致命名实体的标记错误。因此,新词识别已经成为文本挖掘的一个难点和瓶颈问题。近几年利用神经网络或深度学习训练语言模型而得到的词向量表征能够很好的表征词与词之间的语义关系,受此启发,本文把这种词向量表征用于中文的新词发现识别中,提出了一个基于词向量表征和n-gram相结合的无监督的新词发现方法。首先,本文通过训练神经网络语言模型把词映射到一个高维空间,并且对比了Skip-gram模型和CBOW模型得到的词向量对新词结果的影响,发现Skip-gram模型能够取得更好效果。其次,考虑到如果几个相邻的词经常的共同出现在不同的词序列中,那么他们一定存在某种关系。本文受关联规则算法的启发,设计了高效的n-gram挖掘算法,把挖掘出的n-gram作为新词候选词串。接着,本文利用训练好的词向量对候选词串进行剪枝,剔除噪音数据,从而得到新词结果。本文还设计了剪枝算法,并且对比了不同向量相似性度量方法对最终结果的影响,发现余弦相似性剪枝效果最好。同时,本文也和其他新词发现方法做了相应对比,证实了本文方法的有效性。最后,本文在新词结果的基础上,进一步利用条件随机场对结果进行分类,从而实现命名实体词的识别。本文的主要贡献为:(1)在中文新词识别领域引入了神经网络训练的词向量,把词向量和n-gram相结合,提出了一种新的无监督的新词识别方法。(2)在新词发现的基础上利用条件随机场对新词进行分类并识别出其中的命名实体词,为命名实体识别提出了一种新的实践。
[Abstract]:The mining analysis of structured data in data mining is relatively mature, but unstructured data mining analysis faces many challenges. Text data is a very important kind of unstructured data. The mining and analysis of this kind of data types face more challenges, such as Chinese word segmentation, named entity recognition, entity relation extraction, semantic understanding. Emotional analysis and a series of questions. Word segmentation is the basic step of most Chinese text data mining and analysis. However, because people are constantly creating new words, these new words can not be completely included, so it will lead to participle errors, which will lead to the tagging errors of named entities. Therefore, neologism recognition has become a difficult and bottleneck problem in text mining. In recent years, word vector representation obtained by using neural network or in-depth learning training language model can well represent the semantic relationship between words and words. Inspired by this, this paper applies this word vector representation to Chinese new word discovery and recognition. An unsupervised new word discovery method based on word vector representation and n-gram is proposed. Firstly, by training the neural network language model to map words to a high-dimensional space, and comparing the word vectors obtained by Skip-gram model and CBOW model, we find that the Skip-gram model can achieve better results. Secondly, if several adjacent words often appear together in different word sequences, then they must have some relationship. Inspired by the association rule algorithm, an efficient n-gram mining algorithm is designed in this paper. The extracted n-gram is regarded as a new word candidate string. Then, the trained word vector is used to prune the candidate word string and eliminate the noise data, and the result of the new word is obtained. This paper also designs pruning algorithm and compares the effects of different vector similarity measures on the final results. It is found that the effect of cosine similarity pruning is the best. At the same time, this paper also makes the corresponding comparison with other new word discovery methods, which proves the effectiveness of this method. Finally, on the basis of the results of the new words, we use conditional random field to classify the results, so as to realize the recognition of named entity words. The main contributions of this paper are as follows: (1) the neural network trained word vector is introduced in the field of Chinese new word recognition, which combines word vector with n-gram. A new unsupervised new word recognition method is proposed. (2) based on the discovery of new words, the conditional random field is used to classify the new words and identify the named entity words, which provides a new practice for naming entity recognition.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】