基于概率分析与规则约束的词组识别研究

发布时间：2018-04-02 16:01

本文选题：自然语言处理　切入点：短语识别　出处：《昆明理工大学》2017年硕士论文

【摘要】：当前,在大数据浪潮的背景下,各种各样的海量数据都或多或少地需要自然语言处理领域相关技术的支撑,从而借此发现数据背后的大量有价值的信息。也正因如此,自然语言处理技术的发展壮大是一种必然的大趋势所向。而词组(短语)识别研究属于自然语言处理技术中应用基础研究的一个重要子领域,属于浅层分析范畴。而浅层句法分析的这种“分而治之”的思路为完整句法分析的消歧提供了很大帮助,因此针对词组的提取研究是非常有价值和意义的。本文在现有的词组研究基础上,提出了一种新的短语识别模型,主要内容如下:(1)尝试面向一般自然语言词组,理论阐述如何用一套比较通用不繁杂的模型来达到短语识别的目的,即通过基于概率分析与规则约束相融合的方法,并提出结合度的概念进行描述说明。(2)在实验部分,以英语语种中的动词短语为例进行实践和说明,主要解决的问题是二元非嵌套动词短语和二元嵌套动词短语以及三元动词短语这三种情况下的识别抽取。具体的实现部分侧重通过语料库训练、结合度分析、相似度计算、数据平滑处理和规则约束以及模拟短语词典辅助等方法的融合,从而实现动词短语的识别抽取目的。系统实现所选用的编程语言为Java,测试平台通过Java Web的形式进行测试和分析。(3)综合实验结果,系统在基于概率分析和规则约束相融合的前提条件下,其最好的识别效果是准确率达到88%,召回率达到90%。这也说明了本文的短语识别框架是有效可行的。综上,论文的创新点主要有以下三点:(1)通过概率分析和适当规则相结合的方式,提出结合度的概念,探索一般自然语言中的短语识别问题;(2)将词语相似度计算运用于数据稀疏问题;(3)系统可实现动态语料库的功能。
[Abstract]:At present, against the background of big data's wave, all kinds of massive data more or less need the support of the related technology in the field of natural language processing, so as to find a large amount of valuable information behind the data.Therefore, the development of natural language processing technology is an inevitable trend.Phrase recognition is an important subfield of applied basic research in natural language processing and belongs to the category of shallow analysis.The idea of "divide and conquer" in shallow syntactic analysis provides a great help for the disambiguation of complete syntactic analysis, so it is of great value and significance to study the extraction of phrases.In this paper, a new phrase recognition model is proposed based on the existing research on phrases, the main contents of which are as follows: 1) try to face the general natural language phrases.The theory explains how to achieve the purpose of phrase recognition by using a set of relatively common and uncomplicated models, that is, through the method of combining probability analysis with rule constraints, and puts forward the concept of combination degree to describe and explain in the experiment part.Taking the verb phrase in English language as an example, the main problem to be solved is the identification and extraction of binary unnested verb phrase, binary nested verb phrase and ternary verb phrase.The specific implementation focuses on the fusion of corpus training, combination analysis, similarity calculation, data smoothing and rule constraint, as well as analog phrase dictionary assistance, so as to achieve the purpose of verb phrase recognition and extraction.The programming language used in the system is Java. The test platform is tested and analyzed by Java Web. The system is based on the premise of probability analysis and rule constraint fusion.The best recognition effect is 88% accuracy and 90% recall.This also shows that the framework of phrase recognition in this paper is effective and feasible.In summary, the innovations of this paper are as follows: 1) through the combination of probability analysis and appropriate rules, the concept of combination degree is put forward.This paper explores the phrase recognition problem in general natural languages and applies the word similarity calculation to the data sparsity problem. The system can realize the function of dynamic corpus.
【学位授予单位】：昆明理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】