当前位置:主页 > 科技论文 > 软件论文 >

基于上下文分析的词及短语复述抽取研究

发布时间:2018-05-23 21:12

  本文选题:复述抽取 + 上下文信息 ; 参考:《哈尔滨工业大学》2017年硕士论文


【摘要】:现实生活中,人们在表达相同信息时往往会使用不同的文本描述,这就是复述现象。因为复述现象的存在,也使得众多自然语言处理任务变得复杂困难。词及短语复述抽取是从语料中抽取得到表达相同语义的词汇和短语,抽取得到的复述资源在问答、信息检索、机器翻译、文本生成等自然语言处理任务中都有重要的应用,能够提升相关自然语言处理系统性能。在本文基于上下文分析的词及短语复述抽取研究中,主要包含以下三个方面的研究内容:基于上下文分析的词汇级复述抽取方法研究、基于枢轴法的短语级复述抽取方法研究以及基于上下文分析的短语级复述抽取方法研究。首先,本文提出了基于上下文分析的词汇复述抽取方法。目前词汇复述抽取研究中主要是基于枢轴法从双语平行语料中抽取词汇复述。本文使用枢轴法的思想,使用中文词汇的在线翻译资源来抽取候选词汇复述,从而避免双语平行语料的对齐错误而导致抽取得到错误复述。使用词汇上下文来学习词汇向量,结合前馈神经网络学习得到的词汇复述得分以及词向量之间的相似度得分作为词汇复述的最终得分,使用最终得分对词汇复述资源进行排序过滤。使用上下文等信息对词汇复述资源过滤可以减少因为外文翻译一词多义而导致抽取得到的错误复述。对抽取得到的词汇复述资源进行人工评价结果显示该方法抽取得到的词汇复述资源质量优于传统枢轴法抽取得到的词汇复述资源。其次,在目前常用的基于枢轴法抽取短语复述的基础上,本文针对该方法因为双语对齐错误以及外文翻译多义问题导致抽取得到错误短语复述的问题,对抽取得到的短语复述资源分别使用翻译概率以及上下文信息进行过滤。实验结果表明,使用上下文信息对候选短语复述资源进行过滤可以大幅提升抽取得到的短语复述资源质量。最后,本文提出基于上下文分析的短语复述抽取方法。该方法中使用两层Bi LSTM-CRF模型对中文单语语料进行短语划分,然后使用深度学习模型学习短语的向量表示,将短语向量的余弦相似度值高的短语抽取作为候选短语复述。并使用词汇的英文翻译对这些候选短语复述进行过滤。提出短语上下文向量学习方法,使用短语上下文向量相似度对候选短语复述进行排序。实验结果表明,神经网络模型可以学习短语语义向量表示,经过过滤排序之后的短语复述资源质量远高于基于枢轴法抽取得到的短语复述质量。
[Abstract]:In real life, people often use different text descriptions when expressing the same information, which is the phenomenon of repetition. Because of the existence of retelling phenomenon, many natural language processing tasks become complex and difficult. Word and phrase retelling extraction is the extraction of words and phrases that express the same semantics from the corpus. The extracted retelling resources have important applications in natural language processing tasks such as question and answer, information retrieval, machine translation, text generation and so on. Can improve the performance of related natural language processing systems. In this paper, the extraction of words and phrases based on context analysis mainly includes the following three aspects: the research of lexical level extraction method based on context analysis. Research on phrase level repetition extraction method based on pivot method and phrase level repeat extraction method based on context analysis. Firstly, this paper proposes a lexical repetition extraction method based on context analysis. At present, lexical repetition extraction is mainly based on pivot method from bilingual parallel corpus. In this paper, the idea of pivot method is used to extract candidate lexical repetition using online translation resources of Chinese vocabulary, so as to avoid the alignment error of bilingual parallel corpus and result in error repetition of extraction. Vocabulary vector is learned by lexical context, and the score of word repetition and the similarity between word vectors are used as the final score of vocabulary retelling, which is based on feedforward neural network learning. The final score is used to sort and filter the word retelling resources. Using contextual information to filter lexical retelling resources can reduce the misrepresentation caused by the polysemy of foreign language translation. The results of manual evaluation of the extracted lexical repetition resources show that the quality of the lexical repetition resources extracted by this method is superior to that of the lexical repetition resources extracted by the traditional pivot method. Secondly, on the basis of the pivot method, this paper aims at the problem that the paraphrase can be extracted by this method because of the error of bilingual alignment and the polysemy of foreign language translation. The extracted phrase repetition resources are filtered using translation probability and context information respectively. The experimental results show that using context information to filter candidate phrase recitation resources can greatly improve the quality of extracted phrase recitation resources. Finally, a method of phrase repetition extraction based on context analysis is proposed. In this method, the two-layer Bi LSTM-CRF model is used to divide the Chinese monolingual corpus, and then the advanced learning model is used to study the vector representation of the phrase. The phrase with high cosine similarity of the phrase vector is extracted as a candidate phrase. These candidate phrases are filtered by English translation. A learning method of phrase context vector is proposed, and the similarity of phrase context vector is used to sort candidate phrase retelling. The experimental results show that the neural network model can learn the expression of phrase semantic vector, and the quality of phrase repeat resource after filtering and sorting is much higher than that of phrase recitation based on pivot method.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1

【参考文献】

相关期刊论文 前3条

1 何贤江;何维维;左航;;一种句词五特征融合模型的复述研究[J];四川大学学报(工程科学版);2012年06期

2 赵世奇;刘挺;李生;;基于自动构建语料库的词汇级复述研究[J];电子学报;2009年05期

3 张玉洁,山本和英;汉语语句的自动改写[J];中文信息学报;2003年06期

相关博士学位论文 前1条

1 张伟男;社区型问答中问句检索关键技术研究[D];哈尔滨工业大学;2014年



本文编号:1926421

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1926421.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户c38d1***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com