融合多特征的汉缅双语主题模型构建方法研究
发布时间:2018-12-29 18:33
【摘要】:汉-缅双语平行语料是开展面向汉语-缅语机器翻译、跨语言检索、平行句对抽取和双语实体抽取等研究的基础性资源。跨语言的主题模型作为多语言文档分析的基础模型,它能够从语义层面来计算不同语言文档之间的相关性,为我们获取汉-缅可比文档以及平行语料库的建设提供了良好的支撑,因此,研究如何构建汉-缅双语主题模型对于汉-缅可比文档的获取具有重要的意义。本文以语料库构建为出发点,通过主题模型获取可比语料为目的,对双语主题模型的构建展开了研究工作,主要取得了以下成果:(1)详述汉-缅双语平行语料库的构建。汉-缅双语文本的资源稀缺,国内外还没有公开权威的汉-缅文本语料集,构建汉缅双语主题模型需要一定量的双语平行文档作为训练集,并且平行文档的质量将影响后续的文本主题模型的研究。本文详细介绍了汉-缅双语文本的获取方法,包括网页文本、电子杂志和微信平台等资源。对于网页文本,详细介绍了利用爬虫技术自动获取的过程,对于电子杂志和微信平台,也说明了人工获取的过程。最后将资源整合为汉-缅双语平行语料库以及说明相应的数据存储方法。(2)提出一种融合上下文特征的汉-缅双语主题模型。该模型以双语LDA主题模型为基础,融合了文本的上下文特征。双语LDA模型利用了平行文本的关联性,即平行文本共享同一文本主题分布矩阵,而融合上下文特征则解决了词袋模型没有考虑文本结构的问题。融合后的模型实质是对降低了高频词对文本主题分布的负面影响,通过实验结果表明,本文提出的融合上下文特征的汉-缅双语主题模型在文本主题分布上有着更好的效果。(3)提出一种融合语义扩展的汉-缅双语主题模型。以融合上下文特征的主题模型为基础,进一步融合了汉-缅语义扩展词典,通过对词典的解析和处理,构建了汉-缅语义的扩展集合,本文通过上下文特征对词语加权权值,设定一个阈值,对超过阈值的词语通过扩展集合扩展对应的缅甸语文本,通过这种语义扩展,可以解决缅甸语中一种词语,多种表述的问题。我们将上下文特征和语义扩展特征融合在同一个双语LDA模型中,最后通过实验结果比较分析,本文构建的融合多特征的双语主题模型同对比实验比较有着更好的表现。
[Abstract]:Chinese-Burmese bilingual parallel corpus is the basic resource for the research of Chinese-Burmese machine translation, cross-language retrieval, parallel sentence pair extraction and bilingual entity extraction. As the basic model of multilingual document analysis, the cross-language topic model can calculate the correlation between different language documents from the semantic level. It provides a good support for the construction of Chinese-Burmese comparable documents and parallel corpus. Therefore, it is of great significance to study how to construct a Chinese-Burmese bilingual thematic model for the acquisition of Chinese-Burmese comparable documents. Taking corpus construction as the starting point and obtaining comparable corpus through thematic model, this paper studies the construction of bilingual thematic model. The main achievements are as follows: (1) the construction of Chinese-Myanmar bilingual parallel corpus is described in detail. The resources of Chinese-Myanmar bilingual texts are scarce, and there is no open and authoritative Chinese-Burmese text corpus at home and abroad. To construct the Chinese-Myanmar bilingual thematic model, a certain amount of bilingual parallel documents are needed as training sets. And the quality of parallel documents will affect the research of text topic model. This paper introduces the methods of obtaining Chinese-Burmese bilingual texts, including web text, e-magazine and WeChat platform. For the text of web pages, the process of automatically obtaining web pages using crawler technology is introduced in detail. For electronic magazines and WeChat platforms, the process of manual acquisition is also explained. Finally, the resources are integrated into a Chinese-Burmese bilingual parallel corpus and the corresponding data storage methods are illustrated. (2) A Chinese-Burmese bilingual thematic model is proposed, which combines the contextual features. The model is based on the bilingual LDA thematic model and combines the contextual features of the text. The bilingual LDA model utilizes the relevance of parallel text, that is, parallel text sharing the same text topic distribution matrix, while the fusion of context features solves the problem that the lexical bag model does not consider the text structure. The fusion model essentially reduces the negative influence of high-frequency words on the theme distribution of the text. The experimental results show that, The Chinese-Myanmar bilingual thematic model with contextual features proposed in this paper has a better effect on the text theme distribution. (3) A Chinese-Myanmar bilingual thematic model with semantic extension is proposed. Based on the subject model of blending context features, this paper further fuses the Chinese-Burmese semantic extension dictionary. Through the analysis and processing of the dictionary, the extended set of Chinese-Myanmar semantics is constructed, and the weighted weight of the words is given by the context feature in this paper. A threshold is set to extend the corresponding Myanmar language text by extending the set of words over the threshold. By this semantic extension, the problem of one word or a variety of expressions in the Burmese language can be solved. We fuse context features and semantic extended features into the same bilingual LDA model. Finally, by comparing and analyzing the experimental results, we conclude that the multi-feature bilingual thematic model constructed in this paper has a better performance than the comparative experiment.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
本文编号:2395219
[Abstract]:Chinese-Burmese bilingual parallel corpus is the basic resource for the research of Chinese-Burmese machine translation, cross-language retrieval, parallel sentence pair extraction and bilingual entity extraction. As the basic model of multilingual document analysis, the cross-language topic model can calculate the correlation between different language documents from the semantic level. It provides a good support for the construction of Chinese-Burmese comparable documents and parallel corpus. Therefore, it is of great significance to study how to construct a Chinese-Burmese bilingual thematic model for the acquisition of Chinese-Burmese comparable documents. Taking corpus construction as the starting point and obtaining comparable corpus through thematic model, this paper studies the construction of bilingual thematic model. The main achievements are as follows: (1) the construction of Chinese-Myanmar bilingual parallel corpus is described in detail. The resources of Chinese-Myanmar bilingual texts are scarce, and there is no open and authoritative Chinese-Burmese text corpus at home and abroad. To construct the Chinese-Myanmar bilingual thematic model, a certain amount of bilingual parallel documents are needed as training sets. And the quality of parallel documents will affect the research of text topic model. This paper introduces the methods of obtaining Chinese-Burmese bilingual texts, including web text, e-magazine and WeChat platform. For the text of web pages, the process of automatically obtaining web pages using crawler technology is introduced in detail. For electronic magazines and WeChat platforms, the process of manual acquisition is also explained. Finally, the resources are integrated into a Chinese-Burmese bilingual parallel corpus and the corresponding data storage methods are illustrated. (2) A Chinese-Burmese bilingual thematic model is proposed, which combines the contextual features. The model is based on the bilingual LDA thematic model and combines the contextual features of the text. The bilingual LDA model utilizes the relevance of parallel text, that is, parallel text sharing the same text topic distribution matrix, while the fusion of context features solves the problem that the lexical bag model does not consider the text structure. The fusion model essentially reduces the negative influence of high-frequency words on the theme distribution of the text. The experimental results show that, The Chinese-Myanmar bilingual thematic model with contextual features proposed in this paper has a better effect on the text theme distribution. (3) A Chinese-Myanmar bilingual thematic model with semantic extension is proposed. Based on the subject model of blending context features, this paper further fuses the Chinese-Burmese semantic extension dictionary. Through the analysis and processing of the dictionary, the extended set of Chinese-Myanmar semantics is constructed, and the weighted weight of the words is given by the context feature in this paper. A threshold is set to extend the corresponding Myanmar language text by extending the set of words over the threshold. By this semantic extension, the problem of one word or a variety of expressions in the Burmese language can be solved. We fuse context features and semantic extended features into the same bilingual LDA model. Finally, by comparing and analyzing the experimental results, we conclude that the multi-feature bilingual thematic model constructed in this paper has a better performance than the comparative experiment.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
【参考文献】
中国期刊全文数据库 前5条
1 关鹏;王曰芬;傅柱;;不同语料下基于LDA主题模型的科学文献主题抽取效果分析[J];图书情报工作;2016年02期
2 赵煜;邵必林;边根庆;;一种融合词序信息的多粒度文本话题情感联合模型[J];西安交通大学学报;2014年11期
3 陈霞枫;;缅甸改革对中缅关系的影响及中国的对策[J];东南亚研究;2013年01期
4 马颖华,王永成,苏贵洋,张宇萌;一种基于字同现频率的汉语文本主题抽取方法[J];计算机研究与发展;2003年06期
5 杨沐昀;A Research on Bilingual Dictionary Based Sentence Alignment for Chinese English Parallel Corpus[J];High Technology Letters;2002年01期
,本文编号:2395219
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2395219.html