当前位置:主页 > 科技论文 > 软件论文 >

基于双语文档相似度的跨语言文档排序学习方法研究

发布时间:2018-03-30 22:33

  本文选题:信息检索 切入点:双语文档相似度 出处:《昆明理工大学》2017年硕士论文


【摘要】:跨语言的信息检索是当前研究的热点,对跨语言文档分析以及跨语言新闻获取等研究领域具有重要的作用。当前的跨语言信息检索的研究主要集中在基于查询翻译和文档翻译的方法,对基于统计概率的机器翻译十分依赖,面临着训练语料难以获取以及翻译精度低等问题。目前基于排序学习的信息检索研究集中在单语言的文档排序上,跨语言的文档排序学习并没有得到很大关注。本文提出一种基于双语文档相似度的跨语言文档排序学习模型,利用机器学习的方法训练出排序函数,并融合双语文档的相似度因素对跨语言文档进行排序。本文在构建跨语言的文档排序学习模型过程中主要解决了以下两个问题:1.提出了双语文档之间的相似度计算方法:针对双语文档相似度计算过程中难以对不同语言的文档进行统一空间表示的问题,提出了基于双语词嵌入的双语文档相似度计算方法,首先对双语文档进行关键词提取,然后把双语文档的关键词映射到同一个语义空间,并用这些关键词之间的距离来表示双语文档之间的相似度。实验结果表明,提出方法能够很好地对双语文档之间的相似度进行计算。2.构建了基于双语文档相似度的跨语言文档排序学习模型:针对基于点和基于对的排序学习损失函数不能准确地对排序损失进行表示的问题,本文采用基于列表的概率分布交叉熵的损失函数以及基于人工神经网络的排序函数来构建排序学习模型,提出了融合双语文档相似度的特征来对跨语言文档进行统一排序的方法,以双语文档相似度作为对目标语言进行排序打分的依据。实验结果表明提出的跨语言文档排序学习模型在英汉和英越两种语料集下表现了很好的排序效果。
[Abstract]:Cross-language information retrieval is a hot topic in current research. It plays an important role in the field of cross-language document analysis and cross-language news acquisition. The current research on cross-language information retrieval mainly focuses on the methods of query translation and document translation. Machine translation based on statistical probability is very dependent, and it is faced with the problems of difficult acquisition of training corpus and low translation accuracy. At present, the research of information retrieval based on sorting learning is focused on the sorting of documents in a single language. Cross-language document sorting learning has not been paid much attention. In this paper, a cross-language document sorting learning model based on bilingual document similarity is proposed, and the sorting function is trained by machine learning. Combining the similarity factors of bilingual documents to sort the cross-language documents, this paper mainly solves the following two problems: 1. In the process of constructing a cross-language document sorting learning model, we propose a similarity meter between bilingual documents. Calculation methods: in the process of calculating the similarity of bilingual documents, it is difficult to unify the spatial representation of documents in different languages. This paper proposes a method for calculating the similarity of bilingual documents based on the embedding of bilingual words. Firstly, the keywords of bilingual documents are extracted, then the keywords of bilingual documents are mapped to the same semantic space. The distance between these keywords is used to express the similarity between bilingual documents. The experimental results show that, The proposed method can well calculate the similarity between bilingual documents. 2. A cross-language document ranking learning model based on bilingual document similarity is constructed. The loss function of sorting based on point and pair cannot be used. The problem of accurately representing the sort loss, In this paper, the loss function of cross-entropy of probability distribution based on list and the sort function based on artificial neural network are used to construct the ranking learning model. This paper proposes a method of uniform sorting of cross-language documents by combining the similarity features of bilingual documents. Based on the similarity of bilingual documents as the basis for sorting the target language, the experimental results show that the proposed cross-language document sorting learning model performs well in both English-Chinese and English-Vietnamese corpus.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.3

【参考文献】

相关期刊论文 前6条

1 郝嘉树;王惠临;刘耀;;基于本体的跨语言信息检索模型和关键技术研究[J];情报科学;2009年02期

2 郑德权;李生;赵铁军;于浩;;结合本体论和统计方法的跨语言信息检索模型[J];哈尔滨工业大学学报;2008年01期

3 姚文琳;王存刚;任丽婕;仇利克;郜振霞;;基于核心概念集的多语言Ontology[J];计算机应用研究;2006年04期

4 张俊林;曲为民;杜林;孙玉芳;;跨语言信息检索研究进展[J];计算机科学;2004年07期

5 王进,陈恩红,张振亚,王煦法;基于本体的跨语言信息检索模型[J];中文信息学报;2004年03期

6 徐红姣;王惠临;;跨语言信息检索中的查询翻译方法研究[J];数字图书馆论坛;2009年04期



本文编号:1687979

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1687979.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户bd4eb***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com