当前位置:主页 > 经济论文 > 技术经济论文 >

跨语言词语语义相似词抽取方法研究与实现

发布时间:2019-06-20 19:34
【摘要】:随着计算机应用水平的不断提升,人类越来越急切地希望计算机能像人类一样思考和处理各种问题。自然语言处理作为其中最重要的组成部分,其应用可以让计算机能更好地理解人类的需求。语义相似度反映的是一组文档、短语、词语在语义上的相似程度。跨语言词语语义相似度是对指向相似语义概念的两种(或以上)不同语言的单词表达相同意义的度量。跨语言词语语义相似度作为信息处理的重要组成部分,随着大数据技术的推进和发展,其在人工智能、自然语言处理、信息检索等领域均存在着巨大的应用和研究价值。中文和英文作为当今最重要的两种语言,己被人们在经济、文化、贸易、教育等领域广泛使用,均显示其重要的地位和作用,因此本文以这两种语言为主要研究对象。目前,对于跨语言词语语义相似度的计算主要分为几大类:一、基于语义知识库规则的方法:二、基于语料库统计的方法;三、混合方法。语义知识库中包含了人类手工指定的各种语义知识,它们共同组成了复杂的语义网络,因此基于知识库规则的方法将会充分利用存在于其中的语义知识。我们通过人工录入、网络爬取、众筹等方式获取到充足的数据,以获取到的数据为立足点,运用概率统计、机器学习等方法,计算两种不同语言词语之间的语义相似度,但是该方法存在词语分布不均匀问题导致相似度计算结果有偏差。将语义知识库和语料统计的方法相混合,可以很好地弥补上述问题的不足,目前也得到了研究者的关注。我们首先以中文概念辞书(CCD)和WordNet语义知识库,分别构建了CSWE (Chinese Semantic Similar Words Extraction)模型和ESWE (English Se-mantic Similar Words Extraction),分别将这个两个模型应用于中文和英文的语义相似词语的抽取,实验结果表明CSWE和ESWE模型在保证语义相似词语抽取正确性的同时,对于规模越大的数据集,其抽取前k个相似词语消耗的时间也越少。此外,我们对CSWE和ESWE模型进行了扩展,使之成为适合于跨语言语义相似词抽取的CLS WE (Cross-language Semantic Similar Words Extractions)模型。为了更好地展现该模型自身良好的性能,分别使用数据规模不等、包含的词语不尽相同的数据集WordSim353和RW对其进行验证。首先抽取出WordSim353与RW中共同存在的77个词语作为英文正确性验证数据集,同时将这77个英文词语通过多组翻译,最终确定与其最为匹配的中文词语,最终得到中文正确性验证数据集。通过实验,可以发现与基准策略模型相比,本文提出的CLSWE模型在跨语言语义相似词的抽取可以保证正确性,同时对于数据规模越大的数据集,它能够在越短的时间内抽取出查询词最相似的前k个词语。本文提出的跨语言语义相似词抽取模型,与基准策略模型相比,均取得了很好的实验性能。
[Abstract]:With the continuous improvement of computer application level, human beings are more and more eager to think and deal with all kinds of problems like human beings. As the most important part of natural language processing, its application can enable computers to better understand the needs of human beings. Semantic similarity reflects the semantic similarity of a group of documents, phrases and words. The semantic similarity of cross-language words is a measure of the same meaning expressed by two (or more) words in two (or more) different languages pointing to similar semantic concepts. As an important part of information processing, semantic similarity of cross-language words has great application and research value in artificial intelligence, natural language processing, information retrieval and other fields with the promotion and development of big data technology. As the most important two languages, Chinese and English have been widely used in the fields of economy, culture, trade, education and so on, showing their important position and function. Therefore, this paper takes these two languages as the main research object. At present, the calculation of semantic similarity of cross-language words is mainly divided into several categories: first, methods based on semantic knowledge base rules; second, methods based on corpus statistics; third, hybrid methods. The semantic knowledge base contains all kinds of semantic knowledge manually specified by human beings, which together form a complex semantic network, so the method based on knowledge base rules will make full use of the semantic knowledge that exists in it. We get enough data by manual input, network crawling, crowdfunding and so on. We use probability statistics, machine learning and other methods to calculate the semantic similarity between words in two different languages, but there is a problem of uneven word distribution in this method, which leads to the deviation of similarity calculation results. The combination of semantic knowledge base and corpus statistics can make up for the shortcomings of the above problems, and has been paid more and more attention by researchers at present. Firstly, using the Chinese concept dictionary (CCD) and WordNet semantic knowledge base, we construct the CSWE (Chinese Semantic Similar Words Extraction) model and ESWE (English Se-mantic Similar Words Extraction), apply the two models to the extraction of semantic similar words in Chinese and English respectively. The experimental results show that the CSWE and ESWE models not only ensure the correctness of semantic similar words extraction, but also for the larger data sets. The less time it takes to extract k similar words. In addition, we extend the CSWE and ESWE models to become CLS WE (Cross-language Semantic Similar Words Extractions) models suitable for cross-linguistic semantic similarity word extraction. In order to better show the good performance of the model, the data set WordSim353 and RW with different data sizes and different words are used to verify the model. Firstly, 77 words co-existing in WordSim353 and RW are extracted as English correctness verification data set, and the 77 English words are translated into several groups, and the most matching Chinese words are finally determined, and finally the Chinese correctness verification data set is obtained. Through experiments, it can be found that compared with the benchmark strategy model, the CLSWE model proposed in this paper can ensure the correctness of the extraction of semantic similar words across languages. At the same time, for the larger the data set, it can extract the first k words that are most similar to the query words in a shorter period of time. Compared with the benchmark strategy model, the cross-language semantic similar word extraction model proposed in this paper has achieved good experimental performance.
【学位授予单位】:南京师范大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1

【相似文献】

相关期刊论文 前10条

1 胡艳波;崔新春;路青;;2002~2011年国内语义相似度研究计量分析[J];情报科学;2013年07期

2 王家琴;李仁发;李仲生;唐剑波;;一种基于本体的概念语义相似度方法的研究[J];计算机工程;2007年11期

3 刘俊;;基于语义相似度的关键词生成在企业搜索引擎营销中应用[J];电脑知识与技术;2008年14期

4 宗裕朋;吴刚;;一种基于上下文的语义相似度算法[J];微计算机信息;2008年30期

5 刘春辰;刘大有;王生生;赵静滨;王兆丹;;改进的语义相似度计算模型及应用[J];吉林大学学报(工学版);2009年01期

6 徐猛;刘宗田;周文;;一种基于知网语义相似度计算的应用研究[J];微计算机信息;2010年03期

7 孙海霞;钱庆;成颖;;基于本体的语义相似度计算方法研究综述[J];现代图书情报技术;2010年01期

8 魏椺;向阳;陈千;;计算术语间语义相似度的混合方法[J];计算机应用;2010年06期

9 马续补;郭菊娥;;基于《知网》语义相似度的企业事实主题诊断研究[J];情报杂志;2010年05期

10 魏凯斌;冉延平;余牛;;语义相似度的计算方法研究与分析[J];计算机技术与发展;2010年07期

相关会议论文 前10条

1 关毅;王晓龙;;基于统计的汉语词汇间语义相似度计算[A];语言计算与基于内容的文本处理——全国第七届计算语言学联合学术会议论文集[C];2003年

2 李月雷;师瑞峰;林丽冰;周一民;;汉语语句语义相似度的计算方法[A];2008'中国信息技术与应用学术论坛论文集(一)[C];2008年

3 冯新元;魏建国;路文焕;党建武;;引入领域知识的基于《知网》词语语义相似度计算[A];第十二届全国人机语音通讯学术会议(NCMMSC'2013)论文集[C];2013年

4 章成志;;词语的语义相似度计算及其应用研究[A];NCIRCS2004第一届全国信息检索与内容安全学术会议论文集[C];2004年

5 刘寒磊;关毅;徐永东;;多文档文摘中基于语义相似度的最大边缘相关技术研究[A];全国第八届计算语言学联合学术会议(JSCL-2005)论文集[C];2005年

6 石静;邱立坤;王菲;吴云芳;;相似词获取的集成方法[A];中国计算语言学研究前沿进展(2009-2011)[C];2011年

7 陈明;鹿e,

本文编号:2503476


资料下载
论文发表

本文链接:https://www.wllwen.com/jingjilunwen/jiliangjingjilunwen/2503476.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户b34a5***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com