跨语言词语语义相似词抽取方法研究与实现

发布时间：2019-06-20 19:34

【摘要】：随着计算机应用水平的不断提升,人类越来越急切地希望计算机能像人类一样思考和处理各种问题。自然语言处理作为其中最重要的组成部分,其应用可以让计算机能更好地理解人类的需求。语义相似度反映的是一组文档、短语、词语在语义上的相似程度。跨语言词语语义相似度是对指向相似语义概念的两种(或以上)不同语言的单词表达相同意义的度量。跨语言词语语义相似度作为信息处理的重要组成部分,随着大数据技术的推进和发展,其在人工智能、自然语言处理、信息检索等领域均存在着巨大的应用和研究价值。中文和英文作为当今最重要的两种语言,己被人们在经济、文化、贸易、教育等领域广泛使用,均显示其重要的地位和作用,因此本文以这两种语言为主要研究对象。目前,对于跨语言词语语义相似度的计算主要分为几大类：一、基于语义知识库规则的方法：二、基于语料库统计的方法；三、混合方法。语义知识库中包含了人类手工指定的各种语义知识,它们共同组成了复杂的语义网络,因此基于知识库规则的方法将会充分利用存在于其中的语义知识。我们通过人工录入、网络爬取、众筹等方式获取到充足的数据,以获取到的数据为立足点,运用概率统计、机器学习等方法,计算两种不同语言词语之间的语义相似度,但是该方法存在词语分布不均匀问题导致相似度计算结果有偏差。将语义知识库和语料统计的方法相混合,可以很好地弥补上述问题的不足,目前也得到了研究者的关注。我们首先以中文概念辞书(CCD)和WordNet语义知识库,分别构建了CSWE (Chinese Semantic Similar Words Extraction)模型和ESWE (English Se-mantic Similar Words Extraction),分别将这个两个模型应用于中文和英文的语义相似词语的抽取,实验结果表明CSWE和ESWE模型在保证语义相似词语抽取正确性的同时,对于规模越大的数据集,其抽取前k个相似词语消耗的时间也越少。此外,我们对CSWE和ESWE模型进行了扩展,使之成为适合于跨语言语义相似词抽取的CLS WE (Cross-language Semantic Similar Words Extractions)模型。为了更好地展现该模型自身良好的性能,分别使用数据规模不等、包含的词语不尽相同的数据集WordSim353和RW对其进行验证。首先抽取出WordSim353与RW中共同存在的77个词语作为英文正确性验证数据集,同时将这77个英文词语通过多组翻译,最终确定与其最为匹配的中文词语,最终得到中文正确性验证数据集。通过实验,可以发现与基准策略模型相比,本文提出的CLSWE模型在跨语言语义相似词的抽取可以保证正确性,同时对于数据规模越大的数据集,它能够在越短的时间内抽取出查询词最相似的前k个词语。本文提出的跨语言语义相似词抽取模型,与基准策略模型相比,均取得了很好的实验性能。
[Abstract]:With the continuous improvement of computer application level, human beings are more and more eager to think and deal with all kinds of problems like human beings. As the most important part of natural language processing, its application can enable computers to better understand the needs of human beings. Semantic similarity reflects the semantic similarity of a group of documents, phrases and words. The semantic similarity of cross-language words is a measure of the same meaning expressed by two (or more) words in two (or more) different languages pointing to similar semantic concepts. As an important part of information processing, semantic similarity of cross-language words has great application and research value in artificial intelligence, natural language processing, information retrieval and other fields with the promotion and development of big data technology. As the most important two languages, Chinese and English have been widely used in the fields of economy, culture, trade, education and so on, showing their important position and function. Therefore, this paper takes these two languages as the main research object. At present, the calculation of semantic similarity of cross-language words is mainly divided into several categories: first, methods based on semantic knowledge base rules; second, methods based on corpus statistics; third, hybrid methods. The semantic knowledge base contains all kinds of semantic knowledge manually specified by human beings, which together form a complex semantic network, so the method based on knowledge base rules will make full use of the semantic knowledge that exists in it. We get enough data by manual input, network crawling, crowdfunding and so on. We use probability statistics, machine learning and other methods to calculate the semantic similarity between words in two different languages, but there is a problem of uneven word distribution in this method, which leads to the deviation of similarity calculation results. The combination of semantic knowledge base and corpus statistics can make up for the shortcomings of the above problems, and has been paid more and more attention by researchers at present. Firstly, using the Chinese concept dictionary (CCD) and WordNet semantic knowledge base, we construct the CSWE (Chinese Semantic Similar Words Extraction) model and ESWE (English Se-mantic Similar Words Extraction), apply the two models to the extraction of semantic similar words in Chinese and English respectively. The experimental results show that the CSWE and ESWE models not only ensure the correctness of semantic similar words extraction, but also for the larger data sets. The less time it takes to extract k similar words. In addition, we extend the CSWE and ESWE models to become CLS WE (Cross-language Semantic Similar Words Extractions) models suitable for cross-linguistic semantic similarity word extraction. In order to better show the good performance of the model, the data set WordSim353 and RW with different data sizes and different words are used to verify the model. Firstly, 77 words co-existing in WordSim353 and RW are extracted as English correctness verification data set, and the 77 English words are translated into several groups, and the most matching Chinese words are finally determined, and finally the Chinese correctness verification data set is obtained. Through experiments, it can be found that compared with the benchmark strategy model, the CLSWE model proposed in this paper can ensure the correctness of the extraction of semantic similar words across languages. At the same time, for the larger the data set, it can extract the first k words that are most similar to the query words in a shorter period of time. Compared with the benchmark strategy model, the cross-language semantic similar word extraction model proposed in this paper has achieved good experimental performance.
【学位授予单位】：南京师范大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】