基于中文维基百科的概念相关词群研究
发布时间:2018-07-30 06:38
【摘要】:互联网飞速发展,人们对信息获取需求的不断提高,同时信息爆炸式增长,导致信息的收集和查找日益困难,如何在有限的时间内查找到准确而全面的信息对于搜索技术研究提出了重大的挑战,而在搜索引擎系统中加入语义知识就是提高查询效率的一个重要途径。 词语作为语义表示的最小单位,由于一词多义、别名等众多复杂情况导致单个词语表达意思时语义不明确,传统的一些词语相关度计算方法不能很好地解决词语消歧义问题。传统计算方法大概可以分两种方法,一是在大规模语料上使用统计方法,但是现实生活中缺少规模足够大且精确的语料;二是基于人工构建知识系统的计算方法,也存在一些问题,如人工构建知识系统规模小、维护成本高等。 面对传统词语相关度计算方法的一些不足以及当今自然语言处理领域对语义知识的需求,本文着重于词语相关度计算与概念相关词群挖掘的研究,具体内容如下: 一、对中文维基百科资源整理加工的基础上,使用改进的WLVM方法建立了-个词语间相关度数据集,对数据集进行了评估和分析,整理出一些概念的相关词群,概念词群可以用于该概念的语义表示,同样也可以被广泛的应用于自然语言处理的其他方面,比如,文本扩展、知识库构建等。 二、提出一种词语相关度计算方法。在分析前人词语相关性计算方法的基础上,对比大规模语料、人工构建的知识系统与维基百科的差别,本文提出一种词语间语义相关度计算方法,综合利用了链接、分类系统、文本资源和锚文本等语义知识,并对相关性计算结果进行消歧义处理。在实验中,使用本文提出的方法分别在文本资源和链接、分类系统中计算词语相关度、并与其他多种方法做了对比,证明了本方法的有效性。
[Abstract]:With the rapid development of the Internet, the increasing demand for information acquisition and the explosive growth of information make it more and more difficult to collect and find information. How to find accurate and comprehensive information in a limited time poses a great challenge to the research of search technology, and adding semantic knowledge to search engine system is an important way to improve query efficiency. As the smallest unit of semantic representation, because of the complexity of polysemy, aliases, etc., the semantic of a single word is not clear, so some traditional methods of calculating the correlation degree of words can not solve the problem of word disambiguation. The traditional computing methods can be divided into two methods: one is to use statistical methods on large-scale corpus, but in real life there is a lack of large enough and accurate data; the other is to calculate the knowledge system based on artificial construction. There are also some problems, such as small scale of artificial construction of knowledge system, high maintenance cost and so on. In the face of the shortcomings of traditional computing methods of word relevance and the need of semantic knowledge in the field of natural language processing, this paper focuses on the research of word relevance calculation and concept related word group mining. The specific contents are as follows: first, based on the processing of Chinese Wikipedia resources, we establish a set of words correlation data set by using the improved WLVM method, and evaluate and analyze the data set. The concept group can be used for semantic representation of the concept, and can also be widely used in other aspects of natural language processing, such as text expansion, knowledge base construction and so on. Second, a method for calculating the relevance of words is proposed. On the basis of analyzing the previous methods of word correlation calculation and comparing the differences between large-scale corpus, artificial knowledge system and Wikipedia, this paper proposes a method to calculate the semantic relevance between words and phrases, which makes comprehensive use of link and classification system. Semantic knowledge such as text resources and anchor text are used to disambiguate the results of correlation calculation. In the experiment, the method proposed in this paper is used to calculate the relevance of words in the text resources, links and classification system respectively, and compared with other methods, the effectiveness of this method is proved.
【学位授予单位】:华中师范大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.1
本文编号:2154149
[Abstract]:With the rapid development of the Internet, the increasing demand for information acquisition and the explosive growth of information make it more and more difficult to collect and find information. How to find accurate and comprehensive information in a limited time poses a great challenge to the research of search technology, and adding semantic knowledge to search engine system is an important way to improve query efficiency. As the smallest unit of semantic representation, because of the complexity of polysemy, aliases, etc., the semantic of a single word is not clear, so some traditional methods of calculating the correlation degree of words can not solve the problem of word disambiguation. The traditional computing methods can be divided into two methods: one is to use statistical methods on large-scale corpus, but in real life there is a lack of large enough and accurate data; the other is to calculate the knowledge system based on artificial construction. There are also some problems, such as small scale of artificial construction of knowledge system, high maintenance cost and so on. In the face of the shortcomings of traditional computing methods of word relevance and the need of semantic knowledge in the field of natural language processing, this paper focuses on the research of word relevance calculation and concept related word group mining. The specific contents are as follows: first, based on the processing of Chinese Wikipedia resources, we establish a set of words correlation data set by using the improved WLVM method, and evaluate and analyze the data set. The concept group can be used for semantic representation of the concept, and can also be widely used in other aspects of natural language processing, such as text expansion, knowledge base construction and so on. Second, a method for calculating the relevance of words is proposed. On the basis of analyzing the previous methods of word correlation calculation and comparing the differences between large-scale corpus, artificial knowledge system and Wikipedia, this paper proposes a method to calculate the semantic relevance between words and phrases, which makes comprehensive use of link and classification system. Semantic knowledge such as text resources and anchor text are used to disambiguate the results of correlation calculation. In the experiment, the method proposed in this paper is used to calculate the relevance of words in the text resources, links and classification system respectively, and compared with other methods, the effectiveness of this method is proved.
【学位授予单位】:华中师范大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.1
【引证文献】
相关硕士学位论文 前2条
1 骆超;基于LDA模型的文档排序方法研究[D];华中师范大学;2013年
2 刘强;面向查询语句的扩展过滤及权重计算研究[D];华中师范大学;2013年
,本文编号:2154149
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2154149.html