基于统计语言模型的搜索引擎输入纠错技术研究

发布时间：2018-07-21 15:04

【摘要】：在信息化飞速发展的今天,搜索引擎在互联网上扮演着越来越重要的角色,日益增多的互联网用户对搜索引擎的要求也变得越来越高.其中,搜索引擎输入纠错功能是一项非常重要的附加技术,并且已经得到了较为广泛的应用和推广.因此研究搜索引擎的纠错技术对于搜索引擎的发展有着重要深远的意义.纠错技术是自然语言处理的重要研究课题之一.针对中文文本的纠错研究相较于英文起步较晚.目前主要分为基于词典和基于统计模型这两大方法.基于词典的纠错受限于词典的规模和内容,而基于统计模型的方法则是基于海量实例,分析语言内在之间的关系,无需专门词典来实现.用于纠错的统计模型有有基于互信息概率,基于N-gram模型,基于组合度的汉语决策等.本文提出一种完全通过分析上下文统计信息的方法.为了论证本文方法的可行性,以Nutch和Hadoop为基础搭建分布式搜索引擎平台进行实验验证.本文主要完成以下工作:为了构架良好的搜索引擎平台,本文首先介绍了主流的索引机制—倒排索引.本文分析介绍了倒排索引的性能模型以及压缩技术,同时对该索引机制的性能与一般索引进行分析比较,计算倒排索引创建的时间复杂度和空间复杂度,进而引出良好应用倒排索引,构架搜索引擎的工具包Lucene.由Lucene搭建起搜索引擎Nutch.由于实验环境需要大数据,因此采用分布式平台,详细介绍了由Nutch+Hadoop搭建的分布式搜索引擎.由于汉语理论研究存在局限性,因此要想实现对检索引擎输入的内容实现纠错功能,就需要对中文语料库建立了N-gram语言模型,并对其进行详细的分析,确定语言模型所必须的参数,并通过平滑技术解决数据稀疏问题.基于大量语料库,通过N-gram模型纠错后的关键词可能存在相同的结果,利用TF-IDF计算初步处理后结果的权重,筛选结果,以此得到最佳的结果集.
[Abstract]:With the rapid development of information technology, search engines are playing a more and more important role in the Internet, and more Internet users are demanding more and more search engines. Among them, search engine input error correction function is a very important additional technology, and has been widely used and promoted. Therefore, the study of search engine error correction technology for the development of search engines has an important and far-reaching significance. Error correction technology is one of the important research topics in natural language processing. The research on error correction in Chinese text started later than in English. At present, there are two main methods based on dictionary and statistical model. The error correction based on the dictionary is limited by the size and content of the dictionary, while the statistical model-based approach is based on a large number of examples and analyzes the relationship between the languages without the need for a special dictionary. The statistical models used for error correction are based on mutual information probability, N-gram model, combination degree based Chinese decision making and so on. In this paper, a method of analyzing context statistics is presented. In order to prove the feasibility of this method, the distributed search engine platform is built based on Nutch and Hadoop. The main work of this paper is as follows: in order to construct a good search engine platform, this paper first introduces the mainstream indexing mechanism-inverted index. In this paper, the performance model and compression technology of inverted index are analyzed and introduced. At the same time, the performance of this index mechanism is compared with that of general index, and the time complexity and space complexity of inverted index are calculated. Then leads to the good application inverted index, constructs the search engine tool kit Lucene. By Lucene build search engine Nutch. Because the experimental environment needs big data, the distributed search engine built by Nutch Hadoop is introduced in detail by using distributed platform. Because of the limitation of Chinese theory research, in order to realize the error-correcting function of the contents input by the retrieval engine, we need to establish the N-gram language model of the Chinese corpus and analyze it in detail. The necessary parameters of the language model are determined and the data sparse problem is solved by smoothing technique. Based on a large number of corpus, there may be the same result for the keywords corrected by N-gram model. TF-IDF is used to calculate the weight of the preliminary processed results and to screen the results to obtain the best result set.
【学位授予单位】：江苏科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【参考文献】