基于自然语言理解的全文搜索研究

发布时间：2018-02-27 12:28

本文关键词： 自然语言理解倒排索引全文搜索中文分词局部索引　出处：《湖北大学》2013年硕士论文　论文类型：学位论文

【摘要】：随着网络技术的发展,网络中存在的信息量也越来越大,如何高效、快速、准确地从庞大的信息海中获取到满足要求的信息已经成为人们重点关注的问题。传统的信息检索技术仅仅是从关键字的角度出发进行信息的机械匹配,现在越来越多的人已经开始将自然语言与搜索引擎技术结合研究,探索智能搜索引擎的开发。本文分析研究了信息检索技术中比较主流的全文搜索技术,全文搜索技术对非结构化文本的处理就是将文档中的所有内容作为研究对象,经过文本处理得到可以被索引的纯文本信息,然后对文本信息分词建立索引形成索引库,当有用户进行信息检索时,对用户输入的关键字进行一定的处理再与索引库中的索引关键字进行匹配,从索引库中提取出满足用户要求的信息。在全文搜索技术的基础上,加入自然语言理解的中文分词处理层次,具体的研究内容和成果如下： ①分析研究了全文搜索、自然语言理解的关键原理及处理机制,在理论的基础上,结合SS (Struts+Spring)框架开发出一个基于自然语言理解全切分中文分词的全文搜索系统原型,此系统原型是针对目前各种典型非结构化文档的全部内容进行文本预处理、中文分词、建立索引库、在索引库中进行信息检索； ②已开发出的系统原型对于文档信息量较小的文档库进行建立索引库检索信息的效率、准确率都比较高。但是可以预想,当文档库所包含的信息量非常大,对文档全部内容进行预处理,再分词建立索引库,时空耗费必然也相当庞大。针对这一缺陷,本文提出了一种对文档内容建立局部索引的思想,并且在已开发完成的系统原型基础上进一步研究,比较两种不同的文档处理机制,经过试验,得出对文档内容建立局部索引在信息检索领域是相当有研究价值的。
[Abstract]:With the development of network technology, the amount of information in the network is increasing. It has become a focus of attention to get the information that meets the requirements from the huge information sea accurately. Traditional information retrieval technology is only to carry out the mechanical matching of information from the perspective of keywords. Now more and more people have begun to combine natural language and search engine technology to explore the development of intelligent search engine. Full-text search technology for unstructured text processing is to take all the contents of the document as the research object, through the text processing can be indexed pure text information, and then the text information participle is indexed to form an index library. When a user carries out information retrieval, the keywords entered by the user are processed and matched with the index keywords in the index library, and the information that meets the user's requirements is extracted from the index library. On the basis of full-text search technology, Add the Chinese word segmentation processing level of natural language understanding, the specific research contents and results are as follows:. 1. The key principles and processing mechanisms of full-text search and natural language understanding are analyzed and studied. On the basis of the theory, a full-text search system based on natural language understanding is developed based on the framework of SS Struts Spring. The prototype of the system is to carry out text preprocessing, Chinese word segmentation, index database and information retrieval in various typical unstructured documents. (2) the prototype of the developed system is more efficient and accurate in building index library for document library with less document information. However, it can be expected that when the document library contains a large amount of information, In order to preprocess all the contents of a document and build an index database with word segmentation, the cost of time and space is bound to be very large. In view of this defect, this paper puts forward a kind of idea of building local index to the document content. And on the basis of the system prototype that has been developed, this paper compares two different document processing mechanisms. Through experiments, it is concluded that local indexing of document content is of considerable value in the field of information retrieval.
【学位授予单位】：湖北大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】