当前位置:主页 > 科技论文 > 搜索引擎论文 >

异构数据联合检索系统的设计与实现

发布时间:2018-10-26 13:06
【摘要】:随着计算机和网络的普及,越来越多的企业、机关、学校等都利用计算机来处理文档,而在这些机构的管理过程中也必然会产生大量的电子文档。如何从大量的文档中快速而准确地检索出用户所需要的信息成为摆在人们面前的一大难题。某企业对文档的检索上也存在这个问题,目前该企业对文档采用目录式管理,没有一个针对所有文档的检索系统,员工欲查找某项信息需花费大量的时间,并且寻找到的信息不完全。所以该企业急需一个针对其所有文档来进行信息检索的搜索引擎来满足不同用户的需求。本项目以该企业需求为依托,针对异构数据联合检索系统中索引建立与搜索机制来进行研究。该系统提供了按文档类型检索、按发布者检索、按发布日期检索等多种检索方式,以方便用户的使用。同时,针对企业数据量庞大和检索结果需准确的特点,系统对索引的建立与检索过程以及庖丁解牛中文分词器均做了大量的优化。本系统采用Java语言开发,主要使用基于Java的全文索引工具包Lucene来实现。考虑到企业庞大的数据量以及未来的系统升级,数据库采用专门针对大容量数据处理的GreenPlum数据库。项目采用SSH框架,文档解析采用了POI和PDFBox工具包,中文分词器采用了庖丁解牛分词器。开发工具使用MyEclipse10。系统运行情况良好,就检索的效率和效果而言,基本达到了最初的设计要求。
[Abstract]:With the popularity of computers and networks, more and more enterprises, institutions, schools and so on use computers to process documents, and in the management process of these organizations will inevitably produce a large number of electronic documents. How to quickly and accurately retrieve the information needed by users from a large number of documents has become a big problem in front of people. There is also this problem in the retrieval of documents in a certain enterprise. At present, the enterprise uses directory management for documents, and there is no retrieval system for all documents. It takes a lot of time for employees to find a certain item of information. And the information found is incomplete. Therefore, the enterprise urgently needs a search engine for all its documents to meet the needs of different users. This project is based on the requirements of the enterprise and studies the indexing and searching mechanism in the heterogeneous data joint retrieval system. The system provides a variety of retrieval methods, such as retrieval by document type, by publisher, by publication date, and so on, in order to facilitate the use of users. At the same time, in view of the large amount of enterprise data and the need for accurate retrieval results, the system has made a great deal of optimization on the establishment and retrieval process of the index and the Chinese word particifier of Pao Ding Jie Niu. This system is developed with Java language, mainly using the full-text index toolkit Lucene based on Java. Considering the huge amount of enterprise data and the future system upgrade, the database adopts GreenPlum database which is specially designed for large capacity data processing. SSH framework is used in the project, POI and PDFBox toolkits are used in document parsing, and Pao Ding Jie Niu word Segmentation is used in Chinese word Segmentation. Development tools using MyEclipse10. The system runs well, and the efficiency and effect of retrieval basically meet the initial design requirements.
【学位授予单位】:东北大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3


本文编号:2295815

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2295815.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户9b928***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com