石油企业海量网页检索系统设计与实现

发布时间：2018-09-12 12:22

【摘要】：当前，随着油田企业勘探一体化的不断发展壮大，用于生产、经营、科研数据分析以及统一组织和管理，企业无纸化办公电子文档和网页需求的增加和出现，使得文档数量每年成倍的增加，文档的保存量变得非常庞大大。电子文档在现有搜索引擎的不能很好的提供专门的信息索引。定制出企业级文档检索，，达到文档信息快速查找、调用，解决以往信息检索效率低下、查找不准确的难题。根据最新互联网调查，截止到目前，互联网上一共有超过数以亿计的网站信息量信息。全球最大的搜索引擎Google收录了超过80亿的网页信息，搜索引擎的网页提取系统（又称爬虫），是搜索引擎的主要应用模块之一，而爬虫的速度、抓取网页质量又是奠定引擎搜索效率的主要标准。为让爬虫满足企业数据搜集的需要，减少因信息重复搜集而产生的不必要数据重复。本文针对当前企业海量网页检索存在的缺陷，根据油田企业的具体业务需求提出一种新的信息检索方法，在检索系统的机构化结构中引入了多Field思想。另外，针对企业局域网硬件条件，系统采用基于Lucene的词法分析方法，对网页进行页面数据分析，高效提取网页的纯正文内容。最后，对系统进行完整性验证和性能分析。最后，对系统进行了测试，测试结果表明系统满足企业海量网页检索的需求，在可靠性、实用性、稳定性、速度和安全性方面具有一定优势。
[Abstract]:At present, with the continuous development of the integration of exploration and exploration in oil field enterprises, the demand for paperless office electronic documents and web pages for production, management, scientific research data analysis, and unified organization and management has increased and appeared. The number of documents multiplied every year and the amount of documents saved became very large. Electronic documents in the existing search engine can not provide a good specialized information index. Enterprise document retrieval is customized to quickly find and call document information and solve the problem of low efficiency and inaccuracy of information retrieval in the past. According to the latest Internet survey, there are more than hundreds of millions of website information on the Internet so far. Google, the world's largest search engine, contains more than 8 billion pages of information. The search engine's web page extraction system, also known as the crawler, is one of the main application modules of the search engine, and the speed of the crawler. Grabbing the quality of web pages is also the main standard for engine search efficiency. In order to satisfy the need of enterprise data collection, the crawler can reduce the unnecessary data duplication caused by repeated collection of information. In this paper, a new information retrieval method is proposed according to the specific business requirements of oil field enterprises, aiming at the defects of the current massive web page retrieval of enterprises, and the idea of multiple Field is introduced into the institutional structure of the retrieval system. In addition, aiming at the hardware condition of the enterprise LAN, the system adopts the lexical analysis method based on Lucene, carries on the page data analysis to the webpage, and extracts the pure text content of the webpage efficiently. Finally, the integrity of the system and performance analysis. Finally, the system is tested, and the test results show that the system meets the requirements of mass web search, and has some advantages in reliability, practicability, stability, speed and security.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3;TP393.092

【相似文献】