基于Web挖掘技术的化学物质信息获取方法研究

发布时间：2018-06-01 10:31

本文选题：Web挖掘 + 聚焦爬虫　；参考：《西北农林科技大学》2012年硕士论文

【摘要】：随着互联网的发展，网上信息资源与日剧增，采用常规获取信息手段存在准确度不高、效率低下等问题，本文以化学物质常用网站为研究对象，研究快速、高效从网页中获取信息的技术和方法，以实现化学物质环境安全数据库自动更新。首先运用垂直搜索引擎技术，筛选、获取相关的化学物质网页并分析网页结构，按照网页的结构化程度分别采用相应技术和方法；其次，运用排序算法、全局模式等的方法对化学物质网站中的异构数据进行集成。同时为了提高动态信息源网站信息持续、适时抽取，提出了任务分割、失败重试机制、动态更新检查等方法。本文的主要研究内容和结论如下：（1）化学物质网上信息的动态获取方法研究。网上获取化学物质的主要任务是获取CasNo(化学物质登录号)、名称、理化性质等信息。根据网站页面类型，分别运用聚焦爬虫技术和模拟人工浏览方法对网页进行获取；分析网页的树形结构，运用包装器技术抽取出化学物质的相关属性信息，运用正则表达式的方法抽取出非结构化数据中的结构化信息；采用监听器技术，实现了化学物质网站任务的调度，保证了化学物质网上信息的自动获取和数据的适时更新。（2）化学物质异构数据集成方法的研究。针对化学物质网页中数据异构的问题，本文首先根据化学物质环境安全相关的属性确定集成范围，设计了公共数据模型CompoundsDTO作为全局模式，然后运用排序算法对动态获取的数据进行分析，最后将处理后的数据映射到全局模式中，实现了异构数据的集成，有效的消除了异构数据源上的结构冲突和语义冲突。（3）设计开发化学物质环境安全数据管理系统。在构建化学物质环境安全数据库的基础上，运用化学物质网上信息动态获取技术和化学物质异构数据集成技术，设计开发了化学物质环境安全数据管理系统。实现了互联网上化学物质信息的自动、适时抽取，并将结构统一规范的数据运用动态跟新检测技术存入数据库中，，实现数据库的更新查询。
[Abstract]:With the development of the Internet , the information resources on the Internet and the daily play increase , and the problems such as low accuracy and low efficiency of the conventional acquisition information method are adopted in this paper , and the technology and the method for acquiring information from the web page are studied rapidly and efficiently in order to realize automatic updating of the chemical environment safety database .
Secondly , the method of ordering algorithm , global pattern and so on is used to integrate heterogeneous data in the chemical website . At the same time , in order to improve the sustainable and timely extraction of the website information of the dynamic information source , the methods of task segmentation , failure retry mechanism and dynamic update check are put forward . The main research contents and conclusions are as follows :

( 1 ) The method of dynamic acquisition of chemical information on the web . The main task of obtaining chemical substance on the Internet is to acquire the information such as CasNo ( chemical registration number ) , name , physical and chemical properties , etc . According to the page type of the website , the webpage is acquired by using the focus crawler technology and the simulated manual browsing method respectively ;
analyzing the tree structure of the webpage , extracting relevant attribute information of the chemical substance by using the wrapper technology , and extracting the structured information in the unstructured data by using a regular expression method ;
By using the listener technology , the task scheduling of the chemical website is realized , and the automatic acquisition of information on the chemical substance and timely updating of the data are ensured .

( 2 ) The research of the integration method of chemical heterogeneous data . In order to solve the problem of data isomerization in chemical substance web pages , this paper firstly determines the integration scope according to the attributes of chemical environment safety , designs the public data model CompoundsDTO as the global mode , then uses the sorting algorithm to analyze the dynamic acquired data . Finally , the processed data is mapped into the global mode , the integration of the heterogeneous data is realized , and the structure conflict and the semantic conflict on the heterogeneous data source are effectively eliminated .

( 3 ) Design and develop chemical environment safety data management system . On the basis of constructing the chemical environment safety database , the chemical environment safety data management system is designed and developed using the chemical information dynamic acquiring technology and the chemical heterogeneous data integration technology . The automatic and timely extraction of the chemical information on the Internet is realized , and the data of the unified specification of the structure is stored in the database by dynamic and new detection technology , and the updating query of the database is realized .
【学位授予单位】：西北农林科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP311.13

【参考文献】