相互作用组异构数据集成研究
发布时间:2018-04-24 19:07
本文选题:数据集成 + 异构数据库系统 ; 参考:《北京协和医学院》2011年博士论文
【摘要】:后基因组(post-genome)生物医学的一个关键目标就是对活细胞内的所有分子及其相互间的作用进行全面和系统地研究。理解细胞系统的一个关键步骤是对DNA、RNA、蛋白质和化学小分子等相关的物理相互作用网络进行映射,从而对特定的物种形成一个尽可能完整和准确的相互作用组网络(interactome network)。研究者们采用高通量技术的实验,基于计算的预测,以及文献挖掘等方法得到了大量的、有价值的相互作用组数据。同时,为了管理和利用这些数据,研究者们建立了许多相互作用组数据库。但是,现有的相互作用组数据库相互隔离,形成了所谓的“信息孤岛”,不能实现数据共享(data sharing)和更有效的利用。为了更好地管理和更有效地利用现有的相互作用组数据,需要将这些相互独立的数据库有机地集成在一起。这对于增加相互作用组研究的整体知识水平,以及对该领域更深入、更全面的理解是十分重要的。数据集成(data integration)已经成为相互作用组研究的重要方向之一。 本研究建立了相互作用组数据仓库InteractomeDW。InteractomeDW包括相互作用组数据库集合,生物实体映射数据库,生物学本体和受控词表数据库集合,以及生物学注释数据库等四大部分。InteractomeDW存储了62779056条相互作用记录,涉及51个相互作用组数据源,9个辅助数据源,5个相互作用组数据类型(蛋白质相互作用,结构域相互作用,分子间相互作用,复合物和通路),2426个物种,170个相互作用鉴定方法,44个相互作用类型,以及85212篇参考文献。就我们所知,InteractomeDW比现有相关研究建立的数据仓库的规模都要大。 本研究首次提出融合了基于数据仓库(data warehouse)和基于中介(mediation)这两种方法的新型异构数据集成方法WM。WM方法采用数据仓库方式进行数据管理,以确保数据源的可用性、提高系统查询效率和数据质量。待集成的所有相互作用组数据库都存储在本地服务器上,这样可以最大限度地确保数据源的可用性。同时,本地存储策略显著提高了系统的查询效率和响应能力。相互作用组数据仓库提供的数据清洗功能可以检测、修正或删除所有相互作用组数据库中已损坏、不完整或不准确的脏数据,进而提高所集成数据的质量。WM方法采用中介方式实现具体的数据集成工作,以提高系统的扩展性和可维护性。在WM方法中,可以方便地通过向中介器模块的映射关系表注册新的数据源,并构建相应包装器的方式实现数据集成范围的扩展。这种扩展方式对数据集成系统的其他部分没有任何影响,实现了可插拔式的数据集成。这种低耦合度、灵活的集成方式使得数据集成系统的可维护性大大加强。WM方法结合了上述两种数据集成方法的优点,很好地兼顾了数据集成的效率和灵活性,为相互作用组数据集成提供了基础架构和解决方案。 本研究利用WM方法成功地构建了一个可靠性高、数据质量高、查询效率高和可扩展性强的基于网络的相互作用组异构数据集成系统IMbase。建立IMbase的目的就是让生物学家可以透明地访问相互作用组异构数据库,更有效地利用其中的数据。IMbase是一个共享和利用相互作用组数据的基础平台,为生物学家提供了相互作用组数据集成、相互作用网络分析和推理,以及相应的Web Service开发接口等多种功能,进而可以帮助生物学家生成相互作用假说和实现知识发现(knowledge discovery)。IMbase对相互作用组相关数据进行了垂直集成。这样做可以通过及时总结和整理现有数据,实现相互作用组研究领域内更广泛的数据共享,进而提高相互作用组研究领域的总体知识水平。以相互作用组数据的垂直集成为基础,可以进一步实现跨领域和学科数据的水平集成,以实现更有价值的知识发现。就我们所知,IMbase是现有数据规模最大,功能最为完善的相互作用组数据集成系统。用户可以通过网址http://122.70.220.98/imbase/index.gr免费访问IMbase。 本研究将IMbase系统应用于小鼠神经管缺陷(NTDs)的研究。以表达谱芯片筛选出的差异表达基因为诱饵,利用IMbase获得与这些差异表达基囚有相互作用的生物实体对应的基因,并构建相应的相互作用网络。本研究建立了已知小鼠NTDs候选基因数据库MouseNTDs。通过MouseNTDs数据库对潜在NTDs候选基因进行筛选,以确定已被认定和尚未被认定为小鼠NTDs候选基因的潜在NTDs候选基因。最后,通过研究这些筛选出的潜在NTDs候选基因的注释信息和通路信息,本研究提出了小鼠NTDs候选基因假说,为进一步的分子生物学实验提供可能的方向。 本研究的主要创新之处在于: 1.提出了一种新的异构数据集成的方法WM。WM方法结合了基于数据仓库和基于中介这两种数据集成方法的优点,很好地兼顾了数据集成的效率和灵活性,为相互作用组异构数据集成提供了基础架构和解决方案。 2.建立了一个相互作用组数据仓库InteractomeDW。InteractomeDW共存储了超过62百万(62 779 056)条相互作用记录,涉及51个相互作用组数据源,9个辅助数据源,5个相互作用组数据类型(蛋白质相互作用,结构域相互作用,分子间相互作用,复合物和通路),2 426个物种,170个相互作用鉴定方法,44个相互作用类型,以及85212篇参考文献。 3.建立了一个生物实体映射数据库BEM。BEM是由5个相关数据源集成而来,共存储了超过1.8亿(180 328 282)条非冗余的映射记录,涉及4个实体类型(基因,蛋白质,小分子物质和化合物),可以实现90个常用生物医学数据库之间的实体映射。 4.利用WM方法,构建了一个基于网络的相互作用组异构数据集成系统IMbase。IMbase是一个共享和利用相互作用组数据的计算平台,提供相互作用组数据集成、相互作用网络分析和推理、生物实体映射等多种服务,可以帮助研究者生成相互作用假说和实现知识发现。 5.构建的异构数据集成系统IMbase不但提供了基于网络应用程序的访问方式,而且还提供了基于Web Service的访问方式,以便为相关软件开发者提供编程接口,实现软件复用和可互操作性。 6.将异构数据集成系统IMbase用于小鼠神经管缺陷(NTDs)的研究,通过构建和分析潜在的小鼠NTDs候选基因相关的相互作用网络,提出小鼠NTDs候选基因的假说,为进一步的分子生物学实验提供参考方向。
[Abstract]:One of the key objectives of post - genome biomedical research is to conduct a comprehensive and systematic study of all the molecules in living cells and their interactions . A key step in understanding cellular systems is to map DNA , RNA , proteins , and chemical small molecules and other related physical interaction networks to form an interactome network that is as complete and accurate as possible for specific species . At the same time , in order to manage and utilize these data , the researchers have established many database of interaction groups . However , in order to manage and utilize these data , the researchers have established many database of interaction groups . However , in order to better manage and utilize the existing interaction group data , it is important to integrate these mutually independent databases .
InteractomeDW has established an interaction group data warehouse , InteractomeDW . InteractomeDW includes four parts : an interaction group database set , a biological entity mapping database , a biological ontology , a controlled vocabulary database collection , and a biological annotation database . The InteractomeDW stores 62779056 interaction records , involving 51 interacting group data sources , 9 auxiliary data sources , 5 interacting group data types ( protein interaction , domain interaction , intermolecular interaction , complexes and pathways ) , 2426 species , 170 interaction identification methods , 44 interaction types , and 85212 references . As far as we know , the scale of the data warehouse established by InteractomeDW is greater than that of existing related research .
This paper first puts forward a new heterogeneous data integration method WM based on data warehouse and intermediary . The WM method adopts data warehouse to manage data to ensure the availability of data source , improve system query efficiency and data quality .
This study successfully constructed IMbase of heterogeneous data integration system based on network with high reliability , high data quality , high query efficiency and expansibility by WM method . The purpose of establishing IMbase is to enable biologists to access the heterogeneous database of the interaction group transparently and effectively utilize the data . The IMbase is a base platform for sharing and utilizing the interaction group data , and provides the biologists with various functions such as interaction group data integration , interaction network analysis and reasoning , and corresponding development interface of Web Service , etc . , which can help biologists generate interactive hypothesis and knowledge discovery . IMbase is vertically integrated with the data related to the interaction group . In this way , more extensive data sharing in the field of interaction group research can be realized by summarizing and arranging the existing data in a timely manner . It can further realize the horizontal integration of the cross - domain and subject data to realize more valuable knowledge discovery . As far as we know , IMbase is the most powerful and perfect interaction group data integration system of the existing data .
In this study , the IMbase system was applied to the study of mouse neural tube defects ( NTDs ) .
The main innovations of this study are :
1 . A new method WM for heterogeneous data integration is put forward . The WM method combines the advantages of two kinds of data integration methods based on data warehouse and intermediary . It combines the efficiency and flexibility of data integration , and provides the infrastructure and solution for the integration of heterogeneous data in the interaction group .
2 . An interaction group data warehouse , InteractomeDW . InteractomeDW , was established to store more than 62 million ( 62,776,056 ) interaction records , involving 51 interacting group data sources , 9 auxiliary data sources , 5 interacting group data types ( protein interactions , domain interactions , intermolecular interactions , complexes and pathways ) , 2,426 species , 170 interaction identification methods , 44 interaction types , and 85212 references .
3 . A biological entity mapping database ( BEM ) is established . The BEM is integrated with five related data sources . It has stored more than 180 million ( 180 328 282 ) non - redundant mapping records , involving 4 entity types ( genes , proteins , small molecule substances and compounds ) , and can realize the entity mapping between 90 common biomedical databases .
4 . Based on the WM method , the IMbase of a heterogeneous data integration system based on the network is constructed . The IMbase is a computing platform for sharing and utilizing the interaction group data , which provides a variety of services such as interaction group data integration , interaction network analysis and reasoning , biological entity mapping and the like , and can help the investigator to generate the interaction hypothesis and realize the knowledge discovery .
5 . The constructed heterogeneous data integration system IMbase not only provides access method based on web application , but also provides access method based on Web Service , so as to provide programming interface for relevant software developers , so as to realize software reuse and interoperability .
6 . Using IMbase of heterogeneous data integration system ( IMbase ) in the study of mouse neural tube defect ( NTDs ) , by constructing and analyzing the potential mouse NTDs candidate gene related interaction network , the hypothesis of mouse NTDs candidate gene was put forward , and the reference direction was provided for further molecular biology experiments .
【学位授予单位】:北京协和医学院
【学位级别】:博士
【学位授予年份】:2011
【分类号】:R346
【共引文献】
相关期刊论文 前1条
1 谢晓兰;何恭贺;周德俭;;运用中间件技术的制造网格数据资源集成系统的设计与实现[J];现代制造工程;2011年04期
,本文编号:1797855
本文链接:https://www.wllwen.com/xiyixuelunwen/1797855.html
最近更新
教材专著