生物医学链接数据的清洗与集成技术研究
[Abstract]:In recent years, the rapid development of semantic Web technology facilitates the integration and display of massive data. Due to the large amount of data and many sub-fields, the need of cleaning and integrating RDF data sets published by various organizations is increasingly prominent in the biomedical field. Many previous efforts have been devoted to the use of semantic Web standards and technologies to establish linked data networks for massive biomedical data. For example, biomedical data sets published using semantic Web technology usually provide cross-references to other data sets, but these references often contain errors or fail to fully express the link relationship between data sets. The integrated data needs to be obtained by using SPARQL language query, which hinders the use of data by non-semantic domain users (such as biomedical professionals). Different ontologies in different datasets also make it difficult to integrate the results of cross-dataset queries. This paper analyzes the linked data of biomedical data set, and studies data cleaning and data integration technology to solve the above problems. Data cleaning technology analyzes and verifies the data, and corrects the repeated data, error data and missing data. Semantic Web data integration technology involves ontology matching, entity linking and so on. Ontology matching is used to unify the classes and attributes of different datasets, and entity links connect different data sets to the same entity. The main contributions of this paper are as follows: 1. Based on the Bio2RDF project, the mainstream biomedical link data were investigated and analyzed. In this paper, three kinds of data link graphs, data set link, entity link and terminology link, are constructed, and the relationship between them is analyzed. It is found that the data set link has small world phenomenon, and the distribution of entity link degree is not strictly in accordance with power law. There is more overlap between different data sets. In addition, a standard test set is constructed to evaluate the merits and demerits of entity linking methods. Link analysis method can be used in biomedical domain data set analysis. 2. Data cleaning of selected data sets, string detection, machine learning and other methods to correct the missing data, correct the error data and eliminate the duplicate data caused by automatic conversion and manual input. At the same time, according to the symmetry and transitivity of the entity link, the missing link between the complete data sets is analyzed, and the error link is corrected to improve the data quality and link quality. 3. In an ontology-based data set federated search engine (BioSearch) system, the cleaned data set is integrated, and the ontology matching method is used to support cross-dataset joint query. The system provides users with a simple and efficient data query acquisition interface. The experimental results show that the joint query and semantic query interface defined in this paper are more efficient than the existing two linked data search engines. The facet filtering and entity browsing functions implemented by BioSearch have also been proved to improve the user experience.
【学位授予单位】:南京大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13
【相似文献】
相关期刊论文 前10条
1 张尔强;创建SAS数据集的技巧[J];数理医药学杂志;2003年01期
2 ;数据集N鄽2[J];航空材料;1959年09期
3 江海洪 ,罗长坤;首套中国数字化可视人体数据集在第三军医大学研制成功[J];中华医学杂志;2003年09期
4 陈相颖;数据集记录快速定位与筛选方法之探讨[J];计量与测试技术;2005年06期
5 张晓斌;魏永祥;韩德民;夏寅;李希平;原林;唐雷;王兴海;;数字化耳鼻咽喉数据集的采集[J];中华耳鼻咽喉头颈外科杂志;2005年06期
6 王宏鼎;唐世渭;董国田;;数据集成中数据集特征的检测方法[J];中国金融电脑;2006年03期
7 张华;郁书好;;时空数据集的连接处理和优化方法研究[J];皖西学院学报;2006年02期
8 苗卿;单立新;裘昱;;信息熵在数据集分割中的应用研究[J];电脑知识与技术(学术交流);2007年05期
9 陈德诚;丘平珠;唐炳莉;;广西气象数据集设计与制作[J];气象研究与应用;2007年04期
10 赵凤英;王崇骏;陈世福;;用于不均衡数据集的挖掘方法[J];计算机科学;2007年09期
相关会议论文 前10条
1 田捷;;三维医学影像数据集处理的集成化平台[A];2003年全国医学影像技术学术会议论文汇编[C];2003年
2 范明;魏芳;;挖掘基本显露模式用于分类[A];第二十一届中国数据库学术会议论文集(技术报告篇)[C];2004年
3 冷传良;;飞机化铣成样板划线数据集设计方法探索[A];第十届沈阳科学学术年会论文集(信息科学与工程技术分册)[C];2013年
4 孟烨;张鹏;宋大为;王雷;;信息检索系统性能对数据集特性的依赖性分析[A];第十二届全国人机语音通讯学术会议(NCMMSC'2013)论文集[C];2013年
5 段磊;唐常杰;左R,
本文编号:2368234
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2368234.html