基于Python的基因表达数据网络爬虫研究与设计

发布时间：2018-05-18 04:30

本文选题：GEO数据库 + 网络爬虫　；参考：《山西医科大学》2017年硕士论文

【摘要】：目的:以NCBI创建的开放式基因表达综合数据库(Gene Expression Omnibu,GEO)为例,开发爬虫程序可以有效的解决日益增长的高通量基因表达的实验数据带来的问题。对信息进行挖掘和处理,而不被海量信息所淹没,提高数据库的利用率;减少生物医学信息资源的浪费,为医学工作者供给全面的基因表达数据信息,推动临床生物信息学的发展。方法:1.文献分析法:查阅网络爬虫系统、网页抓取技术、GEO数据库方面的相关文献等,深入学习了解网络爬虫系统发展现状,网页抓取技术的策略和GEO数据库发展现状。为开发设计专门适用于GEO数据库中RNA相关数据抓取的网络爬虫系统提供理论参考和实践经验。2.编程语言:利用Python语言编写爬虫程序。3.数据库技术;使用MySQL数据库技术储存爬虫程序爬取到的基因表达数据。结果:1.本研究成功开发一款爬虫程序,爬虫程序投入运行;2.爬虫程序抓取GEO数据库中全部基因表达数据共71032个,并保存在Mysql数据库中。结论:爬虫程序实现GEO数据库中基因表达信息相关数据的自动抓取,免去人工下载的繁琐,有效的实现数据的大规模下载。高效地从数据库的海量信息中挖掘出有效的信息或者生物知识,帮助临床研究者浏览生物医学文献,允许数据资源的批量下载,很大程度上方便生物研究与信息的查询与借鉴。其抓取到的成果不仅对基础医学研究有极大推动作用,而且对人类疾病防治,基因定位等都具有重要意义。
[Abstract]:Aim: to develop an open gene expression database, Gene Expression Omnibun GE O, created by NCBI, and to develop a reptile program to effectively solve the problems caused by the increasing experimental data of high throughput gene expression. To mine and process the information without being submerged by the massive information, to improve the utilization of the database, to reduce the waste of biomedical information resources, and to provide comprehensive gene expression data information for medical workers. To promote the development of clinical bioinformatics. Method 1: 1. Literature analysis: referring to web crawler system, web crawling technology and related documents of geo database, and studying deeply the current situation of web crawler system, the strategy of web crawler technology and the development status of GEO database. It provides a theoretical reference and practical experience for the development and design of a web crawler system that can be used to capture RNA related data in GEO database. Programming language: using Python language to write crawler program. 3. Database technology; the use of MySQL database technology to store crawler crawling gene expression data. The result is 1: 1. In this study, a reptile program was successfully developed, and the crawler program was put into operation. A total of 71032 gene expression data were captured from GEO database by crawler program and stored in Mysql database. Conclusion: the crawler program can automatically capture the data related to gene expression information in GEO database, and can effectively realize the large-scale data download without the tedious manual download. Efficient mining of effective information or biological knowledge from the massive information in the database helps clinical researchers browse biomedical literature and allow batch downloading of data resources, which greatly facilitates the inquiry and reference of biological research and information. The results not only promote the research of basic medicine, but also play an important role in the prevention and treatment of human diseases and gene location.
【学位授予单位】：山西医科大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：Q811.4

【参考文献】