当前位置:主页 > 科技论文 > 化学论文 >

基于深度学习的化合物QSAR分类和有机碳吸附系数预测

发布时间:2018-04-20 23:02

  本文选题:机器学习 + 深度学习 ; 参考:《新疆大学》2017年硕士论文


【摘要】:随着计算机技术的高速发展和广泛应用及大数据产业规模呈现几何增长,化合物定量构效活性/属性关系(quantitative structure-activity/property relationship,QSAR/QSPR)也得到了迅速发展,并上升到一个更高的水平。从最初在生物领域的应用,逐渐扩展到药物科学、环境科学、药物化学、药物设计、医学等众多领域。其目的在于通过使用各种计算学、统计学方法研究化合物的结构参数与其各种理化性质及生物活性之间的关系,从而在分子层面上了解化合物的微观结构。因其涉及的领域较为广泛,它所研究的对象包括化合物的生物活性、药物毒性、及药物在人体内的吸收速率等。特别是在环境化学领域,由于大量的有机化合物进入环境中,对自然生态系统和人类都有很大的危害性。然而,以往对QSAR的建模通常采用的都是浅层机器学习方法,例如启发式方法、多元线性回归、径向基函数神经网络、反向传播神经网络、支持向量机等模型,它们的共性是作用于样本数量少并且问题规模不是特别复杂的场景下。这便限制了其进一步处理复杂问题和海量数据时的泛化能力。近年来深度学习作为机器学习的一个分支,已经广泛的应用于多个领域,并且取得了一系列令人满意的成果。特别是在大数据时代下,更需要利用深度学习技术处理很多浅层机器学习模型无法解决的问题。本文以口服生物利用度,CYP450 1A2酶的抑制性和logKoc为研究对象,以深度学习算法为基础,建立了基于深度学习的QSAR分类和logKoc预测模型,主要内容由三个部分组成。第一部分以口服生物利用度为研究对象,通过分子计算软件生成2D和3D分子特征作为栈式自编码模型的输入,让其自动学习分子的特征,利用softmax实现口服生物利用度分类。并与一些浅层模型(支持向量机和人工神经网络)做对比,来验证基于栈式自编码模型对口服生物利用度分类的有效性。第二部分为基于深度信念网络的CYP450 1A2抑制性分类模型,试验选取13000个化合物作为数据集,采用PubChem和MACCS分子指纹进行分子结构表征,利用DBN的半监督学习方式从预处理后的特征中学习更本质的特征表达,避免人工提取特征的过程,实现CYP450 1A2的抑制性分类。第三部分为基于无向图递归神经网络(UGRNN)的深度学习方法。首先将化合物分子结构表示成无向图的形式,然后利用递归神经网络对分子图结构进行特征抽取,实现对logKoc的预测。此外该模型结合用皮尔逊相关系数法找出脂水分配系数(logP)作为另一输入(简称UGRNN+logP),进一步提升了预测精度。
[Abstract]:With the rapid development and wide application of computer technology and the geometric growth of big data's industrial scale, QSAR / QSPRs have also developed rapidly and reached a higher level. From the initial application in the biological field, it has gradually expanded to many fields, such as pharmaceutical science, environmental science, drug chemistry, drug design, medicine and so on. The purpose of this study is to study the relationship between the structural parameters of the compounds and their physical and chemical properties and biological activities by using various computational methods, so as to understand the microstructure of the compounds at the molecular level. Because of its wide range of fields, it studies the biological activities of compounds, drug toxicity, and drug absorption rate in the human body. Especially in the field of environmental chemistry, because a large number of organic compounds enter the environment, it is harmful to the natural ecosystem and human beings. However, in the past, the modeling of QSAR is usually based on shallow machine learning methods, such as heuristic method, multiple linear regression, radial basis function neural network, back propagation neural network, support vector machine and so on. Their commonality is that they work in situations where the number of samples is small and the size of the problem is not particularly complex. This limits its generalization ability to deal with complex problems and massive data. In recent years, as a branch of machine learning, deep learning has been widely used in many fields, and has achieved a series of satisfactory results. Especially in big data's time, it is necessary to use depth learning technology to deal with many problems that can not be solved by shallow machine learning model. In this paper, the inhibition of CYP450 1A2 enzyme and logKoc in oral bioavailability were studied. Based on the deep learning algorithm, the QSAR classification and logKoc prediction model based on deep learning were established. The main contents were composed of three parts. In the first part, taking oral bioavailability as the research object, 2D and 3D molecular features are generated by molecular computing software as the input of stack self-coding model to automatically learn the molecular characteristics, and the classification of oral bioavailability is realized by softmax. Compared with some shallow models (support vector machine and artificial neural network), the effectiveness of the self-coding model based on stack for oral bioavailability classification is verified. The second part is the CYP450 1A2 inhibitory classification model based on deep belief network. 13000 compounds are selected as data sets and the molecular structure is characterized by PubChem and MACCS fingerprints. The semi-supervised learning method of DBN is used to learn the more essential feature expression from the pretreated features, to avoid the process of artificial feature extraction, and to realize the inhibitory classification of CYP450 1A2. The third part is the depth learning method based on undirected graph recurrent neural network (UGRNN). First, the molecular structure of compounds is expressed as an undirected graph, then the structure of the molecular graph is extracted by recursive neural network, and the prediction of logKoc is realized. In addition, the model combined with Pearson correlation coefficient to find out the fat-water partition coefficient (log P) as another input (abbreviated as UGRNN log P) further improves the prediction accuracy.
【学位授予单位】:新疆大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:O621.1;TP18

【参考文献】

相关期刊论文 前10条

1 史新宇;禹龙;田生伟;叶飞跃;钱进;高双印;;基于深度学习的口服生物利用度分类研究[J];计算机科学;2016年04期

2 陈云霁;陈天石;;人工神经网络处理器[J];中国科学:生命科学;2016年02期

3 刘翠;杨书程;李民;刘培庆;陈缵光;张仁伟;;药物筛选新技术及其应用进展[J];分析测试学报;2015年11期

4 王勇;赵俭辉;章登义;叶威;;基于稀疏自编码深度神经网络的林火图像分类[J];计算机工程与应用;2014年24期

5 杨帆;冯翔;阮羚;陈俊武;夏荣;陈昱龙;金志辉;;基于皮尔逊相关系数法的水树枝与超低频介损的相关性研究[J];高压电器;2014年06期

6 王放;曹永孝;狄佳;;绝对生物利用度计算方法的讨论[J];医学争鸣;2014年02期

7 刘建伟;刘媛;罗雄麟;;深度学习研究进展[J];计算机应用研究;2014年07期

8 李曼华;孙昊鹏;尤启冬;;CYP1A2抑制剂预测模型的建立及评价[J];中国药科大学学报;2013年05期

9 余凯;贾磊;陈雨强;徐伟;;深度学习的昨天、今天和明天[J];计算机研究与发展;2013年09期

10 刘娴;闻洋;赵元慧;;有机污染物土壤吸附预测模型及其影响因素[J];环境化学;2013年07期

相关博士学位论文 前2条

1 袁永娜;QSPR/QSAR在化学、药物化学和环境科学中的应用研究[D];兰州大学;2010年

2 马卫平;线性和非线性方法在QSAR/QSPR研究中的应用[D];兰州大学;2007年

相关硕士学位论文 前8条

1 闫奕霖;大数据环境下化合物类药性与活性预测研究[D];新疆大学;2016年

2 晁丽;细胞色素P450抑制剂虚拟筛选与分子对接[D];重庆大学;2014年

3 闵建亮;基于2D分子指纹和非平衡数据集的药物与受体交互作用预测研究[D];景德镇陶瓷学院;2014年

4 田盛;类药性和生物利用度的理论预测研究[D];苏州大学;2011年

5 李焕;基于量子化学计算的药物活性定量构效关系研究[D];河南师范大学;2010年

6 巩志国;苯的衍生物、液晶分子和苯乙烯聚合的构效关系的研究与分析[D];兰州大学;2009年

7 夏彬彬;径向基函数神经网络在环境化学和药物化学中的应用[D];兰州大学;2008年

8 曾小兰;部分持久性有机污染物的定量结构—性质关系研究[D];桂林工学院;2007年



本文编号:1779815

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/huaxue/1779815.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户0d15f***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com