遗传流行病统计分析软件SAGE的开发与应用
[Abstract]:Background and research objectives
Genetic epidemiology is a frontier hot subject developed in recent years. It mainly studies the genetic and environmental factors that affect the distribution of diseases in different populations and puts forward reasonable preventive measures. Its theoretical basis is population genetics and epidemiology, mainly the application of epidemiological population data. Methods of collection and processing, as well as experimental methods of molecular genetics, with the help of relevant principles and methods of biostatistics, to study and explore the individual effects of genetic and environmental factors on diseases and their combined effects on diseases. With the discovery of polymorphic sequence markers in the process of human genome sequencing, the search for disease genes is accelerating. The study of polygenic diseases has become the focus of attention for a long time.
Up to now, a set of effective research systems have been established for single-gene genetic diseases which conform to Mendelian inheritance and nearly one thousand pathogenic genes have been cloned. However, for polygenic diseases, these complex traits show a certain tendency of family clustering, but they do not fully conform to Mendelian. Delphi inheritance law, therefore, still has many problems in the mapping and genetic analysis of susceptible genes, and has become a difficult and hot spot in medical genetics and gene research in recent years. Balance analysis has become an important method for gene mapping. However, due to the huge genetic data, complicated analysis and complex structure, it is difficult to make full use of the information of the data with general statistical methods and software. Analytical ability is not strong.
For example, FASTLINK, LINKAGE, VITESSE and GENEHUNTER, MERLIN, MELINK are available for parametric linkage analysis, while GENEHUNTER, MERLIN and MELINK are available for non-parametric linkage analysis. Because of the huge population and abundant demographic data, our country is a good resource repository for studying human genetic information. At present, there is no good combination of statistics and genetics, which makes geneticists in information collection and data analysis. There are a lot of problems, such as what kind of data to collect, sample size and what kind of genetic statistics method to use. It is a pity that the information can not be fully utilized, resulting in a huge waste of information.
Due to the non-strict one-to-one correspondence between the phenotype and genotype of polygenic diseases, it is necessary to use a variety of analytical methods in the analysis of data. This also makes some special software for genetic analysis more and more expose the limitations of its application, and foreign software is generally English software, which makes geneticists waste a lot of money. The amount of manpower and material resources to learn these software, so the urgent need for a powerful comprehensive genetic statistics software. And genetic epidemiology statistical analysis software package SAGE (Statistical Analysis for Genetic Epidemiology) just meets our needs. HGAR, created by Human Genetic Analysis Resource (HGAR), was founded in the Department of Epidemiology and Statistics of Case Western Reserve University (CWRU) in Cleveland, USA. It was funded by the US Public Health Service and the NIH National Research Resource Center. The software was developed by R.C. Elston, a famous statistical geneticist. Developed in 1987 by its team, the software has been continuously updated over time, from the initial version 1.0 to the current version 5.3.0, and its functions are also increasing, and its position in genetic epidemiological analysis is getting more and more attention.
research method
Through the introduction of five examples files from SAGE software as original data files, each function module is analyzed in detail. The SAGE has one custom module and 18 function modules, which are divided into 18 chapters.
Chapter 1: Overview of SAGE. The input and output files, running environment and characteristics of the basic functional modules of SAGE software are given. Users should pay attention to the system requirements when installing the software.
Chapter 2: Establishment, editing and sorting of SAGE data files. It mainly introduces three methods of establishing data files, the import, export and renaming of projects, etc.
Chapter 3: User-defined functional modules. It mainly introduces how to create genomic data files and create new variables. The emphasis is to create new variables.
Chapter 4: General Statistical Analysis of SAGE (PEDINFO). It mainly introduces the function, principle and operation of PEDINFO, and explains the results. The emphasis is on the explanation of the results. The following 14 chapters are from the function, principle, operation process and main output results of the module.
Chapter 5: Non-Mendelian Genetic Statistical Analysis (MARKERINFO). Mainly used to detect non-Mendelian genetic information in the family coefficient data, to help users detect inconsistent data. The premise is to understand Mendelian genetic law.
Chapter 6: Reclassification of Relative Pairs (RELTEST). The original relatives are reclassified by genomic multilocus scanning data, mainly based on the principle of chromosomal consanguinity (IBD) allele sharing. The emphasis is on understanding IBD and IBS, and explaining the results.
Chapter 7: Allele Frequency Estimation (FREQ). Estimation of individual allele frequencies of known family structures and generation of marker site descriptors. The resulting site files can be used in GENIBD, MLOD and other SAGE programs. The main functions of this module are to output site files and output intimacy coefficients.
Chapter 8: Allelic Association or Data Trait Transfer Disequilibrium Test (ASSOC). It is mainly used to estimate the family coefficient. The covariate can be transformed from the marker phenotype to estimate the family residual correlation coefficient or heritability.
Chapter 9: Family Correlation Analysis (FCOR). It is mainly used to estimate the multivariate correlations of all related pairs in a family and their asymptotic standard errors.
Chapter 10: Mixed Separation Analysis and Complex Separation Analysis (SEGREG). Mainly used to detect and select separation analysis models on the basis of family-related relationships provided. Its characteristics can be continuous, binary or age-related binary classification characteristics, producing an explicit rate file for model-based linkage analysis. Selection of suitable models for different characteristics.
Chapter 11: GENIBD. This function module is mainly used to coordinate the calculation of various family coefficients through a variety of algorithms to produce a uniform allele distribution of units and multiple loci. The emphasis is on different models for different data.
Chapter 12: Age-related seizure analysis (AGEON): Applies to the simultaneous comparison of age-related distribution data between affected and non-involved pairs, allowing for covariate adjustment of mean, variance, or skewness distributions.
Chapter 13: Haplotype Analysis (DECIPHER): Mainly used to estimate the maximum likelihood of haplotype frequencies of autosomal or X-sex chromosomes in a population.
Chapter 14: Model-based Unit Point Linkage Analysis (LODLINK). Mainly used to calculate the LOD values between the main model-based features and the two points between the loci. The main characteristics may be any marker or other characteristics that conform to Mendelian transmission. The emphasis is on the naming of the main features and the explicit file generated from the SEGERG program.
Chapter 15: Model-based multilocus linkage analysis (MLOD). It is mainly used to calculate the multilocus linkage analysis between small or large model-based families. The emphasis is on the generation and identification of major characteristics of genomic data files.
Chapter 16: Siblin-to-Siblin Linkage Analysis (SIBPAL). It can be a shared consanguineous allele information at a single point or multiple loci. Bivariate and contiguous variables are used simultaneously according to the multilocus genes, including epistatic interactions and covariate effects. The emphasis is on different characteristics that need to be set accordingly.
Chapter 17: Lods linkage analysis of affected siblings (LODPAL). The program is based on Lods scores of affected siblings. Currently, the general conditional logistic regression model is implemented. Attention should be paid to the setting of effectiveness.
Chapter 18: Transfer Disequilibrium Test (TDT). The TDT in the program is based on the basic model of transfer disequilibrium. It is used to analyze the linkage between marker sites and disease sites under the condition of known linkage disequilibrium. The disease characteristics are binary variables. The premise is to master the principle of TDT.
Result
Through this paper, geneticists can make full use of their genetic data for genetic statistical analysis, saving manpower and material resources, learning this software can guide geneticists to collect genetic data, as far as possible use of genetic data, thus speeding up the development of genetic epidemiology.
【学位授予单位】:南方医科大学
【学位级别】:硕士
【学位授予年份】:2007
【分类号】:TP311.52;R181.3
【相似文献】
相关会议论文 前10条
1 叶冬青;施小明;陆伟;;系统性红斑狼疮的遗传流行病学研究[A];新世纪预防医学面临的挑战——中华预防医学会首届学术年会论文摘要集[C];2002年
2 张彩霞;鲍忠赞;周前凯;魏广兵;徐世清;司马杨虎;;家蚕正反交SAGE表达分析[A];中国蚕学会第八届暨国家蚕桑产业技术体系家(柞)蚕遗传育种及良种繁育学术研讨会论文集[C];2011年
3 杨智;邹勇莉;涂颖;顾华;何黎;;痤疮遗传模式研究[A];2006中国中西医结合皮肤性病学术会议论文汇编[C];2006年
4 徐德忠;王安辉;李良寿;;人类基因组流行病学的研究[A];新世纪预防医学面临的挑战——中华预防医学会首届学术年会论文摘要集[C];2002年
5 林晓玲;刘芳;卢大儒;徐剑锋;;中国人群前列腺特异性抗原的遗传研究[A];2012年中国青年遗传学家论坛会议文集[C];2012年
6 陈晓铮;林新华;李明禄;伍民友;;基于SAGE的分布式虚拟现实框架[A];2008年全国开放式分布与并行计算机学术会议论文集(上册)[C];2008年
7 徐德忠;王安辉;李寿良;;人类基因组流行病学的研究[A];新世纪预防医学面临的挑战——中华预防医学会首届学术年会论文摘要集[C];2002年
8 张玉琦;徐文炜;程灶火;李桂林;吴越;顾君;张明廉;;阿尔茨海默病的遗传流行病学研究[A];中华医学会精神病学分会第九次全国学术会议论文集[C];2011年
9 施慎逊;;女性抑郁症遗传流行病学国际合作课题[A];中华医学会精神病学分会第九次全国学术会议论文集[C];2011年
10 刘菊华;金志强;徐碧玉;;植物功能基因组学研究技术及其在热带作物上的应用前景[A];中国热带作物学会第七次全国会员代表大会暨学术讨论会论文集[C];2004年
相关重要报纸文章 前10条
1 Jet;时尚之风[N];计算机世界;2004年
2 ;Web—mail商务应用异军突起[N];科技日报;2000年
3 ;邮件系统供应商扫描(一)[N];中国计算机报;2001年
4 秀文;波导股份(600302)生产没有盲点的手机[N];山西日报;2000年
5 本报记者 阮湘华 通讯员 武明飞;天喻信息在调整中崛起[N];科技日报;2005年
6 杨朝英;专用通讯市场烽烟再起[N];人民政协报;2004年
7 本报记者 宋剑峰;被遗漏的人类基因?[N];中国高新技术产业导报;2002年
8 深圳海景贸易公司 杜越;延长信息的触角[N];网络世界;2001年
9 叶黎明;波导以专搏大[N];科技日报;2000年
10 安徽医科大学教授 张学军;“牛皮癣”病因查明:一遗传 二环境[N];健康报;2001年
相关博士学位论文 前10条
1 刘江波;白癜风的遗传流行病学研究[D];安徽医科大学;2005年
2 董艳彬;高血压的遗传易感性及其分子基础的临床与实验研究[D];中国协和医科大学;1995年
3 潘发明;中国汉族人群免疫球蛋白受体家族基因单核苷酸多态性与系统性红斑狼疮的关联研究[D];安徽医科大学;2006年
4 王先良;基于甲基化特异性引物和SAGE的高通量DNA甲基化定量检测方法研究[D];华中科技大学;2006年
5 杨森;六种常见皮肤病(寻常型银屑病、白癜风、斑秃、瘢痕疙瘩、花斑癣、雀斑)的遗传流行病学比较性研究[D];安徽医科大学;2007年
6 唐晓武;中国汉族人群免疫球蛋白受体同系物家簇基因单核苷酸多态性与强直性脊柱炎的关联研究[D];安徽医科大学;2009年
7 甘丽萍;家蚕黄茧限性品种雌雄SAGE文库的构建及其差异表达基因的研究[D];苏州大学;2011年
8 黄健华;基于SAGE技术的家蚕基因表达谱研究[D];中国科学院研究生院(上海生命科学研究院);2007年
9 缑金营;棉花纤维发育研究:表达谱和代谢谱分析[D];中国科学院研究生院(上海生命科学研究院);2006年
10 徐佳;高通量基因筛选技术的应用及优化[D];山东大学;2010年
相关硕士学位论文 前10条
1 陈莉雅;遗传流行病统计分析软件SAGE的开发与应用[D];南方医科大学;2007年
2 鲍忠赞;家蚕幼虫高温处理前后SAGE文库的构建与分析及差异表达热激蛋白基因的研究[D];苏州大学;2012年
3 张彩霞;家蚕正反交F_1代SAGE文库的构建与分析及差异基因的时空表达谱研究[D];苏州大学;2012年
4 王惠琳;GLGI技术鉴定和分析SLE患者CD4~+和CD8~+T细胞基因表达谱的初步研究[D];第三军医大学;2006年
5 潘兴元;应用生物信息学方法从低氧处理人动脉内皮细胞SAGE库中挖掘低氧反应相关新基因[D];南京师范大学;2005年
6 王剑;汉族人系统性红斑狼疮遗传流行病学研究[D];安徽医科大学;2006年
7 闫会萍;单纯性肥胖患者脂肪组织中新陈代谢相关基因的表达分布[D];北京体育大学;2006年
8 黄跃峰;超水稻杂交基因研究和数据库构建[D];吉林大学;2008年
9 张校辉;胃癌遗传流行病学研究[D];郑州大学;2007年
10 陈晓铮;基于SAGE的分布式虚拟现实框架[D];上海交通大学;2008年
本文编号:2179580
本文链接:https://www.wllwen.com/yixuelunwen/liuxingb/2179580.html