首页 | 本学科首页   官方微博 | 高级检索  
检索        


Improving GWAS discovery and genomic prediction accuracy in biobank data
Authors:Etienne J Orliac  Daniel Trejo Banos  Sven E Ojavee  Kristi Lll  Reedik Mgi  Peter M Visscher  Matthew R Robinson
Abstract:Genetically informed, deep-phenotyped biobanks are an important research resource and it is imperative that the most powerful, versatile, and efficient analysis approaches are used. Here, we apply our recently developed Bayesian grouped mixture of regressions model (GMRM) in the UK and Estonian Biobanks and obtain the highest genomic prediction accuracy reported to date across 21 heritable traits. When compared to other approaches, GMRM accuracy was greater than annotation prediction models run in the LDAK or LDPred-funct software by 15% (SE 7%) and 14% (SE 2%), respectively, and was 18% (SE 3%) greater than a baseline BayesR model without single-nucleotide polymorphism (SNP) markers grouped into minor allele frequency–linkage disequilibrium (MAF-LD) annotation categories. For height, the prediction accuracy R2 was 47% in a UK Biobank holdout sample, which was 76% of the estimated hSNP2. We then extend our GMRM prediction model to provide mixed-linear model association (MLMA) SNP marker estimates for genome-wide association (GWAS) discovery, which increased the independent loci detected to 16,162 in unrelated UK Biobank individuals, compared to 10,550 from BoltLMM and 10,095 from Regenie, a 62 and 65% increase, respectively. The average χ2 value of the leading markers increased by 15.24 (SE 0.41) for every 1% increase in prediction accuracy gained over a baseline BayesR model across the traits. Thus, we show that modeling genetic associations accounting for MAF and LD differences among SNP markers, and incorporating prior knowledge of genomic function, is important for both genomic prediction and discovery in large-scale individual-level studies.

As biobank datasets increase in size, it is important to understand the factors limiting the prediction of phenotype from genotype. Alongside others, we have recently shown that genomic prediction accuracy can be improved through the use of random-effects models that incorporate prior knowledge of genomic annotations and allow for differences in the variance explained by single-nucleotide polymorphism (SNP) markers, depending upon their linkage disequilibrium (LD) and their minor allele frequency (MAF) (18). These improvements in prediction accuracy should also translate into greater genome-wide association study (GWAS) discovery power. Mixed-linear models of association (MLMA) are commonly applied in GWASs in a two-step approach, where a random-effects model is first used to estimate leave-one-chromosome-out (LOCO) genetic values, and these are then used in a second marginal regression coefficient estimation step. Theory suggests that the test statistics obtained in the MLMA second step depend upon the accuracy of the LOCO genomic predictors produced from the first step. Current MLMA implementations use a blocked ridge regression model (9), a Restricted Maximum Likelihood (REML) genomic relationship model (10), or a Bayesian spike-and-slab model (11) within the first step.Here, we improve the computational implementation of our recently developed Bayesian grouped mixture of regressions model (GMRM), which estimates genetic marker effects jointly, but with independent marker inclusion probabilities and independent hSNP2 parameters across LD, MAF, and functional annotation groups (Materials and Methods). This allows us to apply the model to 21 traits in the UK Biobank to test for prediction accuracy improvements over existing approaches. We then extend the model to provide MLMA SNP marker association estimates to test whether improved prediction accuracy translates to improved GWAS discovery compared to existing MLMA approaches.
Keywords:genomic prediction  association study  Bayesian penalized regression
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号