首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Variable selection is growing in importance with the advent of high throughput genotyping methods requiring analysis of hundreds to thousands of single nucleotide polymorphisms (SNPs) and the increased interest in using these genetic studies to better understand common, complex diseases. Up to now, the standard approach has been to analyze the genotypes for each SNP individually to look for an association with a disease. Alternatively, combinations of SNPs or haplotypes are analyzed for association. Another added complication in studying complex diseases or phenotypes is that genetic risk for the disease is often due to multiple SNPs in various locations on the chromosome with small individual effects that may have a collectively large effect on the phenotype. Hence, multi-locus SNP models, as opposed to single SNP models, may better capture the true underlying genotypic-phenotypic relationship. Thus, innovative methods for determining which SNPs to include in the model are needed. The goal of this article is to describe several methods currently available for variable and model selection using Bayesian approaches and to illustrate their application for genetic association studies using both real and simulated candidate gene data for a complex disease. In particular, Bayesian model averaging (BMA), stochastic search variable selection (SSVS), and Bayesian variable selection (BVS) using a reversible jump Markov chain Monte Carlo (MCMC) for candidate gene association studies are illustrated using a study of age-related macular degeneration (AMD) and simulated data.  相似文献   

2.
Many complex diseases are known to be affected by the interactions between genetic variants and environmental exposures beyond the main genetic and environmental effects. Study of gene-environment (G×E) interactions is important for elucidating the disease etiology. Existing Bayesian methods for G×E interaction studies are challenged by the high-dimensional nature of the study and the complexity of environmental influences. Many studies have shown the advantages of penalization methods in detecting G×E interactions in “large p, small n” settings. However, Bayesian variable selection, which can provide fresh insight into G×E study, has not been widely examined. We propose a novel and powerful semiparametric Bayesian variable selection model that can investigate linear and nonlinear G×E interactions simultaneously. Furthermore, the proposed method can conduct structural identification by distinguishing nonlinear interactions from main-effects-only case within the Bayesian framework. Spike-and-slab priors are incorporated on both individual and group levels to identify the sparse main and interaction effects. The proposed method conducts Bayesian variable selection more efficiently than existing methods. Simulation shows that the proposed model outperforms competing alternatives in terms of both identification and prediction. The proposed Bayesian method leads to the identification of main and interaction effects with important implications in a high-throughput profiling study with high-dimensional SNP data.  相似文献   

3.
In the last decade, numerous genome‐wide linkage and association studies of complex diseases have been completed. The critical question remains of how to best use this potentially valuable information to improve study design and statistical analysis in current and future genetic association studies. With genetic effect size for complex diseases being relatively small, the use of all available information is essential to untangle the genetic architecture of complex diseases. One promising approach to incorporating prior knowledge from linkage scans, or other information, is to up‐ or down‐weight P‐values resulting from genetic association study in either a frequentist or Bayesian manner. As an alternative to these methods, we propose a fully Bayesian mixture model to incorporate previous knowledge into on‐going association analysis. In this approach, both the data and previous information collectively inform the association analysis, in contrast to modifying the association results (P‐values) to conform to the prior knowledge. By using a Bayesian framework, one has flexibility in modeling, and is able to comprehensively assess the impact of model specification on posterior inferences. We illustrate the use of this method through a genome‐wide linkage study of colorectal cancer, and a genome‐wide association study of colorectal polyps. Genet. Epidemiol. 34:418–426, 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

4.
A Bayesian toolkit for genetic association studies   总被引:3,自引:0,他引:3  
We present a range of modelling components designed to facilitate Bayesian analysis of genetic-association-study data. A key feature of our approach is the ability to combine different submodels together, almost arbitrarily, for dealing with the complexities of real data. In particular, we propose various techniques for selecting the "best" subset of genetic predictors for a specific phenotype (or set of phenotypes). At the same time, we may control for complex, non-linear relationships between phenotypes and additional (non-genetic) covariates as well as accounting for any residual correlation that exists among multiple phenotypes. Both of these additional modelling components are shown to potentially aid in detecting the underlying genetic signal. We may also account for uncertainty regarding missing genotype data. Indeed, at the heart of our approach is a novel method for reconstructing unobserved haplotypes and/or inferring the values of missing genotypes. This can be deployed independently or, alternatively, it can be fully integrated into arbitrary genotype- or haplotype-based association models such that the missing data and the association model are "estimated" simultaneously. The impact of such simultaneous analysis on inferences drawn from the association model is shown to be potentially significant. Our modelling components are packaged as an "add-on" interface to the widely used WinBUGS software, which allows Markov chain Monte Carlo analysis of a wide range of statistical models. We illustrate their use with a series of increasingly complex analyses conducted on simulated data based on a real pharmacogenetic example.  相似文献   

5.
Several methods have been proposed to allow functional genomic information to inform prior distributions in Bayesian fine-mapping case–control association studies. None of these methods allow the inclusion of partially observed functional genomic information. We use functional significance (FS) scores that combine information across multiple bioinformatics sources to inform our effect size prior distributions. These scores are not available for all single-nucleotide polymorphisms (SNPs) but by partitioning SNPs into naturally occurring FS score groups, we show how missing FS scores can easily be accommodated via finite mixtures of elicited priors. Most current approaches adopt a formal Bayesian variable selection approach and either limit the number of causal SNPs allowed or use approximations to avoid the need to explore the vast parameter space. We focus instead on achieving differential shrinkage of the effect sizes through prior scale mixtures of normals and use marginal posterior probability intervals to select candidate causal SNPs. We show via a simulation study how this approach can improve localisation of the causal SNPs compared to existing mutli-SNP fine-mapping methods. We also apply our approach to fine-mapping a region around the CASP8 gene using the iCOGS consortium breast cancer SNP data.  相似文献   

6.
Recent advances in next-generation sequencing technologies facilitate the detection of rare variants, making it possible to uncover the roles of rare variants in complex diseases. As any single rare variants contain little variation, association analysis of rare variants requires statistical methods that can effectively combine the information across variants and estimate their overall effect. In this study, we propose a novel Bayesian generalized linear model for analyzing multiple rare variants within a gene or genomic region in genetic association studies. Our model can deal with complicated situations that have not been fully addressed by existing methods, including issues of disparate effects and nonfunctional variants. Our method jointly models the overall effect and the weights of multiple rare variants and estimates them from the data. This approach produces different weights to different variants based on their contributions to the phenotype, yielding an effective summary of the information across variants. We evaluate the proposed method and compare its performance to existing methods on extensive simulated data. The results show that the proposed method performs well under all situations and is more powerful than existing approaches.  相似文献   

7.
The default causal single-nucleotide polymorphism (SNP) effect size prior in Bayesian fine-mapping studies is usually the Normal distribution. This choice is often based on computational convenience, rather than evidence that it is the most suitable prior distribution. The choice of prior is important because previous studies have shown considerable sensitivity of causal SNP Bayes factors to the form of the prior. In some well-studied diseases there are now considerable numbers of genome-wide association study (GWAS) top hits along with estimates of the number of yet-to-be-discovered causal SNPs. We show how the effect sizes of the top hits and estimates of the number of yet-to-be-discovered causal SNPs can be used to choose between the Laplace and Normal priors, to estimate the prior parameters and to quantify the uncertainty in this estimation. The methodology can readily be applied to other priors. We show that the top hits available from breast cancer GWAS provide overwhelming support for the Laplace over the Normal prior, which has important consequences for variant prioritisation. This work in this paper enables practitioners to derive more objective priors than are currently being used and could lead to prioritisation of different variants.  相似文献   

8.
Studies of gene‐trait associations for complex diseases often involve multiple traits that may vary by genotype groups or patterns. Such traits are usually manifestations of lower‐dimensional latent factors or disease syndromes. We illustrate the use of a variance components factor (VCF) model to model the association between multiple traits and genotype groups as well as any other existing patient‐level covariates. This model characterizes the correlations between traits as underlying latent factors that can be used in clinical decision‐making. We apply it within the Bayesian framework and provide a straightforward implementation using the WinBUGS software. The VCF model is illustrated with simulated data and an example that comprises changes in plasma lipid measurements of patients who were treated with statins to lower low‐density lipoprotein cholesterol, and polymorphisms from the apolipoprotein‐E gene. The simulation shows that this model clearly characterizes existing multiple trait manifestations across genotype groups where individuals' group assignments are fully observed or can be deduced from the observed data. It also allows one to investigate covariate by genotype group interactions that may explain the variability in the traits. The flexibility to characterize such multiple trait manifestations makes the VCF model more desirable than the univariate variance components model, which is applied to each trait separately. The Bayesian framework offers a flexible approach that allows one to incorporate prior information. Genet. Epidemiol. 34: 529–536, 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

9.
Zheng G  Meyer M  Li W  Yang Y 《Statistics in medicine》2008,27(24):5054-5075
To test for genetic association between a marker and a complex disease using a case-control design, Cochran-Armitage trend tests (CATTs) and Pearson's chi-square test are often employed. Both tests are genotype-based. Song and Elston (Statist. Med. 2006; 25:105-126) introduced the Hardy-Weinberg disequilibrium trend test and combined it with CATT to test for association. Compared to using a single statistic to test for case-control genetic association (referred to as single-phase analysis), two-phase analysis is a new strategy in that it employs two test statistics in one analysis framework, each statistic using all available case-control data. Two such two-phase analysis procedures were studied, in which Hardy-Weinberg equilibrium (HWE) in the population is a key assumption, although the procedures are robust to moderate departure from HWE. Our goal in this article is to study a new two-phase procedure and compare all three two-phase analyses and common single-phase procedures by extensive simulation studies. For illustration, the results are applied to real data from two case-control studies. On the basis of the results, we conclude that with an appropriate choice of significance level for the analysis in phase 1, some two-phase analyses could be more powerful than commonly used test statistics.  相似文献   

10.
For a dense set of genetic markers such as single nucleotide polymorphisms (SNPs) on high linkage disequilibrium within a small candidate region, a haplotype-based approach for testing association between a disease phenotype and the set of markers is attractive in reducing the data complexity and increasing the statistical power. However, due to unknown status of the underlying disease variant, a comprehensive association test may require consideration of various combinations of the SNPs, which often leads to severe multiple testing problems. In this paper, we propose a latent variable approach to test for association of multiple tightly linked SNPs in case-control studies. First, we introduce a latent variable into the penetrance model to characterize a putative disease susceptible locus (DSL) that may consist of a marker allele, a haplotype from a subset of the markers, or an allele at a putative locus between the markers. Next, through using of a retrospective likelihood to adjust for the case-control sampling ascertainment and appropriately handle the Hardy-Weinberg equilibrium constraint, we develop an expectation-maximization (EM)-based algorithm to fit the penetrance model and estimate the joint haplotype frequencies of the DSL and markers simultaneously. With the latent variable to describe a flexible role of the DSL, the likelihood ratio statistic can then provide a joint association test for the set of markers without requiring an adjustment for testing of multiple haplotypes. Our simulation results also reveal that the latent variable approach may have improved power under certain scenarios comparing with classical haplotype association methods.  相似文献   

11.
Genome-wide association studies (GWAS) routinely apply principal component analysis (PCA) to infer population structure within a sample to correct for confounding due to ancestry. GWAS implementation of PCA uses tens of thousands of single-nucleotide polymorphisms (SNPs) to infer structure, despite the fact that only a small fraction of such SNPs provides useful information on ancestry. The identification of this reduced set of ancestry-informative markers (AIMs) from a GWAS has practical value; for example, researchers can genotype the AIM set to correct for potential confounding due to ancestry in follow-up studies that utilize custom SNP or sequencing technology. We propose a novel technique to identify AIMs from genome-wide SNP data using sparse PCA. The procedure uses penalized regression methods to identify those SNPs in a genome-wide panel that significantly contribute to the principal components while encouraging SNPs that provide negligible loadings to vanish from the analysis. We found that sparse PCA leads to negligible loss of ancestry information compared to traditional PCA analysis of genome-wide SNP data. We further demonstrate the value of sparse PCA for AIM selection using real data from the International HapMap Project and a genomewide study of inflammatory bowel disease. We have implemented our approach in open-source R software for public use.  相似文献   

12.
Errors in genotyping can greatly affect family-based association studies. If a mendelian inconsistency is detected, the family is usually removed from the analysis. This reduces power, and may introduce bias. In addition, a large proportion of genotyping errors remain undetected, and these also reduce power. We present a Bayesian framework for performing association studies with SNP data on samples of trios consisting of parents with an affected offspring, while allowing for the presence of both detectable and undetectable genotyping errors. This framework also allows for the inclusion of missing genotypes. Associations between the SNP and disease were modelled in terms of the genotypic relative risks. The performances of the analysis methods were investigated under a variety of models for disease association and genotype error, looking at both power to detect association and precision of genotypic relative risk estimates. As expected, power to detect association decreased as genotyping error probability increased. Importantly, however, analyses allowing for genotyping error had similar power to standard analyses when applied to data without genotyping error. Furthermore, allowing for genotyping error yielded relative risk estimates that were approximately unbiased, together with 95% credible intervals giving approximately correct coverage. The methods were also applied to a real dataset: a sample of schizophrenia cases and their parents genotyped at SNPs in the dysbindin gene. The analysis methods presented here require no prior information on the genotyping error probabilities, and may be fitted in WinBUGS.  相似文献   

13.
A Bayesian model-based method for multilocus association analysis of quantitative and qualitative (binary) traits is presented. The method selects a trait-associated subset of markers among candidates, and is equally applicable for analyzing wide chromosomal segments (genome scans) and small candidate regions. The method can be applied in situations involving missing genotype data. The number of trait loci, their marker positions, and the magnitudes of their gene effects (strengths of association) are all estimated simultaneously. The inference of parameters is based on their posterior distributions, which are obtained through Markov chain Monte Carlo simulations. The strengths of the approach are: 1) flexible use of oligogenic models with unknown number of loci, 2) performing the estimation of association jointly with model selection, and 3) avoidance of the multiple testing problem, which typically complicates the approaches based on association testing. The performance of the method was tested and compared to the multilocus conditional search procedure by analyzing two simulated data sets. We also applied the method to cystic fibrosis haplotype data (two-locus haplotypes), where gene position has already been identified. The method is implemented as a software package, which is freely available for research purposes under the name BAMA.  相似文献   

14.
As part of their practice, policymakers have to make economic evaluations using clinical trial data. Recent interest has been expressed in determining how cost-effectiveness analysis can be undertaken in a regression framework. In this respect, published research basically provides a general method for prognostic factor adjustment in the presence of imbalance, emphasizing sub-group analysis. In this paper, we present an alternative method from a Bayesian approach. We propose the use of covariates in Bayesian health technology assessment in order to reduce uncertainty about the effect of treatments. We show its advantages by comparison with another published method that do not adjust for covariates using simulated data.  相似文献   

15.
Genome‐wide association studies (GWAS) are now routinely imputed for untyped single nucleotide polymorphisms (SNPs) based on various powerful statistical algorithms for imputation trained on reference datasets. The use of predicted allele counts for imputed SNPs as the dosage variable is known to produce valid score test for genetic association. In this paper, we investigate how to best handle imputed SNPs in various modern complex tests for genetic associations incorporating gene–environment interactions. We focus on case‐control association studies where inference for an underlying logistic regression model can be performed using alternative methods that rely on varying degree on an assumption of gene–environment independence in the underlying population. As increasingly large‐scale GWAS are being performed through consortia effort where it is preferable to share only summary‐level information across studies, we also describe simple mechanisms for implementing score tests based on standard meta‐analysis of “one‐step” maximum‐likelihood estimates across studies. Applications of the methods in simulation studies and a dataset from GWAS of lung cancer illustrate ability of the proposed methods to maintain type‐I error rates for the underlying testing procedures. For analysis of imputed SNPs, similar to typed SNPs, the retrospective methods can lead to considerable efficiency gain for modeling of gene–environment interactions under the assumption of gene–environment independence. Methods are made available for public use through CGEN R software package.  相似文献   

16.
Selecting the best design for genetic association studies requires careful deliberation; different study designs can be used to scan for different genetic effects, and each design has its own set of strengths and limitations. A variety of family and unrelated control configurations are amenable to genetic association analyses, including the case-control design, case-parent triads, and case-parent triads in combination with unrelated controls or control-parent triads. Ultimately, the goal is to choose the design that achieves the highest statistical power using the lowest cost. For given parameter values and genotyped individuals, designs can be compared directly by computing the power. However, a more informative and general design comparison can be achieved by studying the relative efficiency, defined as the ratio of variances of two different parameter estimators, corresponding to two separate designs. Using log-linear modeling, we derive the relative efficiency from the asymptotic variance of the parameter estimators and relate it to the concept of Pitman efficiency. The relative efficiency takes into account the fact that different designs impose different costs relative to the number of genotyped individuals. We show that while optimal efficiency for analyses of regular autosomal effects is achieved using the standard case-control design, the case-parent triad design without unrelated controls is efficient when searching for parent-of-origin effects. Due to the potential loss of efficiency, maternal genes should generally not be adjusted for in an initial genome-wide association study scan of offspring genes but instead checked post hoc. The relative efficiency calculations are implemented in our R package Haplin.  相似文献   

17.
In this paper we examine alternative measurement models for fitting data from health surveys. We show why a testlet‐based latent trait model that includes covariate information, embedded within a fully Bayesian framework, can allow multiple simultaneous inferences and aid interpretation. We illustrate our approach with a survey of breast cancer survivors that reveals how the attitudes of those patients change after diagnosis toward a focus on appreciating the here‐and‐now, and away from consideration of longer‐term goals. Using the covariate information, we also show the extent to which individual‐level variables such as race, age and Tamoxifen treatment are related to a patient's change in attitude. The major contribution of this research is to demonstrate the use of a hierarchical Bayesian IRT model with covariates in this application area; hence a novel case study, and one that is certainly closely aligned with but distinct from the educational testing applications that have made IRT the dominant test scoring model. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

18.
The goal of this paper is to present an implementation of stochastic search variable selection (SSVS) to multilevel model from item response theory (IRT). As experimental settings get more complex and models are required to integrate multiple (and sometimes massive) sources of information, a model that can jointly summarize and select the most relevant characteristics can provide better interpretation and a deeper insight into the problem. A multilevel IRT model recently proposed in the literature for modeling multifactorial diseases is extended to perform variable selection in the presence of thousands of covariates using SSVS. We derive conditional distributions required for such a task as well as an acceptance‐rejection step that allows for the SSVS in high dimensional settings using a Markov Chain Monte Carlo algorithm. We validate the variable selection procedure through simulation studies, and illustrate its application on a study with genetic markers associated with the metabolic syndrome.  相似文献   

19.
The identification of gene–gene and gene–environment interaction in human traits and diseases is an active area of research that generates high expectation, and most often lead to high disappointment. This is partly explained by a misunderstanding of the inherent characteristics of standard regression‐based interaction analyses. Here, I revisit and untangle major theoretical aspects of interaction tests in the special case of linear regression; in particular, I discuss variables coding scheme, interpretation of effect estimate, statistical power, and estimation of variance explained in regard of various hypothetical interaction patterns. Linking this components it appears first that the simplest biological interaction models—in which the magnitude of a genetic effect depends on a common exposure—are among the most difficult to identify. Second, I highlight the demerit of the current strategy to evaluate the contribution of interaction effects to the variance of quantitative outcomes and argue for the use of new approaches to overcome this issue. Finally, I explore the advantages and limitations of multivariate interaction models, when testing for interaction between multiple SNPs and/or multiple exposures, over univariate approaches. Together, these new insights can be leveraged for future method development and to improve our understanding of the genetic architecture of multifactorial traits.  相似文献   

20.
Evaluating the association of multiple genetic variants with a trait of interest by use of kernel-based methods has made a significant impact on how genetic association analyses are conducted. An advantage of kernel methods is that they tend to be robust when the genetic variants have effects that are a mixture of positive and negative effects, as well as when there is a small fraction of causal variants. Another advantage is that kernel methods fit within the framework of mixed models, providing flexible ways to adjust for additional covariates that influence traits. Herein, we review the basic ideas behind the use of kernel methods for genetic association analysis as well as recent methodological advancements for different types of traits, multivariate traits, pedigree data, and longitudinal data. Finally, we discuss opportunities for future research.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号