首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A genome‐wide association study (GWAS) correlates marker and trait variation in a study sample. Each subject is genotyped at a multitude of SNPs (single nucleotide polymorphisms) spanning the genome. Here, we assume that subjects are randomly collected unrelateds and that trait values are normally distributed or can be transformed to normality. Over the past decade, geneticists have been remarkably successful in applying GWAS analysis to hundreds of traits. The massive amount of data produced in these studies present unique computational challenges. Penalized regression with the ?1 penalty (LASSO) or minimax concave penalty (MCP) penalties is capable of selecting a handful of associated SNPs from millions of potential SNPs. Unfortunately, model selection can be corrupted by false positives and false negatives, obscuring the genetic underpinning of a trait. Here, we compare LASSO and MCP penalized regression to iterative hard thresholding (IHT). On GWAS regression data, IHT is better at model selection and comparable in speed to both methods of penalized regression. This conclusion holds for both simulated and real GWAS data. IHT fosters parallelization and scales well in problems with large numbers of causal markers. Our parallel implementation of IHT accommodates SNP genotype compression and exploits multiple CPU cores and graphics processing units (GPUs). This allows statistical geneticists to leverage commodity desktop computers in GWAS analysis and to avoid supercomputing. Availability : Source code is freely available at https://github.com/klkeys/IHT.jl .  相似文献   

2.
Significance testing one SNP at a time has proven useful for identifying genomic regions that harbor variants affecting human disease. But after an initial genome scan has identified a "hit region" of association, single-locus approaches can falter. Local linkage disequilibrium (LD) can make both the number of underlying true signals and their identities ambiguous. Simultaneous modeling of multiple loci should help. However, it is typically applied ad hoc: conditioning on the top SNPs, with limited exploration of the model space and no assessment of how sensitive model choice was to sampling variability. Formal alternatives exist but are seldom used. Bayesian variable selection is coherent but requires specifying a full joint model, including priors on parameters and the model space. Penalized regression methods (e.g., LASSO) appear promising but require calibration, and, once calibrated, lead to a choice of SNPs that can be misleadingly decisive. We present a general method for characterizing uncertainty in model choice that is tailored to reprioritizing SNPs within a hit region under strong LD. Our method, LASSO local automatic regularization resample model averaging (LLARRMA), combines LASSO shrinkage with resample model averaging and multiple imputation, estimating for each SNP the probability that it would be included in a multi-SNP model in alternative realizations of the data. We apply LLARRMA to simulations based on case-control genome-wide association studies data, and find that when there are several causal loci and strong LD, LLARRMA identifies a set of candidates that is enriched for true signals relative to single locus analysis and to the recently proposed method of Stability Selection.  相似文献   

3.
The genetic case-control association study of unrelated subjects is a leading method to identify single nucleotide polymorphisms (SNPs) and SNP haplotypes that modulate the risk of complex diseases. Association studies often genotype several SNPs in a number of candidate genes; we propose a two-stage approach to address the inherent statistical multiple comparisons problem. In the first stage, each gene's association with disease is summarized by a single p-value that controls a familywise error rate. In the second stage, summary p-values are adjusted for multiplicity using a false discovery rate (FDR) controlling procedure. For the first stage, we consider marginal and joint tests of SNPs and haplotypes within genes, and we construct an omnibus test that combines SNP and haplotype analysis. Simulation studies show that when disease susceptibility is conferred by a SNP, and all common SNPs in a gene are genotyped, marginal analysis of SNPs using the Simes test has similar or higher power than marginal or joint haplotype analysis. Conversely, haplotype analysis can be more powerful when disease susceptibility is conferred by a haplotype. The omnibus test tracks the more powerful of the two approaches, which is generally unknown. Multiple testing balances the desire for statistical power against the implicit costs of false positive results, which up to now appear to be common in the literature.  相似文献   

4.
The much-anticipated fixed-array, genome-wide SNP genotyping technologies make large-scale genome-wide association scans now possible for large numbers of subjects. In this paper we reconsider the problem (Satagopan and Elston [2003] Genet Epidemiol 25:149-157) of optimizing a two-stage genotyping design to deal with important new issues that are relevant when studies are expanded from candidate gene size to a genome-wide scale. We investigate how the basic two-stage genotyping approach, in which all markers are genotyped in an initial group of subjects (stage I) and only the promising markers are genotyped in additional subjects (stage II), can be used to reduce genotyping cost in a genome-wide case-control association study even after allowing for much higher per genotype costs using specially designed assays in stage II, compared to the fixed array of SNPs used in stage I. In addition, we consider the problem of using measured SNPs to make (imperfect) prediction of unmeasured SNPs for association tests of all SNPs (measured or unmeasured) genome wide and the utility of expanding genotyping densities in stage II in the regions where significant associations were detected in stage I. Under a set of reasonable but conservative assumptions, we derive optimal two-stage design configurations (sample sizes and the thresholds of significance in both stages) with these optimal designs depending both on the total number of markers tested and upon the ratios of cost in stage II versus stage I. In addition we show how existing software for power and sample size calculations can be used for the purpose of designing two-stage studies, for a wide range of assumptions about the number of markers genotyped and the costs of genotyping in each stage of the study.  相似文献   

5.
In this article, we develop a powerful test for identifying single nucleotide polymorphism (SNP)-sets that are predictive of survival with data from genome-wide association studies. We first group typed SNPs into SNP-sets based on genomic features and then apply a score test to assess the overall effect of each SNP-set on the survival outcome through a kernel machine Cox regression framework. This approach uses genetic information from all SNPs in the SNP-set simultaneously and accounts for linkage disequilibrium (LD), leading to a powerful test with reduced degrees of freedom when the typed SNPs are in LD with each other. This type of test also has the advantage of capturing the potentially nonlinear effects of the SNPs, SNP-SNP interactions (epistasis), and the joint effects of multiple causal variants. By simulating SNP data based on the LD structure of real genes from the HapMap project, we demonstrate that our proposed test is more powerful than the standard single SNP minimum P-value-based test for association studies with censored survival outcomes. We illustrate the proposed test with a real data application.  相似文献   

6.
Qin H  Zhu X 《Genetic epidemiology》2012,36(3):235-243
When dense markers are available, one can interrogate almost every common variant across the genome via imputation and single nucleotide polymorphism (SNP) test, which has become a routine in current genome-wide association studies (GWASs). As a complement, admixture mapping exploits the long-range linkage disequilibrium (LD) generated by admixture between genetically distinct ancestral populations. It is then questionable whether admixture mapping analysis is still necessary in detecting the disease associated variants in admixed populations. We argue that admixture mapping is able to reduce the burden of massive comparisons in GWASs; it therefore can be a powerful tool to locate the disease variants with substantial allele frequency differences between ancestral populations. In this report we studied a two-stage approach, where candidate regions are defined by conducting admixture mapping at stage 1, and single SNP association tests are followed at stage 2 within the candidate regions defined at stage 1. We first established the genome-wide significance levels corresponding to the criteria to define the candidate regions at stage 1 by simulations. We next compared the power of the two-stage approach with direct association analysis. Our simulations suggest that the two-stage approach can be more powerful than the standard genome-wide association analysis when the allele frequency difference of a causal variant in ancestral populations, is larger than 0.4. Our conclusion is consistent with a theoretical prediction by Risch and Tang ([2006] Am J Hum Genet 79:S254). Surprisingly, our study also suggests that power can be improved when we use less strict criteria to define the candidate regions at stage 1.  相似文献   

7.
8.
Hsu L  Aragaki C  Quiaoit F  Wang X  Xu X  Zhao LP 《Genetic epidemiology》1999,17(Z1):S621-S626
A genome-wide scan of a simulated data set for fictitious disease genes was conducted using both semiparametric and nonparametric methods. The semiparametric model-based method, which tests for linkage/linkage disequilibrium separately and together, correctly identified all three underlying disease loci along with two false positives through the linkage analysis. However, the nonparametric model-free method which tests combined linkage/linkage disequilibrium, failed to yield any results due to the lack of linkage disequilibrium information in the data.  相似文献   

9.
10.
In spite of the tremendous success of genome-wide association studies (GWAS) in identifying genetic variants associated with complex traits and common diseases, many more are yet to be discovered. Hence, it is always desirable to improve the statistical power of GWAS. Paralleling with the intensive efforts of integrating GWAS with functional annotations or other omic data, we propose leveraging other published GWAS summary data to boost statistical power for a new/focus GWAS; the traits of the published GWAS may or may not be genetically correlated with the target trait of the new GWAS. Building on weighted hypothesis testing with a solid theoretical foundation, we develop a novel and effective method to construct single-nucleotide polymorphism (SNP)-specific weights based on 22 published GWAS data sets with various traits, detecting sometimes dramatically increased numbers of significant SNPs and independent loci as compared to the standard/unweighted analysis. For example, by integrating a schizophrenia GWAS summary data set with 19 other GWAS summary data sets of nonschizophrenia traits, our new method identified 1,585 genome-wide significant SNPs mapping to 15 linkage disequilibrium-independent loci, largely exceeding 818 significant SNPs in 13 independent loci identified by the standard/unweighted analysis; furthermore, using a later and larger schizophrenia GWAS summary data set as the validation data, 1,423 (out of 1,585) significant SNPs identified by the weighted analysis, compared to 705 (out of 818) by the unweighted analysis, were confirmed, while all 15 and 13 independent loci were also confirmed. Similar conclusions were reached with lipids and Alzheimer's disease (AD) traits. We conclude that the proposed approach is simple and cost-effective to improve GWAS power.  相似文献   

11.
Genomewide association studies (GWAS) sometimes identify loci at which both the number and identities of the underlying causal variants are ambiguous. In such cases, statistical methods that model effects of multiple single‐nucleotide polymorphisms (SNPs) simultaneously can help disentangle the observed patterns of association and provide information about how those SNPs could be prioritized for follow‐up studies. Current multi‐SNP methods, however, tend to assume that SNP effects are well captured by additive genetics; yet when genetic dominance is present, this assumption translates to reduced power and faulty prioritizations. We describe a statistical procedure for prioritizing SNPs at GWAS loci that efficiently models both additive and dominance effects. Our method, LLARRMA‐dawg, combines a group LASSO procedure for sparse modeling of multiple SNP effects with a resampling procedure based on fractional observation weights. It estimates for each SNP the robustness of association with the phenotype both to sampling variation and to competing explanations from other SNPs. In producing an SNP prioritization that best identifies underlying true signals, we show the following: our method easily outperforms a single‐marker analysis; when additive‐only signals are present, our joint model for additive and dominance is equivalent to or only slightly less powerful than modeling additive‐only effects; and when dominance signals are present, even in combination with substantial additive effects, our joint model is unequivocally more powerful than a model assuming additivity. We also describe how performance can be improved through calibrated randomized penalization, and discuss how dominance in ungenotyped SNPs can be incorporated through either heterozygote dosage or multiple imputation.  相似文献   

12.
The impact of erroneous genotypes having passed standard quality control (QC) can be severe in genome-wide association studies, genotype imputation, and estimation of heritability and prediction of genetic risk based on single nucleotide polymorphisms (SNP). To detect such genotyping errors, a simple two-locus QC method, based on the difference in test statistic of association between single SNPs and pairs of SNPs, was developed and applied. The proposed approach could detect many problematic SNPs with statistical significance even when standard single SNP QC analyses fail to detect them in real data. Depending on the data set used, the number of erroneous SNPs that were not filtered out by standard single SNP QC but detected by the proposed approach varied from a few hundred to thousands. Using simulated data, it was shown that the proposed method was powerful and performed better than other tested existing methods. The power of the proposed approach to detect erroneous genotypes was ~80% for a 3% error rate per SNP. This novel QC approach is easy to implement and computationally efficient, and can lead to a better quality of genotypes for subsequent genotype-phenotype investigations.  相似文献   

13.
Genome-wide association studies (GWAS) routinely apply principal component analysis (PCA) to infer population structure within a sample to correct for confounding due to ancestry. GWAS implementation of PCA uses tens of thousands of single-nucleotide polymorphisms (SNPs) to infer structure, despite the fact that only a small fraction of such SNPs provides useful information on ancestry. The identification of this reduced set of ancestry-informative markers (AIMs) from a GWAS has practical value; for example, researchers can genotype the AIM set to correct for potential confounding due to ancestry in follow-up studies that utilize custom SNP or sequencing technology. We propose a novel technique to identify AIMs from genome-wide SNP data using sparse PCA. The procedure uses penalized regression methods to identify those SNPs in a genome-wide panel that significantly contribute to the principal components while encouraging SNPs that provide negligible loadings to vanish from the analysis. We found that sparse PCA leads to negligible loss of ancestry information compared to traditional PCA analysis of genome-wide SNP data. We further demonstrate the value of sparse PCA for AIM selection using real data from the International HapMap Project and a genomewide study of inflammatory bowel disease. We have implemented our approach in open-source R software for public use.  相似文献   

14.
A central goal of medical genetics is to accurately predict complex disease from genotypes. Here, we present a comprehensive analysis of simulated and real data using lasso and elastic‐net penalized support‐vector machine models, a mixed‐effects linear model, a polygenic score, and unpenalized logistic regression. In simulation, the sparse penalized models achieved lower false‐positive rates and higher precision than the other methods for detecting causal SNPs. The common practice of prefiltering SNP lists for subsequent penalized modeling was examined and shown to substantially reduce the ability to recover the causal SNPs. Using genome‐wide SNP profiles across eight complex diseases within cross‐validation, lasso and elastic‐net models achieved substantially better predictive ability in celiac disease, type 1 diabetes, and Crohn's disease, and had equivalent predictive ability in the rest, with the results in celiac disease strongly replicating between independent datasets. We investigated the effect of linkage disequilibrium on the predictive models, showing that the penalized methods leverage this information to their advantage, compared with methods that assume SNP independence. Our findings show that sparse penalized approaches are robust across different disease architectures, producing as good as or better phenotype predictions and variance explained. This has fundamental ramifications for the selection and future development of methods to genetically predict human disease.  相似文献   

15.
We have developed a single nucleotide polymorphism (SNP) association scan statistic that takes into account the complex distribution of the human genome variation in the identification of chromosomal regions with significant SNP associations. This scan statistic has wide applicability for genetic analysis, whether to identify important chromosomal regions associated with common diseases based on whole-genome SNP association studies or to identify disease susceptibility genes based on dense SNP positional candidate studies. To illustrate this method, we analyzed patterns of SNP associations on chromosome 19 in a large cohort study. Among 2,944 SNPs, we found seven regions that contained clusters of significantly associated SNPs. The average width of these regions was 35 kb with a range of 10-72 kb. We compared the scan statistic results to Fisher's product method using a sliding window approach, and detected 22 regions with significant clusters of SNP associations. The average width of these regions was 131 kb with a range of 10.1-615 kb. Given that the distances between SNPs are not taken into consideration in the sliding window approach, it is likely that a large fraction of these regions represents false positives. However, all seven regions detected by the scan statistic were also detected by the sliding window approach. The linkage disequilibrium (LD) patterns within the seven regions were highly variable indicating that the clusters of SNP associations were not due to LD alone. The scan statistic developed here can be used to make gene-based or region-based SNP inferences about disease association.  相似文献   

16.
We develop regression methodology to identify subsets of single nucleotide polymorphisms (SNPs) within candidate genes related to quantitative traits and apply our methods to the simulated Genetic Analysis Workshop (GAW) 12 data set. In the data set we find 694 SNP loci with minimum allele frequencies of at least 0.01. We assume an additive casual model between these SNPs and all five quantitative traits. After initial screening using one‐way analysis of variance, we employ a computationally efficient, simulated annealing algorithm to select among all possible subsets of SNP loci, using a generalization of Mallows’ Cp as our optimality criterion. The simple transition kernel we develop evaluates new subsets in O(1), by requiring just three arithmetic operations to calculate the proposed RSS based on the Gauss‐Jordan pivot. We identify an SNP loci located at 6–5782 related to traits 2 and 3 and several sites on gene 2 related to trait 5 using a subsample of 1,000 individuals and the full data set (n = 8,250) for comparison. © 2001 Wiley‐Liss, Inc.  相似文献   

17.
Genome‐wide association studies allow detection of non‐genotyped disease‐causing variants through testing of nearby genotyped SNPs. This approach may fail when there are no genotyped SNPs in strong LD with the causal variant. Several genotyped SNPs in weak LD with the causal variant may, however, considered together, provide equivalent information. This observation motivates popular but computationally intensive approaches based on imputation or haplotyping. Here we present a new method and accompanying software designed for this scenario. Our approach proceeds by selecting, for each genotyped “anchor” SNP, a nearby genotyped “partner” SNP, chosen via a specific algorithm we have developed. These two SNPs are used as predictors in linear or logistic regression analysis to generate a final significance test. In simulations, our method captures much of the signal captured by imputation, while taking a fraction of the time and disc space, and generating a smaller number of false‐positives. We apply our method to a case/control study of severe malaria genotyped using the Affymetrix 500K array. Previous analysis showed that fine‐scale sequencing of a Gambian reference panel in the region of the known causal locus, followed by imputation, increased the signal of association to genome‐wide significance levels. Our method also increases the signal of association from to . Our method thus, in some cases, eliminates the need for more complex methods such as sequencing and imputation, and provides a useful additional test that may be used to identify genetic regions of interest.  相似文献   

18.
In cancer research, high‐throughput profiling studies have been extensively conducted, searching for genes/single nucleotide polymorphisms (SNPs) associated with prognosis. Despite seemingly significant differences, different subtypes of the same cancer (or different types of cancers) may share common susceptibility genes. In this study, we analyze prognosis data on multiple subtypes of the same cancer but note that the proposed approach is directly applicable to the analysis of data on multiple types of cancers. We describe the genetic basis of multiple subtypes using the heterogeneity model that allows overlapping but different sets of susceptibility genes/SNPs for different subtypes. An accelerated failure time (AFT) model is adopted to describe prognosis. We develop a regularized gradient descent approach that conducts gene‐level analysis and identifies genes that contain important SNPs associated with prognosis. The proposed approach belongs to the family of gradient descent approaches, is intuitively reasonable, and has affordable computational cost. Simulation study shows that when prognosis‐associated SNPs are clustered in a small number of genes, the proposed approach outperforms alternatives with significantly more true positives and fewer false positives. We analyze an NHL (non‐Hodgkin lymphoma) prognosis study with SNP measurements and identify genes associated with the three major subtypes of NHL, namely, DLBCL, FL, and CLL/SLL. The proposed approach identifies genes different from using alternative approaches and has the best prediction performance.  相似文献   

19.
We develop a Bayesian multi‐SNP Markov chain Monte Carlo approach that allows published functional significance scores to objectively inform single nucleotide polymorphism (SNP) prior effect sizes in expression quantitative trait locus (eQTL) studies. We developed the Normal Gamma prior to allow the inclusion of functional information. We partition SNPs into predefined functional groups and select prior distributions that fit the group‐specific observed functional significance scores. We test our method on two simulated datasets and previously analysed human eQTL data containing validated causal SNPs. In our simulations the modified Normal Gamma always performs at least as well, and generally outperforms, the other methods considered. When analysing the human eQTL data, we placed all SNPs into their actual functional group. The ranks of the four validated causal SNPs analysed using the modified Normal Gamma increase dramatically compared to those of the other methods considered. Using our new method, three of the four validated SNPs are ranked in the top 1% of SNPs and the other is in the top 2%. For the standard Normal Gamma, the best of the other methods, the four validated SNPs had ranks in the top 1%, 4%, 20% and 59%. Crucially these substantive improvements in the ranks make it highly likely that most, if not all, of these validated SNPs would have been flagged for follow‐up using our new method, whereas at least two of them would certainly not have been using the current approaches.  相似文献   

20.
In this paper we propose a Bayesian modeling approach to the analysis of genome-wide association studies based on single nucleotide polymorphism (SNP) data. Our latent seed model combines various aspects of k-means clustering, hidden Markov models (HMMs) and logistic regression into a fully Bayesian model. It is fitted using the Markov chain Monte Carlo stochastic simulation method, with Metropolis-Hastings update steps. The approach is flexible, both in allowing different types of genetic models, and because it can be easily extended while remaining computationally feasible due to the use of fast algorithms for HMMs. It allows for inference primarily on the location of the causal locus and also on other parameters of interest. The latent seed model is used here to analyze three data sets, using both synthetic and real disease phenotypes with real SNP data, and shows promising results. Our method is able to correctly identify the causal locus in examples where single SNP analysis is both successful and unsuccessful at identifying the causal SNP.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号