首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Neighboring common polymorphisms are often correlated (in linkage disequilibrium (LD)) as a result of shared ancestry. An association between a polymorphism and a disease trait may therefore be the indirect result of a correlated functional variant, and identifying the true causal variant(s) from an initial disease association is a major challenge in genetic association studies. Here, we present a method to estimate the sample size needed to discriminate between a functional variant of a given allele frequency and effect size, and other correlated variants. The sample size required to conduct such fine‐scale mapping is typically 1–4 times larger than required to detect the initial association. Association studies in populations with different LD patterns can substantially improve the power to isolate the causal variant. An online tool to perform these calculations is available at http://moya.srl.cam.ac.uk/ocac/FineMappingPowerCalculator.html . Genet. Epidemiol. 34:463–468, 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

2.
Meta-analysis has become a key component of well-designed genetic association studies due to the boost in statistical power achieved by combining results across multiple samples of individuals and the need to validate observed associations in independent studies. Meta-analyses of genetic association studies based on multiple SNPs and traits are subject to the same multiple testing issues as single-sample studies, but it is often difficult to adjust accurately for the multiple tests. Procedures such as Bonferroni may control the type-I error rate but will generally provide an overly harsh correction if SNPs or traits are correlated. Depending on study design, availability of individual-level data, and computational requirements, permutation testing may not be feasible in a meta-analysis framework. In this article, we present methods for adjusting for multiple correlated tests under several study designs commonly employed in meta-analyses of genetic association tests. Our methods are applicable to both prospective meta-analyses in which several samples of individuals are analyzed with the intent to combine results, and retrospective meta-analyses, in which results from published studies are combined, including situations in which (1) individual-level data are unavailable, and (2) different sets of SNPs are genotyped in different studies due to random missingness or two-stage design. We show through simulation that our methods accurately control the rate of type I error and achieve improved power over multiple testing adjustments that do not account for correlation between SNPs or traits.  相似文献   

3.
The interpretation of the results of large association studies encompassing much or all of the human genome faces the fundamental statistical problem that a correspondingly large number of single nucleotide polymorphisms markers will be spuriously flagged as significant. A common method of dealing with these false positives is to raise the significance level for the individual tests for association of each marker. Any such adjustment for multiple testing is ultimately based on a more or less precise estimate for the actual overall type I error probability. We estimate this probability for association tests for correlated markers and show that it depends in a nonlinear way on the significance level for the individual tests. This dependence of the effective number of tests is not taken into account by existing multiple-testing corrections, leading to widely overestimated results. We demonstrate a simple correction for multiple testing, which can easily be calculated from the pairwise correlation and gives far more realistic estimates for the effective number of tests than previous formulae. The calculation is considerably faster than with other methods and hence applicable on a genome-wide scale. The efficacy of our method is shown on a constructed example with highly correlated markers as well as on real data sets, including a full genome scan where a conservative estimate only 8% above the permutation estimate is obtained in about 1% of computation time. As the calculation is based on pairwise correlations between markers, it can be performed at the stage of study design using public databases.  相似文献   

4.
Even in large-scale genome-wide association studies (GWASs), only a fraction of the true associations are detected at the genome-wide significance level. When few or no associations reach the significance threshold, one strategy is to follow up on the most promising candidates, i.e. the single nucleotide polymorphisms (SNPs) with the smallest association-test P-values, by genotyping them in additional studies. In this communication, we propose an overall test for GWASs that analyzes the SNPs with the most promising P-values simultaneously and therefore allows an early assessment of whether the follow-up of the selected SNPs is likely promising. We theoretically derive the properties of the proposed overall test under the null hypothesis and assess its power based on simulation studies. An application to a GWAS for chronic obstructive pulmonary disease suggests that there are true association signals among the top SNPs and that an additional follow-up study is promising.  相似文献   

5.
Next‐generation DNA sequencing technologies are facilitating large‐scale association studies of rare genetic variants. The depth of the sequence read coverage is an important experimental variable in the next‐generation technologies and it is a major determinant of the quality of genotype calls generated from sequence data. When case and control samples are sequenced separately or in different proportions across batches, they are unlikely to be matched on sequencing read depth and a differential misclassification of genotypes can result, causing confounding and an increased false‐positive rate. Data from Pilot Study 3 of the 1000 Genomes project was used to demonstrate that a difference between the mean sequencing read depth of case and control samples can result in false‐positive association for rare and uncommon variants, even when the mean coverage depth exceeds 30× in both groups. The degree of the confounding and inflation in the false‐positive rate depended on the extent to which the mean depth was different in the case and control groups. A logistic regression model was used to test for association between case‐control status and the cumulative number of alleles in a collapsed set of rare and uncommon variants. Including each individual's mean sequence read depth across the variant sites in the logistic regression model nearly eliminated the confounding effect and the inflated false‐positive rate. Furthermore, accounting for the potential error by modeling the probability of the heterozygote genotype calls in the regression analysis had a relatively minor but beneficial effect on the statistical results. Genet. Epidemiol. 35: 261‐268, 2011. © 2011 Wiley‐Liss, Inc.  相似文献   

6.
A major challenge in genome‐wide association studies (GWASs) is to derive the multiple testing threshold when hypothesis tests are conducted using a large number of single nucleotide polymorphisms. Permutation tests are considered the gold standard in multiple testing adjustment in genetic association studies. However, it is computationally intensive, especially for GWASs, and can be impractical if a large number of random shuffles are used to ensure accuracy. Many researchers have developed approximation algorithms to relieve the computing burden imposed by permutation. One particularly attractive alternative to permutation is to calculate the effective number of independent tests, Meff, which has been shown to be promising in genetic association studies. In this study, we compare recently developed Meff methods and validate them by the permutation test with 10,000 random shuffles using two real GWAS data sets: an Illumina 1M BeadChip and an Affymetrix GeneChip® Human Mapping 500K Array Set. Our results show that the simpleM method produces the best approximation of the permutation threshold, and it does so in the shortest amount of time. We also show that Meff is indeed valid on a genome‐wide scale in these data sets based on statistical theory and significance tests. The significance thresholds derived can provide practical guidelines for other studies using similar population samples and genotyping platforms. Genet. Epidemiol. 34:100–105, 2010. © 2009 Wiley‐Liss, Inc.  相似文献   

7.
Improving power in genome-wide association studies: weights tip the scale   总被引:3,自引:0,他引:3  
The potential of genome-wide association analysis can only be realized when they have power to detect signals despite the detrimental effect of multiple testing on power. We develop a weighted multiple testing procedure that facilitates the input of prior information in the form of groupings of tests. For each group a weight is estimated from the observed test statistics within the group. Differentially weighting groups improves the power to detect signals in likely groupings. The advantage of the grouped-weighting concept, over fixed weights based on prior information, is that it often leads to an increase in power even if many of the groupings are not correlated with the signal. Being data dependent, the procedure is remarkably robust to poor choices in groupings. Power is typically improved if one (or more) of the groups clusters multiple tests with signals, yet little power is lost when the groupings are totally random. If there is no apparent signal in a group, relative to a group that appears to have several tests with signals, the former group will be down-weighted relative to the latter. If no groups show apparent signals, then the weights will be approximately equal. The only restriction on the procedure is that the number of groups be small, relative to the total number of tests performed.  相似文献   

8.
Validation studies have been used to increase the reliability of the statistical conclusions for scientific discoveries; such studies improve the reproducibility of the findings and reduce the possibility of false positives. Here, one of the important roles of statistics is to quantify reproducibility rigorously. Two concepts were recently defined for this purpose: (i) rediscovery rate (RDR), which is the expected proportion of statistically significant findings in a study that can be replicated in the validation study and (ii) false discovery rate in the validation study (vFDR). In this paper, we aim to develop a nonparametric approach to estimate the RDR and vFDR and show an explicit link between the RDR and the FDR. Among other things, the link explains why reproducing statistically significant results even with low FDR level may be difficult. Two metabolomics datasets are considered to illustrate the application of the RDR and vFDR concepts in high‐throughput data analysis. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

9.
Using genome-wide association studies to identify genetic variants contributing to disease has been highly successful with many novel genetic predispositions identified and biological pathways revealed. Several pitfalls for spurious association or non-replication have been highlighted: from population structure, automated genotype scoring for cases and controls, to age-varying association. We describe an important yet unreported source of bias in case-control studies due to variations in chip technology between different commercial array releases. As cases are commonly genotyped with newer arrays and freely available control resources are frequently used for comparison, there exists an important potential for false associations which are robust to standard quality control and replication design.  相似文献   

10.
The genetic basis of multiple phenotypes such as gene expression, metabolite levels, or imaging features is often investigated by testing a large collection of hypotheses, probing the existence of association between each of the traits and hundreds of thousands of genotyped variants. Appropriate multiplicity adjustment is crucial to guarantee replicability of findings, and the false discovery rate (FDR) is frequently adopted as a measure of global error. In the interest of interpretability, results are often summarized so that reporting focuses on variants discovered to be associated to some phenotypes. We show that applying FDR‐controlling procedures on the entire collection of hypotheses fails to control the rate of false discovery of associated variants as well as the expected value of the average proportion of false discovery of phenotypes influenced by such variants. We propose a simple hierarchical testing procedure that allows control of both these error rates and provides a more reliable basis for the identification of variants with functional effects. We demonstrate the utility of this approach through simulation studies comparing various error rates and measures of power for genetic association studies of multiple traits. Finally, we apply the proposed method to identify genetic variants that impact flowering phenotypes in Arabidopsis thaliana, expanding the set of discoveries.  相似文献   

11.
The multiplicity problem has become increasingly important in genetic studies as the capacity for high-throughput genotyping has increased. The control of False Discovery Rate (FDR) (Benjamini and Hochberg. [1995] J. R. Stat. Soc. Ser. B 57:289-300) has been adopted to address the problems of false positive control and low power inherent in high-volume genome-wide linkage and association studies. In many genetic studies, there is often a natural stratification of the m hypotheses to be tested. Given the FDR framework and the presence of such stratification, we investigate the performance of a stratified false discovery control approach (i.e. control or estimate FDR separately for each stratum) and compare it to the aggregated method (i.e. consider all hypotheses in a single stratum). Under the fixed rejection region framework (i.e. reject all hypotheses with unadjusted p-values less than a pre-specified level and then estimate FDR), we demonstrate that the aggregated FDR is a weighted average of the stratum-specific FDRs. Under the fixed FDR framework (i.e. reject as many hypotheses as possible and meanwhile control FDR at a pre-specified level), we specify a condition necessary for the expected total number of true positives under the stratified FDR method to be equal to or greater than that obtained from the aggregated FDR method. Application to a recent Genome-Wide Association (GWA) study by Maraganore et al. ([2005] Am. J. Hum. Genet. 77:685-693) illustrates the potential advantages of control or estimation of FDR by stratum. Our analyses also show that controlling FDR at a low rate, e.g. 5% or 10%, may not be feasible for some GWA studies.  相似文献   

12.
One of main roles of omics-based association studies with high-throughput technologies is to screen out relevant molecular features, such as genetic variants, genes, and proteins, from a large pool of such candidate features based on their associations with the phenotype of interest. Typically, screened features are subject to validation studies using more established or conventional assays, where the number of evaluable features is relatively limited, so that there may exist a fixed number of features measurable by these assays. Such a limitation necessitates narrowing a feature set down to a fixed size, following an initial screening analysis via multiple testing where adjustment for multiplicity is made. We propose a two-stage screening approach to control the false discovery rate (FDR) for a feature set with fixed size that is subject to validation studies, rather than for a feature set from the initial screening analysis. Out of the feature set selected in the first stage with a relaxed FDR level, a fraction of features with most statistical significance is firstly selected. For the remaining feature set, features are selected based on biological consideration only, without regard to any statistical information, which allows evaluating the FDR level for the finally selected feature set with fixed size. Improvement of the power is discussed in the proposed two-stage screening approach. Simulation experiments based on parametric models and real microarray datasets demonstrated substantial increment in the number of screened features for biological consideration compared with the standard screening approach, allowing for more extensive and in-depth biological investigations in omics association studies.  相似文献   

13.
Genome‐wide association (GWA) studies have proved to be extremely successful in identifying novel common polymorphisms contributing effects to the genetic component underlying complex traits. Nevertheless, one source of, as yet, undiscovered genetic determinants of complex traits are those mediated through the effects of rare variants. With the increasing availability of large‐scale re‐sequencing data for rare variant discovery, we have developed a novel statistical method for the detection of complex trait associations with these loci, based on searching for accumulations of minor alleles within the same functional unit. We have undertaken simulations to evaluate strategies for the identification of rare variant associations in population‐based genetic studies when data are available from re‐sequencing discovery efforts or from commercially available GWA chips. Our results demonstrate that methods based on accumulations of rare variants discovered through re‐sequencing offer substantially greater power than conventional analysis of GWA data, and thus provide an exciting opportunity for future discovery of genetic determinants of complex traits. Genet. Epidemiol. 34: 188–193, 2010. © 2009 Wiley‐Liss, Inc.  相似文献   

14.
Large exploratory studies are often characterized by a preponderance of true null hypotheses, with a small though multiple number of false hypotheses. Traditional multiple-test adjustments consider either each hypothesis separately, or all hypotheses simultaneously, but it may be more desirable to consider the combined evidence for subsets of hypotheses, in order to reduce the number of hypotheses to a manageable size. Previously, Zaykin et al. ([2002] Genet. Epidemiol. 22:170-185) proposed forming the product of all P-values at less than a preset threshold, in order to combine evidence from all significant tests. Here we consider a complementary strategy: form the product of the K most significant P-values. This has certain advantages for genomewide association scans: K can be chosen on the basis of a hypothesised disease model, and is independent of sample size. Furthermore, the alternative hypothesis corresponds more closely to the experimental situation where all loci have fixed effects. We give the distribution of the rank truncated product and suggest some methods to account for correlated tests in genomewide scans. We show that, under realistic scenarios, it provides increased power to detect genomewide association, while identifying a candidate set of good quality and fixed size for follow-up studies.  相似文献   

15.
We consider detecting associations between a trait and multiple single nucleotide polymorphisms (SNPs) in linkage disequilibrium (LD). To maximize the use of information contained in multiple SNPs while minimizing the cost of large degrees of freedom (DF) in testing multiple parameters, we first theoretically explore the sum test derived under a working assumption of a common association strength between the trait and each SNP, testing on the corresponding parameter with only one DF. Under the scenarios that the association strengths between the trait and the SNPs are close to each other (and in the same direction), as considered by Wang and Elston [Am. J. Hum. Genet. [2007] 80:353–360], we show with simulated data that the sum test was powerful as compared to several existing tests; otherwise, the sum test might have much reduced power. To overcome the limitation of the sum test, based on our theoretical analysis of the sum test, we propose five new tests that are closely related to each other and are shown to consistently perform similarly well across a wide range of scenarios. We point out the close connection of the proposed tests to the Goeman test. Furthermore, we derive the asymptotic distributions of the proposed tests so that P‐values can be easily calculated, in contrast to the use of computationally demanding permutations or simulations for the Goeman test. A distinguishing feature of the five new tests is their use of a diagonal working covariance matrix, rather than a full covariance matrix as used in the usual Wald or score test. We recommend the routine use of two of the new tests, along with several other tests, to detect disease associations with multiple linked SNPs. Genet. Epidemiol. 33:497–507, 2009. © 2009 Wiley‐Liss, Inc.  相似文献   

16.
Genome-wide association studies are carried out to identify unknown genes for a complex trait. Polymorphisms showing the most statistically significant associations are reported and followed up in subsequent confirmatory studies. In addition to the test of association, the statistical analysis provides point estimates of the relationship between the genotype and phenotype at each polymorphism, typically an odds ratio in case-control association studies. The statistical significance of the test and the estimator of the odds ratio are completely correlated. Selecting the most extreme statistics is equivalent to selecting the most extreme odds ratios. The value of the estimator, given the value of the statistical significance depends on the standard error of the estimator and the power of the study. This report shows that when power is low, estimates of the odds ratio from a genome-wide association study, or any large-scale association study, will be upwardly biased. Genome-wide association studies are often underpowered given the low alpha levels required to declare statistical significance and the small individual genetic effects known to characterize complex traits. Factors such as low allele frequency, inadequate sample size and weak genetic effects contribute to large standard errors in the odds ratio estimates, low power and upwardly biased odds ratios. Studies that have high power to detect an association with the true odds ratio will have little or no bias, regardless of the statistical significance threshold. The results have implications for the interpretation of genome-wide association analysis and the planning of subsequent confirmatory stages.  相似文献   

17.
It is increasingly recognized that multiple genetic variants, within the same or different genes, combine to affect liability for many common diseases. Indeed, the variants may interact among themselves and with environmental factors. Thus realistic genetic/statistical models can include an extremely large number of parameters, and it is by no means obvious how to find the variants contributing to liability. For models of multiple candidate genes and their interactions, we prove that statistical inference can be based on controlling the false discovery rate (FDR), which is defined as the expected number of false rejections divided by the number of rejections. Controlling the FDR automatically controls the overall error rate in the special case that all the null hypotheses are true. So do more standard methods such as Bonferroni correction. However, when some null hypotheses are false, the goals of Bonferroni and FDR differ, and FDR will have better power. Model selection procedures, such as forward stepwise regression, are often used to choose important predictors for complex models. By analysis of simulations of such models, we compare a computationally efficient form of forward stepwise regression against the FDR methods. We show that model selection includes numerous genetic variants having no impact on the trait, whereas FDR maintains a false-positive rate very close to the nominal rate. With good control over false positives and better power than Bonferroni, the FDR-based methods we introduce present a viable means of evaluating complex, multivariate genetic models. Naturally, as for any method seeking to explore complex genetic models, the power of the methods is limited by sample size and model complexity.  相似文献   

18.
When many correlated traits are measured the potential exists to discover the coordinated control of these traits via genotyped polymorphisms. A common statistical approach to this problem involves assessing the relationship between each phenotype and each single nucleotide polymorphism (SNP) individually (PHN); and taking a Bonferroni correction for the effective number of independent tests conducted. Alternatively, one can apply a dimension reduction technique, such as estimation of principal components, and test for an association with the principal components of the phenotypes (PCP) rather than the individual phenotypes. Building on the work of Lange and colleagues we develop an alternative method based on the principal component of heritability (PCH). For each SNP the PCH approach reduces the phenotypes to a single trait that has a higher heritability than any other linear combination of the phenotypes. As a result, the association between a SNP and derived trait is often easier to detect than an association with any of the individual phenotypes or the PCP. When applied to unrelated subjects, PCH has a drawback. For each SNP it is necessary to estimate the vector of loadings that maximize the heritability over all phenotypes. We develop a method of iterated sample splitting that uses one portion of the data for training and the remainder for testing. This cross-validation approach maintains the type I error control and yet utilizes the data efficiently, resulting in a powerful test for association.  相似文献   

19.
With reductions in genotyping costs and the fast pace of improvements in genotyping technology, it is not uncommon for the individuals in a single study to undergo genotyping using several different platforms, where each platform may contain different numbers of markers selected via different criteria. For example, a set of cases and controls may be genotyped at markers in a small set of carefully selected candidate genes, and shortly thereafter, the same cases and controls may be used for a genome-wide single nucleotide polymorphism (SNP) association study. After such initial investigations, often, a subset of "interesting" markers is selected for validation or replication. Specifically, by validation, we refer to the investigation of associations between the selected subset of markers and the disease in independent data. However, it is not obvious how to choose the best set of markers for this validation. There may be a prior expectation that some sets of genotyping data are more likely to contain real associations. For example, it may be more likely for markers in plausible candidate genes to show disease associations than markers in a genome-wide scan. Hence, it would be desirable to select proportionally more markers from the candidate gene set. When a fixed number of markers are selected for validation, we propose an approach for identifying an optimal marker-selection configuration by basing the approach on minimizing the stratified false discovery rate. We illustrate this approach using a case-control study of colorectal cancer from Ontario, Canada, and we show that this approach leads to substantial reductions in the estimated false discovery rates in the Ontario dataset for the selected markers, as well as reductions in the expected false discovery rates for the proposed validation dataset.  相似文献   

20.
We study the link between two quality measures of SNP (single nucleotide polymorphism) data in genome‐wide association (GWA) studies, that is, per SNP call rates (CR) and p‐values for testing Hardy–Weinberg equilibrium (HWE). The aim is to improve these measures by applying methods based on realized randomized p‐values, the false discovery rate and estimates for the proportion of false hypotheses. While exact non‐randomized conditional p‐values for testing HWE cannot be recommended for estimating the proportion of false hypotheses, their realized randomized counterparts should be used. P‐values corresponding to the asymptotic unconditional chi‐square test lead to reasonable estimates only if SNPs with low minor allele frequency are excluded. We provide an algorithm to compute the probability that SNPs violate HWE given the observed CR, which yields an improved measure of data quality. The proposed methods are applied to SNP data from the KORA (Cooperative Health Research in the Region of Augsburg, Southern Germany) 500 K project, a GWA study in a population‐based sample genotyped by Affymetrix GeneChip 500 K arrays using the calling algorithm BRLMM 1.4.0. We show that all SNPs with CR = 100 per cent are nearly in perfect HWE which militates in favor of the population to meet the conditions required for HWE at least for these SNPs. Moreover, we show that the proportion of SNPs not being in HWE increases with decreasing CR. We conclude that using a single threshold for judging HWE p‐values without taking the CR into account is problematic. Instead we recommend a stratified analysis with respect to CR. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号