首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We examine the measurement properties of pooled DNA odds ratio estimates for 7,357 single nucleotide polymorphisms (SNPs) genotyped in a genome‐wide association study of postmenopausal breast cancer. This study involved DNA pools formed from 125 cases or125 matched controls. Individual genotyping for these SNPs subsequently came available for a substantial majority of women included in seven pool pairs, providing the opportunity for a comparison of pooled DNA and individual odds ratio estimates and their variances. We find that the “per minor allele” odds ratio estimates from the pooled DNA comparisons agree fairly well with those from individual genotyping. Furthermore, the log‐odds ratio variance estimates support a pooled DNA measurement model that we previously described, although with somewhat greater extra‐binomial variation than was hypothesized in project design. Implications for the role of pooled DNA comparisons in the future genetic epidemiology research agenda are discussed. Genet. Epidemiol. 34: 603–612, 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

2.
Next‐generation DNA sequencing technologies are facilitating large‐scale association studies of rare genetic variants. The depth of the sequence read coverage is an important experimental variable in the next‐generation technologies and it is a major determinant of the quality of genotype calls generated from sequence data. When case and control samples are sequenced separately or in different proportions across batches, they are unlikely to be matched on sequencing read depth and a differential misclassification of genotypes can result, causing confounding and an increased false‐positive rate. Data from Pilot Study 3 of the 1000 Genomes project was used to demonstrate that a difference between the mean sequencing read depth of case and control samples can result in false‐positive association for rare and uncommon variants, even when the mean coverage depth exceeds 30× in both groups. The degree of the confounding and inflation in the false‐positive rate depended on the extent to which the mean depth was different in the case and control groups. A logistic regression model was used to test for association between case‐control status and the cumulative number of alleles in a collapsed set of rare and uncommon variants. Including each individual's mean sequence read depth across the variant sites in the logistic regression model nearly eliminated the confounding effect and the inflated false‐positive rate. Furthermore, accounting for the potential error by modeling the probability of the heterozygote genotype calls in the regression analysis had a relatively minor but beneficial effect on the statistical results. Genet. Epidemiol. 35: 261‐268, 2011. © 2011 Wiley‐Liss, Inc.  相似文献   

3.
Power and sample size for DNA microarray studies   总被引:10,自引:0,他引:10  
A microarray study aims at having a high probability of declaring genes to be differentially expressed if they are truly expressed, while keeping the probability of making false declarations of expression acceptably low. Thus, in formal terms, well-designed microarray studies will have high power while controlling type I error risk. Achieving this objective is the purpose of this paper. Here, we discuss conceptual issues and present computational methods for statistical power and sample size in microarray studies, taking account of the multiple testing that is generic to these studies. The discussion encompasses choices of experimental design and replication for a study. Practical examples are used to demonstrate the methods. The examples show forcefully that replication of a microarray experiment can yield large increases in statistical power. The paper refers to cDNA arrays in the discussion and illustrations but the proposed methodology is equally applicable to expression data from oligonucleotide arrays.  相似文献   

4.
Large collections of individuals are required to investigate the association of commonly occurring genetic variation with disease. The laboratory assessment of one form of variation, single nucleotide polymorphisms, is costly in time and DNA. Robust statistical approaches are developed to allow the successful implementation of a recently described laboratory method for rapidly estimating allele frequency using pools of DNA. A substantial reduction in Type I error is demonstrated using simulation, through the incorporation of measurement error into confidence limits for a case-control study, illustrated on a case-control study of acute leukaemia in adults. A method for creating multiple sub-pools is described which will allow large studies, such as the proposed U.K. Biobank, to take advantage of this method. Furthermore, a set-based logistic regression is presented which allows the investigation of joint effects and interactions with other genes or environmental factors.  相似文献   

5.
In case-control studies, subjects in the case group may be recruited from suspected patients who are diagnosed positively with disease. While many statistical methods have been developed for measurement error or misclassification of exposure variables in epidemiological studies, no studies have been reported on the effect of errors in diagnosing disease on testing genetic association in case-control studies. We study the impact of using the original Cochran-Armitage trend test assuming no diagnostic error when, in fact, cases and controls may be clinically diagnosed by an imperfect gold standard or a reference test. The type I error, sample size and asymptotic power of trend tests are examined under a family of genetic models in the presence of diagnostic error. The empirical powers of the trend tests are also compared by simulation studies under various genetic models.  相似文献   

6.
Using genome-wide association studies to identify genetic variants contributing to disease has been highly successful with many novel genetic predispositions identified and biological pathways revealed. Several pitfalls for spurious association or non-replication have been highlighted: from population structure, automated genotype scoring for cases and controls, to age-varying association. We describe an important yet unreported source of bias in case-control studies due to variations in chip technology between different commercial array releases. As cases are commonly genotyped with newer arrays and freely available control resources are frequently used for comparison, there exists an important potential for false associations which are robust to standard quality control and replication design.  相似文献   

7.
With recent advances in genomewide microarray technologies, whole-genome association (WGA) studies have aimed at identifying susceptibility genes for complex human diseases using hundreds of thousands of single nucleotide polymorphisms (SNPs) genotyped at the same time. In this context and to take into account multiple testing, false discovery rate (FDR)-based strategies are now used frequently. However, a critical aspect of these strAtegies is that they are applied to a collection or a family of hypotheses and, thus, critically depend on these precise hypotheses. We investigated how modifying the family of hypotheses to be tested affected the performance of FDR-based procedures in WGA studies. We showed that FDR-based procedures performed more poorly when excluding SNPs with high prior probability of being associated. Results of simulation studies mimicking WGA studies according to three scenarios are reported, and show the extent to which SNPs elimination (family contraction) prior to the analysis impairs the performance of FDR-based procedures. To illustrate this situation, we used the data from a recent WGA study on type-1 diabetes (Clayton et al. [2005] Nat. Genet. 37:1243-1246) and report the results obtained when excluding or not SNPs located inside the human leukocyte antigen region. Based on our findings, excluding markers with high prior probability of being associated cannot be recommended for the analysis of WGA data with FDR-based strategies.  相似文献   

8.
Errors in genotyping can greatly affect family-based association studies. If a mendelian inconsistency is detected, the family is usually removed from the analysis. This reduces power, and may introduce bias. In addition, a large proportion of genotyping errors remain undetected, and these also reduce power. We present a Bayesian framework for performing association studies with SNP data on samples of trios consisting of parents with an affected offspring, while allowing for the presence of both detectable and undetectable genotyping errors. This framework also allows for the inclusion of missing genotypes. Associations between the SNP and disease were modelled in terms of the genotypic relative risks. The performances of the analysis methods were investigated under a variety of models for disease association and genotype error, looking at both power to detect association and precision of genotypic relative risk estimates. As expected, power to detect association decreased as genotyping error probability increased. Importantly, however, analyses allowing for genotyping error had similar power to standard analyses when applied to data without genotyping error. Furthermore, allowing for genotyping error yielded relative risk estimates that were approximately unbiased, together with 95% credible intervals giving approximately correct coverage. The methods were also applied to a real dataset: a sample of schizophrenia cases and their parents genotyped at SNPs in the dysbindin gene. The analysis methods presented here require no prior information on the genotyping error probabilities, and may be fitted in WinBUGS.  相似文献   

9.
A key step in genomic studies is to assess high throughput measurements across millions of markers for each participant's DNA, either using microarrays or sequencing techniques. Accurate genotype calling is essential for downstream statistical analysis of genotype‐phenotype associations, and next generation sequencing (NGS) has recently become a more common approach in genomic studies. How the accuracy of variant calling in NGS‐based studies affects downstream association analysis has not, however, been studied using empirical data in which both microarrays and NGS were available. In this article, we investigate the impact of variant calling errors on the statistical power to identify associations between single nucleotides and disease, and on associations between multiple rare variants and disease. Both differential and nondifferential genotyping errors are considered. Our results show that the power of burden tests for rare variants is strongly influenced by the specificity in variant calling, but is rather robust with regard to sensitivity. By using the variant calling accuracies estimated from a substudy of a Cooperative Studies Program project conducted by the Department of Veterans Affairs, we show that the power of association tests is mostly retained with commonly adopted variant calling pipelines. An R package, GWAS.PC, is provided to accommodate power analysis that takes account of genotyping errors ( http://zhaocenter.org/software/ ).  相似文献   

10.
Susceptibility to breast cancer is likely to be the result of susceptibility alleles in many different genes. In particular, one segregation analysis of breast cancer suggested that disease susceptibility in noncarriers of BRCA1/2 mutations may be explicable in terms of a polygenic model, with large numbers of susceptibility polymorphisms acting multiplicatively on risk. We considered the implications for such a model on the design of association studies to detect susceptibility polymorphisms, in particular the efficacy of utilizing cases with a family history of the disease, together with unrelated controls. Relative to a standard case-control association study with cases unselected for family history, the sample size required to detect a common disease susceptibility allele was typically reduced by more than twofold if cases with an affected first-degree relative were selected, and by more than fourfold if cases with two affected first-degree relatives were utilized. The relative efficiency obtained by using familial cases was greater for rarer alleles. Analysis of extended families indicated that the power was most dependent on the immediate (first-degree) family history. Bilateral cases may offer a similar gain in power to cases with two affected first-degree relatives. In contrast to the strong effect of family history, varying the ages at diagnosis of the cases across the range of 35-65 years did not strongly affect the power to detect association. These results indicate that association studies based on cases with a strong family history, identified for example through cancer genetics clinics, may be substantially more efficient than population-based studies.  相似文献   

11.
Association studies using DNA pools are in principle powerful and efficient to detect association between a marker allele and disease status, e.g., in a case-control design. A common observation with the use of DNA pools is that the two alleles at a polymorphic SNP locus are not amplified in equal amounts in heterozygous individuals. In addition, there are pool-specific experimental errors so that there is variation in the estimates of allele frequencies from different pools that are from the same individuals. As a result of these additional sources of variation, the outcome of an experiment is an estimated count of alleles rather than the usual outcome in terms of observed counts. In this study, we show analytically and by computer simulation that unequal amplification should be taken into account when testing for differences in allele frequencies between pools, and suggest a simple modification of the standard chi(2) test to control the type I error rate in the presence of experimental error variation. The impact of experimental errors on the power of association studies is shown.  相似文献   

12.
A combination of common and rare variants is thought to contribute to genetic susceptibility to complex diseases. Recently, next‐generation sequencers have greatly lowered sequencing costs, providing an opportunity to identify rare disease variants in large genetic epidemiology studies. At present, it is still expensive and time consuming to resequence large number of individual genomes. However, given that next‐generation sequencing technology can provide accurate estimates of allele frequencies from pooled DNA samples, it is possible to detect associations of rare variants using pooled DNA sequencing. Current statistical approaches to the analysis of associations with rare variants are not designed for use with pooled next‐generation sequencing data. Hence, they may not be optimal in terms of both validity and power. Therefore, we propose here a new statistical procedure to analyze the output of pooled sequencing data. The test statistic can be computed rapidly, making it feasible to test the association of a large number of variants with disease. By simulation, we compare this approach to Fisher's exact test based either on pooled or individual genotypic data. Our results demonstrate that the proposed method provides good control of the Type I error rate, while yielding substantially higher power than Fisher's exact test using pooled genotypic data for testing rare variants, and has similar or higher power than that of Fisher's exact test using individual genotypic data. Our results also provide guidelines on how various parameters of the pooled sequencing design affect the efficiency of detecting associations. Genet. Epidemiol. 34: 492–501, 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

13.
Most previous sample size calculations for case-control studies to detect genetic associations with disease assumed that the disease gene locus is known, whereas, in fact, markers are used. We calculated sample sizes for unmatched case-control and sibling case-control studies to detect an association between a biallelic marker and a disease governed by a putative biallelic disease locus. Required sample sizes increase with increasing discrepancy between the marker and disease allele frequencies, and with less-than-maximal linkage disequilibrium between the marker and disease alleles. Qualitatively similar results were found for studies of parent offspring triads based on the transmission disequilibrium test (Abel and Müller-Myhsok, 1998, Am. J. Hum. Genet. 63:664-667; Tu and Whittemore, 1999, Am. J. Hum. Genet. 64:641-649). We also studied other factors affecting required sample size, including attributable risk for the disease allele, inheritance mechanism, disease prevalence, and for sibling case-control designs, extragenetic familial aggregation of disease and recombination. The large sample-size requirements represent a formidable challenge to studies of this type.  相似文献   

14.
The purpose of this work is the development of linear trend tests that allow for error (LTT ae), specifically incorporating double-sampling information on phenotypes and/or genotypes. We use a likelihood framework. Misclassification errors are estimated via double sampling. Unbiased estimates of penetrances and genotype frequencies are determined through application of the Expectation-Maximization algorithm. We perform simulation studies to evaluate false-positive rates for various genotype classification weights (recessive, dominant, additive). We compare simulated power between the LTT ae and its genotypic test equivalent, the LRT ae, in the presence of phenotype and genotype misclassification, to evaluate power gains of the LTT ae for multi-locus haplotype association with a dominant mode of inheritance. Finally, we apply LTT ae and a method without double-sample information (LTT std) to double-sampled phenotype data for an actual Alzheimer's disease (AD) case-control study with ApoE genotypes. Simulation results suggest that the LTT ae maintains correct false-positive rates in the presence of misclassification. For power simulations, the LTT ae method is at least as powerful as LRT ae method, with a maximum power gain of 0.42 over the LRT ae method for certain parameter settings. For AD data, LTT ae provides more significant evidence for association (permutation p=0.0522) than LTT std (permutation p=0.1684). This is due to observed phenotype misclassification. The LTT ae statistic enables researchers to apply linear trend tests to case-control genetic data, increasing power to detect association in the presence of misclassification. If the disease MOI is known, LTT ae methods are usually more powerful due to the fact that the statistic has fewer degrees of freedom.  相似文献   

15.
Tong T  Zhao H 《Statistics in medicine》2008,27(11):1960-1972
One major goal in microarray studies is to identify genes having different expression levels across different classes/conditions. In order to achieve this goal, a study needs to have an adequate sample size to ensure the desired power. Owing to the importance of this topic, a number of approaches to sample size calculation have been developed. However, due to the cost and/or experimental difficulties in obtaining sufficient biological materials, it might be difficult to attain the required sample size. In this article, we address more practical questions for assessing power and false discovery rate (FDR) for a fixed sample size. The relationships between power, sample size and FDR are explored. We also conduct simulations and a real data study to evaluate the proposed findings.  相似文献   

16.
Genome‐wide scans of nucleotide variation in human subjects are providing an increasing number of replicated associations with complex disease traits. Most of the variants detected have small effects and, collectively, they account for a small fraction of the total genetic variance. Very large sample sizes are required to identify and validate findings. In this situation, even small sources of systematic or random error can cause spurious results or obscure real effects. The need for careful attention to data quality has been appreciated for some time in this field, and a number of strategies for quality control and quality assurance (QC/QA) have been developed. Here we extend these methods and describe a system of QC/QA for genotypic data in genome‐wide association studies (GWAS). This system includes some new approaches that (1) combine analysis of allelic probe intensities and called genotypes to distinguish gender misidentification from sex chromosome aberrations, (2) detect autosomal chromosome aberrations that may affect genotype calling accuracy, (3) infer DNA sample quality from relatedness and allelic intensities, (4) use duplicate concordance to infer SNP quality, (5) detect genotyping artifacts from dependence of Hardy‐Weinberg equilibrium test P‐values on allelic frequency, and (6) demonstrate sensitivity of principal components analysis to SNP selection. The methods are illustrated with examples from the “Gene Environment Association Studies” (GENEVA) program. The results suggest several recommendations for QC/QA in the design and execution of GWAS. Genet. Epidemiol. 34: 591–602, 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

17.
The potential for bias from population stratification (PS) has raised concerns about case-control studies involving admixed ethnicities. We evaluated the potential bias due to PS in relating a binary outcome with a candidate gene under simulated settings where study populations consist of multiple ethnicities. Disease risks were assigned within the range of prostate cancer rates of African Americans reported in SEER registries assuming k=2, 5, or 10 admixed ethnicities. Genotype frequencies were considered in the range of 5-95%. Under a model assuming no genotype effect on disease (odds ratio (OR)=1), the range of observed OR estimates ignoring ethnicity was 0.64-1.55 for k=2, 0.72-1.33 for k=5, and 0.81-1.22 for k=10. When genotype effect on disease was modeled to be OR=2, the ranges of observed OR estimates were 1.28-3.09, 1.43-2.65, and 1.62-2.42 for k=2, 5, and 10 ethnicities, respectively. Our results indicate that the magnitude of bias is small unless extreme differences exist in genotype frequency. Bias due to PS decreases as the number of admixed ethnicities increases. The biases are bounded by the minimum and maximum of all pairwise baseline disease odds ratios across ethnicities. Therefore, bias due to PS alone may be small when baseline risk differences are small within major categories of admixed ethnicity, such as African Americans.  相似文献   

18.
The multiplicity problem has become increasingly important in genetic studies as the capacity for high-throughput genotyping has increased. The control of False Discovery Rate (FDR) (Benjamini and Hochberg. [1995] J. R. Stat. Soc. Ser. B 57:289-300) has been adopted to address the problems of false positive control and low power inherent in high-volume genome-wide linkage and association studies. In many genetic studies, there is often a natural stratification of the m hypotheses to be tested. Given the FDR framework and the presence of such stratification, we investigate the performance of a stratified false discovery control approach (i.e. control or estimate FDR separately for each stratum) and compare it to the aggregated method (i.e. consider all hypotheses in a single stratum). Under the fixed rejection region framework (i.e. reject all hypotheses with unadjusted p-values less than a pre-specified level and then estimate FDR), we demonstrate that the aggregated FDR is a weighted average of the stratum-specific FDRs. Under the fixed FDR framework (i.e. reject as many hypotheses as possible and meanwhile control FDR at a pre-specified level), we specify a condition necessary for the expected total number of true positives under the stratified FDR method to be equal to or greater than that obtained from the aggregated FDR method. Application to a recent Genome-Wide Association (GWA) study by Maraganore et al. ([2005] Am. J. Hum. Genet. 77:685-693) illustrates the potential advantages of control or estimation of FDR by stratum. Our analyses also show that controlling FDR at a low rate, e.g. 5% or 10%, may not be feasible for some GWA studies.  相似文献   

19.
Several groups have developed methods for estimating allele frequencies in DNA pools as a fast and cheap way for detecting allelic association between genetic markers and disease. To obtain accurate estimates of allele frequencies, a correction factor k for the degree to which measurement of allele-specific products is biased is generally applied. Factor k is usually obtained as the ratio of the two allele-specific signals in samples from heterozygous individuals, a step that can significantly impair throughput and increase cost. We have systematically investigated the properties of k through the use of empirical and simulated data. We show that for the dye terminator primer extension genotyping method we have applied, the correction factor k is substantially influenced by the dye terminators incorporated, but also by the terminal 3' base of the extension primer. We also show that the variation in k is large enough to result in unacceptable error rates if association studies are conducted without regard to k. We show that the impact of ignoring k can be neutralized by applying a correction factor k(max) that can be easily derived, but this at the potential cost of an increase in type I error. Finally, based upon observed distributions for k we derive a method allowing the estimation of the probability pooled data reflects significant differences in the allele frequencies between the subjects comprising the pools. By controlling the error rates in the absence of knowledge of the appropriate SNP-specific correction factors, each approach enhances the performance of DNA pooling, while considerably streamlining the method by reducing time and cost.  相似文献   

20.
OBJECTIVES: Genotyping errors can induce biases in frequency estimates for haplotypes of single nucleotide polymorphisms (SNPs). Here, we considered the impact of SNP allele misclassification on haplotype odds ratio estimates from case-control studies of unrelated individuals. METHODS: We calculated bias analytically, using the haplotype counts expected in cases and controls under genotype misclassification. We evaluated the bias due to allele misclassification across a range of haplotype distributions using empirical haplotype frequencies within blocks of limited haplotype diversity. We also considered simple two- and three-locus haplotype distributions to understand the impact of haplotype frequency and number of SNPs on misclassification bias. RESULTS: We found that for common haplotypes (>5% frequency), realistic genotyping error rates (0.1-1% chance of miscalling an allele), and moderate relative risks (2-4), the bias was always towards the null and increases in magnitude with increasing error rate, increasing odds ratio. For common haplotypes, bias generally increased with increasing haplotype frequency, while for rare haplotypes, bias generally increased with decreasing frequency. When the chance of miscalling an allele is 0.5%, the median bias in haplotype-specific odds ratios for common haplotypes was generally small (<4% on the log odds ratio scale), but the bias for some individual haplotypes was larger (10-20%). Bias towards the null leads to a loss in power; the relative efficiency using a test statistic based upon misclassified haplotype data compared to a test based on the unobserved true haplotypes ranged from roughly 60% to 80%, and worsened with increasing haplotype frequency. CONCLUSIONS: The cumulative effect of small allele-calling errors across multiple loci can induce noticeable bias and reduce power in realistic scenarios. This has implications for the design of candidate gene association studies that utilize multi-marker haplotypes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号