首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Most common hereditary diseases in humans are complex and multifactorial. Large‐scale genome‐wide association studies based on SNP genotyping have only identified a small fraction of the heritable variation of these diseases. One explanation may be that many rare variants (a minor allele frequency, MAF <5%), which are not included in the common genotyping platforms, may contribute substantially to the genetic variation of these diseases. Next‐generation sequencing, which would allow the analysis of rare variants, is now becoming so cheap that it provides a viable alternative to SNP genotyping. In this paper, we present cost‐effective protocols for using next‐generation sequencing in association mapping studies based on pooled and un‐pooled samples, and identify optimal designs with respect to total number of individuals, number of individuals per pool, and the sequencing coverage. We perform a small empirical study to evaluate the pooling variance in a realistic setting where pooling is combined with exon‐capturing. To test for associations, we develop a likelihood ratio statistic that accounts for the high error rate of next‐generation sequencing data. We also perform extensive simulations to determine the power and accuracy of this method. Overall, our findings suggest that with a fixed cost, sequencing many individuals at a more shallow depth with larger pool size achieves higher power than sequencing a small number of individuals in higher depth with smaller pool size, even in the presence of high error rates. Our results provide guidelines for researchers who are developing association mapping studies based on next‐generation sequencing. Genet. Epidemiol. 34: 479–491, 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

2.
Traditional reviews, meta-analyses and pooled analyses in epidemiology   总被引:13,自引:0,他引:13  
BACKGROUND: The use of review articles and meta-analysis has become an important part of epidemiological research, mainly for reconciling previously conducted studies that have inconsistent results. Numerous methodologic issues particularly with respect to biases and the use of meta-analysis are still controversial. METHODS: Four methods summarizing data from epidemiological studies are described. The rationale for meta-analysis and the statistical methods used are outlined. The strengths and limitations of these methods are compared particularly with respect to their ability to investigate heterogeneity between studies and to provide quantitative risk estimation. RESULTS: Meta-analyses from published data are in general insufficient to calculate a pooled estimate since published estimates are based on heterogeneous populations, different study designs and mainly different statistical models. More reliable results can be expected if individual data are available for a pooled analysis, although some heterogeneity still remains. Large prospective planned meta-analysis of multicentre studies would be preferable to investigate small risk factors, however this type of meta-analysis is expensive and time-consuming. CONCLUSION: For a full assessment of risk factors with a high prevalence in the general population, pooling of data will become increasingly important. Future research needs to focus on the deficiencies of review methods, in particular, the errors and biases that can be produced when studies are combined that have used different designs, methods and analytic models.  相似文献   

3.
A combination of common and rare variants is thought to contribute to genetic susceptibility to complex diseases. Recently, next‐generation sequencers have greatly lowered sequencing costs, providing an opportunity to identify rare disease variants in large genetic epidemiology studies. At present, it is still expensive and time consuming to resequence large number of individual genomes. However, given that next‐generation sequencing technology can provide accurate estimates of allele frequencies from pooled DNA samples, it is possible to detect associations of rare variants using pooled DNA sequencing. Current statistical approaches to the analysis of associations with rare variants are not designed for use with pooled next‐generation sequencing data. Hence, they may not be optimal in terms of both validity and power. Therefore, we propose here a new statistical procedure to analyze the output of pooled sequencing data. The test statistic can be computed rapidly, making it feasible to test the association of a large number of variants with disease. By simulation, we compare this approach to Fisher's exact test based either on pooled or individual genotypic data. Our results demonstrate that the proposed method provides good control of the Type I error rate, while yielding substantially higher power than Fisher's exact test using pooled genotypic data for testing rare variants, and has similar or higher power than that of Fisher's exact test using individual genotypic data. Our results also provide guidelines on how various parameters of the pooled sequencing design affect the efficiency of detecting associations. Genet. Epidemiol. 34: 492–501, 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

4.
Additive measurement errors and pooling design are objectively two different issues, which have been separately and extensively dealt with in the biostatistics literature. However, these topics usually correspond to problems of reconstructing a summand's distribution of the biomarker by the distribution of the convoluted observations. Thus, we associate the two issues into one stated problem. The integrated approach creates an opportunity to investigate new fields, e.g. a subject of pooling errors, issues regarding pooled data affected by measurement errors. To be specific, we consider the stated problem in the context of the receiver operating characteristic (ROC) curves analysis, which is the well-accepted tool for evaluating the ability of a biomarker to discriminate between two populations. The present paper considers a wide family of biospecimen distributions. In addition, applied assumptions, which are related to distribution functions of biomarkers, are mainly conditioned by the reconstructing problem. We propose and examine maximum likelihood techniques based on the following data: a biomarker with measurement error; pooled samples; and pooled samples with measurement error. The obtained methods are illustrated by applications to real data studies.  相似文献   

5.
Case-control association studies using unrelated individuals may offer an effective approach for identifying genetic variants that have small to moderate disease risks. In general, two different strategies may be employed to establish associations between genotypes and phenotypes: (1) collecting individual genotypes or (2) quantifying allele frequencies in DNA pools. These two technologies have their respective advantages. Individual genotyping gathers more information, whereas DNA pooling may be more cost effective. Recent technological advances in DNA pooling have generated great interest in using DNA pooling in association studies. In this article, we investigate the impacts of errors in genotyping or measuring allele frequencies on the identification of genetic associations with these two strategies. We find that, with current technologies, compared to individual genotyping, a larger sample is generally required to achieve the same power using DNA pooling. We further consider the use of DNA pooling as a screening tool to identify candidate regions for follow-up studies. We find that the majority of the positive regions identified from DNA pooling results may represent false positives if measurement errors are not appropriately considered in the design of the study.  相似文献   

6.
With its potential to discover a much greater amount of genetic variation, next‐generation sequencing is fast becoming an emergent tool for genetic association studies. However, the cost of sequencing all individuals in a large‐scale population study is still high in comparison to most alternative genotyping options. While the ability to identify individual‐level data is lost (without bar‐coding), sequencing pooled samples can substantially lower costs without compromising the power to detect significant associations. We propose a hierarchical Bayesian model that estimates the association of each variant using pools of cases and controls, accounting for the variation in read depth across pools and sequencing error. To investigate the performance of our method across a range of number of pools, number of individuals within each pool, and average coverage, we undertook extensive simulations varying effect sizes, minor allele frequencies, and sequencing error rates. In general, the number of pools and pool size have dramatic effects on power while the total depth of coverage per pool has only a moderate impact. This information can guide the selection of a study design that maximizes power subject to cost, sample size, or other laboratory constraints. We provide an R package (hiPOD: hierarchical Pooled Optimal Design) to find the optimal design, allowing the user to specify a cost function, cost, and sample size limitations, and distributions of effect size, minor allele frequency, and sequencing error rate.  相似文献   

7.
Genome-wide association studies may be necessary to identify genes underlying certain complex diseases. Because such studies can be extremely expensive, DNA pooling has been introduced, as it may greatly reduce the genotyping burden. Parallel to DNA pooling developments, the importance of haplotypes in genetic studies has been amply demonstrated in the literature. However, DNA pooling of a large number of samples may lose haplotype information among tightly linked genetic markers. Here, we examine the cost-effectiveness of DNA pooling in the estimation of haplotype frequencies from population data. When the maximum likelihood estimates of haplotype frequencies are obtained from pooled samples, we compare the overall cost of the study, including both DNA collection and marker genotyping, between the individual genotyping strategy and the DNA pooling strategy. We find that the DNA pooling of two individuals can be more cost-effective than individual genotypings, especially when a large number of haplotype systems are studied.  相似文献   

8.
Genome-wide association studies (GWAS) can identify common alleles that contribute to complex disease susceptibility. Despite the large number of SNPs assessed in each study, the effects of most common SNPs must be evaluated indirectly using either genotyped markers or haplotypes thereof as proxies. We have previously implemented a computationally efficient Markov Chain framework for genotype imputation and haplotyping in the freely available MaCH software package. The approach describes sampled chromosomes as mosaics of each other and uses available genotype and shotgun sequence data to estimate unobserved genotypes and haplotypes, together with useful measures of the quality of these estimates. Our approach is already widely used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here, we use simulations and experimental genotypes to evaluate its accuracy and utility, considering choices of genotyping panels, reference panel configurations, and designs where genotyping is replaced with shotgun sequencing. Importantly, we show that genotype imputation not only facilitates cross study analyses but also increases power of genetic association studies. We show that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping studies. Furthermore, we show the approach is applicable in a variety of populations. Finally, we illustrate how association analyses of unobserved variants will benefit from ongoing advances such as larger HapMap reference panels and whole genome shotgun sequencing technologies.  相似文献   

9.
Accurate genetic association studies are crucial for the detection and the validation of disease determinants. One of the main confounding factors that affect accuracy is population stratification, and great efforts have been extended for the past decade to detect and to adjust for it. We have now efficient solutions for population stratification adjustment for single‐SNP (where SNP is single‐nucleotide polymorphisms) inference in genome‐wide association studies, but it is unclear whether these solutions can be effectively applied to rare variation studies and in particular gene‐based (or set‐based) association methods that jointly analyze multiple rare and common variants. We examine here, both theoretically and empirically, the performance of two commonly used approaches for population stratification adjustment—genomic control and principal component analysis—when used on gene‐based association tests. We show that, different from single‐SNP inference, genes with diverse composition of rare and common variants may suffer from population stratification to various extent. The inflation in gene‐level statistics could be impacted by the number and the allele frequency spectrum of SNPs in the gene, and by the gene‐based testing method used in the analysis. As a consequence, using a universal inflation factor as a genomic control should be avoided in gene‐based inference with sequencing data. We also demonstrate that caution needs to be exercised when using principal component adjustment because the accuracy of the adjusted analyses depends on the underlying population substructure, on the way the principal components are constructed, and on the number of principal components used to recover the substructure.  相似文献   

10.
A key step in genomic studies is to assess high throughput measurements across millions of markers for each participant's DNA, either using microarrays or sequencing techniques. Accurate genotype calling is essential for downstream statistical analysis of genotype‐phenotype associations, and next generation sequencing (NGS) has recently become a more common approach in genomic studies. How the accuracy of variant calling in NGS‐based studies affects downstream association analysis has not, however, been studied using empirical data in which both microarrays and NGS were available. In this article, we investigate the impact of variant calling errors on the statistical power to identify associations between single nucleotides and disease, and on associations between multiple rare variants and disease. Both differential and nondifferential genotyping errors are considered. Our results show that the power of burden tests for rare variants is strongly influenced by the specificity in variant calling, but is rather robust with regard to sensitivity. By using the variant calling accuracies estimated from a substudy of a Cooperative Studies Program project conducted by the Department of Veterans Affairs, we show that the power of association tests is mostly retained with commonly adopted variant calling pipelines. An R package, GWAS.PC, is provided to accommodate power analysis that takes account of genotyping errors ( http://zhaocenter.org/software/ ).  相似文献   

11.
钙蛋白酶10基因多态性与2型糖尿病的关联研究:Meta分析   总被引:1,自引:0,他引:1  
目的 探讨钙蛋白酶10(CAPN10)基因SNP43、SNP44位点及主要单倍型、单倍型组合与2型糖尿病(T2DM)的关联性.方法 根据系统评价的原理和规范,检索PubMed及中文期刊数据库,纳入CAPN10基因与T2DM的病例对照研究.根据种族,采用分层Meta分析评估CAPN10基因多态性与T2DM的关联性.同时评估发表偏倚.结果 各种族与T2DM有关联的基因多态性分别是:蒙古人种,SNP43位点G等位基因OR=1.368(95%CI:1.155~1.620),G/G基因型OR=1.437(95%CI:1.186~1.741),111/221单倍型组合OR=2.762(95%CI:1.287~5.927);高加索人种,SNP44位点C等位基因OR=1.144(95%C1:1.023~1.278),111/111单倍型组合OR=1.291(95%CI:1.050~1.586);混血人种,SNP44位点C等位基因OR=1.653(95%CI:1.025~2.665).结论 SNP43位点G等位基因、G/G基因型、111/221单倍型组合是蒙古人种的危险因素;SNP44位点C等位基因、111/111单倍型组合是高加索人种的危险因素;SNP44位点C等位基因是混血人种的危险因素.  相似文献   

12.
13.
For analyzing complex trait association with sequencing data, most current studies test aggregated effects of variants in a gene or genomic region. Although gene‐based tests have insufficient power even for moderately sized samples, pathway‐based analyses combine information across multiple genes in biological pathways and may offer additional insight. However, most existing pathway association methods are originally designed for genome‐wide association studies, and are not comprehensively evaluated for sequencing data. Moreover, region‐based rare variant association methods, although potentially applicable to pathway‐based analysis by extending their region definition to gene sets, have never been rigorously tested. In the context of exome‐based studies, we use simulated and real datasets to evaluate pathway‐based association tests. Our simulation strategy adopts a genome‐wide genetic model that distributes total genetic effects hierarchically into pathways, genes, and individual variants, allowing the evaluation of pathway‐based methods with realistic quantifiable assumptions on the underlying genetic architectures. The results show that, although no single pathway‐based association method offers superior performance in all simulated scenarios, a modification of Gene Set Enrichment Analysis approach using statistics from single‐marker tests without gene‐level collapsing (weighted Kolmogrov‐Smirnov [WKS]‐Variant method) is consistently powerful. Interestingly, directly applying rare variant association tests (e.g., sequence kernel association test) to pathway analysis offers a similar power, but its results are sensitive to assumptions of genetic architecture. We applied pathway association analysis to an exome‐sequencing data of the chronic obstructive pulmonary disease, and found that the WKS‐Variant method confirms associated genes previously published.  相似文献   

14.
Rare variants (RVs) have been shown to be significant contributors to complex disease risk. By definition, these variants have very low minor allele frequencies and traditional single‐marker methods for statistical analysis are underpowered for typical sequencing study sample sizes. Multimarker burden‐type approaches attempt to identify aggregation of RVs across case‐control status by analyzing relatively small partitions of the genome, such as genes. However, it is generally the case that the aggregative measure would be a mixture of causal and neutral variants, and these omnibus tests do not directly provide any indication of which RVs may be driving a given association. Recently, Bayesian variable selection approaches have been proposed to identify RV associations from a large set of RVs under consideration. Although these approaches have been shown to be powerful at detecting associations at the RV level, there are often computational limitations on the total quantity of RVs under consideration and compromises are necessary for large‐scale application. Here, we propose a computationally efficient alternative formulation of this method using a probit regression approach specifically capable of simultaneously analyzing hundreds to thousands of RVs. We evaluate our approach to detect causal variation on simulated data and examine sensitivity and specificity in instances of high RV dimensionality as well as apply it to pathway‐level RV analysis results from a prostate cancer (PC) risk case‐control sequencing study. Finally, we discuss potential extensions and future directions of this work.  相似文献   

15.
Several groups have developed methods for estimating allele frequencies in DNA pools as a fast and cheap way for detecting allelic association between genetic markers and disease. To obtain accurate estimates of allele frequencies, a correction factor k for the degree to which measurement of allele-specific products is biased is generally applied. Factor k is usually obtained as the ratio of the two allele-specific signals in samples from heterozygous individuals, a step that can significantly impair throughput and increase cost. We have systematically investigated the properties of k through the use of empirical and simulated data. We show that for the dye terminator primer extension genotyping method we have applied, the correction factor k is substantially influenced by the dye terminators incorporated, but also by the terminal 3' base of the extension primer. We also show that the variation in k is large enough to result in unacceptable error rates if association studies are conducted without regard to k. We show that the impact of ignoring k can be neutralized by applying a correction factor k(max) that can be easily derived, but this at the potential cost of an increase in type I error. Finally, based upon observed distributions for k we derive a method allowing the estimation of the probability pooled data reflects significant differences in the allele frequencies between the subjects comprising the pools. By controlling the error rates in the absence of knowledge of the appropriate SNP-specific correction factors, each approach enhances the performance of DNA pooling, while considerably streamlining the method by reducing time and cost.  相似文献   

16.
With advancements in next‐generation sequencing technology, a massive amount of sequencing data is generated, which offers a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, the high‐dimensional sequencing data poses a great challenge for statistical analysis. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a Weighted U Sequencing test, referred to as WU‐SEQ, for the high‐dimensional association analysis of sequencing data. Based on a nonparametric U‐statistic, WU‐SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU‐SEQ outperformed a commonly used sequence kernel association test (SKAT) method when the underlying assumptions were violated (e.g., the phenotype followed a heavy‐tailed distribution). Even when the assumptions were satisfied, WU‐SEQ still attained comparable performance to SKAT. Finally, we applied WU‐SEQ to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol.  相似文献   

17.
Genotyping errors can create a problem for the analysis of case-parents data because some families will exhibit genotypes that are inconsistent with Mendelian inheritance. The problem with correcting Mendelian inconsistent genotype errors by regenotyping or removing families in which they occur is that the remaining unidentified genotype errors can produce excess type I (false positive) error for some family-based tests for association. We address this problem by developing a likelihood ratio test (LRT) for association in a case-parents design that incorporates nuisance parameters for a general genotype error model. We extend the likelihood approach for a single SNP to include short haplotypes consisting of 2 or 3 SNPs. The extension to haplotypes is based on assumptions of random mating, multiplicative penetrances, and at most a single genotype error per family. For a single SNP, we found, using Monte Carlo simulation, that type I error rate can be controlled for a number of genotype error models at different error rates. Simulation results suggest the same is true for 2 and 3 SNPs. In all cases, power declined with increasing genotyping error rates. In the absence of genotyping errors, power was similar whether nuisance parameters for genotype error were included in the LRT or not. The LRT developed here does not require prior specification of a particular model for genotype errors and it can be readily computed using the EM algorithm. Consequently, this test may be generally useful as a test of association with case-parents data in which Mendelian inconsistent families are observed.  相似文献   

18.
Genetic heterogeneity, which may manifest on a population level as different frequencies of a specific disease susceptibility allele in different subsets of patients, is a common problem for candidate gene and genome‐wide association studies of complex human diseases. The ordered subset analysis (OSA) was originally developed as a method to reduce genetic heterogeneity in the context of family‐based linkage studies. Here, we have extended a previously proposed method (OSACC) for applying the OSA methodology to case‐control datasets. We have evaluated the type I error and power of different OSACC permutation tests with an extensive simulation study. Case‐control datasets were generated under two different models by which continuous clinical or environmental covariates may influence the relationship between susceptibility genotypes and disease risk. Our results demonstrate that OSACC is more powerful under some disease models than the commonly used trend test and a previously proposed joint test of main genetic and gene‐environment interaction effects. An additional unique benefit of OSACC is its ability to identify a more informative subset of cases that may be subjected to more detailed molecular analysis, such as DNA sequencing of selected genomic regions to detect functional variants in linkage disequilibrium with the associated polymorphism. The OSACC‐identified covariate threshold may also improve the power of an additional dataset to replicate previously reported associations that may only be detectable in a fraction of the original and replication datasets. In summary, we have demonstrated that OSACC is a useful method for improving SNP association signals in genetically heterogeneous datasets. Genet. Epidemiol. 34: 407–417, 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

19.
Case‐control studies are prone to low power for testing gene–environment interactions (GXE) given the need for a sufficient number of individuals on each strata of disease, gene, and environment. We propose a new study design to increase power by strategically pooling biospecimens. Pooling biospecimens allows us to increase the number of subjects significantly, thereby providing substantial increase in power. We focus on a special, although realistic case, where disease and environmental statuses are binary, and gene status is ordinal with each individual having 0, 1, or 2 minor alleles. Through pooling, we obtain an allele frequency for each level of disease and environmental status. Using the allele frequencies, we develop a new methodology for estimating and testing GXE that is comparable to the situation when we have complete data on gene status for each individual. We also explore the measurement process and its effect on the GXE estimator. Using an illustration, we show the effectiveness of pooling with an epidemiologic study, which tests an interaction for fiber and paraoxonase on anovulation. Through simulation, we show that taking 12 pooled measurements from 1000 individuals achieves more power than individually genotyping 500 individuals. Our findings suggest that strategic pooling should be considered when an investigator designs a pilot study to test for a GXE. Published 2012. This article is a US Government work and is in the public domain in the USA.  相似文献   

20.
Genome‐wide association studies have identified hundreds of genetic variants associated with complex diseases although most variants identified so far explain only a small proportion of heritability, suggesting that rare variants are responsible for missing heritability. Identification of rare variants through large‐scale resequencing becomes increasing important but still prohibitively expensive despite the rapid decline in the sequencing costs. Nevertheless, group testing based overlapping pool sequencing in which pooled rather than individual samples are sequenced will greatly reduces the efforts of sample preparation as well as the costs to screen for rare variants. Here, we proposed an overlapping pool sequencing to screen rare variants with optimal sequencing depth and a corresponding cost model. We formulated a model to compute the optimal depth for sufficient observations of variants in pooled sequencing. Utilizing shifted transversal design algorithm, appropriate parameters for overlapping pool sequencing could be selected to minimize cost and guarantee accuracy. Due to the mixing constraint and high depth for pooled sequencing, results showed that it was more cost‐effective to divide a large population into smaller blocks which were tested using optimized strategies independently. Finally, we conducted an experiment to screen variant carriers with frequency equaled 1%. With simulated pools and publicly available human exome sequencing data, the experiment achieved 99.93% accuracy. Utilizing overlapping pool sequencing, the cost for screening variant carriers with frequency equaled 1% in 200 diploid individuals dropped to at least 66% at which target sequencing region was set to 30 Mb.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号