首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Liu Z  Lin S 《Genetic epidemiology》2005,29(4):353-364
Linkage disequilibrium (LD) plays a central role in fine mapping of disease genes and, more recently, in characterizing haplotype blocks. Classical LD measures, such as D' and r(2), are frequently used to quantify relationship between two loci. A pairwise "distance" matrix among a set of loci can be constructed using such a measure, and based upon which a number of haplotype block detection and tagging single nucleotide polymorphism (SNP) selection algorithms have been devised. Although successful in many applications, the pairwise nature of these measures does not provide a direct characterization of joint linkage disequilibrium among multiple loci. Consequently, applications based on them may lead to loss of important information. In this report, we propose a multilocus LD measure based on generalized mutual information, which is also known as relative entropy or Kullback-Leibler distance. In essence, this measure seeks to quantify the distance between the observed haplotype distribution and the expected distribution assuming linkage equilibrium. We can show that this measure is approximately equal to r(2) in the special case with two loci. Based on this multilocus LD measure and an entropy measure that characterizes haplotype diversity, we propose a class of stepwise tagging SNP selection algorithms. This represents a unified approach for SNP selection in that it takes into account both the haplotype diversity and linkage disequilibrium objectives. Applications to both simulated and real data demonstrate the utility of the proposed methods for handling a large number of SNPs. The results indicate that multilocus LD patterns can be captured well, and informative and nonredundant SNPs can be selected effectively from a large set of loci.  相似文献   

2.
Hao K  Liu S  Niu T 《Genetic epidemiology》2005,29(4):336-352
Single nucleotide polymorphisms (SNPs) play a central role in the identification of susceptibility genes for common diseases. Recent empirical studies on human genome have revealed block-like structures, and each block contains a set of haplotype tagging SNPs (htSNPs) that capture a large fraction of the haplotype diversity. Herein, we present an innovative sparse marker extension tree (SMET) algorithm to select optimal htSNP set(s). SMET reduces the search space considerably (compared to full enumeration strategy), and therefore improves computing efficiency. We tested this algorithm on several datasets at three different genomic scales: (1) gene-wide (NOS3, CRP, IL6 PPARA, and TNF), (2) region-wide (a Whitehead Institute inflammatory bowel disease dataset and a UK Graves' disease dataset), and (3) chromosome-wide (chromosome 22) levels. SMET offers geneticists with greater flexibilities in SNP tagging than lossless methods with adjustable haplotype diversity coverage (phi). In simulation studies, we found that (1) an initial sample size of 50 individuals (100 chromosomes) or more is needed for htSNP selection; (2) the SNP tagging strategy is considerably more efficient when the underlying block structure is taken into account; and (3) htSNP sets at 80-90% phi are more cost-effective than the lossless sets in term of relative power, relative risk ratio estimation, and genotyping efforts. Our study suggests that the novel SMET algorithm is a valuable tool for association tests.  相似文献   

3.
Single nucleotide polymorphisms (SNPs) are becoming widely used as genotypic markers in genetic association studies of common, complex human diseases. For such association screens, a crucial part of study design is determining what SNPs to prioritize for genotyping. We present a novel power-based algorithm to select a subset of tag SNPs for genotyping from a map of available SNPs. Blocks of markers in strong linkage disequilibrium (LD) are identified, and SNPs are selected to represent each block such that power to detect disease association with an underlying disease allele in LD with block members is preserved; all markers outside of blocks are also included in the tagging subset. A key, novel element of this method is that it incorporates information about the phase of LD observed among marker pairs to retain markers likely to be in coupling phase with an underlying disease locus, thus increasing power compared to a phase-blind approach. Power calculations illustrate important issues regarding LD phase and make clear the advantages of our approach to SNP selection. We apply our algorithm to genotype data from the International HapMap Consortium and demonstrate that considerable reduction in SNP genotyping may be attained while retaining much of the available power for a disease association screen. We also demonstrate that these tag SNPs effectively represent underlying variants not included in the LD analysis and SNP selection, by using leave-one-out tests to show that most (approximately 90%) of the "untyped" variants lying in blocks are in coupling-phase LD with a tag SNP. Additional performance tests using the HapMap ENCyclopedia of DNA Elements (ENCODE) regions show that the method compares well with the popular r2 bin tagging method. This work is a concrete example of how empirical LD phase may be used to benefit study design.  相似文献   

4.
Genome-wide association studies (GWAS) routinely apply principal component analysis (PCA) to infer population structure within a sample to correct for confounding due to ancestry. GWAS implementation of PCA uses tens of thousands of single-nucleotide polymorphisms (SNPs) to infer structure, despite the fact that only a small fraction of such SNPs provides useful information on ancestry. The identification of this reduced set of ancestry-informative markers (AIMs) from a GWAS has practical value; for example, researchers can genotype the AIM set to correct for potential confounding due to ancestry in follow-up studies that utilize custom SNP or sequencing technology. We propose a novel technique to identify AIMs from genome-wide SNP data using sparse PCA. The procedure uses penalized regression methods to identify those SNPs in a genome-wide panel that significantly contribute to the principal components while encouraging SNPs that provide negligible loadings to vanish from the analysis. We found that sparse PCA leads to negligible loss of ancestry information compared to traditional PCA analysis of genome-wide SNP data. We further demonstrate the value of sparse PCA for AIM selection using real data from the International HapMap Project and a genomewide study of inflammatory bowel disease. We have implemented our approach in open-source R software for public use.  相似文献   

5.
Linkage disequilibrium (LD) in the human genome, often measured as pairwise correlation between adjacent markers, shows substantial spatial heterogeneity. Congruent with these results, studies have found that certain regions of the genome have far less haplotype diversity than expected if the alleles at multiple markers were independent, while other sets of adjacent markers behave almost independently. Regions with limited haplotype diversity have been described as "blocked" or "haplotype blocks." In this article, we propose a new method that aims to distinguish between blocked and unblocked regions in the genome. Like some other approaches, the method analyses haplotype diversity. Unlike other methods, it allows for adjacent, distinct blocks and also multiple, independent single nucleotide polymorphisms (SNPs) separating blocks. Based on an approximate likelihood model and a parsimony criterion to penalize for model complexity, the method partitions a genomic region into blocks relatively quickly, and simulations suggest that its partitions are accurate. We also propose a new, efficient method to select SNPs for association analysis, namely tag SNPs. These methods compare favorably to similar blocking and tagging methods using simulations.  相似文献   

6.
Recent studies suggest that haplotypes tend to have block-like structures throughout the human genome. Several methods were proposed for haplotype block partitioning and for tagging single-nucleotide polymorphism (SNP) identification. In population genetics studies, several research groups compared block structures across human populations. However, the measures used to quantify population similarity are either less than satisfactory or nonexistent. In this article, we propose several similarity measures to facilitate the comparisons of haplotype structures, namely block boundaries and tagging SNPs, across populations. With these measures, we can more objectively compare haplotype block structures and tagging SNP sets between different populations. In addition, these measures allow us to compare the results of different methods for block partition and tagging SNP identification. When we applied these measures to a real data set on chromosome 10 in 16 worldwide populations, we found that in this genome region: 1) haplotype block boundaries vary among populations, with European and some African populations showing similar boundaries but other populations showing other patterns; 2) tagging SNP sets are generally similar for populations with similar haplotype block structures but differ if the block structures differ; and 3) all but one of the block finding methods we tested yield consistent results, although variations exist regarding consistency. Our tentative results show that at least in the genome region studied, it is unlikely that a common haplotype pattern exists for all human populations: many populations, even in the same geographical region, may have different haplotype patterns.  相似文献   

7.
It has been proposed that using association analysis of single nucleotide polymorphism (SNP) markers in candidate genes may be more successful in identifying disease susceptibility genes for complex diseases. Finding all the SNPs within a candidate gene and genotyping a large case‐control cohort is a resource‐intensive process. As linkage disequilibrium extends across small regions of the genome, the expectation is that a few common anonymous SNPs will be sufficient to detect functional disease‐associated alleles. The aim of this investigation was to compare the ability of a number of family‐ and population‐based association methods to identify known susceptibility loci using the Genetic Analysis Workshop 12 simulated data set. As expected, case‐control methods were more likely to detect association with individual SNPs but family‐based haplotyping methods appeared better able to localize the position of functional polymorphism. © 2001 Wiley‐Liss, Inc.  相似文献   

8.
Kernel machine learning methods, such as the SNP‐set kernel association test (SKAT), have been widely used to test associations between traits and genetic polymorphisms. In contrast to traditional single‐SNP analysis methods, these methods are designed to examine the joint effect of a set of related SNPs (such as a group of SNPs within a gene or a pathway) and are able to identify sets of SNPs that are associated with the trait of interest. However, as with many multi‐SNP testing approaches, kernel machine testing can draw conclusion only at the SNP‐set level, and does not directly inform on which one(s) of the identified SNP set is actually driving the associations. A recently proposed procedure, KerNel Iterative Feature Extraction (KNIFE), provides a general framework for incorporating variable selection into kernel machine methods. In this article, we focus on quantitative traits and relatively common SNPs, and adapt the KNIFE procedure to genetic association studies and propose an approach to identify driver SNPs after the application of SKAT to gene set analysis. Our approach accommodates several kernels that are widely used in SNP analysis, such as the linear kernel and the Identity by State (IBS) kernel. The proposed approach provides practically useful utilities to prioritize SNPs, and fills the gap between SNP set analysis and biological functional studies. Both simulation studies and real data application are used to demonstrate the proposed approach.  相似文献   

9.
Polygenic risk scores (PRSs) are a method to summarize the additive trait variance captured by a set of SNPs, and can increase the power of set‐based analyses by leveraging public genome‐wide association study (GWAS) datasets. PRS aims to assess the genetic liability to some phenotype on the basis of polygenic risk for the same or different phenotype estimated from independent data. We propose the application of PRSs as a set‐based method with an additional component of adjustment for linkage disequilibrium (LD), with potential extension of the PRS approach to analyze biologically meaningful SNP sets. We call this method POLARIS: POlygenic Ld‐Adjusted RIsk Score. POLARIS identifies the LD structure of SNPs using spectral decomposition of the SNP correlation matrix and replaces the individuals' SNP allele counts with LD‐adjusted dosages. Using a raw genotype dataset together with SNP effect sizes from a second independent dataset, POLARIS can be used for set‐based analysis. MAGMA is an alternative set‐based approach employing principal component analysis to account for LD between markers in a raw genotype dataset. We used simulations, both with simple constructed and real LD‐structure, to compare the power of these methods. POLARIS shows more power than MAGMA applied to the raw genotype dataset only, but less or comparable power to combined analysis of both datasets. POLARIS has the advantages that it produces a risk score per person per set using all available SNPs, and aims to increase power by leveraging the effect sizes from the discovery set in a self‐contained test of association in the test dataset.  相似文献   

10.
We have developed a single nucleotide polymorphism (SNP) association scan statistic that takes into account the complex distribution of the human genome variation in the identification of chromosomal regions with significant SNP associations. This scan statistic has wide applicability for genetic analysis, whether to identify important chromosomal regions associated with common diseases based on whole-genome SNP association studies or to identify disease susceptibility genes based on dense SNP positional candidate studies. To illustrate this method, we analyzed patterns of SNP associations on chromosome 19 in a large cohort study. Among 2,944 SNPs, we found seven regions that contained clusters of significantly associated SNPs. The average width of these regions was 35 kb with a range of 10-72 kb. We compared the scan statistic results to Fisher's product method using a sliding window approach, and detected 22 regions with significant clusters of SNP associations. The average width of these regions was 131 kb with a range of 10.1-615 kb. Given that the distances between SNPs are not taken into consideration in the sliding window approach, it is likely that a large fraction of these regions represents false positives. However, all seven regions detected by the scan statistic were also detected by the sliding window approach. The linkage disequilibrium (LD) patterns within the seven regions were highly variable indicating that the clusters of SNP associations were not due to LD alone. The scan statistic developed here can be used to make gene-based or region-based SNP inferences about disease association.  相似文献   

11.
Hao K  Xu X  Laird N  Wang X  Xu X 《Genetic epidemiology》2004,26(1):22-30
At the current stage, a large number of single nucleotide polymorphisms (SNPs) have been deployed in searching for genes underlying complex diseases. A powerful method is desirable for efficient analysis of SNP data. Recently, a novel method for multiple SNP association test using a combination of allelic association (AA) and Hardy-Weinberg disequilibrium (HWD) has been proposed. However, the power of this test has not been systematically examined. In this study, we conducted a simulation study to further evaluate the statistical power of the new procedure, as well as of the influence of the HWD on its performance. The simulation examined the scenarios of multiple disease SNPs among a candidate pool, assuming different parameters including allele frequencies and risk ratios, dominant, additive, and recessive genetic models, and the existence of gene-gene interactions and linkage disequilibrium (LD). We also evaluated the performance of this test in capturing real disease associated SNPs, when a significant global P value is detected. Our results suggest that this new procedure is more powerful than conventional single-point analyses with correction of multiple testing. However, inclusion of HWD reduces the power under most circumstances. We applied the novel association test procedure to a case-control study of preterm delivery (PTD), examining the effects of 96 candidate gene SNPs concurrently, and detected a global P value of 0.0250 by using Cochran-Armitage chi(2)s as "starting" statistics in the procedure. In the following single point analysis, SNPs on IL1RN, IL1R2, ESR1, Factor 5, and OPRM1 genes were identified as possible risk factors in PTD.  相似文献   

12.
Significance testing one SNP at a time has proven useful for identifying genomic regions that harbor variants affecting human disease. But after an initial genome scan has identified a "hit region" of association, single-locus approaches can falter. Local linkage disequilibrium (LD) can make both the number of underlying true signals and their identities ambiguous. Simultaneous modeling of multiple loci should help. However, it is typically applied ad hoc: conditioning on the top SNPs, with limited exploration of the model space and no assessment of how sensitive model choice was to sampling variability. Formal alternatives exist but are seldom used. Bayesian variable selection is coherent but requires specifying a full joint model, including priors on parameters and the model space. Penalized regression methods (e.g., LASSO) appear promising but require calibration, and, once calibrated, lead to a choice of SNPs that can be misleadingly decisive. We present a general method for characterizing uncertainty in model choice that is tailored to reprioritizing SNPs within a hit region under strong LD. Our method, LASSO local automatic regularization resample model averaging (LLARRMA), combines LASSO shrinkage with resample model averaging and multiple imputation, estimating for each SNP the probability that it would be included in a multi-SNP model in alternative realizations of the data. We apply LLARRMA to simulations based on case-control genome-wide association studies data, and find that when there are several causal loci and strong LD, LLARRMA identifies a set of candidates that is enriched for true signals relative to single locus analysis and to the recently proposed method of Stability Selection.  相似文献   

13.
In genetic association studies, much effort has focused on moving beyond the initial single‐nucleotide polymorphism (SNP)‐by‐SNP analysis. One approach is to reanalyze a chromosomal region where an association has been detected, jointly analyzing the SNP thought to best represent that association with each additional SNP in the region. Such joint analyses may help identify additional, statistically independent association signals. However, it is possible for a single genetic effect to produce joint SNP results that would typically be interpreted as two distinct effects (e.g., both SNPs are significant in the joint model). We present a general approach that can (1) identify conditions under which a single variant could produce a given joint SNP result, and (2) use these conditions to identify variants from a list of known SNPs (e.g., 1000 Genomes) as candidates that could produce the observed signal. We apply this method to our previously reported joint result for smoking involving rs16969968 and rs588765 in CHRNA5. We demonstrate that it is theoretically possible for a joint SNP result suggestive of two independent signals to be produced by a single causal variant. Furthermore, this variant need not be highly correlated with the two tested SNPs or have a large odds ratio. Our method aids in interpretation of joint SNP results by identifying new candidate variants for biological causation that would be missed by traditional approaches. Also, it can connect association findings that may seem disparate due to lack of high correlations among the associated SNPs.  相似文献   

14.
Multiple testing is a challenging issue in genetic association studies using large numbers of single nucleotide polymorphism (SNP) markers, many of which exhibit linkage disequilibrium (LD). Failure to adjust for multiple testing appropriately may produce excessive false positives or overlook true positive signals. The Bonferroni method of adjusting for multiple comparisons is easy to compute, but is well known to be conservative in the presence of LD. On the other hand, permutation-based corrections can correctly account for LD among SNPs, but are computationally intensive. In this work, we propose a new multiple testing correction method for association studies using SNP markers. We show that it is simple, fast and more accurate than the recently developed methods and is comparable to permutation-based corrections using both simulated and real data. We also demonstrate how it might be used in whole-genome association studies to control type I error. The efficiency and accuracy of the proposed method make it an attractive choice for multiple testing adjustment when there is high intermarker LD in the SNP data set.  相似文献   

15.
目的 探索血管内皮生长因子基因(vascular endothelial growth factor gene, vegf)rs3025039C/T的单核苷酸多态性(single nucleotide polymorphism, SNP)和rs3025020C/T SNP与多囊卵巢综合征(polycystic ovary syndrome, PCOS)发病风险的相关性。方法 运用聚合酶链反应 - 连接酶检测反应(polymerase chain reaction - ligase detection reaction, PCR - LDR)的方法分析了152例PCOS患者组和160例健康对照组vegf基因rs3025039C/T和rs3025020C/T 的单核苷酸多态性和PCOS发病风险的关系。结果 与CC基因型相比,携带CT基因型可能增加PCOS的发病风险,OR值为1.75(95%CI:1.04~2.95)。用2LD软件对这2个多态性位点进行了联合分析,结果显示,2个位点之间存在连锁不平衡现象(D' = 0.81)。用EH软件进行单体型分析,结果显示,CC是人群中最常见的单体型,CT单体型增加了PCOS的发病风险(OR = 1.62,95%CI:1.01~2.60)。结论 vegf基因rs3025039C/T SNP增加了对PCOS的遗传易感性,携带CT基因型增加了PCOS的发病风险。  相似文献   

16.
In association analyses, it is critical that informative single-nucleotide polymorphisms (SNPs) be selected for study and utilized appropriately. We sequenced 38 kb, including exons of ELAC2, promoter region and conserved upstream intergenic sequences. A comprehensive characterization of linkage disequilibrium (LD) structure and mutation history was performed using our principal components analysis (PCA) method and a phylogenetic analysis. We identified a complex pattern of LD structure consistent with the occurrence of both recombination and mutation events within ELAC2. Four overlapping and noncontiguous LD groups were defined. Eight tagging SNPs (tSNPs) were identified, accounting for over 90% of the genetic variation of the 19 total variants. We tested associations between familial early-onset prostate cancer (PRCA) and each variant independently and in haplotypes. We performed these tests using all 19 variants and the 8 tSNPs; the results using tSNP haplotypes accurately represent the association evidence for the full haplotypes. We observed increased evidence for association when SNPs were analyzed in haplotypes. The phylogenetic analysis indicated three haplotypes, clustered farthest from the root-node, all of which were found more often in cases than controls. These three haplotypes together showed the best evidence of association with familial, early-onset PRCA (P=0.0024; odds ratio=2.23; 95% CI, 1.33-3.74), indicating possible allelic heterogeneity. Our results suggest that 8 tSNPs are required to comprehensively assess associations in ELAC2, and that haplotypes should be considered for analysis, and that a knowledge of mutation history may be helpful in parsing allelic heterogeneity and suggesting combinations of haplotypes to be tested.  相似文献   

17.
Tag SNP selection for association studies   总被引:6,自引:0,他引:6  
This report describes current methods for selection of informative single nucleotide polymorphisms (SNPs) using data from a dense network of SNPs that have been genotyped in a relatively small panel of subjects. We discuss the following issues: (1) Optimal selection of SNPs based upon maximizing either the predictability of unmeasured SNPs or the predictability of SNP haplotypes as selection criteria. (2) The dependence of the performance of tag SNP selection methods upon the density of SNP markers genotyped for the purpose of haplotype discovery and tag SNP selection. (3) The likely power of case-control studies to detect the influence upon disease risk of common disease-causing variants in candidate genes in a haplotype-based analysis. We propose a quasi-empirical approach towards evaluating the power of large studies with this calculation based upon the SNP genotype and haplotype frequencies estimated in a haplotype discovery panel. In this calculation, each common SNP in turn is treated as a potential unmeasured causal variant and subjected to a correlation analysis using the remaining SNPs. We use a small portion of the HapMap ENCODE data (488 common SNPs genotyped over approximately a 500 kb region of chromosome 2) as an illustrative example of this approach towards power evaluation.  相似文献   

18.
A new multimarker test for family-based association studies   总被引:1,自引:0,他引:1  
  相似文献   

19.
Genomewide association studies (GWAS) sometimes identify loci at which both the number and identities of the underlying causal variants are ambiguous. In such cases, statistical methods that model effects of multiple single‐nucleotide polymorphisms (SNPs) simultaneously can help disentangle the observed patterns of association and provide information about how those SNPs could be prioritized for follow‐up studies. Current multi‐SNP methods, however, tend to assume that SNP effects are well captured by additive genetics; yet when genetic dominance is present, this assumption translates to reduced power and faulty prioritizations. We describe a statistical procedure for prioritizing SNPs at GWAS loci that efficiently models both additive and dominance effects. Our method, LLARRMA‐dawg, combines a group LASSO procedure for sparse modeling of multiple SNP effects with a resampling procedure based on fractional observation weights. It estimates for each SNP the robustness of association with the phenotype both to sampling variation and to competing explanations from other SNPs. In producing an SNP prioritization that best identifies underlying true signals, we show the following: our method easily outperforms a single‐marker analysis; when additive‐only signals are present, our joint model for additive and dominance is equivalent to or only slightly less powerful than modeling additive‐only effects; and when dominance signals are present, even in combination with substantial additive effects, our joint model is unequivocally more powerful than a model assuming additivity. We also describe how performance can be improved through calibrated randomized penalization, and discuss how dominance in ungenotyped SNPs can be incorporated through either heterozygote dosage or multiple imputation.  相似文献   

20.
Genome‐wide association studies allow detection of non‐genotyped disease‐causing variants through testing of nearby genotyped SNPs. This approach may fail when there are no genotyped SNPs in strong LD with the causal variant. Several genotyped SNPs in weak LD with the causal variant may, however, considered together, provide equivalent information. This observation motivates popular but computationally intensive approaches based on imputation or haplotyping. Here we present a new method and accompanying software designed for this scenario. Our approach proceeds by selecting, for each genotyped “anchor” SNP, a nearby genotyped “partner” SNP, chosen via a specific algorithm we have developed. These two SNPs are used as predictors in linear or logistic regression analysis to generate a final significance test. In simulations, our method captures much of the signal captured by imputation, while taking a fraction of the time and disc space, and generating a smaller number of false‐positives. We apply our method to a case/control study of severe malaria genotyped using the Affymetrix 500K array. Previous analysis showed that fine‐scale sequencing of a Gambian reference panel in the region of the known causal locus, followed by imputation, increased the signal of association to genome‐wide significance levels. Our method also increases the signal of association from to . Our method thus, in some cases, eliminates the need for more complex methods such as sequencing and imputation, and provides a useful additional test that may be used to identify genetic regions of interest.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号