首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Recent studies suggest that haplotypes tend to have block-like structures throughout the human genome. Several methods were proposed for haplotype block partitioning and for tagging single-nucleotide polymorphism (SNP) identification. In population genetics studies, several research groups compared block structures across human populations. However, the measures used to quantify population similarity are either less than satisfactory or nonexistent. In this article, we propose several similarity measures to facilitate the comparisons of haplotype structures, namely block boundaries and tagging SNPs, across populations. With these measures, we can more objectively compare haplotype block structures and tagging SNP sets between different populations. In addition, these measures allow us to compare the results of different methods for block partition and tagging SNP identification. When we applied these measures to a real data set on chromosome 10 in 16 worldwide populations, we found that in this genome region: 1) haplotype block boundaries vary among populations, with European and some African populations showing similar boundaries but other populations showing other patterns; 2) tagging SNP sets are generally similar for populations with similar haplotype block structures but differ if the block structures differ; and 3) all but one of the block finding methods we tested yield consistent results, although variations exist regarding consistency. Our tentative results show that at least in the genome region studied, it is unlikely that a common haplotype pattern exists for all human populations: many populations, even in the same geographical region, may have different haplotype patterns.  相似文献   

2.
Liu Z  Lin S 《Genetic epidemiology》2005,29(4):353-364
Linkage disequilibrium (LD) plays a central role in fine mapping of disease genes and, more recently, in characterizing haplotype blocks. Classical LD measures, such as D' and r(2), are frequently used to quantify relationship between two loci. A pairwise "distance" matrix among a set of loci can be constructed using such a measure, and based upon which a number of haplotype block detection and tagging single nucleotide polymorphism (SNP) selection algorithms have been devised. Although successful in many applications, the pairwise nature of these measures does not provide a direct characterization of joint linkage disequilibrium among multiple loci. Consequently, applications based on them may lead to loss of important information. In this report, we propose a multilocus LD measure based on generalized mutual information, which is also known as relative entropy or Kullback-Leibler distance. In essence, this measure seeks to quantify the distance between the observed haplotype distribution and the expected distribution assuming linkage equilibrium. We can show that this measure is approximately equal to r(2) in the special case with two loci. Based on this multilocus LD measure and an entropy measure that characterizes haplotype diversity, we propose a class of stepwise tagging SNP selection algorithms. This represents a unified approach for SNP selection in that it takes into account both the haplotype diversity and linkage disequilibrium objectives. Applications to both simulated and real data demonstrate the utility of the proposed methods for handling a large number of SNPs. The results indicate that multilocus LD patterns can be captured well, and informative and nonredundant SNPs can be selected effectively from a large set of loci.  相似文献   

3.
Candidate gene association studies often utilize one single nucleotide polymorphism (SNP) for analysis, with an initial report typically not being replicated by subsequent studies. The failure to replicate may result from incomplete or poor identification of disease-related variants or haplotypes, possibly due to naive SNP selection. A method for identification of linkage disequilibrium (LD) groups and selection of SNPs that capture sufficient intra-genic genetic diversity is described. We assume all SNPs with minor allele frequency above a pre-determined frequency have been identified. Principal component analysis (PCA) is applied to evaluate multivariate SNP correlations to infer groups of SNPs in LD (LD-groups) and to establish an optimal set of group-tagging SNPs (gtSNPs) that provide the most comprehensive coverage of intra-genic diversity while minimizing the resources necessary to perform an informative association analysis. This PCA method differs from haplotype block (HB) and haplotype-tagging SNP (htSNP) methods, in that an LD-group of SNPs need not be a contiguous DNA fragment. Results of the PCA method compared well with existing htSNP methods while also providing advantages over those methods, including an indication of the optimal number of SNPs needed. Further, evaluation of the method over multiple replicates of simulated data indicated PCA to be a robust method for SNP selection. Our findings suggest that PCA may be a powerful tool for establishing an optimal SNP set that maximizes the amount of genetic variation captured for a candidate gene using a minimal number of SNPs.  相似文献   

4.
Few comparison studies have been performed on single nucleotide polymorphism (SNP) tagging methods to examine their consistency and effectiveness in terms of inferences about association with disease. We applied several SNP tagging methods to SNPs on chromosome 12q (n=713) and compared the utility of these methods to detect association for asthma and serum IgE levels among a sample of African Caribbean families from Barbados selected through asthmatic probands. We found that a high level of information regarding association is retained in Clayton's htSNP, Stram's TagSNP, and de Bakker's Tagger. We also found a high degree of consistency between TagSNP and Tagger. Using this set of 713 SNPs on chromosome 12q, our study provides insight towards analytic strategies for future studies of complex traits.  相似文献   

5.
Polygenic risk scores (PRSs) are a method to summarize the additive trait variance captured by a set of SNPs, and can increase the power of set‐based analyses by leveraging public genome‐wide association study (GWAS) datasets. PRS aims to assess the genetic liability to some phenotype on the basis of polygenic risk for the same or different phenotype estimated from independent data. We propose the application of PRSs as a set‐based method with an additional component of adjustment for linkage disequilibrium (LD), with potential extension of the PRS approach to analyze biologically meaningful SNP sets. We call this method POLARIS: POlygenic Ld‐Adjusted RIsk Score. POLARIS identifies the LD structure of SNPs using spectral decomposition of the SNP correlation matrix and replaces the individuals' SNP allele counts with LD‐adjusted dosages. Using a raw genotype dataset together with SNP effect sizes from a second independent dataset, POLARIS can be used for set‐based analysis. MAGMA is an alternative set‐based approach employing principal component analysis to account for LD between markers in a raw genotype dataset. We used simulations, both with simple constructed and real LD‐structure, to compare the power of these methods. POLARIS shows more power than MAGMA applied to the raw genotype dataset only, but less or comparable power to combined analysis of both datasets. POLARIS has the advantages that it produces a risk score per person per set using all available SNPs, and aims to increase power by leveraging the effect sizes from the discovery set in a self‐contained test of association in the test dataset.  相似文献   

6.
Linkage disequilibrium (LD) in the human genome, often measured as pairwise correlation between adjacent markers, shows substantial spatial heterogeneity. Congruent with these results, studies have found that certain regions of the genome have far less haplotype diversity than expected if the alleles at multiple markers were independent, while other sets of adjacent markers behave almost independently. Regions with limited haplotype diversity have been described as "blocked" or "haplotype blocks." In this article, we propose a new method that aims to distinguish between blocked and unblocked regions in the genome. Like some other approaches, the method analyses haplotype diversity. Unlike other methods, it allows for adjacent, distinct blocks and also multiple, independent single nucleotide polymorphisms (SNPs) separating blocks. Based on an approximate likelihood model and a parsimony criterion to penalize for model complexity, the method partitions a genomic region into blocks relatively quickly, and simulations suggest that its partitions are accurate. We also propose a new, efficient method to select SNPs for association analysis, namely tag SNPs. These methods compare favorably to similar blocking and tagging methods using simulations.  相似文献   

7.
Single nucleotide polymorphisms (SNPs) are becoming widely used as genotypic markers in genetic association studies of common, complex human diseases. For such association screens, a crucial part of study design is determining what SNPs to prioritize for genotyping. We present a novel power-based algorithm to select a subset of tag SNPs for genotyping from a map of available SNPs. Blocks of markers in strong linkage disequilibrium (LD) are identified, and SNPs are selected to represent each block such that power to detect disease association with an underlying disease allele in LD with block members is preserved; all markers outside of blocks are also included in the tagging subset. A key, novel element of this method is that it incorporates information about the phase of LD observed among marker pairs to retain markers likely to be in coupling phase with an underlying disease locus, thus increasing power compared to a phase-blind approach. Power calculations illustrate important issues regarding LD phase and make clear the advantages of our approach to SNP selection. We apply our algorithm to genotype data from the International HapMap Consortium and demonstrate that considerable reduction in SNP genotyping may be attained while retaining much of the available power for a disease association screen. We also demonstrate that these tag SNPs effectively represent underlying variants not included in the LD analysis and SNP selection, by using leave-one-out tests to show that most (approximately 90%) of the "untyped" variants lying in blocks are in coupling-phase LD with a tag SNP. Additional performance tests using the HapMap ENCyclopedia of DNA Elements (ENCODE) regions show that the method compares well with the popular r2 bin tagging method. This work is a concrete example of how empirical LD phase may be used to benefit study design.  相似文献   

8.
目的探讨蛋白磷酸酶2A(PP2A)-Aα亚基基因启动子区多态性的人群单体型分布特征。方法采用Haploview软件分析部分广东汉族人群PPP2R1A基因5′-侧翼区筛查到的7个多态性位点的遗传学特征、连锁不平衡(LD)、标签(tag)SNP和单体型(域)分布。结果各多态性位点基因型频率均符合H-W平衡(P>0.05);各位点在该人群的杂合度(π)不同,且在-568G>A和+87T>C与HapMap中的不同人群存在明显差异(P<0.05);-1039G>T(+Ins)与+87T>C和+108A>G、-568G>A与-241-/G位点之间呈强LD,+87T>C与+108A>G之间为完全LD;构建该人群中的5种单体型(H1~H5),频率分布为野生单体型(H1)53%、其余4种变异单体型(H2~H5)为44%;得到两个单体型域,筛选出-1039G>T(+Ins)、-512G>A、-241-/G、+107-/C为4个tag SNP,并确定了2个单体型域内标签SNPs(htSNP)及其分别代表的单体型。结论首次确定并报道中国广东汉族健康人群PPP2R1A基因5′-侧翼区的标签SNP和单体型(域)分布。  相似文献   

9.
Single nucleotide polymorphisms (SNPs) are important markers to investigate genetic heterogeneity of population and to perform linkage disequilibrium (LD) mapping. We propose a new method, Psi, to express frequency of 2(N(s)) haplotypes for N(s) di-allelic SNPs. Using the new expression of haplotype frequency, we propose a novel measure of LD, D(g), not only for SNP pairs but also for multiple markers. The values of D(g) for SNP pairs were revealed to be similar to values of conventional pairwise LD indices, D' and r(2), and it was revealed that D(g) quantitated components of LD that were not measured by conventional LD indices for SNP pairs. Also we propose a distinct method, D(g)-based absolute estimation, to infer the absolute maximum estimates of haplotype frequency. The result of the D(g)-based absolute estimation of haplotype frequency for SNP pairs were compared with the conventional expectation-maximization (EM) algorithm and reported that the new method gave better inference than the EM algorithm which converged infrequently to a local extreme.  相似文献   

10.
The products of the renin-angiotensin system (RAS) play an important role in the pathogenesis of cardiovascular disease. Studies examining RAS gene variants and cardiovascular disease have focused on single-nucleotide polymorphisms (SNPs) rather than haplotypes, which better characterize the patterns of genetic variation. The authors conducted a population-based, case-control study at Group Health (Seattle, Washington) between 1995 and 1999 to determine whether common haplotypes in the angiotensinogen gene (AGT), the renin gene, the angiotensin-converting enzyme gene, and the angiotensin II receptor type 1 and receptor type 2 genes were associated with the risk of myocardial infarction and stroke among pharmacologically treated hypertensive patients. SNP discovery was done using 23 European-origin samples. Thirty tagSNPs (the minimum sets of SNPs that capture most of the haplotype diversity within a block) were genotyped in cases and controls. Haplotypes were inferred using the program PHASE (http://www.stat.washington.edu/stephens/software.html). The authors used weighted logistic regression to estimate associations and conducted a permutation test to estimate the probability of a chance finding. AGT haplotype B was associated with the risk of myocardial infarction (odds ratio = 1.58, 95% confidence interval: 1.06, 2.35); however, results were not statistically significant given the number of tests performed (permutation p = 0.17). In this case-control study, RAS gene haplotypes were not significantly associated with increased risks of myocardial infarction or stroke.  相似文献   

11.
Tag SNP selection for association studies   总被引:6,自引:0,他引:6  
This report describes current methods for selection of informative single nucleotide polymorphisms (SNPs) using data from a dense network of SNPs that have been genotyped in a relatively small panel of subjects. We discuss the following issues: (1) Optimal selection of SNPs based upon maximizing either the predictability of unmeasured SNPs or the predictability of SNP haplotypes as selection criteria. (2) The dependence of the performance of tag SNP selection methods upon the density of SNP markers genotyped for the purpose of haplotype discovery and tag SNP selection. (3) The likely power of case-control studies to detect the influence upon disease risk of common disease-causing variants in candidate genes in a haplotype-based analysis. We propose a quasi-empirical approach towards evaluating the power of large studies with this calculation based upon the SNP genotype and haplotype frequencies estimated in a haplotype discovery panel. In this calculation, each common SNP in turn is treated as a potential unmeasured causal variant and subjected to a correlation analysis using the remaining SNPs. We use a small portion of the HapMap ENCODE data (488 common SNPs genotyped over approximately a 500 kb region of chromosome 2) as an illustrative example of this approach towards power evaluation.  相似文献   

12.
Kernel machine learning methods, such as the SNP‐set kernel association test (SKAT), have been widely used to test associations between traits and genetic polymorphisms. In contrast to traditional single‐SNP analysis methods, these methods are designed to examine the joint effect of a set of related SNPs (such as a group of SNPs within a gene or a pathway) and are able to identify sets of SNPs that are associated with the trait of interest. However, as with many multi‐SNP testing approaches, kernel machine testing can draw conclusion only at the SNP‐set level, and does not directly inform on which one(s) of the identified SNP set is actually driving the associations. A recently proposed procedure, KerNel Iterative Feature Extraction (KNIFE), provides a general framework for incorporating variable selection into kernel machine methods. In this article, we focus on quantitative traits and relatively common SNPs, and adapt the KNIFE procedure to genetic association studies and propose an approach to identify driver SNPs after the application of SKAT to gene set analysis. Our approach accommodates several kernels that are widely used in SNP analysis, such as the linear kernel and the Identity by State (IBS) kernel. The proposed approach provides practically useful utilities to prioritize SNPs, and fills the gap between SNP set analysis and biological functional studies. Both simulation studies and real data application are used to demonstrate the proposed approach.  相似文献   

13.
We consider two-stage case-control designs for testing associations between single nucleotide polymorphisms (SNPs) and disease, in which a subsample of subjects is used to select a panel of "tagging" SNPs that will be considered in the main study. We propose a pseudolikelihood [Pepe and Flemming, 1991: JASA 86:108-113] that combines the information from both the main study and the substudy to test the association with any polymorphism in the original set. SNP-tagging [Chapman et al., 2003: Hum Hered 56:18-31] and haplotype-tagging [Stram et al., 2003a; Hum Hered 55:27-36] approaches are compared. We show that the cost-efficiency of such a design for estimating the relative risk associated with the causal polymorphism can be considerably better than for a single-stage design, even if the causal polymorphism is not included in the tag-SNP set. We also consider the optimal selection of cases and controls in such designs and the relative efficiency for estimating the location of a causal variant in linkage disequilibrium mapping. Nevertheless, as the cost of high-volume genotyping plummets and haplotype tagging information from the International HapMap project [Gibbs et al., 2003; Nature 426:789-796] rapidly accumulates in public databases, such two-stage designs may soon become unnecessary.  相似文献   

14.
The power of genome‐wide association studies (GWAS) for mapping complex traits with single‐SNP analysis (where SNP is single‐nucleotide polymorphism) may be undermined by modest SNP effect sizes, unobserved causal SNPs, correlation among adjacent SNPs, and SNP‐SNP interactions. Alternative approaches for testing the association between a single SNP set and individual phenotypes have been shown to be promising for improving the power of GWAS. We propose a Bayesian latent variable selection (BLVS) method to simultaneously model the joint association mapping between a large number of SNP sets and complex traits. Compared with single SNP set analysis, such joint association mapping not only accounts for the correlation among SNP sets but also is capable of detecting causal SNP sets that are marginally uncorrelated with traits. The spike‐and‐slab prior assigned to the effects of SNP sets can greatly reduce the dimension of effective SNP sets, while speeding up computation. An efficient Markov chain Monte Carlo algorithm is developed. Simulations demonstrate that BLVS outperforms several competing variable selection methods in some important scenarios.  相似文献   

15.
Here we summarize the contributions to Group 13 of the Genetic Analysis Workshop 15 held in St. Pete Beach, Florida, on November 12-14, 2006. The focus of this group was to identify candidate genes associated with rheumatoid arthritis or surrogate outcomes. The association methods proposed in this group were diverse, from better known approaches, such as logistic regression for single nucleotide polymorphism (SNP) analysis and haplotype sharing tests to methods less familiar to genetic epidemiologists, such as machine learning and visualization methods. The majority of papers analyzed Genetic Analysis Workshop 15 Problems 2 (rheumatoid arthritis data) and 3 (simulated data). The highlighted points of this group analyses were: (1) haplotype-based statistics can be more powerful than single SNP analysis for risk-locus localization; (2) considering linkage disequilibrium block structure in haplotype analysis may reduce the likelihood of false-positive results; and (3) visual representation of genetic models for continuous covariates may help identify SNPs associated with the underlying quantitative trait loci.  相似文献   

16.
We describe a novel method for assessing the strength of disease association with single nucleotide polymorphisms (SNPs) in a candidate gene or small candidate region, and for estimating the corresponding haplotype relative risks of disease, using unphased genotype data directly. We begin by estimating the relative frequencies of haplotypes consistent with observed SNP genotypes. Under the Bayesian partition model, we specify cluster centres from this set of consistent SNP haplotypes. The remaining haplotypes are then assigned to the cluster with the "nearest" centre, where distance is defined in terms of SNP allele matches. Within a logistic regression modelling framework, each haplotype within a cluster is assigned the same disease risk, reducing the number of parameters required. Uncertainty in phase assignment is addressed by considering all possible haplotype configurations consistent with each unphased genotype, weighted in the logistic regression likelihood by their probabilities, calculated according to the estimated relative haplotype frequencies. We develop a Markov chain Monte Carlo algorithm to sample over the space of haplotype clusters and corresponding disease risks, allowing for covariates that might include environmental risk factors or polygenic effects. Application of the algorithm to SNP genotype data in an 890-kb region flanking the CYP2D6 gene illustrates that we can identify clusters of haplotypes with similar risk of poor drug metaboliser (PDM) phenotype, and can distinguish PDM cases carrying different high-risk variants. Further, the results of a detailed simulation study suggest that we can identify positive evidence of association for moderate relative disease risks with a sample of 1,000 cases and 1,000 controls.  相似文献   

17.
Complex genetic disorders are a result of a combination of genetic and nongenetic factors, all potentially interacting. Machine learning methods hold the potential to identify multilocus and environmental associations thought to drive complex genetic traits. Decision trees, a popular machine learning technique, offer a computationally low complexity algorithm capable of detecting associated sets of single nucleotide polymorphisms (SNPs) of arbitrary size, including modern genome-wide SNP scans. However, interpretation of the importance of an individual SNP within these trees can present challenges. We present a new decision tree algorithm denoted as Bagged Alternating Decision Trees (BADTrees) that is based on identifying common structural elements in a bootstrapped set of Alternating Decision Trees (ADTrees). The algorithm is order nk2, where n is the number of SNPs considered and k is the number of SNPs in the tree constructed. Our simulation study suggests that BADTrees have higher power and lower type I error rates than ADTrees alone and comparable power with lower type I error rates compared to logistic regression. We illustrate the application of these data using simulated data as well as from the Lupus Large Association Study 1 (7,822 SNPs in 3,548 individuals). Our results suggest that BADTrees hold promise as a low computational order algorithm for detecting complex combinations of SNP and environmental factors associated with disease.  相似文献   

18.
The genetic case-control association study of unrelated subjects is a leading method to identify single nucleotide polymorphisms (SNPs) and SNP haplotypes that modulate the risk of complex diseases. Association studies often genotype several SNPs in a number of candidate genes; we propose a two-stage approach to address the inherent statistical multiple comparisons problem. In the first stage, each gene's association with disease is summarized by a single p-value that controls a familywise error rate. In the second stage, summary p-values are adjusted for multiplicity using a false discovery rate (FDR) controlling procedure. For the first stage, we consider marginal and joint tests of SNPs and haplotypes within genes, and we construct an omnibus test that combines SNP and haplotype analysis. Simulation studies show that when disease susceptibility is conferred by a SNP, and all common SNPs in a gene are genotyped, marginal analysis of SNPs using the Simes test has similar or higher power than marginal or joint haplotype analysis. Conversely, haplotype analysis can be more powerful when disease susceptibility is conferred by a haplotype. The omnibus test tracks the more powerful of the two approaches, which is generally unknown. Multiple testing balances the desire for statistical power against the implicit costs of false positive results, which up to now appear to be common in the literature.  相似文献   

19.
With the aim of improving detection of novel single‐nucleotide polymorphisms (SNPs) in genetic association studies, we propose a method of including prior biological information in a Bayesian shrinkage model that jointly estimates SNP effects. We assume that the SNP effects follow a normal distribution centered at zero with variance controlled by a shrinkage hyperparameter. We use biological information to define the amount of shrinkage applied on the SNP effects distribution, so that the effects of SNPs with more biological support are less shrunk toward zero, thus being more likely detected. The performance of the method was tested in a simulation study (1,000 datasets, 500 subjects with ~200 SNPs in 10 linkage disequilibrium (LD) blocks) using a continuous and a binary outcome. It was further tested in an empirical example on body mass index (continuous) and overweight (binary) in a dataset of 1,829 subjects and 2,614 SNPs from 30 blocks. Biological knowledge was retrieved using the bioinformatics tool Dintor, which queried various databases. The joint Bayesian model with inclusion of prior information outperformed the standard analysis: in the simulation study, the mean ranking of the true LD block was 2.8 for the Bayesian model versus 3.6 for the standard analysis of individual SNPs; in the empirical example, the mean ranking of the six true blocks was 8.5 versus 9.3 in the standard analysis. These results suggest that our method is more powerful than the standard analysis. We expect its performance to improve further as more biological information about SNPs becomes available.  相似文献   

20.

Background

Rapid advances in genotyping technology have made it possible to easily utilize a large number of genetic markers. According to information theory, an increase in the number of markers provides more information; however, the clinical usefulness does not increase linearly. This study aimed to assess the effect of folic acid supplementation quantitatively in MTHFR haplotypes, and compare its prediction power with that of the C677T single nucleotide polymorphism (SNP) alone.

Methods

The study was a randomized, double-blind, placebo-controlled trial, designed in accordance with the CONSORT statement. The participants were 202 healthy Japanese males who were administered either folic acid at 1 mg/day or a placebo postoperatively for 3 months. The primary endpoint was the total plasma homocysteine levels (tHcy). Stratified analysis by HapMap-based tag SNPs was performed.

Results

Of 52 SNPs on the MTHFR gene, 4 SNP loci covering more than 80% of the information were selected, and the haplotypes were estimated. The haplotypes were classified into 3 groups (Hap0, Hap1, and Hap2), on the basis of the number of times the most frequent haplotype was present. The greatest decrease was observed in Hap2 (6.61 µmol/L), compared with the other haplotypes (Hap0, 2.67; Hap1, 2.60) (trend test, P < 0.01). The haplotype information obtained was not more informative than that obtained with grouping by a single SNP, C677T, which strongly influences enzyme activity.

Conclusions

Grouping by the C677T SNP alone was almost as good a predictor of the homocysteine-lowering effects as was grouping by the 4 best SNPs. This shows that increasing the number of typed SNPs does not necessarily provide more information, at least for this gene. A more efficient, cost-informative method for analyzing genomic data is required.Key words: Folic Acid, Randomized Controlled Trials, Methylenetetrahydrofolate Reductase (MTHFR), Haplotypes, Informativeness  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号