期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Analysis of single-locus tests to detect gene/disease associations

Roeder K Bacanu SA Sonpar V Zhang X Devlin B 《Genetic epidemiology》2005,28(3):207-219

A goal of association analysis is to determine whether variation in a particular candidate region or gene is associated with liability to complex disease. To evaluate such candidates, ubiquitous Single Nucleotide Polymorphisms (SNPs) are useful. It is critical, however, to select a set of SNPs that are in substantial linkage disequilibrium (LD) with all other polymorphisms in the region. Whether there is an ideal statistical framework to test such a set of ‘tag SNPs’ for association is unknown. Compared to tests for association based on frequencies of haplotypes, recent evidence suggests tests for association based on linear combinations of the tag SNPs (Hotelling T² test) are more powerful. Following this logical progression, we wondered if single‐locus tests would prove generally more powerful than the regression‐based tests? We answer this question by investigating four inferential procedures: the maximum of a series of test statistics corrected for multiple testing by the Bonferroni procedure, T_B, or by permutation of case‐control status, T_P; a procedure that tests the maximum of a smoothed curve fitted to the series of of test statistics, T_S; and the Hotelling T² procedure, which we call T_R. These procedures are evaluated by simulating data like that from human populations, including realistic levels of LD and realistic effects of alleles conferring liability to disease. We find that power depends on the correlation structure of SNPs within a gene, the density of tag SNPs, and the placement of the liability allele. The clearest pattern emerges between power and the number of SNPs selected. When a large fraction of the SNPs within a gene are tested, and multiple SNPs are highly correlated with the liability allele, T_S has better power. Using a SNP selection scheme that optimizes power but also requires a substantial number of SNPs to be genotyped (roughly 10–20 SNPs per gene), power of T_P is generally superior to that for the other procedures, including T_R. Finally, when a SNP selection procedure that targets a minimal number of SNPs per gene is applied, the average performances of T_P and T_R are indistinguishable. Genet. Epidemiol. © 2005 Wiley‐Liss, Inc. 相似文献

2.

Power-based, phase-informed selection of single nucleotide polymorphisms for disease association screens

Saccone SF Rice JP Saccone NL 《Genetic epidemiology》2006,30(6):459-470

Single nucleotide polymorphisms (SNPs) are becoming widely used as genotypic markers in genetic association studies of common, complex human diseases. For such association screens, a crucial part of study design is determining what SNPs to prioritize for genotyping. We present a novel power-based algorithm to select a subset of tag SNPs for genotyping from a map of available SNPs. Blocks of markers in strong linkage disequilibrium (LD) are identified, and SNPs are selected to represent each block such that power to detect disease association with an underlying disease allele in LD with block members is preserved; all markers outside of blocks are also included in the tagging subset. A key, novel element of this method is that it incorporates information about the phase of LD observed among marker pairs to retain markers likely to be in coupling phase with an underlying disease locus, thus increasing power compared to a phase-blind approach. Power calculations illustrate important issues regarding LD phase and make clear the advantages of our approach to SNP selection. We apply our algorithm to genotype data from the International HapMap Consortium and demonstrate that considerable reduction in SNP genotyping may be attained while retaining much of the available power for a disease association screen. We also demonstrate that these tag SNPs effectively represent underlying variants not included in the LD analysis and SNP selection, by using leave-one-out tests to show that most (approximately 90%) of the "untyped" variants lying in blocks are in coupling-phase LD with a tag SNP. Additional performance tests using the HapMap ENCyclopedia of DNA Elements (ENCODE) regions show that the method compares well with the popular r2 bin tagging method. This work is a concrete example of how empirical LD phase may be used to benefit study design. 相似文献

3.

Detecting genetic association through shortest paths in a bidirected graph

下载免费PDF全文

Masao Ueki Yoshinori Kawasaki Gen Tamiya for Alzheimer's Disease Neuroimaging Initiative 《Genetic epidemiology》2017,41(6):481-497

Genome‐wide association studies (GWASs) commonly use marginal association tests for each single‐nucleotide polymorphism (SNP). Because these tests treat SNPs as independent, their power will be suboptimal for detecting SNPs hidden by linkage disequilibrium (LD). One way to improve power is to use a multiple regression model. However, the large number of SNPs preclude simultaneous fitting with multiple regression, and subset regression is infeasible because of an exorbitant number of candidate subsets. We therefore propose a new method for detecting hidden SNPs having significant yet weak marginal association in a multiple regression model. Our method begins by constructing a bidirected graph locally around each SNP that demonstrates a moderately sized marginal association signal, the focal SNPs. Vertexes correspond to SNPs, and adjacency between vertexes is defined by an LD measure. Subsequently, the method collects from each graph all shortest paths to the focal SNP. Finally, for each shortest path the method fits a multiple regression model to all the SNPs lying in the path and tests the significance of the regression coefficient corresponding to the terminal SNP in the path. Simulation studies show that the proposed method can detect susceptibility SNPs hidden by LD that go undetected with marginal association testing or with existing multivariate methods. When applied to real GWAS data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), our method detected two groups of SNPs: one in a region containing the apolipoprotein E (APOE) gene, and another in a region close to the semaphorin 5A (SEMA5A) gene. 相似文献

4.

POLARIS: Polygenic LD‐adjusted risk score approach for set‐based analysis of GWAS data

下载免费PDF全文

Emily Baker Karl Michael Schmidt Rebecca Sims Michael C. O'Donovan Julie Williams Peter Holmans Valentina Escott‐Price with the GERAD Consortium 《Genetic epidemiology》2018,42(4):366-377

Polygenic risk scores (PRSs) are a method to summarize the additive trait variance captured by a set of SNPs, and can increase the power of set‐based analyses by leveraging public genome‐wide association study (GWAS) datasets. PRS aims to assess the genetic liability to some phenotype on the basis of polygenic risk for the same or different phenotype estimated from independent data. We propose the application of PRSs as a set‐based method with an additional component of adjustment for linkage disequilibrium (LD), with potential extension of the PRS approach to analyze biologically meaningful SNP sets. We call this method POLARIS: POlygenic Ld‐Adjusted RIsk Score. POLARIS identifies the LD structure of SNPs using spectral decomposition of the SNP correlation matrix and replaces the individuals' SNP allele counts with LD‐adjusted dosages. Using a raw genotype dataset together with SNP effect sizes from a second independent dataset, POLARIS can be used for set‐based analysis. MAGMA is an alternative set‐based approach employing principal component analysis to account for LD between markers in a raw genotype dataset. We used simulations, both with simple constructed and real LD‐structure, to compare the power of these methods. POLARIS shows more power than MAGMA applied to the raw genotype dataset only, but less or comparable power to combined analysis of both datasets. POLARIS has the advantages that it produces a risk score per person per set using all available SNPs, and aims to increase power by leveraging the effect sizes from the discovery set in a self‐contained test of association in the test dataset. 相似文献

5.

Extent and distribution of linkage disequilibrium in the Old Order Amish

Evadnie Rampersaud Haiqing Shen Jeffrey R. O'Connell Braxton D. Mitchell Alan R. Shuldiner Julie A. Douglas 《Genetic epidemiology》2010,34(2):146-150

Knowledge of the extent and distribution of linkage disequilibrium (LD) is critical to the design and interpretation of gene mapping studies. Because the demographic history of each population varies and is often not accurately known, it is necessary to empirically evaluate LD on a population‐specific basis. Here we present the first genome‐wide survey of LD in the Old Order Amish (OOA) of Lancaster County Pennsylvania, a closed population derived from a modest number of founders. Specifically, we present a comparison of LD between OOA individuals and US Utah participants in the International HapMap project (abbreviated CEU) using a high‐density single nucleotide polymorphism (SNP) map. Overall, the allele (and haplotype) frequency distributions and LD profiles were remarkably similar between these two populations. For example, the median absolute allele frequency difference for autosomal SNPs was 0.05, with an inter‐quartile range of 0.02–0.09, and for autosomal SNPs 10–20 kb apart with common alleles (minor allele frequency≥0.05), the LD measure r² was at least 0.8 for 15 and 14% of SNP pairs in the OOA and CEU, respectively. Moreover, tag SNPs selected from the HapMap CEU sample captured a substantial portion of the common variation in the OOA (～88%) at r²≥0.8. These results suggest that the OOA and CEU may share similar LD profiles for other common but untyped SNPs. Thus, in the context of the common variant‐common disease hypothesis, genetic variants discovered in gene mapping studies in the OOA may generalize to other populations. Genet. Epidemiol. 34: 146–150, 2010. © 2009 Wiley‐Liss, Inc. 相似文献

6.

Detection of cis‐acting regulatory SNPs using allelic expression data

Rui Xiao Laura J. Scott 《Genetic epidemiology》2011,35(6):515-525

Allelic expression (AE) imbalance between the two alleles of a gene can be used to detect cis‐acting regulatory SNPs (rSNPs) in individuals heterozygous for a transcribed SNP (tSNP). In this paper, we propose three tests for AE analysis focusing on phase‐unknown data and any degree of linkage disequilibrium (LD) between the rSNP and tSNP: a test based on the minimum P‐value of a one‐sided F test and a two‐sided t test (proposed previously for phase‐unknown data), a test the combines the F and t tests, and a mixture‐model‐based test. We compare these three tests to the F and t tests and an existing regression‐based test for phase‐known data. We show that the ranking of the tests based on power depends most strongly on the magnitude of the LD between the rSNP and tSNP. For phase‐unknown data, we find that under a range of scenarios, our proposed tests have higher power than the F and t tests when LD between the rSNP and tSNP is moderate (～0.2<<～0.8). We further demonstrate that the presence of a second ungenotyped rSNP almost never invalidates the proposed tests nor substantially changes their power rankings. For detection of cis‐acting regulatory SNPs using phase‐unknown AE data, we recommend the F test when the rSNP and tSNP are in or near linkage equilibrium (<0.2); the t test when the two SNPs are in strong LD (<0.7); and the mixture‐model‐based test for intermediate LD levels (0.2<<0.7). Genet. Epidemiol. 2011. © 2011 Wiley‐Liss, Inc. 35: 515‐525, 2011 相似文献

7.

Identity by descent estimation with dense genome‐wide genotype data

Lide Han Mark Abney 《Genetic epidemiology》2011,35(6):557-567

We present a novel method, IBDLD, for estimating the probability of identity by descent (IBD) for a pair of related individuals at a locus, given dense genotype data and a pedigree of arbitrary size and complexity. IBDLD overcomes the challenges of exact multipoint estimation of IBD in pedigrees of potentially large size and eliminates the difficulty of accommodating the background linkage disequilibrium (LD) that is present in high‐density genotype data. We show that IBDLD is much more accurate at estimating the true IBD sharing than methods that remove LD by pruning SNPs and is highly robust to pedigree errors or other forms of misspecified relationships. The method is fast and can be used to estimate the probability for each possible IBD sharing state at every SNP from a high‐density genotyping array for hundreds of thousands of pairs of individuals. We use it to estimate point‐wise and genomewide IBD sharing between 185,745 pairs of subjects all of whom are related through a single, large and complex 13‐generation pedigree and genotyped with the Affymetrix 500 k chip. We find that we are able to identify the true pedigree relationship for individuals who were misidentified in the collected data and estimate empirical kinship coefficients that can be used in follow‐up QTL mapping studies. IBDLD is implemented as an open source software package and is freely available. Genet. Epidemiol. 2011. © 2011 Wiley‐Liss, Inc. 35: 557‐567, 2011 相似文献

8.

Comparison of association methods for dense marker data

Bacanu SA Nelson MR Ehm MG 《Genetic epidemiology》2008,32(8):791-799

While data sets based on dense genome scans are becoming increasingly common, there are many theoretical questions that remain unanswered. How can a large number of markers in high linkage disequilibrium (LD) and rare disease variants be simulated efficiently? How should markers in high LD be analyzed: individually or jointly? Are there fast and simple methods to adjust for correlation of tests? What is the power penalty for conservative Bonferroni adjustments? Assuming that association scans are adequately powered, we attempt to answer these questions. Performance of single‐point and multipoint tests, and their hybrids, is investigated using two simulation designs. The first simulation design uses theoretically derived LD patterns. The second design uses LD patterns based on real data. For the theoretical simulations we used polychoric correlation as a measure of LD to facilitate simulation of markers in LD and rare disease variants. Based on the simulation results of the two studies, we conclude that statistical tests assuming only additive genotype effects (i.e. Armitage and especially multipoint T²) should be used cautiously due to their suboptimal power in certain settings. A false discovery rate (FDR)‐adjusted combination of tests for additive, dominant and recessive effects had close to optimal power. However, the common genotypic χ² test performed adequately and could be used in lieu of the FDR combination. While some hybrid methods yield (sometimes spectacularly) higher power they are computationally intensive. We also propose an “exact” method to adjust for multiple testing, which yields nominally higher power than the Bonferroni correction. Genet. Epidemiol. 2008. © 2008 Wiley‐Liss, Inc. 相似文献

9.

Smooth‐Threshold Multivariate Genetic Prediction with Unbiased Model Selection

Masao Ueki Gen Tamiya for Alzheimer's Disease Neuroimaging Initiative 《Genetic epidemiology》2016,40(3):233-243

We develop a new genetic prediction method, smooth‐threshold multivariate genetic prediction, using single nucleotide polymorphisms (SNPs) data in genome‐wide association studies (GWASs). Our method consists of two stages. At the first stage, unlike the usual discontinuous SNP screening as used in the gene score method, our method continuously screens SNPs based on the output from standard univariate analysis for marginal association of each SNP. At the second stage, the predictive model is built by a generalized ridge regression simultaneously using the screened SNPs with SNP weight determined by the strength of marginal association. Continuous SNP screening by the smooth thresholding not only makes prediction stable but also leads to a closed form expression of generalized degrees of freedom (GDF). The GDF leads to the Stein's unbiased risk estimation (SURE), which enables data‐dependent choice of optimal SNP screening cutoff without using cross‐validation. Our method is very rapid because computationally expensive genome‐wide scan is required only once in contrast to the penalized regression methods including lasso and elastic net. Simulation studies that mimic real GWAS data with quantitative and binary traits demonstrate that the proposed method outperforms the gene score method and genomic best linear unbiased prediction (GBLUP), and also shows comparable or sometimes improved performance with the lasso and elastic net being known to have good predictive ability but with heavy computational cost. Application to whole‐genome sequencing (WGS) data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) exhibits that the proposed method shows higher predictive power than the gene score and GBLUP methods. 相似文献

10.

A novel method to express SNP-based genetic heterogeneity, Psi, and its use to measure linkage disequilibrium for multiple SNPs, D(g), and to estimate absolute maximum of haplotype frequency

Yamada R Matsuda F 《Genetic epidemiology》2007,31(7):709-726

Single nucleotide polymorphisms (SNPs) are important markers to investigate genetic heterogeneity of population and to perform linkage disequilibrium (LD) mapping. We propose a new method, Psi, to express frequency of 2(N(s)) haplotypes for N(s) di-allelic SNPs. Using the new expression of haplotype frequency, we propose a novel measure of LD, D(g), not only for SNP pairs but also for multiple markers. The values of D(g) for SNP pairs were revealed to be similar to values of conventional pairwise LD indices, D' and r(2), and it was revealed that D(g) quantitated components of LD that were not measured by conventional LD indices for SNP pairs. Also we propose a distinct method, D(g)-based absolute estimation, to infer the absolute maximum estimates of haplotype frequency. The result of the D(g)-based absolute estimation of haplotype frequency for SNP pairs were compared with the conventional expectation-maximization (EM) algorithm and reported that the new method gave better inference than the EM algorithm which converged infrequently to a local extreme. 相似文献

11.

JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects

Paul J. Newcombe David V. Conti Sylvia Richardson 《Genetic epidemiology》2016,40(3):188-201

Recently, large scale genome‐wide association study (GWAS) meta‐analyses have boosted the number of known signals for some traits into the tens and hundreds. Typically, however, variants are only analysed one‐at‐a‐time. This complicates the ability of fine‐mapping to identify a small set of SNPs for further functional follow‐up. We describe a new and scalable algorithm, joint analysis of marginal summary statistics (JAM), for the re‐analysis of published marginal summary stactistics under joint multi‐SNP models. The correlation is accounted for according to estimates from a reference dataset, and models and SNPs that best explain the complete joint pattern of marginal effects are highlighted via an integrated Bayesian penalized regression framework. We provide both enumerated and Reversible Jump MCMC implementations of JAM and present some comparisons of performance. In a series of realistic simulation studies, JAM demonstrated identical performance to various alternatives designed for single region settings. In multi‐region settings, where the only multivariate alternative involves stepwise selection, JAM offered greater power and specificity. We also present an application to real published results from MAGIC (meta‐analysis of glucose and insulin related traits consortium) – a GWAS meta‐analysis of more than 15,000 people. We re‐analysed several genomic regions that produced multiple significant signals with glucose levels 2 hr after oral stimulation. Through joint multivariate modelling, JAM was able to formally rule out many SNPs, and for one gene, ADCY5, suggests that an additional SNP, which transpired to be more biologically plausible, should be followed up with equal priority to the reported index. 相似文献

12.

Principal component analysis for selection of optimal SNP-sets that capture intragenic genetic variation 总被引：12，自引：0，他引：12

Horne BD Camp NJ 《Genetic epidemiology》2004,26(1):11-21

Candidate gene association studies often utilize one single nucleotide polymorphism (SNP) for analysis, with an initial report typically not being replicated by subsequent studies. The failure to replicate may result from incomplete or poor identification of disease-related variants or haplotypes, possibly due to naive SNP selection. A method for identification of linkage disequilibrium (LD) groups and selection of SNPs that capture sufficient intra-genic genetic diversity is described. We assume all SNPs with minor allele frequency above a pre-determined frequency have been identified. Principal component analysis (PCA) is applied to evaluate multivariate SNP correlations to infer groups of SNPs in LD (LD-groups) and to establish an optimal set of group-tagging SNPs (gtSNPs) that provide the most comprehensive coverage of intra-genic diversity while minimizing the resources necessary to perform an informative association analysis. This PCA method differs from haplotype block (HB) and haplotype-tagging SNP (htSNP) methods, in that an LD-group of SNPs need not be a contiguous DNA fragment. Results of the PCA method compared well with existing htSNP methods while also providing advantages over those methods, including an indication of the optimal number of SNPs needed. Further, evaluation of the method over multiple replicates of simulated data indicated PCA to be a robust method for SNP selection. Our findings suggest that PCA may be a powerful tool for establishing an optimal SNP set that maximizes the amount of genetic variation captured for a candidate gene using a minimal number of SNPs. 相似文献

13.

Multilocus LD measure and tagging SNP selection with generalized mutual information

Liu Z Lin S 《Genetic epidemiology》2005,29(4):353-364

Linkage disequilibrium (LD) plays a central role in fine mapping of disease genes and, more recently, in characterizing haplotype blocks. Classical LD measures, such as D' and r(2), are frequently used to quantify relationship between two loci. A pairwise "distance" matrix among a set of loci can be constructed using such a measure, and based upon which a number of haplotype block detection and tagging single nucleotide polymorphism (SNP) selection algorithms have been devised. Although successful in many applications, the pairwise nature of these measures does not provide a direct characterization of joint linkage disequilibrium among multiple loci. Consequently, applications based on them may lead to loss of important information. In this report, we propose a multilocus LD measure based on generalized mutual information, which is also known as relative entropy or Kullback-Leibler distance. In essence, this measure seeks to quantify the distance between the observed haplotype distribution and the expected distribution assuming linkage equilibrium. We can show that this measure is approximately equal to r(2) in the special case with two loci. Based on this multilocus LD measure and an entropy measure that characterizes haplotype diversity, we propose a class of stepwise tagging SNP selection algorithms. This represents a unified approach for SNP selection in that it takes into account both the haplotype diversity and linkage disequilibrium objectives. Applications to both simulated and real data demonstrate the utility of the proposed methods for handling a large number of SNPs. The results indicate that multilocus LD patterns can be captured well, and informative and nonredundant SNPs can be selected effectively from a large set of loci. 相似文献

14.

Shrinkage estimation for robust and efficient screening of single‐SNP association from case‐control genome‐wide association studies

Sheng Luo Bhramar Mukherjee Jinbo Chen Nilanjan Chatterjee 《Genetic epidemiology》2009,33(8):740-750

Population‐based case‐control design has become one of the most popular approaches for conducting genome‐wide association scans for rare diseases like cancer. In this article, we propose a novel method for improving the power of the widely used single‐single‐nucleotide polymorphism (SNP) two‐degrees‐of‐freedom (2 d.f.) association test for case‐control studies by exploiting the common assumption of Hardy‐Weinberg Equilibrium (HWE) for the underlying population. A key feature of the method is that it can relax the assumed model constraints via a completely data‐adaptive shrinkage estimation approach so that the number of false‐positive results due to the departure of HWE is controlled. The method is computationally simple and is easily scalable to association tests involving hundreds of thousands or millions of genetic markers. Simulation studies as well as an application involving data from a real genome‐wide association study illustrate that the proposed method is very robust for large‐scale association studies and can improve the power for detecting susceptibility SNPs with recessive effects, when compared to existing methods. Implications of the general estimation strategy beyond the simple 2 d.f. association test are discussed. Genet. Epidemiol. 33:740–750, 2009. Published 2009 Wiley‐Liss, Inc. 相似文献

15.

Prioritizing individual genetic variants after kernel machine testing using variable selection

Qianchuan He Tianxi Cai Yang Liu Ni Zhao Quaker E. Harmon Lynn M. Almli Elisabeth B. Binder Stephanie M. Engel Kerry J. Ressler Karen N. Conneely Xihong Lin Michael C. Wu 《Genetic epidemiology》2016,40(8):722-731

Kernel machine learning methods, such as the SNP‐set kernel association test (SKAT), have been widely used to test associations between traits and genetic polymorphisms. In contrast to traditional single‐SNP analysis methods, these methods are designed to examine the joint effect of a set of related SNPs (such as a group of SNPs within a gene or a pathway) and are able to identify sets of SNPs that are associated with the trait of interest. However, as with many multi‐SNP testing approaches, kernel machine testing can draw conclusion only at the SNP‐set level, and does not directly inform on which one(s) of the identified SNP set is actually driving the associations. A recently proposed procedure, KerNel Iterative Feature Extraction (KNIFE), provides a general framework for incorporating variable selection into kernel machine methods. In this article, we focus on quantitative traits and relatively common SNPs, and adapt the KNIFE procedure to genetic association studies and propose an approach to identify driver SNPs after the application of SKAT to gene set analysis. Our approach accommodates several kernels that are widely used in SNP analysis, such as the linear kernel and the Identity by State (IBS) kernel. The proposed approach provides practically useful utilities to prioritize SNPs, and fills the gap between SNP set analysis and biological functional studies. Both simulation studies and real data application are used to demonstrate the proposed approach. 相似文献

16.

Comparative Power of Family‐Based Association Strategies to Detect Disease‐Causing Variants Under Two‐Locus Models

Marie‐Claude Babron Michel Guilloud‐Bataille Mourad Sahbatou Florence Demenais Emmanuelle Génin Marie‐Hélène Dizier 《Genetic epidemiology》2012,36(8):848-855

Not accounting for interaction in association analyses may reduce the power to detect the variants involved. We investigate the powers of different designs to detect under two‐locus models the effect of disease‐causing variants among several hundreds of markers using family‐based association tests by simulation. This setting reflects realistic situations of exploration of linkage regions or of biological pathways. We define four strategies: (S1) single‐marker analysis of all Single Nucleotide Polymorphisms (SNPs), (S2) two‐marker analysis of all possible SNPs pairs, (S3) lax preliminary selection of SNPs followed by a two‐marker analysis of all selected SNP pairs, (S4) stringent preliminary selection of SNPs, each being later paired with all the SNPs for two‐marker analysis. Strategy S2 is never the best design, except when there is an inversion of the gene effect (flip‐flop model). Testing individual SNPs (S1) is the most efficient when the two genes act multiplicatively. Designs S3 and S4 are the most powerful for nonmultiplicative models. Their respective powers depend on the level of symmetry of the model. Because the true genetic model is unknown, we cannot conclude that one design outperforms another. The optimal approach would be the two‐step strategy (S3 or S4) as it is often the most powerful, or the second best. Genet. 相似文献

17.

How to link call rate and p‐values for Hardy–Weinberg equilibrium as measures of genome‐wide SNP data quality

Helmut Finner Klaus Strassburger Iris M. Heid Christian Herder Wolfgang Rathmann Guido Giani Thorsten Dickhaus Peter Lichtner Thomas Meitinger H.‐Erich Wichmann Thomas Illig Christian Gieger 《Statistics in medicine》2010,29(22):2347-2358

We study the link between two quality measures of SNP (single nucleotide polymorphism) data in genome‐wide association (GWA) studies, that is, per SNP call rates (CR) and p‐values for testing Hardy–Weinberg equilibrium (HWE). The aim is to improve these measures by applying methods based on realized randomized p‐values, the false discovery rate and estimates for the proportion of false hypotheses. While exact non‐randomized conditional p‐values for testing HWE cannot be recommended for estimating the proportion of false hypotheses, their realized randomized counterparts should be used. P‐values corresponding to the asymptotic unconditional chi‐square test lead to reasonable estimates only if SNPs with low minor allele frequency are excluded. We provide an algorithm to compute the probability that SNPs violate HWE given the observed CR, which yields an improved measure of data quality. The proposed methods are applied to SNP data from the KORA (Cooperative Health Research in the Region of Augsburg, Southern Germany) 500 K project, a GWA study in a population‐based sample genotyped by Affymetrix GeneChip 500 K arrays using the calling algorithm BRLMM 1.4.0. We show that all SNPs with CR = 100 per cent are nearly in perfect HWE which militates in favor of the population to meet the conditions required for HWE at least for these SNPs. Moreover, we show that the proportion of SNPs not being in HWE increases with decreasing CR. We conclude that using a single threshold for judging HWE p‐values without taking the CR into account is problematic. Instead we recommend a stratified analysis with respect to CR. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

18.

SNP Prioritization Using a Bayesian Probability of Association

John R. Thompson Martin Gögele Christian X. Weichenberger Mirko Modenese John Attia Jennifer H. Barrett Michael Boehnke Alessandro De Grandi Francisco S. Domingues Andrew A. Hicks Fabio Marroni Cristian Pattaro Fabrizio Ruggeri Giuseppe Borsani Giorgio Casari Giovanni Parmigiani Andrea Pastore Arne Pfeufer Christine Schwienbacher Daniel Taliun CKDGen Consortium Caroline S. Fox Peter P. Pramstaller Cosetta Minelli 《Genetic epidemiology》2013,37(2):214-221

Prioritization is the process whereby a set of possible candidate genes or SNPs is ranked so that the most promising can be taken forward into further studies. In a genome‐wide association study, prioritization is usually based on the P‐values alone, but researchers sometimes take account of external annotation information about the SNPs such as whether the SNP lies close to a good candidate gene. Using external information in this way is inherently subjective and is often not formalized, making the analysis difficult to reproduce. Building on previous work that has identified 14 important types of external information, we present an approximate Bayesian analysis that produces an estimate of the probability of association. The calculation combines four sources of information: the genome‐wide data, SNP information derived from bioinformatics databases, empirical SNP weights, and the researchers’ subjective prior opinions. The calculation is fast enough that it can be applied to millions of SNPS and although it does rely on subjective judgments, those judgments are made explicit so that the final SNP selection can be reproduced. We show that the resulting probability of association is intuitively more appealing than the P‐value because it is easier to interpret and it makes allowance for the power of the study. We illustrate the use of the probability of association for SNP prioritization by applying it to a meta‐analysis of kidney function genome‐wide association studies and demonstrate that SNP selection performs better using the probability of association compared with P‐values alone. 相似文献

19.

Fine‐Mapping Additive and Dominant SNP Effects Using Group‐LASSO and Fractional Resample Model Averaging

Jeremy Sabourin Andrew B. Nobel William Valdar 《Genetic epidemiology》2015,39(2):77-88

Genomewide association studies (GWAS) sometimes identify loci at which both the number and identities of the underlying causal variants are ambiguous. In such cases, statistical methods that model effects of multiple single‐nucleotide polymorphisms (SNPs) simultaneously can help disentangle the observed patterns of association and provide information about how those SNPs could be prioritized for follow‐up studies. Current multi‐SNP methods, however, tend to assume that SNP effects are well captured by additive genetics; yet when genetic dominance is present, this assumption translates to reduced power and faulty prioritizations. We describe a statistical procedure for prioritizing SNPs at GWAS loci that efficiently models both additive and dominance effects. Our method, LLARRMA‐dawg, combines a group LASSO procedure for sparse modeling of multiple SNP effects with a resampling procedure based on fractional observation weights. It estimates for each SNP the robustness of association with the phenotype both to sampling variation and to competing explanations from other SNPs. In producing an SNP prioritization that best identifies underlying true signals, we show the following: our method easily outperforms a single‐marker analysis; when additive‐only signals are present, our joint model for additive and dominance is equivalent to or only slightly less powerful than modeling additive‐only effects; and when dominance signals are present, even in combination with substantial additive effects, our joint model is unequivocally more powerful than a model assuming additivity. We also describe how performance can be improved through calibrated randomized penalization, and discuss how dominance in ungenotyped SNPs can be incorporated through either heterozygote dosage or multiple imputation. 相似文献

20.

Single nucleotide polymorphism (SNP)-based differentiation of Shigella isolates by pyrosequencing

Alice E. Hayford Mark K. Mammel David W. Lacher Eric W. Brown 《Infection, genetics and evolution》2011,11(7):1761-1768

Analysis of single nucleotide polymorphisms (SNPs) is an important genetic tool that provides molecular markers for rapid differentiation of closely related strains. We have applied SNP discovery and analysis for distinguishing each of the four Shigella serogroups (Boydii, Dysenteriae, Flexneri, and Sonnei) and for discriminating individual strains within the same serogroup by using 24 SNPs selected from nine genes. Five SNPs were identified from sequence analysis of two housekeeping genes (gapA and thrB) used previously in our lab to differentiate Shigella isolates into distinct lineages. The remaining 19 SNPs were identified by in silico analyses of eight Shigella genomes and are within the genes lpxC, sanA, yaaH, ybaP, ygaZ, yhbO, and ynhA. A total of 118 Shigella strains comprising 20 Boydii, 29 Dysenteriae, 42 Flexneri, and 27 Sonnei isolates were analyzed using the SNP typing scheme reported here. The combination of the 24 SNPs resulted in the identification of 26 SNP genotypes among the four Shigella serogroups and also provided some discriminatory resolution among individual strains within the same serogroup. The SNPs presented here should prove useful in identifying Shigella using PCR amplification and rapid sequence typing strategies. 相似文献