首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Although imputation of missing SNP results has been widely used in genetic studies, claims about the quality and usefulness of imputation have outnumbered the few studies that have questioned its limitations. But it is becoming clear that these limitations are real—for example, disease association signals can be missed in regions of LD breakdown. Here, as a case study, using the chromosomal region of the well-known lactase gene, LCT, we address the issue of imputation in the context of variants that have become frequent in a limited number of modern population groups only recently, due to selection. We study SNPs in a 500 bp region covering the enhancer of LCT, and compare imputed genotypes with directly genotyped data. We examine the haplotype pairs of all individuals with discrepant and missing genotypes. We highlight the nonrandom nature of the allelic errors and show that most incorrect imputations and missing data result from long haplotypes that are evolutionarily closely related to those carrying the derived alleles, while some relate to rare and recombinant haplotypes. We conclude that bias of incorrectly imputed and missing genotypes can decrease the accuracy of imputed results substantially.  相似文献   

2.

Background

Missing data is a common nuisance in eHealth research: it is hard to prevent and may invalidate research findings.

Objective

In this paper several statistical approaches to data “missingness” are discussed and tested in a simulation study. Basic approaches (complete case analysis, mean imputation, and last observation carried forward) and advanced methods (expectation maximization, regression imputation, and multiple imputation) are included in this analysis, and strengths and weaknesses are discussed.

Methods

The dataset used for the simulation was obtained from a prospective cohort study following participants in an online self-help program for problem drinkers. It contained 124 nonnormally distributed endpoints, that is, daily alcohol consumption counts of the study respondents. Missingness at random (MAR) was induced in a selected variable for 50% of the cases. Validity, reliability, and coverage of the estimates obtained using the different imputation methods were calculated by performing a bootstrapping simulation study.

Results

In the performed simulation study, the use of multiple imputation techniques led to accurate results. Differences were found between the 4 tested multiple imputation programs: NORM, MICE, Amelia II, and SPSS MI. Among the tested approaches, Amelia II outperformed the others, led to the smallest deviation from the reference value (Cohen’s d = 0.06), and had the largest coverage percentage of the reference confidence interval (96%).

Conclusions

The use of multiple imputation improves the validity of the results when analyzing datasets with missing observations. Some of the often-used approaches (LOCF, complete cases analysis) did not perform well, and, hence, we recommend not using these. Accumulating support for the analysis of multiple imputed datasets is seen in more recent versions of some of the widely used statistical software programs making the use of multiple imputation more readily available to less mathematically inclined researchers.  相似文献   

3.
Genotype imputation across populations of mixed ancestry is critical for optimal discovery in large‐scale genome‐wide association studies (GWAS). Methods for direct imputation of GWAS summary‐statistics were previously shown to be practically as accurate as summary statistics produced after raw genotype imputation, while incurring orders of magnitude lower computational burden. Given that direct imputation needs a precise estimation of linkage‐disequilibrium (LD) and that most of the methods using a small reference panel for example, ~2,500‐subject coming from the 1000 Genome‐Project, there is a great need for much larger and more diverse reference panels. To accurately estimate the LD needed for an exhaustive analysis of any cosmopolitan cohort, we developed DISTMIX2. DISTMIX2: (a) uses a much larger and more diverse reference panel compared to traditional reference panels, and (b) can estimate weights of ethnic‐mixture based solely on Z‐scores, when allele frequencies are not available. We applied DISTMIX2 to GWAS summary‐statistics from the psychiatric genetic consortium (PGC). DISTMIX2 uncovered signals in numerous new regions, with most of these findings coming from the rarer variants. Rarer variants provide much sharper location for the signals compared with common variants, as the LD for rare variants extends over a lower distance than for common ones. For example, while the original PGC post‐traumatic stress disorder GWAS found only 3 marginal signals for common variants, we now uncover a very strong signal for a rare variant in PKN2, a gene associated with neuronal and hippocampal development. Thus, DISTMIX2 provides a robust and fast (re)imputation approach for most psychiatric GWAS‐studies.  相似文献   

4.
Imputation allows the inference of unobserved genotypes in low-density data sets, and is often used to test for disease association at variants that are poorly captured by standard genotyping chips (such as low-frequency variants). Although much effort has gone into developing the best imputation algorithms, less is known about the effects of reference set choice on imputation accuracy. We assess the improvements afforded by increases in reference size and diversity, specifically comparing the HapMap2 data set, which has been used to date for imputation, and the new HapMap3 data set, which contains more samples from a more diverse range of populations. We find that, for imputation into Western European samples, the HapMap3 reference provides more accurate imputation with better-calibrated quality scores than HapMap2, and that increasing the number of HapMap3 populations included in the reference set grant further improvements. Improvements are most pronounced for low-frequency variants (frequency <5%), with the largest and most diverse reference sets bringing the accuracy of imputation of low-frequency variants close to that of common ones. For low-frequency variants, reference set diversity can improve the accuracy of imputation, independent of reference sample size. HapMap3 reference sets provide significant increases in imputation accuracy relative to HapMap2, and are of particular use if highly accurate imputation of low-frequency variants is required. Our results suggest that, although the sample sizes from the 1000 Genomes Pilot Project will not allow reliable imputation of low-frequency variants, the larger sample sizes of the main project will allow.  相似文献   

5.
Recently, the Haplotype Reference Consortium (HRC) released a large imputation panel that allows more accurate imputation of genetic variants. In this study, we compared a set of directly assayed common and rare variants from an exome array to imputed genotypes, that is, 1000 genomes project (1000GP) and HRC. We showed that imputation using the HRC panel improved the concordance between assayed and imputed genotypes at common, and especially, low‐frequency variants. Furthermore, we performed a genome‐wide association meta‐analysis of vertical cup‐disc ratio, a highly heritable endophenotype of glaucoma, in four cohorts using 1000GP and HRC imputations. We compared the results of the meta‐analysis using 1000GP to the meta‐analysis results using HRC. Overall, we found that using HRC imputation significantly improved P values (= 3.07 × 10?61), particularly for suggestive variants. Both meta‐analyses were performed in the same sample size, yet we found eight genome‐wide significant loci in the HRC‐based meta‐analysis versus seven genome‐wide significant loci in the 1000GP‐based meta‐analysis. This study provides supporting evidence of the new avenues for gene discovery and fine mapping that the HRC imputation panel offers.  相似文献   

6.
Background:

Missing data can compromise inferences from clinical trials, yet the topic has received little attention in the clinical trial community. Shortcomings in commonly used methods used to analyze studies with missing data (complete case, last- or baseline-observation carried forward) have been highlighted in a recent Food and Drug Administration-sponsored report. This report recommends how to mitigate the issues associated with missing data. We present an example of the proposed concepts using data from recent clinical trials.

Methods:

CD4+ cell count data from the previously reported SINGLE and MOTIVATE studies of dolutegravir and maraviroc were analyzed using a variety of statistical methods to explore the impact of missing data. Four methodologies were used: complete case analysis, simple imputation, mixed models for repeated measures, and multiple imputation. We compared the sensitivity of conclusions to the volume of missing data and to the assumptions underpinning each method.

Results:

Rates of missing data were greater in the MOTIVATE studies (35%–68% premature withdrawal) than in SINGLE (12%–20%). The sensitivity of results to assumptions about missing data was related to volume of missing data. Estimates of treatment differences by various analysis methods ranged across a 61?cells/mm3 window in MOTIVATE and a 22 cells/mm3 window in SINGLE.

Conclusions:

Where missing data are anticipated, analyses require robust statistical and clinical debate of the necessary but unverifiable underlying statistical assumptions. Multiple imputation makes these assumptions transparent, can accommodate a broad range of scenarios, and is a natural analysis for clinical trials in HIV with missing data.  相似文献   

7.
Apolipoprotein E, encoded by APOE, is the main apoprotein for catabolism of chylomicrons and very low density lipoprotein. Two common single-nucleotide polymorphisms (SNPs) in APOE, rs429358 and rs7412, determine the three epsilon alleles that are established genetic risk factors for late-onset Alzheimer''s disease (AD), cerebral amyloid angiopathy, and intracerebral hemorrhage (ICH). These two SNPs are not present in most commercially available genome-wide genotyping arrays and cannot be inferred through imputation using HapMap reference panels. Therefore, these SNPs are often separately genotyped. Introduction of reference panels compiled from the 1000 Genomes project has made imputation of these variants possible. We compared the directly genotyped and imputed SNPs that define the APOE epsilon alleles to determine the accuracy of imputation for inference of unobserved epsilon alleles. We utilized genome-wide genotype data obtained from two cohorts of ICH and AD constituting subjects of European ancestry. Our data suggest that imputation is highly accurate, yields an acceptable proportion of missing data that is non-differentially distributed across case and control groups, and generates comparable results to genotyped data for hypothesis testing. Further, we explored the effect of imputation algorithm parameters and demonstrated that customization of these parameters yields an improved balance between accuracy and missing data for inferred genotypes.  相似文献   

8.
Many studies have suggested that myelin dysfunction may be causally involved in the pathogenesis of schizophrenia. Nogo (RTN4), myelin‐associated glycoprotein (MAG) and oligodendrocyte myelin glycoprotein (OMG) all bind to the common receptor, Nogo‐66 receptor 1 (RTN4R). We examined 68 single nucleotide polymorphisms (SNPs) (51 with genotyping and 17 with imputation analysis) from these four genes for genetic association with schizophrenia, using a 2,120 case–control sample from the Japanese population. Allelic tests showed nominally significant association of two RTN4 SNPs (P = 0.047 and 0.037 for rs11894868 and rs2968804, respectively) and two MAG SNPs (P = 0.034 and 0.029 for rs7249617 and rs16970218, respectively) with schizophrenia. The MAG SNP rs7249617 also showed nominal significance in a genotypic test (P = 0.017). In haplotype analysis, the MAG haplotype block including rs7249617 and rs16970218 showed nominal significance (P = 0.008). These associations did not remain significant after correction for multiple testing, possibly due to their small genetic effect. In the imputation analysis of RTN4, the untyped SNP rs2972090 showed nominally significant association (P = 0.032) and several imputed SNPs showed marginal associations. Moreover, in silico analysis (PolyPhen) of a missense variant (rs11677099: Asp357Val), which is in strong linkage disequilibrium with rs11894868, predicted a deleterious effect on Nogo protein function. Despite a failure to detect robust associations in this Japanese cohort, our nominally positive signals, taken together with previously reported biological and genetic findings, add further support to the “disturbed myelin system theory of schizophrenia” across different populations. © 2011 Wiley‐Liss, Inc.  相似文献   

9.
With the advent of publicly available genome‐wide genotyping data, the use of genotype imputation methods is becoming increasingly common. These methods are of particular use in joint analyses, where data from different genotyping platforms are imputed to a reference set and combined in a single analysis. We show here that such an analysis can miss strong genetic association signals, such as that of the apolipoprotein‐e gene in late‐onset Alzheimer disease. This can occur in regions of weak to moderate LD; unobserved SNPs are not imputed with confidence so there is no consensus SNP set on which to perform association tests. Both IMPUTE and Mach software are tested, with similar results. Additionally, we show that a meta‐analysis that properly accounts for the genotype uncertainty can recover association signals that were lost under a joint analysis. This shows that joint analyses of imputed genotypes, particularly failure to replicate strong signals, should be considered critically and examined on a case‐by‐case basis.  相似文献   

10.
Genome‐wide association (GWA) meta‐analysis has become a popular approach for discovering genetic variants responsible for complex diseases. The between‐study heterogeneity effect is a severe issue that may complicate the interpretation of results. Aiming to improve the interpretation of meta‐analysis results, we empirically explored the extent and source of heterogeneity effect. We analyzed a previously reported GWA meta‐analysis of obesity, in which over 21,000 subjects from seven individual samples were meta‐analyzed. We first evaluated the extent and distribution of heterogeneity across the entire genome. We then studied the effects of several potentially confounding factors, including age, ethnicity, gender composition, study type, and genotype imputation on heterogeneity with a random‐effects meta‐regression model. Of the total 4,325,550 SNPs being tested, heterogeneity was moderate to very large for 25.4% of the total SNPs. Heterogeneity was more severe in SNPs with stronger association signals. Ethnicity, average age, and genotype imputation accuracy had significant effects on the heterogeneity. Exploring the effects of ethnicity can provide clues to the potential ethnic‐specific effects for two loci known to affect obesity, MC4R, and MTCH2. Our analysis can help to clarify understanding of the obesity mechanism and may provide guidance for an effective design of future GWA meta‐analysis.  相似文献   

11.
Genotype imputation has become an essential tool in the analysis of genome-wide association scans. This technique allows investigators to test association at ungenotyped genetic markers, and to combine results across studies that rely on different genotyping platforms. In addition, imputation is used within long-running studies to reuse genotypes produced across generations of platforms. Typically, genotypes of controls are reused and cases are genotyped on more novel platforms yielding a case-control study that is not matched for genotyping platforms. In this study, we scrutinize such a situation and validate GWAS results by actually retyping top-ranking SNPs with the Sequenom MassArray platform. We discuss the needed quality controls (QCs). In doing so, we report a considerable discrepancy between the results from imputed and retyped data when applying recommended QCs from the literature. These discrepancies appear to be caused by extrapolating differences between arrays by the process of imputation. To avoid false positive results, we recommend that more stringent QCs should be applied. We also advocate reporting the imputation quality measure (R(T)(2)) for the post-imputation QCs in publications.  相似文献   

12.
Human leucocyte antigen (HLA) genes play a central role in response to pathogens and in autoimmunity. Research to understand the effects of HLA genes on health has been limited because HLA genotyping protocols are labour intensive and expensive. Recently, algorithms to impute HLA genotype data using genome‐wide association study (GWAS) data have been published. However, imputation accuracy for most of these algorithms was based primarily on training data sets of European ancestry individuals. We considered performance of two HLA‐dedicated imputation algorithms – SNP2HLA and HIBAG – in a multiracial population of n = 1587 women with HLA genotyping data by gold standard methods. We first compared accuracy – defined as the percentage of correctly predicted alleles – of HLA‐B and HLA‐C imputation using SNP2HLA and HIBAG using a breakdown of the data set into an 80% training group and a 20% testing group. Estimates of accuracy for HIBAG were either the same or better than those for SNP2HLA. We then conducted a more thorough test of HIBAG imputation accuracy using five independent 10‐fold cross‐validation procedures with delineation of ancestry groups using ancestry informative markers. Overall accuracy for HIBAG was 89%. Accuracy by HLA gene was 93% for HLA‐A, 84% for HLA‐B, 94% for HLA‐C, 83% for HLA‐DQA1, 91% for HLA‐DQB1 and 88% for HLA‐DRB1. Accuracy was highest in the African ancestry group (the largest group) and lowest in the Hispanic group (the smallest group). Despite suboptimal imputation accuracy for some HLA gene/ancestry group combinations, the HIBAG algorithm has the advantage of providing posterior estimates of accuracy which enable the investigator to analyse subsets of the population with high predicted (e.g. >95%) imputation accuracy.  相似文献   

13.
We hypothesize that imputation based on data from the 1000 Genomes Project can identify novel association signals on a genome-wide scale due to the dense marker map and the large number of haplotypes. To test the hypothesis, the Wellcome Trust Case Control Consortium (WTCCC) Phase I genotype data were imputed using 1000 genomes as reference (20100804 EUR), and seven case/control association studies were performed using imputed dosages. We observed two 'missed' disease-associated variants that were undetectable by the original WTCCC analysis, but were reported by later studies after the 2007 WTCCC publication. One is within the IL2RA gene for association with type 1 diabetes and the other in proximity with the CDKN2B gene for association with type 2 diabetes. We also identified two refined associations. One is SNP rs11209026 in exon 9 of IL23R for association with Crohn's disease, which is predicted to be probably damaging by PolyPhen2. The other refined variant is in the CUX2 gene region for association with type 1 diabetes, where the newly identified top SNP rs1265564 has an association P-value of 1.68 × 10(-16). The new lead SNP for the two refined loci provides a more plausible explanation for the disease association. We demonstrated that 1000 Genomes-based imputation could indeed identify both novel (in our case, 'missed' because they were detected and replicated by studies after 2007) and refined signals. We anticipate the findings derived from this study to provide timely information when individual groups and consortia are beginning to engage in 1000 genomes-based imputation.  相似文献   

14.
With advances in high-throughput single-nucleotide polymorphism (SNP) genotyping, the amount of genotype data available for genetic studies is steadily increasing, and with it comes new abilities to study multigene interactions as well as to develop higher dimensional genetic models that more closely represent the polygenic nature of common disease risk. The combined impact of even small amounts of missing data on a multi-SNP analysis may be considerable. In this study, we present a neural network method for imputing missing SNP genotype data. We compared its imputation accuracy with fastPHASE and an expectation-maximization algorithm implemented in HelixTree. In a simulation data set of 1000 SNPs and 1000 subjects, 1, 5 and 10% of genotypes were randomly masked. Four levels of linkage disequilibrium (LD), LD R2<0.2, R2<0.5, R2<0.8 and no LD threshold, were examined to evaluate the impact of LD on imputation accuracy. All three methods are capable of imputing most missing genotypes accurately (accuracy >86%). The neural network method accurately predicted 92.0-95.9% of the missing genotypes. In a real data set comparison with 419 subjects and 126 SNPs from chromosome 2, the neural network method achieves the highest imputation accuracies >83.1% with missing rate from 1 to 5%. Using 90 HapMap subjects with 1962 SNPs, fastPHASE had the highest accuracy ( approximately 97%) while the other two methods had >95% accuracy. These results indicate that the neural network model is an accurate and convenient tool, requiring minimal parameter tuning for SNP data recovery, and provides a valuable alternative to usual complete-case analysis.  相似文献   

15.
Multiple Imputation of Missing Genotype Data for Unrelated Individuals   总被引:2,自引:0,他引:2  
The objective of this study was to investigate the performance of multiple imputation of missing genotype data for unrelated individuals using the polytomous logistic regression model, focusing on different missingness mechanisms, percentages of missing data, and imputation models. A complete dataset of 581 individuals, each analysed for eight biallelic polymorphisms and the quantitative phenotype HDL-C, was used. From this dataset one hundred replicates with missing data were created, in different ways for different scenarios. The performance was assessed by comparing the mean bias in parameter estimates, the root mean squared standard errors, and the genotype-imputation error rates. Overall, the mean bias was small in all scenarios, and in most scenarios the mean did not differ significantly from 'no bias'. Including polymorphisms that are highly correlated in the imputation model reduced the genotype-imputation error rate and increased precision of the parameter estimates. The method works well for data that are missing completely at random, and for data that are missing at random. In conclusion, our results indicate that multiple imputation with the polytomous logistic regression model can be used for association studies to deal with the problem of missing genotype data, when attention is paid to the imputation model and the percentage of missing data.  相似文献   

16.
Missing data such as appropriateness ratings in clinical research are a common problem and this often yields a biased result. This paper aims to introduce the multiple imputation method to handle missing data in clinical research and to suggest that the multiple imputation technique can give more accurate estimates than those of a complete-case analysis. The idea of multiple imputation is that each missing value is replaced with more than one plausible value. The appropriateness method was developed as a pragmatic solution to problem of trying to assess "appropriate" surgical and medical procedures for patients. Cataract surgery was selected as one of four procedures that were evaluated as a part of the Clinical Appropriateness Initiative. We created mild to high missing rates of 10%, 30% and 50% and compared the performance of logistic regression in cataract surgery. We treated the coefficients in the original data as true parameters and compared them with the other results. In the mild missing rate (10%), the deviation from the true coefficients was quite small and ignorable. After removing the missing data, the complete-case analysis did not reveal any serious bias. However, as the missing rate increased, the bias was not ignorable and it distorted the result. This simulation study suggests that a multiple imputation technique can give more accurate estimates than those of a complete-case analysis, especially for moderate to high missing rates (30 - 50%). In addition, the multiple imputation technique yields better accuracy than a single imputation technique. Therefore, multiple imputation is useful and efficient for a situation in clinical research where there is large amounts of missing data.  相似文献   

17.

Background  

Multiple imputation (MI) provides an effective approach to handle missing covariate data within prognostic modelling studies, as it can properly account for the missing data uncertainty. The multiply imputed datasets are each analysed using standard prognostic modelling techniques to obtain the estimates of interest. The estimates from each imputed dataset are then combined into one overall estimate and variance, incorporating both the within and between imputation variability. Rubin's rules for combining these multiply imputed estimates are based on asymptotic theory. The resulting combined estimates may be more accurate if the posterior distribution of the population parameter of interest is better approximated by the normal distribution. However, the normality assumption may not be appropriate for all the parameters of interest when analysing prognostic modelling studies, such as predicted survival probabilities and model performance measures.  相似文献   

18.
Retinal dystrophies are a heterogeneous group of disorders of visual function leading to partial or complete blindness. We report the genetic basis of an unusual retinal dystrophy in five families with affected females and no affected males. Heterozygous missense variants were identified in the X‐linked phosphoribosyl pyrophosphate synthetase 1 (PRPS1) gene: c.47C > T, p.(Ser16Phe); c.586C > T, p.(Arg196Trp); c.641G > C, p.(Arg214Pro); and c.640C > T, p.(Arg214Trp). Missense variants in PRPS1 are usually associated with disease in male patients, including Arts syndrome, Charcot–Marie–Tooth, and nonsyndromic sensorineural deafness. In our study families, affected females manifested a retinal dystrophy with interocular asymmetry. Three unrelated females from these families had hearing loss leading to a diagnosis of Usher syndrome. Other neurological manifestations were also observed in three individuals. Our data highlight the unexpected X‐linked inheritance of retinal degeneration in females caused by variants in PRPS1 and suggest that tissue‐specific skewed X‐inactivation or variable levels of pyrophosphate synthetase‐1 deficiency are the underlying mechanism(s). We speculate that the absence of affected males in the study families suggests that some variants may be male embryonic lethal when inherited in the hemizygous state. The unbiased nature of next‐generation sequencing enables all possible modes of inheritance to be considered for association of gene variants with novel phenotypic presentation.  相似文献   

19.
Missing values in microarray data can significantly affect subsequent analysis, thus it is important to estimate these missing values accurately. In this paper, a sequential local least squares imputation (SLLSimpute) method is proposed to solve this problem. It estimates missing values sequentially from the gene containing the fewest missing values and partially utilizes these estimated values. In addition, an automatic parameter selection algorithm, which can generate an appropriate number of neighboring genes for each target gene, is presented for parameter estimation. Experimental results confirmed that SLLSimpute method exhibited better estimation ability compared with other currently used imputation methods.  相似文献   

20.
Discovering statistical correlation between causal genetic variation and clinical traits through association studies is an important method for identifying the genetic basis of human diseases. Since fully resequencing a cohort is prohibitively costly, genetic association studies take advantage of local correlation structure (or linkage disequilibrium) between single nucleotide polymorphisms (SNPs) by selecting a subset of SNPs to be genotyped (tag SNPs). While many current association studies are performed using commercially available high‐throughput genotyping products that define a set of tag SNPs, choosing tag SNPs remains an important problem for both custom follow‐up studies as well as designing the high‐throughput genotyping products themselves. The most widely used tag SNP selection method optimizes the correlation between SNPs (r2). However, tag SNPs chosen based on an r2 criterion do not necessarily maximize the statistical power of an association study. We propose a study design framework that chooses SNPs to maximize power and efficiently measures the power through empirical simulation. Empirical results based on the HapMap data show that our method gains considerable power over a widely used r2‐based method, or equivalently reduces the number of tag SNPs required to attain the desired power of a study. Our power‐optimized 100k whole genome tag set provides equivalent power to the Affymetrix 500k chip for the CEU population. For the design of custom follow‐up studies, our method provides up to twice the power increase using the same number of tag SNPs as r2‐based methods. Our method is publicly available via web server at http://design.cs.ucla.edu .  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号