首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Many computational approaches estimate the effect of coding variants, but their predictions often disagree with each other. These contradictions confound users and raise questions regarding reliability. Performance assessments can indicate the expected accuracy for each method and highlight advantages and limitations. The Critical Assessment of Genome Interpretation (CAGI) community aims to organize objective and systematic assessments: They challenge predictors on unpublished experimental and clinical data and assign independent assessors to evaluate the submissions. We participated in CAGI experiments as predictors, using the Evolutionary Action (EA) method to estimate the fitness effect of coding mutations. EA is untrained, uses homology information, and relies on a formal equation: The fitness effect equals the functional sensitivity to residue changes multiplied by the magnitude of the substitution. In previous CAGI experiments (between 2011 and 2016), our submissions aimed to predict the protein activity of single mutants. In 2018 (CAGI5), we also submitted predictions regarding clinical associations, folding stability, and matching genomic data with phenotype. For all these diverse challenges, we used EA to predict the fitness effect of variants, adjusted to specifically address each question. Our submissions had consistently good performance, suggesting that EA predicts reliably the effects of genetic variants.  相似文献   

2.
Precision medicine and sequence‐based clinical diagnostics seek to predict disease risk or to identify causative variants from sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype‐phenotype prediction challenges; participants build models, undergo assessment, and share key findings. In the past, few CAGI challenges have addressed the impact of sequence variants on splicing. In CAGI5, two challenges (Vex‐seq and MaPSY) involved prediction of the effect of variants, primarily single‐nucleotide changes, on splicing. Although there are significant differences between these two challenges, both involved prediction of results from high‐throughput exon inclusion assays. Here, we discuss the methods used to predict the impact of these variants on splicing, their performance, strengths, and weaknesses, and prospects for predicting the impact of sequence variation on splicing and disease phenotypes.  相似文献   

3.
The CAGI‐5 pericentriolar material 1 (PCM1) challenge aimed to predict the effect of 38 transgenic human missense mutations in the PCM1 protein implicated in schizophrenia. Participants were provided with 16 benign variants (negative controls), 10 hypomorphic, and 12 loss of function variants. Six groups participated and were asked to predict the probability of effect and standard deviation associated to each mutation. Here, we present the challenge assessment. Prediction performance was evaluated using different measures to conclude in a final ranking which highlights the strengths and weaknesses of each group. The results show a great variety of predictions where some methods performed significantly better than others. Benign variants played an important role as negative controls, highlighting predictors biased to identify disease phenotypes. The best predictor, Bromberg lab, used a neural‐network‐based method able to discriminate between neutral and non‐neutral single nucleotide polymorphisms. The CAGI‐5 PCM1 challenge allowed us to evaluate the state of the art techniques for interpreting the effect of novel variants for a difficult target protein.  相似文献   

4.
Predicting the impact of mutations on proteins remains an important problem. As part of the CAGI5 frataxin challenge, we evaluate the accuracy with which Provean, FoldX, and ELASPIC can predict changes in the Gibbs free energy of a protein using a limited data set of eight mutations. We find that different methods have distinct strengths and limitations, with no method being strictly superior to other methods on all metrics. ELASPIC achieves the highest accuracy while also providing a web interface which simplifies the evaluation and analysis of mutations. FoldX is slightly less accurate than ELASPIC but is easier to run locally, as it does not depend on external tools or datasets. Provean achieves reasonable results while being computational less expensive than the other methods and not requiring a structure of the protein. In addition to methods submitted to the CAGI5 community experiment, and with the aim to inform about other methods with high accuracy, we also evaluate predictions made by Rosetta's ddg_monomer protocol, Rosetta's cartesian_ddg protocol, and thermodynamic integration calculations using Amber package. ELASPIC still achieves the highest accuracy, while Rosetta's catesian_ddg protocol appears to perform best in capturing the overall trend in the data.  相似文献   

5.
The NAGLU challenge of the fourth edition of the Critical Assessment of Genome Interpretation experiment (CAGI4) in 2016, invited participants to predict the impact of variants of unknown significance (VUS) on the enzymatic activity of the lysosomal hydrolase α‐N‐acetylglucosaminidase (NAGLU). Deficiencies in NAGLU activity lead to a rare, monogenic, recessive lysosomal storage disorder, Sanfilippo syndrome type B (MPS type IIIB). This challenge attracted 17 submissions from 10 groups. We observed that top models were able to predict the impact of missense mutations on enzymatic activity with Pearson's correlation coefficients of up to .61. We also observed that top methods were significantly more correlated with each other than they were with observed enzymatic activity values, which we believe speaks to the importance of sequence conservation across the different methods. Improved functional predictions on the VUS will help population‐scale analysis of disease epidemiology and rare variant association analysis.  相似文献   

6.
7.
The availability of disease‐specific genomic data is critical for developing new computational methods that predict the pathogenicity of human variants and advance the field of precision medicine. However, the lack of gold standards to properly train and benchmark such methods is one of the greatest challenges in the field. In response to this challenge, the scientific community is invited to participate in the Critical Assessment for Genome Interpretation (CAGI), where unpublished disease variants are available for classification by in silico methods. As part of the CAGI‐5 challenge, we evaluated the performance of 18 submissions and three additional methods in predicting the pathogenicity of single nucleotide variants (SNVs) in checkpoint kinase 2 (CHEK2) for cases of breast cancer in Hispanic females. As part of the assessment, the efficacy of the analysis method and the setup of the challenge were also considered. The results indicated that though the challenge could benefit from additional participant data, the combined generalized linear model analysis and odds of pathogenicity analysis provided a framework to evaluate the methods submitted for SNV pathogenicity identification and for comparison to other available methods. The outcome of this challenge and the approaches used can help guide further advancements in identifying SNV‐disease relationships.  相似文献   

8.
Classification of variants of unknown significance is a challenging technical problem in clinical genetics. As up to one‐third of disease‐causing mutations are thought to affect pre‐mRNA splicing, it is important to accurately classify splicing mutations in patient sequencing data. Several consortia and healthcare systems have conducted large‐scale patient sequencing studies, which discover novel variants faster than they can be classified. Here, we compare the advantages and limitations of several high‐throughput splicing assays aimed at mitigating this bottleneck, and describe a data set of ~5,000 variants that we analyzed using our Massively Parallel Splicing Assay (MaPSy). The Critical Assessment of Genome Interpretation group (CAGI) organized a challenge, in which participants submitted machine learning models to predict the splicing effects of variants in this data set. We discuss the winning submission of the challenge (MMSplice) which outperformed existing software. Finally, we highlight methods to overcome the limitations of MaPSy and similar assays, such as tissue‐specific splicing, the effect of surrounding sequence context, classifying intronic variants, synthesizing large exons, and amplifying complex libraries of minigene species. Further development of these assays will greatly benefit the field of clinical genetics, which lack high‐throughput methods for variant interpretation.  相似文献   

9.
Interpretation of genomic variation plays an essential role in the analysis of cancer and monogenic disease, and increasingly also in complex trait disease, with applications ranging from basic research to clinical decisions. Many computational impact prediction methods have been developed, yet the field lacks a clear consensus on their appropriate use and interpretation. The Critical Assessment of Genome Interpretation (CAGI, /'kā‐jē/) is a community experiment to objectively assess computational methods for predicting the phenotypic impacts of genomic variation. CAGI participants are provided genetic variants and make blind predictions of resulting phenotype. Independent assessors evaluate the predictions by comparing with experimental and clinical data. CAGI has completed five editions with the goals of establishing the state of art in genome interpretation and of encouraging new methodological developments. This special issue ( https://onlinelibrary.wiley.com/toc/10981004/2019/40/9 ) comprises reports from CAGI, focusing on the fifth edition that culminated in a conference that took place 5 to 7 July 2018. CAGI5 was comprised of 14 challenges and engaged hundreds of participants from a dozen countries. This edition had a notable increase in splicing and expression regulatory variant challenges, while also continuing challenges on clinical genomics, as well as complex disease datasets and missense variants in diseases ranging from cancer to Pompe disease to schizophrenia. Full information about CAGI is at https://genomeinterpretation.org .  相似文献   

10.
BRCA1 and BRCA2 (BRCA1/2) germline variants disrupting the DNA protective role of these genes increase the risk of hereditary breast and ovarian cancers. Correct identification of these variants then becomes clinically relevant, because it may increase the survival rates of the carriers. Unfortunately, we are still unable to systematically predict the impact of BRCA1/2 variants. In this article, we present a family of in silico predictors that address this problem, using a gene‐specific approach. For each protein, we have developed two tools, aimed at predicting the impact of a variant at two different levels: Functional and clinical. Testing their performance in different datasets shows that specific information compensates the small number of predictive features and the reduced training sets employed to develop our models. When applied to the variants of the BRCA1/2 (ENIGMA) challenge in the fifth Critical Assessment of Genome Interpretation (CAGI 5) we find that these methods, particularly those predicting the functional impact of variants, have a good performance, identifying the large compositional bias towards neutral variants in the CAGI sample. This performance is further improved when incorporating to our prediction protocol estimates of the impact on splicing of the target variant.  相似文献   

11.
With the advent of rapid sequencing technologies, making sense of all the genomic variations that we see among us has been a major challenge. A plethora of algorithms and methods exist that try to address genome interpretation through genotype–phenotype linkage analysis or evaluating the loss of function/stability mutations in protein. Critical Assessment of Genome Interpretation (CAGI) offers an exceptional platform to blind‐test all such algorithms and methods to assess their true ability. We take advantage of this opportunity to explore the use of molecular dynamics simulation as a tool to assess alteration of phenotype, loss of protein function, interaction, and stability. The results show that coarse‐grained dynamics based protein flexibility analysis on 34 CHEK2 and 1719 CALM1 single mutants perform reasonably well for class‐based predictions for phenotype alteration and two‐thirds of the predicted scores return a correlation coefficient of 0.6 or more. When all‐atom dynamics is used to predict altered stability due to mutations for Frataxin protein (8 cases), the predictions are comparable to the state‐of‐the‐art methods. The competitive performance of our straightforward approach to phenotype interpretation contrasts with heavily trained machine learning approaches, and open new avenues to rationally improve genome interpretation.  相似文献   

12.
In silico approaches are routinely adopted to predict the effects of genetic variants and their relation to diseases. The critical assessment of genome interpretation (CAGI) has established a common framework for the assessment of available predictors of variant effects on specific problems and our group has been an active participant of CAGI since its first edition. In this paper, we summarize our experience and lessons learned from the last edition of the experiment (CAGI‐5). In particular, we analyze prediction performances of our tools on five CAGI‐5 selected challenges grouped into three different categories: prediction of variant effects on protein stability, prediction of variant pathogenicity, and prediction of complex functional effects. For each challenge, we analyze in detail the performance of our tools, highlighting their potentialities and drawbacks. The aim is to better define the application boundaries of each tool.  相似文献   

13.
We present a computational model for predicting mutational impact on enzymatic activity of human acid α‐glucosidase (GAA), an enzyme associated with Pompe disease. Using a model that combines features specific to GAA with other general evolutionary and physiochemical features, we made blind predictions of enzymatic activity relative to wildtype human GAA for >300 GAA mutants, as part of the Critical Assessment of Genome Interpretation 5 GAA challenge. We found that gene‐specific features can improve the performance of existing impact prediction tools that mostly rely on general features for pathogenicity prediction. Majority of the poorly predicted mutants that lower wildtype GAA enzyme activity occurred on the surface of the GAA protein. We also found that gene‐specific features were uncorrelated with existing methods and provided orthogonal information for interpreting the origin of pathogenicity, particular in variants that are poorly predicted by existing general methods. Specific variants in GAA, when investigated in the context of its protein structure, suggested gene‐specific information like the disruption of local backbone torsional geometry and disruption of particular sidechain‐sidechain hydrogen bonds as some potential sources for pathogenicity.  相似文献   

14.
15.
The Critical Assessment of Genome Interpretation‐5 intellectual disability challenge asked to use computational methods to predict patient clinical phenotypes and the causal variant(s) based on an analysis of their gene panel sequence data. Sequence data for 74 genes associated with intellectual disability (ID) and/or autism spectrum disorders (ASD) from a cohort of 150 patients with a range of neurodevelopmental manifestations (i.e. ID, autism, epilepsy, microcephaly, macrocephaly, hypotonia, ataxia) have been made available for this challenge. For each patient, predictors had to report the causative variants and which of the seven phenotypes were present. Since neurodevelopmental disorders are characterized by strong comorbidity, tested individuals often present more than one pathological condition. Considering the overall clinical manifestation of each patient, the correct phenotype has been predicted by at least one group for 93 individuals (62%). ID and ASD were the best predicted among the seven phenotypic traits. Also, causative or potentially pathogenic variants were predicted correctly by at least one group. However, the prediction of the correct causative variant seems to be insufficient to predict the correct phenotype. In some cases, the correct prediction has been supported by rare or common variants in genes different from the causative one.  相似文献   

16.
Single nucleotide mutations in exonic regions can significantly affect gene function through a disruption of splicing, and various computational methods have been developed to predict the splicing‐related effects of a single nucleotide mutation. We implemented a new method using ensemble learning that combines two types of predictive models: (a) base sequence‐based deep neural networks (DNNs) and (b) machine learning models based on genomic attributes. This method was applied to the Massively Parallel Splicing Assay challenge of the Fifth Critical Assessment of Genome Interpretation, in which challenge participants predicted various experimentally‐defined exonic splicing mutations, and achieved a promising result. We successfully revealed that combining different predictive models based upon the stacked generalization method led to significant improvement in prediction performance. In addition, whereas most of the genomic features adopted in constructing machine learning models were previously reported, feature values generated with DSSP, a DNN‐based splice site prediction tool, were novel and helpful for the prediction. Learning the sequence patterns associated with normal splicing and the change in splicing site probabilities caused by a mutation was presumed to be helpful in predicting splicing disruption.  相似文献   

17.
Accurate interpretation of genomic variants that alter RNA splicing is critical to precision medicine. We present a computational framework, Prediction of variant Effect on Percent Spliced In (PEPSI), that predicts the splicing impact of coding and noncoding variants for the Fifth Critical Assessment of Genome Interpretation (CAGI5) “Vex‐seq” challenge. PEPSI is a random forest regression model trained on multiple layers of features associated with sequence conservation and regulatory sequence elements. Compared to other splicing defect prediction tools from the literature, our framework integrates secondary structure information in predicting variants that disrupt splicing regulatory elements (SREs). We applied our model to classify splice‐disrupting variants among 2,094 single‐nucleotide polymorphisms from the Exome Aggregation Consortium using model‐predicted changes in percent spliced in (ΔPSI) associated with tested variants. Benchmarking our model against widely used state‐of‐the‐art tools, we demonstrate that PEPSI achieves comparable performance in terms of sensitivity and precision. Moreover, we also show that using secondary structure context can help resolve several cases where changes in the counts of SREs do not correspond with the directionality of ΔPSI measured for tested variants.  相似文献   

18.
The growth of publicly available data informing upon genetic variations, mechanisms of disease, and disease subphenotypes offers great potential for personalized medicine. Computational approaches are likely required to assess a large number of novel genetic variants. However, the integration of genetic, structural, and pathophysiological data still represents a challenge for computational predictions and their clinical use. We addressed these issues for alpha‐1‐antitrypsin deficiency, a disease mediated by mutations in the SERPINA1 gene encoding alpha‐1‐antitrypsin. We compiled a comprehensive database of SERPINA1 coding mutations and assigned them apparent pathological relevance based upon available data. “Benign” and “pathogenic” variations were used to assess performance of 31 pathogenicity predictors. Well‐performing algorithms clustered the subset of variants known to be severely pathogenic with high scores. Eight new mutations identified in the ExAC database and achieving high scores were selected for characterization in cell models and showed secretory deficiency and polymer formation, supporting the predictive power of our computational approach. The behavior of the pathogenic new variants and consistent outliers were rationalized by considering the protein structural context and residue conservation. These findings highlight the potential of computational methods to provide meaningful predictions of the pathogenic significance of novel mutations and identify areas for further investigation.  相似文献   

19.
Thermodynamic stability is a fundamental property shared by all proteins. Changes in stability due to mutation are a widespread molecular mechanism in genetic diseases. Methods for the prediction of mutation‐induced stability change have typically been developed and evaluated on incomplete and/or biased data sets. As part of the Critical Assessment of Genome Interpretation, we explored the utility of high‐throughput variant stability profiling (VSP) assay data as an alternative for the assessment of computational methods and evaluated state‐of‐the‐art predictors against over 7,000 nonsynonymous variants from two proteins. We found that predictions were modestly correlated with actual experimental values. Predictors fared better when evaluated as classifiers of extreme stability effects. While different methods emerging as top performers depending on the metric, it is nontrivial to draw conclusions on their adoption or improvement. Our analyses revealed that only 16% of all variants in VSP assays could be confidently defined as stability‐affecting. Furthermore, it is unclear as to what extent VSP abundance scores were reasonable proxies for the stability‐related quantities that participating methods were designed to predict. Overall, our observations underscore the need for clearly defined objectives when developing and using both computational and experimental methods in the context of measuring variant impact.  相似文献   

20.
The recent years have seen a drastic increase in the amount of available genomic sequences. Alongside this explosion, hundreds of computational tools were developed to assess the impact of observed genetic variation. Critical Assessment of Genome Interpretation (CAGI) provides a platform to evaluate the performance of these tools in experimentally relevant contexts. In the CAGI‐5 challenge assessing the 38 missense variants affecting the human Pericentriolar material 1 protein (PCM1), our SNAP‐based submission was the top performer, although it did worse than expected from other evaluations. Here, we compare the CAGI‐5 submissions, and 24 additional commonly used variant effect predictors, to analyze the reasons for this observation. We identified per residue conservation, structural, and functional PCM1 characteristics, which may be responsible. As expected, predictors had a hard time distinguishing effect variants in nonconserved positions. They were also better able to call effect variants in a structurally rich region than in a less‐structured one; in the latter, they more often correctly identified benign than effect variants. Curiously, most of the protein was predicted to be functionally robust to mutation—a feature that likely makes it a harder problem for generalized variant effect predictors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号