首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Protein secondary structure prediction with a neural network.   总被引:22,自引:3,他引:22       下载免费PDF全文
A method is presented for protein secondary structure prediction based on a neural network. A training phase was used to teach the network to recognize the relation between secondary structure and amino acid sequences on a sample set of 48 proteins of known structure. On a separate test set of 14 proteins of known structure, the method achieved a maximum overall predictive accuracy of 63% for three states: helix, sheet, and coil. A numerical measure of helix and sheet tendency for each residue was obtained from the calculations. When predictions were filtered to include only the strongest 31% of predictions, the predictive accuracy rose to 79%.  相似文献   

2.
3.
The strong coupling between secondary and tertiary structure formation in protein folding is neglected in most structure prediction methods. In this work we investigate the extent to which nonlocal interactions in predicted tertiary structures can be used to improve secondary structure prediction. The architecture of a neural network for secondary structure prediction that utilizes multiple sequence alignments was extended to accept low-resolution nonlocal tertiary structure information as an additional input. By using this modified network, together with tertiary structure information from native structures, the Q3-prediction accuracy is increased by 7-10% on average and by up to 35% in individual cases for independent test data. By using tertiary structure information from models generated with the ROSETTA de novo tertiary structure prediction method, the Q3-prediction accuracy is improved by 4-5% on average for small and medium-sized single-domain proteins. Analysis of proteins with particularly large improvements in secondary structure prediction using tertiary structure information provides insight into the feedback from tertiary to secondary structure.  相似文献   

4.
Peptide-binding proteins play key roles in biology, and predicting their binding specificity is a long-standing challenge. While considerable protein structural information is available, the most successful current methods use sequence information alone, in part because it has been a challenge to model the subtle structural changes accompanying sequence substitutions. Protein structure prediction networks such as AlphaFold model sequence-structure relationships very accurately, and we reasoned that if it were possible to specifically train such networks on binding data, more generalizable models could be created. We show that placing a classifier on top of the AlphaFold network and fine-tuning the combined network parameters for both classification and structure prediction accuracy leads to a model with strong generalizable performance on a wide range of Class I and Class II peptide-MHC interactions that approaches the overall performance of the state-of-the-art NetMHCpan sequence-based method. The peptide-MHC optimized model shows excellent performance in distinguishing binding and non-binding peptides to SH3 and PDZ domains. This ability to generalize well beyond the training set far exceeds that of sequence-only models and should be particularly powerful for systems where less experimental data are available.

Sequence-based methods utilize large sets of experimentally validated binding and non-binding peptides to assemble position specific weight matrices or more sophisticated neural networks with several layers to discriminate binder from non-binder peptides (17). Methods such as NetMHCpan are the current state of the art to address key biological challenges like major histocompatibility complex (MHC)-peptide-binding specificity which is central to the adaptive immune system (T cell surveillance, differentiation, etc.), since they can readily optimize parameters over large sets of binding and non-binding peptides. However, sequence-based methods are limited by their inability to incorporate detailed structural information, and as a result, they have reduced generalizability, particularly in cases where there is less training data. While structure-based methods have shown promise to fill this “gap”, they have been limited by their inability to accurately predict protein and peptide backbone changes which can affect both affinity and specificity, and more importantly, they lack a way to optimize many model parameters on the large amounts of peptide-binding data that are often available (8).  相似文献   

5.
Our objective was to accurately predict, from complex mutation patterns, human immunodeficiency virus type 1 resistance to the protease inhibitor lopinavir, by use of artificial intelligence. Two neural network models were constructed: 1 based on changes at 11 positions in the protease that were previously recognized as being significant for lopinavir resistance and another based on a newly derived set of 28 mutations that were identified by performing category prevalence analysis. Both models were trained, validated, and tested with 1322 clinical samples. A procedure of determining the optimal neural network parameters was proposed to speed up the training processes. The results suggested that the 28-mutation set was a more accurate predictor of lopinavir susceptibility (correlation coefficient, R2=0.88). We identified potentially significant new mutations associated with lopinavir resistance and demonstrated the utility of neural network models in predicting phenotypic susceptibility from complex genotypes.  相似文献   

6.
Mutant Escherichia coli strains in which export of the LamB protein (coded for by the lamB gene) to the outer membrane of the cell is prevented have been described previously. One of these mutant strains contains a small (12-base pair) deletion mutation within the region of the lamB gene that codes for the NH2-terminal signal sequence. In this mutant strain, export but not synthesis of the LamB protein is blocked. We have isolated pseudorevertants that restore export of functional LamB protein to the outer membrane. DNA sequence analysis showed that two of the revertants contain a point mutation in addition to the original deletion. These point mutations lead to amino acid substitutions within the signal sequence. Our results indicate that these secondary mutations efficiently suppress the export defect caused by the deletion mutation. Analysis of the secondary structure of the wild-type, mutant, and pseudorevertant LamB signal sequences suggests that the secondary mutations restore export by allowing the formation of a stable alpha-helical conformation in the central, hydrophobic region of the signal sequence.  相似文献   

7.
Ribosomal 30S protein S1 causes disruption of the secondary structure of certain pyrimidine-containing polynucleotides. Helical poly(U), poly(C, U), and neutral and acidic poly(C) are stoichiometrically converted by S1 to structures indistinguishable from their partially or completely thermally denatured forms, as revealed by circular dichroism. Of the several double- and triple-stranded helical polynucleotides tested that contain one polypurine strand and at least one polypyrimidine strand, only the conformation of the DNA.RNA hybrid, poly(A)-poly(dT), is perturbed. In the presence of S1, this hybrid undergoes a transition to a new structure that has a circular dichroism spectrum unlike either the native or thermally denatured forms. Intercalated ethidium bromide is released from poly(A)-poly(dT) by S1, confirming the occurrence of a conformational rearrangement. The translation inhibitor, autintricarboxylic acid, completely inhibits the action of S1 on polypyrimidines, but has no effect on the conformational perturbation of poly(A(-poly(dT). The possible relation between these observations and the biological function of protein S1 is discussed.  相似文献   

8.
In this study, we estimate the statistical significance of structure prediction by threading. We introduce a single parameter epsilon that serves as a universal measure determining the probability that the best alignment is indeed a native-like analog. Parameter epsilon takes into account both length and composition of the query sequence and the number of decoys in threading simulation. It can be computed directly from the query sequence and potential of interactions, eliminating the need for sequence reshuffling and realignment. Although our theoretical analysis is general, here we compare its predictions with the results of gapless threading. Finally we estimate the number of decoys from which the native structure can be found by existing potentials of interactions. We discuss how this analysis can be extended to determine the optimal gap penalties for any sequence-structure alignment (threading) method, thus optimizing it to maximum possible performance.  相似文献   

9.
This paper explores the knowledge of linguistic structure learned by large artificial neural networks, trained via self-supervision, whereby the model simply tries to predict a masked word in a given context. Human language communication is via sequences of words, but language understanding requires constructing rich hierarchical structures that are never observed explicitly. The mechanisms for this have been a prime mystery of human language acquisition, while engineering work has mainly proceeded by supervised learning on treebanks of sentences hand labeled for this latent structure. However, we demonstrate that modern deep contextual language models learn major aspects of this structure, without any explicit supervision. We develop methods for identifying linguistic hierarchical structure emergent in artificial neural networks and demonstrate that components in these models focus on syntactic grammatical relationships and anaphoric coreference. Indeed, we show that a linear transformation of learned embeddings in these models captures parse tree distances to a surprising degree, allowing approximate reconstruction of the sentence tree structures normally assumed by linguists. These results help explain why these models have brought such large improvements across many language-understanding tasks.  相似文献   

10.
BACKGROUND: Atherosclerosis is a complex histopathologic process that is analogous to chronic inflammatory conditions. Several factors have been shown to correlate with the extent of atherosclerosis. Whereas hypertension, obesity, hyperlipidemia, diabetes, smoking, and family history are all well documented, recent literature points to additional associated factors. Thus, antibodies to oxidized low-density lipoprotein (oxLDL), cytomegalovirus (CMV), Chlamydia pneumonia, Helicobacter pylori, as well as homocysteine and C-reactive protein (CRP) levels have all been implicated as independent markers of accelerated atherosclerosis. HYPOTHESIS: In the current study we attempted to formulate a system by which to predict the extent of coronary atherosclerosis as assessed by angiographic vessel occlusion. METHODS: The 81 patients were categorized as having single-, double-, triple-, or no vessel involvement. The clinical data concerning the "classic" risk factors were obtained from clinical records, and sera were drawn from the patients for determination of the various parameters that are thought to be associated with atherosclerosis. RESULTS: Using four artificial neural networks, we have found the most effective parameters predictive of coronary vessel involvement were (in decreasing order of importance) antibodies to oxLDL, to cardiolipin, to CMV, to Chlamydia pneumonia, and to beta 2-glycoprotein I (beta 2GPI). Although important in the prediction of vessel occlusion, hyperlipidemia, hypertension, CRP levels, and diabetes were less accurate. CONCLUSION: The results of the current study, if reproduced in a larger population, may establish an integrated system based on the creation of artificial neural networks by which to predict the extent of atherosclerosis in a given subject fairly and noninvasively.  相似文献   

11.
The nucleotide sequence of two recombinant plasmids containing hamster vimentin cDNA was determined. The sequence comprises 1,640 base pairs and reveals virtually the total primary structure of vimentin and a large part of the 3' noncoding region. Secondary structure prediction methods allow the characterization of two distinct regions of the polypeptide chain, 135 and 145 residues long, which are able to form alpha helices organized in "coiled coils." Three nonhelical domains can be distinguished: a very basic NH2-terminal domain of at least 67 residues, a nonhelical region of 45 amino acids separating the two helix domains, and a COOH-terminal region of 55 residues, which contains an excess of acidic amino acids. The meaning of each of these domains of the vimentin polypeptide for the subunit and filament formation is discussed.  相似文献   

12.
It is shown that most present empirical prediction algorithms provide information about the conformational states of individual residues, but give little information about the three-dimensional structure of a protein. It is necessary to predict the conformational state of every residue before the resulting structure can serve as a starting conformation to compute the native structure. It is also shown that even a perfect five-state algorithm (which does not include long-range interactions from disulifide loop closing or solvation) will not lead to a globular structure resembling the native one. However, starting from the results of a perfect prediction algorithm, it appears that conformational energy minimization (with long-range interactions included) can lead to a structure having the general features of the native protein.  相似文献   

13.
Sequential state generation by model neural networks.   总被引:3,自引:3,他引:3  
Sequential patterns of neural output activity form the basis of many biological processes, such as the cyclic pattern of outputs that control locomotion. I show how such sequences can be generated by a class of model neural networks that make defined sets of transitions between selected memory states. Sequence-generating networks depend upon the interplay between two sets of synaptic connections. One set acts to stabilize the network in its current memory state, while the second set, whose action is delayed in time, causes the network to make specified transitions between the memories. The dynamic properties of these networks are described in terms of motion along an energy surface. The performance of the networks, both with intact connections and with noisy or missing connections, is illustrated by numerical examples. In addition, I present a scheme for the recognition of externally generated sequences by these networks.  相似文献   

14.
With the advent of molecular cloning methods, the amino acid sequences for a number of membrane proteins have been determined. The relative paucity of detailed three-dimensional structural information available for these molecules has led to attempts to predict the secondary structures of membrane proteins based on folding motifs found in soluble proteins of known three-dimensional structure and sequence. In this study, we evaluated the accuracy of several of these methods in predicting the conformation of 15 integral membrane proteins and membrane-spanning polypeptides for which both primary and secondary structural information are available. chi 2 analyses indicated a less than 0.5% correlation between the net predicted secondary structures and the experimental results. A more stringent test of the accuracy of the methods, the index of prediction, was calculated for individual residues in four of the polypeptides for which the crystal structures were known; this criterion also indicated that the predicted assignments for the secondary structures of the residues were inaccurate. Thus, prediction schemes using soluble protein bases appear to be inappropriate for the prediction of membrane protein folding.  相似文献   

15.
Artificial neural networks (ANN) are promising tools in learning complex interplay of factors on a particular outcome. We performed this study to compare the predictive power of ANN and conventional methods in prediction of bone mineral density (BMD) in Iranian post-menopausal women. A database of 10 input variables from 2158 participants was randomly divided into training (1400), validation (150) and test (608) groups. Multivariate linear regression and ANN models were developed and validated on the training, and validation sets and outcomes (femoral neck and lumbar T-scores) were predicted and compared on the test group using different numbers of input variables. Results were evaluated by comparing the mean square of differences between predicted and reference values (non-central chi-square test) and by measuring area under the receiver operating characteristic curve (AUROC) around cut-off value of -2.5 for T-scores. For models with less than 3 input variables in femoral neck and 4 variables in spinal column, performance of regression and ANN models was almost the same. As more variables imported into models, ANN outperformed linear regression models. AUROC varied in 2 to 10 variable models as follows: for ANN in spine, from 0.709 to 0.774; linear models in spine, from 0.709 to 0.744; ANN in femoral neck, from 0.801 to 0.867; linear models in femoral neck, from 0.799 to 0.834. The ANN model performed better than five established patient selection tools in the test group. Superior performance of neural networks than linear models demonstrate their advantage especially in mass screening applications, when even a slight enhancement in performance results in significant decrease in number of misclassifications.  相似文献   

16.

Purpose

Detection of breast cancer at early stage increases patient’s survival. Mass spectrometry-based protein analysis of serum samples is a promising approach to obtain biomarker profiles for early detection. A combination of commonly applied solid-phase extraction procedures for clean-up may increase the number of detectable peptides and proteins. In this study, we have evaluated whether the classification performance of breast cancer profiles improves by using two serum workup procedures.

Methods

Serum samples from 105 breast cancer patients and 202 healthy volunteers were processed according to a standardized protocol implemented on a high-end liquid-handling robot. Peptide and protein enrichments were carried out using weak-cation exchange (WCX) and reversed-phase (RP) C18 magnetic beads. Profiles were acquired on a matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometer. In this way, two different biomarker profiles were obtained for each serum sample, yielding a WCX- and RPC18-dataset.

Results

The profiles were statistically evaluated with double cross-validation. Classification results of WCX- and RPC18-datasets were determined for each set separately and for the combination of both sets. Sensitivity and specificity were 82 and 87?% (WCX) and 73 and 93?% (RPC18) for the individual workup procedures. These values increased up to 84 and 95?%, respectively, upon combining the data.

Conclusion

It was found that MALDI-TOF peptide and protein profiles can be used for classification of breast cancer with high sensitivity and specificity. The classification performance even improved when two workup procedures were applied, since these provide a greater number of features (proteins).  相似文献   

17.
We designed a single-chain variant of the Arc repressor homodimer in which the beta strands that contact operator DNA are connected by a hairpin turn and the alpha helices that form the tetrahelical scaffold of the dimer are attached by a short linker. The designed protein represents a noncyclic permutation of secondary structural elements in another single-chain Arc molecule (Arc-L1-Arc), in which the two subunits are fused by a single linker. The permuted protein binds operator DNA with nanomolar affinity, refolds on the sub-millisecond time scale, and is as stable as Arc-L1-Arc. The crystal structure of the permuted protein reveals an essentially wild-type fold, demonstrating that crucial folding information is not encoded in the wild-type order of secondary structure. Noncyclic rearrangement of secondary structure may allow grouping of critical active-site residues in other proteins and could be a useful tool for protein design and minimization.  相似文献   

18.
The search for amino acid sequence homologies can be a powerful tool for predicting protein structure. Discovered sequence homologies are currently used in predicting the function of oncogene proteins. To sharpen this tool, we investigated the structural significance of short sequence homologies by searching proteins of known three-dimensional structure for subsequence identities. In 62 proteins with 10,000 residues, we found that the longest isolated homologies between unrelated proteins are five residues long. In 6 (out of 25) cases we saw surprising structural adaptability: the same five residues are part of an alpha-helix in one protein and part of a beta-strand in another protein. These examples show quantitatively that pentapeptide structure within a protein is strongly dependent on sequence context, a fact essentially ignored in most protein structure prediction methods: just considering the local sequence of five residues is not sufficient to predict correctly the local conformation (secondary structure). Cooperativity of length six or longer must be taken into account. Also, we are warned that in the growing practice of comparing a new protein sequence with a data base of known sequences, finding an identical pentapeptide sequence between two proteins is not a significant indication of structural similarity or of evolutionary kinship.  相似文献   

19.
Circular dichroism (CD) spectroscopy is a widely used technique for the study of protein structure. Numerous algorithms have been developed for the estimation of the secondary structure composition from the CD spectra. These methods often fail to provide acceptable results on α/β-mixed or β-structure–rich proteins. The problem arises from the spectral diversity of β-structures, which has hitherto been considered as an intrinsic limitation of the technique. The predictions are less reliable for proteins of unusual β-structures such as membrane proteins, protein aggregates, and amyloid fibrils. Here, we show that the parallel/antiparallel orientation and the twisting of the β-sheets account for the observed spectral diversity. We have developed a method called β-structure selection (BeStSel) for the secondary structure estimation that takes into account the twist of β-structures. This method can reliably distinguish parallel and antiparallel β-sheets and accurately estimates the secondary structure for a broad range of proteins. Moreover, the secondary structure components applied by the method are characteristic to the protein fold, and thus the fold can be predicted to the level of topology in the CATH classification from a single CD spectrum. By constructing a web server, we offer a general tool for a quick and reliable structure analysis using conventional CD or synchrotron radiation CD (SRCD) spectroscopy for the protein science research community. The method is especially useful when X-ray or NMR techniques fail. Using BeStSel on data collected by SRCD spectroscopy, we investigated the structure of amyloid fibrils of various disease-related proteins and peptides.Optically active macromolecules, such as proteins, exhibit differential absorption of circular polarized light. The far-UV circular dichroism (CD) spectroscopy of proteins and peptides (180–250 nm) is predominantly based on the excitation of electronic transitions in amide groups. The peptide backbone forms characteristic secondary structures such as α-helices, β-pleated sheets, turns, and disordered sections with specific Φ, Ψ dihedral angles and H-bond patterns affecting the CD spectrum (1). CD has been exploited for protein folding and stability assays, intermolecular interactions, and ligand binding studies, and has recently been applied in the investigations of protein disorder (2, 3). Synchrotron radiation CD (SRCD) spectroscopy is an emerging technique complementary to small-angle X-ray scattering or infrared spectroscopy, synergistic to biochemical and biophysical assays characterizing the protein folding state. SRCD extends the limits of conventional CD spectroscopy by broadening the spectral range, increasing the signal-to-noise ratio, and accelerating the data acquisition, in the presence of absorbing components (buffers, salts, etc.) (4). Additionally, SRCD has the capability of time-resolved and stopped-flow measurements as well as high-throughput screening (3).Quantitative analysis of CD spectra allows the prediction of the protein secondary structure content. In the past decades, a multitude of enhanced algorithms, based on variable selection, or singular value decomposition of standardized, scaled, and calibrated reference spectra, have been proposed to predict the secondary structure content, with good overall secondary structure prediction (5, 6). Validated reference spectra are nowadays available and collected in a publicly accessible Protein Circular Dichroism Data Bank (PCDDB) (7). The most populated SP175 reference dataset (8) currently available from PCDDB does not yet fully cover the fold space compared with the X-ray structures presented in Protein Data Bank (PDB) (9) (Fig. 1C). As a consequence, the prediction of β-sheet–rich proteins has proven to be difficult and biased due to their spectral variety and lower spectral amplitudes (Fig. 1) (5). This is assumed to be an intrinsic limitation of CD spectroscopy (11). Our goal has been to improve the accuracy and to increase the information content of the secondary structure prediction.Open in a separate windowFig. 1.SRCD spectra of α-helical and β-sheet–rich proteins and the secondary structure and protein fold occurrence in the PDB and in the SP175 CD dataset. The CD spectra of proteins containing ∼50% α-helix (A) are similar to one another, whereas proteins having 50% β-sheet with negligible α-helical content (B) show spectral properties diverse in amplitude, number, and positions of components. (C) The secondary structure composition of proteins in the PDB and in the SP175 reference set. Although the average secondary structure composition is similar in the two, the representation of protein folds is limited in SP175, based on the CATH topology category presented at the bottom (10).In the absence of high-resolution structures, CD is regarded as the method of choice, providing structural information of proteins in solution. Crystallization failure or the sheer size of macromolecules are the drawbacks for structure determination by X-ray crystallography or solution NMR spectroscopy, respectively. Examples are β-sheet–rich membrane proteins, protein aggregates, and amyloid fibrils. CD spectral results are annually increasingly cited in biophysical and macromolecular structure publications, but are too often limited to qualitative spectral differences and comparisons, lacking quantitative evaluation. SRCD should increase the information content, therefore improving quantitative secondary structure predictions (3, 12). For a reliable quantitative analysis of conventional CD and SRCD spectra, a suitable algorithm should correlate the spectral information to the complete fold space. This algorithm accurately predicts the secondary structure content and elucidates the folding pattern.Protein aggregates play a central role in several degenerative disorders including amyloidosis in the central nervous system, observed for Alzheimer’s and Parkinson’s diseases. In vitro as in vivo, proteins can form different aggregates of various sizes and morphologies (amorphous aggregates, oligomers, protofibrils, amyloid fibrils), which have distinct physiological effects depending on the environmental conditions (13, 14). Prediction of these β-sheet–rich structures has so far been controversial. The lack of calibrated and standardized CD reference spectra of such proteins, resulted so far in a misestimation of α-helical content due to the influence of strong spectral amplitudes (Table S1 and Fig. S1). Secondary structure information is essential to understand the molecular mechanisms of self-assembly and the pathophysiological effects of these aggregates.Here, we present a novel algorithm, β-structure selection (BeStSel), which reliably distinguishes parallel from antiparallel β-sheets by CD spectroscopy. We show that the twisting of the β-sheets has a strong influence on the CD spectrum. By taking into account the twisting angles between β-strands, our algorithm improves secondary structure prediction in general, and specifically for β-structure–rich proteins and amyloid fibrils. For the first time (to our knowledge), the increased information content obtained from the CD spectra makes protein fold prediction possible down to the topology level, in terms of the CATH protein structure classification (10).  相似文献   

20.
A lattice model for protein structure prediction at low resolution.   总被引:11,自引:7,他引:11       下载免费PDF全文
The prediction of the folded structure of a protein from its sequence has proven to be a very difficult computational problem. We have developed an exceptionally simple representation of a polypeptide chain, with which we can enumerate all possible backbone conformations of small proteins. A protein is represented by a self-avoiding path of connected vertices on a tetrahedral lattice, with several amino acid residues assigned to each lattice vertex. For five small structurally dissimilar proteins, we find that we can separate native-like structures from the vast majority of non-native folds by using only simple structural and energetic criteria. This method demonstrates significant generality and predictive power without requiring foreknowledge of any native structural details.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号