首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Protein secondary structure prediction with a neural network.   总被引:25,自引:3,他引:22       下载免费PDF全文
A method is presented for protein secondary structure prediction based on a neural network. A training phase was used to teach the network to recognize the relation between secondary structure and amino acid sequences on a sample set of 48 proteins of known structure. On a separate test set of 14 proteins of known structure, the method achieved a maximum overall predictive accuracy of 63% for three states: helix, sheet, and coil. A numerical measure of helix and sheet tendency for each residue was obtained from the calculations. When predictions were filtered to include only the strongest 31% of predictions, the predictive accuracy rose to 79%.  相似文献   

2.
Recently developed methods have shown considerable promise in predicting residue–residue contacts in protein 3D structures using evolutionary covariance information. However, these methods require large numbers of evolutionarily related sequences to robustly assess the extent of residue covariation, and the larger the protein family, the more likely that contact information is unnecessary because a reasonable model can be built based on the structure of a homolog. Here we describe a method that integrates sequence coevolution and structural context information using a pseudolikelihood approach, allowing more accurate contact predictions from fewer homologous sequences. We rigorously assess the utility of predicted contacts for protein structure prediction using large and representative sequence and structure databases from recent structure prediction experiments. We find that contact predictions are likely to be accurate when the number of aligned sequences (with sequence redundancy reduced to 90%) is greater than five times the length of the protein, and that accurate predictions are likely to be useful for structure modeling if the aligned sequences are more similar to the protein of interest than to the closest homolog of known structure. These conditions are currently met by 422 of the protein families collected in the Pfam database.There has been long-standing interest in the prediction of residue–residue contacts based on the covariance of residue-substitution patterns in multiple aligned sequences (1). For many years, these methods met with relatively little success, but with the increase in the number of known protein sequences and improvements in methods such approaches have recently demonstrated considerable promise. The methodological improvements distinguish between direct couplings and indirect correlations that arise from chains of the direct couplings (i.e., if A is coupled to B, and B to C, one might erroneously conclude A is coupled to C). Two recent methods, Direct Coupling Analysis (DCA) and Protein Sparse InverseCOVariance (PSICOV) (2, 3), achieve this separation by inverting a residue–residue covariance matrix.In parallel with the growth of the sequence databases, there has been a considerable increase in the number of known structures over the past decade. Comparative modeling methods, which predict protein structure based on homologs of known structures, have become increasingly powerful, and can generate models more accurate than those produced by de novo modeling in most cases. Thus, the growth in the databases over the last decade represents something of a catch-22 for contact prediction: there are many more sequences, so such predictions can be made much more accurately, but there are few protein families with the many sequences required for accurate contact prediction whose structures cannot be modeled relatively accurately using comparative modeling methods.In this paper we begin by examining the approximations involved in the residue–residue covariation matrix inversion used by PSICOV and DCA, and show that more accurate contact predictions can be obtained using fewer sequences by going beyond the second-order approximation to the underlying distribution implicit in both methods. We then rigorously assess the utility and limits of contact prediction for protein tertiary structure modeling by evaluating the extent to which predicted contacts can contribute to modeling in the presence of the homologous structure information likely to be available.  相似文献   

3.
Automated de novo prediction of native-like RNA tertiary structures   总被引:2,自引:0,他引:2  
RNA tertiary structure prediction has been based almost entirely on base-pairing constraints derived from phylogenetic covariation analysis. We describe here a complementary approach, inspired by the Rosetta low-resolution protein structure prediction method, that seeks the lowest energy tertiary structure for a given RNA sequence without using evolutionary information. In a benchmark test of 20 RNA sequences with known structure and lengths of approximately 30 nt, the new method reproduces better than 90% of Watson-Crick base pairs, comparable with the accuracy of secondary structure prediction methods. In more than half the cases, at least one of the top five models agrees with the native structure to better than 4 A rmsd over the backbone. Most importantly, the method recapitulates more than one-third of non-Watson-Crick base pairs seen in the native structures. Tandem stacks of "sheared" base pairs, base triplets, and pseudoknots are among the noncanonical features reproduced in the models. In the cases in which none of the top five models were native-like, higher energy conformations similar to the native structures are still sampled frequently but not assigned low energies. These results suggest that modest improvements in the energy function, together with the incorporation of information from phylogenetic covariance, may allow confident and accurate structure prediction for larger and more complex RNA chains.  相似文献   

4.
We introduce a method for discriminating correctly folded proteins from well designed decoy structures using atom-atom and atom-solvent contact surfaces. The measure used to quantify contact surfaces integrates the solvent accessible surface and interatomic contacts into one quantity, allowing solvent to be treated as an atom contact. A scoring function was derived from statistical contact preferences within known protein structures and validated by using established protein decoy sets, including the "Rosetta" decoys and data from the CASP4 structure predictions. The scoring function effectively distinguished native structures from all corresponding decoys in >90% of the cases, using isolated protein subunits as target structures. If contacts between subunits within quaternary structures are included, the accuracy increases to 97%. Interactions beyond atom-atom contact range were not required to distinguish native structures from the decoys using this method. The contact scoring performed as well or better than existing statistical and physicochemical potentials and may be applied as an independent means of evaluating putative structural models.  相似文献   

5.
The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein’s behavior and properties. We present a supervised deep learning framework to learn the sequence–function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network’s internal representation affects its ability to learn the sequence–function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find that networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks’ ability to learn biologically meaningful information about protein structure and mechanism. Finally, we demonstrate the models’ ability to navigate sequence space and design new proteins beyond the training set. We applied the protein G B1 domain (GB1) models to design a sequence that binds to immunoglobulin G with substantially higher affinity than wild-type GB1.

Understanding the mapping from protein sequence to function is important for describing natural evolutionary processes, diagnosing genetic disease, and designing new proteins with useful properties. This mapping is shaped by thousands of intricate molecular interactions, dynamic conformational ensembles, and nonlinear relationships between biophysical properties. These highly complex features make it challenging to model and predict how changes in amino acid sequence affect function.The volume of protein data has exploded over the last decade with advances in DNA sequencing, three-dimensional structure determination, and high-throughput screening. With these increasing data, statistics and machine learning approaches have emerged as powerful methods to understand the complex mapping from protein sequence to function. Unsupervised learning methods such as EVmutation (1) and DeepSequence (2) are trained on large alignments of evolutionarily related protein sequences. These methods can model a protein family’s native function, but they are not capable of predicting specific protein properties that were not subject to long-term evolutionary selection. In contrast, supervised methods learn the mapping to a specific protein property directly from sequence–function examples. Many prior supervised learning approaches have limitations, such as the inability to capture nonlinear interactions (3, 4), poor scalability to large datasets (5), making predictions only for single-mutation variants (6), or a lack of available code (7). Other learning methods leverage multiple sequence alignments and databases of annotated genetic variants to make qualitative predictions about a mutation’s effect on organismal fitness or disease, rather than making quantitative predictions of molecular phenotype (810). There is a current need for general, easy to use supervised learning methods that can leverage large sequence–function datasets to predict specific molecular phenotypes with the high accuracy required for protein design. We address this need with a usable software framework that can be readily adopted by others for new proteins (11).We present a deep learning framework to learn protein sequence–function relationships from large-scale data generated by deep mutational scanning experiments. We train supervised neural networks to learn the mapping from sequence to function. These trained networks can then generalize to predict the functions of previously unseen sequences. We examine network architectures with different representational capabilities including linear regression, nonlinear fully connected networks, and convolutional networks that share parameters. Our supervised modeling approach displays strong predictive accuracy on five diverse deep mutational scanning datasets and compares favorably with state-of-the-art physics-based and unsupervised prediction methods. Across the different architectures tested, we find that networks that capture nonlinear interactions and share information across sequence positions display the greatest predictive performance. We explore what our neural network models have learned about proteins and how they comprehend the sequence–function mapping. The convolutional neural networks learn a protein sequence representation that organizes sequences according to their structural and functional differences. In addition, the importance of input sequence features displays a strong correspondence to the protein’s three-dimensional structure and known key residues. Finally, we used an ensemble of the supervised learning models to design five protein G B1 domain (GB1) sequences with varying distances from the wild type. We experimentally characterized these sequences and found the top design binds to immunoglobulin G (IgG) with at least an order of magnitude higher affinity than wild-type GB1.  相似文献   

6.
We examine the interactions between amino acid residues in the context of their secondary structural environments (helix, strand, and coil) in proteins. Effective contact energies for an expanded 60-residue alphabet (20 aa x three secondary structural states) are estimated from the residue-residue contacts observed in known protein structures. Similar to the prototypical contact energies for 20 aa, the newly derived energy parameters reflect mainly the hydrophobic interactions; however, the relative strength of such interactions shows a strong dependence on the secondary structural environment, with nonlocal interactions in beta-sheet structures and alpha-helical structures dominating the energy table. Environment-dependent residue contact energies outperform existing residue pair potentials in both threading and three-dimensional contact prediction tests and should be generally applicable to protein structure prediction.  相似文献   

7.
The relationship between protein sequence and structure arises entirely from amino acid physical properties. An alternative method is therefore proposed to identify homologs in which residue equivalence is based exclusively on the pairwise physical property similarities of sequences. This approach, the property factor method (PFM), is entirely different from those in current use. A comparison is made between our method and PSI BLAST. We demonstrate that traditionally defined sequence similarity can be very low for pairs of sequences (which therefore cannot be identified using PSI BLAST), but similarity of physical property distributions results in almost identical 3D structures. The performance of PFM is shown to be better than that of PSI BLAST when sequence matching is comparable, based on a comparison using targets from CASP10 (89 targets) and CASP11 (51 targets). It is also shown that PFM outperforms PSI BLAST in informatically challenging targets.The prediction of protein structure from sequence is a central problem in molecular biology, with important practical applications. The most reliable approach to this problem is homology modeling (13), in which the structure of a target sequence is modeled on the known structures of candidate proteins whose sequences are judged to be similar to that of the target. The standard procedure for calculating the degree of similarity between two sequences involves alignment techniques. Sequence alignment, in its usual incarnation, involves two computational features: i) a matrix that establishes a quantitative degree of similarity between any two amino acids; and ii) a procedure for locating equivalent subsequence fragments in the two sequences of interest and assessing numerical penalties for the presence of insertions or deletions in one sequence relative to the other.In previous work (4), we discussed the problems associated with this approach. In standard alignment, equivalence matrices are self-referential, in the sense that they are generally constructed from alignments. Biases and preconceptions inherent in the initial alignment will therefore be incorporated in a nonlinear fashion into the final result. In the present work, we use a residue-equivalence measure based entirely on amino acid physical properties (5) [the property factor-based method (PFM), details of which are given in Methods]. As a result, no preconceptions as to correct matches color our results. Furthermore, a matrix-matching algorithm adapted from image processing is used to provide an extremely efficient sequence search/matching procedure. It is shown that, together, these lead to results that are not attainable by current methods and to the accurate prediction of structures that were poorly predicted in the two most recent CASP (critical assessment of protein structure prediction) exercises. This approach recalibrates what is possible using sequence alignment, and in the process eliminates much of the computational complexity that arises in the application of traditional methods.  相似文献   

8.
9.
The strong coupling between secondary and tertiary structure formation in protein folding is neglected in most structure prediction methods. In this work we investigate the extent to which nonlocal interactions in predicted tertiary structures can be used to improve secondary structure prediction. The architecture of a neural network for secondary structure prediction that utilizes multiple sequence alignments was extended to accept low-resolution nonlocal tertiary structure information as an additional input. By using this modified network, together with tertiary structure information from native structures, the Q3-prediction accuracy is increased by 7-10% on average and by up to 35% in individual cases for independent test data. By using tertiary structure information from models generated with the ROSETTA de novo tertiary structure prediction method, the Q3-prediction accuracy is improved by 4-5% on average for small and medium-sized single-domain proteins. Analysis of proteins with particularly large improvements in secondary structure prediction using tertiary structure information provides insight into the feedback from tertiary to secondary structure.  相似文献   

10.
The similarity in the three-dimensional structures of homologous proteins imposes strong constraints on their sequence variability. It has long been suggested that the resulting correlations among amino acid compositions at different sequence positions can be exploited to infer spatial contacts within the tertiary protein structure. Crucial to this inference is the ability to disentangle direct and indirect correlations, as accomplished by the recently introduced direct-coupling analysis (DCA). Here we develop a computationally efficient implementation of DCA, which allows us to evaluate the accuracy of contact prediction by DCA for a large number of protein domains, based purely on sequence information. DCA is shown to yield a large number of correctly predicted contacts, recapitulating the global structure of the contact map for the majority of the protein domains examined. Furthermore, our analysis captures clear signals beyond intradomain residue contacts, arising, e.g., from alternative protein conformations, ligand-mediated residue couplings, and interdomain interactions in protein oligomers. Our findings suggest that contacts predicted by DCA can be used as a reliable guide to facilitate computational predictions of alternative protein conformations, protein complex formation, and even the de novo prediction of protein domain structures, contingent on the existence of a large number of homologous sequences which are being rapidly made available due to advances in genome sequencing.  相似文献   

11.
Information derived from metagenome sequences through deep-learning techniques has significantly improved the accuracy of template free protein structure modeling. However, most of the deep learning–based modeling studies are based on blind sequence database searches and suffer from low efficiency in computational resource utilization and model construction, especially when the sequence library becomes prohibitively large. We proposed a MetaSource model built on 4.25 billion microbiome sequences from four major biomes (Gut, Lake, Soil, and Fermentor) to decode the inherent linkage of microbial niches with protein homologous families. Large-scale protein family folding experiments on 8,700 unknown Pfam families showed that a microbiome targeted approach with multiple sequence alignment constructed from individual MetaSource biomes requires more than threefold less computer memory and CPU (central processing unit) time but generates contact-map and three-dimensional structure models with a significantly higher accuracy, compared with that using combined metagenome datasets. These results demonstrate an avenue to bridge the gap between the rapidly increasing metagenome databases and the limited computing resources for efficient genome-wide database mining, which provides a useful bluebook to guide future microbiome sequence database and modeling development for high-accuracy protein structure and function prediction.

Given the rapid explosion of protein sequences, computer-based approaches play an increasingly important role in protein structure determination and structure-based function annotations (1, 2). Two types of strategies have been widely considered for protein three-dimensional (3D) structure prediction (2): the first is template-based modeling, which constructs structural models using solved structures as templates, where its success requests for the availability of homologous templates in the Protein Data Bank (PDB); the second is template free modeling (FM) approach (or ab initio modeling), which dedicates to model the “Hard” proteins that do not have close homologous structures in the PDB. Due to the lack of reliable physics-based force fields, the most efficient FM methods, including Rosetta (3), QUARK (4), and I-TASSER (Iterative Threading ASSEmbly Refinement) (5), and most recently AlphaFold (6) and trRosetta (7), rely on a prior spatial restraints derived, usually through deep neural-network learning (8, 9), from the coevolution information based on multiple sequence alignments (MSAs) of homologous sequences (10). Hence, to model 3D structure of the “Hard” proteins, a sufficient number of homologous sequences is critical to ensure the accuracy of deep machine-learning models and the quality of subsequent 3D structure constructions (11).Considerable effort was recently paid to the utilization of metagenome sequence data to enhance the MSA and FM model constructions. For example, Ovchinnikov et al. used the Integrated Microbial Genomes database to generate contact-map predictions and create high-confidence models for 614 Pfam protein families that lack homologous structures in the PDB (12). Using UniRef20 (13), Michel et al. combined contact-map prediction with the CNS (Crystallography & NMR System) folding method (14) to model protein structure for 558 Pfam families of unknown structure with an estimated 90% specificity. Most recently, Wang et al. examined the usefulness of the Tara Oceans microbial genomes and found that the microbiome genomes can provide additional help on high-quality MSA construction and protein structure and function modeling (15). This result demonstrated a significant role of the microbiome sequences, which represent one of the largest reservoirs of microbial species on this planet, in FM structural folding and structure-based function annotations.Despite the success of metagenome-assisted 3D structure modeling, there are still thousands of Pfam families whose structure cannot be appropriately modeled with a satisfactory confidence. One critical reason is that despite the rapid accumulation of sequences, the current sequence databases are far from complete, and very few homologous sequences are available for many of the FM targets. On the other hand, the metagenome sequence databases have become extremely large (e.g., the Joint Genome Institute database contains more than 60 billion microbial genes and keeps increasing with at least 20,000 new sequences added per day) (16, 17), which makes a thorough and balanced database search increasingly slow and difficult. In a recent study, Zhang et al. showed that using current data mining tools, the quality of MSAs from metagenome library is not always proportional to the effective number of homologous sequences (Neff, reference SI Appendix, Eq. S1), partly due to the complexity of the sequence family relations and the bias of sequence database searches (10). The recent CASP experiments also witnessed various examples where the folding simulations for FM targets are negatively impacted by the contact/distance predictions due to the biased MSAs from the large metagenome datasets despite the high Neff value (18, 19). Therefore, a balanced sequence mining with accurate MSA construction is of critical importance to help improve the efficiency of sequence database searching and the subsequent 3D structure modeling.In this work, we hypothesize that there exists an inherent evolutionary linkage between microbial niches (biome) and protein families, where a targeted approach built on linked biome families can be used to improve both efficiency and accuracy of MSA construction and protein structure predictions. To examine the hypothesis, we collected a model library of 4.25 billion microbiome sequences from the EBI metagenomic database (MGnify database) (20) that cover four major biomes (Gut, Lake, Soil, and Fermentor). The “marginal effect” analyses showed profoundly different effects of specific biomes on supplementing homologous sequences for different Pfam families. A machine-learning model named MetaSource is then developed to predict the source biome of target proteins, which can significantly improve the contact-map and 3D structure models accuracy with using more than threefold lower computer memory and CPU time. These results have validated the important biome-sequence–Pfam associations, which can lead a way toward better efficiency and effectiveness of the microbiome-based targeted approach to protein structure and function predictions.  相似文献   

12.
To test a different approach to understanding the relationship between the sequence of part of a protein and its conformation in the overall folded structure, the amino acid sequence corresponding to an alpha-helix of T4 lysozyme was duplicated in tandem. The presence of such a sequence repeat provides the protein with "choices" during folding. The mutant protein folds with almost wild-type stability, is active, and crystallizes in two different space groups, one isomorphous with wild type and the other with two molecules in the asymmetric unit. The fold of the mutant is essentially the same in all cases, showing that the inserted segment has a well-defined structure. More than half of the inserted residues are themselves helical and extend the helix present in the wild-type protein. Participation of additional duplicated residues in this helix would have required major disruption of the parent structure. The results clearly show that the residues within the duplicated sequence tend to maintain a helical conformation even though the packing interactions with the remainder of the protein are different from those of the original helix. It supports the hypothesis that the structures of individual alpha-helices are determined predominantly by the nature of the amino acids within the helix, rather than the structural environment provided by the rest of the protein.  相似文献   

13.
With the advent of molecular cloning methods, the amino acid sequences for a number of membrane proteins have been determined. The relative paucity of detailed three-dimensional structural information available for these molecules has led to attempts to predict the secondary structures of membrane proteins based on folding motifs found in soluble proteins of known three-dimensional structure and sequence. In this study, we evaluated the accuracy of several of these methods in predicting the conformation of 15 integral membrane proteins and membrane-spanning polypeptides for which both primary and secondary structural information are available. chi 2 analyses indicated a less than 0.5% correlation between the net predicted secondary structures and the experimental results. A more stringent test of the accuracy of the methods, the index of prediction, was calculated for individual residues in four of the polypeptides for which the crystal structures were known; this criterion also indicated that the predicted assignments for the secondary structures of the residues were inaccurate. Thus, prediction schemes using soluble protein bases appear to be inappropriate for the prediction of membrane protein folding.  相似文献   

14.
The detection of ligand-binding sites is often the starting point for protein function identification and drug discovery. Because of inaccuracies in predicted protein structures, extant binding pocket-detection methods are limited to experimentally solved structures. Here, FINDSITE, a method for ligand-binding site prediction and functional annotation based on binding-site similarity across groups of weakly homologous template structures identified from threading, is described. For crystal structures, considering a cutoff distance of 4 A as the hit criterion, the success rate is 70.9% for identifying the best of top five predicted ligand-binding sites with a ranking accuracy of 76.0%. Both high prediction accuracy and ability to correctly rank identified binding sites are sustained when approximate protein models (<35% sequence identity to the closest template structure) are used, showing a 67.3% success rate with 75.5% ranking accuracy. In practice, FINDSITE tolerates structural inaccuracies in protein models up to a rmsd from the crystal structure of 8-10 A. This is because analysis of weakly homologous protein models reveals that about half have a rmsd from the native binding site <2 A. Furthermore, the chemical properties of template-bound ligands can be used to select ligand templates associated with the binding site. In most cases, FINDSITE can accurately assign a molecular function to the protein model.  相似文献   

15.
Reliable structure-prediction methods for membrane proteins are important because the experimental determination of high-resolution membrane protein structures remains very difficult, especially for eukaryotic proteins. However, membrane proteins are typically longer than 200 aa and represent a formidable challenge for structure prediction. We have developed a method for predicting the structures of large membrane proteins by constraining helix–helix packing arrangements at particular positions predicted from sequence or identified by experiments. We tested the method on 12 membrane proteins of diverse topologies and functions with lengths ranging between 190 and 300 residues. Enforcing a single constraint during the folding simulations enriched the population of near-native models for 9 proteins. In 4 of the cases in which the constraint was predicted from the sequence, 1 of the 5 lowest energy models was superimposable within 4 Å on the native structure. Near-native structures could also be selected for heme-binding and pore-forming domains from simulations in which pairs of conserved histidine-chelating hemes and one experimentally determined salt bridge were constrained, respectively. These results suggest that models within 4 Å of the native structure can be achieved for complex membrane proteins if even limited information on residue-residue interactions can be obtained from protein structure databases or experiments.  相似文献   

16.
The Meccano (or Lego) set approach to synthetic protein design envisages covalent assembly of prefabricated units of peptide secondary structure. Stereochemical control over peptide folding is achieved by incorporation of conformationally constrained residues like alpha-aminoisobutyric acid (Aib) or DPro that nucleate helical and beta-hairpin structures, respectively. The generation of a synthetic sequence containing both a helix and a hairpin is achieved in the peptide BH17, Boc-Val-Ala-Leu-Aib-Val-Ala-Leu-Gly-Gly-Leu-Phe-Val-DPro-Gly-Leu- Phe-Val-OMe (where Boc is t-butoxycarbonyl), as demonstrated by a crystal structure determination. The achiral -Gly-Gly- linker permits helix termination as a Schellman motif and extension to the strand segment of the hairpin. Structure parameters for C(89)H(143)N(17)O(20) x 2H(2)O are space group P2(1), a = 14.935(7) A, b = 18.949(6) A, c = 19.231(8) A, beta = 101.79(4) degrees, Z = 2, agreement factor R(1) = 8.50% for 4,862 observed reflections > 4 sigma(F), and resolution of approximately 0.98 A.  相似文献   

17.
Local protein structure prediction efforts have consistently failed to exceed approximately 70% accuracy. We characterize the degeneracy of the mapping from local sequence to local structure responsible for this failure by investigating the extent to which similar sequence segments found in different proteins adopt similar three-dimensional structures. Sequence segments 3-15 residues in length from 154 different protein families are partitioned into neighborhoods containing segments with similar sequences using cluster analysis. The consistency of the sequence-to-structure mapping is assessed by comparing the local structures adopted by sequence segments in the same neighborhood in proteins of known structure. In the 154 families, 45% and 28% of the positions occur in neighborhoods in which one and two local structures predominate, respectively. The sequence patterns that characterize the neighborhoods in the first class probably include virtually all of the short sequence motifs in proteins that consistently occur in a particular local structure. These patterns, many of which occur in transitions between secondary structural elements, are an interesting combination of previously studied and novel motifs. The identification of sequence patterns that consistently occur in one or a small number of local structures in proteins should contribute to the prediction of protein structure from sequence.  相似文献   

18.
For a large class of proteins called sandwich-like proteins (SPs), the secondary structures consist of two beta-sheets packed face-to-face, with each beta-sheet consisting typically of three to five beta-strands. An important step in the prediction of the three-dimensional structure of a SP is the prediction of its supersecondary structure, namely the prediction of the arrangement of the beta-strands in the two beta-sheets. Recently, significant progress in this direction was made, where it was shown that 91% of observed SPs form what we here call "canonical motifs." Here, we show that all canonical motifs can be constructed in a simple manner that is based on thermodynamic considerations and uses certain geometric structures. The number of these structures is much smaller than the number of possible strand arrangements. For instance, whereas for SPs consisting of six strands there exist a priori 900 possible strand arrangements, there exist only five geometric structures. Furthermore, the few motifs that are noncanonial can be constructed from canonical motifs by a simple procedure.  相似文献   

19.
The complete sequence of the male-specific region of the human Y chromosome (MSY) has been determined recently; however, detailed characterization for many of its encoded proteins still remains to be done. We applied state-of-the-art protein structure prediction methods to all 27 distinct MSY-encoded proteins to provide better understanding of their biological functions and their mechanisms of action at the molecular level. The results of such large-scale structure-functional annotation provide a comprehensive view of the MSY proteome, shedding light on MSY-related processes. We found that, in total, at least 60 domains are encoded by 27 distinct MSY genes, of which 42 (70%) were reliably mapped to currently known structures. The most challenging predictions include the unexpected but confident 3D structure assignments for three domains identified here encoded by the USP9Y, UTY, and BPY2 genes. The domains with unknown 3D structures that are not predictable with currently available theoretical methods are established as primary targets for crystallographic or NMR studies. The data presented here set up the basis for additional scientific discoveries in human biology of the Y chromosome, which plays a fundamental role in sex determination.  相似文献   

20.
Transmembrane β-barrels (TMBs) carry out major functions in substrate transport and protein biogenesis but experimental determination of their 3D structure is challenging. Encouraged by successful de novo 3D structure prediction of globular and α-helical membrane proteins from sequence alignments alone, we developed an approach to predict the 3D structure of TMBs. The approach combines the maximum-entropy evolutionary coupling method for predicting residue contacts (EVfold) with a machine-learning approach (boctopus2) for predicting β-strands in the barrel. In a blinded test for 19 TMB proteins of known structure that have a sufficient number of diverse homologous sequences available, this combined method (EVfold_bb) predicts hydrogen-bonded residue pairs between adjacent β-strands at an accuracy of ∼70%. This accuracy is sufficient for the generation of all-atom 3D models. In the transmembrane barrel region, the average 3D structure accuracy [template-modeling (TM) score] of top-ranked models is 0.54 (ranging from 0.36 to 0.85), with a higher (44%) number of residue pairs in correct strand–strand registration than in earlier methods (18%). Although the nonbarrel regions are predicted less accurately overall, the evolutionary couplings identify some highly constrained loop residues and, for FecA protein, the barrel including the structure of a plug domain can be accurately modeled (TM score = 0.68). Lower prediction accuracy tends to be associated with insufficient sequence information and we therefore expect increasing numbers of β-barrel families to become accessible to accurate 3D structure prediction as the number of available sequences increases.Transmembrane β-barrels (TMBs) constitute 2–3% of all genes in Gram-negative bacterial genomes (1) and are also found in eukaryotes (2). There has been increasing interest in this class of proteins, as their roles have been uncovered in a wide range of biomedical fields. These roles include outer-membrane protein biogenesis (3, 4), antibiotic resistance (5), vaccine design, translocation of virulence factors, and the design of cancer therapeutics (6). In many of these examples, the 3D structure of the TMB has been crucial in elucidating the mechanisms of, for instance, substrate transport and voltage gating and in aiding therapeutic design.Existing computational approaches can successfully identify the location of β-strands (7, 8), but 3D-modeling techniques such as tobmodel and 3d-spot (9, 10) cannot account for the nonsymmetrical, noncircular shape of the barrel pore or the barrel/plug or the transmembrane β-strand/loop interactions. Recent work has shown that 3D structures of globular (11) and α-helical membrane proteins (12, 13) can be successfully predicted from the identification of coevolved residues in multiple-sequence alignments (MSA). The idea is that spatially close residues coevolve to maintain structural and functional integrity of the protein (11). Although this approach was first suggested and tried in 1994 (1417), only recent methods using a global statistical model identify sufficiently accurate residue–residue contacts from evolutionary covariation to successfully fold proteins de novo (11, 1820). The key innovation was to distinguish direct from indirect correlations, using maximum-entropy or related statistical approaches under the constraints of the data.Here, we present a hybrid method based on evolutionary couplings for contact prediction obtained from EVFold-PLM (11, 19) together with an improved β-strand prediction method based on boctopus2, a topology prediction method for transmembrane β-barrels (7). The method predicts consistent sets of backbone hydrogen-bonding restraints that can be used to fold large TMBs. Our approach relies on structural features that are common to the known 3D structures of bacterial TMBs, such as the antiparallel arrangement of transmembrane β-strands and the facts that the first strand, as far as is known, always traverses from the inner-membrane region to the extracellular side and pairs of β-strands have a right-handed twist when viewed along the direction of the strand. In addition to these features, our algorithm uses evolutionary couplings (ECs) in transmembrane β-strands to infer the optimal strand registration between pairs of adjacent β-strands and, by implication, backbone hydrogen-bonded residue pairs. The method achieves reasonably correct strand registration between adjacent β-strand pairs and all-atom 3D structures of TMBs and, where applicable, predicts interactions between the β-barrel and plug domains and between transmembrane β-strands and long extracellular loops. Moreover, in a few proteins we show that ECs can be used to detect functionally important residues.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号