首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The Statistical Inference of Function Through Evolutionary Relationships (SIFTER) framework uses a statistical graphical model that applies phylogenetic principles to automate precise protein function prediction. Here we present a revised approach (SIFTER version 2.0) that enables annotations on a genomic scale. SIFTER 2.0 produces equivalently precise predictions compared to the earlier version on a carefully studied family and on a collection of 100 protein families. We have added an approximation method to SIFTER 2.0 and show a 500-fold improvement in speed with minimal impact on prediction results in the functionally diverse sulfotransferase protein family. On the Nudix protein family, previously inaccessible to the SIFTER framework because of the 66 possible molecular functions, SIFTER achieved 47.4% accuracy on experimental data (where BLAST achieved 34.0%). Finally, we used SIFTER to annotate all of the Schizosaccharomyces pombe proteins with experimental functional characterizations, based on annotations from proteins in 46 fungal genomes. SIFTER precisely predicted molecular function for 45.5% of the characterized proteins in this genome, as compared with four current function prediction methods that precisely predicted function for 62.6%, 30.6%, 6.0%, and 5.7% of these proteins. We use both precision-recall curves and ROC analyses to compare these genome-scale predictions across the different methods and to assess performance on different types of applications. SIFTER 2.0 is capable of predicting protein molecular function for large and functionally diverse protein families using an approximate statistical model, enabling phylogenetics-based protein function prediction for genome-wide analyses. The code for SIFTER and protein family data are available at http://sifter.berkeley.edu.  相似文献   

2.
基于多特征融合的蛋白质折叠子预测   总被引:1,自引:0,他引:1  
蛋白质折叠子预测为启发式搜索蛋白质三级结构提供了有用的信息.目前已知的折叠子预测方法大多数基于单种特征或多种特征的简单组合,本文采用一种多特征融合方法,从蛋白质的一级序列出发,对27类折叠子进行预测.使用支持向量机作为分类器,采用多对多的多类分类策略,以氨基酸组成成分、极性、极化性、范德瓦尔斯量、疏水性和预测的二级结构作为样本的六种特征,进行多特征融合,独立样本预测总精度为59.22%,与Ding等人的结果比较提高了3.2%,结果表明多特征融合方法是一种有效的蛋白质折叠子预测方法.  相似文献   

3.
This work describes the use of a hidden Markov model (HMM), with a reduced number of states, which simultaneously learns amino acid sequence and secondary structure for proteins of known three-dimensional structure and it is used for two tasks: protein class prediction and fold recognition. The Protein Data Bank and the annotation of the SCOP database are used for training and evaluation of the proposed HMM for a number of protein classes and folds. Results demonstrate that the reduced state-space HMM performs equivalently, or even better in some cases, on classifying proteins than a HMM trained with the amino acid sequence. The major advantage of the proposed approach is that a small number of states is employed and the training algorithm is of low complexity and thus relatively fast.  相似文献   

4.
In protein fold recognition, the main disadvantage of hidden Markov models (HMMs) is the employment of large-scale model architectures which require large data sets and high computational resources for training. Also, HMMs must consider sequential information about secondary structures of proteins, to improve prediction performance and reduce model parameters. Therefore, we propose a novel method for protein fold recognition based on a hidden Markov model, called a 9-state HMM. The method can (i) reduce the number of states using secondary structure information about proteins for each fold and (ii) recognize protein folds more accurately than other HMMs.  相似文献   

5.
Knowing the type of an uncharacterized membrane protein often provides a useful clue in both basic research and drug discovery. With the explosion of protein sequences generated in the post genomic era, determination of membrane protein types by experimental methods is expensive and time consuming. It therefore becomes important to develop an automated method to find the possible types of membrane proteins. In view of this, various computational membrane protein prediction methods have been proposed. They extract protein feature vectors, such as PseAAC (pseudo amino acid composition) and PsePSSM (pseudo position-specific scoring matrix) for representation of protein sequence, and then learn a distance metric for the KNN (K nearest neighbor) or NN (nearest neighbor) classifier to predicate the final type. Most of the metrics are learned using linear dimensionality reduction algorithms like Principle Components Analysis (PCA) and Linear Discriminant Analysis (LDA). Such metrics are common to all the proteins in the dataset. In fact, they assume that the proteins lie on a uniform distribution, which can be captured by the linear dimensionality reduction algorithm. We doubt this assumption, and learn local metrics which are optimized for local subset of the whole proteins. The learning procedure is iterated with the protein clustering. Then a novel ensemble distance metric is given by combining the local metrics through Tikhonov regularization. The experimental results on a benchmark dataset demonstrate the feasibility and effectiveness of the proposed algorithm named ProClusEnsem.  相似文献   

6.
The outer membrane of gram-negative bacteria contains several proteins, and some of these proteins, the porins, have numerous biological functions in the interaction with the host; porins are involved in the activation of signal transduction pathways and, in particular, in the activation of the Raf/MEK1-MEK2/mitogen-activated protein kinase (MAPK) cascade. The P2 porin is the most abundant outer membrane protein of Haemophilus influenzae type b. A three-dimensional structural model for P2 was constructed based on the crystal structures of Klebsiella pneumoniae OmpK36 and Escherichia coli PhoE and OmpF. The protein was readily assembled into the beta-barrel fold characteristic of porins, despite the low sequence identity with the template proteins. The model provides information on the structural features of P2 and insights relevant for prediction of domains corresponding to surface-exposed loops, which could be involved in the activation of signal transduction pathways. To identify the role of surface-exposed loops, a set of synthetic peptides were synthesized according to the proposed model and were assayed for MEK1-MEK2/MAPK pathway activation. Our results show that synthetic peptides corresponding to surface loops of protein P2 are able to activate the MEK1-MEK2/MAPK pathways like the entire protein, while peptides modeled on internal beta strands are unable to induce significant phosphorylation of the MEK1-MEK2/MAPK pathways. In particular, the peptides corresponding to loops L5 (Lys206 to Gly219), L6B (Ser239 to Lys253), and L7 (Thr280 to Lys287) activate, as the whole protein, essentially JNK and p38.  相似文献   

7.
Protein data contain discriminative patterns that can be used in many beneficial applications if they are defined correctly. In this work sequential pattern mining (SPM) is utilized for sequence-based fold recognition. Protein classification in terms of fold recognition plays an important role in computational protein analysis, since it can contribute to the determination of the function of a protein whose structure is unknown. Specifically, one of the most efficient SPM algorithms, cSPADE, is employed for the analysis of protein sequence. A classifier uses the extracted sequential patterns to classify proteins in the appropriate fold category. For training and evaluating the proposed method we used the protein sequences from the Protein Data Bank and the annotation of the SCOP database. The method exhibited an overall accuracy of 25% in a classification problem with 36 candidate categories. The classification performance reaches up to 56% when the five most probable protein folds are considered.  相似文献   

8.
Remote protein homology detection and fold recognition refer to detection of structural homology in proteins where there are small or no similarities in the sequence. To detect protein structural classes from protein primary sequence information, homology-based methods have been developed, which can be divided to three types: discriminative classifiers, generative models for protein families and pairwise sequence comparisons. Support Vector Machines (SVM) and Neural Networks (NN) are two popular discriminative methods. Recent studies have shown that SVM has fast speed during training, more accurate and efficient compared to NN. We present a comprehensive method based on two-layer classifiers. The 1st layer is used to detect up to superfamily and family in SCOP hierarchy using optimized binary SVM classification rules. It used the kernel function known as the Bio-kernel, which incorporates the biological information in the classification process. The 2nd layer uses discriminative SVM algorithm with string kernel that will detect up to protein fold level in SCOP hierarchy. The results obtained were evaluated using mean ROC and mean MRFP and the significance of the result produced with pairwise t-test was tested. Experimental results show that our approaches significantly improve the performance of remote protein homology detection and fold recognition for all three different version SCOP datasets (1.53, 1.67 and 1.73). We achieved 4.19% improvements in term of mean ROC in SCOP 1.53, 4.75% in SCOP 1.67 and 4.03% in SCOP 1.73 datasets when compared to the result produced by well-known methods. The combination of first layer and second layer of BioSVM-2L performs well in remote homology detection and fold recognition even in three different versions of datasets.  相似文献   

9.
OBJECTIVE: One of interesting computational topics in bioinformatics is prediction of secondary structure of proteins. Over 30 years of research has been devoted to the topic but we are still far away from having reliable prediction methods. A critical piece of information for accurate prediction of secondary structure is the helix and strand content of a given protein sequence. Ability to accurately predict content of those two secondary structures has a good potential to improve accuracy of prediction of the secondary structure. Most of the existing methods use composition vector to predict the content. Their underlying assumption is that the vector can be used to provide functional mapping between primary sequence and helix/strand content. While this is true for small sets of proteins we show that for larger protein sets such mapping are inconsistent, i.e. the same composition vectors correspond to different contents. To this end, we propose a method for prediction of helix/strand content from primary protein sequences that is fundamentally different from currently available methods. METHODS AND MATERIAL: Our method is accurate and uses a novel approach to obtain information from primary sequence based on a composition moment vector, which is a measure that includes information about both composition of a given primary sequence and the position of amino acids in the sequence. In contrast to the composition vector, we show that it provides functional mapping between primary sequence and the helix/strand content. RESULTS: A set of benchmarks involving a large protein dataset consisting of over 11,000 protein sequences from Protein Data Bank was performed to validate the method. Prediction done by a neural network had average accuracy of 91.5% for the helix and 94.5% for the strand contents. We also show that using the new measure results in about 40% reduction of error rates when compared with the composition vector results. CONCLUSIONS: The developed method has much better accuracy when compared with other existing methods, as shown on a large body of proteins, in contrast to other reported results that often target small sets of specific protein types, such as globular proteins.  相似文献   

10.
Protein remote homology detection is a critical step toward annotating its structure and function. Supervised learning algorithms such as support vector machine are currently the most accurate methods. The position-specific score matrices (PSSMs) contain wealthy information about the evolutionary relationship of proteins. However, the PSSMs often have different lengths, which are difficult to be used by machine-learning methods. In this study, a simple, fast and powerful method is presented for protein remote homology detection, which combines support vector machine with auto-cross covariance transformation. The PSSMs are converted into a series of fixed-length vectors by auto-cross covariance transformation and these vectors are then input to a support vector machine classifier for remote homology detection. The sequence-order effects can be effectively captured by this scheme. Experiments are performed on well-established datasets, and the remote homology is simulated at the superfamily and the fold level, respectively. The results show that the proposed method, referred to as ACCRe, is comparable or even better than the state-of-the-art methods in terms of detection performance, and its time complexity is superior to those of other profile-based SVM methods. The auto-cross covariance transformation provides a novel way for the usage of evolutionary information, which can be widely used for protein-level studies.  相似文献   

11.
ObjectiveTo propose a new flexible and sparse classifier that results in interpretable decision support systems.MethodsSupport vector machines (SVMs) for classification are very powerful methods to obtain classifiers for complex problems. Although the performance of these methods is consistently high and non-linearities and interactions between variables can be handled efficiently when using non-linear kernels such as the radial basis function (RBF) kernel, their use in domains where interpretability is an issue is hampered by their lack of transparency. Many feature selection algorithms have been developed to allow for some interpretation but the impact of the different input variables on the prediction still remains unclear. Alternative models using additive kernels are restricted to main effects, reducing their usefulness in many applications. This paper proposes a new approach to expand the RBF kernel into interpretable and visualizable components, including main and two-way interaction effects. In order to obtain a sparse model representation, an iterative l1-regularized parametric model using the interpretable components as inputs is proposed.ResultsResults on toy problems illustrate the ability of the method to select the correct contributions and an improved performance over standard RBF classifiers in the presence of irrelevant input variables. For a 10-dimensional x-or problem, an SVM using the standard RBF kernel obtains an area under the receiver operating characteristic curve (AUC) of 0.947, whereas the proposed method achieves an AUC of 0.997. The latter additionally identifies the relevant components. In a second 10-dimensional artificial problem, the underlying class probability follows a logistic regression model. An SVM with the RBF kernel results in an AUC of 0.975, as apposed to 0.994 for the presented method. The proposed method is applied to two benchmark datasets: the Pima Indian diabetes and the Wisconsin Breast Cancer dataset. The AUC is in both cases comparable to those of the standard method (0.826 versus 0.826 and 0.990 versus 0.996) and those reported in the literature. The selected components are consistent with different approaches reported in other work. However, this method is able to visualize the effect of each of the components, allowing for interpretation of the learned logic by experts in the application domain.ConclusionsThis work proposes a new method to obtain flexible and sparse risk prediction models. The proposed method performs as well as a support vector machine using the standard RBF kernel, but has the additional advantage that the resulting model can be interpreted by experts in the application domain.  相似文献   

12.
许多重要的细胞过程如信号转导、转运、细胞运动以及多数调节机制均由蛋白-蛋白之间的相可作用介导,蛋白质之间的相互作用在物理上是通过在两个相互作用蛋白之间形成接触面的短残基序列来实现。识别蛋白-蛋白相互作用位点,以及检测相互作用氨基酸残基之间的特异性与强度特异性,是一个具有重要应用前景的课题,它的应用范围从理性的药物设计到代谢和信号转导网络的分析。虽然有不少准确度不断提高的实验技术和计算方法来检测蛋白质之间的相互作用,但很少有方法能够精确地指出参与蛋白质相互作用的特定残基及其位置,而这些信息是将相互作用数据直接应用于药物开发所必需的。随着生物信息学和计算生物学的发展,通过研究已知蛋白-蛋白相互作用位点的这些不同特征.出现了一些利用序列与结构信息顶测蛋白-蛋白相互作用位点的计算方法。本文简要介绍了近年来在顶测蛋白-蛋白的相互作用位点方面取得一定进展的计算方法,包括基于基因组信息的计算方法、基于蛋白质初级序列的计算方法以及基于蛋白复合物结构信息的计算方法。虽然这些方法在过去儿年里取得了显著的进展,但是大多数在这方面的研究仍处于起步阶段.而现在数据库的不足和实验技术的缺陷对计算预测方法的进一步发展和公平性评价也存在着较大的影响,要提高蛋白-蛋白相巨作用位点预测的鲁棒性与可靠性,仍要有很多的工作要做。(发表在这里的是第一部分)  相似文献   

13.
Diverse proteins are known to be capable of forming amyloid aggregates, self-seeding fibrillar assemblies that may be biologically functional or pathological. Well-known examples include neurodegenerative disease-associated proteins that misfold as amyloid, fungal prion proteins that can transition to a self-propagating amyloid form and certain bacterial proteins that fold as amyloid at the cell surface and promote biofilm formation. To further explore the diversity of amyloidogenic proteins, generally applicable methods for identifying them are critical. Here we describe a cell-based method for generating amyloid aggregates that relies on the natural ability of Escherichia coli cells to elaborate amyloid fibrils at the cell surface. We use several different yeast prion proteins and the human huntingtin protein to show that protein secretion via this specialized export pathway promotes acquisition of the amyloid fold specifically for proteins that have an inherent amyloid-forming propensity. Furthermore, our findings establish the potential of this E. coli-based system to facilitate the implementation of high-throughput screens for identifying amyloidogenic proteins and modulators of amyloid aggregation.  相似文献   

14.
Protein interactions are very important for control life activities. If we want to study the principle of protein interactions, we have to find the seats of a protein which are involved in the interactions called interaction sites firstly. In this paper, a novel method based on an integrated RBF neural networks is proposed for prediction of protein interaction sites. At first, a number of features were extracted, i.e., sequence profiles, entropy, relative entropy, conservation weight, accessible surface area and sequence variability. Then 6 sliding windows about these features were made, and they contained 1, 3, 5, 7, 9 and 11 amino acid residues respectively. These sliding windows were put into the input layers of six radial basis functional neural networks that were optimized by Particle Swarm Optimization. Thus, six group results were obtained. Finally, these six group results were integrated by decision fusion (DF) and Genetic Algorithm based Selective Ensemble (GASEN). The experimental results show that the proposed method performs better than the other related methods such as neural networks and support vector machine.  相似文献   

15.
测定蛋白质空间结构有助于认识蛋白质功能以及蛋白质如何执行其功能。因此研究从蛋白质序列或由DNA读框翻译出的氨基酸序列预测蛋白质结构是当前研究热点之一。综述了蛋白质空间结构预测的同源模型化方法、线索化方法和混合方法等及其应用研究进展。  相似文献   

16.
MULTIPROSPECTOR, a multimeric threading algorithm for the prediction of protein–protein interactions, is applied to the genome of Saccharomyces cerevisiae. Each possible pairwise interaction among more than 6000 encoded proteins is evaluated against a dimer database of 768 complex structures by using a confidence estimate of the fold assignment and the magnitude of the statistical interfacial potentials. In total, 7321 interactions between pairs of different proteins are predicted, based on 304 complex structures. Quality estimation based on the coincidence of subcellular localizations and biological functions of the predicted interactors shows that our approach ranks third when compared with all other large-scale methods. Unlike other in silico methods, MULTIPROSPECTOR is able to identify the residues that participate directly in the interaction. Three hundred seventy-four of our predictions can be found by at least one of the other studies, which is compatible with the overlap between two different other methods. From the analysis of the mRNA abundance data, our method does not bias towards proteins with high abundance. Finally, several relevant predictions involved in various functions are presented. In summary, we provide a novel approach to predict protein–protein interactions on a genomic scale that is a useful complement to experimental methods.  相似文献   

17.
Identifying new drug target (DT) proteins is important in pharmaceutical and biomedical research. General machine learning method (GMLM) classifiers perform fairly well at prediction if the training dataset is well prepared. However, a common problem in preparing the training dataset is the lack of a negative dataset. To address this problem, we proposed two methods that can help GMLM better select the negative training dataset from the test dataset. The prediction accuracy was improved with the training dataset from the proposed strategies. The classifier identified 1797 and 227 potential DT proteins, some of which were mentioned in previous research, which added correlative weight to the new method. Practically, these two sets of potential DT proteins or their homologues are worth considering.  相似文献   

18.
This study introduces new neural network based methods for the assessment of the dynamics of the heart rate variability (HRV) signal. The heart rate regulation is assessed as a dynamical system operating in chaotic regimes. Radial-basis function (RBF) networks are applied as a tool for learning and predicting the HRV dynamics. HRV signals are analyzed from normal subjects before and after pharmacological autonomic nervous system (ANS) blockade and from diabetic patients with dysfunctional ANS. The heart rate of normal subjects presents notable predictability. The prediction error is minimized, in fewer degrees of freedom, in the case of diabetic patients. However, for the case of pharmacological ANS blockade, although correlation dimension approaches indicate significant reduction in complexity, the RBF networks fail to reconstruct adequately the underlying dynamics. The transient attributes of the HRV dynamics under the pharmacological disturbance is elucidated as the explanation for the prediction inability.  相似文献   

19.
A number of techniques such as information extraction, document classification, document clustering and information visualization have been developed to ease extraction and understanding of information embedded within text documents. However, knowledge that is embedded in natural language texts is difficult to extract using simple pattern matching techniques and most of these methods do not help users directly understand key concepts and their semantic relationships in document corpora, which are critical for capturing their conceptual structures. The problem arises due to the fact that most of the information is embedded within unstructured or semi-structured texts that computers can not interpret very easily. In this paper, we have presented a novel Biomedical Knowledge Extraction and Visualization framework, BioKEVis to identify key information components from biomedical text documents. The information components are centered on key concepts. BioKEVis applies linguistic analysis and Latent Semantic Analysis (LSA) to identify key concepts. The information component extraction principle is based on natural language processing techniques and semantic-based analysis. The system is also integrated with a biomedical named entity recognizer, ABNER, to tag genes, proteins and other entity names in the text. We have also presented a method for collating information extracted from multiple sources to generate semantic network. The network provides distinct user perspectives and allows navigation over documents with similar information components and is also used to provide a comprehensive view of the collection. The system stores the extracted information components in a structured repository which is integrated with a query-processing module to handle biomedical queries over text documents. We have also proposed a document ranking mechanism to present retrieved documents in order of their relevance to the user query.  相似文献   

20.
This paper presents an intelligent decision support system designed on a decision fusion framework coupled with a priori knowledge base for abnormality detection from endoscopic images. Sub-decisions are made based on associated component feature sets derived from the endoscopic images and predefined algorithms, and subsequently fused to classify the patient state. Bayesian probability computations are employed to evaluate the accuracies of sub-decisions, which are utilized in estimating the probability of the fused decision. The overall detectability of abnormalities by using the proposed fusion approach is improved in terms of detection of true positive and true negative conditions when compared with corresponding results from individual methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号