首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Gene selection is important for cancer classification based on gene expression data, because of high dimensionality and small sample size. In this paper, we present a new gene selection method based on clustering, in which dissimilarity measures are obtained through kernel functions. It searches for best weights of genes iteratively at the same time to optimize the clustering objective function. Adaptive distance is used in the process, which is suitable to learn the weights of genes during the clustering process, improving the performance of the algorithm. The proposed algorithm is simple and does not require any modification or parameter optimization for each dataset. We tested it on eight publicly available datasets, using two classifiers (support vector machine, k-nearest neighbor), compared with other six competitive feature selectors. The results show that the proposed algorithm is capable of achieving better accuracies and may be an efficient tool for finding possible biomarkers from gene expression data.  相似文献   

2.
Classification into multiple classes when the measured variables are outnumbered is a major methodological challenge in -omics studies. Two algorithms that overcome the dimensionality problem are presented: the forest classification tree (FCT) and the forest support vector machines (FSVM). In FCT, a set of variables is randomly chosen and a classification tree (CT) is grown using a forward classification algorithm. The process is repeated and a forest of CTs is derived. Finally, the most frequent variables from the trees with the smallest apparent misclassification rate (AMR) are used to construct a productive tree. In FSVM, the CTs are replaced by SVMs. The methods are demonstrated using prostate gene expression data for classifying tissue samples into four tumor types. For threshold split value 0.001 and utilizing 100 markers the productive CT consisted of 29 terminal nodes and achieved perfect classification (AMR=0). When the threshold value was set to 0.01, a tree with 17 terminal nodes was constructed based on 15 markers (AMR=7%). In FSVM, reducing the fraction of the forest that was used to construct the best classifier from the top 80% to the top 20% reduced the misclassification to 25% (when using 200 markers). The proposed methodologies may be used for identifying important variables in high dimensional data. Furthermore, the FCT allows exploring the data structure and provides a decision rule.  相似文献   

3.
Gene selection is an important task in bioinformatics studies, because the accuracy of cancer classification generally depends upon the genes that have biological relevance to the classifying problems. In this work, randomization test (RT) is used as a gene selection method for dealing with gene expression data. In the method, a statistic derived from the statistics of the regression coefficients in a series of partial least squares discriminant analysis (PLSDA) models is used to evaluate the significance of the genes. Informative genes are selected for classifying the four gene expression datasets of prostate cancer, lung cancer, leukemia and non-small cell lung cancer (NSCLC) and the rationality of the results is validated by multiple linear regression (MLR) modeling and principal component analysis (PCA). With the selected genes, satisfactory results can be obtained.  相似文献   

4.
Gene expression datasets is a means to classify and predict the diagnostic categories of a patient. Informative genes and representative samples selection are two important aspects for reducing gene expression data. Identifying and pruning redundant genes and samples simultaneously can improve the performance of classification and circumvent the local optima problem. In the present paper, the modified particle swarm optimization was applied to selecting optimal genes and samples simultaneously and support vector machine was used as an objective function to determine the optimum set of genes and samples. To evaluate the performance of the new proposed method, it was applied to three publicly available microarray datasets. It has been demonstrated that the proposed method for gene and sample selection is a useful tool for mining high dimension data.  相似文献   

5.
With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification.  相似文献   

6.
Gene expression profile classification is a pivotal research domain assisting in the transformation from traditional to personalized medicine. A major challenge associated with gene expression data classification is the small number of samples relative to the large number of genes. To address this problem, researchers have devised various feature selection algorithms to reduce the number of genes. Recent studies have been experimenting with the use of semantic similarity between genes in Gene Ontology (GO) as a method to improve feature selection. While there are few studies that discuss how to use GO for feature selection, there is no simulation study that addresses when to use GO-based feature selection. To investigate this, we developed a novel simulation, which generates binary class datasets, where the differentially expressed genes between two classes have some underlying relationship in GO. This allows us to investigate the effects of various factors such as the relative connectedness of the underlying genes in GO, the mean magnitude of separation between differentially expressed genes denoted by δ, and the number of training samples. Our simulation results suggest that the connectedness in GO of the differentially expressed genes for a biological condition is the primary factor for determining the efficacy of GO-based feature selection. In particular, as the connectedness of differentially expressed genes increases, the classification accuracy improvement increases. To quantify this notion of connectedness, we defined a measure called Biological Condition Annotation Level BCAL(G), where G is a graph of differentially expressed genes. Our main conclusions with respect to GO-based feature selection are the following: (1) it increases classification accuracy when BCAL(G)  0.696; (2) it decreases classification accuracy when BCAL(G)  0.389; (3) it provides marginal accuracy improvement when 0.389 < BCAL(G) < 0.696 and δ < 1; (4) as the number of genes in a biological condition increases beyond 50 and δ  0.7, the improvement from GO-based feature selection decreases; and (5) we recommend not using GO-based feature selection when a biological condition has less than ten genes. Our results are derived from datasets preprocessed using RMA (Robust Multi-array Average), cases where δ is between 0.3 and 2.5, and training sample sizes between 20 and 200, therefore our conclusions are limited to these specifications. Overall, this simulation is innovative and addresses the question of when SoFoCles-style feature selection should be used for classification instead of statistical-based ranking measures.  相似文献   

7.
ObjectiveSupport vector machines (SVMs) have drawn considerable attention due to their high generalisation ability and superior classification performance compared to other pattern recognition algorithms. However, the assumption that the learning data is identically generated from unknown probability distributions may limit the application of SVMs for real problems. In this paper, we propose a vicinal support vector classifier (VSVC) which is shown to be able to effectively handle practical applications where the learning data may originate from different probability distributions.MethodsThe proposed VSVC method utilises a set of new vicinal kernel functions which are constructed based on supervised clustering in the kernel-induced feature space. Our proposed approach comprises two steps. In the clustering step, a supervised kernel-based deterministic annealing (SKDA) clustering algorithm is employed to partition the training data into different soft vicinal areas of the feature space in order to construct the vicinal kernel functions. In the training step, the SVM technique is used to minimise the vicinal risk function under the constraints of the vicinal areas defined in the SKDA clustering step.ResultsExperimental results on both artificial and real medical datasets show our proposed VSVC achieves better classification accuracy and lower computational time compared to a standard SVM. For an artificial dataset constructed from non-separated data, the classification accuracy of VSVC is between 95.5% and 96.25% (using different cluster numbers) which compares favourably to the 94.5% achieved by SVM. The VSVC training time is between 8.75 s and 17.83 s (for 2–8 clusters), considerable less than the 65.0 s required by SVM. On a real mammography dataset, the best classification accuracy of VSVC is 85.7% and thus clearly outperforms a standard SVM which obtains an accuracy of only 82.1%. A similar performance improvement is confirmed on two further real datasets, a breast cancer dataset (74.01% vs. 72.52%) and a heart dataset (84.77% vs. 83.81%), coupled with a reduction in terms of learning time (32.07 s vs. 92.08 s and 25.00 s vs. 53.31 s, respectively). Furthermore, the VSVC results in the number of support vectors being equal to the specified cluster number, and hence in a much sparser solution compared to a standard SVM.ConclusionIncorporating a supervised clustering algorithm into the SVM technique leads to a sparse but effective solution, while making the proposed VSVC adaptive to different probability distributions of the training data.  相似文献   

8.
A new spike sorting method based on the support vector machine (SVM) is proposed to resolve the superposition problem. The spike superposition is generally resolved by the template matching. Previous template matching methods separate the spikes through linear classifiers. The classification performance is severely influenced by the background noise included in spike trains. The nonlinear classifiers with high generation ability are required to deal with the task. A multi-class SVM classifier is therefore applied to separate the spikes, which contains several binary SVM classifiers. Every binary SVM classifier corresponding to one spike class is used to identify the single and superposition spikes. The superposition spikes are decomposed through template extraction. The experimental results on the simulated and real data demonstrate the utility of the proposed method.  相似文献   

9.
The present work proposes the development of an automated medical diagnostic tool that can classify ECG beats. This is considered an important problem as accurate, timely detection of cardiac arrhythmia can help to provide proper medical attention to cure/reduce the ailment. The proposed scheme utilizes a cross-correlation based approach where the cross-spectral density information in frequency domain is used to extract suitable features. A least square support vector machine (LS-SVM) classifier is developed utilizing the features so that the ECG beats are classified into three categories: normal beats, PVC beats and other beats. This three-class classification scheme is developed utilizing a small training dataset and tested with an enormous testing dataset to show the generalization capability of the scheme. The scheme, when employed for 40 files in the MIT/BIH arrhythmia database, could produce high classification accuracy in the range 95.51–96.12% and could outperform several competing algorithms.  相似文献   

10.

Objective

Medical data sets are usually small and have very high dimensionality. Too many attributes will make the analysis less efficient and will not necessarily increase accuracy, while too few data will decrease the modeling stability. Consequently, the main objective of this study is to extract the optimal subset of features to increase analytical performance when the data set is small.

Methods

This paper proposes a fuzzy-based non-linear transformation method to extend classification related information from the original data attribute values for a small data set. Based on the new transformed data set, this study applies principal component analysis (PCA) to extract the optimal subset of features. Finally, we use the transformed data with these optimal features as the input data for a learning tool, a support vector machine (SVM). Six medical data sets: Pima Indians’ diabetes, Wisconsin diagnostic breast cancer, Parkinson disease, echocardiogram, BUPA liver disorders dataset, and bladder cancer cases in Taiwan, are employed to illustrate the approach presented in this paper.

Results

This research uses the t-test to evaluate the classification accuracy for a single data set; and uses the Friedman test to show the proposed method is better than other methods over the multiple data sets. The experiment results indicate that the proposed method has better classification performance than either PCA or kernel principal component analysis (KPCA) when the data set is small, and suggest creating new purpose-related information to improve the analysis performance.

Conclusion

This paper has shown that feature extraction is important as a function of feature selection for efficient data analysis. When the data set is small, using the fuzzy-based transformation method presented in this work to increase the information available produces better results than the PCA and KPCA approaches.  相似文献   

11.
The successful decoding of kinematic variables from spike trains of motor cortical neurons is essential for cortical neural prosthesis. Spike trains from each single unit must be extracted from extracellular neural signals and, thus, spike detection and sorting procedure is indispensable but the detection and sorting may involve considerable error. Thus, a decoding algorithm should be robust with respect to spike train errors. Here, we show that spike train decoding algorithms employing nonlinear mapping, especially a support vector machine (SVM), may be more advantageous contrary to previous results which showed that an optimal linear filter is sufficient. The advantage became more conspicuous in the case of erroneous spike trains. Using the SVM, satisfactory training of the decoder could be achieved much more easily, compared to the case of using a multilayer perceptron, which has been employed in previous studies. Tests were performed on simulated spike trains from primary motor cortical neurons with a realistic distribution of preferred direction. The results suggest the possibility that a neuroprosthetic device with a low-quality spike sorting preprocessor can be achieved by adopting a spike train decoder that is robust to spike sorting errors.  相似文献   

12.
The proposed system provides new textural information for segmenting tumours, efficiently and accurately and with less computational time, from benign and malignant tumour images, especially in smaller dimensions of tumour regions of computed tomography (CT) images. Region-based segmentation of tumour from brain CT image data is an important but time-consuming task performed manually by medical experts. The objective of this work is to segment brain tumour from CT images using combined grey and texture features with new edge features and nonlinear support vector machine (SVM) classifier. The selected optimal features are used to model and train the nonlinear SVM classifier to segment the tumour from computed tomography images and the segmentation accuracies are evaluated for each slice of the tumour image. The method is applied on real data of 80 benign, malignant tumour images. The results are compared with the radiologist labelled ground truth. Quantitative analysis between ground truth and the segmented tumour is presented in terms of segmentation accuracy and the overlap similarity measure dice metric. From the analysis and performance measures such as segmentation accuracy and dice metric, it is inferred that better segmentation accuracy and higher dice metric are achieved with the normalized cut segmentation method than with the fuzzy c-means clustering method.  相似文献   

13.
Our main interest in supervised classification of gene expression data is to infer whether the expressions can discriminate biological characteristics of samples. With thousands of gene expressions to consider, a gene selection has been advocated to decrease classification by including only the discriminating genes. We propose to make the gene selection based on partial least squares and logistic regression random-effects (RE) estimates before the selected genes are evaluated in classification models. We compare the selection with that based on the two-sample t-statistics, a current practice, and modified t-statistics. The results indicate that gene selection based on logistic regression RE estimates is recommended in a general situation, while the selection based on the PLS estimates is recommended when the number of samples is low. Gene selection based on the modified t-statistics performs well when the genes exhibit moderate-to-high variability with moderate group separation. Respecting the characteristics of the data is a key aspect to consider in gene selection.  相似文献   

14.
This study attempts to propose an improved decision forest (IDF) with an integrated graphical user interface. Based on four gene expression data sets, the IDF not only outperforms the original decision forest, but also is superior or comparable to other state-of-the-art machine learning methods, especially in dealing with high dimensional data. With an integrated built-in feature selection (FS) mechanism and fewer parameters to tune, it can be trained more efficiently than methods such as support vector machine, and can be built with much fewer trees than other popular tree-based ensemble methods. Moreover, it suffers less from the curse of dimensionality.  相似文献   

15.
Selecting a subset of genes with strong discriminative power is a very important step in classification problems based on gene expression data. Lasso and Dantzig selector are known to have automatic variable selection ability in linear regression analysis. This paper applies Lasso and Dantzig selector to select the most informative genes for representing the probability of an example being positive as a linear function of the gene expression data. The selected genes are further used to fit different classifiers for cancer classification. Comparative experiments were conducted on six publicly available cancer datasets, and the detailed comparison results show that in general, Lasso is more capable than Dantzig selector at selecting informative genes for cancer classification.  相似文献   

16.
为了提高下肢外骨骼机器人及其穿戴者行走的流畅性和人机相互协调性,本文提出了一种基于惯性传感器信号的下肢外骨骼穿戴者行走步速识别方法。首先选取大腿处和小腿处的三轴加速度和三轴角速度信号,随后根据时间窗口提取当前时刻前0.5 s的信号,以频域信号中的傅里叶变换系数为特征值。接着将支持向量机(SVM)与隐马尔科夫模型(HMM)结合作为分类模型,对该模型进行训练和步速识别。最后结合步速变化规律与人-机约束力,预测当前时刻步速大小。试验结果表明,本文提出的方法能够有效识别下肢外骨骼穿戴者的步速意图,七种步速模式识别率可达到92.14%。本文方法为实现外骨骼与穿戴者之间的人机协调控制提供了新思路和新途径。  相似文献   

17.
Independent component analysis (ICA) has been widely deployed to the analysis of microarray datasets. Although it was pointed out that after ICA transformation, different independent components (ICs) are of different biological significance, the IC selection problem is still far from fully explored. In this paper, we propose a genetic algorithm (GA) based ensemble independent component selection (EICS) system. In this system, GA is applied to select a set of optimal IC subsets, which are then used to build diverse and accurate base classifiers. Finally, all base classifiers are combined with majority vote rule. To show the validity of the proposed method, we apply it to classify three DNA microarray data sets involving various human normal and tumor tissue samples. The experimental results show that our ensemble method obtains stable and satisfying classification results when compared with several existing methods.  相似文献   

18.
Translation of electroencephalographic (EEG) recordings into control signals for brain–computer interface (BCI) systems needs to be based on a robust classification of the various types of information. EEG-based BCI features are often noisy and likely to contain outliers. This contribution describes the application of a fuzzy support vector machine (FSVM) with a radial basis function kernel for classifying motor imagery tasks, while the statistical features over the set of the wavelet coefficients were extracted to characterize the time–frequency distribution of EEG signals. In the proposed FSVM classifier, a low fraction of support vectors was used as a criterion for choosing the kernel parameter and the trade-off parameter, together with the membership parameter based solely on training data. FSVM and support vector machine (SVM) classifiers outperformed the winner of the BCI Competition 2003 and other similar studies on the same Graz dataset, in terms of the competition criterion of the mutual information (MI), while the FSVM classifier yielded a better performance than the SVM approach. FSVM and SVM classifiers perform much better than the winner of the BCI Competition 2005 on the same Graz dataset for the subject O3 according to the competition criterion of the maximal MI steepness, while the FSVM classifier outperforms the SVM method. The proposed FSVM model has potential in reducing the effects of noise or outliers in the online classification of EEG signals in BCIs.  相似文献   

19.
The degree of malignancy in brain glioma needs to be assessed by MRI findings and clinical data before operations. There have been previous attempts to solve this problem with a fuzzy rule extraction algorithm based on fuzzy min-max neural networks. We utilize support vector machines with floating search method to select relevant features and to predict the degree of malignancy. Computation results show that the feature subset selected by our techniques can yield better classification performance. In contrast with the base line method, which generated two rules and obtained 83.21% accuracy on the whole data set, our method generates one rule to yield 88.21% accuracy.  相似文献   

20.
Clinical feature selection problem is the task of selecting and identifying a subset of informative clinical features that are useful for promoting accurate clinical diagnosis. This is a significant task of pragmatic value in the clinical settings as each clinical test is associated with a different financial cost, diagnostic value, and risk for obtaining the measurement. Moreover, with continual introduction of new clinical features, the need to repeat the feature selection task can be very time consuming. Therefore to address this issue, we propose a novel feature selection technique for diagnosis of myocardial infarction – one of the leading causes of morbidity and mortality in many high-income countries. This method adopts the conceptual framework of biological continuum, the optimization capability of genetic algorithm for performing feature selection and the classification ability of support vector machine. Together, a network of clinical risk factors, called the biological continuum based etiological network (BCEN), was constructed. Evaluation of the proposed methods was carried out using the cardiovascular heart study (CHS) dataset. Results demonstrate a significant speedup of 4.73-fold can be achieved for the development of MI classification model. The key advantage of this methodology is the provision of a reusable (feature subset) paradigm for efficient development of up-to-date and efficacious clinical classification models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号