首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification.  相似文献   

2.
Compared to backward feature selection (BFS) method in gene expression data analysis, forward feature selection (FFS) method can obtain an expected feature subset with less iteration. However, the number of FFS method is considerably less than that of BFS method. More efficient FFS methods need to be developed. In this paper, two FFS methods based on the pruning of the classifier ensembles generated by single attribute are proposed for gene selection. The main contributions are as follows: (1) a new loss function, p-insensitive loss function, is proposed to overcome the disadvantage of the margin Euclidean distance loss function in the pruning of classifier ensembles; (2) two FFS methods based on the margin Euclidean distance loss function and the p-insensitive loss function, named as FFS-ACSA1 and FFS-ACSA2 respectively, are proposed; (3) the comparison experiments on four gene expression datasets show that FFS-ACSA2 obtains the best results among three FFS methods (i.e. signal-to-noise ratio (SNR), FFS-ACSA1 and FFS-ACSA2), and is competitive to the famous support vector machine-based recursive feature elimination (SVM-RFE), while FFS-ACSA1 is unstable.  相似文献   

3.
Due to recent advances in DNA microarray technology, using gene expression profiles, diagnostic category of tissue samples can be predicted with high accuracy. In this study, we discuss shortcomings of some existing gene expression profile classification methods and propose a new approach based on linear Bayesian classifiers. In our approach, we first construct gene-level linear classifiers to identify genes that provide high class-prediction accuracies, i.e., low error rates. After this screening phase, starting with the gene that offers the lowest error rate, we construct a multi-dimensional linear classifier by incorporating next best-performing genes, until the prediction error becomes minimum or 0, if possible. When we compared classification performance of our approach against prediction analysis of microarrays (PAM) and support vector machines (SVM) based approaches, we found that our method outperforms PAM and produces comparable results with SVM. In addition, we observed that the gene selection scheme of PAM could be misleading. Albeit SVM achieves relatively higher prediction performance, it has two major disadvantages: Complexity and lack of insight about important genes. Our intuitive approach offers competing performance and also an efficient means for finding important genes.  相似文献   

4.
Gene expression profile classification is a pivotal research domain assisting in the transformation from traditional to personalized medicine. A major challenge associated with gene expression data classification is the small number of samples relative to the large number of genes. To address this problem, researchers have devised various feature selection algorithms to reduce the number of genes. Recent studies have been experimenting with the use of semantic similarity between genes in Gene Ontology (GO) as a method to improve feature selection. While there are few studies that discuss how to use GO for feature selection, there is no simulation study that addresses when to use GO-based feature selection. To investigate this, we developed a novel simulation, which generates binary class datasets, where the differentially expressed genes between two classes have some underlying relationship in GO. This allows us to investigate the effects of various factors such as the relative connectedness of the underlying genes in GO, the mean magnitude of separation between differentially expressed genes denoted by δ, and the number of training samples. Our simulation results suggest that the connectedness in GO of the differentially expressed genes for a biological condition is the primary factor for determining the efficacy of GO-based feature selection. In particular, as the connectedness of differentially expressed genes increases, the classification accuracy improvement increases. To quantify this notion of connectedness, we defined a measure called Biological Condition Annotation Level BCAL(G), where G is a graph of differentially expressed genes. Our main conclusions with respect to GO-based feature selection are the following: (1) it increases classification accuracy when BCAL(G)  0.696; (2) it decreases classification accuracy when BCAL(G)  0.389; (3) it provides marginal accuracy improvement when 0.389 < BCAL(G) < 0.696 and δ < 1; (4) as the number of genes in a biological condition increases beyond 50 and δ  0.7, the improvement from GO-based feature selection decreases; and (5) we recommend not using GO-based feature selection when a biological condition has less than ten genes. Our results are derived from datasets preprocessed using RMA (Robust Multi-array Average), cases where δ is between 0.3 and 2.5, and training sample sizes between 20 and 200, therefore our conclusions are limited to these specifications. Overall, this simulation is innovative and addresses the question of when SoFoCles-style feature selection should be used for classification instead of statistical-based ranking measures.  相似文献   

5.
ObjectiveFeature selection (FS) methods are widely used in grading and diagnosing prostate histopathological images. In this context, FS is based on the texture features obtained from the lumen, nuclei, cytoplasm and stroma, all of which are important tissue components. However, it is difficult to represent the high-dimensional textures of these tissue components. To solve this problem, we propose a new FS method that enables the selection of features with minimal redundancy in the tissue components.MethodologyWe categorise tissue images based on the texture of individual tissue components via the construction of a single classifier and also construct an ensemble learning model by merging the values obtained by each classifier. Another issue that arises is overfitting due to the high-dimensional texture of individual tissue components. We propose a new FS method, SVM-RFE(AC), that integrates a Support Vector Machine-Recursive Feature Elimination (SVM-RFE) embedded procedure with an absolute cosine (AC) filter method to prevent redundancy in the selected features of the SV-RFE and an unoptimised classifier in the AC.ResultsWe conducted experiments on H&E histopathological prostate and colon cancer images with respect to three prostate classifications, namely benign vs. grade 3, benign vs. grade 4 and grade 3 vs. grade 4. The colon benchmark dataset requires a distinction between grades 1 and 2, which are the most difficult cases to distinguish in the colon domain. The results obtained by both the single and ensemble classification models (which uses the product rule as its merging method) confirm that the proposed SVM-RFE(AC) is superior to the other SVM and SVM-RFE-based methods.ConclusionWe developed an FS method based on SVM-RFE and AC and successfully showed that its use enabled the identification of the most crucial texture feature of each tissue component. Thus, it makes possible the distinction between multiple Gleason grades (e.g. grade 3 vs. grade 4) and its performance is far superior to other reported FS methods.  相似文献   

6.
Our main interest in supervised classification of gene expression data is to infer whether the expressions can discriminate biological characteristics of samples. With thousands of gene expressions to consider, a gene selection has been advocated to decrease classification by including only the discriminating genes. We propose to make the gene selection based on partial least squares and logistic regression random-effects (RE) estimates before the selected genes are evaluated in classification models. We compare the selection with that based on the two-sample t-statistics, a current practice, and modified t-statistics. The results indicate that gene selection based on logistic regression RE estimates is recommended in a general situation, while the selection based on the PLS estimates is recommended when the number of samples is low. Gene selection based on the modified t-statistics performs well when the genes exhibit moderate-to-high variability with moderate group separation. Respecting the characteristics of the data is a key aspect to consider in gene selection.  相似文献   

7.
Gene expression datasets is a means to classify and predict the diagnostic categories of a patient. Informative genes and representative samples selection are two important aspects for reducing gene expression data. Identifying and pruning redundant genes and samples simultaneously can improve the performance of classification and circumvent the local optima problem. In the present paper, the modified particle swarm optimization was applied to selecting optimal genes and samples simultaneously and support vector machine was used as an objective function to determine the optimum set of genes and samples. To evaluate the performance of the new proposed method, it was applied to three publicly available microarray datasets. It has been demonstrated that the proposed method for gene and sample selection is a useful tool for mining high dimension data.  相似文献   

8.
Gene selection is important for cancer classification based on gene expression data, because of high dimensionality and small sample size. In this paper, we present a new gene selection method based on clustering, in which dissimilarity measures are obtained through kernel functions. It searches for best weights of genes iteratively at the same time to optimize the clustering objective function. Adaptive distance is used in the process, which is suitable to learn the weights of genes during the clustering process, improving the performance of the algorithm. The proposed algorithm is simple and does not require any modification or parameter optimization for each dataset. We tested it on eight publicly available datasets, using two classifiers (support vector machine, k-nearest neighbor), compared with other six competitive feature selectors. The results show that the proposed algorithm is capable of achieving better accuracies and may be an efficient tool for finding possible biomarkers from gene expression data.  相似文献   

9.
10.
ObjectiveTo propose a new flexible and sparse classifier that results in interpretable decision support systems.MethodsSupport vector machines (SVMs) for classification are very powerful methods to obtain classifiers for complex problems. Although the performance of these methods is consistently high and non-linearities and interactions between variables can be handled efficiently when using non-linear kernels such as the radial basis function (RBF) kernel, their use in domains where interpretability is an issue is hampered by their lack of transparency. Many feature selection algorithms have been developed to allow for some interpretation but the impact of the different input variables on the prediction still remains unclear. Alternative models using additive kernels are restricted to main effects, reducing their usefulness in many applications. This paper proposes a new approach to expand the RBF kernel into interpretable and visualizable components, including main and two-way interaction effects. In order to obtain a sparse model representation, an iterative l1-regularized parametric model using the interpretable components as inputs is proposed.ResultsResults on toy problems illustrate the ability of the method to select the correct contributions and an improved performance over standard RBF classifiers in the presence of irrelevant input variables. For a 10-dimensional x-or problem, an SVM using the standard RBF kernel obtains an area under the receiver operating characteristic curve (AUC) of 0.947, whereas the proposed method achieves an AUC of 0.997. The latter additionally identifies the relevant components. In a second 10-dimensional artificial problem, the underlying class probability follows a logistic regression model. An SVM with the RBF kernel results in an AUC of 0.975, as apposed to 0.994 for the presented method. The proposed method is applied to two benchmark datasets: the Pima Indian diabetes and the Wisconsin Breast Cancer dataset. The AUC is in both cases comparable to those of the standard method (0.826 versus 0.826 and 0.990 versus 0.996) and those reported in the literature. The selected components are consistent with different approaches reported in other work. However, this method is able to visualize the effect of each of the components, allowing for interpretation of the learned logic by experts in the application domain.ConclusionsThis work proposes a new method to obtain flexible and sparse risk prediction models. The proposed method performs as well as a support vector machine using the standard RBF kernel, but has the additional advantage that the resulting model can be interpreted by experts in the application domain.  相似文献   

11.
目的构建一种可同时表达靶向两种基因shRNA的瞬时表达载体,以提高RNA干扰效率和降低载体的非特异毒性。方法载体采用pUC19的质粒基本序列,RNA聚合酶III识别的H1启动子,并利用同尾酶的特性设计两个多克隆位点,用DNA重组技术构建了载体pCSH1;以pCSH1为基础,构建靶向N-Ras或c-Myc的单干扰载体,以及靶向N-Ras和c-Myc的双干扰载体,并验证其干扰效率;并用克隆形成法验证了载体的抗肿瘤活性;此外还将靶向N-Ras两个shRNA转录单位串联到一个载体中,检测该连接方式的优势。结果成功设计并构建了可同时表达双基因shRNA的瞬时表达载体pCSH1。对该载体的验证结果表明,靶向N-Ras的单干扰shRNA表达载体pCSH1-shNR或靶向c-Myc的单干扰shRNA表达载体pCSH1-shMyc对靶基因mRNA和蛋白具有明显的干扰作用;双干扰载体pCSH1-shNM也能同时抑制N-Ras和c-Myc的mRNA和蛋白表达水平,且对细胞克隆形成的抑制率最高。细胞生长曲线法检测两个N-Ras-shRNA表达单位串联的载体pCSH1-2shNR对细胞生长的抑制作用,结果显示在质粒质量相同条件下,双表达载体pCSH1-2shNR对细胞的生长抑制作用显著强于单N-Ras-shRNA表达载体pCSH1-shNR,而相同质量的对照质粒pCSH1-2shMock、pCSH1-shMock对细胞生长的抑制没有明显差异。结论利用同尾酶原理构建了一个可同时表达双shRNA的瞬时表达载体pCSH1,该载体可实现双基因同时干扰;pCSH1还具有分子量小、shRNA表达单位可以串联和串联后RNA干扰活性增强的优点。该载体通过同时抑制两个靶基因或增加对单个靶基因的shRNA表达单位可以取得更好的抗肿瘤活性。  相似文献   

12.
Genomewide profiling of gene expression, made possible by the development of DNA microarray technology and more powerful by the sequencing of the human genome, has led to advances in tumor classification and biomarker discovery for the common types of human neoplasia. Application of this approach to the field of endocrine neoplasia is in its infancy, although some progress has been recently reported. In this review, the progress to date is summarized and the promise of DNA microarray analysis in conjunction with tissue array immunohistochemistry to significantly impact endocrine tumor diagnosis and prognosis is discussed.  相似文献   

13.
The Escherichia coli thymidine kinase/thymidylate kinase (tk/tmk) fusion gene encodes an enzyme that efficiently converts the prodrug 3'-azido-2',3'-dideoxythymidine (AZT) into its toxic triphosphate derivative, a substance which stops DNA chain elongation. Integration of this marker gene into vaccinia virus that normally is not inhibited by AZT allowed the establishment of a powerful selection procedure for recombinant viruses. In contrast to the conventional vaccinia thymidine kinase (tk) selection that is performed in tk-negative cell lines, AZT selection can be performed in normal (tk-positive) cell lines. The technique is especially useful for the generation of replication-deficient vaccinia viruses and may also be used for gene knock-out studies of essential vaccinia genes.  相似文献   

14.
Mammography is a widely used screening tool and is the gold standard for the early detection of breast cancer. The classification of breast masses into the benign and malignant categories is an important problem in the area of computer-aided diagnosis of breast cancer. A small dataset of 57 breast mass images, each with 22 features computed, was used in this investigation; the same dataset has been previously used in other studies. The extracted features relate to edge-sharpness, shape, and texture. The novelty of this paper is the adaptation and application of the classification technique called genetic programming (GP), which possesses feature selection implicitly. To refine the pool of features available to the GP classifier, we used feature-selection methods, including the introduction of three statistical measures—Student’s t test, Kolmogorov–Smirnov test, and Kullback–Leibler divergence. Both the training and test accuracies obtained were high: above 99.5% for training and typically above 98% for test experiments. A leave-one-out experiment showed 97.3% success in the classification of benign masses and 95.0% success in the classification of malignant tumors. A shape feature known as fractional concavity was found to be the most important among those tested, since it was automatically selected by the GP classifier in almost every experiment.  相似文献   

15.
Gene expression data collected from DNA microarray are characterized by a large amount of variables (genes), but with only a small amount of observations (experiments). In this paper, manifold learning method is proposed to map the gene expression data to a low dimensional space, and then explore the intrinsic structure of the features so as to classify the microarray data more accurately. The proposed algorithm can project the gene expression data into a subspace with high intra-class compactness and inter-class separability. Experimental results on six DNA microarray datasets demonstrated that our method is efficient for discriminant feature extraction and gene expression data classification. This work is a meaningful attempt to analyze microarray data using manifold learning method; there should be much room for the application of manifold learning to bioinformatics due to its performance.  相似文献   

16.
This study investigates the effect of the feature dimensionality reduction strategies on the classification of surface electromyography (EMG) signals toward developing a practical myoelectric control system. Two dimensionality reduction strategies, feature selection and feature projection, were tested on both EMG feature sets, respectively. A feature selection based myoelectric pattern recognition system was introduced to select the features by eliminating the redundant features of EMG recordings instead of directly choosing a subset of EMG channels. The Markov random field (MRF) method and a forward orthogonal search algorithm were employed to evaluate the contribution of each individual feature to the classification, respectively. Our results from 15 healthy subjects indicate that, with a feature selection analysis, independent of the type of feature set, across all subjects high overall accuracies can be achieved in classification of seven different forearm motions with a small number of top ranked original EMG features obtained from the forearm muscles (average overall classification accuracy >95% with 12 selected EMG features). Compared to various feature dimensionality reduction techniques in myoelectric pattern recognition, the proposed filter-based feature selection approach is independent of the type of classification algorithms and features, which can effectively reduce the redundant information not only across different channels, but also cross different features in the same channel. This may enable robust EMG feature dimensionality reduction without needing to change ongoing, practical use of classification algorithms, an important step toward clinical utility.  相似文献   

17.
目的:寻找颞叶癫痫大鼠海马组织的差异表达基因和蛋白质,以期为进一步探讨颞叶癫痫的发病机制,寻找新的治疗靶点和研发新的治疗手段奠定基础。方法:运用cDNA微阵列、二维电泳和MALDI-TOF-MS技术,分析氯化锂-匹罗卡品(LiCl-PILO)致痫大鼠模型海马组织的基因表达谱和蛋白质表达谱,并对发现的差异表达基因和差异表达蛋白质进行分析和鉴定结果和。结论:发现LiCl-PILO致痫大鼠海马组织中192个基因差异表达,159条可在GenBank中登陆,其中表达上调的基因84条,表达下调的基因75条;筛选到78个差异表达蛋白质斑点,其中31个在癫痫组表达下调,47个在癫痫组表达上调。有5个蛋白质最终鉴定确认。本研究结果为运用蛋白质组学方法寻找癫痫治疗新靶点研究提供实验依据。  相似文献   

18.
白细胞介素-18基因多态性与其表达量关系的研究   总被引:2,自引:0,他引:2  
目的:探讨IL-18基因启动子-607C/A、-137G/C位点多态性是否影响其表达量。方法:选择80例体检健康个体,确定IL-18基因启动子区-137G/C、-607C/A位点基因型,分离上述研究对象外周血单个核细胞(PeripheralBloodMonocytes,PBMC),统一浓度培养24h后,收集培养细胞和上清液。采用ELISA检测上清液中IL-18的含量,并用RT-PCR检测培养细胞中IL-18mRNA的水平。结果:IL-18基因启动子-607位点CC、CA和AA基因型个体PBMC分泌IL-18的浓度分别为(11.54±6.48)ng/ml、(10.92±5.16)ng/ml和(11.79±3.18)ng/ml,表达IL-18mRNA水平分别为0.878±0.633、0.877±0.521和0.881±0.400;IL-18基因启动子-137位点GG、GC或CC基因型个体PBMC分泌IL-18的浓度分别为(11.27±5.42)ng/ml和(11.31±4.62)ng/ml,表达IL-18mRNA水平分别为0.835±0.485和0.984±0.613。两个位点各基因型间个体PBMC分泌和表达IL-18水平比较,差异无显著性(P>0.05)。结论:IL-18基因外显子1上游启动子-607C/A、-137G/C位点基因多态性对IL-18分泌和表达量没有明显影响。  相似文献   

19.
目的 克隆LIGHT基因 ,构建含有人LIGHT基因的表达载体 ,诱导其在大肠杆菌中可溶性表达 ,并对表达的LIGHT蛋白的生物学活性进行检测。方法 从人的外周血单个核细胞中克隆LIGHT全长cDNA及其胞外区片段 ,并将其胞外区片段亚克隆至原核表达载体pET 11a中 ,筛选阳性重组质粒pET LIGHT ,以IPTG诱导其可溶性表达 ,并以SDS PAGE和Westernblot检测进行分析。表达的蛋白初步纯化后 ,进行生物学活性分析。结果 RT PCR扩增出了LIGHT全长 72 3bp的cDNA。SDS PAGE和Westernblot分析证实重组pET LIGHT质粒可表达出相对分子质量 (Mr)为 19× 10 3的蛋白。可溶性LIGHT重组蛋白可共刺激T细胞的增殖及诱导IFN γ的产生。结论 本实验成功地将LIGHT胞外区片段在大肠杆菌中进行表达 ,表达的蛋白具有生物学功能 ,这为进一步的LIGHT基因的功能研究打下了基础  相似文献   

20.
Summary We describe the construction of a cosmid cloning vector, pMT555, which allows positive selection for the presence of an inserted DNA fragment. The vector contains sequences which enable its replication and selection in either E. coli or Saccharomyces cerevisiae. We demonstrate that pMT555 may be used for the efficient construction of total genomic banks from small quantities of donor DNA. The positive selection permits the stable maintenance of the cosmid in E. coli and the faithful replication of inserted sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号