首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Mass spectrometry is being used to generate protein profiles from human serum, and proteomic data obtained from mass spectrometry have attracted great interest for the detection of early stage cancer. However, high dimensional mass spectrometry data cause considerable challenges. In this paper we propose a feature extraction algorithm based on wavelet analysis for high dimensional mass spectrometry data. A set of wavelet detail coefficients at different scale is used to detect the transient changes of mass spectrometry data. The experiments are performed on 2 datasets. A highly competitive accuracy, compared with the best performance of other kinds of classification models, is achieved. Experimental results show that the wavelet detail coefficients are efficient way to characterize features of high dimensional mass spectra and reduce the dimensionality of high dimensional mass spectra.  相似文献   

2.
Data mining techniques for cancer detection using serum proteomic profiling   总被引:9,自引:0,他引:9  
OBJECTIVE: Pathological changes in an organ or tissue may be reflected in proteomic patterns in serum. It is possible that unique serum proteomic patterns could be used to discriminate cancer samples from non-cancer ones. Due to the complexity of proteomic profiling, a higher order analysis such as data mining is needed to uncover the differences in complex proteomic patterns. The objectives of this paper are (1) to briefly review the application of data mining techniques in proteomics for cancer detection/diagnosis; (2) to explore a novel analytic method with different feature selection methods; (3) to compare the results obtained on different datasets and that reported by Petricoin et al. in terms of detection performance and selected proteomic patterns. METHODS AND MATERIAL: Three serum SELDI MS data sets were used in this research to identify serum proteomic patterns that distinguish the serum of ovarian cancer cases from non-cancer controls. A support vector machine-based method is applied in this study, in which statistical testing and genetic algorithm-based methods are used for feature selection respectively. Leave-one-out cross validation with receiver operating characteristic (ROC) curve is used for evaluation and comparison of cancer detection performance. RESULTS AND CONCLUSIONS: The results showed that (1) data mining techniques can be successfully applied to ovarian cancer detection with a reasonably high performance; (2) the classification using features selected by the genetic algorithm consistently outperformed those selected by statistical testing in terms of accuracy and robustness; (3) the discriminatory features (proteomic patterns) can be very different from one selection method to another. In other words, the pattern selection and its classification efficiency are highly classifier dependent. Therefore, when using data mining techniques, the discrimination of cancer from normal does not depend solely upon the identity and origination of cancer-related proteins.  相似文献   

3.
Recently, mass spectrometry analysis has a become an effective and rapid approach in detecting early-stage cancer. To identify proteomic patterns in serum to discriminate cancer patients from normal individuals, machine-learning methods, such as feature selection and classification, have already been involved in the analysis of mass spectrometry (MS) data with some success. However, the performance of existing machine learning methods for MS data analysis still needs improving. The study in this paper proposes a wavelet-based pre-processing approach to MS data analysis. The approach applies wavelet-based transforms to MS data with the aim of de-noising the data that are potentially contaminated in acquisition. The effects of the selection of wavelet function and decomposition level on the de-noising performance have also been investigated in this study. Our comparative experimental results demonstrate that the proposed de-noising pre-processing approach has potentials to remove possible noise embedded in MS data, which can lead to improved performance for existing machine learning methods in cancer detection.  相似文献   

4.
Specimens from 60 cases of renal cell carcinoma (RCC) were graded employing quantitative nuclear data combined with multivariate discriminant analysis. Evaluation of patient survival was analysed with respect to quantitative microscopic and qualitative features. Both morphometric and stereological estimators were used to establish the nuclear size and form pattern of the RCC specimens. Tumoural dedifferentiation paralleled progressive increases in nuclear elongation and in two- and, especially, three-dimensional—mean nuclear volume (MNV)—size parameters. Using stepwise discriminant analysis, 85-0 per cent of the specimens were correctly classified when differentiating grade 2 and 3 tumours. It is concluded that simple and realistic estimates of MNV are the best discriminator for objective grading in patients with RCC. Univariate survival analysis demonstrated the important significance of several features such as MNV, clinical stage, and nuclear discriminant and histopathological tumour grades. Nuclear form factor PE, area, and perimeter were also significant. A prognosis study based on the Cox model using a stepwise selection of parameters showed that only MNV has an independent prognostic role when examining all investigated quantitative parameters. The clinical stage was the best prognostic feature when all quantitative and qualitative characteristics were included in the analysis.  相似文献   

5.
目的:利用蛋白质芯片和生物信息学方法从无精症患者精浆中筛选标志蛋白质。方法:采用表面增强激光解吸电离飞行时间质谱技术(SELDI-TOF-MS)和IMAC3蛋白质芯片对30例无精症患者精浆和57例健康人精浆的蛋白质谱图进行了检测,使用PBSⅡ-C型蛋白质芯片阅读机读取数据,获得的结果采用CI-PHERGEN公司的Biomarkerwizard和BiomarkerPatternsSystem软件分析。结果:无精症患者与正常人精浆蛋白质谱相比有16个显著差异蛋白质,其中6个蛋白质在患者精浆中高表达,10个蛋白质在患者精浆中低表达。BiomarkerPatternsSystem软件用16个显著差异蛋白质中的3个标志分子建立无精症诊断的分类树模型。检测灵敏度为86.7%(26/30)、特异性为96.5%(55/57)。结论:实验证明利用蛋白质组学和生物信息学方法可以从精浆中筛选出无精症相关的标志蛋白质,而且蛋白质芯片技术对于发现和筛选精浆中的无精症标志蛋白质是一种有效、快速的工具。  相似文献   

6.
面对高维、小样本的基因微阵列数据,有效地提取特征基因成为一项艰巨的任务.在随机特征选择方法的基础上,引入“种子变量”及滚动的排名机制,提出一种基于职业网球选手排名(PTPR)的特征选择算法.用种子变量提高变量搜索过程的选择性,提高搜索效率,同时充分利用历史记录来动态更新种子变量,加快寻优速度.在公共数据库上的测试实验结果表明,PTPR在随机多次独立运行时得到平均50% ~ 80%的相同基因,而MichalDraminski的方法只能保持相同基因在10% ~50%左右;收敛性实验表明,PTPR的收敛速度更快且显著;而在5个数据集的独立测试集上的分类率实验表明,PTPR保持较高的分类率,如PTPR得到最高分类率大约为98%、90%、89%、95%、75%,而Michal Draminski方法的最高分类率为96%、89%、85%、95%、70%.同时,与其他典型方法相比,PTPR也得到了较高的分类率.总体上,PTPR算法具有搜索速度快、结果稳定的特点,而且在不同的分类器上都保持了较优的分类率.  相似文献   

7.
We are developing new computer vision techniques for characterization of breast masses on mammograms. We had previously developed a characterization method based on texture features. The goal of the present work was to improve our characterization method by making use of morphological features. Toward this goal, we have developed a fully automated, three-stage segmentation method that includes clustering, active contour, and spiculation detection stages. After segmentation, morphological features describing the shape of the mass were extracted. Texture features were also extracted from a band of pixels surrounding the mass. Stepwise feature selection and linear discriminant analysis were employed in the morphological, texture, and combined feature spaces for classifier design. The classification accuracy was evaluated using the area Az under the receiver operating characteristic curve. A data set containing 249 films from 102 patients was used. When the leave-one-case-out method was applied to partition the data set into trainers and testers, the average test Az for the task of classifying the mass on a single mammographic view was 0.83 +/- 0.02, 0.84 +/- 0.02, and 0.87 +/- 0.02 in the morphological, texture, and combined feature spaces, respectively. The improvement obtained by supplementing texture features with morphological features in classification was statistically significant (p = 0.04). For classifying a mass as malignant or benign, we combined the leave-one-case-out discriminant scores from different views of a mass to obtain a summary score. In this task, the test Az value using the combined feature space was 0.91 +/- 0.02. Our results indicate that combining texture features with morphological features extracted from automatically segmented mass boundaries will be an effective approach for computer-aided characterization of mammographic masses.  相似文献   

8.
In papillary thyroid carcinoma (PTC), metastasis is a feature of an aggressive tumor phenotype. To identify protein biomarkers that distinguish patients with an aggressive tumor behavior, proteomic signatures in metastatic and non-metastatic tumors were investigated comparatively. In particular, matrix-assisted laser desorption/ionization (MALDI) imaging mass spectrometry (IMS) was used to analyze primary tumor samples. We investigated a tumor cohort of PTC (n?=?118) that were matched for age, tumor stage, and gender. Proteomic screening by MALDI-IMS was performed for a discovery set (n?=?29). Proteins related to the discriminating mass peaks were identified by 1D-gel electrophoresis followed by mass spectrometry. The candidate proteins were subsequently validated by immunohistochemistry (IHC) using a tissue microarray for an independent PTC validation set (n?=?89). In this study, we found 36 mass-to-charge-ratio (m/z) species that specifically distinguished metastatic from non-metastatic tumors, among which m/z 11,608 was identified as thioredoxin, m/z 11,184 as S100-A10, and m/z 10,094 as S100-A6. Furthermore, using IHC on the validation set, we showed that the overexpression of these three proteins was highly associated with lymph node metastasis in PTC (p?相似文献   

9.
Four proteomic biomarkers (human neutrophil peptide 1 [HNP1], HNP2 [defensins], calgranulin C [Cal-C], and Cal-A) characterize the fingerprint of intra-amniotic inflammation (IAI). We compared proteomic technology using surfaced-enhanced laser desorption-ionization-time of flight (SELDI-TOF) mass spectrometry to enzyme-linked immunosorbent assay (ELISA) for detection of these biomarkers. Amniocentesis was performed on 48 women enrolled in two groups: those with intact membranes (n = 27; gestational age [GA], 26.0 +/- 0.8 weeks) and those with preterm premature rupture of the membranes (PPROM; n = 21; GA, 28.4 +/- 0.9 weeks). Paired abdominal amniotic fluids (aAFs)-vaginal AFs (vAFs) were analyzed in PPROM women. Quantitative aspects of HNP1-3, Cal-C, Cal-A, and calprotectin (a complex of Cal-A with Cal-B) were assessed by ELISA. SELDI-TOF mass spectrometry tracings from 16/48 (33.3%) aAFs and 13/17 (88.2%) vAFs were consistent with IAI (three or four biomarkers present). IAI (by SELDI-TOF mass spectrometry) was associated with increased HNP1-3 and Cal-C measured by ELISA. However, immunoassays detected Cal-A in only 4 of the AFs even though its specific SELDI-TOF mass spectrometry peak was identified in 19/48 AFs. Calprotectin immunoreactivity was decreased in AFs retrieved from women with IAI (P = 0.01). In conclusion, IAI is associated with increased HNP1-3 levels. In the absence of isoform-specific ELISAs, mass spectrometry remains the only way to discriminate the HNP biomarker isoforms. Monomeric Cal-A is not reliably estimated by specific ELISA as it binds to Cal-B to form the calprotectin complex. Cal-C was reliably measured by SELDI-TOF mass spectrometry or specific ELISA.  相似文献   

10.
Biomarker identification by feature wrappers.   总被引:9,自引:0,他引:9  
M Xiong  X Fang  J Zhao 《Genome research》2001,11(11):1878-1887
Gene expression studies bridge the gap between DNA information and trait information by dissecting biochemical pathways into intermediate components between genotype and phenotype. These studies open new avenues for identifying complex disease genes and biomarkers for disease diagnosis and for assessing drug efficacy and toxicity. However, the majority of analytical methods applied to gene expression data are not efficient for biomarker identification and disease diagnosis. In this paper, we propose a general framework to incorporate feature (gene) selection into pattern recognition in the process to identify biomarkers. Using this framework, we develop three feature wrappers that search through the space of feature subsets using the classification error as measure of goodness for a particular feature subset being "wrapped around": linear discriminant analysis, logistic regression, and support vector machines. To effectively carry out this computationally intensive search process, we employ sequential forward search and sequential forward floating search algorithms. To evaluate the performance of feature selection for biomarker identification we have applied the proposed methods to three data sets. The preliminary results demonstrate that very high classification accuracy can be attained by identified composite classifiers with several biomarkers.  相似文献   

11.
The different steps of a proteomics analysis workflow generate a plethora of features for each extracted proteomic object (a protein spot in 2D gel electrophoresis (2-DE), or a peptide peak in liquid chromatography–mass spectrometry (LC–MS) analysis). Yet, the joint visualization of multiple object features on 2D gel-like maps is rather limited in currently available proteomics software packages. We introduce a new, simple, and intuitive visualization method that utilizes spheres to represent proteomic objects on proteomic feature maps, and exploits the spheres size and color to provide simultaneous visualization of user-selected feature pairs. Our contribution, a unified and flexible visualization mechanism that can be easily applied at any stage of a 2-DE or a LC–MS based differential proteomics study, is demonstrated and discussed using five representative scenarios. The joint visualization of proteomic object features and their spatial distribution is a powerful tool for inspecting and comparing the proteomics analysis results, attracting the users attention to useful information, such as differential expression trends and patterns, and even assisting in the evaluation and refinement of a proteomics experiment.  相似文献   

12.
Medical applications are often characterized by a large number of disease markers and a relatively small number of data records. We demonstrate that complete feature ranking followed by selection can lead to appreciable reductions in data dimensionality, with significant improvements in the implementation and performance of classifiers for medical diagnosis. We describe a novel approach for ranking all features according to their predictive quality using properties unique to learning algorithms based on the group method of data handling (GMDH). An abductive network training algorithm is repeatedly used to select groups of optimum predictors from the feature set at gradually increasing levels of model complexity specified by the user. Groups selected earlier are better predictors. The process is then repeated to rank features within individual groups. The resulting full feature ranking can be used to determine the optimum feature subset by starting at the top of the list and progressively including more features until the classification error rate on an out-of-sample evaluation set starts to increase due to overfitting. The approach is demonstrated on two medical diagnosis datasets (breast cancer and heart disease) and comparisons are made with other feature ranking and selection methods. Receiver operating characteristics (ROC) analysis is used to compare classifier performance. At default model complexity, dimensionality reduction of 22 and 54% could be achieved for the breast cancer and heart disease data, respectively, leading to improvements in the overall classification performance. For both datasets, considerable dimensionality reduction introduced no significant reduction in the area under the ROC curve. GMDH-based feature selection results have also proved effective with neural network classifiers.  相似文献   

13.
Four proteomic biomarkers (human neutrophil peptide 1 [HNP1], HNP2 [defensins], calgranulin C [Cal-C], and Cal-A) characterize the fingerprint of intra-amniotic inflammation (IAI). We compared proteomic technology using surfaced-enhanced laser desorption-ionization-time of flight (SELDI-TOF) mass spectrometry to enzyme-linked immunosorbent assay (ELISA) for detection of these biomarkers. Amniocentesis was performed on 48 women enrolled in two groups: those with intact membranes (n = 27; gestational age [GA], 26.0 ± 0.8 weeks) and those with preterm premature rupture of the membranes (PPROM; n = 21; GA, 28.4 ± 0.9 weeks). Paired abdominal amniotic fluids (aAFs)-vaginal AFs (vAFs) were analyzed in PPROM women. Quantitative aspects of HNP1-3, Cal-C, Cal-A, and calprotectin (a complex of Cal-A with Cal-B) were assessed by ELISA. SELDI-TOF mass spectrometry tracings from 16/48 (33.3%) aAFs and 13/17 (88.2%) vAFs were consistent with IAI (three or four biomarkers present). IAI (by SELDI-TOF mass spectrometry) was associated with increased HNP1-3 and Cal-C measured by ELISA. However, immunoassays detected Cal-A in only 4 of the AFs even though its specific SELDI-TOF mass spectrometry peak was identified in 19/48 AFs. Calprotectin immunoreactivity was decreased in AFs retrieved from women with IAI (P = 0.01). In conclusion, IAI is associated with increased HNP1-3 levels. In the absence of isoform-specific ELISAs, mass spectrometry remains the only way to discriminate the HNP biomarker isoforms. Monomeric Cal-A is not reliably estimated by specific ELISA as it binds to Cal-B to form the calprotectin complex. Cal-C was reliably measured by SELDI-TOF mass spectrometry or specific ELISA.  相似文献   

14.
The advent of systems biology approaches that have stemmed from the sequencing of the human genome has led to the search for new methods to diagnose diseases. While much effort has been focused on the identification of disease-specific biomarkers, recent efforts are underway toward the use of proteomic and metabonomic patterns to indicate disease. We have developed and contrasted the use of both proteomic and metabonomic patterns in urine for the detection of interstitial cystitis (IC). The methodology relies on advanced bioinformatics to scrutinize information contained within mass spectrometry (MS) and high-resolution proton nuclear magnetic resonance (1H-NMR) spectral patterns to distinguish IC-affected from non-affected individuals as well as those suffering from bacterial cystitis (BC). We have applied a novel pattern recognition tool that employs an unsupervised system (self-organizing-type cluster mapping) as a fitness test for a supervised system (a genetic algorithm). With this approach, a training set comprised of mass spectra and 1H-NMR spectra from urine derived from either unaffected individuals or patients with IC is employed so that the most fit combination of relative, normalized intensity features defined at precise m/z or chemical shift values plotted in n-space can reliably distinguish the cohorts used in training. Using this bioinformatic approach, we were able to discriminate spectral patterns associated with IC-affected, BC-affected, and unaffected patients with a success rate of approximately 84%.  相似文献   

15.
This paper presents a novel feature selection approach to deal with issues of high dimensionality in biomedical data classification. Extensive research has been performed in the field of pattern recognition and machine learning. Dozens of feature selection methods have been developed in the literature, which can be classified into three main categories: filter, wrapper and hybrid approaches. Filter methods apply an independent test without involving any learning algorithm, while wrapper methods require a predetermined learning algorithm for feature subset evaluation. Filter and wrapper methods have their, respectively, drawbacks and are complementary to each other in that filter approaches have low computational cost with insufficient reliability in classification while wrapper methods tend to have superior classification accuracy but require great computational power. The approach proposed in this paper integrates filter and wrapper methods into a sequential search procedure with the aim to improve the classification performance of the features selected. The proposed approach is featured by (1) adding a pre-selection step to improve the effectiveness in searching the feature subsets with improved classification performances and (2) using Receiver Operating Characteristics (ROC) curves to characterize the performance of individual features and feature subsets in the classification. Compared with the conventional Sequential Forward Floating Search (SFFS), which has been considered as one of the best feature selection methods in the literature, experimental results demonstrate that (i) the proposed approach is able to select feature subsets with better classification performance than the SFFS method and (ii) the integrated feature pre-selection mechanism, by means of a new selection criterion and filter method, helps to solve the over-fitting problems and reduces the chances of getting a local optimal solution.  相似文献   

16.
This article presents a novel method for diagnosis of valvular heart disease (VHD) based on phonocardiography (PCG) signals. Application of the pattern classification and feature selection and reduction methods in analysing normal and pathological heart sound was investigated. After signal preprocessing using independent component analysis (ICA), 32 features are extracted. Those include carefully selected linear and nonlinear time domain, wavelet and entropy features. By examining different feature selection and feature reduction methods such as principal component analysis (PCA), genetic algorithms (GA), genetic programming (GP) and generalized discriminant analysis (GDA), the four most informative features are extracted. Furthermore, support vector machines (SVM) and neural network classifiers are compared for diagnosis of pathological heart sounds. Three valvular heart diseases are considered: aortic stenosis (AS), mitral stenosis (MS) and mitral regurgitation (MR). An overall accuracy of 99.47% was achieved by proposed algorithm.  相似文献   

17.
Gene selection is important for cancer classification based on gene expression data, because of high dimensionality and small sample size. In this paper, we present a new gene selection method based on clustering, in which dissimilarity measures are obtained through kernel functions. It searches for best weights of genes iteratively at the same time to optimize the clustering objective function. Adaptive distance is used in the process, which is suitable to learn the weights of genes during the clustering process, improving the performance of the algorithm. The proposed algorithm is simple and does not require any modification or parameter optimization for each dataset. We tested it on eight publicly available datasets, using two classifiers (support vector machine, k-nearest neighbor), compared with other six competitive feature selectors. The results show that the proposed algorithm is capable of achieving better accuracies and may be an efficient tool for finding possible biomarkers from gene expression data.  相似文献   

18.
OBJECTIVE: Medical data is often very high dimensional. Depending upon the use, some data dimensions might be more relevant than others. In processing medical data, choosing the optimal subset of features is such important, not only to reduce the processing cost but also to improve the usefulness of the model built from the selected data. This paper presents a data mining study of medical data with fuzzy modeling methods that use feature subsets selected by some indices/methods. METHODS: Specifically, three fuzzy modeling methods including the fuzzy k-nearest neighbor algorithm, a fuzzy clustering-based modeling, and the adaptive network-based fuzzy inference system are employed. For feature selection, a total of 11 indices/methods are used. Medical data mined include the Wisconsin breast cancer dataset and the Pima Indians diabetes dataset. The classification accuracy and computational time are reported. To show how good the best performer is, the globally optimal was also found by carrying out an exhaustive testing of all possible combinations of feature subsets with three features. RESULTS: For the Wisconsin breast cancer dataset, the best accuracy of 97.17% was obtained, which is only 0.25% lower than that was obtained by exhaustive testing. For the Pima Indians diabetes dataset, the best accuracy of 77.65% was obtained, which is only 0.13% lower than that obtained by exhaustive testing. CONCLUSION: This paper has shown that feature selection is important to mining medical data for reducing processing time and for increasing classification accuracy. However, not all combinations of feature selection and modeling methods are equally effective and the best combination is often data-dependent, as supported by the breast cancer and diabetes data analyzed in this paper.  相似文献   

19.
目的: 比较膝骨关节炎患者与正常人血清蛋白质组的差异,筛选其能用于骨关节炎诊断、治疗或发病机制研究的血清生物学标志物。方法: 利用双向荧光差异凝胶电泳技术比较膝骨关节炎患者血清(4例)与健康志愿者血清(4例)蛋白图谱的差异,使用基质辅助激光解吸电离-飞行时间质谱及生物信息学相关技术对获得的差异蛋白进行鉴定和初步分析,并用Western blotting进一步验证所鉴定的蛋白。结果: 成功建立了骨关节炎血清差异蛋白质组学的研究方法;通过质谱分析初步鉴定出8个差异蛋白,其中5个蛋白在膝骨关节炎组表达上调,3个蛋白在膝骨关节炎组表达下调。Western blotting进一步验证得到α2-巨球蛋白在膝骨关节炎组表达上调,与双向荧光胶内差异凝胶电泳技术联合质谱分析的结果一致。结论: 膝骨关节炎患者血清与正常人血清蛋白表达存在明显差异,其中α2-巨球蛋白可作为骨关节炎潜在的疾病相关生物学标志物进一步研究,为骨关节炎的诊断、治疗和发病机制的研究提供新的线索和实验基础。  相似文献   

20.
本文面向生物信息学中一类重要问题———模式选择问题展开研究。针对模式选择过程中,算法复杂度高以及最佳模式量个数难以确定的问题,提出一种基于互信息(MI)理论实现模式选择,基于模糊神经的模式子集评价准则实现最佳模式量选择的算法。该算法基于模式信息与类别信息之间的相关程度,以及各子模式之间的冗余程度实现模式选择;基于模糊模式指标完成特征模式子集评价。实验数据采用数据挖掘后的小鼠基因表达数据(来自Leiden University)与UCI数据。结果表明,算法性能优良,无论在复杂度还是正确率方面效果均有所提高。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号