首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 250 毫秒
1.
偏最小二乘法降维在微阵列数据判别分析中的应用   总被引:2,自引:2,他引:2  
目的探讨微阵列数据的判别分析方法。方法首先采用偏最小二乘法对高维数据降维,然后再用Fisher’s线性判别。文中同时介绍了偏最小二乘法的基本原理、基本算法,讨论了成分数选择等问题,并以实际微阵列数据展示了其效果。结果偏最小二乘法降维不但实现了数据的可视化,而且取得了较好的后期判别效果。结论偏最小二乘法是一种新的实用的降维方法,可用于微阵列数据判别分析的前期降维。  相似文献   

2.
武海滨  张涛  赵发林  李康 《中国卫生统计》2013,30(4):517-520,524
目的探讨基于偏最小二乘线性判别分析的遗传算法特征筛选性能,并将其应用于高维代谢组学数据。方法通过模拟试验验证基于偏最小二乘线性判别分析的遗传算法特征筛选能力,同时应用于卵巢良恶性肿瘤鉴别的代谢组学数据特征筛选分析。结果模拟实验显示,基于偏最小二乘线性判别分析的遗传算法对信息变量的筛选能力明显优于偏最小二乘变量投影重要性指标;代谢组学数据分析显示,使用遗传算法筛选出的变量能够获得更低的误差率,该方法筛得的变量具有更大的概率包含了与某种生物学结果相关的代谢物。结论基于偏最小二乘线性判别分析的遗传算法作为一种优化技术,在小样本条件下对高维数据的特征筛选具有较好的效果。  相似文献   

3.
目的由于疾病,特别是肿瘤的识别模型,其分型准确度对疾病的治疗和预后具有重要意义,因而,本研究探讨了基于基因表达谱的疾病分型识别模型建模方法。方法结合白血病基因表达谱数据分析,利用偏最小二乘判别分析(PLS-DA)对利用基因微阵列数据予以建立白血病分型模型,并与Golub等提出的建模方法相对照,比较它们的判别效果。结果基于偏最小二乘判别分析的白血病识别模型的拟合准确度和预测准确度均达到100%。结论研究表明,基于偏最小二乘判别分析的模型明显提高了白血病的分型正确率,无论是拟合精度,还是预测精度,均高于Golub等提出的方法。  相似文献   

4.
基因表达数据的随机森林逐步判别分析方法   总被引:3,自引:2,他引:3  
目的给出一种新的随机森林算法,它能在建模过程中自动对变量进行筛选,建立“最优”判断模型。方法采用变量重要性评分和逐步迭代算法选择有作用的变量;通过实际基因表达数据考核其应用效果,并使用R语言编程做模拟试验验证其有效性。结果三种疾病基因表达数据的判别模型,在包含很少量的基因情况下便获得了理想的分类效果;模拟试验则显示在类间区分度较大的情况下,随机森林逐步判别分析的效果明显,能有效地将有作用的变量保留在模型中,提高模型的判别效果;在类间区分度不够大的情况下分类效果提高不明显。结论随机森林逐步判别分析可以有效地应用于基因表达数据的基因筛选和分类研究,但要特别注意由随机波动对分析结果造成的影响。  相似文献   

5.
目的 应用偏最小二乘法,并结合Fisher精确检验实现原发性肝癌细胞遗传学异常区域的识别.方法 利用Chen等发布在SMD上的原发性肝癌基因表达数据库,运用偏最小二乘法中的变量投影重要性(Variable mportance in the Project,VIP)指标筛选肝癌异常表达基因,根据基因在染色体上的定位,计算每条染色体上的上调、下调基因以及正常表达基因,运用Fisher精确检验识别有统计学意义的细胞遗传学异常区域.结果 得到基因表达增强区域7个(1q,5q,6p,8q,12q,17q,20q),表达下调区域为8个(4q,8p,9p,12p,13q,16q,17p,21q),15个异常区域已全部由实验方法证实.结论 与传统的实验方法和一些预测算法相比,偏最小二乘法结合Fisher精确检验能够有效、快速地识别染色体基因异常表达区域,灵敏度有了较大提高.  相似文献   

6.
目的基于偏最小二乘模型(PLS)提出一种新的FDR估计方法,并对其准确性进行验证。方法利用偏最小二乘的vip评分筛选变量,结合permutation方法和后退法对筛选结果进行FDR估计。结果模拟实验表明,在变量之间独立时,PLS-FDR方法和三种单变量估计方法都能准确估计FDR;在变量之间存在线性关系时,PLS-FDR方法估计FDR仍然具有无偏性,而三种单变量分析方法则无法准确地进行估计。实例分析表明,PLS-FDR方法对高维数据分析能够提供重要信息。结论在线性数据结构下,使用本文给出的PLS-FDR方法能够得出多变量FDR估计结果。  相似文献   

7.
目的 探究判别分析驱动的微阵列数据之降维策略.方法 3步降维策略.即首先采用"单变量检验FDR控制"结合"相关矩阵差值综合评分法"的预选维;其次采用PCA、PLS等方法作进一步降维;最后用逐步判别的思想筛选.结果 以Alon等的结肠癌数据为例展示了该3步降维策略在判别分析过程中的应用,组内回代错误率为9.68%,弃一法交叉验证的错误率为11.29%.结论 本文提出的"初步选维→进一步降维→逐步判别筛选"的3步降维策略对于后续的微阵列数据判别分析是实用、可行的.  相似文献   

8.
目的基于偏最小二乘模型(PLS)提出一种新的FDR估计方法,并对其准确性进行验证。方法利用偏最小二乘的vip评分筛选变量,结合permutation方法和后退法对筛选结果进行FDR估计。结果模拟实验表明,在变量之间独立时,PLS-FDR方法和三种单变量估计方法都能准确估计FDR;在变量之间存在线性关系时,PLS-FDR方法估计FDR仍然具有无偏性,而三种单变量分析方法则无法准确地进行估计。实例分析表明,PLS-FDR方法对高维数据分析能够提供重要信息。结论在线性数据结构下,使用本文给出的PLS-FDR方法能够得出多变量FDR估计结果。  相似文献   

9.
目的探讨高维生物学数据的多阶段组合降维策略。方法以微阵列数据的判别分析为例,采用实际数据和模拟数据相结合的方法,提出"初步选维→进一步降维"的两阶段组合降维策略,并与后续的"判别→验证"相结合,形成了"选维→降维→判别→验证"的判别分析思路。以后续判别分析的预测效果、预测结果的稳定性与敏感性等为指标,对2种单一降维(PCA,PLS)方法和4种组合降维方法(PCA+SIR、PCA+SAVE、PLS+SIR和PLS+SAVE)进行了考察。结果从判别模型的预测效果、预测结果的稳定性及敏感性来看,PLS优于PCA,PLS+SIR/SAVE的组合降维效果更佳。结论用t计分法选维,以"PLS+SIR/SAVE"法进行降维的两阶段组合降维策略,对于微阵列数据判别分析,是实用的、可行的。  相似文献   

10.
logistic回归模型共线性三种降维方法的模拟比较研究   总被引:2,自引:0,他引:2  
目的 logistic回归模型中解释变量存在共线性的情形下,比较联合式偏最小二乘logistic回归(IPLSLR),耦合式偏最小二乘logistic回归(CPLSLR)和主成分logistic回归(PCLR)三种降维方法所得到的回归系数估计的偏差和稳定性。方法运用MonteCarlo随机法,采用SAS软件编程,模拟估算不同共线性情形下三种降维方法所得到的回归系数估计均值及标准差。结果简单共线性情形下,IPLSLR与CPLSLR回归系数估计的偏差较小,PCLR估计则相对稳定。多重共线性情形下,当样本量较大时,若共线性较高,IPLSLR,CPLSLR和PCLR均获得偏差较小、稳定性较好的回归系数估计;若共线性较低,PCLR回归系数估计的偏差明显大于IPLSLR和CPLSLR。当样本量较小时,IPLSLR与CPLSLR在回归系数估计偏差和稳定性方面互有优劣。结论应根据共线性的程度和样本量的大小决定使用相应的降维方法用于处理logistic回归模型中共线性情况。通过对三种降维方法不足的分析,提出了进一步改进的原则。  相似文献   

11.
Wang A  Gehan EA 《Statistics in medicine》2005,24(13):2069-2087
Principal component analysis (PCA) has been widely used in multivariate data analysis to reduce the dimensionality of the data in order to simplify subsequent analysis and allow for summarization of the data in a parsimonious manner. It has become a useful tool in microarray data analysis. For a typical microarray data set, it is often difficult to compare the overall gene expression difference between observations from different groups or conduct the classification based on a very large number of genes. In this paper, we propose a gene selection method based on the strategy proposed by Krzanowski. We demonstrate the effectiveness of this procedure using a cancer gene expression data set and compare it with several other gene selection strategies. It turns out that the proposed method selects the best gene subset for preserving the original data structure.  相似文献   

12.
DNA微阵列数据判别的旋转森林方法   总被引:1,自引:0,他引:1  
陈金瓯  柳青 《中国卫生统计》2012,29(4):525-528,534
目的探讨旋转森林算法在DNA微阵列数据分类中的应用。方法通过对四个经典基因表达数据的分析考察旋转森林的分类效果,并与其他分类器进行比较,进一步调整算法参数并研究其对分类效果产生的影响。结果旋转森林对基因表达数据有较高且稳定的分类准确性,除了线性变换方式和集成规模对分类性能影响较大,分类效果不随算法其他几个主要参数变化。结论旋转森林在基因表达谱数据分类中有较好的判别结果。  相似文献   

13.
We propose a block principal component analysis method for extracting information from a database with a large number of variables and a relatively small number of subjects, such as a microarray gene expression database. This new procedure has the advantage of computational simplicity, and theory and numerical results demonstrate it to be as efficient as the ordinary principal component analysis when used for dimension reduction, variable selection and data visualization and classification. The method is illustrated with the well-known National Cancer Institute database of 60 human cancer cell lines data (NCI60) of gene microarray expressions, in the context of classification of cancer cell lines.  相似文献   

14.
任雨冬  陆震  李婧惟  刘艳 《实用预防医学》2020,27(12):1537-1539
目的 使用高斯核函数和欧式距离函数改进微阵列显著分析法(significance analysis of microarray,SAM)得到MSAM1法(modified significance analysis of microarray-1,MSAM1 )和MSAM2法(modified significance analysis ofmicroarray-2,MSAM2 ),与SAM法、Relief法、支持向量机递归特征消除法(support vector machine recursive featureelimination, SVM-RFE )进行对比,评价在基因表达数据中MSAM1法、MSAM2法的基因选择和分类预测能力。 方法 从Bioconductor中的golubEsets包获得leukemia数据集(Golub等人给出了该数据集所包含的50个差异基因),运用R软件实现5种算法,分别用正确率和ROC曲线下面积即AUC值评价基因选择能力和分类预测能力,用Kruskal-Wallis H检验比较5种方法的正确率和AUC值的组间差异,进一步的两两比较采用SNK-q检验。 结果 正确率和AUC值均表现为MSAM1和MSAM2最优,SAM和SVM-RFE法次之,Relief法排在最后;5种方法的组间差异有统计学意义(H=150.333,P<0.0001和H=293.2579,P<0.0001),两两比较结果显示虽然MSAM1和MSAM2之间差异无统计学意义(P>0.05),但两种方法与其他3种方法之间差异均有统计学意义(P<0.05)。 结论 用高斯核函数和欧式距离函数改进的加权SAM法提高了SAM法的基因选择和分类预测能力,在实际基因表达数据的应用中可以得到更为稳定的分析结果。  相似文献   

15.
目的 介绍Radviz可视化的基本原理和方法,并将Radviz可视化应用于基因表达数据的分类和特征选择.方法 以结肠癌基因表达数据为例,结合启发式搜索和Vizrank可视化评估,利用Radviz可视化实现基因表达数据的分类和差异基因排序.结果 由Vizrank算法得到排序前100的Radviz可视化结果,最优的Vizrank得分为0.9491,并得到了17个用于可视化分类的差异基因,其中部分基因获得了生物学解释.结论 Radviz能够形象的呈现隐含在数据中的模式特征,较好地用于基因表达数据的可视化分类和差异基因筛选.  相似文献   

16.

Background

We have reported arginine-sensitive regulation of LAT1 amino acid transporter (SLC 7A5) in normal rodent hepatic cells with loss of arginine sensitivity and high level constitutive expression in tumor cells. We hypothesized that liver cell gene expression is highly sensitive to alterations in the amino acid microenvironment and that tumor cells may differ substantially in gene sets sensitive to amino acid availability. To assess the potential number and classes of hepatic genes sensitive to arginine availability at the RNA level and compare these between normal and tumor cells, we used an Affymetrix microarray approach, a paired in vitro model of normal rat hepatic cells and a tumorigenic derivative with triplicate independent replicates. Cells were exposed to arginine-deficient or control conditions for 18 hours in medium formulated to maintain differentiated function.

Results

Initial two-way analysis with a p-value of 0.05 identified 1419 genes in normal cells versus 2175 in tumor cells whose expression was altered in arginine-deficient conditions relative to controls, representing 9–14% of the rat genome. More stringent bioinformatic analysis with 9-way comparisons and a minimum of 2-fold variation narrowed this set to 56 arginine-responsive genes in normal liver cells and 162 in tumor cells. Approximately half the arginine-responsive genes in normal cells overlap with those in tumor cells. Of these, the majority was increased in expression and included multiple growth, survival, and stress-related genes. GADD45, TA1/LAT1, and caspases 11 and 12 were among this group. Previously known amino acid regulated genes were among the pool in both cell types. Available cDNA probes allowed independent validation of microarray data for multiple genes. Among genes downregulated under arginine-deficient conditions were multiple genes involved in cholesterol and fatty acid metabolism. Expression of low-density lipoprotein receptor was decreased in both normal and tumor cells.

Conclusion

Arginine-sensitive regulation appears to be an important homeostatic mechanism to coordinate cell response and nutrient availability in hepatic cells. Genes predicted as arginine-responsive in stringent microarray data analysis were confirmed by Northern blot and RT-PCR. Although the profile of arginine-responsive genes is altered and increased, a considerable portion of the "arginome" is maintained upon neoplastic transformation.  相似文献   

17.
Lipidomics is an emerging field of science that holds the potential to provide a readout of biomarkers for an early detection of a disease. Our objective was to identify an efficient statistical methodology for lipidomics—especially in finding interpretable and predictive biomarkers useful for clinical practice. In two case studies, we address the need for data preprocessing for regression modeling of a binary response. These are based on a normalization step, in order to remove experimental variability, and on a multiple imputation step, to make the full use of the incompletely observed data with potentially informative missingness. Finally, by cross‐validation, we compare stepwise variable selection to penalized regression models on stacked multiple imputed data sets and propose the use of a permutation test as a global test of association. Our results show that, depending on the design of the study, these data preprocessing methods modestly improve the precision of classification, and no clear winner among the variable selection methods is found. Lipidomics profiles are found to be highly important predictors in both of the two case studies. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

18.
The purpose of the study was to design a hybrid decision support system (HDSS) that could simulate the embolized coil selection pattern of the radiologists in aneurysms treatment. As the longest available length of the coils should be used in most cases, therefore only the shape diameter (SD) selection was modeled and varied. Ninety-eight aneurysms successfully treated by a radiologist with coil embolization were divided into two groups (86 for training and 12 randomly selected for validating). Eight aneurysms treated by another radiologist were also used to cross validate the proposed HDSS. The HDSS was developed using the classification and the linear regression methods (LRM). The dome and the width of an aneurysm were used as the system inputs. The system outputs were the SDs of the first three coils indexed according to the insertion order. The HDSS that consisted of Bagging classification and LRM achieved the highest accuracy for all cases. The errors were within 1 mm for the SD selection of the first two coils. For the third coil, the SD selection within 1 mm bound had 80 % accuracy. The experimental results indicated the feasibility of using the HDSS as the guidance for selecting the SDs of the first two coils. The selection of the third coil required more training data for the rarely used SD. Moreover, the cross validation with another radiologist showed the feasibility of using the proposed HDSS as the guidance, however further validation with more data is recommended.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号