首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
目的探索核正交偏最小二乘方法的特点及其在代谢组学数据分析中的应用。方法通过模拟实验和真实代谢组学数据,评价核正交偏最小二乘方法的模型预测能力及其可视化效果。结果模拟数据分析表明,当数据间存在线性关系时,KOPLS与传统的线性OPLS具有相同的效果;当数据间存在非线性关系时,KOPLS具有相对更高的预测能力,得分图的可视化效果更好。实际数据分析结果显示,应用KOPLS能够提高模型预测能力和改善可视化效果。结论对于高维非线性关系的代谢组学数据更适合使用KOPLS方法。  相似文献   

2.
武海滨  张涛  赵发林  李康 《中国卫生统计》2013,30(4):517-520,524
目的探讨基于偏最小二乘线性判别分析的遗传算法特征筛选性能,并将其应用于高维代谢组学数据。方法通过模拟试验验证基于偏最小二乘线性判别分析的遗传算法特征筛选能力,同时应用于卵巢良恶性肿瘤鉴别的代谢组学数据特征筛选分析。结果模拟实验显示,基于偏最小二乘线性判别分析的遗传算法对信息变量的筛选能力明显优于偏最小二乘变量投影重要性指标;代谢组学数据分析显示,使用遗传算法筛选出的变量能够获得更低的误差率,该方法筛得的变量具有更大的概率包含了与某种生物学结果相关的代谢物。结论基于偏最小二乘线性判别分析的遗传算法作为一种优化技术,在小样本条件下对高维数据的特征筛选具有较好的效果。  相似文献   

3.
目的探讨基于错分代价的HingeBoost算法在二分类高维数据中分类判别的效果。方法通过模拟试验和实际代谢组学数据分析,对HingeBoost算法、AdaBoost算法、支持向量机、随机森林四种方法进行比较,并用ROC曲线下面积、灵敏度、特异度和错误率评价。结果模拟试验和真实代谢组数据分析显示,HingeBoost算法内部参数错分代价能影响分类结果的判别,在线性结构、同时存在大量噪声变量时分类效果优于其他三种算法。结论 HingeBoost算法将错分代价引入模型,达到减少假阳性错误或假阴性错误的目的,同时有很强的抗噪声能力,适用于高维代谢组学数据分析,值得进一步研究。  相似文献   

4.
目的探讨基于错分代价的HingeBoost算法在二分类高维数据中分类判别的效果。方法通过模拟试验和实际代谢组学数据分析,对HingeBoost算法、AdaBoost算法、支持向量机、随机森林四种方法进行比较,并用ROC曲线下面积、灵敏度、特异度和错误率评价。结果模拟试验和真实代谢组数据分析显示,HingeBoost算法内部参数错分代价能影响分类结果的判别,在线性结构、同时存在大量噪声变量时分类效果优于其他三种算法。结论 HingeBoost算法将错分代价引入模型,达到减少假阳性错误或假阴性错误的目的,同时有很强的抗噪声能力,适用于高维代谢组学数据分析,值得进一步研究。  相似文献   

5.
目的 探讨随机梯度boosting算法(SGB)对代谢组学数据分类判别和代谢物筛选的效果.方法 每一次迭代均根据损失函数最小化原则得出“伪残差”,并用最小二乘法对其构建基础分类器(决策树),最终组合各分类器形成随机梯度boosting模型.通过模拟实验和真实代谢组学数据的分析,与Adaboost、RF、SVM三种算法进行比较.结果 无论是在模拟条件下还是真实数据中,随机梯度boosting算法的分类准确性都优于其他三种算法.算法可评价各代谢物重要性,有效地筛选出部分代谢物.结论 随机梯度boosting算法适用于代谢组学数据研究,对疾病早期诊断、治疗和预后具有重要价值,值得进一步研究和探索.  相似文献   

6.
目的探讨XGBoost算法在二分类高维非平衡数据中的分类判别效果。方法通过模拟实验及真实代谢组学数据分析,对XGBoost、随机森林、支持向量机、随机欠采样以及随机梯度提升树共五种方法进行比较。结果模拟实验显示,XGBoost算法在数据非平衡较明显时,在各种实验条件下均优于或不劣于其他四种算法,在数据类别趋于平衡的情况下也同样具有较好的分类效果,且对噪声变量具有一定的抗干扰能力。实例分析显示,与其他四种算法相比,XGBoost算法的分类性能最优,且在保证分类效果的基础上具有更快的运算速度。结论 XGBoost算法适用于非平衡高维数据的判别分析,值得研究。  相似文献   

7.
目的应用Boosting算法建立模型,对卵巢癌和非卵巢癌(卵巢囊肿和子宫肌瘤)患者的尿液代谢组数据进行分析,提取出具有生物学意义的代谢组分,为卵巢癌的早期诊断及疾病机理提供线索。方法将决策树与Boosting算法相结合,对患者的临床样品代谢组数据进行分析,并对代谢组分进行逐步筛选,得到鉴别卵巢癌患者的重要代谢组分。结果由Boosting模型得到的排序靠前的10个差异代谢组分,能够将卵巢癌与对照组患者进行较好的判别分类,其ROC曲线下面积达到了0.944。结论 Boosting模型可以有效地应用于卵巢癌代谢组数据,在保证较高的分类正确率的同时可以得到对分类起作用的重要的代谢组分。  相似文献   

8.
目的 中药是一个具有复杂组分的统一体,无论是方剂还是单味中药,其药效都是其中多种化学成分相互作用的综合效果,具有多组分、多靶点、多渠道作用的特点.本文阐述利用代谢组学方法研究中药作用机理的数据分析方法及特点,为医学工作者提供新的中药研究思路及策略.方法 从生物统计学和生物信息学角度,利用文献和结合目前相关研究结果提出作者的观点和看法.代谢组数据分析的主要困难是相对于给定的样品数目谱峰的数量巨大,用传统的统计方法对可能具有生物学意义的"差异谱峰"进行鉴别会产生大量的假阳性结果.特征选择方法按照算法可分为过滤法、包裹法和嵌入法,三种方法各有特点.结果 代谢组指纹图谱数据能够为我们提供大量药物化学、特别是次生代谢物质的信息,对这种高维数据有多种分析方法可以使用,如果不对数据做变量筛选,难免受大量对分类不起作用的无关变量的干扰.变量筛选有很多优点:既可以简化模型,利于可视化和数据解释,同时可以更好地避免维数灾难引起的过拟合问题,提高模型分类效果.代谢组数据库和一些软件也是我们可以利用的工具.结论 利用代谢组学的方法研究中药的作用机理是一种可行的方法,研究中药的代谢指纹图谱应包括化学和药效两方面的的内容,为有效提取其生物学信息,必须采取适宜的统计学模型结合生物学知识对其进行研究.  相似文献   

9.
目的应用随机生存森林模型探讨肺癌患者预后影响因素的重要性并对预测结果进行评价。方法对山西省某三甲医院342例确诊的肺癌患者进行随访研究,建立随机生存森林模型,并与传统的Cox回归模型进行比较。结果 342例肺癌患者中226例患者发生死亡,中位生存时间为28.23月。治疗方式、肿瘤大小、临床分期等变量是影响肺癌患者预后的重要因素,淋巴结转移、分化程度、病理分型、年龄是中度预测因素,并分析了变量之间的交互作用。二者的模型比较结果显示随机生存森林模型预测错误率以及预测误差均低于Cox回归模型。结论随机生存森林模型拟合效果好,可用于右删失生存数据的分析,不但能发现重要的影响因素,还能发现变量之间的交互作用,为肺癌患者预后状况的改善,提升生命质量提供科学依据。  相似文献   

10.
目的探索基于遗传算法的随机森林模型在特征基因筛选中的效果和特点。方法通过本文构建的基于遗传算法的随机森林模型(GARF)对真实基因数据和模拟数据进行特征基因筛选,以筛选后基因进行判别分析,计算ROC曲线下面积AUC值,同时观察GARF方法对模拟实验中预设的差异基因排序结果。结果对真实基因数据和模拟数据的分析结果均显示,采用GARF方法筛选得到的特征基因建立判别模型能获得更好的分类效果,在模拟实验中与随机森林相比能将预设的差异基因排在更靠前的位置。结论 GARF方法能够有效地用于基因芯片数据特征基因筛选,在FDR控制上具备潜力,具有研究价值。  相似文献   

11.
目的 将随机森林算法用于类风湿性关节炎病例对照研究的高维甲基化数据的分析,并探讨应用效果。方法 实例数据来自基因表达数据库(gene expression omnibus,GEO),检索号为GSE42861,包含354名病例、335名对照,本文选取类风湿性关节炎相关基因区域所在的第9号染色体,共纳入2 433个胞嘧啶-磷酸-鸟嘌呤双核苷酸(cytosine-phosphate-guanine pairs of nucleotides,CpGs)位点。利用随机森林计算变量的重要性评分并排序;对排序后的变量进行逐步随机森林过程,寻找最有可能与结果存在关联的变量子集;对降维后的变量子集进行逐步Logistic回归。结果 逐步随机森林筛选出80个重要的CpG位点,Logistic回归模型中有13个位点具有统计学意义。纳入这些位点建立Logistic回归模型,该模型的预测正确率达88.29%。结论 随机森林算法可以大大减少噪音变量,提高检验效能,适用于高维甲基化数据分析。  相似文献   

12.
Random forest is a supervised learning method that combines many classification or regression trees for prediction. Here we describe an extension of the random forest method for building event risk prediction models in survival analysis with competing risks. In case of right‐censored data, the event status at the prediction horizon is unknown for some subjects. We propose to replace the censored event status by a jackknife pseudo‐value, and then to apply an implementation of random forests for uncensored data. Because the pseudo‐responses take on values on a continuous scale, the node variance is chosen as split criterion for growing regression trees. In a simulation study, the pseudo split criterion is compared with the Gini split criterion when the latter is applied to the uncensored event status. To investigate the resulting pseudo random forest method for building risk prediction models, we analyze it in a simulation study of predictive performance where we compare it to Cox regression and random survival forest. The method is further illustrated in two real data sets. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

13.
王福成    齐平  蒋剑军  黄永  杨晓玲 《现代预防医学》2020,(13):2310-2313
目的 针对铜陵市天桥社区居民体检数据中多因素、有效样本有限的情况,挖掘与分析高血压影响因素与因素间的交互效应,为高血压干预提供参考。方法 选取2017年该社区801例体检数据为研究对象,采用随机森林方法,筛选出重要性评分较大的特征,代入logistic完全二次回归模型,逐步回归分析影响因素及因素间的交互效应。结果 随机森林模型准确率83.67%,特征重要性前10项为年龄、糖尿病、锻炼频率、体质指数、总胆固醇、吸烟情况、饮酒情况、中心性肥胖、甘油三酯、血尿素氨。Logistic完全二次回归模型准确率84.17%,输出2条主效应、8条二次交互效应。主效应中有统计学意义(P<0.05)的特征有年龄、锻炼频率,二次交互效应中有统计学意义(P<0.05)的特征有年龄、糖尿病、体质指数、总胆固醇、吸烟情况、饮酒情况、甘油三酯、血尿素氨。结论 随机森林与logistic完全二次回归模型相结合,解决了经典方法难以从多因素、样本有限的数据中挖掘交互效应的问题,获得高血压影响因素与因素间的交互效应,为高血压干预提供有益的指导。  相似文献   

14.
When fitting regression models to investigate the relationship between an outcome variable and independent variables of primary interest, there is often concern whether omitted variables or assuming a different functional relationship could have changed the conclusion or interpretation of the results. In longitudinal studies of ageing, the concern with omitted variables is well known in the context of cohort and period effects, which refer to unmeasured variables systematically related to the individual's year of birth and secular trends in outcome, respectively. We present and compare three approaches to detecting omitted confounders and non-linearity in the random effects model for longitudinal data (Laird and Ware, 1982) with random slope and intercept across individuals. The first approach compares simple unweighted within and between regression coefficients, the second is the Hausman specification test for regression models, and the third approach involves testing directly the significance of functions of individual specific covariate means x?i, in the random effects regression model. This last approach is motivated by the models that arise when cohort or period effects are ignored. We compare the three approaches, and illustrate their application.  相似文献   

15.
Analysis of health care cost data is often complicated by a high level of skewness, heteroscedastic variances and the presence of missing data. Most of the existing literature on cost data analysis have been focused on modeling the conditional mean. In this paper, we study a weighted quantile regression approach for estimating the conditional quantiles health care cost data with missing covariates. The weighted quantile regression estimator is consistent, unlike the naive estimator, and asymptotically normal. Furthermore, we propose a modified BIC for variable selection in quantile regression when the covariates are missing at random. The quantile regression framework allows us to obtain a more complete picture of the effects of the covariates on the health care cost and is naturally adapted to the skewness and heterogeneity of the cost data. The method is semiparametric in the sense that it does not require to specify the likelihood function for the random error or the covariates. We investigate the weighted quantile regression procedure and the modified BIC via extensive simulations. We illustrate the application by analyzing a real data set from a health care cost study. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

16.
In genetic and genomic studies, gene‐environment (G×E) interactions have important implications. Some of the existing G×E interaction methods are limited by analyzing a small number of G factors at a time, by assuming linear effects of E factors, by assuming no data contamination, and by adopting ineffective selection techniques. In this study, we propose a new approach for identifying important G×E interactions. It jointly models the effects of all E and G factors and their interactions. A partially linear varying coefficient model is adopted to accommodate possible nonlinear effects of E factors. A rank‐based loss function is used to accommodate possible data contamination. Penalization, which has been extensively used with high‐dimensional data, is adopted for selection. The proposed penalized estimation approach can automatically determine if a G factor has an interaction with an E factor, main effect but not interaction, or no effect at all. The proposed approach can be effectively realized using a coordinate descent algorithm. Simulation shows that it has satisfactory performance and outperforms several competing alternatives. The proposed approach is used to analyze a lung cancer study with gene expression measurements and clinical variables. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

17.
目的 利用重采样技术提高我国中老年居民糖尿病不平衡数据的分类预测效果。方法 采用随机欠采样、随机过采样、合成少数类过采样(synthetic minority oversampling technique, SMOTE)以及自适应合成抽样(adaptive synthetic sampling, ADASYN)四种重采样技术处理CHARLS数据库中糖尿病不平衡数据,比较重采样前后logistic回归、支持向量机、随机森林的分类性能,采用G-means和AUC评价模型的预测效果。结果 对CHARLS糖尿病不平衡数据集,logistic回归、支持向量机、随机森林模型的G-means分别为0.222 7、0、0,AUC分别为0.761 2、0.736 3、0.742 9,logistic回归模型显著优于支持向量机,模型准确率(χ2=1 231.501,P<0.001)及AUC值(Z=2.634, P=0.028)的差异均具有统计学意义。四种重采样技术处理后模型的G-means均提高,特别是SMOTE和ADASYN技术;此外,随机欠采样不能显著提高logistic回...  相似文献   

18.
Testing for association between two random vectors is a common and important task in many fields, however, existing tests, such as Escoufier's RV test, are suitable only for low‐dimensional data, not for high‐dimensional data. In moderate to high dimensions, it is necessary to consider sparse signals, which are often expected with only a few, but not many, variables associated with each other. We generalize the RV test to moderate‐to‐high dimensions. The key idea is to data adaptively weight each variable pair based on its empirical association. As the consequence, the proposed test is adaptive, alleviating the effects of noise accumulation in high‐dimensional data, and thus maintaining the power for both dense and sparse alternative hypotheses. We show the connections between the proposed test with several existing tests, such as a generalized estimating equations‐based adaptive test, multivariate kernel machine regression (KMR), and kernel distance methods. Furthermore, we modify the proposed adaptive test so that it can be powerful for nonlinear or nonmonotonic associations. We use both real data and simulated data to demonstrate the advantages and usefulness of the proposed new test. The new test is freely available in R package aSPC on CRAN at https://cran.r-project.org/web/packages/aSPC/index.html and https://github.com/jasonzyx/aSPC .  相似文献   

19.
目的 探索随机生存森林在大规模测序肺癌随访研究资料中的降维效果,为进一步建立预后预测模型提供依据.方法 利用随机生存森林法对120位肺癌患者399个单核苷酸多态性(single nucleotide polymorphisms,SNPs)位点进行降维分析,筛选出重要性评分较高且错分率较低的SNPs子集,再对该子集建立多元Cox比例风险模型,并利用交叉验证法评价模型的预测效果.结果 随机生存森林法筛选出25个重要的SNPs,控制临床协变量(临床分期、是否手术、组织病理学类型)的多元Cox比例风险模型显示有4个位点有统计学意义.交叉验证结果表明,该模型的平均准确度达83.63%.结论 对高维关联性研究数据利用随机生存森林法先去噪降维,再作进一步分析,有助于后续预后预测模型的建立.  相似文献   

20.
A significant source of missing data in longitudinal epidemiologic studies on elderly individuals is death. It is generally believed that these missing data by death are non-ignorable to likelihood based inference. Inference based on data only from surviving participants in the study may lead to biased results. In this paper we model both the probability of disease and the probability of death using shared random effect parameters. We also propose to use the Laplace approximation for obtaining an approximate likelihood function so that high dimensional integration over the distributions of the random effect parameters is not necessary. Parameter estimates can be obtained by maximizing the approximate log-likelihood function. Data from a longitudinal dementia study will be used to illustrate the approach. A small simulation is conducted to compare parameter estimates from the proposed method to the 'naive' method where missing data is considered at random.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号