首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Attempts to predict prognosis in cancer patients using high‐dimensional genomic data such as gene expression in tumor tissue can be made difficult by the large number of features and the potential complexity of the relationship between features and the outcome. Integrating prior biological knowledge into risk prediction with such data by grouping genomic features into pathways and networks reduces the dimensionality of the problem and could improve prediction accuracy. Additionally, such knowledge‐based models may be more biologically grounded and interpretable. Prediction could potentially be further improved by allowing for complex nonlinear pathway effects. The kernel machine framework has been proposed as an effective approach for modeling the nonlinear and interactive effects of genes in pathways for both censored and noncensored outcomes. When multiple pathways are under consideration, one may efficiently select informative pathways and aggregate their signals via multiple kernel learning (MKL), which has been proposed for prediction of noncensored outcomes. In this paper, we propose MKL methods for censored survival outcomes. We derive our approach for a general survival modeling framework with a convex objective function and illustrate its application under the Cox proportional hazards and semiparametric accelerated failure time models. Numerical studies demonstrate that the proposed MKL‐based prediction methods work well in finite sample and can potentially outperform models constructed assuming linear effects or ignoring the group knowledge. The methods are illustrated with an application to 2 cancer data sets.  相似文献   

2.
The importance of developing personalized risk prediction estimates has become increasingly evident in recent years. In general, patient populations may be heterogenous and represent a mixture of different unknown subtypes of disease. When the source of this heterogeneity and resulting subtypes of disease are unknown, accurate prediction of survival may be difficult. However, in certain disease settings, the onset time of an observable short‐term event may be highly associated with these unknown subtypes of disease and thus may be useful in predicting long‐term survival. One approach to incorporate short‐term event information along with baseline markers for the prediction of long‐term survival is through a landmark Cox model, which assumes a proportional hazards model for the residual life at a given landmark point. In this paper, we use this modeling framework to develop procedures to assess how a patient's long‐term survival trajectory may change over time given good short‐term outcome indications along with prognosis on the basis of baseline markers. We first propose time‐varying accuracy measures to quantify the predictive performance of landmark prediction rules for residual life and provide resampling‐based procedures to make inference about such accuracy measures. Simulation studies show that the proposed procedures perform well in finite samples. Throughout, we illustrate our proposed procedures by using a breast cancer dataset with information on time to metastasis and time to death. In addition to baseline clinical markers available for each patient, a chromosome instability genetic score, denoted by CIN25, is also available for each patient and has been shown to be predictive of survival for various types of cancer. We provide procedures to evaluate the incremental value of CIN25 for the prediction of residual life and examine how the residual life profile changes over time. This allows us to identify an informative landmark point, t0, such that accurate risk predictions of the residual life could be made for patients who survive past t0 without metastasis. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

3.
Some interesting recent studies have shown that neural network models are useful alternatives in modeling survival data when the assumptions of a classical parametric or semiparametric survival model such as the Cox (1972) model are seriously violated. However, to the best of our knowledge, the plausibility of adapting the emerging extreme learning machine (ELM) algorithm for single-hidden-layer feedforward neural networks to survival analysis has not been explored. In this paper, we present a kernel ELM Cox model regularized by an L0-based broken adaptive ridge (BAR) penalization method. Then, we demonstrate that the resulting method, referred to as ELMCoxBAR, can outperform some other state-of-art survival prediction methods such as L1- or L2-regularized Cox regression, random survival forest with various splitting rules, and boosted Cox model, in terms of its predictive performance using both simulated and real world datasets. In addition to its good predictive performance, we illustrate that the proposed method has a key computational advantage over the above competing methods in terms of computation time efficiency using an a real-world ultra–high-dimensional survival data.  相似文献   

4.
Deep learning is a class of machine learning algorithms that are popular for building risk prediction models. When observations are censored, the outcomes are only partially observed and standard deep learning algorithms cannot be directly applied. We develop a new class of deep learning algorithms for outcomes that are potentially censored. To account for censoring, the unobservable loss function used in the absence of censoring is replaced by a censoring unbiased transformation. The resulting class of algorithms can be used to estimate both survival probabilities and restricted mean survival. We show how the deep learning algorithms can be implemented by adapting software for uncensored data by using a form of response transformation. We provide comparisons of the proposed deep learning algorithms to existing risk prediction algorithms for predicting survival probabilities and restricted mean survival through both simulated datasets and analysis of data from breast cancer patients.  相似文献   

5.
BACKGROUND/OBJECTIVESThis is the first study to identify common genetic factors associated with the basal metabolic rate (BMR) and body mass index (BMI) in obese Korean women including overweight. This will be a basic study for future research of obese gene-BMR interaction.SUBJECTS/METHODSThe experimental design was 2 by 2 with variables of BMR and BMI. A genome-wide association study (GWAS) of single nucleotide polymorphisms (SNPs) was conducted in the overweight and obesity (BMI > 23 kg/m2) compared to the normality, and in women with low BMR (< 1426.3 kcal/day) compared to high BMR. A total of 140 SNPs reached formal genome-wide statistical significance in this study (P < 1 × 10-4). Surveys to estimate energy intake using 24-h recall method for three days and questionnaires for family history, a medical examination, and physical activities were conducted.RESULTSWe found that two NRG3 gene SNPs in the 10q23.1 chromosomal region were highly associated with BMR (rs10786764; P = 8.0 × 10-7, rs1040675; 2.3 × 10-6) and BMI (rs10786764; P = 2.5 × 10-5, rs10786764; 6.57 × 10-5). The other genes related to BMI (HSD52, TMA16, MARCH1, NRG1, NRXN3, and STK4) yielded P <10 × 10-4. Five new loci associated with BMR and BMI, including NRG3, OR8U8, BCL2L2-PABPN1, PABPN1, and SLC22A17 were identified in obese Korean women (P < 1 × 10-4). In the questionnaire investigation, significant differences were found in the number of starvation periods per week, family history of stomach cancer, coffee intake, and trial of weight control in each group.CONCLUSIONWe discovered several common BMR- and BMI-related genes using GWAS. Although most of these newly established loci were not previously associated with obesity, they may provide new insights into body weight regulation. Our findings of five common genes associated with BMR and BMI in Koreans will serve as a reference for replication and validation of future studies on the metabolic rate.  相似文献   

6.
目的 联合使用遗传因素和吸烟信息构建中国汉族人群的肺癌风险预测模型。方法 基于中国汉族人群全基因组关联研究(GWAS)数据,根据样本地区来源将样本分为训练集(南京与上海:1 473 名病例vs. 1 962 名对照)和测试集(北京与武汉:858 名病例vs. 1 115 名对照)。系统整理已报道肺癌易感位点,在训练集中用逐步后退法筛选具有独立效应的位点,并通过加权法估算个体遗传得分用于建模。在训练集中分别构建基于吸烟信息、遗传得分和联合使用吸烟与遗传信息的3 种风险预测模型(吸烟模型、遗传效应模型和联合模型),并根据受试者工作特征(ROC)曲线、曲线下面积(AUC)、净分类指数(NRI)和整体鉴别指数(IDI)评价模型对肺癌风险预测的效能。对于构建的模型,进一步在测试集中进行验证。结果 在训练集中,联合模型、吸烟模型和遗传效应模型AUC分别为0.69(0.67~0.71)、0.65(0.63~0.66)和0.60(0.59~0.62)。在训练集和测试集中联合模型的风险预测效能高于吸烟模型或遗传模型,差异有统计学意义(P<0.001)。重分类结果显示,联合模型与吸烟模型相比,在训练集中NRI 增加4.57%(2.23%~6.91%),IDI 增加3.11%(2.52%~3.69%)。在测试集中,NRI和IDI 分别增加2.77%和3.16%。结论 遗传得分可以显著提高肺癌传统风险模型的预测效能。联合使用遗传因素和吸烟信息构建的中国汉族人群肺癌风险预测模型可用于筛选中国汉族人群中肺癌发病的高危人群。  相似文献   

7.
Predicting an individual's risk of experiencing a future clinical outcome is a statistical task with important consequences for both practicing clinicians and public health experts. Modern observational databases such as electronic health records provide an alternative to the longitudinal cohort studies traditionally used to construct risk models, bringing with them both opportunities and challenges. Large sample sizes and detailed covariate histories enable the use of sophisticated machine learning techniques to uncover complex associations and interactions, but observational databases are often ‘messy’, with high levels of missing data and incomplete patient follow‐up. In this paper, we propose an adaptation of the well‐known Naive Bayes machine learning approach to time‐to‐event outcomes subject to censoring. We compare the predictive performance of our method with the Cox proportional hazards model which is commonly used for risk prediction in healthcare populations, and illustrate its application to prediction of cardiovascular risk using an electronic health record dataset from a large Midwest integrated healthcare system. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

8.
目的 开发和验证基于机器学习算法的孕期大于胎龄儿(LGA)风险预测模型,并比较其与传统逻辑回归方法建模的性能差异。方法 研究对象来自"中国免费孕前优生健康检查项目",于2010-2012年在全国31个省市的220个县开展,覆盖全部农村计划妊娠夫妇,本研究选取分娩新生儿胎龄在24~42周内,单胎活产的所有育龄期夫妇及其新生儿为研究对象。应用10种机器学习算法分别建立LGA预测模型,评估模型对LGA的预测性能。结果 最终纳入104 936名新生儿,男婴54 856例(52.3%),女婴50 080例(47.7%),LGA的发生率为11.7%(12 279例)。经过下采样数据平衡处理后,机器学习方法建立模型的整体效能出现明显提高,其中以CatBoost模型在预测LGA风险方面表现最佳,模型的受试者工作特征曲线的曲线下面积(AUC)为0.932;逻辑回归模型表现最差,AUC仅为0.555。结论 与传统的逻辑回归方法相比,通过机器学习算法可建立更有效的孕期LGA风险预测模型,具有潜在的应用价值。  相似文献   

9.
In oncology clinical trials, overall survival, time to progression, and progression‐free survival are three commonly used endpoints. Empirical correlations among them have been published for different cancers, but statistical models describing the dependence structures are limited. Recently, Fleischer et al. proposed a statistical model that is mathematically tractable and shows some flexibility to describe the dependencies in a realistic way, based on the assumption of exponential distributions. This paper aims to extend their model to the more flexible Weibull distribution. We derived theoretical correlations among different survival outcomes, as well as the distribution of overall survival induced by the model. Model parameters were estimated by the maximum likelihood method and the goodness of fit was assessed by plotting estimated versus observed survival curves for overall survival. We applied the method to three cancer clinical trials. In the non‐small‐cell lung cancer trial, both the exponential and the Weibull models provided an adequate fit to the data, and the estimated correlations were very similar under both models. In the prostate cancer trial and the laryngeal cancer trial, the Weibull model exhibited advantages over the exponential model and yielded larger estimated correlations. Simulations suggested that the proposed Weibull model is robust for data generated from a range of distributions. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

10.
Among the surrogate endpoints for overall survival (OS) in oncology trials, progression‐free survival (PFS) is more and more taking the leading role. Although there have been some empirical investigations on the dependence structure between OS and PFS (in particular between the median OS and the median PFS), statistical models are almost non‐existing. This paper aims at filling this gap by introducing an easy‐to‐handle model based on exponential time‐to‐event distributions that describe the dependence structure between OS and PFS. Based on this model, explicit formulae for individual correlations are derived together with a lower bound for the correlation of OS and PFS, which is given by the fraction of the two medians for OS and PFS. Two methods on how to estimate the parameter of the model from real data are discussed. One method is based on a maximum‐likelihood estimator whereas the other method uses a plug‐in approach. Three examples from non‐small cell lung cancer are considered. In the first example, the parameters of the model are determined and the estimated survival curce is compared with the observed one. The second example explains how to obtain sample size estimates for OS based on assumptions on median PFS and OS. Finally, the third example provides a way of modelling and quantifying confounding effects that might explain a levelling of differences in OS although a difference in PFS is observed. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

11.
Resampling techniques are often used to provide an initial assessment of accuracy for prognostic prediction models developed using high-dimensional genomic data with binary outcomes. Risk prediction is most important, however, in medical applications and frequently the outcome measure is a right-censored time-to-event variable such as survival. Although several methods have been developed for survival risk prediction with high-dimensional genomic data, there has been little evaluation of the use of resampling techniques for the assessment of such models. Using real and simulated datasets, we compared several resampling techniques for their ability to estimate the accuracy of risk prediction models. Our study showed that accuracy estimates for popular resampling methods, such as sample splitting and leave-one-out cross validation (Loo CV), have a higher mean square error than for other methods. Moreover, the large variability of the split-sample and Loo CV may make the point estimates of accuracy obtained using these methods unreliable and hence should be interpreted carefully. A k-fold cross-validation with k = 5 or 10 was seen to provide a good balance between bias and variability for a wide range of data settings and should be more widely adopted in practice.  相似文献   

12.
Progression‐free survival is an increasingly popular end point in oncology clinical trials. A complete blinded independent central review (BICR) is often required by regulators in an attempt to reduce the bias in progression‐free survival (PFS) assessment. In this paper, we propose a new methodology that uses a sample‐based BICR as an audit tool to decide whether a complete BICR is needed. More specifically, we propose a new index, the differential risk, to measure the reading discordance pattern, and develop a corresponding hypothesis testing procedure to decide whether the bias in local evaluation is acceptable. Simulation results demonstrate that our new index is sensitive to the change of discordance pattern; type I error is well controlled in the hypothesis testing procedure, and the calculated sample size provides the desired power. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

13.
人类基因组计划(human genome project,HGP)的完成预示着生命科学研究进入了基因组时代,在利用基因组学方法 进行流行病学研究的过程中产生了基因组流行病学.在人群中研究与疾病发生发展或健康相关的遗传变异,即遗传标志物(genetic marker)用于疾病的预防和治疗、促进健康,是基因组流行病学的主要研究内容.  相似文献   

14.
杨磊  聂艳武  朱凯  周青  蔡雯 《现代预防医学》2021,(18):3270-3276
目的 基于机器学习( machine learning,ML)的不同算法构建新疆维吾尔自治区乌鲁木齐市非酒精性脂肪性肝病(nonalcoholic fatty liver disease, NAFLD)决策树、随机森林及支持向量机风险预测模型,与经典logistic回归预测模型进行对比分析,以筛选出NAFLD最佳风险预测模型。方法 选取2018年1月—2019年12月就诊并明确诊断为NAFLD的患者429例为病例组,同时选择同期体检的健康志愿者561例作为对照组,探讨NAFLD患者健康状况、生活方式及行为等环境影响因素,基于影响因素构建NAFLD经典logsitic回归预测模型,并运用R软件构建决策树、随机森林及支持向量机风险预测模型,绘制四种模型受试者工作特征曲线(receiver operating characteristic curve,ROC),计算曲线下面积(area under curve,AUC) ,运用“十折交叉验证法”(10-fold cross-validation)对各模型准确率、灵敏度、特异度等指标进行对比分析。结果 共纳入体质指数(P<0.001;OR = 14.479;95%CI:4.000~52.407)、腰臀比(P = 0.001;OR = 3.692;95%CI:1.713~7.956)、被动吸烟(P = 0.004; OR = 3.074; 95%CI:1.426~6.623)、主食品种(P = 0.001;OR = 4.938;95%CI:2.004~12.164)、高血压(P = 0.008 ;OR = 3.601;95%CI:1.407~9.219)、糖尿病(P = 0.018;OR = 4.719;95%CI:1.301~17.124)、血脂异常(P<0.001;OR = 8.538;95%CI:3.582~20.350)、食用红肉类频率(P<0.001;OR = 5.923;95%CI:2.487~14.106)、压力(P = 0.019;OR = 2.466;95%CI:1.158~5.252)、食用水果频率(P = 0.034;OR = 0.498 ;95%CI:0.261~0.949)、文化程度(P = 0.011;OR = 0.444;95%CI:0.238~0.828)11个预测因素构建经典logistic回归预测模型、决策树、随机森林及支持向量机风险预测模型。四种模型均具良好NAFLD风险预测能力,各模型准确率均大于0.80,灵敏度均大于0.85,Kappa值均大于0.65,阳性预测值、阴性预测值均大于或等于0.80;其中支持向量机具有最高的准确率(0.852)、特异度(0.855)、阳性预测值(0.877)及仅次于logsitic回归预测模型的AUC值(0.9086)。结论 综合分析可知,基于支持向量机算法的预测模型更具有优势,可以有效地预测NAFLD患病风险,更有助于NAFLD的预防、早期治疗和管理。  相似文献   

15.
目的  评价极端梯度提升(extreme gradient boosting, XGBoost)、支持向量机(support vector machine, SVM)和朴素贝叶斯等6种机器学习模型与传统logistic回归分析模型对小于胎龄儿(small for gestational age, SGA)的预测效能。方法  选取2012年3月―2016年9月在山西医科大学第一医院产科住院分娩的9 972例孕妇作为研究对象,采用问卷调查及从医院信息系统收集数据。依据分娩结局分为SGA组(n=1 124)与非SGA组(n=8 848),按7.50∶2.50比例划分训练集与测试集。采用多因素logistic回归模型筛选危险因素,基于XGBoost、SVM、朴素贝叶斯、梯度提升决策树(gradient boosting decision tree, GBDT)、K最近邻(k-nearest neighbor, KNN)算法及传统logistic回归分析模型方法分别建立预测模型,使用受试者工作特征曲线的曲线下面积(area under the curve, AUC)、准确率和精确度等指标比较预测性能。结果  Logistic回归模型结果显示,妊娠期高血压和子痫等7项变量是SGA的影响因素。将以上因素纳入预测模型,SVM算法构建的预测模型效能最佳,AUC达0.72,模型准确率为71%。传统logistic回归分析模型表现欠佳,AUC为0.71,准确率为66%。结论  基于机器学习算法尤其是SVM算法建立的SGA风险预测模型具有较好的效能,能够有效预测山西省SGA的发生,为实现SGA的一级预防提供参考。  相似文献   

16.
When statistical models are used to predict the values of unobserved random variables, loss functions are often used to quantify the accuracy of a prediction. The expected loss over some specified set of occasions is called the prediction error. This paper considers the estimation of prediction error when regression models are used to predict survival times and discusses the use of these estimates. Extending the previous work, we consider both point and confidence interval estimations of prediction error, and allow for variable selection and model misspecification. Different estimators are compared in a simulation study for an absolute relative error loss function, and results indicate that cross‐validation procedures typically produce reliable point estimates and confidence intervals, whereas model‐based estimates are sensitive to model misspecification. Links between performance measures for point predictors and for predictive distributions of survival times are also discussed. The methodology is illustrated in a medical setting involving survival after treatment for disease. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

17.
目的 集成学习是近年来机器学习领域中被广泛应用的一种新的、用来提高学习精度的算法.本文旨在介绍基于superlearner算法的集成学习方法在纵向删失数据预测建模中的应用及其R语言实现.方法 本文介绍了su-per learner算法的基本原理及其在纵向删失数据建模中的应用,以及如何在R语言中实现该算法的建模.其次,应...  相似文献   

18.
Given a predictive marker and a time‐to‐event response variable, the proportion of concordant pairs in a data set is called concordance index. A specifically useful marker is the risk predicted by a survival regression model. This article extends the existing methodology for applications where the length of the follow‐up period depends on the predictor variables. A class of inverse probability of censoring weighted estimators is discussed in which the estimates rely on a working model for the conditional censoring distribution. The estimators are consistent for a truncated concordance index if the working model is correctly specified and if the probability of being uncensored at the truncation time is positive. In this framework, all kinds of prediction models can be assessed, and time trends in the discrimination ability of a model can be captured by varying the truncation time point. For illustration, we re‐analyze a study on risk prediction for prostate cancer patients. The effects of misspecification of the censoring model are studied in simulated data. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

19.
目的 基于机器学习模型(machine learning,ML)和logistic回归构建预测结直肠腺癌5年生存结局的组合模型。方法 选取SEER数据库中12 980名患者,采用传统logistic回归分析影响患者5年存活的相关因素。使用相关因素构建以极限梯度提升、自适应提升、支持向量机、随机森林、回归决策树的预测概率为自变量,分别纳入极限梯度提升、自适应提升和logistic回归做最终预测的组合模型,比较各组合模型5年结直肠腺癌生存预测效果。结果 年龄、手术、化疗、分化程度、T分期、N分期、M分期、CEA状况和婚姻,9个因素影响结直肠腺癌患者5年生存。组合模型logistic+Adaboost+RF+XGboost,内部测试集AUC、准确率、F1分数分别为0.861、0.801、0.832。外部验证集AUC、准确率、F1分数分别为0.833、0.806、0.869。组合模型效能优于单一模型。结论 机器学习组合模型更具有优势,可以有效预测结直肠腺癌5年生存结局,辅助临床工作者制定诊疗方案和优化癌症防治措施。  相似文献   

20.
A variety of prediction methods are used to relate high‐dimensional genome data with a clinical outcome using a prediction model. Once a prediction model is developed from a data set, it should be validated using a resampling method or an independent data set. Although the existing prediction methods have been intensively evaluated by many investigators, there has not been a comprehensive study investigating the performance of the validation methods, especially with a survival clinical outcome. Understanding the properties of the various validation methods can allow researchers to perform more powerful validations while controlling for type I error. In addition, sample size calculation strategy based on these validation methods is lacking. We conduct extensive simulations to examine the statistical properties of these validation strategies. In both simulations and a real data example, we have found that 10‐fold cross‐validation with permutation gave the best power while controlling type I error close to the nominal level. Based on this, we have also developed a sample size calculation method that will be used to design a validation study with a user‐chosen combination of prediction. Microarray and genome‐wide association studies data are used as illustrations. The power calculation method in this presentation can be used for the design of any biomedical studies involving high‐dimensional data and survival outcomes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号