首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
For many high‐dimensional studies, additional information on the variables, like (genomic) annotation or external p‐values, is available. In the context of binary and continuous prediction, we develop a method for adaptive group‐regularized (logistic) ridge regression, which makes structural use of such ‘co‐data’. Here, ‘groups’ refer to a partition of the variables according to the co‐data. We derive empirical Bayes estimates of group‐specific penalties, which possess several nice properties: (i) They are analytical. (ii) They adapt to the informativeness of the co‐data for the data at hand. (iii) Only one global penalty parameter requires tuning by cross‐validation. In addition, the method allows use of multiple types of co‐data at little extra computational effort. We show that the group‐specific penalties may lead to a larger distinction between ‘near‐zero’ and relatively large regression parameters, which facilitates post hoc variable selection. The method, termed GRridge , is implemented in an easy‐to‐use R‐package. It is demonstrated on two cancer genomics studies, which both concern the discrimination of precancerous cervical lesions from normal cervix tissues using methylation microarray data. For both examples, GRridge clearly improves the predictive performances of ordinary logistic ridge regression and the group lasso. In addition, we show that for the second study, the relatively good predictive performance is maintained when selecting only 42 variables. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

2.
Array comparative genomic hybridization (aCGH) provides a genome‐wide information of DNA copy number that is potentially useful for disease classification. One immediate problem is that the data contain many features (probes) but only a few samples. Existing approaches to overcome this problem include features selection, ridge regression and partial least squares. However, these methods typically ignore the spatial characteristic of aCGH data. To explicitly make use of this spatial information we develop a procedure called smoothed logistic regression (SLR) model. The procedure is based on a mixed logistic regression model, where the random component is a mixture distribution that controls smoothness and sparseness. Conceptually such a procedure is straightforward, but its implementation is complicated due to computational problems. We develop a fast and reliable iterative weighted least‐squares algorithm based on the singular value decomposition. Simulated data and two real data sets are used to illustrate the procedure. For real data sets, error rates are calculated using the leave‐one‐out cross validation procedure. For both simulated and real data examples, SLR achieves better misclassification error rates compared with previous methods. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

3.
This paper proposes a risk prediction model using semi‐varying coefficient multinomial logistic regression. We use a penalized local likelihood method to do the model selection and estimate both functional and constant coefficients in the selected model. The model can be used to improve predictive modelling when non‐linear interactions between predictors are present. We conduct a simulation study to assess our method's performance, and the results show that the model selection procedure works well with small average numbers of wrong‐selection or missing‐selection. We illustrate the use of our method by applying it to classify the patients with early rheumatoid arthritis at baseline into different risk groups in future disease progression. We use a leave‐one‐out cross‐validation method to assess its correct prediction rate and propose a recalibration framework to evaluate how reliable are the predicted risks. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

4.
Polygenic scores (PGS) summarize the genetic contribution of a person's genotype to a disease or phenotype. They can be used to group participants into different risk categories for diseases, and are also used as covariates in epidemiological analyses. A number of possible ways of calculating PGS have been proposed, and recently there is much interest in methods that incorporate information available in published summary statistics. As there is no inherent information on linkage disequilibrium (LD) in summary statistics, a pertinent question is how we can use LD information available elsewhere to supplement such analyses. To answer this question, we propose a method for constructing PGS using summary statistics and a reference panel in a penalized regression framework, which we call lassosum. We also propose a general method for choosing the value of the tuning parameter in the absence of validation data. In our simulations, we showed that pseudovalidation often resulted in prediction accuracy that is comparable to using a dataset with validation phenotype and was clearly superior to the conservative option of setting the tuning parameter of lassosum to its lowest value. We also showed that lassosum achieved better prediction accuracy than simple clumping and P‐value thresholding in almost all scenarios. It was also substantially faster and more accurate than the recently proposed LDpred.  相似文献   

5.
It is now common in clinical practice to make clinical decisions based on combinations of multiple biomarkers. In this paper, we propose new approaches for combining multiple biomarkers linearly to maximize the partial area under the receiver operating characteristic curve (pAUC). The parametric and nonparametric methods that have been developed for this purpose have limitations. When the biomarker values for populations with and without a given disease follow a multivariate normal distribution, it is easy to implement our proposed parametric approach, which adopts an alternative analytic expression of the pAUC. When normality assumptions are violated, a kernel‐based approach is presented, which handles multiple biomarkers simultaneously. We evaluated the proposed as well as existing methods through simulations and discovered that when the covariance matrices for the disease and nondisease samples are disproportional, traditional methods (such as the logistic regression) are more likely to fail to maximize the pAUC while the proposed methods are more robust. The proposed approaches are illustrated through application to a prostate cancer data set, and a rank‐based leave‐one‐out cross‐validation procedure is proposed to obtain a realistic estimate of the pAUC when there is no independent validation set available.  相似文献   

6.
It is well known that measurement error in the covariates of regression models generally causes bias in parameter estimates. Correction for such biases requires information concerning the measurement error, which is often in the form of internal validation or replication data. Regression calibration (RC) is a popular approach to correct for covariate measurement error, which involves predicting the true covariate using error‐prone measurements. Likelihood methods have previously been proposed as an alternative approach to estimate the parameters in models affected by measurement error, but have been relatively infrequently employed in medical statistics and epidemiology, partly because of computational complexity and concerns regarding robustness to distributional assumptions. We show how a standard random‐intercepts model can be used to obtain maximum likelihood (ML) estimates when the outcome model is linear or logistic regression under certain normality assumptions, when internal error‐prone replicate measurements are available. Through simulations we show that for linear regression, ML gives more efficient estimates than RC, although the gain is typically small. Furthermore, we show that RC and ML estimates remain consistent even when the normality assumptions are violated. For logistic regression, our implementation of ML is consistent if the true covariate is conditionally normal given the outcome, in contrast to RC. In simulations, this ML estimator showed less bias in situations where RC gives non‐negligible biases. Our proposal makes the ML approach to dealing with covariate measurement error more accessible to researchers, which we hope will improve its viability as a useful alternative to methods such as RC. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

7.
The use of individual participant data (IPD) from multiple studies is an increasingly popular approach when developing a multivariable risk prediction model. Corresponding datasets, however, typically differ in important aspects, such as baseline risk. This has driven the adoption of meta‐analytical approaches for appropriately dealing with heterogeneity between study populations. Although these approaches provide an averaged prediction model across all studies, little guidance exists about how to apply or validate this model to new individuals or study populations outside the derivation data. We consider several approaches to develop a multivariable logistic regression model from an IPD meta‐analysis (IPD‐MA) with potential between‐study heterogeneity. We also propose strategies for choosing a valid model intercept for when the model is to be validated or applied to new individuals or study populations. These strategies can be implemented by the IPD‐MA developers or future model validators. Finally, we show how model generalizability can be evaluated when external validation data are lacking using internal–external cross‐validation and extend our framework to count and time‐to‐event data. In an empirical evaluation, our results show how stratified estimation allows study‐specific model intercepts, which can then inform the intercept to be used when applying the model in practice, even to a population not represented by included studies. In summary, our framework allows the development (through stratified estimation), implementation in new individuals (through focused intercept choice), and evaluation (through internal–external validation) of a single, integrated prediction model from an IPD‐MA in order to achieve improved model performance and generalizability. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

8.
The validity of prognostic models is an important prerequisite for their applicability in practical clinical settings. Here, we report on a specific prognostic study on stroke patients and describe how we explored the prediction performance of our model. We considered two practically highly relevant generalization aspects, namely, the model's performance in patients recruited at a later time point (temporal transportability) and in medical centers different from those used for model building (geographic transportability). To estimate the accuracy of the model, we investigated classical internal validation techniques and leave-one-center-out cross validation (CV). Prognostic models predicting functional independence of stroke patients were developed in a training set using logistic regression, support vector machines, and random forests (RFs). Tenfold CV and leave-one-center-out CV were employed to estimate temporal and geographic transportability of the models. For temporal and external validation, the resulting models were used to classify patients from a later time point and from different clinics. When applying the regression model or the RFs, accuracy in the temporal validation data was well predicted from classical internal validation. However, when predicting geographic transportability all approaches had difficulties. We observed that the leave-one-center-out CV yielded better estimates than classical CV. On the basis of our results, we conclude that external validation in patients from different clinics is required before a prognostic model can be applied in practice. Even validating the model in patients recruited merely at a later time point does not suffice to predict how it may fare with regard to another clinic.  相似文献   

9.
An important question for clinicians appraising a meta‐analysis is: are the findings likely to be valid in their own practice—does the reported effect accurately represent the effect that would occur in their own clinical population? To this end we advance the concept of statistical validity—where the parameter being estimated equals the corresponding parameter for a new independent study. Using a simple (‘leave‐one‐out’) cross‐validation technique, we demonstrate how we may test meta‐analysis estimates for statistical validity using a new validation statistic, Vn , and derive its distribution. We compare this with the usual approach of investigating heterogeneity in meta‐analyses and demonstrate the link between statistical validity and homogeneity. Using a simulation study, the properties of Vn and the Q statistic are compared for univariate random effects meta‐analysis and a tailored meta‐regression model, where information from the setting (included as model covariates) is used to calibrate the summary estimate to the setting of application. Their properties are found to be similar when there are 50 studies or more, but for fewer studies Vn has greater power but a higher type 1 error rate than Q . The power and type 1 error rate of Vn are also shown to depend on the within‐study variance, between‐study variance, study sample size, and the number of studies in the meta‐analysis. Finally, we apply Vn to two published meta‐analyses and conclude that it usefully augments standard methods when deciding upon the likely validity of summary meta‐analysis estimates in clinical practice. © 2017 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.  相似文献   

10.
The use of longitudinal measurements to predict a categorical outcome is an increasingly common goal in research studies. Joint models are commonly used to describe two or more models simultaneously by considering the correlated nature of their outcomes and the random error present in the longitudinal measurements. However, there is limited research on joint models with longitudinal predictors and categorical cross‐sectional outcomes. Perhaps the most challenging task is how to model the longitudinal predictor process such that it represents the true biological mechanism that dictates the association with the categorical response. We propose a joint logistic regression and Markov chain model to describe a binary cross‐sectional response, where the unobserved transition rates of a two‐state continuous‐time Markov chain are included as covariates. We use the method of maximum likelihood to estimate the parameters of our model. In a simulation study, coverage probabilities of about 95%, standard deviations close to standard errors, and low biases for the parameter values show that our estimation method is adequate. We apply the proposed joint model to a dataset of patients with traumatic brain injury to describe and predict a 6‐month outcome based on physiological data collected post‐injury and admission characteristics. Our analysis indicates that the information provided by physiological changes over time may help improve prediction of long‐term functional status of these severely ill subjects. Copyright © 2017 John Wiley & Sons, Ltd.  相似文献   

11.
Causal inference practitioners are routinely presented with the challenge of model selection and, in particular, reducing the size of the covariate set with the goal of improving estimation efficiency. Collaborative targeted minimum loss‐based estimation (CTMLE) is a general framework for constructing doubly robust semiparametric causal estimators that data‐adaptively limit model complexity in the propensity score to optimize a preferred loss function. This stepwise complexity reduction is based on a loss function placed on a strategically updated model for the outcome variable through which the error is assessed using cross‐validation. We demonstrate how the existing stepwise variable selection CTMLE can be generalized using regression shrinkage of the propensity score. We present 2 new algorithms that involve stepwise selection of the penalization parameter(s) in the regression shrinkage. Simulation studies demonstrate that, under a misspecified outcome model, mean squared error and bias can be reduced by a CTMLE procedure that separately penalizes individual covariates in the propensity score. We demonstrate these approaches in an example using electronic medical data with sparse indicator covariates to evaluate the relative safety of 2 similarly indicated asthma therapies for pregnant women with moderate asthma.  相似文献   

12.
Advances in high‐throughput technologies in genomics and imaging yield unprecedentedly large numbers of prognostic biomarkers. To accommodate the scale of biomarkers and study their association with disease outcomes, penalized regression is often used to identify important biomarkers. The ideal variable selection procedure would search for the best subset of predictors, which is equivalent to imposing an ?0‐penalty on the regression coefficients. Since this optimization is a nondeterministic polynomial‐time hard (NP‐hard) problem that does not scale with number of biomarkers, alternative methods mostly place smooth penalties on the regression parameters, which lead to computationally feasible optimization problems. However, empirical studies and theoretical analyses show that convex approximation of ?0‐norm (eg, ?1) does not outperform their ?0 counterpart. The progress for ?0‐norm feature selection is relatively slower, where the main methods are greedy algorithms such as stepwise regression or orthogonal matching pursuit. Penalized regression based on regularizing ?0‐norm remains much less explored in the literature. In this work, inspired by the recently popular augmenting and data splitting algorithms including alternating direction method of multipliers, we propose a 2‐stage procedure for ?0‐penalty variable selection, referred to as augmented penalized minimization‐L0(APM‐L0). The APM‐L0 targets ?0‐norm as closely as possible while keeping computation tractable, efficient, and simple, which is achieved by iterating between a convex regularized regression and a simple hard‐thresholding estimation. The procedure can be viewed as arising from regularized optimization with truncated ?1 norm. Thus, we propose to treat regularization parameter and thresholding parameter as tuning parameters and select based on cross‐validation. A 1‐step coordinate descent algorithm is used in the first stage to significantly improve computational efficiency. Through extensive simulation studies and real data application, we demonstrate superior performance of the proposed method in terms of selection accuracy and computational speed as compared to existing methods. The proposed APM‐L0 procedure is implemented in the R‐package APML0 .  相似文献   

13.
To date, numerous genetic variants have been identified as associated with diverse phenotypic traits. However, identified associations generally explain only a small proportion of trait heritability and the predictive power of models incorporating only known‐associated variants has been small. Multiple regression is a popular framework in which to consider the joint effect of many genetic variants simultaneously. Ordinary multiple regression is seldom appropriate in the context of genetic data, due to the high dimensionality of the data and the correlation structure among the predictors. There has been a resurgence of interest in the use of penalised regression techniques to circumvent these difficulties. In this paper, we focus on ridge regression, a penalised regression approach that has been shown to offer good performance in multivariate prediction problems. One challenge in the application of ridge regression is the choice of the ridge parameter that controls the amount of shrinkage of the regression coefficients. We present a method to determine the ridge parameter based on the data, with the aim of good performance in high‐dimensional prediction problems. We establish a theoretical justification for our approach, and demonstrate its performance on simulated genetic data and on a real data example. Fitting a ridge regression model to hundreds of thousands to millions of genetic variants simultaneously presents computational challenges. We have developed an R package, ridge, which addresses these issues. Ridge implements the automatic choice of ridge parameter presented in this paper, and is freely available from CRAN.  相似文献   

14.
Measurement error is common in epidemiological and biomedical studies. When biomarkers are measured in batches or groups, measurement error is potentially correlated within each batch or group. In regression analysis, most existing methods are not applicable in the presence of batch‐specific measurement error in predictors. We propose a robust conditional likelihood approach to account for batch‐specific error in predictors when batch effect is additive and the predominant source of error, which requires no assumptions on the distribution of measurement error. Although a regression model with batch as a categorical covariable yields the same parameter estimates as the proposed conditional likelihood approach for linear regression, this result does not hold in general for all generalized linear models, in particular, logistic regression. Our simulation studies show that the conditional likelihood approach achieves better finite sample performance than the regression calibration approach or a naive approach without adjustment for measurement error. In the case of logistic regression, our proposed approach is shown to also outperform the regression approach with batch as a categorical covariate. In addition, we also examine a ‘hybrid’ approach combining the conditional likelihood method and the regression calibration method, which is shown in simulations to achieve good performance in the presence of both batch‐specific and measurement‐specific errors. We illustrate our method by using data from a colorectal adenoma study. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

15.
Prediction of an outcome for a given unit based on prediction models built on a training sample plays a major role in many research areas. The uncertainty of the prediction is predominantly characterized by the subject sampling variation in current practice, where prediction models built on hypothetically re‐sampled units yield variable predictions for the same unit of interest. It is almost always true that the predictors used to build prediction models are simply a subset of the entirety of factors related to the outcome. Following the frequentist principle, we can account for the variation because of hypothetically re‐sampled predictors used to build the prediction models. This is particularly important in medicine where the prediction has important and sometime life‐death consequences on a patient's health status. In this article, we discuss some rationale along this line in the context of medicine. We propose a simple approach to estimate the standard error of the prediction that accounts for the variation because of sampling both subjects and predictors under logistic and Cox regression models. A simulation study is presented to support our argument and demonstrate the performance of our method. The concept and method are applied to a real data set. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

16.
Despite the extensive discovery of disease‐associated common variants, much of the genetic contribution to complex traits remains unexplained. Rare variants may explain additional disease risk or trait variability. Although sequencing technology provides a supreme opportunity to investigate the roles of rare variants in complex diseases, detection of these variants in sequencing‐based association studies presents substantial challenges. In this article, we propose novel statistical tests to test the association between rare and common variants in a genomic region and a complex trait of interest based on cross‐validation prediction error (PE). We first propose a PE method based on Ridge regression. Based on PE, we also propose another two tests PE‐WS and PE‐TOW by testing a weighted combination of variants with two different weighting schemes. PE‐WS is the PE version of the test based on the weighted sum statistic (WS) and PE‐TOW is the PE version of the test based on the optimally weighted combination of variants (TOW). Using extensive simulation studies, we are able to show that (1) PE‐TOW and PE‐WS are consistently more powerful than TOW and WS, respectively, and (2) PE is the most powerful test when causal variants contain both common and rare variants.  相似文献   

17.
When statistical models are used to predict the values of unobserved random variables, loss functions are often used to quantify the accuracy of a prediction. The expected loss over some specified set of occasions is called the prediction error. This paper considers the estimation of prediction error when regression models are used to predict survival times and discusses the use of these estimates. Extending the previous work, we consider both point and confidence interval estimations of prediction error, and allow for variable selection and model misspecification. Different estimators are compared in a simulation study for an absolute relative error loss function, and results indicate that cross‐validation procedures typically produce reliable point estimates and confidence intervals, whereas model‐based estimates are sensitive to model misspecification. Links between performance measures for point predictors and for predictive distributions of survival times are also discussed. The methodology is illustrated in a medical setting involving survival after treatment for disease. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

18.
To study significant predictors of condom use in HIV-infected adults, we propose the use of generalized partially linear models and develop a variable selection procedure incorporating a least squares approximation. Local polynomial regression and spline smoothing techniques are used to estimate the baseline nonparametric function. The asymptotic normality of the resulting estimate is established. We further demonstrate that, with the proper choice of the penalty functions and the regularization parameter, the resulting estimate performs as well as an oracle procedure. Finite sample performance of the proposed inference procedure is assessed by Monte Carlo simulation studies. An application to assess condom use by HIV-infected patients gains some interesting results, which cannot be obtained when an ordinary logistic model is used.  相似文献   

19.
A variety of prediction methods are used to relate high‐dimensional genome data with a clinical outcome using a prediction model. Once a prediction model is developed from a data set, it should be validated using a resampling method or an independent data set. Although the existing prediction methods have been intensively evaluated by many investigators, there has not been a comprehensive study investigating the performance of the validation methods, especially with a survival clinical outcome. Understanding the properties of the various validation methods can allow researchers to perform more powerful validations while controlling for type I error. In addition, sample size calculation strategy based on these validation methods is lacking. We conduct extensive simulations to examine the statistical properties of these validation strategies. In both simulations and a real data example, we have found that 10‐fold cross‐validation with permutation gave the best power while controlling type I error close to the nominal level. Based on this, we have also developed a sample size calculation method that will be used to design a validation study with a user‐chosen combination of prediction. Microarray and genome‐wide association studies data are used as illustrations. The power calculation method in this presentation can be used for the design of any biomedical studies involving high‐dimensional data and survival outcomes.  相似文献   

20.
Inverse probability weights used to fit marginal structural models are typically estimated using logistic regression. However, a data‐adaptive procedure may be able to better exploit information available in measured covariates. By combining predictions from multiple algorithms, ensemble learning offers an alternative to logistic regression modeling to further reduce bias in estimated marginal structural model parameters. We describe the application of two ensemble learning approaches to estimating stabilized weights: super learning (SL), an ensemble machine learning approach that relies on V‐fold cross validation, and an ensemble learner (EL) that creates a single partition of the data into training and validation sets. Longitudinal data from two multicenter cohort studies in Spain (CoRIS and CoRIS‐MD) were analyzed to estimate the mortality hazard ratio for initiation versus no initiation of combined antiretroviral therapy among HIV positive subjects. Both ensemble approaches produced hazard ratio estimates further away from the null, and with tighter confidence intervals, than logistic regression modeling. Computation time for EL was less than half that of SL. We conclude that ensemble learning using a library of diverse candidate algorithms offers an alternative to parametric modeling of inverse probability weights when fitting marginal structural models. With large datasets, EL provides a rich search over the solution space in less time than SL with comparable results. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号