首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 281 毫秒
1.
Variable selection and Bayesian model averaging in case-control studies.   总被引:5,自引:0,他引:5  
Covariate and confounder selection in case-control studies is often carried out using a statistical variable selection method, such as a two-step method or a stepwise method in logistic regression. Inference is then carried out conditionally on the selected model, but this ignores the model uncertainty implicit in the variable selection process, and so may underestimate uncertainty about relative risks. We report on a simulation study designed to be similar to actual case-control studies. This shows that p-values computed after variable selection can greatly overstate the strength of conclusions. For example, for our simulated case-control studies with 1000 subjects, of variables declared to be 'significant' with p-values between 0.01 and 0.05, only 49 per cent actually were risk factors when stepwise variable selection was used. We propose Bayesian model averaging as a formal way of taking account of model uncertainty in case-control studies. This yields an easily interpreted summary, the posterior probability that a variable is a risk factor, and our simulation study indicates this to be reasonably well calibrated in the situations simulated. The methods are applied and compared in the context of a case-control study of cervical cancer.  相似文献   

2.
OBJECTIVES: Automated variable selection methods are frequently used to determine the independent predictors of an outcome. The objective of this study was to determine the reproducibility of logistic regression models developed using automated variable selection methods. STUDY DESIGN AND SETTING: An initial set of 29 candidate variables were considered for predicting mortality after acute myocardial infarction (AMI). We drew 1,000 bootstrap samples from a dataset consisting of 4,911 patients admitted to hospital with an AMI. Using each bootstrap sample, logistic regression models predicting 30-day mortality were obtained using backward elimination, forward selection, and stepwise selection. The agreement between the different model selection methods and the agreement across the 1,000 bootstrap samples were compared. RESULTS: Using 1,000 bootstrap samples, backward elimination identified 940 unique models for predicting mortality. Similar results were obtained for forward and stepwise selection. Three variables were identified as independent predictors of mortality among all bootstrap samples. Over half the candidate prognostic variables were identified as independent predictors in less than half of the bootstrap samples. CONCLUSION: Automated variable selection methods result in models that are unstable and not reproducible. The variables selected as independent predictors are sensitive to random fluctuations in the data.  相似文献   

3.
Identifying factors that affect mortality requires a robust statistical approach. This study’s objective is to assess an optimal set of variables that are independently associated with the mortality risk of 433 older comorbid adults that have been discharged from the geriatric ward. We used both the stepwise backward variable selection and the iterative Bayesian model averaging (BMA) approaches to the Cox proportional hazards models. Potential predictors of the mortality rate were based on a broad range of clinical data; functional and laboratory tests, including geriatric nutritional risk index (GNRI); lymphocyte count; vitamin D, and the age-weighted Charlson comorbidity index. The results of the multivariable analysis identified seven explanatory variables that are independently associated with the length of survival. The mortality rate was higher in males than in females; it increased with the comorbidity level and C-reactive proteins plasma level but was negatively affected by a person’s mobility, GNRI and lymphocyte count, as well as the vitamin D plasma level.  相似文献   

4.
In developing regression models, data analysts are often faced with many predictor variables that may influence an outcome variable. After more than half a century of research, the 'best' way of selecting a multivariable model is still unresolved. It is generally agreed that subject matter knowledge, when available, should guide model building. However, such knowledge is often limited, and data-dependent model building is required. We limit the scope of the modelling exercise to selecting important predictors and choosing interpretable and transportable functions for continuous predictors. Assuming linear functions, stepwise selection and all-subset strategies are discussed; the key tuning parameters are the nominal P-value for testing a variable for inclusion and the penalty for model complexity, respectively. We argue that stepwise procedures perform better than a literature-based assessment would suggest.Concerning selection of functional form for continuous predictors, the principal competitors are fractional polynomial functions and various types of spline techniques. We note that a rigorous selection strategy known as multivariable fractional polynomials (MFP) has been developed. No spline-based procedure for simultaneously selecting variables and functional forms has found wide acceptance. Results of FP and spline modelling are compared in two data sets. It is shown that spline modelling, while extremely flexible, can generate fitted curves with uninterpretable 'wiggles', particularly when automatic methods for choosing the smoothness are employed. We give general recommendations to practitioners for carrying out variable and function selection. While acknowledging that further research is needed, we argue why MFP is our preferred approach for multivariable model building with continuous covariates.  相似文献   

5.
Some research studies in the medical literature use multiple stepwise variable selection (SVS) algorithms to build multivariable models. The purpose of this study is to determine whether the use of multiple SVS algorithms in tandem (stepwise agreement) is a valid variable selection procedure. Computer simulations were developed to address stepwise agreement. Three popular SVS algorithms were tested (backward elimination, forward selection, and stepwise) on three statistical methods (linear, logistic, and Cox proportional hazards regression). Other simulation parameters explored were the sample size, number of predictors considered, degree of correlation between pairs of predictors, p‐value‐based entrance and exit criteria, predictor type (normally distributed or binary), and differences between stepwise agreement between any two or all three algorithms. Among stepwise methods, the rate of agreement, agreement on a model including only those predictors truly associated with the outcome, and agreement on a model containing the predictors truly associated with the outcome were measured. These rates were dependent on all simulation parameters. Mostly, the SVS algorithms agreed on a final model, but rarely on a model with only the true predictors. Sample size and candidate predictor pool size are the most influential simulation conditions. To conclude, stepwise agreement is often a poor strategy that gives misleading results and researchers should avoid using multiple SVS algorithms to build multivariable models. More research on the relationship between sample size and variable selection is needed. Published in 2010 by John Wiley & Sons, Ltd.  相似文献   

6.
Recently, several authors have proposed the use of linear regression models in cost-effectiveness analysis. In this paper, by modelling costs and outcomes using patient and Health Centre covariates, we seek to identify the part of the cost or outcome difference that is not attributable to the treatment itself, but to the patients' condition or to characteristics of the Centres. Selection of the covariates to be included as predictors of effectiveness and cost is usually assumed by the researcher. This behaviour ignores the uncertainty associated with model selection and leads to underestimation of the uncertainty about quantities of interest. We propose the use of Bayesian model averaging as a mechanism to account for such uncertainty about the model. Data from a clinical trial are used to analyze the effect of incorporating model uncertainty, by comparing two highly active antiretroviral treatments applied to asymptomatic HIV patients. The joint posterior density of incremental effectiveness and cost and cost-effectiveness acceptability curves are proposed as decision-making measures.  相似文献   

7.
In matched case‐crossover studies, it is generally accepted that the covariates on which a case and associated controls are matched cannot exert a confounding effect on independent predictors included in the conditional logistic regression model. This is because any stratum effect is removed by the conditioning on the fixed number of sets of the case and controls in the stratum. Hence, the conditional logistic regression model is not able to detect any effects associated with the matching covariates by stratum. However, some matching covariates such as time often play an important role as an effect modification leading to incorrect statistical estimation and prediction. Therefore, we propose three approaches to evaluate effect modification by time. The first is a parametric approach, the second is a semiparametric penalized approach, and the third is a semiparametric Bayesian approach. Our parametric approach is a two‐stage method, which uses conditional logistic regression in the first stage and then estimates polynomial regression in the second stage. Our semiparametric penalized and Bayesian approaches are one‐stage approaches developed by using regression splines. Our semiparametric one stage approach allows us to not only detect the parametric relationship between the predictor and binary outcomes, but also evaluate nonparametric relationships between the predictor and time. We demonstrate the advantage of our semiparametric one‐stage approaches using both a simulation study and an epidemiological example of a 1‐4 bi‐directional case‐crossover study of childhood aseptic meningitis with drinking water turbidity. We also provide statistical inference for the semiparametric Bayesian approach using Bayes Factors. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

8.
ObjectiveTo evaluate whether different categorization strategies for introducing continuous variables in multivariable logistic regression analysis results in prognostic models that differ in content and performance.Study Design and SettingBackward multivariable logistic regression (P < 0.05 and P < 0.157) was performed with possible predictors for persistent complaints in patients with nonspecific neck pain. The continuous variables were introduced in the analysis in three separate ways: (1) continuous, (2) split into multiple categories, and (3) dichotomized. The different models were compared with regard to model content, goodness of fit, explained variation, and discriminative ability. We also compared the effect on performance of categorization before and after the selection procedure.ResultsFor P < 0.05, the final model with continuous variables, containing five predictors, disagreed on three predictors with both categorization strategies. For P < 0.157, the model with continuous variables, containing six predictors, disagreed on three predictors with the model containing stratified continuous variables and on six predictors compared with the model with dichotomized variables. The models in which the variables were kept continuous performed best. There was no clear difference in performance between categorization before and after the selection procedure.ConclusionCategorization of continuous variables resulted in a different content and poorer performance of the final model.  相似文献   

9.
In this paper we propose a Bayesian modeling approach to the analysis of genome-wide association studies based on single nucleotide polymorphism (SNP) data. Our latent seed model combines various aspects of k-means clustering, hidden Markov models (HMMs) and logistic regression into a fully Bayesian model. It is fitted using the Markov chain Monte Carlo stochastic simulation method, with Metropolis-Hastings update steps. The approach is flexible, both in allowing different types of genetic models, and because it can be easily extended while remaining computationally feasible due to the use of fast algorithms for HMMs. It allows for inference primarily on the location of the causal locus and also on other parameters of interest. The latent seed model is used here to analyze three data sets, using both synthetic and real disease phenotypes with real SNP data, and shows promising results. Our method is able to correctly identify the causal locus in examples where single SNP analysis is both successful and unsuccessful at identifying the causal SNP.  相似文献   

10.
The continual reassessment method (CRM) is a method for estimating the maximum tolerated dose in a dose-finding study. Traditionally, use is made of a single working model or 'skeleton' idealizing an underlying true dose-toxicity relationship. This working model is chosen either by discussion with investigators or published data, before the beginning of the trial or simply on the basis of operating characteristics. To overcome the arbitrariness of the choice of such a single working model, Yin and Yuan (biJ. Am. Statist. Assoc. 2009; 104:954-968) propose a model averaging over a set of working models. Here, instead of averaging, we investigate some alternative Bayesian model criteria that maximize the posterior distribution. We propose three adaptive model-selecting CRMs using the Bayesian model selection criteria, in which we specify in advance a collection of candidate working models for the dose-toxicity relationship, especially initial guesses of toxicity probabilities, and adaptively select the only one working model among the candidates updated by using the original CRM for each working model, based on the posterior model probability, the posterior predictive loss or the deviance information criteria, during the course of the trial. These approaches were compared via a simulation study with the model averaging approach.  相似文献   

11.
Clinical decision making often requires estimates of the likelihood of a dichotomous outcome in individual patients. When empirical data are available, these estimates may well be obtained from a logistic regression model. Several strategies may be followed in the development of such a model. In this study, the authors compare alternative strategies in 23 small subsamples from a large data set of patients with an acute myocardial infarction, where they developed predictive models for 30-day mortality. Evaluations were performed in an independent part of the data set. Specifically, the authors studied the effect of coding of covariables and stepwise selection on discriminative ability of the resulting model, and the effect of statistical "shrinkage" techniques on calibration. As expected, dichotomization of continuous covariables implied a loss of information. Remarkably, stepwise selection resulted in less discriminating models compared to full models including all available covariables, even when more than half of these were randomly associated with the outcome. Using qualitative information on the sign of the effect of predictors slightly improved the predictive ability. Calibration improved when shrinkage was applied on the standard maximum likelihood estimates of the regression coefficients. In conclusion, a sensible strategy in small data sets is to apply shrinkage methods in full models that include well-coded predictors that are selected based on external information.  相似文献   

12.
OBJECTIVE: Researchers have proposed using bootstrap resampling in conjunction with automated variable selection methods to identify predictors of an outcome and to develop parsimonious regression models. Using this method, multiple bootstrap samples are drawn from the original data set. Traditional backward variable elimination is used in each bootstrap sample, and the proportion of bootstrap samples in which each candidate variable is identified as an independent predictor of the outcome is determined. The performance of this method for identifying predictor variables has not been examined. STUDY DESIGN AND SETTING: Monte Carlo simulation methods were used to determine the ability of bootstrap model selection methods to correctly identify predictors of an outcome when those variables that are selected for inclusion in at least 50% of the bootstrap samples are included in the final regression model. We compared the performance of the bootstrap model selection method to that of conventional backward variable elimination. RESULTS: Bootstrap model selection tended to result in an approximately equal proportion of selected models being equal to the true regression model compared with the use of conventional backward variable elimination. CONCLUSION: Bootstrap model selection performed comparatively to backward variable elimination for identifying the true predictors of a binary outcome.  相似文献   

13.
OBJECTIVE: To investigate whether stratification of the risk of developing a surgical-site infection (SSI) is improved when a logistic regression model is used to weight the risk factors for each procedure category individually instead of the modified NNIS System risk index. DESIGN AND SETTING: The German Nosocomial Infection Surveillance System, based on NNIS System methodology, has 273 acute care surgical departments participating voluntarily. Data on 9 procedure categories were included (214,271 operations). METHODS: For each of the procedure categories, the significant risk factors from the available data (NNIS System risk index variables of ASA score, wound class, duration of operation, and endoscope use, as well as gender and age) were identified by multiple logistic regression analyses with stepwise variable selection. The area under the receiver operating characteristic (ROC) curve resulting from these analyses was used to evaluate the predictive power of logistic regression models. RESULTS: For most procedures, at least two of the three variables contributing to the NNIS System risk index were shown to be independent risk factors (appendectomy, knee arthroscopy, cholecystectomy, colon surgery, herniorrhaphy, hip prosthesis, knee prosthesis, and vascular surgery). The predictive power of logistic regression models (including age and gender, when appropriate) was low (between 0.55 and 0.71) and for most procedures only slightly better than that of the NNIS System risk index. CONCLUSION: Without the inclusion of additional procedure-specific variables, logistic regression models do not improve the comparison of SSI rates from various hospitals.  相似文献   

14.
目的 猩红热(scarlet fever)是我国重点防制的法定乙类传染病之一,严重危害人类健康,其发病数据表现出典型的时空特征,利用时空分析方法研究猩红热发病的影响因素,为疾病防治工作提供帮助.方法 收集2014-2017年我国31个省市猩红热月发病资料及相应的气象因素和空气污染数据,利用Spearman相关分析和逐步...  相似文献   

15.
In modelling we usually endeavour to find a single 'best' model that explains the relationship between independent and dependent variables. Selection of a single model fails to take into account the prior uncertainty in the model space. The Bayesian model averaging (BMA) approach tackles this problem by considering the set of all possible models. We apply BMA approach to the estimation of the false negative fraction (FNF) in a particular case of a two-stage multiple screening test for bowel cancer. We find that after taking model uncertainty into consideration the estimate of the FNF obtained is largely dependent on the covariance structure of the priors. Results obtained when the Zellner g-prior for the prior variance is used is largely influenced by the magnitude of g.  相似文献   

16.
Multiple imputation is commonly used to impute missing covariate in Cox semiparametric regression setting. It is to fill each missing data with more plausible values, via a Gibbs sampling procedure, specifying an imputation model for each missing variable. This imputation method is implemented in several softwares that offer imputation models steered by the shape of the variable to be imputed, but all these imputation models make an assumption of linearity on covariates effect. However, this assumption is not often verified in practice as the covariates can have a nonlinear effect. Such a linear assumption can lead to a misleading conclusion because imputation model should be constructed to reflect the true distributional relationship between the missing values and the observed values. To estimate nonlinear effects of continuous time invariant covariates in imputation model, we propose a method based on B‐splines function. To assess the performance of this method, we conducted a simulation study, where we compared the multiple imputation method using Bayesian splines imputation model with multiple imputation using Bayesian linear imputation model in survival analysis setting. We evaluated the proposed method on the motivated data set collected in HIV‐infected patients enrolled in an observational cohort study in Senegal, which contains several incomplete variables. We found that our method performs well to estimate hazard ratio compared with the linear imputation methods, when data are missing completely at random, or missing at random. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

17.
18.
Variable selection is growing in importance with the advent of high throughput genotyping methods requiring analysis of hundreds to thousands of single nucleotide polymorphisms (SNPs) and the increased interest in using these genetic studies to better understand common, complex diseases. Up to now, the standard approach has been to analyze the genotypes for each SNP individually to look for an association with a disease. Alternatively, combinations of SNPs or haplotypes are analyzed for association. Another added complication in studying complex diseases or phenotypes is that genetic risk for the disease is often due to multiple SNPs in various locations on the chromosome with small individual effects that may have a collectively large effect on the phenotype. Hence, multi-locus SNP models, as opposed to single SNP models, may better capture the true underlying genotypic-phenotypic relationship. Thus, innovative methods for determining which SNPs to include in the model are needed. The goal of this article is to describe several methods currently available for variable and model selection using Bayesian approaches and to illustrate their application for genetic association studies using both real and simulated candidate gene data for a complex disease. In particular, Bayesian model averaging (BMA), stochastic search variable selection (SSVS), and Bayesian variable selection (BVS) using a reversible jump Markov chain Monte Carlo (MCMC) for candidate gene association studies are illustrated using a study of age-related macular degeneration (AMD) and simulated data.  相似文献   

19.
When modeling the risk of a disease, the very act of selecting the factors to be included can heavily impact the results. This study compares the performance of several variable selection techniques applied to logistic regression. We performed realistic simulation studies to compare five methods of variable selection: (1) a confidence interval (CI) approach for significant coefficients, (2) backward selection, (3) forward selection, (4) stepwise selection, and (5) Bayesian stochastic search variable selection (SSVS) using both informed and uniformed priors. We defined our simulated diseases mimicking odds ratios for cancer risk found in the literature for environmental factors, such as smoking; dietary risk factors, such as fiber; genetic risk factors, such as XPD; and interactions. We modeled the distribution of our covariates, including correlation, after the reported empirical distributions of these risk factors. We also used a null data set to calibrate the priors of the Bayesian method and evaluate its sensitivity. Of the standard methods (95 per cent CI, backward, forward, and stepwise selection) the CI approach resulted in the highest average per cent of correct associations and the lowest average per cent of incorrect associations. SSVS with an informed prior had a higher average per cent of correct associations and a lower average per cent of incorrect associations than the CI approach. This study shows that the Bayesian methods offer a way to use prior information to both increase power and decrease false-positive results when selecting factors to model complex disease risk.  相似文献   

20.
A common problem in the statistical analysis of clinical studies is the selection of those variables in the framework of a regression model which might influence the outcome variable. Stepwise methods have been available for a long time, but as with many other possible strategies, there is a lot of criticism of their use. Investigations of the stability of a selected model are often called for, but usually are not carried out in a systematic way. Since analytical approaches are extremely difficult, data-dependent methods might be an useful alternative. Based on a bootstrap resampling procedure, Chen and George investigated the stability of a stepwise selection procedure in the framework of the Cox proportional hazard regression model. We extend their proposal and develop a bootstrap-model selection procedure, combining the bootstrap method with existing selection techniques such as stepwise methods. We illustrate the proposed strategy in the process of model building by using data from two cancer clinical trials featuring two different situations commonly arising in clinical research. In a brain tumour study the adjustment for covariates in an overall treatment comparison is of primary interest calling for the selection of even 'mild' effects. In a prostate cancer study we concentrate on the analysis of treatment-covariate interactions demanding that only 'strong' effects should be selected. Both variants of the strategy will be demonstrated analysing the clinical trials with a Cox model, but they can be applied in other types of regression with obvious and straightforward modifications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号