首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Eluted dried blood spot specimens from newborn screening, collected in 2004 in North Thames and anonymously linked to birth registration data, were tested for maternally acquired rubella IgG antibody as a proxy for maternal antibody concentration using an enzyme-linked immunosorbent assay. Finite mixture regression models were fitted to the antibody concentrations from 1964 specimens. The Bayesian Information Criterion (BIC) was used as a model selection criterion to avoid over-fitting the number of mixture model components. This allowed investigation of the independent effect of maternal age and maternal country of birth on rubella antibody concentration without dichotomizing the outcome variable using cut-off values set a priori. Mixture models are a highly useful method of analysis in seroprevalence studies of vaccine-preventable infections in which preset cut-off values may overestimate the size of the seronegative population.  相似文献   

2.
PURPOSE: To explore the best approach to identify and adjust for confounders in epidemiologic practice. METHODS: In the Port Pirie cohort study, the selection of covariates was based on both a priori and an empirical consideration. In an assessment of the relationship between exposure to environmental lead and child development, change-in-estimate (CE) and significance testing (ST) criteria were compared in identifying potential confounders. The Pearson correlation coefficients were used to evaluate the potential for collinearity between pairs of major quantitative covariates. In multivariate analyses, the effects of confounding factors were assessed with multiple linear regression models. RESULTS: The nature and number of covariates selected varied with different confounder selection criteria and different cutoffs. Four covariates (i.e., quality of home environment, socioeconomic status (SES), maternal intelligence, and parental smoking behaviour) met the conventional CE criterion (> or =10%), whereas 14 variables met the ST criterion (p < or = 0.25). However, the magnitude of the relationship between blood lead concentration and children's IQ differed slightly after adjustment for confounding, using either the CE (partial regression coefficient: -4.4; 95% confidence interval (CI): -0.5 to -8.3) or ST criterion (-4.3; 95% CI: -0.2 to -8.4). CONCLUSIONS: Identification and selection of confounding factors need to be viewed cautiously in epidemiologic studies. Either the CE (e.g., > or = 10%) or ST (e.g., p < or = 0.25) criterion may be implemented in identification of a potential confounder if a study sample is sufficiently large, and both the methods are subject to arbitrariness of selecting a cut-off point. In this study, the CE criterion (i.e., > or = 10%) appears to be more stringent than the ST method (i.e., p < or = 0.25) in the identification of confounders. However, the ST rule cannot be used to determine the trueness of confounding because it cannot reflect the causal relationship between the confounder and outcome. This study shows the complexities one can expect to encounter in the identification of and adjustment for confounders.  相似文献   

3.
OBJECTIVE: Researchers have proposed using bootstrap resampling in conjunction with automated variable selection methods to identify predictors of an outcome and to develop parsimonious regression models. Using this method, multiple bootstrap samples are drawn from the original data set. Traditional backward variable elimination is used in each bootstrap sample, and the proportion of bootstrap samples in which each candidate variable is identified as an independent predictor of the outcome is determined. The performance of this method for identifying predictor variables has not been examined. STUDY DESIGN AND SETTING: Monte Carlo simulation methods were used to determine the ability of bootstrap model selection methods to correctly identify predictors of an outcome when those variables that are selected for inclusion in at least 50% of the bootstrap samples are included in the final regression model. We compared the performance of the bootstrap model selection method to that of conventional backward variable elimination. RESULTS: Bootstrap model selection tended to result in an approximately equal proportion of selected models being equal to the true regression model compared with the use of conventional backward variable elimination. CONCLUSION: Bootstrap model selection performed comparatively to backward variable elimination for identifying the true predictors of a binary outcome.  相似文献   

4.
OBJECTIVES: Automated variable selection methods are frequently used to determine the independent predictors of an outcome. The objective of this study was to determine the reproducibility of logistic regression models developed using automated variable selection methods. STUDY DESIGN AND SETTING: An initial set of 29 candidate variables were considered for predicting mortality after acute myocardial infarction (AMI). We drew 1,000 bootstrap samples from a dataset consisting of 4,911 patients admitted to hospital with an AMI. Using each bootstrap sample, logistic regression models predicting 30-day mortality were obtained using backward elimination, forward selection, and stepwise selection. The agreement between the different model selection methods and the agreement across the 1,000 bootstrap samples were compared. RESULTS: Using 1,000 bootstrap samples, backward elimination identified 940 unique models for predicting mortality. Similar results were obtained for forward and stepwise selection. Three variables were identified as independent predictors of mortality among all bootstrap samples. Over half the candidate prognostic variables were identified as independent predictors in less than half of the bootstrap samples. CONCLUSION: Automated variable selection methods result in models that are unstable and not reproducible. The variables selected as independent predictors are sensitive to random fluctuations in the data.  相似文献   

5.
Correct selection of prognostic biomarkers among multiple candidates is becoming increasingly challenging as the dimensionality of biological data becomes higher. Therefore, minimizing the false discovery rate (FDR) is of primary importance, while a low false negative rate (FNR) is a complementary measure. The lasso is a popular selection method in Cox regression, but its results depend heavily on the penalty parameter λ. Usually, λ is chosen using maximum cross‐validated log‐likelihood (max‐cvl). However, this method has often a very high FDR. We review methods for a more conservative choice of λ. We propose an empirical extension of the cvl by adding a penalization term, which trades off between the goodness‐of‐fit and the parsimony of the model, leading to the selection of fewer biomarkers and, as we show, to the reduction of the FDR without large increase in FNR. We conducted a simulation study considering null and moderately sparse alternative scenarios and compared our approach with the standard lasso and 10 other competitors: Akaike information criterion (AIC), corrected AIC, Bayesian information criterion (BIC), extended BIC, Hannan and Quinn information criterion (HQIC), risk information criterion (RIC), one‐standard‐error rule, adaptive lasso, stability selection, and percentile lasso. Our extension achieved the best compromise across all the scenarios between a reduction of the FDR and a limited raise of the FNR, followed by the AIC, the RIC, and the adaptive lasso, which performed well in some settings. We illustrate the methods using gene expression data of 523 breast cancer patients. In conclusion, we propose to apply our extension to the lasso whenever a stringent FDR with a limited FNR is targeted. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

6.
Confidence intervals (CIs) and the reported predictive ability of statistical models may be misleading if one ignores uncertainty in the model selection procedure. When analyzing time-to-event data using Cox regression, one typically checks the proportional hazards (PH) assumption and subsequently alters the model to address any violations. Such an examination and correction constitute a model selection procedure, and, if not accounted for, could result in misleading CI. With the bootstrap, I study the impact of checking the PH assumption using (1) data to predict AIDS-free survival among HIV-infected patients initiating antiretroviral therapy and (2) simulated data. In the HIV study, due to non-PH, a Cox model was stratified on age quintiles. Interestingly, bootstrap CIs that ignored the PH check (always stratified on age quintiles) were wider than those which accounted for the PH check (on each bootstrap replication PH was tested and corrected through stratification only if violated). Simulations demonstrated that such a phenomenon is not an anomaly, although on average CIs widen when accounting for the PH check. In most simulation scenarios, coverage probabilities adjusting and not adjusting for the PH check were similar. However, when data were generated under a minor PH violation, the 95 per cent bootstrap CI ignoring the PH check had a coverage of 0.77 as opposed to 0.95 for CI accounting for the PH check. The impact of checking the PH assumption is greatest when the p-value of the test for PH is close to the test's chosen Type I error probability.  相似文献   

7.
Penalized regression methods offer an attractive alternative to single marker testing in genetic association analysis. Penalized regression methods shrink down to zero the coefficient of markers that have little apparent effect on the trait of interest, resulting in a parsimonious subset of what we hope are true pertinent predictors. Here we explore the performance of penalization in selecting SNPs as predictors in genetic association studies. The strength of the penalty can be chosen either to select a good predictive model (via methods such as computationally expensive cross validation), through maximum likelihood-based model selection criterion (such as the BIC), or to select a model that controls for type I error, as done here. We have investigated the performance of several penalized logistic regression approaches, simulating data under a variety of disease locus effect size and linkage disequilibrium patterns. We compared several penalties, including the elastic net, ridge, Lasso, MCP and the normal-exponential-γ shrinkage prior implemented in the hyperlasso software, to standard single locus analysis and simple forward stepwise regression. We examined how markers enter the model as penalties and P-value thresholds are varied, and report the sensitivity and specificity of each of the methods. Results show that penalized methods outperform single marker analysis, with the main difference being that penalized methods allow the simultaneous inclusion of a number of markers, and generally do not allow correlated variables to enter the model, producing a sparse model in which most of the identified explanatory markers are accounted for.  相似文献   

8.
We carried out a discriminant analysis with identity by descent (IBD) at each marker as inputs, and the sib pair type (affected‐affected versus affected‐unaffected) as the output. Using simple logistic regression for this discriminant analysis, we illustrate the importance of comparing models with different number of parameters. Such model comparisons are best carried out using either the Akaike information criterion (AIC) or the Bayesian information criterion (BIC). When AIC (or BIC) stepwise variable selection was applied to the German Asthma data set, a group of markers were selected which provide the best fit to the data (assuming an additive effect). Interestingly, these 25–26 markers were not identical to those with the highest (in magnitude) single‐locus lod scores. © 2001 Wiley‐Liss, Inc.  相似文献   

9.
Jones RH 《Statistics in medicine》2011,30(25):3050-3056
When a number of models are fit to the same data set, one method of choosing the 'best' model is to select the model for which Akaike's information criterion (AIC) is lowest. AIC applies when maximum likelihood is used to estimate the unknown parameters in the model. The value of -2 log likelihood for each model fit is penalized by adding twice the number of estimated parameters. The number of estimated parameters includes both the linear parameters and parameters in the covariance structure. Another criterion for model selection is the Bayesian information criterion (BIC). BIC penalizes -2 log likelihood by adding the number of estimated parameters multiplied by the log of the sample size. For large sample sizes, BIC penalizes -2 log likelihood much more than AIC making it harder to enter new parameters into the model. An assumption in BIC is that the observations are independent. In mixed models, the observations are not independent. This paper develops a method for calculating the 'effective sample size' for mixed models based on Fisher's information. The effective sample size replaces the sample size in BIC and can vary from the number of subjects to the number of observations. A number of error models are considered based on a general mixed model including unstructured, compound symmetry.  相似文献   

10.
This preliminary study investigated associations between environmental organochlorine compounds and thyroid function in a sample of 66 sportsmen selected from among participants in the New York State Angler Cohort Study. A cross-sectional design was employed with the primary goal of the analysis being the generation of specific testable hypotheses. Blood samples were analyzed for compounds based on a priori identified literature-cited evidence of thyroid disruption. These included hexachlorobenzene and polychlorinated biphenyl congeners 19, 28, 47, 118, 153, 169, 180, 183, and 187. Time of sample collection, serum triglycerides, cholesterol, high- and low-density lipoproteins, age, body mass index, and cigarette smoking were considered for each participant. Potential associations between organochlorine compounds and serum total thyroxine, controlling for potential confounders, were examined using multivariable linear regression models. The models reported consisted of all variates being entered ("full" model, R2=0.380, P=0.136) and stepwise selection of variates ("reduced" models, alpha=0.15) using the criterion of maximum partial correlation at each step. Several procedures were considered to address contaminant data below the limit of detection in the reduced models with no change in selected predictors. Hexachlorobenzene (beta=-0.113) and age (beta=0.007) were selected as predictors of serum T4 in the reduced models (R2=0.083, P=0.065). Power analysis suggested that by doubling the sample size the existing results would be statistically significant with a type I error of 0.05 and a power of 0.80. These findings are important in the design of a new specific study of thyroid function and environmental contaminants.  相似文献   

11.
Statistical prediction methods typically require some form of fine‐tuning of tuning parameter(s), with K‐fold cross‐validation as the canonical procedure. For ridge regression, there exist numerous procedures, but common for all, including cross‐validation, is that one single parameter is chosen for all future predictions. We propose instead to calculate a unique tuning parameter for each individual for which we wish to predict an outcome. This generates an individualized prediction by focusing on the vector of covariates of a specific individual. The focused ridge—fridge—procedure is introduced with a 2‐part contribution: First we define an oracle tuning parameter minimizing the mean squared prediction error of a specific covariate vector, and then we propose to estimate this tuning parameter by using plug‐in estimates of the regression coefficients and error variance parameter. The procedure is extended to logistic ridge regression by using parametric bootstrap. For high‐dimensional data, we propose to use ridge regression with cross‐validation as the plug‐in estimate, and simulations show that fridge gives smaller average prediction error than ridge with cross‐validation for both simulated and real data. We illustrate the new concept for both linear and logistic regression models in 2 applications of personalized medicine: predicting individual risk and treatment response based on gene expression data. The method is implemented in the R package fridge.  相似文献   

12.
Logistic regression is the standard method for assessing predictors of diseases. In logistic regression analyses, a stepwise strategy is often adopted to choose a subset of variables. Inference about the predictors is then made based on the chosen model constructed of only those variables retained in that model. This method subsequently ignores both the variables not selected by the procedure, and the uncertainty due to the variable selection procedure. This limitation may be addressed by adopting a Bayesian model averaging approach, which selects a number of all possible such models, and uses the posterior probabilities of these models to perform all inferences and predictions. This study compares the Bayesian model averaging approach with the stepwise procedures for selection of predictor variables in logistic regression using simulated data sets and the Framingham Heart Study data. The results show that in most cases Bayesian model averaging selects the correct model and out-performs stepwise approaches at predicting an event of interest.  相似文献   

13.
BACKGROUND: Incidence rates for cancers of the upper aero-digestive tract in Southern Brazil are among the highest in the world. A case-control study was designed to identify the main risk factors for carcinomas of mouth, pharynx, and larynx in the region. We tested the hypothesis of whether use of wood stoves is associated with these cancers. METHODS: Information on known and potential risk factors was obtained from interviews with 784 cases and 1568 non-cancer controls. We estimated the effect of use of wood stove by conditional logistic regression, with adjustment for smoking, alcohol consumption and for other sociodemographic and dietary variables chosen as empirical confounders based on a change-in-estimate criterion. RESULTS: After extensive adjustment for all the empirical confounders the odds ratio (OR) for all upper aero-digestive tract cancers was 2.68 (95% confidence interval [CI] : 2.2-3.3). Increased risks were also seen in site-specific analyses for mouth (OR = 2.73; 95% CI: 1.8-4.2), pharyngeal (OR = 3.82; 95% CI: 2.0-7.4), and laryngeal carcinomas (OR = 2.34; 95% CI: 1.2-4.7). Significant risk elevations remained for each of the three anatomic sites and for all sites combined even after we purposefully biased the analyses towards the null hypothesis by adjusting the effect of wood stove use only for positive empirical confounders. CONCLUSIONS: The association of use of wood stoves with cancers of the upper aero-digestive tract is genuine and unlikely to result from insufficient control of confounding. Due to its high prevalence, use of wood stoves may be linked to as many as 30% of all cancers occurring in the region.  相似文献   

14.
Varying‐coefficient models have claimed an increasing portion of statistical research and are now applied to censored data analysis in medical studies. We incorporate such flexible semiparametric regression tools for interval censored data with a cured proportion. We adopted a two‐part model to describe the overall survival experience for such complicated data. To fit the unknown functional components in the model, we take the local polynomial approach with bandwidth chosen by cross‐validation. We establish consistency and asymptotic distribution of the estimation and propose to use bootstrap for inference. We constructed a BIC‐type model selection method to recommend an appropriate specification of parametric and nonparametric components in the model. We conducted extensive simulations to assess the performance of our methods. An application on a decompression sickness data illustrates our methods. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

15.
Applied researchers frequently use automated model selection methods, such as backwards variable elimination, to develop parsimonious regression models. Statisticians have criticized the use of these methods for several reasons, amongst them are the facts that the estimated regression coefficients are biased and that the derived confidence intervals do not have the advertised coverage rates. We developed a method to improve estimation of regression coefficients and confidence intervals which employs backwards variable elimination in multiple bootstrap samples. In a given bootstrap sample, predictor variables that are not selected for inclusion in the final regression model have their regression coefficient set to zero. Regression coefficients are averaged across the bootstrap samples, and non-parametric percentile bootstrap confidence intervals are then constructed for each regression coefficient. We conducted a series of Monte Carlo simulations to examine the performance of this method for estimating regression coefficients and constructing confidence intervals for variables selected using backwards variable elimination. We demonstrated that this method results in confidence intervals with superior coverage compared with those developed from conventional backwards variable elimination. We illustrate the utility of our method by applying it to a large sample of subjects hospitalized with a heart attack.  相似文献   

16.
Birthweight distributions have been conceptualized as a predominant Gaussian distribution contaminated in the tails by an unspecified ‘residual’ distribution. Acknowledging this idea, we propose a technique for measuring certain features of birthweight distributions useful to epidemiologists: the mean and variance of the predominant distribution; the proportions of births in the low- and high-birthweight residual distributions, and the boundaries of support for these residual distributions. Our technique, based on an underlying multinomial sampling distribution, involves estimating parameters in a mixture model for the multinomial bin probabilities after having chosen the support of the residual distribution with a model selection criterion. A modest simulation study and experience with a few actual datasets indicate that use of a Bayesian information criterion (BIC) as model selection criterion is superior to use of Akaike's information criterion (AIC) in this application.  相似文献   

17.
Fang Liu 《Statistics in medicine》2018,37(24):3471-3485
The Bayesian expected power (BEP) has become increasingly popular in assessing the probability of success for a future trial. While the traditional power assumes a single value for the unknown effect size Δ and is thus highly dependent on the assumed value, the BEP embraces the uncertainty around Δ given the prior information and is therefore a less subjective measure for the probability of success than the traditional power especially when the prior information is not rich. Current methods for assessing BEP are often based in a parametric framework by imposing a model on the pilot data to derive and sample from the posterior distributions of Δ. The model‐based approach can be analytically challenging and computationally costly especially for multivariate data sets, and it also runs the risk of generating misleading BEP if the model is misspecified. We propose an approach based on the Bayesian bootstrap (BBS) technique to simulate future trials in the presence of individual‐level pilot data, based on which the empirical BEP can be calculated. The BBS approach is model‐free with no assumptions about the distribution of the prior data and also circumvents the analytical and computational complexity associated with obtaining the posterior distribution of the Δ. Information from multiple pilot studies is also straightforward to combine. We also propose the double bootstrap technique, a frequentist counterpart to the BBS, that shares similar properties and achieves the same goal as the BBS for BEP assessment. Simulation and case studies are presented to demonstrate the implementation of the BBS technique and the double bootstrap technique and to compare the BEP results with model‐based approach.  相似文献   

18.
In air pollution epidemiology, improvements in statistical analysis tools can help improve signal-to-noise ratios, and untangle large correlations between exposures and confounders. For this reason, we welcome a novel model-selection approach that helps to identify the time-windows of exposure to pollutants that produces adverse health effects. However, there are concerns about approaches that select a model based on a given data set, and then estimate health effects in the same data. This can create problems when (1) the sample size is small in relation to the magnitude of the health effects; and (2) candidate predictors are highly correlated and likely to have similar effects. Bayesian Model Averaging has been advocated as a way to estimate health effects that accounts for model uncertainty. However, implementations where posterior model probabilities are approximated using BIC, as well as other default choices, may not reflect the ability of each model to provide an estimate of the health effect that is properly adjusted for confounding. Air pollution studies need to focus on estimating health effects while accounting for the uncertainty in the adjustment for confounding factors. This is true especially when model choice and estimation are performed on the same data. The development of appropriate statistical tools remains an open area of investigation.  相似文献   

19.

Background

Multivariable confounder adjustment in comparative studies of newly marketed drugs can be limited by small numbers of exposed patients and even fewer outcomes. Disease risk scores (DRSs) developed in historical comparator drug users before the new drug entered the market may improve adjustment. However, in a high dimensional data setting, empirical selection of hundreds of potential confounders and modeling of DRS even in the historical cohort can lead to over-fitting and reduced predictive performance in the study cohort. We propose the use of combinations of dimension reduction and shrinkage methods to overcome this problem, and compared the performances of these modeling strategies for implementing high dimensional (hd) DRSs from historical data in two empirical study examples of newly marketed drugs versus comparator drugs after the new drugs’ market entry—dabigatran versus warfarin for the outcome of major hemorrhagic events and cyclooxygenase-2 inhibitor (coxibs) versus nonselective non-steroidal anti-inflammatory drugs (nsNSAIDs) for gastrointestinal bleeds.

Results

Historical hdDRSs that included predefined and empirical outcome predictors with dimension reduction (principal component analysis; PCA) and shrinkage (lasso and ridge regression) approaches had higher c-statistics (0.66 for the PCA model, 0.64 for the PCA + ridge and 0.65 for the PCA + lasso models in the warfarin users) than an unreduced model (c-statistic, 0.54) in the dabigatran example. The odds ratio (OR) from PCA + lasso hdDRS-stratification [OR, 0.64; 95 % confidence interval (CI) 0.46–0.90] was closer to the benchmark estimate (0.93) from a randomized trial than the model without empirical predictors (OR, 0.58; 95 % CI 0.41–0.81). In the coxibs example, c-statistics of the hdDRSs in the nsNSAID initiators were 0.66 for the PCA model, 0.67 for the PCA + ridge model, and 0.67 for the PCA + lasso model; these were higher than for the unreduced model (c-statistic, 0.45), and comparable to the demographics + risk score model (c-statistic, 0.67).

Conclusions

hdDRSs using historical data with dimension reduction and shrinkage was feasible, and improved confounding adjustment in two studies of newly marketed medications.
  相似文献   

20.
Analysis of health care cost data is often complicated by a high level of skewness, heteroscedastic variances and the presence of missing data. Most of the existing literature on cost data analysis have been focused on modeling the conditional mean. In this paper, we study a weighted quantile regression approach for estimating the conditional quantiles health care cost data with missing covariates. The weighted quantile regression estimator is consistent, unlike the naive estimator, and asymptotically normal. Furthermore, we propose a modified BIC for variable selection in quantile regression when the covariates are missing at random. The quantile regression framework allows us to obtain a more complete picture of the effects of the covariates on the health care cost and is naturally adapted to the skewness and heterogeneity of the cost data. The method is semiparametric in the sense that it does not require to specify the likelihood function for the random error or the covariates. We investigate the weighted quantile regression procedure and the modified BIC via extensive simulations. We illustrate the application by analyzing a real data set from a health care cost study. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号