首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A central goal of medical genetics is to accurately predict complex disease from genotypes. Here, we present a comprehensive analysis of simulated and real data using lasso and elastic‐net penalized support‐vector machine models, a mixed‐effects linear model, a polygenic score, and unpenalized logistic regression. In simulation, the sparse penalized models achieved lower false‐positive rates and higher precision than the other methods for detecting causal SNPs. The common practice of prefiltering SNP lists for subsequent penalized modeling was examined and shown to substantially reduce the ability to recover the causal SNPs. Using genome‐wide SNP profiles across eight complex diseases within cross‐validation, lasso and elastic‐net models achieved substantially better predictive ability in celiac disease, type 1 diabetes, and Crohn's disease, and had equivalent predictive ability in the rest, with the results in celiac disease strongly replicating between independent datasets. We investigated the effect of linkage disequilibrium on the predictive models, showing that the penalized methods leverage this information to their advantage, compared with methods that assume SNP independence. Our findings show that sparse penalized approaches are robust across different disease architectures, producing as good as or better phenotype predictions and variance explained. This has fundamental ramifications for the selection and future development of methods to genetically predict human disease.  相似文献   

2.
3.
We study regularized estimation in high-dimensional longitudinal classification problems, using the lasso and fused lasso regularizers. The constructed coefficient estimates are piecewise constant across the time dimension in the longitudinal problem, with adaptively selected change points (break points). We present an efficient algorithm for computing such estimates, based on proximal gradient descent. We apply our proposed technique to a longitudinal data set on Alzheimer's disease from the Cardiovascular Health Study Cognition Study. Using data analysis and a simulation study, we motivate and demonstrate several practical considerations such as the selection of tuning parameters and the assessment of model stability. While race, gender, vascular and heart disease, lack of caregivers, and deterioration of learning and memory are all important predictors of dementia, we also find that these risk factors become more relevant in the later stages of life.  相似文献   

4.
Survival regression is commonly applied in biomedical studies or clinical trials, and evaluating their predictive performance plays an essential role for model diagnosis and selection. The presence of censored data, particularly if informative, may pose more challenges for the assessment of predictive accuracy. Existing literature mainly focuses on prediction for survival probabilities with limitation work for survival time. In this work, we focus on accuracy measures of predicted survival times adjusted for a potentially informative censoring mechanism (ie, coarsening at random (CAR); non-CAR) by adopting the technique of inverse probability of censoring weighting. Our proposed predictive metric can be adaptive to various survival regression frameworks including but not limited to accelerated failure time models and proportional hazards models. Moreover, we provide the asymptotic properties of the inverse probability of censoring weighting estimators under CAR. We consider the settings of high-dimensional data under CAR or non-CAR for extensions. The performance of the proposed method is evaluated through extensive simulation studies and analysis of real data from the Critical Assessment of Microarray Data Analysis.  相似文献   

5.
The aim of this paper is to develop a functional‐mixed effects modeling (FMEM) framework for the joint analysis of high‐dimensional imaging data in a large number of locations (called voxels) of a three‐dimensional volume with a set of genetic markers and clinical covariates. Our FMEM is extremely useful for efficiently carrying out the candidate gene approaches in imaging genetic studies. FMEM consists of two novel components including a mixed effects model for modeling nonlinear genetic effects on imaging phenotypes by introducing the genetic random effects at each voxel and a jumping surface model for modeling the variance components of the genetic random effects and fixed effects as piecewise smooth functions of the voxels. Moreover, FMEM naturally accommodates the correlation structure of the genetic markers at each voxel, while the jumping surface model explicitly incorporates the intrinsically spatial smoothness of the imaging data. We propose a novel two‐stage adaptive smoothing procedure to spatially estimate the piecewise smooth functions, particularly the irregular functional genetic variance components, while preserving their edges among different piecewise‐smooth regions. We develop weighted likelihood ratio tests and derive their exact approximations to test the effect of the genetic markers across voxels. Simulation studies show that FMEM significantly outperforms voxel‐wise approaches in terms of higher sensitivity and specificity to identify regions of interest for carrying out candidate genetic mapping in imaging genetic studies. Finally, FMEM is used to identify brain regions affected by three candidate genes including CR1, CD2AP, and PICALM, thereby hoping to shed light on the pathological interactions between these candidate genes and brain structure and function.  相似文献   

6.
For many high‐dimensional studies, additional information on the variables, like (genomic) annotation or external p‐values, is available. In the context of binary and continuous prediction, we develop a method for adaptive group‐regularized (logistic) ridge regression, which makes structural use of such ‘co‐data’. Here, ‘groups’ refer to a partition of the variables according to the co‐data. We derive empirical Bayes estimates of group‐specific penalties, which possess several nice properties: (i) They are analytical. (ii) They adapt to the informativeness of the co‐data for the data at hand. (iii) Only one global penalty parameter requires tuning by cross‐validation. In addition, the method allows use of multiple types of co‐data at little extra computational effort. We show that the group‐specific penalties may lead to a larger distinction between ‘near‐zero’ and relatively large regression parameters, which facilitates post hoc variable selection. The method, termed GRridge , is implemented in an easy‐to‐use R‐package. It is demonstrated on two cancer genomics studies, which both concern the discrimination of precancerous cervical lesions from normal cervix tissues using methylation microarray data. For both examples, GRridge clearly improves the predictive performances of ordinary logistic ridge regression and the group lasso. In addition, we show that for the second study, the relatively good predictive performance is maintained when selecting only 42 variables. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

7.
Identification of biomarkers is an emerging area in oncology. In this article, we develop an efficient statistical procedure for the classification of protein markers according to their effect on cancer progression. A high-dimensional time-course dataset of protein markers for 80 patients motivates us for developing the model. The threshold value is formulated as a level of a marker having maximum impact on cancer progression. The classification algorithm technique for high-dimensional time-course data is developed and the algorithm is validated by comparing random components using both proportional hazard and accelerated failure time frailty models. The study elucidates the application of two separate joint modeling techniques using auto regressive-type model and mixed effect model for time-course data and proportional hazard model for survival data with proper utilization of Bayesian methodology. Also, a prognostic score is developed on the basis of few selected genes with application on patients. This study facilitates to identify relevant biomarkers from a set of markers.  相似文献   

8.
Two main classes of methodology have been developed for addressing the analytical intractability of generalized linear mixed models: likelihood‐based methods and Bayesian methods. Likelihood‐based methods such as the penalized quasi‐likelihood approach have been shown to produce biased estimates especially for binary clustered data with small clusters sizes. More recent methods using adaptive Gaussian quadrature perform well but can be overwhelmed by problems with large numbers of random effects, and efficient algorithms to better handle these situations have not yet been integrated in standard statistical packages. Bayesian methods, although they have good frequentist properties when the model is correct, are known to be computationally intensive and also require specialized code, limiting their use in practice. In this article, we introduce a modification of the hybrid approach of Capanu and Begg, 2011, Biometrics 67 , 371–380, as a bridge between the likelihood‐based and Bayesian approaches by employing Bayesian estimation for the variance components followed by Laplacian estimation for the regression coefficients. We investigate its performance as well as that of several likelihood‐based methods in the setting of generalized linear mixed models with binary outcomes. We apply the methods to three datasets and conduct simulations to illustrate their properties. Simulation results indicate that for moderate to large numbers of observations per random effect, adaptive Gaussian quadrature and the Laplacian approximation are very accurate, with adaptive Gaussian quadrature preferable as the number of observations per random effect increases. The hybrid approach is overall similar to the Laplace method, and it can be superior for data with very sparse random effects. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

9.
Misspecification of the covariance structure in a linear mixed model (LMM) can lead to biased population parameters' estimates under MAR drop-out. In our motivating example of modeling CD4 cell counts during untreated HIV infection, random intercept and slope LMMs are frequently used. In this article, we evaluate the performance of LMMs with specific covariance structures, in terms of bias in the fixed effects estimates, under specific MAR drop-out mechanisms, and adopt a Bayesian model comparison criterion to discriminate between the examined approaches in real-data applications. We analytically show that using a random intercept and slope structure when the true one is more complex can lead to seriously biased estimates, with the degree of bias depending on the magnitude of the MAR drop-out. Under misspecified covariance structure, we compare in terms of induced bias the approach of adding a fractional Brownian motion (BM) process on top of random intercepts and slopes with the approach of using splines for the random effects. In general, the performance of both approaches was satisfactory, with the BM model leading to smaller bias in most cases. A simulation study is carried out to evaluate the performance of the proposed Bayesian criterion in identifying the model with the correct covariance structure. Overall, the proposed method performs better than the AIC and BIC criteria under our specific simulation setting. The models under consideration are applied to real data from the CASCADE study; the most plausible model is identified by all examined criteria.  相似文献   

10.
We develop a new genetic prediction method, smooth‐threshold multivariate genetic prediction, using single nucleotide polymorphisms (SNPs) data in genome‐wide association studies (GWASs). Our method consists of two stages. At the first stage, unlike the usual discontinuous SNP screening as used in the gene score method, our method continuously screens SNPs based on the output from standard univariate analysis for marginal association of each SNP. At the second stage, the predictive model is built by a generalized ridge regression simultaneously using the screened SNPs with SNP weight determined by the strength of marginal association. Continuous SNP screening by the smooth thresholding not only makes prediction stable but also leads to a closed form expression of generalized degrees of freedom (GDF). The GDF leads to the Stein's unbiased risk estimation (SURE), which enables data‐dependent choice of optimal SNP screening cutoff without using cross‐validation. Our method is very rapid because computationally expensive genome‐wide scan is required only once in contrast to the penalized regression methods including lasso and elastic net. Simulation studies that mimic real GWAS data with quantitative and binary traits demonstrate that the proposed method outperforms the gene score method and genomic best linear unbiased prediction (GBLUP), and also shows comparable or sometimes improved performance with the lasso and elastic net being known to have good predictive ability but with heavy computational cost. Application to whole‐genome sequencing (WGS) data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) exhibits that the proposed method shows higher predictive power than the gene score and GBLUP methods.  相似文献   

11.
Correct selection of prognostic biomarkers among multiple candidates is becoming increasingly challenging as the dimensionality of biological data becomes higher. Therefore, minimizing the false discovery rate (FDR) is of primary importance, while a low false negative rate (FNR) is a complementary measure. The lasso is a popular selection method in Cox regression, but its results depend heavily on the penalty parameter λ. Usually, λ is chosen using maximum cross‐validated log‐likelihood (max‐cvl). However, this method has often a very high FDR. We review methods for a more conservative choice of λ. We propose an empirical extension of the cvl by adding a penalization term, which trades off between the goodness‐of‐fit and the parsimony of the model, leading to the selection of fewer biomarkers and, as we show, to the reduction of the FDR without large increase in FNR. We conducted a simulation study considering null and moderately sparse alternative scenarios and compared our approach with the standard lasso and 10 other competitors: Akaike information criterion (AIC), corrected AIC, Bayesian information criterion (BIC), extended BIC, Hannan and Quinn information criterion (HQIC), risk information criterion (RIC), one‐standard‐error rule, adaptive lasso, stability selection, and percentile lasso. Our extension achieved the best compromise across all the scenarios between a reduction of the FDR and a limited raise of the FNR, followed by the AIC, the RIC, and the adaptive lasso, which performed well in some settings. We illustrate the methods using gene expression data of 523 breast cancer patients. In conclusion, we propose to apply our extension to the lasso whenever a stringent FDR with a limited FNR is targeted. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

12.
In applications such as medical statistics and genetics, we encounter situations where a large number of highly correlated predictors explain a response. For example, the response may be a disease indicator and the predictors may be treatment indicators or single nucleotide polymorphisms (SNPs). Constructing a good predictive model in such cases is well studied. Less well understood is how to recover the ‘true sparsity pattern’, that is finding which predictors have direct effects on the response, and indicating the statistical significance of the results. Restricting attention to binary predictors and response, we study the recovery of the true sparsity pattern using a two‐stage method that separates establishing the presence of effects from inferring their exact relationship with the predictors. Simulations and a real data application demonstrate that the method discriminates well between associations and direct effects. Comparisons with lasso‐based methods demonstrate favourable performance of the proposed method. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

13.
Genetically complex diseases are caused by interacting environmental factors and genes. As a consequence, statistical methods that consider multiple unlinked genomic regions simultaneously are desirable. Such consideration, however, may lead to a vast number of different high-dimensional tests whose appropriate analysis pose a problem. Here, we present a method to analyze case-control studies with multiple SNP data without phase information that considers gene-gene interaction effects while correcting appropriately for multiple testing. In particular, we allow for interactions of haplotypes that belong to different unlinked regions, as haplotype analysis often proves to be more powerful than single marker analysis. In addition, we consider different marker combinations at each unlinked region. The multiple testing issue is settled via the minP approach; the P value of the "best" marker/region configuration is corrected via Monte-Carlo simulations. Thus, we do not explicitly test for a specific pre-defined interaction model, but test for the global hypothesis that none of the considered haplotype interactions shows association with the disease. We carry out a simulation study for case-control data that confirms the validity of our approach. When simulating two-locus disease models, our test proves to be more powerful than association methods that analyze each linked region separately. In addition, when one of the tested regions is not involved in the etiology of the disease, only a small amount of power is lost with interaction analysis as compared to analysis without interaction. We successfully applied our method to a real case-control data set with markers from two genes controlling a common pathway. While classical analysis failed to reach significance, we obtained a significant result even after correction for multiple testing with our proposed haplotype interaction analysis. The method described here has been implemented in FAMHAP.  相似文献   

14.
A basket trial aims to expedite the drug development process by evaluating a new therapy in multiple populations within the same clinical trial. Each population, referred to as a “basket”, can be defined by disease type, biomarkers, or other patient characteristics. The objective of a basket trial is to identify the subset of baskets for which the new therapy shows promise. The conventional approach would be to analyze each of the baskets independently. Alternatively, several Bayesian dynamic borrowing methods have been proposed that share data across baskets when responses appear similar. These methods can achieve higher power than independent testing in exchange for a risk of some inflation in the type 1 error rate. In this paper we propose a frequentist approach to dynamic borrowing for basket trials using adaptive lasso. Through simulation studies we demonstrate adaptive lasso can achieve similar power and type 1 error to the existing Bayesian methods. The proposed approach has the benefit of being easier to implement and faster than existing methods. In addition, the adaptive lasso approach is very flexible: it can be extended to basket trials with any number of treatment arms and any type of endpoint.  相似文献   

15.
Epistasis could be an important source of risk for disease. How interacting loci might be discovered is an open question for genome‐wide association studies (GWAS). Most researchers limit their statistical analyses to testing individual pairwise interactions (i.e., marginal tests for association). A more effective means of identifying important predictors is to fit models that include many predictors simultaneously (i.e., higher‐dimensional models). We explore a procedure called screen and clean (SC) for identifying liability loci, including interactions, by using the lasso procedure, which is a model selection tool for high‐dimensional regression. We approach the problem by using a varying dictionary consisting of terms to include in the model. In the first step the lasso dictionary includes only main effects. The most promising single‐nucleotide polymorphisms (SNPs) are identified using a screening procedure. Next the lasso dictionary is adjusted to include these main effects and the corresponding interaction terms. Again, promising terms are identified using lasso screening. Then significant terms are identified through the cleaning process. Implementation of SC for GWAS requires algorithms to explore the complex model space induced by the many SNPs genotyped and their interactions. We propose and explore a set of algorithms and find that SC successfully controls Type I error while yielding good power to identify risk loci and their interactions. When the method is applied to data obtained from the Wellcome Trust Case Control Consortium study of Type 1 Diabetes it uncovers evidence supporting interaction within the HLA class II region as well as within Chromosome 12q24. Genet. Epidemiol. 34: 275–285, 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

16.
There is the potential for high-dimensional information about patients collected in clinical trials (such as genomic, imaging, and data from wearable technologies) to be informative for the efficacy of a new treatment in situations where only a subset of patients benefits from the treatment. The adaptive signature design (ASD) method has been proposed for developing and testing the efficacy of a treatment in a high-efficacy patient group (the sensitive group) using genetic data. The method requires selection of three tuning parameters which may be highly computationally expensive. We propose a variation to the ASD method, the cross-validated risk scores (CVRS) design method, that does not require selection of any tuning parameters. The method is based on computing a risk score for each patient and dividing them into clusters using a nonparametric clustering procedure. We assess the properties of CVRS against the originally proposed cross-validated ASD using simulation data and a real psychiatry trial. CVRS, as assessed for various sample sizes and response rates, has a substantial reduction in the computational time required. In many simulation scenarios, there is a substantial improvement in the ability to correctly identify the sensitive group and the power of the design to detect a treatment effect in the sensitive group. We illustrate the application of the CVRS method on the psychiatry trial.  相似文献   

17.
Disaggregation regression has become an important tool in spatial disease mapping for making fine-scale predictions of disease risk from aggregated response data. By including high resolution covariate information and modeling the data generating process on a fine scale, it is hoped that these models can accurately learn the relationships between covariates and response at a fine spatial scale. However, validating these high resolution predictions can be a challenge, as often there is no data observed at this spatial scale. In this study, disaggregation regression was performed on simulated data in various settings and the resulting fine-scale predictions are compared to the simulated ground truth. Performance was investigated with varying numbers of data points, sizes of aggregated areas and levels of model misspecification. The effectiveness of cross validation on the aggregate level as a measure of fine-scale predictive performance was also investigated. Predictive performance improved as the number of observations increased and as the size of the aggregated areas decreased. When the model was well-specified, fine-scale predictions were accurate even with small numbers of observations and large aggregated areas. Under model misspecification predictive performance was significantly worse for large aggregated areas but remained high when response data was aggregated over smaller regions. Cross-validation correlation on the aggregate level was a moderately good predictor of fine-scale predictive performance. While these simulations are unlikely to capture the nuances of real-life response data, this study gives insight into the effectiveness of disaggregation regression in different contexts.  相似文献   

18.
In cancer genomic studies, an important objective is to identify prognostic markers associated with patients' survival. Network-based regularization has achieved success in variable selections for high-dimensional cancer genomic data, because of its ability to incorporate the correlations among genomic features. However, as survival time data usually follow skewed distributions, and are contaminated by outliers, network-constrained regularization that does not take the robustness into account leads to false identifications of network structure and biased estimation of patients' survival. In this study, we develop a novel robust network-based variable selection method under the accelerated failure time model. Extensive simulation studies show the advantage of the proposed method over the alternative methods. Two case studies of lung cancer datasets with high-dimensional gene expression measurements demonstrate that the proposed approach has identified markers with important implications.  相似文献   

19.
Frequently, studies are conducted in a real clinic setting. When the outcome of interest is collected longitudinally over a specified period of time, this design can lead to unequally spaced intervals and varying numbers of assessments. In our study, these features were embedded in a randomized, factorial design in which interventions to improve blood pressure control were delivered to both patients and providers. We examine the effect of the intervention and compare methods of estimation of both fixed effects and variance components in the multilevel generalized linear mixed model. Methods of comparison include penalized quasi-likelihood (PQL), adaptive quadrature, and Bayesian Monte Carlo methods. We also investigate the implications of reducing the data and analysis to baseline and final measurements. In the full analysis, the PQL fixed-effects estimates were closest to zero and confidence intervals were generally narrower than those of the other methods. The adaptive quadrature and Bayesian fixed-effects estimates were similar, but the Bayesian credible intervals were consistently wider. Variance component estimation was markedly different across methods, particularly for the patient-level random effects. In the baseline and final measurement analysis, we found that estimates and corresponding confidence intervals for the adaptive quadrature and Bayesian methods were very similar. However, the time effect was diminished and other factors also failed to reach statistical significance, most likely due to decreased power. When analyzing data from this type of design, we recommend using either adaptive quadrature or Bayesian methods to fit a multilevel generalized linear mixed model including all available measurements. Published in 2008 by John Wiley & Sons, Ltd.  相似文献   

20.
We propose a simple method to compute sample size for an arbitrary test hypothesis in population pharmacokinetics (PK) studies analysed with non-linear mixed effects models. Sample size procedures exist for linear mixed effects model, and have been recently extended by Rochon using the generalized estimating equation of Liang and Zeger. Thus, full model based inference in sample size computation has been possible. The method we propose extends the approach using a first-order linearization of the non-linear mixed effects model and use of the Wald chi(2) test statistic. The proposed method is general. It allows an arbitrary non-linear model as well as arbitrary distribution of random effects characterizing both inter- and intra-individual variability of the mixed effects model. To illustrate possible uses of the method we present tables of minimum sample sizes, in particular, with an illustration of the effect of sampling design on sample size. We demonstrate how (D-)optimal or frequent sampling requires fewer subjects in comparison to a sparse sampling design. We also present results from Monte Carlo simulations showing that the computed sample size can produce the desired power. The proposed method greatly reduces computing times compared with simulation-based methods of estimating sample sizes for population PK studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号