首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
In cancer studies with high‐throughput genetic and genomic measurements, integrative analysis provides a way to effectively pool and analyze heterogeneous raw data from multiple independent studies and outperforms “classic” meta‐analysis and single‐dataset analysis. When marker selection is of interest, the genetic basis of multiple datasets can be described using the homogeneity model or the heterogeneity model. In this study, we consider marker selection under the heterogeneity model, which includes the homogeneity model as a special case and can be more flexible. Penalization methods have been developed in the literature for marker selection. This study advances from the published ones by introducing the contrast penalties, which can accommodate the within‐ and across‐dataset structures of covariates/regression coefficients and, by doing so, further improve marker selection performance. Specifically, we develop a penalization method that accommodates the across‐dataset structures by smoothing over regression coefficients. An effective iterative algorithm, which calls an inner coordinate descent iteration, is developed. Simulation shows that the proposed method outperforms the benchmark with more accurate marker identification. The analysis of breast cancer and lung cancer prognosis studies with gene expression measurements shows that the proposed method identifies genes different from those using the benchmark and has better prediction performance.  相似文献   

2.
In profiling studies, the analysis of a single dataset often leads to unsatisfactory results because of the small sample size. Multi‐dataset analysis utilizes information of multiple independent datasets and outperforms single‐dataset analysis. Among the available multi‐dataset analysis methods, integrative analysis methods aggregate and analyze raw data and outperform meta‐analysis methods, which analyze multiple datasets separately and then pool summary statistics. In this study, we conduct integrative analysis and marker selection under the heterogeneity structure, which allows different datasets to have overlapping but not necessarily identical sets of markers. Under certain scenarios, it is reasonable to expect some similarity of identified marker sets – or equivalently, similarity of model sparsity structures – across multiple datasets. However, the existing methods do not have a mechanism to explicitly promote such similarity. To tackle this problem, we develop a sparse boosting method. This method uses a BIC/HDBIC criterion to select weak learners in boosting and encourages sparsity. A new penalty is introduced to promote the similarity of model sparsity structures across datasets. The proposed method has a intuitive formulation and is broadly applicable and computationally affordable. In numerical studies, we analyze right censored survival data under the accelerated failure time model. Simulation shows that the proposed method outperforms alternative boosting and penalization methods with more accurate marker identification. The analysis of three breast cancer prognosis datasets shows that the proposed method can identify marker sets with increased similarity across datasets and improved prediction performance. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

3.
Ma S  Huang J  Wei F  Xie Y  Fang K 《Statistics in medicine》2011,30(28):3361-3371
Although in cancer research microarray gene profiling studies have been successful in identifying genetic variants predisposing to the development and progression of cancer, the identified markers from analysis of single datasets often suffer low reproducibility. Among multiple possible causes, the most important one is the small sample size hence the lack of power of single studies. Integrative analysis jointly considers multiple heterogeneous studies, has a significantly larger sample size, and can improve reproducibility. In this article, we focus on cancer prognosis studies, where the response variables are progression-free, overall, or other types of survival. A group minimax concave penalty (GMCP) penalized integrative analysis approach is proposed for analyzing multiple heterogeneous cancer prognosis studies with microarray gene expression measurements. An efficient group coordinate descent algorithm is developed. The GMCP can automatically accommodate the heterogeneity across multiple datasets, and the identified markers have consistent effects across multiple studies. Simulation studies show that the GMCP provides significantly improved selection results as compared with the existing meta-analysis approaches, intensity approaches, and group Lasso penalized integrative analysis. We apply the GMCP to four microarray studies and identify genes associated with the prognosis of breast cancer.  相似文献   

4.
Consider the integrative analysis of genetic data with multiple correlated response variables. The goal is to identify important gene–environment (G × E) interactions along with main gene and environment effects that are associated with the responses. The homogeneity and heterogeneity models can be adopted to describe the genetic basis of multiple responses. To accommodate possible nonlinear effects of some environment effects, a multi‐response partially linear varying coefficient model is assumed. Penalization is adopted for marker selection. The proposed penalization method can select genetic variants with G × E interactions, no G × E interactions, and no main effects simultaneously. It adopts different penalties to accommodate the homogeneity and heterogeneity models. The proposed method can be effectively computed using a coordinate descent algorithm. Simulation study and the analysis of Health Professionals Follow‐up Study, which has two correlated continuous traits, SNP measurements and multiple environment effects, show superior performance of the proposed method over its competitors. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

5.
In high‐throughput cancer genomic studies, markers identified from the analysis of single datasets may have unsatisfactory properties because of low sample sizes. Integrative analysis pools and analyzes raw data from multiple studies, and can effectively increase sample size and lead to improved marker identification results. In this study, we consider the integrative analysis of multiple high‐throughput cancer prognosis studies. In the existing integrative analysis studies, the interplay among genes, which can be described using the network structure, has not been effectively accounted for. In network analysis, tightly connected nodes (genes) are more likely to have related biological functions and similar regression coefficients. The goal of this study is to develop an analysis approach that can incorporate the gene network structure in integrative analysis. To this end, we adopt an AFT (accelerated failure time) model to describe survival. A weighted least squares approach, which has low computational cost, is adopted for estimation. For marker selection, we propose a new penalization approach. The proposed penalty is composed of two parts. The first part is a group MCP penalty, and conducts gene selection. The second part is a Laplacian penalty, and smoothes the differences of coefficients for tightly connected genes. A group coordinate descent approach is developed to compute the proposed estimate. Simulation study shows satisfactory performance of the proposed approach when there exist moderate‐to‐strong correlations among genes. We analyze three lung cancer prognosis datasets, and demonstrate that incorporating the network structure can lead to the identification of important genes and improved prediction performance.  相似文献   

6.
Integrative analysis of high dimensional omics datasets has been studied by many authors in recent years. By incorporating prior known relationships among the variables, these analyses have been successful in elucidating the relationships between different sets of omics data. In this article, our goal is to identify important relationships between genomic expression and cytokine data from a human immunodeficiency virus vaccine trial. We proposed a flexible partial least squares technique, which incorporates group and subgroup structure in the modelling process. Our new method accounts for both grouping of genetic markers (eg, gene sets) and temporal effects. The method generalises existing sparse modelling techniques in the partial least squares methodology and establishes theoretical connections to variable selection methods for supervised and unsupervised problems. Simulation studies are performed to investigate the performance of our methods over alternative sparse approaches. Our R package sgspls is available at https://github.com/matt‐sutton/sgspls .  相似文献   

7.
In cancer genomic studies, an important objective is to identify prognostic markers associated with patients' survival. Network-based regularization has achieved success in variable selections for high-dimensional cancer genomic data, because of its ability to incorporate the correlations among genomic features. However, as survival time data usually follow skewed distributions, and are contaminated by outliers, network-constrained regularization that does not take the robustness into account leads to false identifications of network structure and biased estimation of patients' survival. In this study, we develop a novel robust network-based variable selection method under the accelerated failure time model. Extensive simulation studies show the advantage of the proposed method over the alternative methods. Two case studies of lung cancer datasets with high-dimensional gene expression measurements demonstrate that the proposed approach has identified markers with important implications.  相似文献   

8.
The use of individual participant data (IPD) from multiple studies is an increasingly popular approach when developing a multivariable risk prediction model. Corresponding datasets, however, typically differ in important aspects, such as baseline risk. This has driven the adoption of meta‐analytical approaches for appropriately dealing with heterogeneity between study populations. Although these approaches provide an averaged prediction model across all studies, little guidance exists about how to apply or validate this model to new individuals or study populations outside the derivation data. We consider several approaches to develop a multivariable logistic regression model from an IPD meta‐analysis (IPD‐MA) with potential between‐study heterogeneity. We also propose strategies for choosing a valid model intercept for when the model is to be validated or applied to new individuals or study populations. These strategies can be implemented by the IPD‐MA developers or future model validators. Finally, we show how model generalizability can be evaluated when external validation data are lacking using internal–external cross‐validation and extend our framework to count and time‐to‐event data. In an empirical evaluation, our results show how stratified estimation allows study‐specific model intercepts, which can then inform the intercept to be used when applying the model in practice, even to a population not represented by included studies. In summary, our framework allows the development (through stratified estimation), implementation in new individuals (through focused intercept choice), and evaluation (through internal–external validation) of a single, integrated prediction model from an IPD‐MA in order to achieve improved model performance and generalizability. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

9.
Genetic heterogeneity, which may manifest on a population level as different frequencies of a specific disease susceptibility allele in different subsets of patients, is a common problem for candidate gene and genome‐wide association studies of complex human diseases. The ordered subset analysis (OSA) was originally developed as a method to reduce genetic heterogeneity in the context of family‐based linkage studies. Here, we have extended a previously proposed method (OSACC) for applying the OSA methodology to case‐control datasets. We have evaluated the type I error and power of different OSACC permutation tests with an extensive simulation study. Case‐control datasets were generated under two different models by which continuous clinical or environmental covariates may influence the relationship between susceptibility genotypes and disease risk. Our results demonstrate that OSACC is more powerful under some disease models than the commonly used trend test and a previously proposed joint test of main genetic and gene‐environment interaction effects. An additional unique benefit of OSACC is its ability to identify a more informative subset of cases that may be subjected to more detailed molecular analysis, such as DNA sequencing of selected genomic regions to detect functional variants in linkage disequilibrium with the associated polymorphism. The OSACC‐identified covariate threshold may also improve the power of an additional dataset to replicate previously reported associations that may only be detectable in a fraction of the original and replication datasets. In summary, we have demonstrated that OSACC is a useful method for improving SNP association signals in genetically heterogeneous datasets. Genet. Epidemiol. 34: 407–417, 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

10.
Linkage disequilibrium (LD) in the human genome, often measured as pairwise correlation between adjacent markers, shows substantial spatial heterogeneity. Congruent with these results, studies have found that certain regions of the genome have far less haplotype diversity than expected if the alleles at multiple markers were independent, while other sets of adjacent markers behave almost independently. Regions with limited haplotype diversity have been described as "blocked" or "haplotype blocks." In this article, we propose a new method that aims to distinguish between blocked and unblocked regions in the genome. Like some other approaches, the method analyses haplotype diversity. Unlike other methods, it allows for adjacent, distinct blocks and also multiple, independent single nucleotide polymorphisms (SNPs) separating blocks. Based on an approximate likelihood model and a parsimony criterion to penalize for model complexity, the method partitions a genomic region into blocks relatively quickly, and simulations suggest that its partitions are accurate. We also propose a new, efficient method to select SNPs for association analysis, namely tag SNPs. These methods compare favorably to similar blocking and tagging methods using simulations.  相似文献   

11.
In cancer research, high‐throughput profiling studies have been extensively conducted, searching for genes/single nucleotide polymorphisms (SNPs) associated with prognosis. Despite seemingly significant differences, different subtypes of the same cancer (or different types of cancers) may share common susceptibility genes. In this study, we analyze prognosis data on multiple subtypes of the same cancer but note that the proposed approach is directly applicable to the analysis of data on multiple types of cancers. We describe the genetic basis of multiple subtypes using the heterogeneity model that allows overlapping but different sets of susceptibility genes/SNPs for different subtypes. An accelerated failure time (AFT) model is adopted to describe prognosis. We develop a regularized gradient descent approach that conducts gene‐level analysis and identifies genes that contain important SNPs associated with prognosis. The proposed approach belongs to the family of gradient descent approaches, is intuitively reasonable, and has affordable computational cost. Simulation study shows that when prognosis‐associated SNPs are clustered in a small number of genes, the proposed approach outperforms alternatives with significantly more true positives and fewer false positives. We analyze an NHL (non‐Hodgkin lymphoma) prognosis study with SNP measurements and identify genes associated with the three major subtypes of NHL, namely, DLBCL, FL, and CLL/SLL. The proposed approach identifies genes different from using alternative approaches and has the best prediction performance.  相似文献   

12.
The power of genome‐wide association studies (GWAS) for mapping complex traits with single‐SNP analysis (where SNP is single‐nucleotide polymorphism) may be undermined by modest SNP effect sizes, unobserved causal SNPs, correlation among adjacent SNPs, and SNP‐SNP interactions. Alternative approaches for testing the association between a single SNP set and individual phenotypes have been shown to be promising for improving the power of GWAS. We propose a Bayesian latent variable selection (BLVS) method to simultaneously model the joint association mapping between a large number of SNP sets and complex traits. Compared with single SNP set analysis, such joint association mapping not only accounts for the correlation among SNP sets but also is capable of detecting causal SNP sets that are marginally uncorrelated with traits. The spike‐and‐slab prior assigned to the effects of SNP sets can greatly reduce the dimension of effective SNP sets, while speeding up computation. An efficient Markov chain Monte Carlo algorithm is developed. Simulations demonstrate that BLVS outperforms several competing variable selection methods in some important scenarios.  相似文献   

13.
For analyzing complex trait association with sequencing data, most current studies test aggregated effects of variants in a gene or genomic region. Although gene‐based tests have insufficient power even for moderately sized samples, pathway‐based analyses combine information across multiple genes in biological pathways and may offer additional insight. However, most existing pathway association methods are originally designed for genome‐wide association studies, and are not comprehensively evaluated for sequencing data. Moreover, region‐based rare variant association methods, although potentially applicable to pathway‐based analysis by extending their region definition to gene sets, have never been rigorously tested. In the context of exome‐based studies, we use simulated and real datasets to evaluate pathway‐based association tests. Our simulation strategy adopts a genome‐wide genetic model that distributes total genetic effects hierarchically into pathways, genes, and individual variants, allowing the evaluation of pathway‐based methods with realistic quantifiable assumptions on the underlying genetic architectures. The results show that, although no single pathway‐based association method offers superior performance in all simulated scenarios, a modification of Gene Set Enrichment Analysis approach using statistics from single‐marker tests without gene‐level collapsing (weighted Kolmogrov‐Smirnov [WKS]‐Variant method) is consistently powerful. Interestingly, directly applying rare variant association tests (e.g., sequence kernel association test) to pathway analysis offers a similar power, but its results are sensitive to assumptions of genetic architecture. We applied pathway association analysis to an exome‐sequencing data of the chronic obstructive pulmonary disease, and found that the WKS‐Variant method confirms associated genes previously published.  相似文献   

14.
In the analysis of complex and high-dimensional data, graphical models have been commonly adopted to describe associations among variables. When common factors exist which make the associations dense, the single factor graphical model has been proposed, which first extracts the common factor and then conducts graphical modeling. Under other simpler contexts, it has been recognized that results generated from analyzing a single dataset are often unsatisfactory, and integrating multiple datasets can effectively improve variable selection and estimation. In graphical modeling, the increased number of parameters makes the “lack of information” problem more severe. In this article, we integrate multiple datasets and conduct the approximate single factor graphical model analysis. A novel penalization approach is developed for the identification and estimation of important loadings and edges. An effective computational algorithm is developed. A wide spectrum of simulations and the analysis of breast cancer gene expression datasets demonstrate the competitive performance of the proposed approach. Overall, this study provides an effective new venue for taking advantage of multiple datasets and improving graphical model analysis.  相似文献   

15.
Multiple imputation (MI) is a commonly used technique for handling missing data in large‐scale medical and public health studies. However, variable selection on multiply‐imputed data remains an important and longstanding statistical problem. If a variable selection method is applied to each imputed dataset separately, it may select different variables for different imputed datasets, which makes it difficult to interpret the final model or draw scientific conclusions. In this paper, we propose a novel multiple imputation‐least absolute shrinkage and selection operator (MI‐LASSO) variable selection method as an extension of the least absolute shrinkage and selection operator (LASSO) method to multiply‐imputed data. The MI‐LASSO method treats the estimated regression coefficients of the same variable across all imputed datasets as a group and applies the group LASSO penalty to yield a consistent variable selection across multiple‐imputed datasets. We use a simulation study to demonstrate the advantage of the MI‐LASSO method compared with the alternatives. We also apply the MI‐LASSO method to the University of Michigan Dioxin Exposure Study to identify important circumstances and exposure factors that are associated with human serum dioxin concentration in Midland, Michigan. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

16.
In genomic studies with both genotypes and gene or protein expression profile available, causal effects of gene or protein on clinical outcomes can be inferred through using genetic variants as instrumental variables (IVs). The goal of introducing IV is to remove the effects of unobserved factors that may confound the relationship between the biomarkers and the outcome. A valid inference under the IV framework requires pairwise associations and pathway exclusivity. Among these assumptions, the IV expression association needs to be strong for the casual effect estimates to be unbiased. However, a small number of single nucleotide polymorphisms (SNPs) often provide limited explanation of the variability in the gene or protein expression and can only serve as weak IVs. In this study, we propose to replace SNPs with haplotypes as IVs to increase the variant‐expression association and thus improve the casual effect inference of the expression. In the classical two‐stage procedure, we developed a haplotype regression model combined with a model selection procedure to identify optimal instruments. The performance of the new method was evaluated through simulations and compared with the IV approaches using observed multiple SNPs. Our results showed the gain of power to detect a causal effect of gene or protein on the outcome using haplotypes compared with using only observed SNPs, under either complete or missing genotype scenarios. We applied our proposed method to a study of the effect of interleukin‐1 beta (IL‐1β) protein expression on the 90‐day survival following sepsis and found that overly expressed IL‐1β is likely to increase mortality.  相似文献   

17.
In the past few years, a plethora of methods for rare variant association with phenotype have been proposed. These methods aggregate information from multiple rare variants across genomic region(s), but there is little consensus as to which method is most effective. The weighting scheme adopted when aggregating information across variants is one of the primary determinants of effectiveness. Here we present a systematic evaluation of multiple weighting schemes through a series of simulations intended to mimic large sequencing studies of a quantitative trait. We evaluate existing phenotype‐independent and phenotype‐dependent methods, as well as weights estimated by penalized regression approaches including Lasso, Elastic Net, and SCAD. We find that the difference in power between phenotype‐dependent schemes is negligible when high‐quality functional annotations are available. When functional annotations are unavailable or incomplete, all methods suffer from power loss; however, the variable selection methods outperform the others at the cost of increased computational time. Therefore, in the absence of good annotation, we recommend variable selection methods (which can be viewed as “statistical annotation”) on top of regions implicated by a phenotype‐independent weighting scheme. Further, once a region is implicated, variable selection can help to identify potential causal single nucleotide polymorphisms for biological validation. These findings are supported by an analysis of a high coverage targeted sequencing study of 1,898 individuals.  相似文献   

18.
Rich meta‐epidemiological data sets have been collected to explore associations between intervention effect estimates and study‐level characteristics. Welton et al proposed models for the analysis of meta‐epidemiological data, but these models are restrictive because they force heterogeneity among studies with a particular characteristic to be at least as large as that among studies without the characteristic. In this paper we present alternative models that are invariant to the labels defining the 2 categories of studies. To exemplify the methods, we use a collection of meta‐analyses in which the Cochrane Risk of Bias tool has been implemented. We first investigate the influence of small trial sample sizes (less than 100 participants), before investigating the influence of multiple methodological flaws (inadequate or unclear sequence generation, allocation concealment, and blinding). We fit both the Welton et al model and our proposed label‐invariant model and compare the results. Estimates of mean bias associated with the trial characteristics and of between‐trial variances are not very sensitive to the choice of model. Results from fitting a univariable model show that heterogeneity variance is, on average, 88% greater among trials with less than 100 participants. On the basis of a multivariable model, heterogeneity variance is, on average, 25% greater among trials with inadequate/unclear sequence generation, 51% greater among trials with inadequate/unclear blinding, and 23% lower among trials with inadequate/unclear allocation concealment, although the 95% intervals for these ratios are very wide. Our proposed label‐invariant models for meta‐epidemiological data analysis facilitate investigations of between‐study heterogeneity attributable to certain study characteristics.  相似文献   

19.
Abstract

Ovarian cancer is the fifth leading cause of death among women worldwide. Characterized by complex etiology and multi-level heterogeneity, its origins are not well understood. Intense research efforts over the last decade have furthered our knowledge by identifying multiple risk factors that are associated with the disease. However, it is still unclear how genetic heterogeneity contributes to tumor formation, and more specifically, how genome-level heterogeneity acts as the key driving force of cancer evolution. Most current genomic approaches are based on ‘average molecular profiling.’ While effective for data generation, they often fail to effectively address the issue of high level heterogeneity because they mask variation that exists in a cell population. In this synthesis, we hypothesize that genome-mediated cancer evolution can effectively explain diverse factors that contribute to ovarian cancer. In particular, the key contribution of genome replacement can be observed during major transitions of ovarian cancer evolution including cellular immortalization, transformation, and malignancy. First, we briefly review major updates in the literature, and illustrate how current gene-mediated research will offer limited insight into cellular heterogeneity and ovarian cancer evolution. We next explain a holistic framework for genome-based ovarian cancer evolution and apply it to understand the genomic dynamics of a syngeneic ovarian cancer mouse model. Finally, we employ single cell assays to further test our hypothesis, discuss some predictions, and report some recent findings.  相似文献   

20.
Attempts to predict prognosis in cancer patients using high‐dimensional genomic data such as gene expression in tumor tissue can be made difficult by the large number of features and the potential complexity of the relationship between features and the outcome. Integrating prior biological knowledge into risk prediction with such data by grouping genomic features into pathways and networks reduces the dimensionality of the problem and could improve prediction accuracy. Additionally, such knowledge‐based models may be more biologically grounded and interpretable. Prediction could potentially be further improved by allowing for complex nonlinear pathway effects. The kernel machine framework has been proposed as an effective approach for modeling the nonlinear and interactive effects of genes in pathways for both censored and noncensored outcomes. When multiple pathways are under consideration, one may efficiently select informative pathways and aggregate their signals via multiple kernel learning (MKL), which has been proposed for prediction of noncensored outcomes. In this paper, we propose MKL methods for censored survival outcomes. We derive our approach for a general survival modeling framework with a convex objective function and illustrate its application under the Cox proportional hazards and semiparametric accelerated failure time models. Numerical studies demonstrate that the proposed MKL‐based prediction methods work well in finite sample and can potentially outperform models constructed assuming linear effects or ignoring the group knowledge. The methods are illustrated with an application to 2 cancer data sets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号