共查询到20条相似文献,搜索用时 15 毫秒
1.
In matched case‐crossover studies, it is generally accepted that the covariates on which a case and associated controls are matched cannot exert a confounding effect on independent predictors included in the conditional logistic regression model. This is because any stratum effect is removed by the conditioning on the fixed number of sets of the case and controls in the stratum. Hence, the conditional logistic regression model is not able to detect any effects associated with the matching covariates by stratum. However, some matching covariates such as time often play an important role as an effect modification leading to incorrect statistical estimation and prediction. Therefore, we propose three approaches to evaluate effect modification by time. The first is a parametric approach, the second is a semiparametric penalized approach, and the third is a semiparametric Bayesian approach. Our parametric approach is a two‐stage method, which uses conditional logistic regression in the first stage and then estimates polynomial regression in the second stage. Our semiparametric penalized and Bayesian approaches are one‐stage approaches developed by using regression splines. Our semiparametric one stage approach allows us to not only detect the parametric relationship between the predictor and binary outcomes, but also evaluate nonparametric relationships between the predictor and time. We demonstrate the advantage of our semiparametric one‐stage approaches using both a simulation study and an epidemiological example of a 1‐4 bi‐directional case‐crossover study of childhood aseptic meningitis with drinking water turbidity. We also provide statistical inference for the semiparametric Bayesian approach using Bayes Factors. Copyright © 2016 John Wiley & Sons, Ltd. 相似文献
2.
Bayesian analysis of pair‐matched case‐control studies subject to outcome misclassification 下载免费PDF全文
Tanja Högg John Petkau Yinshan Zhao Paul Gustafson José MA Wijnands Helen Tremlett 《Statistics in medicine》2017,36(26):4196-4213
We examine the impact of nondifferential outcome misclassification on odds ratios estimated from pair‐matched case‐control studies and propose a Bayesian model to adjust these estimates for misclassification bias. The model relies on access to a validation subgroup with confirmed outcome status for all case‐control pairs as well as prior knowledge about the positive and negative predictive value of the classification mechanism. We illustrate the model's performance on simulated data and apply it to a database study examining the presence of ten morbidities in the prodromal phase of multiple sclerosis. 相似文献
3.
When an initial case‐control study is performed, data can be used in a secondary analysis to evaluate the effect of the case‐defining event on later outcomes. In this paper, we study the example in which the role of the event is changed from a response variable to a treatment of interest. If the aim is to estimate marginal effects, such as average effects in the population, the sampling scheme needs to be adjusted for. We study estimators of the average effect of the treatment in a secondary analysis of matched and unmatched case‐control data where the probability of being a case is known. For a general class of estimators, we show the components of the bias resulting from ignoring the sampling scheme and demonstrate a design‐weighted matching estimator of the average causal effect. In simulations, the finite sample properties of the design‐weighted matching estimator are studied. Using a Swedish diabetes incidence register with a matched case‐control design, we study the effect of childhood onset diabetes on the use of antidepressant medication as an adult. Copyright © 2017 John Wiley & Sons, Ltd. 相似文献
4.
Mulugeta Gebregziabher Paulo Guimaraes Wendy Cozen David V. Conti 《Statistics in medicine》2010,29(9):1004-1013
In genetic association studies it is becoming increasingly imperative to have large sample sizes to identify and replicate genetic effects. To achieve these sample sizes, many research initiatives are encouraging the collaboration and combination of several existing matched and unmatched case–control studies. Thus, it is becoming more common to compare multiple sets of controls with the same case group or multiple case groups to validate or confirm a positive or negative finding. Usually, a naive approach of fitting separate models for each case–control comparison is used to make inference about disease–exposure association. But, this approach does not make use of all the observed data and hence could lead to inconsistent results. The problem is compounded when a common case group is used in each case–control comparison. An alternative to fitting separate models is to use a polytomous logistic model but, this model does not combine matched and unmatched case–control data. Thus, we propose a polytomous logistic regression approach based on a latent group indicator and a conditional likelihood to do a combined analysis of matched and unmatched case–control data. We use simulation studies to evaluate the performance of the proposed method and a case–control study of multiple myeloma and Inter‐Leukin‐6 as an example. Our results indicate that the proposed method leads to a more efficient homogeneity test and a pooled estimate with smaller standard error. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献
5.
Computationally simple analysis of matched,outcome‐based studies of ordinal disease states 下载免费PDF全文
Rebecca A. Betensky Jackie Szymonifka Eudocia Q. Lee Catherine L. Nutt Tracy T. Batchelor 《Statistics in medicine》2015,34(17):2514-2527
Outcome‐based sampling is an efficient study design for rare conditions, such as glioblastoma. It is often used in conjunction with matching, for increased efficiency and to potentially avoid bias due to confounding. A study was conducted at the Massachusetts General Hospital that involved retrospective sampling of glioblastoma patients with respect to multiple‐ordered disease states, as defined by three categories of overall survival time. To analyze such studies, we posit an adjacent categories logit model and exploit its allowance for prospective analysis of a retrospectively sampled study and its advantageous removal of set and level specific nuisance parameters through conditioning on sufficient statistics. This framework allows for any sampling design and is not limited to one level of disease within each set, such as in previous publications. We describe how this ordinal conditional model can be fit using standard conditional logistic regression procedures. We consider an alternative pseudo‐likelihood approach that potentially offers robustness under partial model misspecification at the expense of slight loss of efficiency under correct model specification for small sample sizes. We apply our methods to the Massachusetts General Hospital glioblastoma study. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献
6.
Biomedical studies have a common interest in assessing relationships between multiple related health outcomes and high‐dimensional predictors. For example, in reproductive epidemiology, one may collect pregnancy outcomes such as length of gestation and birth weight and predictors such as single nucleotide polymorphisms in multiple candidate genes and environmental exposures. In such settings, there is a need for simple yet flexible methods for selecting true predictors of adverse health responses from a high‐dimensional set of candidate predictors. To address this problem, one may either consider linear regression models for the continuous outcomes or convert these outcomes into binary indicators of adverse responses using predefined cutoffs. The former strategy has the disadvantage of often leading to a poorly fitting model that does not predict risk well, whereas the latter approach can be very sensitive to the cutoff choice. As a simple yet flexible alternative, we propose a method for adverse subpopulation regression, which relies on a two‐component latent class model, with the dominant component corresponding to (presumed) healthy individuals and the risk of falling in the minority component characterized via a logistic regression. The logistic regression model is designed to accommodate high‐dimensional predictors, as occur in studies with a large number of gene by environment interactions, through the use of a flexible nonparametric multiple shrinkage approach. The Gibbs sampler is developed for posterior computation. We evaluate the methods with the use of simulation studies and apply these to a genetic epidemiology study of pregnancy outcomes. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献
7.
Chuanhua Xing Janice M. McCarthy Josée Dupuis L. Adrienne Cupples James B. Meigs Xihong Lin Andrew S. Allen 《Statistics in medicine》2016,35(23):4226-4237
The case‐control study is a common design for assessing the association between genetic exposures and a disease phenotype. Though association with a given (case‐control) phenotype is always of primary interest, there is often considerable interest in assessing relationships between genetic exposures and other (secondary) phenotypes. However, the case‐control sample represents a biased sample from the general population. As a result, if this sampling framework is not correctly taken into account, analyses estimating the effect of exposures on secondary phenotypes can be biased leading to incorrect inference. In this paper, we address this problem and propose a general approach for estimating and testing the population effect of a genetic variant on a secondary phenotype. Our approach is based on inverse probability weighted estimating equations, where the weights depend on genotype and the secondary phenotype. We show that, though slightly less efficient than a full likelihood‐based analysis when the likelihood is correctly specified, it is substantially more robust to model misspecification, and can out‐perform likelihood‐based analysis, both in terms of validity and power, when the model is misspecified. We illustrate our approach with an application to a case‐control study extracted from the Framingham Heart Study. Copyright © 2016 John Wiley & Sons, Ltd. 相似文献
8.
In this paper, we propose nonlinear distance‐odds models investigating elevated odds around point sources of exposure, under a matched case‐control design where there are subtypes within cases. We consider models analogous to the polychotomous logit models and adjacent‐category logit models for categorical outcomes and extend them to the nonlinear distance‐odds context. We consider multiple point sources as well as covariate adjustments. We evaluate maximum likelihood, profile likelihood, iteratively reweighted least squares, and a hierarchical Bayesian approach using Markov chain Monte Carlo techniques under these distance‐odds models. We compare these methods using an extensive simulation study and show that with multiple parameters and a nonlinear model, Bayesian methods have advantages in terms of estimation stability, precision, and interpretation. We illustrate the methods by analyzing Medicaid claims data corresponding to the pediatric asthma population in Detroit, Michigan, from 2004 to 2006. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献
9.
Andrew Crossett Brian P. Kent Lambertus Klei Steven Ringquist Massimo Trucco Kathryn Roeder Bernie Devlin 《Statistics in medicine》2010,29(28):2932-2945
We propose a method to analyze family‐based samples together with unrelated cases and controls. The method builds on the idea of matched case–control analysis using conditional logistic regression (CLR). For each trio within the family, a case (the proband) and matched pseudo‐controls are constructed, based upon the transmitted and untransmitted alleles. Unrelated controls, matched by genetic ancestry, supplement the sample of pseudo‐controls; likewise unrelated cases are also paired with genetically matched controls. Within each matched stratum, the case genotype is contrasted with control/pseudo‐control genotypes via CLR, using a method we call matched‐CLR (mCLR). Eigenanalysis of numerous SNP genotypes provides a tool for mapping genetic ancestry. The result of such an analysis can be thought of as a multidimensional map, or eigenmap, in which the relative genetic similarities and differences amongst individuals is encoded in the map. Once constructed, new individuals can be projected onto the ancestry map based on their genotypes. Successful differentiation of individuals of distinct ancestry depends on having a diverse, yet representative sample from which to construct the ancestry map. Once samples are well‐matched, mCLR yields comparable power to competing methods while ensuring excellent control over Type I error. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献
10.
Scan‐stratified case‐control sampling for modeling blood–brain barrier integrity in multiple sclerosis 下载免费PDF全文
Gina‐Maria Pomann Elizabeth M. Sweeney Daniel S. Reich Ana‐Maria Staicu Russell T. Shinohara 《Statistics in medicine》2015,34(20):2872-2880
Multiple sclerosis (MS) is an immune‐mediated neurological disease that causes morbidity and disability. In patients with MS, the accumulation of lesions in the white matter of the brain is associated with disease progression and worse clinical outcomes. Breakdown of the blood–brain barrier in newer lesions is indicative of more active disease‐related processes and is a primary outcome considered in clinical trials of treatments for MS. Such abnormalities in active MS lesions are evaluated in vivo using contrast‐enhanced structural MRI, during which patients receive an intravenous infusion of a costly magnetic contrast agent. In some instances, the contrast agents can have toxic effects. Recently, local image regression techniques have been shown to have modest performance for assessing the integrity of the blood–brain barrier based on imaging without contrast agents. These models have centered on the problem of cross‐sectional classification in which patients are imaged at a single study visit and pre‐contrast images are used to predict post‐contrast imaging. In this paper, we extend these methods to incorporate historical imaging information, and we find the proposed model to exhibit improved performance. We further develop scan‐stratified case‐control sampling techniques that reduce the computational burden of local image regression models, while respecting the low proportion of the brain that exhibits abnormal vascular permeability. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献
11.
目的 将随机森林算法用于类风湿性关节炎病例对照研究的高维甲基化数据的分析,并探讨应用效果。方法 实例数据来自基因表达数据库(gene expression omnibus,GEO),检索号为GSE42861,包含354名病例、335名对照,本文选取类风湿性关节炎相关基因区域所在的第9号染色体,共纳入2 433个胞嘧啶-磷酸-鸟嘌呤双核苷酸(cytosine-phosphate-guanine pairs of nucleotides,CpGs)位点。利用随机森林计算变量的重要性评分并排序;对排序后的变量进行逐步随机森林过程,寻找最有可能与结果存在关联的变量子集;对降维后的变量子集进行逐步Logistic回归。结果 逐步随机森林筛选出80个重要的CpG位点,Logistic回归模型中有13个位点具有统计学意义。纳入这些位点建立Logistic回归模型,该模型的预测正确率达88.29%。结论 随机森林算法可以大大减少噪音变量,提高检验效能,适用于高维甲基化数据分析。 相似文献
12.
Many epidemiological studies use a nested case‐control (NCC) design to reduce cost while maintaining study power. Because NCC sampling is conditional on the primary outcome, routine application of logistic regression to analyze a secondary outcome will generally be biased. Recently, many studies have proposed several methods to obtain unbiased estimates of risk for a secondary outcome from NCC data. Two common features of all current methods requires that the times of onset of the secondary outcome are known for cohort members not selected into the NCC study and the hazards of the two outcomes are conditionally independent given the available covariates. This last assumption will not be plausible when the individual frailty of study subjects is not captured by the measured covariates. We provide a maximum‐likelihood method that explicitly models the individual frailties and also avoids the need to have access to the full cohort data. We derive the likelihood contribution by respecting the original sampling procedure with respect to the primary outcome. We use proportional hazard models for the individual hazards, and Clayton's copula is used to model additional dependence between primary and secondary outcomes beyond that explained by the measured risk factors. We show that the proposed method is more efficient than weighted likelihood and is unbiased in the presence of shared frailty for the primary and secondary outcome. We illustrate the method with an application to a study of risk factors for diabetes in a Swedish cohort. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献
13.
Motivated by a matched case-control study to investigate potential risk factors for meningococcal disease amongst adolescents, we consider the analysis of matched case-control studies where disease incidence, and possibly other risk factors, vary with time of year. For the cases, the time of infection may be recorded. For controls, however, the recorded time is simply the time of data collection, which is shortly after the time of infection for the matched case, and so depends on the latter. We show that the effect of risk factors and interactions may be adjusted for the time of year effect in a standard conditional logistic regression analysis without introducing any bias. We also show that, if the time delay between data collection for cases and controls is constant, provided this delay is not very short, estimates of the time of year effect are approximately unbiased. In the case that the length of the delay varies over time, the estimate of the time of year effect is biased. We obtain an approximate expression for the degree of bias in this case. 相似文献
14.
Case‐control association studies often collect extensive information on secondary phenotypes, which are quantitative or qualitative traits other than the case‐control status. Exploring secondary phenotypes can yield valuable insights into biological pathways and identify genetic variants influencing phenotypes of direct interest. All publications on secondary phenotypes have used standard statistical methods, such as least‐squares regression for quantitative traits. Because of unequal selection probabilities between cases and controls, the case‐control sample is not a random sample from the general population. As a result, standard statistical analysis of secondary phenotype data can be extremely misleading. Although one may avoid the sampling bias by analyzing cases and controls separately or by including the case‐control status as a covariate in the model, the associations between a secondary phenotype and a genetic variant in the case and control groups can be quite different from the association in the general population. In this article, we present novel statistical methods that properly reflect the case‐control sampling in the analysis of secondary phenotype data. The new methods provide unbiased estimation of genetic effects and accurate control of false‐positive rates while maximizing statistical power. We demonstrate the pitfalls of the standard methods and the advantages of the new methods both analytically and numerically. The relevant software is available at our website. Genet. Epidemiol. 2009. © 2008 Wiley‐Liss, Inc. 相似文献
15.
Xiaoyan Yin Daniel Levy Christine Willinger Aram Adourian Martin G. Larson 《Statistics in medicine》2016,35(8):1315-1326
Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case–control study of 135 incident cases of myocardial infarction and 135 pair‐matched controls from the Framingham Heart Study Offspring cohort. Plasma protein markers (K = 861) were measured on the case–control pairs (N = 135), and the majority of proteins had missing expression values for a subset of samples. In the setting of many more variables than observations (K ? N), we explored and documented the feasibility of multiple imputation approaches along with subsequent analysis of the imputed data sets. Initially, we selected proteins with complete expression data (K = 261) and randomly masked some values as the basis of simulation to tune the imputation and analysis process. We randomly shuffled proteins into several bins, performed multiple imputation within each bin, and followed up with stepwise selection using conditional logistic regression within each bin. This process was repeated hundreds of times. We determined the optimal method of multiple imputation, number of proteins per bin, and number of random shuffles using several performance statistics. We then applied this method to 544 proteins with incomplete expression data (≤40% missing values), from which we identified a panel of seven proteins that were jointly associated with myocardial infarction. © 2015 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd. 相似文献
16.
PLNseq: a multivariate Poisson lognormal distribution for high‐throughput matched RNA‐sequencing read count data 下载免费PDF全文
High‐throughput RNA‐sequencing (RNA‐seq) technology provides an attractive platform for gene expression analysis. In many experimental settings, RNA‐seq read counts are measured from matched samples or taken from the same subject under multiple treatment conditions. The induced correlation therefore should be evaluated and taken into account in deriving tests of differential expression. We proposed a novel method ‘PLNseq’, which uses a multivariate Poisson lognormal distribution to model matched read count data. The correlation is directly modeled through Gaussian random effects, and inferences are made by likelihood methods. A three‐stage numerical algorithm is developed to estimate unknown parameters and conduct differential expression analysis. Results using simulated data demonstrate that our method performs reasonably well in terms of parameter estimation, DE analysis power, and robustness. PLNseq also has better control of FDRs than the benchmarks edgeR and DESeq2 in the situations where the correlation is different across the genes but can still be accurately estimated. Furthermore, direct evaluation of correlation through PLNseq enables us to develop a new and more powerful test for DE analysis. Application to a lung cancer study is provided to illustrate the practical utilities of our method. An R package implementing the method is also publicly available. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献
17.
A novel case‐control subsampling approach for rapid model exploration of large clustered binary data 下载免费PDF全文
In many settings, an analysis goal is the identification of a factor, or set of factors associated with an event or outcome. Often, these associations are then used for inference and prediction. Unfortunately, in the big data era, the model building and exploration phases of analysis can be time‐consuming, especially if constrained by computing power (ie, a typical corporate workstation). To speed up this model development, we propose a novel subsampling scheme to enable rapid model exploration of clustered binary data using flexible yet complex model set‐ups (GLMMs with additive smoothing splines). By reframing the binary response prospective cohort study into a case‐control–type design, and using our knowledge of sampling fractions, we show one can approximate the model estimates as would be calculated from a full cohort analysis. This idea is extended to derive cluster‐specific sampling fractions and thereby incorporate cluster variation into an analysis. Importantly, we demonstrate that previously computationally prohibitive analyses can be conducted in a timely manner on a typical workstation. The approach is applied to analysing risk factors associated with adverse reactions relating to blood donation. 相似文献
18.
Population-based case-control studies measuring associations between haplotypes of single nucleotide polymorphisms (SNPs) are increasingly popular, in part because haplotypes of a few "tagging" SNPs may serve as surrogates for variation in relatively large sections of the genome. Due to current technological limitations, haplotypes in cases and controls must be inferred from unphased genotypic data. Using individual-specific inferred haplotypes as covariates in standard epidemiologic analyses (e.g., conditional logistic regression) is an attractive analysis strategy, as it allows adjustment for nongenetic covariates, provides omnibus and haplotype-specific tests of association, and can estimate haplotype and haplotype x environment interaction effects. In principle, some adjustment for the uncertainty in inferred haplotypes should be made. Via simulation, we compare the performance (bias and mean squared error of haplotype and haplotype x environment interaction effect estimates) of several analytic strategies using inferred haplotypes in the context of matched case-control data. These strategies include using only the most likely haplotype assignment, the expectation substitution approach described by Stram et al. ([2003b] Hum. Hered. 55:179-190) and others, and an improper version of multiple imputation. For relatively uncomplicated haplotype structures and moderate haplotype relative risks (=2), all methods performed comparably well (small bias with appropriately-sized confidence intervals). For larger relative risks, the most likely haplotype and multiple imputation strategies showed noticeable bias towards the null; the expectation substitution strategy still performed well. When there was more uncertainty in the inferred haplotypes, the most likely and multiple imputation strategies showed even more bias towards the null, while the expectation substitution method had slightly smaller than nominal confidence intervals for larger relative risks (>/=5). An application to progesterone-receptor haplotypes and endometrial cancer further illustrates that the performance of all these methods depends on how well the observed haplotypes "tag" the unobserved causal variant. 相似文献
19.
Family-based case-control studies are popularly used to study the effect of genes and gene-environment interactions in the etiology of rare complex diseases. We consider methods for the analysis of such studies under the assumption that genetic susceptibility (G) and environmental exposures (E) are independently distributed of each other within families in the source population. Conditional logistic regression, the traditional method of analysis of the data, fails to exploit the independence assumption and hence can be inefficient. Alternatively, one can estimate the multiplicative interaction between G and E more efficiently using cases only, but the required population-based G-E independence assumption is very stringent. In this article, we propose a novel conditional likelihood framework for exploiting the within-family G-E independence assumption. This approach leads to a simple and yet highly efficient method of estimating interaction and various other risk parameters of scientific interest. Moreover, we show that the same paradigm also leads to a number of alternative and even more efficient methods for analysis of family-based case-control studies when parental genotype information is available on the case-control study participants. Based on these methods, we evaluate different family-based study designs by examining their relative efficiencies to each other and their efficiencies compared to a population-based case-control design of unrelated subjects. These comparisons reveal important design implications. Extensions of the methodologies for dealing with complex family studies are also discussed. 相似文献
20.
Mark A. van de Wiel Tonje G. Lien Wina Verlaat Wessel N. van Wieringen Saskia M. Wilting 《Statistics in medicine》2016,35(3):368-381
For many high‐dimensional studies, additional information on the variables, like (genomic) annotation or external p‐values, is available. In the context of binary and continuous prediction, we develop a method for adaptive group‐regularized (logistic) ridge regression, which makes structural use of such ‘co‐data’. Here, ‘groups’ refer to a partition of the variables according to the co‐data. We derive empirical Bayes estimates of group‐specific penalties, which possess several nice properties: (i) They are analytical. (ii) They adapt to the informativeness of the co‐data for the data at hand. (iii) Only one global penalty parameter requires tuning by cross‐validation. In addition, the method allows use of multiple types of co‐data at little extra computational effort. We show that the group‐specific penalties may lead to a larger distinction between ‘near‐zero’ and relatively large regression parameters, which facilitates post hoc variable selection. The method, termed GRridge , is implemented in an easy‐to‐use R‐package. It is demonstrated on two cancer genomics studies, which both concern the discrimination of precancerous cervical lesions from normal cervix tissues using methylation microarray data. For both examples, GRridge clearly improves the predictive performances of ordinary logistic ridge regression and the group lasso. In addition, we show that for the second study, the relatively good predictive performance is maintained when selecting only 42 variables. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献