首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 640 毫秒
1.
ROC curve analysis for biomarkers based on pooled assessments   总被引:2,自引:0,他引:2  
Interleukin-6 is a biomarker of inflammation which has been suggested as having potential discriminatory ability for myocardial infarction. Because of its high assaying cost it is very expensive to evaluate this marker. In order to reduce this cost we propose pooling the specimens. In this paper we examine the efficiency of ROC curve analysis, specifically the estimation of the area under the ROC curve, when dealing with pooled data. We study the effect of pooling when there are only a fixed number of individuals available for testing and pooling is carried out to save on the number of assays. Alternatively we examine how many pooled assays of size g are necessary to provide essentially the same information as N individual assays. We measure loss of information by means of the change in root mean square error of the estimate of the area under the ROC curve and study the extent of this loss via a simulation study.  相似文献   

2.
Evaluating biomarkers in epidemiological studies can be expensive and time consuming. Many investigators use techniques such as random sampling or pooling biospecimens in order to cut costs and save time on experiments. Commonly, analyses based on pooled data are strongly restricted by distributional assumptions that are challenging to validate because of the pooled biospecimens. Random sampling provides data that can be easily analyzed. However, random sampling methods are not optimal cost‐efficient designs for estimating means. We propose and examine a cost‐efficient hybrid design that involves taking a sample of both pooled and unpooled data in an optimal proportion in order to efficiently estimate the unknown parameters of the biomarker distribution. In addition, we find that this design can be used to estimate and account for different types of measurement and pooling error, without the need to collect validation data or repeated measurements. We show an example where application of the hybrid design leads to minimization of a given loss function based on variances of the estimators of the unknown parameters. Monte Carlo simulation and biomarker data from a study on coronary heart disease are used to demonstrate the proposed methodology. Published in 2010 by John Wiley & Sons, Ltd.  相似文献   

3.
Xu J  Yang Y  Ying Z  Ott J 《Statistics in medicine》2008,27(28):5801-5815
Pooling DNA samples of multiple individuals has been advocated as a method to reduce genotyping costs. Under such a scheme, only the allele counts at each locus, not the haplotype information, are observed. We develop a systematic way for handling such data by formulating the problem in terms of contingency tables, where pooled allele counts are expressed as the margins and the haplotype counts correspond to the unobserved cell counts. We show that the cell frequencies can be uniquely determined from the marginal frequencies under the usual Hardy-Weinberg equilibrium (HWE) assumption and that the maximum likelihood estimates of haplotype frequencies are consistent and asymptotically normal as the number of pools increases. The limiting covariance matrix is shown to be closely related to the extended hypergeometric distribution. Our results are used to derive Wald-type tests for linkage disequilibrium (LD) coefficient using pooled data. It is discovered that pooling is not efficient in testing weak LD despite its efficiency in estimating haplotype frequencies. We also show by simulations that the proposed LD tests are robust to slight deviation from HWE and to minor genotype error. Applications to two real angiotensinogen gene data sets are also provided.  相似文献   

4.
As a cost‐efficient data collection mechanism, the process of assaying pooled biospecimens is becoming increasingly common in epidemiological research; for example, pooling has been proposed for the purpose of evaluating the diagnostic efficacy of biological markers (biomarkers). To this end, several authors have proposed techniques that allow for the analysis of continuous pooled biomarker assessments. Regretfully, most of these techniques proceed under restrictive assumptions, are unable to account for the effects of measurement error, and fail to control for confounding variables. These limitations are understandably attributable to the complex structure that is inherent to measurements taken on pooled specimens. Consequently, in order to provide practitioners with the tools necessary to accurately and efficiently analyze pooled biomarker assessments, herein, a general Monte Carlo maximum likelihood‐based procedure is presented. The proposed approach allows for the regression analysis of pooled data under practically all parametric models and can be used to directly account for the effects of measurement error. Through simulation, it is shown that the proposed approach can accurately and efficiently estimate all unknown parameters and is more computational efficient than existing techniques. This new methodology is further illustrated using monocyte chemotactic protein‐1 data collected by the Collaborative Perinatal Project in an effort to assess the relationship between this chemokine and the risk of miscarriage. Copyright © 2017 John Wiley & Sons, Ltd.  相似文献   

5.
Next-generation sequencing is widely used to study complex diseases because of its ability to identify both common and rare variants without prior single nucleotide polymorphism (SNP) information. Pooled sequencing of implicated target regions can lower costs and allow more samples to be analyzed, thus improving statistical power for disease-associated variant detection. Several methods for disease association tests of pooled data and for optimal pooling designs have been developed under certain assumptions of the pooling process, for example, equal/unequal contributions to the pool, sequencing depth variation, and error rate. However, these simplified assumptions may not portray the many factors affecting pooled sequencing data quality, such as PCR amplification during target capture and sequencing, reference allele preferential bias, and others. As a result, the properties of the observed data may differ substantially from those expected under the simplified assumptions. Here, we use real datasets from targeted sequencing of pooled samples, together with microarray SNP genotypes of the same subjects, to identify and quantify factors (biases and errors) affecting the observed sequencing data. Through simulations, we find that these factors have a significant impact on the accuracy of allele frequency estimation and the power of association tests. Furthermore, we develop a workflow protocol to incorporate these factors in data analysis to reduce the potential biases and errors in pooled sequencing data and to gain better estimation of allele frequencies. The workflow, Psafe, is available at http://bioinformatics.med.yale.edu/group/.  相似文献   

6.
The focus of this paper is dietary intervention trials. We explore the statistical issues involved when the response variable, intake of a food or nutrient, is based on self‐report data that are subject to inherent measurement error. There has been little work on handling error in this context. A particular feature of self‐reported dietary intake data is that the error may be differential by intervention group. Measurement error methods require information on the nature of the errors in the self‐report data. We assume that there is a calibration sub‐study in which unbiased biomarker data are available. We outline methods for handling measurement error in this setting and use theory and simulations to investigate how self‐report and biomarker data may be combined to estimate the intervention effect. Methods are illustrated using data from the Trial of Nonpharmacologic Intervention in the Elderly, in which the intervention was a sodium‐lowering diet and the response was sodium intake. Simulations are used to investigate the methods under differential error, differing reliability of self‐reports relative to biomarkers and different proportions of individuals in the calibration sub‐study. When the reliability of self‐report measurements is comparable with that of the biomarker, it is advantageous to use the self‐report data in addition to the biomarker to estimate the intervention effect. If, however, the reliability of the self‐report data is low compared with that in the biomarker, then, there is little to be gained by using the self‐report data. Our findings have important implications for the design of dietary intervention trials. © 2016 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.  相似文献   

7.
In many biological studies, biomarkers are measured with errors. In addition, study samples are often divided and measured in separate batches, and data collected from different experiments are used in a single analysis. Generally speaking, the structure of the measurement error is unknown and is not easy to ascertain. While the conditions under which the measurements are taken vary from one batch/experiment to another, they are often held steady within each batch/experiment. Thus, the measurement error can be considered batch/experiment specific, that is, fixed within each batch/experiment, which results into a rank‐preserving property within each batch/experiment. Under this condition, we study robust statistical methods for analyzing the association between an outcome variable and predictors measured with error, and evaluating the diagnostic or predictive accuracy of these biomarkers. Our methods require no assumptions on the structure and distribution of the measurement error, which are often unrealistic. Compared with existing methods that are predicated on normality and additive structure of measurement errors, our methods still yield valid inferences under departure from these assumptions. The proposed methods are easy to implement using off‐shelf software. Simulation studies show that under various measurement error structures, the performance of the proposed methods is satisfactory even for a fairly small sample size, whereas existing methods under misspecified structures and a naive approach exhibited substantial bias. Our methods are illustrated using a biomarker validation case–control study for colorectal neoplasms. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

8.
Pooling of biological specimens has been utilised as a cost-efficient sampling strategy, but cost is not the unique limiting factor in biomarker development and evaluation. We examine the effect of different sampling strategies of biospecimens for exposure assessment that cannot be detected below a detection threshold (DT). The paper compares use of pooled samples to a randomly selected sample from a cohort in order to evaluate the efficiency of parameter estimates.
The proposed approach shows that a pooling design is more efficient than a random sample strategy under certain circumstances. Moreover, because pooling minimises the amount of information lost below the DT, the use of pooled data is preferable (in a context of a parametric estimation) to using all available individual measurements, for certain values of the DT. We propose a combined design, which applies pooled and unpooled biospecimens, in order to capture the strengths of the different sampling strategies and overcome instrument limitations (i.e. DT). Several Monte Carlo simulations and an example based on actual biomarker data illustrate the results of the article.  相似文献   

9.
It is common in the analysis of aggregate data in epidemiology that the variances of the aggregate observations are available. The analysis of such data leads to a measurement error situation, where the known variances of the measurement errors vary between the observations. Assuming multivariate normal distribution for the 'true' observations and normal distributions for the measurement errors, we derive a simple EM algorithm for obtaining maximum likelihood estimates of the parameters of the multivariate normal distributions. The results also facilitate the estimation of regression parameters between the variables as well as the 'true' values of the observations. The approach is applied to re-estimate recent results of the WHO MONICA Project on cardiovascular disease and its risk factors, where the original estimation of the regression coefficients did not adjust for the regression attenuation caused by the measurement errors.  相似文献   

10.
Genome-wide association studies may be necessary to identify genes underlying certain complex diseases. Because such studies can be extremely expensive, DNA pooling has been introduced, as it may greatly reduce the genotyping burden. Parallel to DNA pooling developments, the importance of haplotypes in genetic studies has been amply demonstrated in the literature. However, DNA pooling of a large number of samples may lose haplotype information among tightly linked genetic markers. Here, we examine the cost-effectiveness of DNA pooling in the estimation of haplotype frequencies from population data. When the maximum likelihood estimates of haplotype frequencies are obtained from pooled samples, we compare the overall cost of the study, including both DNA collection and marker genotyping, between the individual genotyping strategy and the DNA pooling strategy. We find that the DNA pooling of two individuals can be more cost-effective than individual genotypings, especially when a large number of haplotype systems are studied.  相似文献   

11.
Several groups have developed methods for estimating allele frequencies in DNA pools as a fast and cheap way for detecting allelic association between genetic markers and disease. To obtain accurate estimates of allele frequencies, a correction factor k for the degree to which measurement of allele-specific products is biased is generally applied. Factor k is usually obtained as the ratio of the two allele-specific signals in samples from heterozygous individuals, a step that can significantly impair throughput and increase cost. We have systematically investigated the properties of k through the use of empirical and simulated data. We show that for the dye terminator primer extension genotyping method we have applied, the correction factor k is substantially influenced by the dye terminators incorporated, but also by the terminal 3' base of the extension primer. We also show that the variation in k is large enough to result in unacceptable error rates if association studies are conducted without regard to k. We show that the impact of ignoring k can be neutralized by applying a correction factor k(max) that can be easily derived, but this at the potential cost of an increase in type I error. Finally, based upon observed distributions for k we derive a method allowing the estimation of the probability pooled data reflects significant differences in the allele frequencies between the subjects comprising the pools. By controlling the error rates in the absence of knowledge of the appropriate SNP-specific correction factors, each approach enhances the performance of DNA pooling, while considerably streamlining the method by reducing time and cost.  相似文献   

12.
Misclassification of exposure is a well-recognized inherent limitation of epidemiologic studies of disease and the environment. For many agents of interest, exposures take place over time and in multiple locations; accurately estimating the relevant exposures for an individual participant in epidemiologic studies is often daunting, particularly within the limits set by feasibility, participant burden, and cost. Researchers have taken steps to deal with the consequences of measurement error by limiting the degree of error through a study's design, estimating the degree of error using a nested validation study, and by adjusting for measurement error in statistical analyses. In this paper, we address measurement error in observational studies of air pollution and health. Because measurement error may have substantial implications for interpreting epidemiologic studies on air pollution, particularly the time-series analyses, we developed a systematic conceptual formulation of the problem of measurement error in epidemiologic studies of air pollution and then considered the consequences within this formulation. When possible, we used available relevant data to make simple estimates of measurement error effects. This paper provides an overview of measurement errors in linear regression, distinguishing two extremes of a continuum-Berkson from classical type errors, and the univariate from the multivariate predictor case. We then propose one conceptual framework for the evaluation of measurement errors in the log-linear regression used for time-series studies of particulate air pollution and mortality and identify three main components of error. We present new simple analyses of data on exposures of particulate matter < 10 microm in aerodynamic diameter from the Particle Total Exposure Assessment Methodology Study. Finally, we summarize open questions regarding measurement error and suggest the kind of additional data necessary to address them.  相似文献   

13.
Pooling-based strategies that combine samples from multiple participants for laboratory assays have been proposed for epidemiologic investigations of biomarkers to address issues including cost, efficiency, detection, and when minimal sample volume is available. A modification of the standard logistic regression model has been previously described to allow use with pooled data; however, this model makes assumptions regarding exposure distribution and logit-linearity of risk (i.e., constant odds ratio) that can be violated in practice. We were motivated by a nested case-control study of miscarriage and inflammatory factors with highly skewed distributions to develop a more flexible model for analysis of pooled data. Using characteristics of the gamma distribution and the relation between models of binary outcome conditional on exposure and of exposure conditional on outcome, we use a modified logistic regression to accommodate nonlinearity because of unequal shape parameters in gamma distributed exposure for cases and controls. Using simulations, we compare our approach with existing methods for logistic regression for pooled data considering: (1) constant and dose-dependent effects; (2) gamma and log-normal distributed exposure; (3) effect size; and (4) the proportions of biospecimens pooled. We show that our approach allows estimation of odds ratios that vary with exposure level, yet has minimal loss of efficiency compared with existing approaches when exposure effects are dose-invariant. Our model performed similarly to a maximum likelihood estimation approach in terms of bias and efficiency, and provides an easily implemented approach for estimation with pooled biomarker data when effects may not be constant across exposure. Copyright ? 2012 John Wiley & Sons, Ltd.  相似文献   

14.
While family purchase of health insurance may benefit insurance markets by pooling individual risk into family groups, the correlation across illness types in families could exacerbate adverse selection. We analyze the impact of family pooling on risk for health insurers to inform policy about family-level insurance plans. Using data on 8,927,918 enrollees in fee-for-service commercial health plans in the 2013 Truven MarketScan database, we compare the distribution of annual individual health spending across four pooling scenarios: (1) “Individual” where there is no pooling into families; (2) “real families” where costs are pooled within families; (3) “random groups” where costs are pooled within randomly generated small groups that mimic families in group size; and (4) “the Sims” where costs are pooled within random small groups which match families in demographics and size. These four simulations allow us to identify the separate contributions of group size, group composition, and family affinity in family risk pooling. Variation in individual spending under family pooling is very similar to that within “simulated families” and to that within random groups, and substantially lower than when there is no family pooling and individuals choose independently (standard deviation $12,526 vs. $11,919, $12,521 and $17,890 respectively). Within-family correlations in health status and utilization do not “undo” the gains from family pooling of risks. Family pooling can mitigate selection and improve the functioning of health insurance markets.  相似文献   

15.
In population surveys of seroprevalence, it may not be most efficient to test all individual samples. The laboratory and statistical issues encountered when one first pools individual samples into groups of samples before laboratory analyses were discussed recently in relation to the seroprevalence of human immunodeficiency virus. In particular, point and confidence interval estimates for seroprevalence from pooled sera were derived. A potential problem with these confidence intervals is that they may contain negative values. This problem is most likely to occur in low-prevalence populations where pooling is most efficient. An alternative method of obtaining confidence intervals that cannot contain negative values is proposed and an example provided.  相似文献   

16.
The area (A) under the receiver operating characteristic curve is commonly used to quantify the ability of a biomarker to correctly classify individuals into two populations. However, many markers are subject to measurement error, which must be accounted for to prevent understating their effectiveness. In this paper, we develop a new confidence interval procedure for A which is adjusted for measurement error using either external or internal replicated measurements. Based on the observation that A is a function of normal means and variances, we develop the procedure by recovering variance estimates needed from confidence limits for normal means and variances. Simulation results show that the procedure performs better than the previous ones based on the delta‐method in terms of coverage percentage, balance of tail errors and interval width. Two examples are presented. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

17.
Costs can hamper the evaluation of the effectiveness of new biomarkers. Analysis of smaller numbers of pooled specimens has been shown to be a useful cost-cutting technique. The Youden index (J), a function of sensitivity (q) and specificity (p), is a commonly used measure of overall diagnostic effectiveness. More importantly, J is the maximum vertical distance or difference between the ROC curve and the diagonal or chance line; it occurs at the cut-point that optimizes the biomarker's differentiating ability when equal weight is given to sensitivity and specificity. Using the additive property of the gamma and normal distributions, we present a method to estimate the Youden index and the optimal cut-point, and extend its applications to pooled samples. We study the effect of pooling when only a fixed number of individuals are available for testing, and pooling is carried out to save on the number of assays. We measure loss of information by the change in root mean squared error of the estimates of the optimal cut-point and the Youden index, and we study the extent of this loss via a simulation study. In conclusion, pooling can result in a substantial cost reduction while preserving the effectiveness of estimators, especially when the pool size is not very large.  相似文献   

18.
Nutritional epidemiology relies largely on self‐reported measures of dietary intake, errors in which give biased estimated diet–disease associations. Self‐reported measurements come from questionnaires and food records. Unbiased biomarkers are scarce; however, surrogate biomarkers, which are correlated with intake but not unbiased, can also be useful. It is important to quantify and correct for the effects of measurement error on diet–disease associations. Challenges arise because there is no gold standard, and errors in self‐reported measurements are correlated with true intake and each other. We describe an extended model for error in questionnaire, food record, and surrogate biomarker measurements. The focus is on estimating the degree of bias in estimated diet–disease associations due to measurement error. In particular, we propose using sensitivity analyses to assess the impact of changes in values of model parameters which are usually assumed fixed. The methods are motivated by and applied to measures of fruit and vegetable intake from questionnaires, 7‐day diet diaries, and surrogate biomarker (plasma vitamin C) from over 25000 participants in the Norfolk cohort of the European Prospective Investigation into Cancer and Nutrition. Our results show that the estimated effects of error in self‐reported measurements are highly sensitive to model assumptions, resulting in anything from a large attenuation to a small amplification in the diet–disease association. Commonly made assumptions could result in a large overcorrection for the effects of measurement error. Increased understanding of relationships between potential surrogate biomarkers and true dietary intake is essential for obtaining good estimates of the effects of measurement error in self‐reported measurements on observed diet–disease associations. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

19.
Exposure assessment is often subject to measurement errors. We consider here the analysis of studies aimed at reducing exposure to potential health hazards, in which exposure is the outcome variable. In these studies, the intervention effect may be estimated using either biomarkers or self-report data, but it is not common to combine these measures of exposure. Bias in the self-reported measures of exposure is a well-known fact; however, only few studies attempt to correct it. Recently, Keogh et al addressed this problem, presenting a model for measurement error in this setting and investigating how self-report and biomarker data can be combined. Keogh et al find the maximum likelihood estimate for the intervention effect in their model via direct numerical maximization of the likelihood. Here, we exploit an alternative presentation of the model that leads us to a closed formula for the MLE and also for its variance, when the number of biomarker replicates is the same for all subjects in the substudy. The variance formula enables efficient design of such intervention studies. When the number of biomarker replicates is not constant, our approach can be used along with the EM-algorithm to quickly compute the MLE. We compare the MLE to Buonaccorsi's method (Buonaccorsi, 1996) and find that they have similar efficiency when most subjects have biomarker data, but that the MLE has clear advantages when only a small fraction of subjects has biomarker data. This conclusion extends the findings of Keogh et al (2016) and has practical importance for efficiently designing studies.  相似文献   

20.
There is growing interest in pooling specimens across subjects in epidemiologic studies, especially those involving biomarkers. This paper is concerned with regression analysis of epidemiologic data where a binary exposure is subject to pooling and the pooled measurement is dichotomized to indicate either that no subjects in the pool are exposed or that some are exposed, without revealing further information about the exposed subjects in the latter case. The pooling process may be stratified on the disease status (a binary outcome) and possibly other variables but is otherwise assumed random. We propose methods for estimating parameters in a prospective logistic regression model and illustrate these with data from a population-based case-control study of colorectal cancer. Simulation results show that the proposed methods perform reasonably well in realistic settings and that pooling can lead to sizable gains in cost efficiency. We make recommendations with regard to the choice of design for pooled epidemiologic studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号