首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Medical prognostic models can be designed to predict the future course or outcome of disease progression after diagnosis or treatment. The existing variable selection methods may be precluded by full model advocates when we build a prediction model owing to their estimation bias and selection bias in right-censored time-to-event data. If our objective is to optimize predictive performance by some criterion, we can often achieve a reduced model that has a little bias with low variance, but whose overall performance is enhanced. To accomplish this goal, we propose a new variable selection approach that combines Stepwise Tuning in the Maximum Concordance Index (STMC) with Forward Nested Subset Selection (FNSS) in two stages. In the first stage, the proposed variable selection is employed to identify the best subset of risk factors optimized with the concordance index using inner cross-validation for optimism correction in the outer loop of cross-validation, yielding potentially different final models for each of the folds. We then feed the intermediate results of the prior stage into another selection method in the second stage to resolve the overfitting problem and to select a final model from the variation of predictors in the selected models. Two case studies on relatively different sized survival data sets as well as a simulation study demonstrate that the proposed approach is able to select an improved and reduced average model under a sufficient sample and event size compared with other selection methods such as stepwise selection using the likelihood ratio test, Akaike Information Criterion (AIC), and lasso. Finally, we achieve better final models in each dataset than their full models by most measures. These results of the model selection models and the final models are assessed in a systematic scheme through validation for the independent performance.  相似文献   

2.
3.
While Raman spectroscopy provides a powerful tool for noninvasive and real time diagnostics of biological samples, its translation to the clinical setting has been impeded by the lack of robustness of spectroscopic calibration models and the size and cumbersome nature of conventional laboratory Raman systems. Linear multivariate calibration models employing full spectrum analysis are often misled by spurious correlations, such as system drift and covariations among constituents. In addition, such calibration schemes are prone to overfitting, especially in the presence of external interferences that may create nonlinearities in the spectra-concentration relationship. To address both of these issues we incorporate residue error plot-based wavelength selection and nonlinear support vector regression (SVR). Wavelength selection is used to eliminate uninformative regions of the spectrum, while SVR is used to model the curved effects such as those created by tissue turbidity and temperature fluctuations. Using glucose detection in tissue phantoms as a representative example, we show that even a substantial reduction in the number of wavelengths analyzed using SVR lead to calibration models of equivalent prediction accuracy as linear full spectrum analysis. Further, with clinical datasets obtained from human subject studies, we also demonstrate the prospective applicability of the selected wavelength subsets without sacrificing prediction accuracy, which has extensive implications for calibration maintenance and transfer. Additionally, such wavelength selection could substantially reduce the collection time of serial Raman acquisition systems. Given the reduced footprint of serial Raman systems in relation to conventional dispersive Raman spectrometers, we anticipate that the incorporation of wavelength selection in such hardware designs will enhance the possibility of miniaturized clinical systems for disease diagnosis in the near future.  相似文献   

4.
5.
Information processing in the brain is believed to require coordinated activity across many neurons. With the recent development of techniques for simultaneously recording the spiking activity of large numbers of individual neurons, the search for complex multicell firing patterns that could help reveal this neural code has become possible. Here we develop a new approach for analyzing sequential firing patterns involving an arbitrary number of neurons based on relative firing order. Specifically, we develop a combinatorial method for quantifying the degree of matching between a "reference sequence" of N distinct "letters" (representing a particular target order of firing by N cells) and an arbitrarily long "word" composed of any subset of those letters including repeats (representing the relative time order of spikes in an arbitrary firing pattern). The method involves computing the probability that a random permutation of the word's letters would by chance alone match the reference sequence as well as or better than the actual word does, assuming all permutations were equally likely. Lower probabilities thus indicate better matching. The overall degree and statistical significance of sequence matching across a heterogeneous set of words (such as those produced during the course of an experiment) can be computed from the corresponding set of probabilities. This approach can reduce the sample size problem associated with analyzing complex firing patterns. The approach is general and thus applicable to other types of neural data beyond multiple spike trains, such as EEG events or imaging signals from multiple locations. We have recently applied this method to quantify memory traces of sequential experience in the rodent hippocampus during slow wave sleep.  相似文献   

6.
OBJECTIVE: Statistical models, such as linear or logistic regression or survival analysis, are frequently used as a means to answer scientific questions in psychosomatic research. Many who use these techniques, however, apparently fail to appreciate fully the problem of overfitting, ie, capitalizing on the idiosyncrasies of the sample at hand. Overfitted models will fail to replicate in future samples, thus creating considerable uncertainty about the scientific merit of the finding. The present article is a nontechnical discussion of the concept of overfitting and is intended to be accessible to readers with varying levels of statistical expertise. The notion of overfitting is presented in terms of asking too much from the available data. Given a certain number of observations in a data set, there is an upper limit to the complexity of the model that can be derived with any acceptable degree of uncertainty. Complexity arises as a function of the number of degrees of freedom expended (the number of predictors including complex terms such as interactions and nonlinear terms) against the same data set during any stage of the data analysis. Theoretical and empirical evidence--with a special focus on the results of computer simulation studies--is presented to demonstrate the practical consequences of overfitting with respect to scientific inference. Three common practices--automated variable selection, pretesting of candidate predictors, and dichotomization of continuous variables--are shown to pose a considerable risk for spurious findings in models. The dilemma between overfitting and exploring candidate confounders is also discussed. Alternative means of guarding against overfitting are discussed, including variable aggregation and the fixing of coefficients a priori. Techniques that account and correct for complexity, including shrinkage and penalization, also are introduced.  相似文献   

7.
基于肿瘤基因表达数据,利用信息科学的方法和技术建立肿瘤预测分类模型,对肿瘤基因表达模式研究和肿瘤的诊断识别具有重要意义.本研究提出一种从肿瘤基因表达数据中直接挖掘分类规则建立肿瘤预测分类器的方法.该方法首先抽取实验样本集,分别找出标记肿瘤和正常组织样本的分类特征,由此生成可预测样本类别的分类规则,对每个未知类别样本,按照置信度最高原则,选择一个分类规则作为预测结构.本研究的实验数据来自Broad Institute的前列腺癌基因表达数据,实验结果显示该方法的预测精度在90%以上,且同时获得了大量结构透明的分类预测规则,表明本研究的方法是可行的和有效的.  相似文献   

8.
The robust sib-pair method introduced by Haseman & Elston (1972) is one of the most widely circulated allele-sharing methods for linkage analysis. The procedure evaluates linkage by significance testing of a regression coefficient and, hence, a standard t -test has traditionally been applied despite known violations of the statistical assumptions underlying the test. We present a permutation based reference distribution for the estimate of the regression coefficient that is motivated by genetic principles rather than by standard regression testing procedures. The permutation test approximates Mendelian co-segregation under the null hypothesis of no linkage, making it a very natural approach. Theory and simulations show that the conventional t -test approximates the permutation test quite well, even when dependent sib pairs are used for analysis. These results thus indirectly address concerns over the t -test. To illustrate the permutation test using real data we applied the procedure to two lipoprotein systems that have been well characterized.  相似文献   

9.
Sigma Delta analogue-to-digital converters allow acquiring the full dynamic range of biomedical signals at the electrodes, resulting in less complex hardware and increased measurement robustness. However, the increased data size per sample (typically 24 bits) demands the transmission of extremely large volumes of data across the isolation barrier, thus increasing power consumption on the patient side. This problem is accentuated when a large number of channels is used as in current 128–256 electrodes biopotential acquisition systems, that usually opt for an optic fibre link to the computer. An analogous problem occurs for simpler low-power acquisition platforms that transmit data through a wireless link to a computing platform. In this paper, a low-complexity encoding method is presented to decrease sample data size without losses, while preserving the full DC-coupled signal. The method achieved a 2.3 average compression ratio evaluated over an ECG and EMG signal bank acquired with equipment based on Sigma-Delta converters. It demands a very low processing load: a C language implementation is presented that resulted in an 110 clock cycles average execution on an 8-bit microcontroller.  相似文献   

10.
AIM: The aim of this study was to develop predictive equations based on anthropometric data to estimate stature in people 60 years and older in Latin America. DESIGN: Population-based cross-sectional study in three Latin American cities.Subjects: Sample sizes were n=1657 (Sao Paulo, Brazil), n=1004 (Santiago, Chile) and n=995 (Mexico City, Mexico). METHOD: The prediction equations were fitted by stepwise linear regression analysis. For each country and sex, samples were randomly split into two sub-samples (training and validation sub-samples) using the cross-validation method. RESULTS: Stepwise regression analysis in the training sample revealed that only knee-height and age had a significant effect on the prediction of height. The values of the shrinkage statistic were below 0.1 indicating the reliability of the prediction equations. The regression equations had standard errors of estimate ranging from 3.3 cm (Chile), 3.6 cm (Brazil) and 4.0 cm (Mexico) for women, and 3.7 cm (Mexico and Chile) and 3.8 cm (Brazil) for men. CONCLUSIONS: Sex-specific stature prediction equations based on knee-height and age were obtained from large representative samples from selected cities of Latin America.  相似文献   

11.
New technology for large-scale genotyping has created new challenges for statistical analysis. Correcting for multiple comparison without discarding true positive results and extending methods to triad studies are among the important problems facing statisticians. We present a one-sample permutation test for testing transmission disequilibrium hypotheses in triad studies, and show how this test can be used for multiple single nucleotide polymorphism (SNP) testing. The resulting multiple comparison procedure is shown in the case of the transmission disequilibrium test to control the familywise error. Furthermore, this procedure can handle multiple possible modes of risk inheritance per SNP. The resulting permutational procedure is shown through simulation of SNP data to be more powerful than the Bonferroni procedure when the SNPs are in linkage disequilibrium. Moreover, permutations implicitly avoid any multiple comparison correction penalties when the SNP has a rare allele. The method is illustrated by analyzing a large candidate gene study of neural tube defects and an independent study of oral clefts, where the smallest adjusted p-values using the permutation procedure are approximately half those of the Bonferroni procedure. We conclude that permutation tests are more powerful for identifying disease-associated SNPs in candidate gene studies and are useful for analysis of triad studies.  相似文献   

12.
Linkage analysis in multivariate or longitudinal context presents both statistical and computational challenges. The permutation test can be used to avoid some of the statistical challenges, but it substantially adds to the computational burden. Utilizing the distributional dependencies between (defined as the proportion of alleles at a locus that are identical by descent (IBD) for a pairs of relatives, at a given locus) and the permutation test we report a new method of efficient permutation. In summary, the distribution of for a sample of relatives at locus x is estimated as a weighted mixture of drawn from a pool of ‘representative’ distributions observed at other loci. This weighting scheme is then used to sample from the distribution of the permutation tests at the representative loci to obtain an empirical P-value at locus x (which is asymptotically distributed as the permutation test at loci x). This weighted mixture approach greatly reduces the number of permutation tests required for genome-wide scanning, making it suitable for use in multivariate and other computationally intensive linkage analyses. In addition, because the distribution of is a property of the genotypic data for a given sample and is independent of the phenotypic data, the weighting scheme can be applied to any phenotype (or combination of phenotypes) collected from that sample. We demonstrate the validity of this approach through simulation. Edited by David Allison.  相似文献   

13.
Additional information about risk genes or risk pathways for diseases can be extracted from genome-wide association studies through analyses of groups of markers. The most commonly employed approaches involve combining individual marker data by adding the test statistics, or summing the logarithms of their P-values, and then using permutation testing to derive empirical P-values that allow for the statistical dependence of single-marker tests arising from linkage disequilibrium (LD). In the present study, we use simulated data to show that these approaches fail to reflect the structure of the sampling error, and the effect of this is to give undue weight to correlated markers. We show that the results obtained are internally inconsistent in the presence of strong LD, and are externally inconsistent with the results derived from multi-locus analysis. We also show that the results obtained from regression and multivariate Hotelling T(2) (H-T2) testing, but not those obtained from permutations, are consistent with the theoretically expected distributions, and that the H-T2 test has greater power to detect gene-wide associations in real datasets. Finally, we show that while the results from permutation testing can be made to approximate those from regression and multivariate Hotelling T(2) testing through aggressive LD pruning of markers, this comes at the cost of loss of information. We conclude that when conducting multi-locus analyses of sets of single-nucleotide polymorphisms, regression or multivariate Hotelling T(2) testing, which give equivalent results, are preferable to the other more commonly applied approaches.  相似文献   

14.
Artificial neural networks (ANNs) are very popular as classification or regression mechanisms in medical decision support systems despite the fact that they are unstable predictors. This instability means that small changes in the training data used to build the model (i.e. train the ANN) may result in very different models. A central implication of this is that different sets of training data may produce models with very different generalisation accuracies. In this paper, we show in detail how this can happen in a prediction system for use in in-vitro fertilisation. We argue that claims for the generalisation performance of ANNs used in such a scenario should only be based on k-fold cross-validation tests. We also show how the accuracy of such a predictor can be improved by aggregating the output of several predictors.  相似文献   

15.
BACKGROUND: Accurate gender determination is crucial in many scientific disciplines, but especially so in prenatal diagnosis of X-linked diseases and forensic investigations. Today, molecular techniques, especially typing for a length variation in the X-Y homologous amelogenin gene (AMELX and AMELY), are used for gender assignation. This amelogenin test is an integral part of most PCR multiplex kits that are used for DNA profiling, but in 1998 there was a report of two normal males being typed as female with this test. Subsequently, a small number of amelogenin negative (or AMELY null) males have been reported in various populations but little data are available characterising these deletions. AIMS: The study aims to determine the size of the deletion in five AMELY null males by typing DNA samples for markers surrounding this gender-determining locus. The possible relationships among the AMELY null samples are examined through analysis of their deletion size and associated Y-chromosome microsatellite haplotypes. We also attempt to determine the frequency of AMELY negative males in Australia. SUBJECTS AND METHODS: DNA samples from five AMELY null males, from different geographical regions, were made available for this study. The samples were typed for eight sites, all located on the short arm of the Y chromosome, using PCR and gel electrophoresis. Eleven Y-chromosome specific microsatellites were also typed on each sample in order to generate haplotypes for phylogenetic analysis. A questionnaire was sent to all Australian forensic centres requesting information on the frequency of AMELY negative males observed in their laboratories. RESULTS: Two different sized deletions were seen in the five AMELY null samples. One deletion (in two samples) has a size of between 304 and 731 kbp, whereas the other (in three samples) ranges between 712 and 1001 kbp. Y-microsatellite haplotypes indicate that the smaller deletion is probably identical in the two samples, but this is not the case with the larger deletion. The frequency of AMELY negative is rare in Australia, with an overall frequency of 0.02%. CONCLUSION: Comparisons of both deletion size and haplotypes with published data suggest that most AMELY nulls are the result of independent evolutionary events, even in those populations where the frequency is relatively high. Although AMELY null males are extremely rare in most populations, typing an additional gender-determining locus should be considered in forensic investigations where the reference sample is of unknown gender.  相似文献   

16.
It has been shown that parametric analysis of linkage disequilibrium conditional on linkage using an overly deterministic model can be optimal for family-based association analysis. However, if one applies this strategy carelessly, there is a risk of false inference. We analyse properties of such likelihood ratio tests when the assumed disease mode of inheritance is inaccurate. Under some conditions, problems result if one is not careful to consider what null hypothesis is being tested. We show that: (a) tests for which the null hypothesis assumes the absence of both linkage and association are independent of the true mode of inheritance; (b) likelihood ratio tests assuming either linkage or association under the null hypothesis may depend on the true mode of inheritance, leading to inconsistent parameter estimates, in particular under extremely deterministic models; (c) this problem cannot be eliminated by increasing sample size or adding population controls--as sample size increases, the chance of false positive inference goes to 100%; (d) this issue can lead to systematic false positive inference of association in regions of linkage. This is important because highly deterministic models are often used intentionally in model-based analyses because they can have more power than the true model, and are implicit in many model-free analysis methods.  相似文献   

17.
CDKN2A codes for two oncosuppressors by alternative splicing of two first exons: p16INK4a and p14ARF. Germline mutations are found in about 40% of melanoma‐prone families, and most of them are missense mutations mainly affecting p16INK4a. A growing number of p16INK4a variants of uncertain significance (VUS) are being identified but, unless their pathogenic role can be demonstrated, they cannot be used for identification of carriers at risk. Predicting the effect of these VUS by either a “standard” in silico approach, or functional tests alone, is rather difficult. Here, we report a protocol for the assessment of any p16INK4a VUS, which combines experimental and computational tools in an integrated approach. We analyzed p16INK4a VUS from melanoma patients as well as variants derived through permutation of conserved p16INK4a amino acids. Variants were expressed in a p16INK4a‐null cell line (U2‐OS) and tested for their ability to block proliferation. In parallel, these VUS underwent in silico prediction analysis and molecular dynamics simulations. Evaluation of in silico and functional data disclosed a high agreement for 15/16 missense mutations, suggesting that this approach could represent a pilot study for the definition of a protocol applicable to VUS in general, involved in other diseases, as well.  相似文献   

18.
Predicting gene function from patterns of annotation   总被引:7,自引:0,他引:7       下载免费PDF全文
The Gene Ontology (GO) Consortium has produced a controlled vocabulary for annotation of gene function that is used in many organism-specific gene annotation databases. This allows the prediction of gene function based on patterns of annotation. For example, if annotations for two attributes tend to occur together in a database, then a gene holding one attribute is likely to hold the other as well. We modeled the relationships among GO attributes with decision trees and Bayesian networks, using the annotations in the Saccharomyces Genome Database (SGD) and in FlyBase as training data. We tested the models using cross-validation, and we manually assessed 100 gene-attribute associations that were predicted by the models but that were not present in the SGD or FlyBase databases. Of the 100 manually assessed associations, 41 were judged to be true, and another 42 were judged to be plausible.  相似文献   

19.
神经精神疾病的神经病理机制仍有许多未知,客观临床诊断标准也十分欠缺,其诊断与预后面临巨大挑战.随着神经影像技术的快速发展,神经影像数据被广泛应用于神经精神疾病神经病理机制的探索和潜在生物标志物的发掘.相比于实现群体水平分析的传统单变量分析方法,机器学习模型基于神经影像数据,实现神经精神 疾病的个体化、智能化预测.综述近...  相似文献   

20.
Experimental approaches for identifying T-cell epitopes are time-consuming, costly and not applicable to the large scale screening. Computer modeling methods can help to minimize the number of experiments required, enable a systematic scanning for candidate major histocompatibility complex (MHC) binding peptides and thus speed up vaccine development. We developed a prediction system based on a novel data representation of peptide/MHC interaction and support vector machines (SVM) for prediction of peptides that promiscuously bind to multiple Human Leukocyte Antigen (HLA, human MHC) alleles belonging to a HLA supertype. Ten-fold cross-validation results showed that the overall performance of SVM models is improved in comparison to our previously published methods based on hidden Markov models (HMM) and artificial neural networks (ANN), also confirmed by blind testing. At specificity 0.90, sensitivity values of SVM models were 0.90 and 0.92 for HLA-A2 and -A3 dataset respectively. Average area under the receiver operating curve (AROC) of SVM models in blind testing are 0.89 and 0.92 for HLA-A2 and -A3 datasets. AROC of HLA-A2 and -A3 SVM models were 0.94 and 0.95, validated using a full overlapping study of 9-mer peptides from human papillomavirus type 16 E6 and E7 proteins. In addition, a large-scale experimental dataset has been used to validate HLA-A2 and -A3 SVM models. The SVM prediction models were integrated into a web-based computational system MULTIPRED1, accessible at antigen.i2r.a-star.edu.sg/multipred1/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号