首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This report puts into perspective a series of exploratory statistical analyses carried out on the major genotoxicity data bases. While large compilations of data, even though computerized, suffer from their own size and are quite intractable to scientific reflection and judgement, the multivariate data analysis methods used by us are specifically designed for reorganising the information in a rational way and highlighting the underlying regularities of the data. The analyses reported here refer to the following data bases: the International Program for the Evaluation of Short-Term Tests for Carcinogens, the International Program on Chemical Safety Collaborative Study on In Vitro Assays, the Gene-Tox data base, and a subset of the U.S. National Toxicology Program data. Although the various data bases consisted of different sets of chemicals and had different underlying rationales, a number of invariant associations among short-term test performances were highlighted. The overall evidence indicated that the traditional classification of assays (according to the criteria of genetic end-point and phylogenetic position of the assays) was in contrast with the actual, operational similarities among assay performances, in that the experimental responses of the tests to the large variety of chemicals under consideration pointed to an alternative classification scheme. This consisted of three major classes: 1) a class comprising the in vivo assays; 2) a class grouping together many of the most widely used in vitro assays (Salmonella, chromosomal aberrations, and sister chromatid exchanges in Chinese hamster ovary cells, the various mutation tests in mammalian cell systems, etc.); 3) a second in vitro assay class (with Syrian hamster embryo cell transformation, Saccharomyces cerevisiae XV185-14C, B. subtilis rec-, Escherichia coli pol A). Such classes had clearly differentiated features with respect to carcinogenicity prediction. The implications of these findings for the current debate on mutagenicity testing are discussed.  相似文献   

2.
Recent studies have shown that under specific conditions such as high sample sizes and Hardy–Weinberg equilibrium, bone marrow donor registry data can be used to describe HLA molecular variation across a specific geographic area, thus providing excellent data sets to infer human migrations history. The province of Quebec is known to have experienced a complex history of settlement, characterized by multiple migrations and demographic changes. We thus analysed the data of more than 13 000 unrelated individuals acting as volunteer bone marrow donors who were molecularly typed for HLA‐A, B and DRB1 polymorphisms in the Héma‐Quebec registry. HLA allelic and haplotypic frequencies were estimated and compared among regions. The results indicate that, despite an overall low genetic diversity in Quebec, genetic variation is correlated with geography, compatible with isolation‐by‐distance across the province. However, some localities also harbour contrasting genetic profiles, that is a highly diversified genetic pool in the two main urban centres (Montréal and Laval) and a more pronounced genetic divergence of two specific regions characterized by a peculiar peopling history (Saguenay‐Lac‐St‐Jean and Gaspésie‐Îles‐De‐La‐Madeleine). In agreement with other independent molecular markers, the observations based on HLA data thus account for the main demographic mechanisms that shaped the genetic structure of the present day Quebecer population. In addition, the detailed analysis of the Héma‐Quebec registry provides key genetic information on which an efficient bone marrow transplantation recruitment strategy can be settled.  相似文献   

3.
Global analyses of RNA expression levels are useful for classifying genes and overall phenotypes. Often these classification problems are linked, and one wants to find "marker genes" that are differentially expressed in particular sets of "conditions." We have developed a method that simultaneously clusters genes and conditions, finding distinctive "checkerboard" patterns in matrices of gene expression data, if they exist. In a cancer context, these checkerboards correspond to genes that are markedly up- or downregulated in patients with particular types of tumors. Our method, spectral biclustering, is based on the observation that checkerboard structures in matrices of expression data can be found in eigenvectors corresponding to characteristic expression patterns across genes or conditions. In addition, these eigenvectors can be readily identified by commonly used linear algebra approaches, in particular the singular value decomposition (SVD), coupled with closely integrated normalization steps. We present a number of variants of the approach, depending on whether the normalization over genes and conditions is done independently or in a coupled fashion. We then apply spectral biclustering to a selection of publicly available cancer expression data sets, and examine the degree to which the approach is able to identify checkerboard structures. Furthermore, we compare the performance of our biclustering methods against a number of reasonable benchmarks (e.g., direct application of SVD or normalized cuts to raw data).  相似文献   

4.
Independent component analysis (ICA) has been widely deployed to the analysis of microarray datasets. Although it was pointed out that after ICA transformation, different independent components (ICs) are of different biological significance, the IC selection problem is still far from fully explored. In this paper, we propose a genetic algorithm (GA) based ensemble independent component selection (EICS) system. In this system, GA is applied to select a set of optimal IC subsets, which are then used to build diverse and accurate base classifiers. Finally, all base classifiers are combined with majority vote rule. To show the validity of the proposed method, we apply it to classify three DNA microarray data sets involving various human normal and tumor tissue samples. The experimental results show that our ensemble method obtains stable and satisfying classification results when compared with several existing methods.  相似文献   

5.
We studied the efficiency of multilayer perceptron networks to classify eight different medical data sets with typical problems connected to their strongly non-uniform distributions between output classes and relatively small sizes of training sets. We studied especially the possibility mentioned in the literature of balancing a class distribution by artificially extending small classes of a data set. The results obtained supported our hypothesis that principally this does somewhat improve the classification accuracy of small classes, but is also inclined to impair the classification accuracy of majority classes.  相似文献   

6.
To provide a resource for assessing continental ancestry in a wide variety of genetic studies, we identified, validated, and characterized a set of 128 ancestry informative markers (AIMs). The markers were chosen for informativeness, genome-wide distribution, and genotype reproducibility on two platforms (TaqMan assays and Illumina arrays). We analyzed genotyping data from 825 subjects with diverse ancestry, including European, East Asian, Amerindian, African, South Asian, Mexican, and Puerto Rican. A comprehensive set of 128 AIMs and subsets as small as 24 AIMs are shown to be useful tools for ascertaining the origin of subjects from particular continents, and to correct for population stratification in admixed population sample sets. Our findings provide general guidelines for the application of specific AIM subsets as a resource for wide application. We conclude that investigators can use TaqMan assays for the selected AIMs as a simple and cost efficient tool to control for differences in continental ancestry when conducting association studies in ethnically diverse populations.  相似文献   

7.
Genome-wide association studies of clinically defined cases against controls have transformed our understanding of the genetic causes of many diseases. However, there are limitations to the simple clinical definitions used in these studies, and GWAS analyses are beginning to explore more refined phenotypes in subgroups of the existing data sets. These analyses are often performed ad hoc without considering the power requirements to justify such analyses. Here we derive expressions for the relative power of such subgroup analyses and determine the genotypic relative risks (GRRs) required to achieve equivalent power to a full analysis for relevant scenarios. We show that only modest increases in GRRs may be required to offset the reduction in power from analysing fewer cases, implying that analyses of more genetically homogenous case subgroups may have the potential to identify further associations. We find that, for lower genotypic relative risks in the full sample, subgroup analyses of more homogeneous cases have relatively more power than for higher index genotypic relative risks and that this effect is stronger for rare as opposed to common variants. As GWA studies are likely to have now identified the majority of SNPs with stronger effects, these results strongly advocate a renewed effort to identify phenotypically homogeneous disease groups, in which power to detect genetic variants with small effects will be greater. These results suggest that analysis of case subsets could be a powerful strategy to uncover some of the hidden heritability for common complex disorders, particularly in identifying rarer variants of modest effect.  相似文献   

8.
We consider the problem of fitting mathematical models for bacterialgrowth and decline to experimental data. Using models whichrepresent the phases of the growth and decline cycle in a piecewisemanner, we describe how least-squares fitting can lead to potentiallymisleading parameter estimates. We show how these difficultiescan be overcome by extending a data set to include hypotheticalobservations (dummy data points) which reflect biological beliefs,and the resulting stabilization of parameter estimates is analysedmathematically. The techniques are illustrated using real andsimulated data sets.  相似文献   

9.
Numerous studies have shown there is consistent evidence implicating genetic factors in the etiology of autism. In some cases chromosomal abnormalities have been identified. One type of these abnormalities is gaps and breaks nonrandomly located in chromosomes, denominated fragile sites (FS). We cytogenetically analyzed a group of autistic individuals and a normal population, and we examined the FS found in both samples with the aim of (1) comparing their FS expression, (2) ascertaining whether any FS could be associated with our autistic sample, and (3) examining if there are differences between individual and pooled-data analyses. Different statistical methods were used to analyse the FS of pooled and individual data. Our results show that there are statistically significant differences in the spontaneous expression of breakages between patients and controls, with a minimal sex difference. Using the method for pooled data, eight autosomal FS have preferential expression in patients and five patients were found to be positive at FS Xq27.3. With the method per-individual analysis, four FS emerged as specific in our autistic sample. Inferences of FS from pooled data were different from those of individual data. The findings suggest that although analysis of pooled data is necessitated by the problem of sparse data, analysis of single individuals is essential to know the significance of FS in autism.  相似文献   

10.
For most complex trait association studies using next-generation sequencing, in addition to the primary phenotype of interest, many clinically important secondary traits are also available, which can be analyzed to map susceptibility genes. Owing to high sequencing costs, most studies use selected samples, and the sampling mechanisms of these studies can be complicated. When the primary and secondary traits are correlated, analyses of secondary phenotypes can cause spurious associations in selected samples and existing methods are inadequate to adjust for them. To address this problem, a likelihood-based method, MULTI-TRAIT-ASSOCIATION (MTA) was developed. MTA is flexible and can be applied to any study with known sampling mechanisms. It also allows efficient inferences of genetic parameters. To investigate the power of MTA and different study designs, extensive simulations were performed under rigorous population genetic and phenotypic models. It is demonstrated that there are great benefits for analyzing secondary phenotypes in selected samples. In particular, using case-control samples and samples with extreme primary phenotypes can be more powerful than analyzing random samples of equivalent size. One major challenge for sequence-based association studies is that most data sets are not of sufficient size to be adequately powered. By applying MTA, data sets ascertained under distinct mechanisms or targeted at different primary traits can be jointly analyzed to map common phenotypes and greatly increase power. The combined analysis can be performed using freely available data sets from public repositories, for example, dbGaP. In conclusion, MTA will have an important role in dissecting the etiology of complex traits.  相似文献   

11.
Comparative analysis of multiple genome-scale data sets   总被引:4,自引:0,他引:4  
The ongoing analyses of published genome-scale data sets is evidence that different approaches are required to completely mine this data. We report the use of novel tools for both visualization and data set comparison to analyze yeast gene-expression (cell cycle and exit from stationary phase/G(0)) and protein-interaction studies. This analysis led to new insights about each data set. For example, G(1)-regulated genes are not co-regulated during exit from stationary phase, indicating that the cells are not synchronized. The tight clustering of other genes during exit from stationary-phase data set further indicates the physiological responses during G(0) exit are separable from cell-cycle events. Comparison of the two data sets showed that ribosomal-protein genes cluster tightly during exit from stationary phase, but are found in three significantly different clusters in the cell-cycle data set. Two protein-interaction data sets were also compared with the gene-expression data. Visual analysis of the complete data sets showed no clear correlation between co-expression of genes and protein interactions, in contrast to published reports examining subsets of the protein-interaction data. Neither two-hybrid study identified a large number of interactions between ribosomal proteins, consistent with recent structural data, indicating that for both data sets, the identification of false-positive interactions may be lower than previously thought.  相似文献   

12.
The Major Histocompatibility Complex (MHC) is a genomic region encoding immune loci that are important and frequently used markers in studies of adaptive genetic variation and disease resistance. Given the primary role of infectious diseases in contributing to global amphibian declines, we characterized the hypervariable exon 2 and flanking introns of the MHC Class IIβ chain for 17 species of frogs in the Ranidae, a speciose and cosmopolitan family facing widespread pathogen infections and declines. We find high levels of genetic variation concentrated in the Peptide Binding Region (PBR) of the exon. Ten codons are under positive selection, nine of which are located in the mammal-defined PBR. We hypothesize that the tenth codon (residue 21) is an amphibian-specific PBR site that may be important in disease resistance. Trans-species and trans-generic polymorphisms are evident from exon-based genealogies, and co-phylogenetic analyses between intron, exon and mitochondrial based reconstructions reveal incongruent topologies, likely due to different locus histories. We developed two sets of barcoded adapters that reliably amplify a single and likely functional locus in all screened species using both 454 and Illumina based sequencing methods. These primers provide a resource for multiplexing and directly sequencing hundreds of samples in a single sequencing run, avoiding the labour and chimeric sequences associated with cloning, and enabling MHC population genetic analyses. Although the primers are currently limited to the 17 species we tested, these sequences and protocols provide a useful genetic resource and can serve as a starting point for future disease, adaptation and conservation studies across a range of anuran taxa.  相似文献   

13.
ObjectivesToday, hospitals and other health care-related institutions are accumulating a growing bulk of real world clinical data. Such data offer new possibilities for the generation of disease models for the health economic evaluation. In this article, we propose a new approach to leverage cancer registry data for the development of Markov models. Records of breast cancer patients from a clinical cancer registry were used to construct a real world data driven disease model.MethodsWe describe a model generation process which maps database structures to disease state definitions based on medical expert knowledge. Software was programmed in Java to automatically derive a model structure and transition probabilities. We illustrate our method with the reconstruction of a published breast cancer reference model derived primarily from clinical study data. In doing so, we exported longitudinal patient data from a clinical cancer registry covering eight years. The patient cohort (n = 892) comprised HER2-positive and HER2-negative women treated with or without Trastuzumab.ResultsThe models generated with this method for the respective patient cohorts were comparable to the reference model in their structure and treatment effects. However, our computed disease models reflect a more detailed picture of the transition probabilities, especially for disease free survival and recurrence.ConclusionsOur work presents an approach to extract Markov models semi-automatically using real world data from a clinical cancer registry. Health care decision makers may benefit from more realistic disease models to improve health care-related planning and actions based on their own data.  相似文献   

14.
Genomic inversions are an increasingly recognized source of genetic variation. However, a lack of reliable high-throughput genotyping assays for these structures has precluded a full understanding of an inversion's phylogenetic, phenotypic, and population genetic properties. We characterize these properties for one of the largest polymorphic inversions in man (the ~4.5-Mb 8p23.1 inversion), a structure that encompasses numerous signals of natural selection and disease association. We developed and validated a flexible bioinformatics tool that utilizes SNP data to enable accurate, high-throughput genotyping of the 8p23.1 inversion. This tool was applied retrospectively to diverse genome-wide data sets, revealing significant population stratification that largely follows a clinal "serial founder effect" distribution model. Phylogenetic analyses establish the inversion's ancestral origin within the Homo lineage, indicating that 8p23.1 inversion has occurred independently in the Pan lineage. The human inversion breakpoint was localized to an inverted pair of human endogenous retrovirus elements within the large, flanking low-copy repeats; experimental validation of this breakpoint confirmed these elements as the likely intermediary substrates that sponsored inversion formation. In five data sets, mRNA levels of disease-associated genes were robustly associated with inversion genotype. Moreover, a haplotype associated with systemic lupus erythematosus was restricted to the derived inversion state. We conclude that the 8p23.1 inversion is an evolutionarily dynamic structure that can now be accommodated into the understanding of human genetic and phenotypic diversity.  相似文献   

15.
Parameters of quantitative genetic models have traditionally been estimated by either algebraic manipulation of familial correlations (or familial mean squares), biometric model fitting, or multiple-group covariance structure analysis. With few exceptions, researchers who have used these methods for the analysis of twin data have assumed that their data were multinormal and, consequently, have used normal-theory estimation methods. It is shown that normal-theory methods produce biased genetic and environmental parameter estimates when data are censored. Specifically, with censored data, (1) normal-theory estimates of narrowsense heritability are either positively or negatively biased, whereas (2) estimates of shared-familial environmental variance are always biased downward. An alternative method for estimating genetic and environmental parameters from censored twin data is proposed. The method is called genetic Tobit factor analysis (GTFA) and is an extension of the Tobit factor analysis model developed by Muthén (Br. J. Math. Stat. Psychol. 42, 241–250, 1989). Using a Monte Carlo design, the performance of GTFA is compared to traditional quantitative genetic methods in both large and small data sets. The results of this study suggest that GTFA is the preferred method for the genetic modeling of censored data obtained from twins.To whom correspondence should be addressed.  相似文献   

16.
The problem of identifying motifs comprising nucleotides at a set of polymorphic DNA sites, not necessarily contiguous, arises in many human genetic problems. However, when the sites are not contiguous, no efficient algorithm exists for polymorphic motif identification. A search based on complete enumeration is computationally inefficient. We have developed probabilistic search algorithms to discover motifs of known or unknown lengths. We have developed statistical tests of significance for assessing a motif discovery, and a statistical criterion for simultaneously estimating motif length and discovering it. We have tested these algorithms on various synthetic data sets and have shown that they are very efficient, in the sense that the "true" motifs can be detected in the vast majority of replications and in a small number of iterations. Additionally, we have applied them to some real data sets and have shown that they are able to identify known motifs. In certain applications, it is pertinent to find motifs that contain contrasting nucleotides at the sites included in the motif (e.g., motifs identified in case-control association studies). For this, we have suggested appropriate modifications. Using simulations, we have discovered that the success rate of identification of the correct motif is high in case-control studies except when relative risks are small. Our analyses of evolutionary data sets resulted in the identification of some motifs that appear to have important implications on human evolutionary inference. These algorithms can easily be implemented to discover motifs from multilocus genotype data by simple numerical recoding of genotypes.  相似文献   

17.
Shape is data and data is shape. Biologists are accustomed to thinking about how the shape of biomolecules, cells, tissues, and organisms arise from the effects of genetics, development, and the environment. Less often do we consider that data itself has shape and structure, or that it is possible to measure the shape of data and analyze it. Here, we review applications of topological data analysis (TDA) to biology in a way accessible to biologists and applied mathematicians alike. TDA uses principles from algebraic topology to comprehensively measure shape in data sets. Using a function that relates the similarity of data points to each other, we can monitor the evolution of topological features—connected components, loops, and voids. This evolution, a topological signature, concisely summarizes large, complex data sets. We first provide a TDA primer for biologists before exploring the use of TDA across biological sub-disciplines, spanning structural biology, molecular biology, evolution, and development. We end by comparing and contrasting different TDA approaches and the potential for their use in biology. The vision of TDA, that data are shape and shape is data, will be relevant as biology transitions into a data-driven era where the meaningful interpretation of large data sets is a limiting factor.  相似文献   

18.
Increasingly, frequent temporal patterns discovered in longitudinal patient records are proposed as features for classification and prediction, and as means to cluster patient clinical trajectories. However, to justify that, we must demonstrate that most frequent temporal patterns are indeed consistently discoverable within the records of different patient subsets within similar patient populations.We have developed several measures for the consistency of the discovery of temporal patterns. We focus on time-interval relations patterns (TIRPs) that can be discovered within different subsets of the same patient population. We expect the discovered TIRPs (1) to be frequent in each subset, (2) preserve their “local” metrics - the absolute frequency of each pattern, measured by a Proportion Test, and (3) preserve their “global” characteristics - their overall distribution, measured by a Kolmogorov-Smirnov test. We also wanted to examine the effect on consistency, over a variety of settings, of varying the minimal frequency threshold for TIRP discovery, and of using a TIRP-filtering criterion that we previously introduced, the Semantic Adjacency Criterion (SAC).We applied our methodology to three medical domains (oncology, infectious hepatitis, and diabetes). We found that, within the minimal frequency ranges we had examined, 70–95% of the discovered TIRPs were consistently discoverable; 40–48% of them maintained their local frequency. TIRP global distribution similarity varied widely, from 0% to 65%. Increasing the threshold usually increased the percentage of TIRPs that were repeatedly discovered across different patient subsets within the same domain, and the probability of a similar TIRP distribution. Using the SAC principle, enhanced, for most minimal support levels, the percentage of repeating TIRPs, their local consistency and their global consistency. The effect of using the SAC was further strengthened as the minimal frequency threshold was raised.  相似文献   

19.
BACKGROUND: In the assessment of stressful life events (SLEs), researchers have often tried to evaluate whether individual events are dependent or independent of the respondent's behaviour. We sought to validate this evaluation using a twin methodology. We predicted that dependent SLEs would be more heritable than independent SLEs. METHODS: We explored, by twin modelling, the resemblance in two pairs of past-year personal and network SLEs rated individually, by trained interviewers, on a four-point dependence-independence scale. We examined results from two waves of interviews with 785 female-female twin pairs ascertained from a population based registry. RESULTS: Twin model-fitting found no evidence for genetic effects on personal or network independent SLEs. However, familial-environmental factors played an important role in the aetiology of network independent SLEs. For personal and network dependent SLEs, by contrast, three of four analyses suggested a significant aetiological role for genetic factors with estimated heritabilities ranging from 19 to 51%. CONCLUSIONS: Our results support the validity of interviewer assessments of dependence versus independence of SLEs. As predicted, these assessments were relatively successful at distinguishing SLEs that were influenced by genetic factors from those that were not.  相似文献   

20.
Linkage- and association-based approaches have been applied to attempt to unravel the genetic predisposition for complex diseases. However, studies often report contradictory results even when similar population backgrounds are investigated. Unrecognized population substructures could possibly explain these inconsistencies. In an apparently homogeneous German sample of 612 patients with type 2 diabetic and end-stage diabetic nephropathy and 214 healthy controls, we tested for hidden population substructures and their possible effects on association. Using a genetic vector space analysis of genotypes of 20 microsatellite markers, we identified four distinct subsets of cases and controls. The significance of these substructures was demonstrated by subsequent association analyses, using three genetic markers (UCSNP-43,-19,-63; intron 3 of the calpain-10 gene). In the undivided sample, we found no association between individual SNPs or any haplogenotypes (ie the genotype combination of two multilocus haplotypes) and type 2 diabetes. In contrast, when analyzing the four groups separately, we found that there was evidence for association of the common C allele of UCSNP-63 with the trait in the largest group (n=547 cases/101 controls; P=0.002). In this subset haplotype 112 was more frequent in controls than in cases (P=0.006; haplogenotype 112/121: odds ratio (OR)=0.27, 95% confidence intervals (CI)=0.13-0.57), indicating a protective effect against the development of type 2 diabetes. Our study demonstrates that unconsidered population substructures (ethnicity-dependent factors) can severely bias association studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号