Segmenting the human genome based on states of neutral genetic divergence期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Segmenting the human genome based on states of neutral genetic divergence

Authors:	Prabhani Kuruppumullage Don Guruprasad Ananda Francesca Chiaromonte Kateryna D. Makova

Affiliation:	Departments of ^aStatistics and;^cBiology, and;^bHuck Institute of the Life Sciences, The Pennsylvania State University, University Park, PA, 16802

Abstract:	Many studies have demonstrated that divergence levels generated by different mutation types vary and covary across the human genome. To improve our still-incomplete understanding of the mechanistic basis of this phenomenon, we analyze several mutation types simultaneously, anchoring their variation to specific regions of the genome. Using hidden Markov models on insertion, deletion, nucleotide substitution, and microsatellite divergence estimates inferred from human–orangutan alignments of neutrally evolving genomic sequences, we segment the human genome into regions corresponding to different divergence states—each uniquely characterized by specific combinations of divergence levels. We then parsed the mutagenic contributions of various biochemical processes associating divergence states with a broad range of genomic landscape features. We find that high divergence states inhabit guanine- and cytosine (GC)-rich, highly recombining subtelomeric regions; low divergence states cover inner parts of autosomes; chromosome X forms its own state with lowest divergence; and a state of elevated microsatellite mutability is interspersed across the genome. These general trends are mirrored in human diversity data from the 1000 Genomes Project, and departures from them highlight the evolutionary history of primate chromosomes. We also find that genes and noncoding functional marks [annotations from the Encyclopedia of DNA Elements (ENCODE)] are concentrated in high divergence states. Our results provide a powerful tool for biomedical data analysis: segmentations can be used to screen personal genome variants—including those associated with cancer and other diseases—and to improve computational predictions of noncoding functional elements.Whole-genome sequencing studies have demonstrated that divergence estimates for several mutation types (e.g., nucleotide substitutions, insertions, and deletions) vary substantially across the human genome. This phenomenon has been studied at various genomic scales and evolutionary distances (reviewed in ref. 1), and—whereas initially of interest solely to evolutionary biologists—is now entering the purview of main biomedical research. Specifically, human population (e.g., ref. 2) and cancer (3, 4) genome resequencing projects have revealed that incidences of single nucleotide polymorphisms (SNPs), insertions and deletions (indels), and copy number variants (CNVs) vary across the genome. Divergence estimates for different mutation types also covary across the genome (5, 6)—e.g., substitution rates increase in regions with high indel rates (7)—suggesting that regional variation is an important and general characteristic of mutations.Variation in divergence is often linked to genomic landscape features such as base composition, replication timing, and recombination rates (1). For instance, nucleotide substitution rates are elevated in late-replicating regions because of an accumulation of single-stranded DNA susceptible to endogenous damage (8) and are affected by chromatin structure (9) and recombination as a result of either biased gene conversion (BGC) (10) or the mutagenicity of recombination (2). Moreover, nucleotide substitution rates depend nonlinearly on guanine and cytosine (GC) content (11) and are affected by methylation levels and GC content at cytosine—phosphate—guanine (CpG) sites (12) and by replication timing and distance to telomeres at non-CpG sites (13). Covariation in divergence among rates of different mutation types can also be at least partly attributed to the influence of a common genomic landscape (5). Importantly, we note that, whereas selection may indeed operate in noncoding regions, it is unlikely to explain the large degree of variation and covariation in divergence estimates computed from putatively neutral sequences (1, and references therein). Divergence computed for neutral DNA ought to reflect mutation, BGC, and—for relatively distant species—only a minimal amount of diversity.Anchoring variation and covariation of divergence estimates for different mutation types to specific regions of the genome is crucial for elucidating how biochemical processes—e.g., replication and recombination—drive mutagenesis (8, 10) and for understanding genome evolution. Such a “geographic” characterization may correlate with the spatial distribution of genes; for instance, cellular receptor and housekeeping genes tend to locate, respectively, in high and low nucleotide substitution rate regions (14). It may also aid prediction of noncoding functional elements (15, 16) and studies of the genetic basis of disease. For instance, it could assist in (i) discerning whether a locus exhibits an excess of mutations because it resides in a hotspot, thus preventing false positive associations with a disease; and (ii) identifying loci with mutational signatures typical of a disease, e.g., explaining frequent coincidence in tumors of sites prone to DNA damage and chromosomal instability (17).With this motivation, we used hidden Markov models (HMMs) (18), a well-established statistical tool, to analyze human divergence for different mutation types. An HMM models a sequence of observations as governed by underlying states that are not directly observable (hidden) but can be inferred based on the data. These states alternate along the sequence following a Markovian structure, i.e., the state governing a given observation may depend on the state governing the preceding observation. Maximum likelihood techniques are used to select an appropriate number of states, characterize them, and partition the sequence into contiguous segments governed by each state.In genomics, HMMs have been used to model stretches of DNA—the sequences of observations—in a variety of applications ranging from gene finders (19) to epigenomic segmentations (20, 21). In our study, we compute divergence estimates for four mutation types—substitutions, insertions, deletions, and microsatellite repeat number alterations—in nonoverlapping windows of neutrally evolving sequences present in human–orangutan genomic alignments. Modeling the resulting observations with HMMs, we identify distinct divergence states characterized by biologically meaningful combinations of elevated, average, or depressed divergence levels for the four mutation types, e.g., a state where only microsatellite mutations are elevated, whereas the other divergence types are average; one where substitutions, insertions, and deletions are all elevated, whereas microsatellite divergence is average, etc. Correspondingly, we partition the genome into chromosomal segments governed by these states. Our analysis departs from previous applications in considering several mutation types simultaneously—thus accounting for their interdependencies—and in generating segmentations not on a small 100-bp scale (22), but on a larger 1-Mb scale that robustly captures variation in divergence for mammalian genomes (11, 23). Additionally, we investigate whether divergence states differ in genomic landscape features that proxy underlying mutagenic processes, correlate with the spatial organization of functional elements, and persist when assessed from human diversity estimates or for varying genomic scales.

Keywords:

设为首页 | 免责声明 | 关于勤云 | 加入收藏