As populations boom and bust, the accumulation of genetic diversity is modulated, encoding histories of living populations in present-day variation. Many methods exist to decode these histories, and all must make strong model assumptions. It is typical to assume that mutations accumulate uniformly across the genome at a constant rate that does not vary between closely related populations. However, recent work shows that mutational processes in human and great ape populations vary across genomic regions and evolve over time. This perturbs the mutation spectrum (relative mutation rates in different local nucleotide contexts). Here, we develop theoretical tools in the framework of Kingman’s coalescent to accommodate mutation spectrum dynamics. We present mutation spectrum history inference (mushi), a method to perform nonparametric inference of demographic and mutation spectrum histories from allele frequency data. We use mushi to reconstruct trajectories of effective population size and mutation spectrum divergence between human populations, identify mutation signatures and their dynamics in different human populations, and calibrate the timing of a previously reported mutational pulse in the ancestors of Europeans. We show that mutation spectrum histories can be placed in a well-studied theoretical setting and rigorously inferred from genomic variation data, like other features of evolutionary history.Over the past decade, population geneticists have developed many sophisticated methods for inferring population demography and have consistently found that simple isolated populations of constant size are far from the norm (reviewed in refs.
1–
3). Population expansions and founder events, as well as migration between species and geographic regions, have been inferred from virtually all high-resolution genetic datasets. We now recognize that inferring these nonequilibrium demographies is often essential for understanding the histories of adaptation and global migration. Population genetics has uncovered many features of human history that were once virtually unknowable by other means, revealing a complex series of migrations, population replacements, and admixture networks among human groups and extinct hominoids.Although demographic inference methods can model complex population histories, the germline mutation process that creates variation has long received a comparatively simple treatment. A single parameter,
, is used to represent the mutation rate per generation at all loci, in all individuals, and at all times. In humans,
is estimated from parent–child trio sequencing studies, and modest variation in
can have major effects on the interpretation of inferred parameters, such as times of admixture and population divergence. In other organisms, for which trio sequence data are usually unavailable,
is estimated from sequence divergence between species with a fossil-calibrated divergence time, and these estimates come with still higher uncertainty.A growing body of evidence indicates that simple, constant mutation rate models may not adequately describe how variation accumulates on either inter- or intraspecific timescales (
4–
7). Germline mutation rates appear to have evolved during the speciation of great apes and the divergence of modern human populations (reviewed in ref.
8). Much of this evolution might be caused by nearly neutral drift (
9), but a contributing factor could be selection on traits, like life history and chromatin structure, that indirectly affect mutation accumulation. Because mutation is intimately tied to the basic housekeeping process of cell division, gamete production, and embryonic development, the accumulation of mutations is likely to be complexly coupled to other biological processes (
10–
12).It is difficult to disentangle past changes in mutation rate from past changes in effective population size, which modulate levels of polymorphism even when the mutation rate stays constant. However, evolution of the mutation process can be indirectly detected by measuring its effects on the mutation spectrum: the relative mutation rates among different local nucleotide contexts. Hwang and Green (
13) modeled the triplet context dependence of the substitution process in a mammalian phylogeny, finding varying contributions from replication errors, cytosine deamination, and biased gene conversion and showing that the relative rates of these processes varied between different mammalian lineages. Many cancers also exhibit somatic hypermutability of certain triplet motifs due to different DNA damage agents and failure points in the DNA repair process (
14,
15). Harris (
6) and Harris and Pritchard (
7) examined the variation of triplet spectra between closely related populations, counting single-nucleotide variants in each triplet mutation type as a proxy for mutational input. They found that human triplet spectra distinctly cluster by continental ancestry group and that historical pulses in mutation activity influence the distribution of allele frequencies in certain mutation types. The divergence of mutation spectra among human continental groups has been replicated in independently generated datasets (
7,
16), and similar patterns have been observed in other species, including great apes (
17), mice (
18), and yeast (
19). Some of the mutation spectrum divergence between mice and yeast lineages has been mapped to mutator alleles (
19,
20).Emerging from the literature is a picture of a mutation process evolving within and between populations, anchored to genomic features and accented by spectra of local nucleotide context. If probabilistic models of population genetic processes are to keep pace with these empirical findings, mutation deserves a richer treatment in state-of-the-art inference tools. In this paper, we build on classical theoretical tools to introduce fast nonparametric inference of population-level mutation spectrum history (MuSH)—the relative mutation rate in different local nucleotide contexts across time—alongside inference of demographic history. Whereas previous work has uncovered mutation spectrum evolution using summary statistics of standing variation, we shift perspective to focus on inference of the MuSH, which we model on the same footing as demography.Demographic inference requires us to invert the map that takes population history to the patterns of genetic diversity observable today. This task is often simplified by first compressing these genetic diversity data into a summary statistic such as the sample frequency spectrum (SFS), the distribution of derived allele frequencies among sampled haplotypes. The SFS is a well-studied population genetic summary statistic that is sensitive to demographic history. Inverting the map from demographic history to SFS is a notoriously ill-posed problem, in that many different population histories can have identical expected SFS (
21–
25). One way to deal with the ill posedness of demographic inference is to specify a parametric model of population size change, usually piecewise linear or piecewise exponential. An alternative, which generalizes to other inverse problems, is to allow a more general space of solutions but to regularize by penalizing histories that contain biologically unrealistic features (e.g., high-frequency population size oscillations). Both approaches shrink the set of feasible solutions to the inverse problem so that it becomes well posed and can be thought of as leveraging prior knowledge. In particular, regularization constrains the population size from changing on arbitrarily small timescales since significant population size change usually takes at least a few generations.In this paper, we extend a coalescent framework for demographic inference to accommodate inference of the MuSH from an SFS that is resolved into different local
-mer nucleotide contexts. This is a richer summary statistic that we call the
-SFS where, for example,
means triplet context. We show using coalescent theory that the
-SFS is related to the MuSH by a linear transformation while depending nonlinearly on the demographic history. We infer both demographic history and MuSH by optimizing a cost that balances a data-fitting term using the forward map from coalescent theory, along with regularization terms that favor solutions with low complexity. Our open-source software mushi (mutation spectrum history inference) is available in ref.
26 as a Python package with extensive documentation. Using default settings and modest hardware, mushi takes only a few seconds to infer histories from population-scale sample frequency data.The recovered MuSH is a rich object that illuminates dimensions of population history and addresses biological questions about the evolution of the mutation process. After validating with data simulated under known histories, we use mushi to independently infer histories for each of the 26 populations (from 5 superpopulations defined by continental ancestry) from the 1000 Genomes Project (1KG) Consortium (
27) using recent high-coverage sequencing data (
28). We demonstrate that mushi is a powerful tool for demographic inference that has several advantages over existing demographic inference methods and then go on to describe the illuminated features of human MuSH.We recover demographic features that are robust to regularization parameter choices, including the out-of-Africa event and the more recent bottleneck in the ancestors of modern Finns, and we find that effective population sizes converge ancestrally within each superpopulation, despite being inferred independently. Decomposing human MuSH into mutation signatures varying through time in each population, we see global divergence in the mutation process that impacts many mutation types and reflects population and superpopulation relatedness. Finally, we revisit the timing of a previously reported ancient pulse of elevated TCC
TTC mutation rate, active primarily in the ancestors of Europeans and absent in East Asians (
6,
7,
29,
30). We find that the extent of the pulse into the ancient past is sensitive to the choice of demographic history model but that all demographic models that fit the
-SFS yield a pulse timing that is significantly older than previously thought, seemingly arising near the divergence time of East Asians and Europeans.With mushi, we can quickly reconstruct demographic history and MuSH without strong model specification requirements. This adds an approach to the toolbox for researchers interested only in demographic inference. For researchers studying the mutation spectrum, demographic history is necessary for time calibration of events in mutation history, so we expect that jointly modeling demography and MuSH will be important for studying mutational spectrum evolution in population genetics.
相似文献