首页 | 本学科首页   官方微博 | 高级检索  
     


Robust analysis of prokaryotic pangenome gene gain and loss rates with Panstripe
Authors:Gerry Tonkin-Hill,Rebecca A. Gladstone,Anna K. Pö  ntinen,Sergio Arredondo-Alonso,Stephen D. Bentley,Jukka Corander
Affiliation:1.Department of Biostatistics, University of Oslo, 0372 Blindern, Norway;2.Parasites and Microbes, Wellcome Sanger Institute, Cambridge CB10 1RQ, United Kingdom;3.Helsinki Institute for Information Technology HIIT, Department of Mathematics and Statistics, University of Helsinki, 00014 Helsinki, Finland
Abstract:Horizontal gene transfer (HGT) plays a critical role in the evolution and diversification of many microbial species. The resulting dynamics of gene gain and loss can have important implications for the development of antibiotic resistance and the design of vaccine and drug interventions. Methods for the analysis of gene presence/absence patterns typically do not account for errors introduced in the automated annotation and clustering of gene sequences. In particular, methods adapted from ecological studies, including the pangenome gene accumulation curve, can be misleading as they may reflect the underlying diversity in the temporal sampling of genomes rather than a difference in the dynamics of HGT. Here, we introduce Panstripe, a method based on generalized linear regression that is robust to population structure, sampling bias, and errors in the predicted presence/absence of genes. We show using simulations that Panstripe can effectively identify differences in the rate and number of genes involved in HGT events, and illustrate its capability by analyzing several diverse bacterial genome data sets representing major human pathogens.

Genetic variation within microbial populations is shaped by both the accumulation of variation from point mutations as well as by the acquisition and loss of genetic material through horizontal gene transfer (HGT). HGT can occur via the uptake of DNA from the environment, with the help of mobile genetic elements (MGEs; phages, integrative conjugative elements, and plasmids), or from direct contact between bacterial cells (Thomas and Nielsen 2005). Genes are also frequently duplicated and lost vertically upon cell division (Arnold et al. 2022). The influence of these sources of variation varies by species. Clonal species such as Mycobacterium tuberculosis (Mtb) typically accumulate variation nearly entirely through point mutations, whereas naturally transformable species such as Streptococcus pneumoniae and Neisseria meningitidis have very high rates of homologous recombination (Dubnau 1999). In other species such as Salmonella enterica, horizontal exchange is generally restricted to the movement of MGEs (Harris et al. 2010). Although HGT does not always have an impact on a microbe''s fitness, it can lead to critical phenotypic changes such as the acquisition of antimicrobial resistance, virulence factors, and vaccine escape (Croucher et al. 2013; Wyres et al. 2019).A common approach to analyzing horizontal exchange in microbial genomics is to group homologous gene sequences into orthologous and paralogous gene clusters. The union of these clusters within a particular species or group is commonly referred to as the pangenome (Medini et al. 2005). Genes are often further classified into either the “core” genome, which is found in almost all members of the group, or the “accessory” genome, which is only found in a subset of genomes. Species with a limited accessory genome such that all genes are likely to have already been observed are often described as “closed,” whereas species with a diverse accessory genome are described as “open.”A number of tools have been developed to infer a pangenome given a collection of annotated genomes (Page et al. 2015; Ding et al. 2018; Bayliss et al. 2019; Gautreau et al. 2020; Tonkin-Hill et al. 2020; Zhou et al. 2020). A common output of these algorithms is a binary gene presence/absence matrix where genomes are represented by rows and orthologous gene clusters by columns. After generating a gene presence/absence matrix, researchers are often interested in comparing the size of pangenomes between data sets, determining the rate of horizontal gene exchange as well as identifying whether a pangenome is “open” or “closed.”A gene accumulation curve, as is often performed in ecological studies of species diversity, is often used to investigate these questions (Ugland et al. 2003; Medini et al. 2005). Here, the number of unique gene clusters identified is plotted against the number of genomes. Random permutations are often used to account for the variation caused by the order in which genomes are considered in the plot. In some cases, a power law such as Heaps’ or Zipf''s law is fit to this curve to give a parameter estimate of the diminishing number of new genes found with each additional genome and to determine whether the pangenome is open or closed (Tettelin et al. 2008).A neglected problem with this approach is that it fails to account for the underlying diversity of the set of sampled genomes. For example, a set of genomes taken from within an outbreak is likely to involve far fewer gene exchange events than a diverse sample from a species with thousands of years of evolution separating isolates. Methods that make use of a phylogeny constructed from the genetic diversity present in genes found in all the genomes (the “core” genome) help to address this issue by controlling for the underlying diversity of the sample. The branch lengths of the core genome phylogeny indicate the evolutionary time over which gene gain and loss events could have occurred. Shorter branch lengths separating more closely related taxa would be expected to have fewer associated gene exchange events. Methods that rely on the construction of such a phylogeny include those based on maximum parsimony (Mirkin et al. 2003), maximum likelihood (Hao and Golding 2006; Cohen and Pupko 2010; Han et al. 2013), and Bayesian phylogenetics (Liu et al. 2011). Two notable models that use this approach are the infinitely many genes (IMG) model and the finitely many genes (FMG) model (Baumdicker et al. 2010, 2012; Collins and Higgs 2012; Zamani-Dahaj et al. 2016). The IMG model assumes an infinite pool of genes and that a particular gene can only be gained once, whereas the FMG model assumes that genes belong to a finite pool and that multiple gene gain and loss events of the same gene can occur. Many models also collapse paralogous clusters into gene families before the inference of gene gain and loss rates (Mirkin et al. 2003; Cohen and Pupko 2010; Han et al. 2013).A significant limitation of these approaches is that they generally assume that there is no error in the inferred pangenome presence/absence matrix. We and others have shown that gene annotation errors and the complexities of clustering genes into orthologous families can introduce substantial numbers of erroneous gene clusters (Han et al. 2013; Salzberg 2019; Tonkin-Hill et al. 2020; Zhou et al. 2020). Although a subset of models do account for errors in the predicted presence/absence of genes, these have mostly been optimized for the analysis of eukaryotes and focus on a small number of gene families involving multiple genes (Han et al. 2013). Most models also make the simplifying assumption that genes are gained or lost individually, which can significantly bias estimates of the rate of gene exchange, particularly when the exchange of MGEs is frequent (Baumdicker et al. 2012; Zamani-Dahaj et al. 2016).To address these limitations, we have developed Panstripe, an approach that compares the rates of core and accessory genome evolution to account for both population structure and errors in the pangenome gene presence/absence matrix. Using extensive simulations and by analyzing a diverse range of bacterial genome data sets, we show that Panstripe can effectively identify the rate of gene exchange in pangenomes, detect the presence of a temporal signal in the accessory genome, and discern whether the size of gene exchange events varies between pangenomes.
Keywords:
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号