首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The human gut microbiome is a complex ecosystem composed mainly of uncultured bacteria. It plays an essential role in the catabolism of dietary fibers, the part of plant material in our diet that is not metabolized in the upper digestive tract, because the human genome does not encode adequate carbohydrate active enzymes (CAZymes). We describe a multi-step functionally based approach to guide the in-depth pyrosequencing of specific regions of the human gut metagenome encoding the CAZymes involved in dietary fiber breakdown. High-throughput functional screens were first applied to a library covering 5.4 × 109 bp of metagenomic DNA, allowing the isolation of 310 clones showing beta-glucanase, hemicellulase, galactanase, amylase, or pectinase activities. Based on the results of refined secondary screens, sequencing efforts were reduced to 0.84 Mb of nonredundant metagenomic DNA, corresponding to 26 clones that were particularly efficient for the degradation of raw plant polysaccharides. Seventy-three CAZymes from 35 different families were discovered. This corresponds to a fivefold target-gene enrichment compared to random sequencing of the human gut metagenome. Thirty-three of these CAZy encoding genes are highly homologous to prevalent genes found in the gut microbiome of at least 20 individuals for whose metagenomic data are available. Moreover, 18 multigenic clusters encoding complementary enzyme activities for plant cell wall degradation were also identified. Gene taxonomic assignment is consistent with horizontal gene transfer events in dominant gut species and provides new insights into the human gut functional trophic chain.The human intestinal microbiome is the dense and complex ecosystem that resides in the distal part of our digestive tract. Its role in metabolizing dietary constituents (Sonnenburg et al. 2005; Flint et al. 2008; Ley et al. 2008) and in protecting the host against pathogens (Rakoff-Nahoum et al. 2004) is crucial to human health (Macdonald and Monteleone 2005; McGarr et al. 2005; Manichanh et al. 2006; Turnbaugh and Gordon 2009). It is mainly composed of commensal bacteria from the Bacteroidetes, Firmicutes, Proteobacteria, and Actinobacteria phyla (five), and of several archaeal and eukaryotic species. With up to 1012 cells per gram of feces, the bacterial abundance is estimated to reach 1000 operational taxonomic units (OTUs) per individual, 70% to 80% of the most dominant ones being subject-specific (Zoetendal et al. 1998; Tap et al. 2009). However, only 20% of the bacterial species have been successfully cultured so far (Eckburg et al. 2005). Large-scale analyses of genomic and metagenomic sequences have provided gene catalogs and statistical evidence on protein families involved in the predominant functions of the human gut microbiome (Gill et al. 2006; Kurokawa et al. 2007; Flint et al. 2008; Turnbaugh et al. 2009; Qin et al. 2010), among which the catabolism of dietary fibers is of particular interest in human nutrition and health. Dietary fibers are the components of vegetables, cereals, leguminous seeds, and fruits that are not digested in the stomach or in the small intestine, but are fermented in the colon by the gut microbiome and/or excreted in feces (Grabitske and Slavin 2008). Chemically, dietary fibers are mainly composed of complex plant cell wall polysaccharides and their associated lignin (Selvendran 1984), along with storage polysaccharides such as fructans and resistant starch (Institute of Medicine 2005). Dietary fibers have been identified as a strong positive dietary factor in the prevention of obesity, diabetes, and cardiovascular diseases (World Health Organization 2003). Because of the wide structural diversity of dietary fibers, the human gut bacteria produce a huge panel of carbohydrate active enzymes (CAZymes), with widely different substrate specificities, to degrade these compounds into metabolizable monosaccharides and disaccharides. The functions and the evolutionary relationships of CAZyme-encoding genes of the human gut microbiome are being extensively studied through functional and structural genomics investigations (Flint et al. 2008; Lozupone et al. 2008; Mahowald et al. 2009; Martens et al. 2009), which are nevertheless restricted to cultivated bacterial species. CAZyme diversity has also been described in three metagenomics studies focused on this microbiome (Gill et al. 2006; Turnbaugh et al. 2009, 2010), and these revealed the presence of at least 81 families of glycoside-hydrolases, making the human gut metagenome one of the richest source of CAZymes (Li et al. 2009). However, the proof of function of annotated genes issued from metagenomes still constitutes a goal for enzyme discovery. This can be addressed by functional screening of metagenomic libraries, in order to retrieve genes of interest. Numerous studies have provided conclusive evidence on the potential of such an approach for the identification of novel glycoside-hydrolases from various ecosystems such as soil (Rondon et al. 2000; Richardson et al. 2002; Voget et al. 2003; Pang et al. 2009), lakes (Rees et al. 2003), hot springs (Tang et al. 2006, 2008), rumen (Ferrer et al. 2005; Guo et al. 2008; Liu et al. 2008; Duan et al. 2009), rabbit (Feng et al. 2007), and insect guts (Brennan et al. 2004; for review, see Ferrer et al. 2009; Li et al. 2009; Simon and Daniel 2009; Uchiyama and Miyazaki 2009). In all cases, the identification of the gene responsible for the screened activity was carried out by sequencing only a few kilobases of metagenomic DNA. Collectively these studies have established an experimental proof of function for 35 glycoside hydrolases (from eight families) issued from metagenomes (data from the CAZy database; http://www.cazy.org/), a number that is very small considering the known CAZy diversity. Here, we examined the potential of high-throughput functional screening of large insert libraries to guide in-depth pyrosequencing of specific regions of the human gut metagenome that encode the enzymatic machinery involved in dietary fiber catabolism.  相似文献   

2.
3.
Next-generation massively parallel DNA sequencing technologies provide ultrahigh throughput at a substantially lower unit data cost; however, the data are very short read length sequences, making de novo assembly extremely challenging. Here, we describe a novel method for de novo assembly of large genomes from short read sequences. We successfully assembled both the Asian and African human genome sequences, achieving an N50 contig size of 7.4 and 5.9 kilobases (kb) and scaffold of 446.3 and 61.9 kb, respectively. The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.The development and commercialization of next-generation massively parallel DNA sequencing technologies, including Illumina Genome Analyzer (GA) (Bentley 2006), Applied Biosystems SOLiD System, and Helicos BioSciences HeliScope (Harris et al. 2008), have revolutionized genomic research. Compared to traditional Sanger capillary-based electrophoresis systems, these new technologies provide ultrahigh throughput with two orders of magnitude lower unit data cost. However, they all share a common intrinsic characteristic of providing very short read length, currently 25–75 base pairs (bp), which is substantially shorter than the Sanger sequencing reads (500–1000 bp) (Shendure et al. 2004). This has raised concern about their ability to accurately assemble large genomes. Illumina GA technology has been shown to be feasible for use in human whole-genome resequencing and can be used to identify single nucleotide polymorphisms (SNPs) accurately by mapping the short reads onto the known reference genome (Bentley et al. 2008; Wang et al. 2008). But to thoroughly annotate insertions, deletions, and structural variations, de novo assembly of each individual genome from these raw short reads is required.Currently, Sanger sequencing technology remains the dominant method for building a reference genome sequence for a species. It is, however, expensive, and this prevents many genome sequencing projects from being put into practice. Over the past 10 yr, only a limited number of plant and animal genomes have been completely sequenced, (http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html), including human (Lander et al. 2001; Venter et al. 2001) and mouse (Mouse Genome Sequencing Consortium 2002), but accurate understanding of evolutionary history and biological processes at a nucleotide level requires substantially more. The development of a de novo short read assembly method would allow the building of reference sequences for these unexplored genomes in a very cost-effective way, opening the door for carrying out numerous substantial new analyses.Several programs, such as phrap (http://www.phrap.org), Celera assembler (Myers et al. 2000), ARACHNE (Batzoglou et al. 2002), Phusion (Mullikin and Ning 2003), RePS (Wang et al. 2002), PCAP (Huang et al. 2003), and Atlas (Havlak et al. 2004), have been successfully used for de novo assembly of whole-genome shotgun (WGS) sequencing reads in the projects applying the Sanger technology. These are based on an overlap-layout strategy, but for very short reads, this approach is unsuitable because it is hard to distinguish correct assembly from repetitive sequence overlap due to there being only a very short sequence overlap between these short reads. Also, in practice, it is unrealistic to record into a computer memory all the sequence overlap information from deep sequencing that are made up of huge numbers of short reads.The de Bruijn graph data structure, introduced in the EULER (Pevzner et al. 2001) assembler, is particularly suitable for representing the short read overlap relationship. The advantage of the data structure is that it uses K-mer as vertex, and read path along the K-mers as edges on the graph. Hence, the graph size is determined by the genome size and repeat content of the sequenced sample, and in principle, will not be affected by the high redundancy of deep read coverage. A few short read assemblers, including Velvet (Zerbino and Birney 2008), ALLPATHS (Butler et al. 2008), and EULER-SR (Chaisson and Pevzner 2008), have adopted this algorithm, explicitly or implicitly, and have been implemented and shown very promising performances. Some other short read assemblers have applied the overlap and extension strategy, such as SSAKE (Warren et al. 2007), VCAKE (Jeck et al. 2007) (the follower of SSAKE which can handle sequencing errors), SHARCGS (Dohm et al. 2007), and Edena (Hernandez et al. 2008). However, all these assemblers were designed to handle bacteria- or fungi-sized genomes, and cannot be applied for assembly of large genomes, such as the human, given the limits of the available memory of current supercomputers. Recently, ABySS (Simpson et al. 2009) used a distributed de Bruijn graph algorithm that can split data and parallelize the job on a Linux cluster with message passing interface (MPI) protocol, allowing communication between nodes. Thus, it is able to handle a whole short read data set of a human individual; however, the assembly is very fragmented with an N50 length of ∼1.5 kilobases (kb). This is not long enough for structural variation detection between human individuals, nor is it good enough for gene annotation and further analysis of the genomes of novel species.Here, we present a novel short read assembly method that can build a de novo draft assembly for the human genome. We previously sequenced the complete genome of an Asian individual using a resequencing method, producing a total of 117.7 gigabytes (Gb) of data, and have now an additional 82.5 Gb of paired-end short reads, achieving a 71× sequencing depth of the NCBI human reference sequence. We used this substantial amount of data to test our de novo assembly method, as well as the data from the African genome sequence (Bentley et al. 2008; Wang et al. 2008; Li et al. 2009a). We compared the de novo assemblies to the NCBI reference genome and demonstrated the capability of this method to accurately identify structural variations, especially small deletions and insertions that are difficult to detect using the resequencing method. This software has been integrated into the short oligonucleotide alignment program (SOAP) (Li et al. 2008, 2009b,c) package and named SOAPdenovo to indicate its functionality.  相似文献   

4.
5.
6.
7.
Microbial virulence is a complex and often multifactorial phenotype, intricately linked to a pathogen’s evolutionary trajectory. Toxicity, the ability to destroy host cell membranes, and adhesion, the ability to adhere to human tissues, are the major virulence factors of many bacterial pathogens, including Staphylococcus aureus. Here, we assayed the toxicity and adhesiveness of 90 MRSA (methicillin resistant S. aureus) isolates and found that while there was remarkably little variation in adhesion, toxicity varied by over an order of magnitude between isolates, suggesting different evolutionary selection pressures acting on these two traits. We performed a genome-wide association study (GWAS) and identified a large number of loci, as well as a putative network of epistatically interacting loci, that significantly associated with toxicity. Despite this apparent complexity in toxicity regulation, a predictive model based on a set of significant single nucleotide polymorphisms (SNPs) and insertion and deletions events (indels) showed a high degree of accuracy in predicting an isolate’s toxicity solely from the genetic signature at these sites. Our results thus highlight the potential of using sequence data to determine clinically relevant parameters and have further implications for understanding the microbial virulence of this opportunistic pathogen.A key factor affecting the severity and outcome of any infection is the virulence potential of the infecting organism. If the virulence phenotype could be determined directly from its genome sequence, next generation sequencing technology would provide for the first time an opportunity to make predictions of virulence at an early stage of infection. Since the first whole-genome sequence of a free-living organism, Haemophilus influenzae, was published (Fleischmann et al. 1995), sequencing technology has advanced to a stage where a bacterial genome can be sequenced in a matter of hours (Parkhill and Wren 2011; Didelot et al. 2012a; Eyre et al. 2012; Köser et al. 2012a). This has led to an explosion of genomic data that has allowed us to monitor outbreaks in hospitals (Köser et al. 2012b; Young et al. 2012; Harris et al. 2013; Sherry et al. 2013; Walker et al. 2013), track strains transitioning from carrier to invasive status (Young et al. 2012), and perform detailed epidemiological studies to understand aspects of pathogen biology (Castillo-Ramírez et al. 2011, 2012; Didelot et al. 2012b; McAdam et al. 2012; Holden et al. 2013). While some success has also been made in predicting phenotype from genotype, such as the antimicrobial resistance (Farhat et al. 2013; Holden et al. 2013), for more complex phenotypes, such as virulence, involving the contribution of several genes, this has not yet been possible. Furthermore, complex interactions between genes (epistasis) are not apparent from genome sequences alone, nor is the effect of epigenetics (Borrell and Gagneux 2011; Jelier et al. 2011; Beltrao et al. 2012; Bierne et al. 2012).Staphylococcus aureus is a major human pathogen, the treatment of which has been complicated by the worldwide emergence of multiple lineages that have acquired resistance to methicillin (methicillin resistant S. aureus, MRSA) (Lowy 1998; Gordon and Lowy 2008; Otto 2010). Its virulence is conferred by the activity of many effector molecules which can be broadly grouped into being either toxins (Lowy 1998; Gordon and Lowy 2008; Otto 2010)—factors that cause specific tissue damage in the host, or adhesins—factors that facilitate adherence to and invasion of host tissues (Foster et al. 2014). The ability of toxins to lyse human cells causes local tissue damage, facilitating immune evasion, release of nutrients, dissemination within a host, and transmission to others (Lowy 1998; Gordon and Lowy 2008; Otto 2010). A complex network of regulatory proteins controls the expression of many individual toxins (Priest et al. 2012), such that various sites on the S. aureus chromosome contribute to the overall toxicity of an individual isolate. The ability of S. aureus cells to bind human glycoproteins, such as fibrinogen and fibronectin, is another critical determinant in disease outcome. It facilitates attachment to and damage of host tissues, host cell invasion, and systemic dissemination (Foster et al. 2014). Several genes encode fibronectin- and fibrinogen-binding proteins (e.g., fnbA, fnbB, clfA, clfB, eap, isdA, emp, ebh, etc.), whose expression is again controlled by a complex regulatory network (Priest et al. 2012). Similar to toxicity, many sites on the chromosome can therefore contribute to the overall adhesiveness of S. aureus, with many regulators common to both adhesion and toxicity (Priest et al. 2012).The success of epidemic MRSA clones such as USA300 and sequence type (ST) 239 is attributed to a variation in their expression of either toxins or adhesins (Li et al. 2010; Otto 2010; Li et al. 2012). In response to the prevalence of the highly toxic USA300 clone, guidelines exist that recommend treating suspected infections with vancomycin and a second antibiotic such as clindamycin or linezolid to reduce toxin expression and the associated disease severity (http://www.hpa.org.uk/webc/HPAwebFile/HPAweb_C/1242630044068). It is therefore clear that the ability to predict whether an infecting isolate is either highly adhesive or highly toxic could allow clinicians to adapt treatment approaches and increase their index of suspicion for disease complications for infected individuals.To address this, we adopted a genome-wide association study (GWAS) and a machine learning approach to determine the feasibility of predicting virulence from the genome sequences of 90 MRSA isolates. Our findings demonstrate that using whole-genome sequence data for large collections of isolates to identify genetic signature associated with a specific trait can be used to infer complex phenotypes from genotype.  相似文献   

8.
9.
Short tandem repeats are among the most polymorphic loci in the human genome. These loci play a role in the etiology of a range of genetic diseases and have been frequently utilized in forensics, population genetics, and genetic genealogy. Despite this plethora of applications, little is known about the variation of most STRs in the human population. Here, we report the largest-scale analysis of human STR variation to date. We collected information for nearly 700,000 STR loci across more than 1000 individuals in Phase 1 of the 1000 Genomes Project. Extensive quality controls show that reliable allelic spectra can be obtained for close to 90% of the STR loci in the genome. We utilize this call set to analyze determinants of STR variation, assess the human reference genome’s representation of STR alleles, find STR loci with common loss-of-function alleles, and obtain initial estimates of the linkage disequilibrium between STRs and common SNPs. Overall, these analyses further elucidate the scale of genetic variation beyond classical point mutations.Short tandem repeats (STRs) are abundant repetitive elements comprised of recurring DNA motifs of 2–6 bases. These loci are highly prone to mutations due to their susceptibility to slippage events during DNA replication (Ellegren 2004). To date, STR mutations have been linked to at least 40 monogenic disorders (Pearson et al. 2005; Mirkin 2007), including a range of neurological conditions such as Huntington’s disease, amyotrophic lateral sclerosis, and certain types of ataxia. Some disorders, such as Huntington’s disease, are triggered by the expansion of a large number of repeat units. In other cases, such as oculopharyngeal muscular dystrophy, a pathogenic allele is only two repeat units from the wild-type allele (Brais et al. 1998; Amiel et al. 2004). In addition to Mendelian conditions, multiple studies have suggested that STR variations contribute to an array of complex traits (Gemayel et al. 2010), ranging from the period of the circadian clock in Drosophila (Sawyer et al. 1997) to gene expression in yeast (Vinces et al. 2009) and splicing in humans (Hefferon et al. 2004; Sathasivam et al. 2013).Beyond their importance to medical genetics, STR variations convey high information content due to their rapid mutations and multiallelic spectra. Population genetics studies have utilized STRs in a wide range of methods to find signatures of selection and to elucidate mutation patterns in nearby SNPs (Tishkoff et al. 2001; Sun et al. 2012). In DNA forensics, STRs play a significant role as both the United States and the European forensic DNA databases rely solely on these loci to create genetic fingerprints (Kayser and de Knijff 2011). Finally, the vibrant genetic genealogy community extensively uses these loci to develop impressive databases containing lineages for hundreds of thousands of individuals (Khan and Mittelman 2013).Despite the utility of STRs, systematic data about their variation in the human population is far from comprehensive. Currently, most of the genetic information concerns a few thousand loci that were part of STR linkage and association panels in the pre-SNP-array era (Broman et al. 1998; Tamiya et al. 2005) and several hundred loci involved in forensic analysis, genetic genealogy, or genetic diseases (Ruitberg et al. 2001; Pearson et al. 2005). In total, there are only 5500 loci under the microsatellite category in dbSNP139. For the vast majority of STR loci, little is known about their normal allelic ranges, frequency spectra, and population differences. This knowledge gap largely stems from the absence of high-throughput genotyping techniques for these loci (Jorgenson and Witte 2007). Capillary electrophoresis offers the most reliable method to probe these loci, but this technology scales poorly. More recently, several studies have begun to genotype STR loci with whole-genome sequencing data sets obtained from long read platforms such as Sanger sequencing (Payseur et al. 2011) and 454 Life Sciences (Roche) (Molla et al. 2009; Duitama et al. 2014). However, due to the relatively low throughput of these platforms, these studies analyzed STR variations in only a few genomes.Illumina sequencing has the potential to profile STR variations on a population-scale. However, STR variations present significant challenges for standard sequence analysis frameworks (Treangen and Salzberg 2012). In order to reduce computation time, most alignment algorithms use heuristics that reduce their tolerance to large indels, hampering alignment of STRs with large contractions or expansions. In addition, due to the repetitive nature of STRs, the PCR steps involved in sample preparation induce in vitro slippage events (Hauge and Litt 1993). These events, called stutter noise, generate erroneous reads that mask the true genotypes. Because of these issues, previous large-scale efforts to catalog genetic variation have omitted STRs from their analyses (The 1000 Genomes Project Consortium 2012; Tennessen et al. 2012; Montgomery et al. 2013), and early attempts to analyze STRs using the 1000 Genomes Project data mainly focused on exonic regions (McIver et al. 2013) or extremely short STR regions in a relatively small number of individuals based on the native indel call set (Ananda et al. 2013).In our previous studies, we created publicly available programs that specialize in STR profiling using Illumina whole-genome sequencing data (Gymrek et al. 2012; Highnam et al. 2013). Recently, we employed one of these tools (lobSTR) to accurately genotype STRs on the Y chromosome of anonymous individuals in the 1000 Genomes Project to infer their surnames (Gymrek et al. 2013), demonstrating the potential utility of STR analysis from Illumina sequencing. Here, we used these tools to conduct a genome-wide analysis of STR variation in the human population using sequencing data from Phase 1 of the 1000 Genomes Project.  相似文献   

10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
The relationship between genotype mutations and phenotype variations determines health in the short term and evolution over the long term, and it hinges on the action of mutations on fitness. A fundamental difficulty in determining this action, however, is that it depends on the unique context of each mutation, which is complex and often cryptic. As a result, the effect of most genome variations on molecular function and overall fitness remains unknown and stands apart from population genetics theories linking fitness effect to polymorphism frequency. Here, we hypothesize that evolution is a continuous and differentiable physical process coupling genotype to phenotype. This leads to a formal equation for the action of coding mutations on fitness that can be interpreted as a product of the evolutionary importance of the mutated site with the difference in amino acid similarity. Approximations for these terms are readily computable from phylogenetic sequence analysis, and we show mutational, clinical, and population genetic evidence that this action equation predicts the effect of point mutations in vivo and in vitro in diverse proteins, correlates disease-causing gene mutations with morbidity, and determines the frequency of human coding polymorphisms, respectively. Thus, elementary calculus and phylogenetics can be integrated into a perturbation analysis of the evolutionary relationship between genotype and phenotype that quantitatively links point mutations to function and fitness and that opens a new analytic framework for equations of biology. In practice, this work explicitly bridges molecular evolution with population genetics with applications from protein redesign to the clinical assessment of human genetic variations.Each birth introduces about 70 new human genetic mutations (Keightley 2012) that have led, over generations, to the current four million DNA differences among randomly chosen individuals. Besides insertions, deletions, copy number variations, and chromosomal rearrangements, genetic alterations include single nucleotide substitutions that translate into nearly 10,000 amino acid substitutions per human exome (Ng et al. 2008; Lupski et al. 2010). These protein-coding variants can affect fitness (Eyre-Walker and Keightley 2007), account for 85% of known disease mutations (Choi et al. 2009), and are associated with more than 2500 ailments (Botstein and Risch 2003; Bodmer and Bonilla 2008). Nevertheless, association studies explain only a fraction of disease susceptibility (McCarthy and Hirschhorn 2008), and the role of both private and common mutations remains unclear (Ng et al. 2008). Computational approaches therefore aim to identify which coding variations cause disease (Ng and Henikoff 2001; Stone and Sidow 2005; Adzhubei et al. 2010) within the limitations of biophysical, statistical, and machine-learning models of protein function (Chun and Fay 2009; Hicks et al. 2011). In parallel, a large body of theory models the spread and fixation of mutations (Orr 2005), their distribution for various population sizes and fitness effects (Eyre-Walker and Keightley 2007), and whether selection or drift dominates their fate (Nei 2007). However, without a practical measure of the action of mutations on fitness, the theory cannot be applied to the massive inflow of genetic information (Orr 2005; Losos et al. 2013).Here, we follow the perspective that evolution proceeds in infinitesimal mutational steps (Fisher 1930; Orr 2005) to propose an equation for the Evolutionary Action of a mutation on fitness. This action equation is derived from a model of the genotype-phenotype relationship that is simpler than current models (Choi et al. 2008; Kleinman et al. 2010; Grahnen et al. 2011) and that is compatible with the theory of nearly neutral evolution (Ohta 1992) and with fundamental variational principles of physics describing how physical systems evolve to follow paths of least action. The computed Evolutionary Action consistently topped the most sophisticated, homology-based or machine-learning methods that predict the impact of mutations in both retrospective and prospective assessments. Retrospective validation included large data sets of (1) experimental assays of molecular function; (2) human disease association; and (3) population-wide polymorphisms. Prospective validation involved the CAGI (Critical Assessment of Genome Interpretation) community contest, which challenged predictors to estimate the impact of 84 mutations on enzymatic activity of the cystathionine beta-synthase.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号