首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Long-range and highly accurate de novo assembly from short-read data is one of the most pressing challenges in genomics. Recently, it has been shown that read pairs generated by proximity ligation of DNA in chromatin of living tissue can address this problem, dramatically increasing the scaffold contiguity of assemblies. Here, we describe a simpler approach (“Chicago”) based on in vitro reconstituted chromatin. We generated two Chicago data sets with human DNA and developed a statistical model and a new software pipeline (“HiRise”) that can identify poor quality joins and produce accurate, long-range sequence scaffolds. We used these to construct a highly accurate de novo assembly and scaffolding of a human genome with scaffold N50 of 20 Mbp. We also demonstrated the utility of Chicago for improving existing assemblies by reassembling and scaffolding the genome of the American alligator. With a single library and one lane of Illumina HiSeq sequencing, we increased the scaffold N50 of the American alligator from 508 kbp to 10 Mbp.A “holy grail” of genomics is the accurate reconstruction of full-length haplotype-resolved chromosome sequences with low effort and cost. High-throughput sequencing methods have sparked a revolution in the field of genomics. By generating data from millions of short fragments of DNA at once, the cost of resequencing genomes has fallen dramatically, rapidly approaching $1000 per human genome (Sheridan 2014). Substantial obstacles remain, however, in transforming short read sequences into long, contiguous genomic assemblies.Currently accessible and affordable high-throughput sequencing methods are best suited to the characterization of short-range sequence contiguity and genomic variation. Achieving long-range linkage and haplotype phasing requires either the ability to directly and accurately read long (i.e., tens of kilobase) sequences or the capture of linkage and phase relationships through paired or grouped sequence reads.A number of methods for increasing the contiguity and accuracy of de novo assemblies have recently been developed. Broadly, they attempt either to increase the read lengths generated from sequencing or to increase the insert size between paired short reads that can subsequently be used to scaffold genome assemblies. For example, the PacBio RS II chemistry updated in 2014 is advertised as producing raw reads with mean lengths of 15 kbp but suffers from error rates as high as ∼15% and remains about 100-fold more expensive than high-throughput short reads (Koren et al. 2012; Quail et al. 2012). Commercially available long-reads from Oxford Nanopore are promising but have even higher error rates and lower throughput (Goodwin et al. 2015). These long-read technologies greatly simplify the process of assembly since, in many cases, repetitive or otherwise ambiguous regions of a genome are traversed in single reads. Illumina''s TruSeq synthetic long-read technology (formerly Moleculo) is limited to 10-kbp reads maximum (Voskoboynik et al. 2013). CPT-seq is somewhat similar in approach but does not rely on long-range PCR amplification (Adey et al. 2014; Amini et al. 2014). Despite a number of improvements, fosmid library creation (Williams et al. 2012; Wu et al. 2012) remains time-consuming and expensive. To date, the community has not settled on a consistently superior technology for large inserts or long reads that is available at the scale and cost needed for large-scale projects like the sequencing of thousands of vertebrate species (Genome 10K Community of Scientists 2009) or hundreds of thousands of humans (Torjesen 2013).The challenge of creating reference-quality assemblies from low-cost sequence data is evident in the comparison of the quality of assemblies generated with today''s technologies and the human reference assembly (Alkan et al. 2011). Many techniques, including BAC clone sequencing, physical maps, and Sanger sequencing, were used to create the high-quality and highly contiguous human reference standard with an 38.5-Mbp N50 length (the size of the scaffold at which at least half of the genome assembly can be found on scaffolds at least that large) and error rate of one per 100,000 bases (International Human Genome Sequencing Consortium 2004). In contrast, a recent comparison of the performance of whole-genome shotgun (WGS) assembly software pipelines, each run by their developers on very high coverage data sets from libraries with multiple insert sizes, produced assemblies with N50 scaffold length ranging up to 4.5 Mbp on a fish genome and 4.0 Mbp on a snake genome (Bradnam et al. 2013).High coverage of sequence with short reads is rarely enough to attain a high-quality and highly contiguous assembly. This is due primarily to repetitive content on both large and small scales, including the repetitive structure near centromeres and telomeres, large paralogous gene families like zinc finger genes, and the distribution of interspersed nuclear elements such as LINEs and SINEs. Such difficult-to-assemble content composes large portions of many eukaryotic genomes, for example, 60%–70% of the human genome (de Koning et al. 2011). When such repeats cannot be spanned by the input sequence data, fragmented and incorrect assemblies result. In general, the starting point for de novo assembly combines deep-coverage (50×–200× minimum), short-range (300–500 bp) paired-end “shotgun” data with intermediate range “mate-pair” libraries with insert sizes between 2 and 8 kbp and longer range (35-kbp) fosmid end pairs (Gnerre et al. 2011; Salzberg et al. 2012). However, even mate-pair data spanning these distances is often not completely adequate for generating megabase scale assembles.Recently, high-throughput short-read sequencing has been used to characterize the three-dimensional structure of chromosomes in living cells. Proximity ligation–based methods like Hi-C (Lieberman-Aiden et al. 2009) and other chromatin capture–based methods (Dixon et al. 2012; Kalhor et al. 2012) rely on the fact that, after fixation, segments of DNA in close proximity in the nucleus are more likely to be ligated together, and thus sequenced as pairs, than are distant regions. As a result, the number of read pairs between intrachromosomal regions is a slowly decreasing function of the genomic distance between them. Several approaches have been developed that exploit this information for the purpose of genome assembly scaffolding and haplotype phasing (Burton et al. 2013; Kaplan and Dekker 2013; Selvaraj et al. 2013; Marie-Nelly et al. 2014).While Hi-C and related methods can identify biologically mediated long-range chromatin contacts at multi-megabase length scales, most of the data describe DNA–DNA proximity on the scale of tens or hundreds of kilobases. These contacts arise from the polymer physics of the nucleosome-wound DNA fiber rather than from chromatin biology. In fact, the large-scale organization of chromosomes in nuclei provides a confounding signal for assembly since, for example, telomeres of different chromosomes are often associated in cells.We demonstrate here that DNA linkages up to several hundred kilobases can be produced in vitro using reconstituted chromatin rather than living chromosomes as the substrate for the production of proximity ligation libraries. The resulting libraries share many of the characteristics of Hi-C data that are useful for long-range genome assembly, including a regular relationship between within–read pair distance and read count. By combining this in vitro long-range linking library with standard WGS and jumping libraries, we generated a de novo human genome assembly with long-range accuracy and contiguity comparable to more expensive methods for a fraction of the cost and effort. This method, called “Chicago,” depends only on the availability of modest amounts of high-molecular-weight DNA and is generally applicable to any species. Here we demonstrate the value of this Chicago data not only for de novo genome assembly using human and alligator but also as an efficient tool for the identification and phasing of structural variants.  相似文献   

2.
Accurate evaluation of microbial communities is essential for understanding global biogeochemical processes and can guide bioremediation and medical treatments. Metagenomics is most commonly used to analyze microbial diversity and metabolic potential, but assemblies of the short reads generated by current sequencing platforms may fail to recover heterogeneous strain populations and rare organisms. Here we used short (150-bp) and long (multi-kb) synthetic reads to evaluate strain heterogeneity and study microorganisms at low abundance in complex microbial communities from terrestrial sediments. The long-read data revealed multiple (probably dozens of) closely related species and strains from previously undescribed Deltaproteobacteria and Aminicenantes (candidate phylum OP8). Notably, these are the most abundant organisms in the communities, yet short-read assemblies achieved only partial genome coverage, mostly in the form of short scaffolds (N50 = ∼2200 bp). Genome architecture and metabolic potential for these lineages were reconstructed using a new synteny-based method. Analysis of long-read data also revealed thousands of species whose abundances were <0.1% in all samples. Most of the organisms in this “long tail” of rare organisms belong to phyla that are also represented by abundant organisms. Genes encoding glycosyl hydrolases are significantly more abundant than expected in rare genomes, suggesting that rare species may augment the capability for carbon turnover and confer resilience to changing environmental conditions. Overall, the study showed that a diversity of closely related strains and rare organisms account for a major portion of the communities. These are probably common features of many microbial communities and can be effectively studied using a combination of long and short reads.Metagenomics is a cultivation-independent approach for studying microbial communities. The dramatic increase in DNA sequencing throughput, accompanied by the development of new bioinformatics approaches for the assembly (Peng et al. 2012) and binning (Dick et al. 2009; Baran and Halperin 2012; Sharon et al. 2013) of metagenomic data facilitate the study of microbial communities based on the genomes of their members (Tyson et al. 2004; Goltsman et al. 2009; Wrighton et al. 2012; Brown et al. 2013; Sharon et al. 2013). One major drawback of most commonly used sequencing platforms is read length, which is typically in the range of a few hundred base pairs (bp). Short-read length is compensated by high throughput, which provides high coverage for the abundant genomes in the community, allowing assembly and sometimes even complete genome recovery (Iverson et al. 2012; Albertsen et al. 2013; Castelle et al. 2013; Di Rienzi et al. 2013; Kantor et al. 2013). Short-read assemblers sometimes fail to assemble similar repeating regions (Miller et al. 2011). For de Bruijn graph assemblers such as IDBA-UD (Peng et al. 2012) and Ray Meta (Boisvert et al. 2012), the presence of multiple similar regions should result in bubbles and short paths in the de Bruijn graph. Consequently, assembly of these regions will result in short assembled contigs or elimination of the regions altogether. Assemblers are therefore expected to perform poorly in the presence of multiple similar genomes from closely related species and strains. In addition, the assembly of rare genomes fails due to insufficient sequencing coverage. While these issues can limit understanding of the true community composition, the extent to which they do so is currently unknown.Previously, we used short-read (150-bp) metagenomic data to study microbial communities in terrestrial sediments from a site near Rifle, Colorado (Castelle et al. 2013; Hug et al. 2013). Terrestrial sediments are major reservoirs of organic matter on Earth, and microbes play key roles in carbon turnover in these environments (Whitman et al. 1998), thus affecting the global carbon cycle. Recent studies showed that many of the microbes living in terrestrial sediments are fermenters that belong to candidate phyla or to deep branches with no close cultured representatives of other phyla. Communities in these environments are complex, with no single organism''s share typically exceeding 1% (Wrighton et al. 2012; Castelle et al. 2013; Hug et al. 2013; Kantor et al. 2013). Nearly two hundred different species per community were detected in samples recovered from the same site (Wrighton et al. 2012, 2014; Castelle et al. 2013); however, the true community complexity of sediment communities is currently unknown.Recently, a new sequencing technology (previously licensed to Moleculo, acquired by Illumina in 2012 and name changed to Illumina TruSeq Synthetic Long-Reads) that enables the sequencing of long multi-kb synthetic reads was introduced (Voskoboynik et al. 2013). Here we used the new synthetic long-read technology in tandem with the previously sequenced short (150-bp) reads to metagenomically study the sediment microbial communities described in Castelle et al. (2013) and Hug et al. (2013). The main objectives of the study were to (1) test the efficacy of assembling the reads and using them to improve the scaffolding of contigs generated by short-read assembly, (2) evaluate the accuracy of genomes reconstructed through curation of short-read assemblies, (3) provide insight into organisms present at very low abundance levels, and (4) evaluate levels of sequence variation and genomic content in populations of closely related species and strains.  相似文献   

3.
Next-generation massively parallel DNA sequencing technologies provide ultrahigh throughput at a substantially lower unit data cost; however, the data are very short read length sequences, making de novo assembly extremely challenging. Here, we describe a novel method for de novo assembly of large genomes from short read sequences. We successfully assembled both the Asian and African human genome sequences, achieving an N50 contig size of 7.4 and 5.9 kilobases (kb) and scaffold of 446.3 and 61.9 kb, respectively. The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.The development and commercialization of next-generation massively parallel DNA sequencing technologies, including Illumina Genome Analyzer (GA) (Bentley 2006), Applied Biosystems SOLiD System, and Helicos BioSciences HeliScope (Harris et al. 2008), have revolutionized genomic research. Compared to traditional Sanger capillary-based electrophoresis systems, these new technologies provide ultrahigh throughput with two orders of magnitude lower unit data cost. However, they all share a common intrinsic characteristic of providing very short read length, currently 25–75 base pairs (bp), which is substantially shorter than the Sanger sequencing reads (500–1000 bp) (Shendure et al. 2004). This has raised concern about their ability to accurately assemble large genomes. Illumina GA technology has been shown to be feasible for use in human whole-genome resequencing and can be used to identify single nucleotide polymorphisms (SNPs) accurately by mapping the short reads onto the known reference genome (Bentley et al. 2008; Wang et al. 2008). But to thoroughly annotate insertions, deletions, and structural variations, de novo assembly of each individual genome from these raw short reads is required.Currently, Sanger sequencing technology remains the dominant method for building a reference genome sequence for a species. It is, however, expensive, and this prevents many genome sequencing projects from being put into practice. Over the past 10 yr, only a limited number of plant and animal genomes have been completely sequenced, (http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html), including human (Lander et al. 2001; Venter et al. 2001) and mouse (Mouse Genome Sequencing Consortium 2002), but accurate understanding of evolutionary history and biological processes at a nucleotide level requires substantially more. The development of a de novo short read assembly method would allow the building of reference sequences for these unexplored genomes in a very cost-effective way, opening the door for carrying out numerous substantial new analyses.Several programs, such as phrap (http://www.phrap.org), Celera assembler (Myers et al. 2000), ARACHNE (Batzoglou et al. 2002), Phusion (Mullikin and Ning 2003), RePS (Wang et al. 2002), PCAP (Huang et al. 2003), and Atlas (Havlak et al. 2004), have been successfully used for de novo assembly of whole-genome shotgun (WGS) sequencing reads in the projects applying the Sanger technology. These are based on an overlap-layout strategy, but for very short reads, this approach is unsuitable because it is hard to distinguish correct assembly from repetitive sequence overlap due to there being only a very short sequence overlap between these short reads. Also, in practice, it is unrealistic to record into a computer memory all the sequence overlap information from deep sequencing that are made up of huge numbers of short reads.The de Bruijn graph data structure, introduced in the EULER (Pevzner et al. 2001) assembler, is particularly suitable for representing the short read overlap relationship. The advantage of the data structure is that it uses K-mer as vertex, and read path along the K-mers as edges on the graph. Hence, the graph size is determined by the genome size and repeat content of the sequenced sample, and in principle, will not be affected by the high redundancy of deep read coverage. A few short read assemblers, including Velvet (Zerbino and Birney 2008), ALLPATHS (Butler et al. 2008), and EULER-SR (Chaisson and Pevzner 2008), have adopted this algorithm, explicitly or implicitly, and have been implemented and shown very promising performances. Some other short read assemblers have applied the overlap and extension strategy, such as SSAKE (Warren et al. 2007), VCAKE (Jeck et al. 2007) (the follower of SSAKE which can handle sequencing errors), SHARCGS (Dohm et al. 2007), and Edena (Hernandez et al. 2008). However, all these assemblers were designed to handle bacteria- or fungi-sized genomes, and cannot be applied for assembly of large genomes, such as the human, given the limits of the available memory of current supercomputers. Recently, ABySS (Simpson et al. 2009) used a distributed de Bruijn graph algorithm that can split data and parallelize the job on a Linux cluster with message passing interface (MPI) protocol, allowing communication between nodes. Thus, it is able to handle a whole short read data set of a human individual; however, the assembly is very fragmented with an N50 length of ∼1.5 kilobases (kb). This is not long enough for structural variation detection between human individuals, nor is it good enough for gene annotation and further analysis of the genomes of novel species.Here, we present a novel short read assembly method that can build a de novo draft assembly for the human genome. We previously sequenced the complete genome of an Asian individual using a resequencing method, producing a total of 117.7 gigabytes (Gb) of data, and have now an additional 82.5 Gb of paired-end short reads, achieving a 71× sequencing depth of the NCBI human reference sequence. We used this substantial amount of data to test our de novo assembly method, as well as the data from the African genome sequence (Bentley et al. 2008; Wang et al. 2008; Li et al. 2009a). We compared the de novo assemblies to the NCBI reference genome and demonstrated the capability of this method to accurately identify structural variations, especially small deletions and insertions that are difficult to detect using the resequencing method. This software has been integrated into the short oligonucleotide alignment program (SOAP) (Li et al. 2008, 2009b,c) package and named SOAPdenovo to indicate its functionality.  相似文献   

4.
Eliminating the bacterial cloning step has been a major factor in the vastly improved efficiency of massively parallel sequencing approaches. However, this also has made it a technical challenge to produce the modern equivalent of the Fosmid- or BAC-end sequences that were crucial for assembling and analyzing complex genomes during the Sanger-based sequencing era. To close this technology gap, we developed Fosill, a method for converting Fosmids to Illumina-compatible jumping libraries. We constructed Fosmid libraries in vectors with Illumina primer sequences and specific nicking sites flanking the cloning site. Our family of pFosill vectors allows multiplex Fosmid cloning of end-tagged genomic fragments without physical size selection and is compatible with standard and multiplex paired-end Illumina sequencing. To excise the bulk of each cloned insert, we introduced two nicks in the vector, translated them into the inserts, and cleaved them. Recircularization of the vector via coligation of insert termini followed by inverse PCR generates a jumping library for paired-end sequencing with 101-base reads. The yield of unique Fosmid-sized jumps is sufficiently high, and the background of short, incorrectly spaced and chimeric artifacts sufficiently low, to enable applications such as mapping of structural variation and scaffolding of de novo assemblies. We demonstrate the power of Fosill to map genome rearrangements in a cancer cell line and identified three fusion genes that were corroborated by RNA-seq data. Our Fosill-powered assembly of the mouse genome has an N50 scaffold length of 17.0 Mb, rivaling the connectivity (16.9 Mb) of the Sanger-sequencing based draft assembly.Paired-end sequencing of large DNA fragments cloned in Fosmid (Kim et al. 1992) or BAC (Shizuya et al. 1992) vectors were a mainstay of genome projects during the Sanger-based sequencing era. The large spans, particularly of BAC ends, helped resolve long repeats and segmental duplications and provided long-range connectivity in shotgun assemblies of complex genomes (Adams et al. 2000; Venter et al. 2001; Waterston et al. 2002). Fosmids are shorter than BACs but much easier to generate. Their consistent, narrow insert-size distribution centered around 35–40 kb enabled the scanning of individual human genomes with read pairs to detect structural variation such as insertions, deletions, and inversions (International Human Genome Sequencing Consortium 2004; Tuzun et al. 2005; Kidd et al. 2008).Massively parallel genome-sequencing technologies no longer rely on cloning DNA fragments in a bacterial host. The platforms currently on the market (454, Illumina, SOLiD, Ion Torrent) replaced vectors with synthetic adapters and bacterial colonies with PCR-amplified “clones” of DNA fragments tethered to a bead (Margulies et al. 2005; McKernan et al. 2009) or with “colonies” of identical molecules grown by bridge PCR amplification on a solid surface (Bentley et al. 2008).However, none of these platforms can handle DNA molecules much longer than 1 kb. Consequently, paired-end sequencing of DNA fragments >1 kb by these technologies requires “jumping” constructs (Collins and Weissman 1984; Poustka et al. 1987): the ends of size-selected genomic DNA fragments are brought together by circularization, the bulk of the intervening DNA is excised, and the coligated junction fragments are isolated and end-sequenced.Suitable protocols exist for converting sheared and size-selected DNA samples to jumping libraries and for generating read pairs that span several kb of genomic distance which is generally sufficient to fashion accurate and highly contiguous de novo assemblies of microbial genomes from massively parallel short sequencing reads (MacCallum et al. 2009; Nelson et al. 2010; Nowrousian et al. 2010). However, early short-read assemblies of complex genomes, including human genomes, turned out fragmented—despite jumps up to ∼12 kb in length (Li et al. 2010a,b; Schuster et al. 2010; Yan et al. 2011). Without the equivalent of Fosmid or BAC end sequences, the N50 scaffold length (a measure of long-range connectivity) of these assemblies was <1.3 Mb. By comparison, largely owing to paired-end reads from large-insert clones, some of the best traditional Sanger-based mammalian draft assemblies had N50 scaffold lengths of >40 Mb (Lindblad-Toh et al. 2005; Mikkelsen et al. 2007).Constructing a jumping library entails numerous physical and enzymatic DNA manipulations. Several steps, notably size selection and circularization of genomic DNA fragments in vitro, become increasingly difficult and inefficient as the desired jump length, and hence, fragment length, goes up. In contrast, Fosmid cloning employs a sophisticated biological machinery to carry out these critical steps: Large fragments are size-selected (and short competing fragments excluded) by packaging in bacteriophage λ; once inside the Escherichia coli host, cohesive ends mediate efficient circularization—aided by the cellular machinery and a powerful selection for circular amplicons.To our knowledge, no jumping library constructed to date from uncloned DNA fragments has approached the average span (35–40 kb) and complexity (>105 independent clones per μg of input DNA) of a traditional Fosmid library. To close this technology gap, we and others have taken a hybrid approach wherein Fosmid libraries are constructed first and then converted to Fosmid-size jumps in vitro (Gnerre et al. 2011; Hampton et al. 2011).Here, we present the experimental details of the “Fosill” concept (Gnerre et al. 2011) as well as extensive improvements of the original protocol. The term Fosill stands for paired-end sequencing of Fosmid libraries by Illumina, though we note that this approach should work for any massively parallel sequencing technology that can generate paired reads. We describe the methodology and novel cloning vectors that enable molecular barcoding of DNA inserts and multiplex Fosmid library construction without physical size selection of sheared genomic DNA. We demonstrate the power of Fosill to detect structural abnormalities in cancer genomes and to improve de novo assemblies of mammalian genomes from short reads.  相似文献   

5.
We present the discovery of genes recurrently involved in structural variation in nasopharyngeal carcinoma (NPC) and the identification of a novel type of somatic structural variant. We identified the variants with high complexity mate-pair libraries and a novel computational algorithm specifically designed for tumor-normal comparisons, SMASH. SMASH combines signals from split reads and mate-pair discordance to detect somatic structural variants. We demonstrate a >90% validation rate and a breakpoint reconstruction accuracy of 3 bp by Sanger sequencing. Our approach identified three in-frame gene fusions (YAP1-MAML2, PTPLB-RSRC1, and SP3-PTK2) that had strong levels of expression in corresponding NPC tissues. We found two cases of a novel type of structural variant, which we call “coupled inversion,” one of which produced the YAP1-MAML2 fusion. To investigate whether the identified fusion genes are recurrent, we performed fluorescent in situ hybridization (FISH) to screen 196 independent NPC cases. We observed recurrent rearrangements of MAML2 (three cases), PTK2 (six cases), and SP3 (two cases), corresponding to a combined rate of structural variation recurrence of 6% among tested NPC tissues.Nasopharyngeal carcinoma (NPC) is a malignant neoplasm of the head and neck originating in the epithelial lining of the nasopharynx. It has a high incidence among the native people of the American Arctic and Greenland and in southern Asia (Yu and Yuan 2002). NPC is strongly linked to consumption of Cantonese salted fish (Ning et al. 1990) and infection with Epstein-Barr virus (EBV) (Raab-Traub 2002), which is almost invariably present within the cancer cells and is thought to promote oncogenic transformation (zur Hausen et al. 1970). A challenging feature of NPC genome sequencing is that significant lymphocyte infiltration (e.g., 80% of cells in a sample) (Jayasurya et al. 2000) is common, requiring special laboratory and bioinformatic approaches not necessary for higher-purity tumors (Mardis et al. 2009).Currently, cancer genomes are analyzed by reading short sequences (usually 100 bases) from the ends of library DNA inserts 200–500 bp in length (Meyerson et al. 2010). For technical reasons, it is difficult to obtain deep genome coverage with inserts exceeding 600 bp using this approach. Much larger inserts can be produced by circularizing large DNA fragments of up to 10 kb (Fullwood et al. 2009; Hillmer et al. 2011), and subsequent isolation of a short fragment that contains both ends (mate pairs). Large-insert and fosmid mate-pair libraries offer several attractive features that make them well-suited for analysis of structural variation (Raphael et al. 2003; International Human Genome Sequencing Consortium 2004; Tuzun et al. 2005; Kidd et al. 2008; Hampton et al. 2011; Williams et al. 2012). First, mate pairs inherently capture genomic structure in that discordantly aligning mate-pair reads occur at sites of genomic rearrangements, exposing underlying lesions (Supplemental Fig. S1a). Second, large-insert mate-pair libraries deliver deep physical coverage of the genome (100–1000×), reliably revealing somatic structural variants even in specimens with low tumor content (Supplemental Fig. S1a,b). Third, variant-supporting mate-pair reads from large inserts may align up to several kilobases away from a breakpoint, beyond the repeats that often catalyze structural variants (Supplemental Fig. S1c). For these reasons, large insert mate-pair libraries have been used extensively for de novo assembly of genomes and for identification of inherited structural variants (International Human Genome Sequencing Consortium 2001). In principle, they should also be well-suited for analysis of “difficult” low-tumor purity cancer tissues such as NPCs. In a recent study, paired-end fosmid sequencing libraries with nearly 40 kb inserts were adapted to Illumina sequencing and applied to the K562 cell line (Williams et al. 2012). Due to low library complexity, of 33.9 million sequenced read pairs, only about 7 million were unique (corresponding to about 0.5× genome sequence coverage), but nonetheless facilitated structural variation detection.Mate-pair techniques have not yet been applied to produce truly deep sequencing data sets of tumor-normal samples (30× or greater sequence coverage), presumably due to the difficulty of retaining sufficient library complexity to support deep sequencing. We have improved the efficiency of library preparation by combining two existing protocols (Supplemental Fig. S2), and so were able to generate 3.5-kb insert libraries with sufficient genomic complexity to enable deep sequencing of two NPC genomes. To take full advantage of unique features offered by large-insert libraries, such as the large footprints of breakpoint-spanning inserts (Supplemental Fig. S1c) and the correlation between the two ends of alignment coordinates of breakpoint-spanning inserts, we also developed a novel somatic structural variant caller. SMASH (Somatic Mutation Analysis by Sequencing and Homology detection) is specifically designed to accurately map somatic structural lesions, including deletions, duplications, translocations, and large duplicative insertions via direct comparison of tumor and normal data sets.Structural variation methods, such as GASV (Sindi et al. 2009), SegSeq (Chiang et al. 2009), DELLY (Rausch et al. 2012b), HYDRA (Quinlan et al. 2010), AGE (Abyzov and Gerstein 2011), and others (Lee et al. 2008; for review, see Snyder et al. 2010; Alkan et al. 2011), generally utilize (1) read-pair (RP) discordance, (2) increase or reduction in sequence coverage, (3) split reads that span breakpoints, and (4) exact assembly of breakpoint sequences. These tools were primarily designed for variant detection from a single data set, such as a normal genome, and are suited for cataloguing structural polymorphisms in the human population (Kidd et al. 2008, 2010; Mills et al. 2011). However, specific detection of somatic structural variants in cancer using these tools typically requires additional downstream custom analysis to enable “subtraction” of germline variants from the tumor variant calls (Rausch et al. 2012a). This limits the general utility of such tools for somatic variant detection. Recently, as an increasing number of studies specifically focused on somatic mutations, dedicated somatic variant callers such as CREST (Wang et al. 2011) have been developed. CREST relies on detection of partially aligned reads (known as “soft clipping”) across a breakpoint.SMASH adopts a hybrid approach to somatic variant detection. It relies on read-pair discordance to discover somatic breakpoints and then uses split reads to refine their coordinates. Furthermore, SMASH incorporates a number of important quality measures and filters, which are critical for minimizing the rate of false positive somatic SV calls (Supplemental Material). As we demonstrate here with both simulated and real sequence data, such a hybrid approach delivers high sensitivity of somatic SV detection due to the read-pair discordance, high accuracy of breakpoint coordinates enabled by split reads, and overall low false discovery rate due to extensive use of quality measures.  相似文献   

6.
Second-generation sequencing technology can now be used to sequence an entire human genome in a matter of days and at low cost. Sequence read lengths, initially very short, have rapidly increased since the technology first appeared, and we now are seeing a growing number of efforts to sequence large genomes de novo from these short reads. In this Perspective, we describe the issues associated with short-read assembly, the different types of data produced by second-gen sequencers, and the latest assembly algorithms designed for these data. We also review the genomes that have been assembled recently from short reads and make recommendations for sequencing strategies that will yield a high-quality assembly.As genome sequencing technology has evolved, methods for assembling genomes have changed with it. Genome sequencers have never been able to “read” more than a relatively short stretch of DNA at once, with read lengths gradually increasing over time. Reconstructing a complete genome from a set of reads requires an assembly program, and a variety of genome assemblers have been used for this task. In 1995, when the first bacterial genome was published (Haemophilus influenzae), read lengths were ∼460 base pairs (bp), and that whole-genome shotgun (WGS) sequencing project generated 24,304 reads (Fleischmann et al. 1995). The human genome project required ∼30 million reads, with lengths up to 800 bp, using Sanger sequencing technology and automated capillary sequencers (International Human Genome Sequencing Consortium 2001; Venter et al. 2001). This corresponded to 24 billion bases (Gb), or approximately eightfold coverage of the 3-Gb human genome. Redundant coverage, in which on average every nucleotide is sequenced many times over, is required to produce a high-quality assembly. Another benefit of redundancy is greatly increased accuracy compared with a single read: Where a single read might have an error rate of 1%, eightfold coverage has an error rate as low as 10−16 when eight high-quality reads agree with one another. High coverage is also necessary to sequence polymorphic alleles within diploid or polyploid genomes.Current second-generation sequencing (SGS) technologies produce read lengths ranging from 35 to 400 bp, at far greater speed and much lower cost than Sanger sequencing. However, as reads get shorter, coverage needs to increase to compensate for the decreased connectivity and produce a comparable assembly. Certain problems cannot be overcome by deeper coverage: If a repetitive sequence is longer than a read, then coverage alone will never compensate, and all copies of that sequence will produce gaps in the assembly. These gaps can be spanned by paired reads—consisting of two reads generated from a single fragment of DNA and separated by a known distance—as long as the pair separation distance is longer than the repeat. Paired-end sequencing is available from most of the SGS machines, although it is not yet as flexible or as reliable as paired-end sequencing using traditional methods.After the successful assembly of the human (International Human Genome Sequencing Consortium 2001; Venter et al. 2001) and mouse (Waterston et al. 2002) genomes by whole-genome shotgun sequencing, most large-scale genome projects quickly moved to adopt the WGS approach, which has subsequently been used for dozens of eukaryotic genomes. Today, thanks to changes in sequencing technology, a major question confronting genome projects is, can we sequence a large genome (>100 Mbp) using short reads? If so, what are the limitations on read length, coverage, and error rates? How much paired-end sequencing is necessary? And what will the assembly look like? In this perspective we take a look at each of these questions and describe the solutions available today. Although we provide some answers, we have no doubt that the solutions will change rapidly over the next few years, as both the sequencing methods and the computational solutions improve.  相似文献   

7.
Copy number variants (CNVs) underlie a significant amount of genetic diversity and disease. CNVs can be detected by a number of means, including chromosomal microarray analysis (CMA) and whole-genome sequencing (WGS), but these approaches suffer from either limited resolution (CMA) or are highly expensive for routine screening (both CMA and WGS). As an alternative, we have developed a next-generation sequencing-based method for CNV analysis termed SMASH, for short multiply aggregated sequence homologies. SMASH utilizes random fragmentation of input genomic DNA to create chimeric sequence reads, from which multiple mappable tags can be parsed using maximal almost-unique matches (MAMs). The SMASH tags are then binned and segmented, generating a profile of genomic copy number at the desired resolution. Because fewer reads are necessary relative to WGS to give accurate CNV data, SMASH libraries can be highly multiplexed, allowing large numbers of individuals to be analyzed at low cost. Increased genomic resolution can be achieved by sequencing to higher depth.Analysis of CNVs on a genomic scale is useful for assessing cancer progression and identifying congenital genetic abnormalities (Hicks et al. 2006; Sebat et al. 2007; Marshall et al. 2008; Xu et al. 2008; Levy et al. 2011; Stadler et al. 2012; Warburton et al. 2014; for review, see Malhotra and Sebat 2012; Weischenfeldt et al. 2013). CNVs are typically identified by microarray hybridization (Iafrate et al. 2004; Sebat et al. 2004) but can also be detected by next-generation sequencing (NGS). This is generally done using algorithms that measure the number of sequence reads mapping to specific regions (Alkan et al. 2009); consequently, the resolution of sequence-based copy number methods depends largely on the number of independent mappings. The current trend in NGS technologies is to increase the number of bases read per unit cost. This is accomplished by increasing the total number of sequence reads per lane of a flow cell, as well as increasing the number of bases within each read. Because the accuracy of copy number methods is driven by the quantity of unique reads, increasing the length of reads does not improve the resolution or decrease the cost of copy number analysis.Most of the human genome is mapped well by short reads, on the order of 35–40 bp (Supplemental Fig. S1). At the moment, high-throughput sequencers with the greatest per base cost effectiveness are generating paired-end read lengths of 150 bp, well in excess of what suffices for unique mapping. In fact, variability in insert size and “mappability” of paired-end reads suggest that paired-end mapping is a poor choice for read-depth–based copy number analysis of WGS. To take advantage of current (and future) increases in read length and optimally utilize paired-end reads, we have developed SMASH to “pack” multiple independent mappings into every read pair. We accomplish this by breaking genomic DNA into small fragments with a mean length of ∼40 bp. These fragments are joined together into chimeric stretches of DNA with lengths suitable for creating NGS libraries (300–700 bp). SMASH is conceptually similar to serial analysis of gene expression (SAGE) (Velculescu et al. 1995), which utilized the generation of chimeric molecules of short cDNA-derived tags to provide a digital readout of gene expression. SMASH differs in that it (1) requires significantly longer tags than SAGE and its later variants (e.g., SuperSAGE) (Matsumura et al. 2008) due to the complexity of genomic DNA, and (2) utilizes mechanical shearing and/or enzymatic digestion to counteract restriction enzyme bias, creating highly variable fragments of genomic DNA.The chimeric sequence reads generated by SMASH are processed using a time-efficient, memory-intensive mapping algorithm that performs a conservative partition of the long read into constituent fragments. The fragment maps are utilized in the same manner as read maps in downstream copy number analysis. For 125-bp paired-end reads, whole-genome sequencing (WGS) averages less than one map per read pair, whereas SMASH yields four to five. The quality of SMASH maps, i.e., the nonuniformities introduced by the sample preparation and sequencer and mapping bias, is of the same order as those seen with WGS mapping. Using correction and testing protocols optimized for WGS data, we show that on a map-for-map basis, SMASH generates read-depth copy number data that is virtually equivalent to WGS at a small fraction of the cost.  相似文献   

8.
9.
Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. Although this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of “marker” genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate-, single-cell-, and metagenome-derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities.Recent advances in high-throughput sequencing combined with improving computational methods are enabling the rapid, cost effective recovery of genomes from cultivated and uncultivated microorganisms across a wide range of host-associated and environmental samples. Large-scale initiatives, such as the Genomic Encyclopedia of Bacteria and Archaea (GEBA) (Wu et al. 2009), aim to provide reference genomes from isolated species across the Tree of Life, whereas targeted efforts such as the Human Microbiome Project (HMP) (Turnbaugh et al. 2007) and the GEBA-Root Nodulating Bacteria (GEBA-RNB; http://jgi.doe.gov/) initiatives are providing reference genomes necessary for understanding the role of microorganisms in specific habitats. These efforts are complemented by initiatives such as the GEBA-Microbial Dark Matter (GEBA-MDM) project, which used single-cell genomics to obtain genomes from uncultivated bacterial and archaeal lineages (Rinke et al. 2013). Several studies have also demonstrated the successful recovery of high-quality population genomes directly from metagenomic data (Tyson et al. 2004; Wrighton et al. 2012; Albertsen et al. 2013; Sharon et al. 2013). Together these initiatives have produced thousands of draft genomes and stand to provide tens of thousands more as sequencing technology and computational methodologies continue to improve. Although this rapid recovery of genomes stands to greatly improve our understanding of the microbial world, it is outpacing our ability to manually assess the quality of individual genomes.In order to make robust inferences from the increasing availability of draft genomes, it is critical to distinguish between genomes of varying quality (Mardis et al. 2002; Chain et al. 2009). In particular, genomes recovered from single cells or metagenomic data require careful scrutiny due to the additional complications inherent in obtaining genomes using these approaches (Dick et al. 2010; Albertsen et al. 2013). The quality of isolate genomes has traditionally been evaluated using assembly statistics such as N50 (Salzberg et al. 2012; Gurevich et al. 2013), whereas single-cell and metagenomic studies have relied on the presence and absence of universal single-copy “marker” genes for estimating genome completeness (Wrighton et al. 2012; Haroon et al. 2013; Rinke et al. 2013; Sharon et al. 2013). However, the accuracy of this completeness estimate has not been evaluated, and the approach is likely to be limited by both the uneven distribution of universal marker genes across a genome and their low number, typically accounting for <10% of all genes (Sharon and Banfield 2013). These limitations have been partially addressed by identifying genes that are ubiquitous and single copy within a specific phylum, which increases the number of marker genes used in the estimate (Swan et al. 2013). Single-copy marker genes present multiple times within a recovered genome have also been used to estimate potential contamination (Albertsen et al. 2013; Soo et al. 2014; Sekiguchi et al. 2015).Here we describe CheckM, an automated method for estimating the completeness and contamination of a genome using marker genes that are specific to a genome''s inferred lineage within a reference genome tree. Using simulated genomes of varying degrees of quality, we demonstrate that lineage-specific marker genes provide refined estimates of genome completeness and contamination compared to the universal or domain-level marker genes commonly used. Marker genes that are consistently collocated within a lineage do not provide independent evidence of a genome''s quality, so collocated marker genes were grouped into marker sets in order to further refine estimates of genome quality. We show that lineage-specific collocated marker sets provide robust estimates across all bacterial and archaeal lineages, with completeness and contamination estimates generally having a low absolute error even when genomes are relatively incomplete (70%) with medium contamination (10%). We also propose a fixed vocabulary for defining genome quality based on estimates of completeness and contamination that is suitable for automated screening of genomes from large-scale sequencing initiatives and for annotating genomes in reference databases. We envisage that CheckM will help identify problematic genomes before they are deposited in public databases. For single-cell genomes and population genomes recovered from metagenomic data, the improved quality estimates provided by CheckM allow biological inferences to be made in the context of genome quality and highlight genomes that would benefit from further refinement.  相似文献   

10.
11.
Emerging next-generation sequencing technologies have revolutionized the collection of genomic data for applications in bioforensics, biosurveillance, and for use in clinical settings. However, to make the most of these new data, new methodology needs to be developed that can accommodate large volumes of genetic data in a computationally efficient manner. We present a statistical framework to analyze raw next-generation sequence reads from purified or mixed environmental or targeted infected tissue samples for rapid species identification and strain attribution against a robust database of known biological agents. Our method, Pathoscope, capitalizes on a Bayesian statistical framework that accommodates information on sequence quality, mapping quality, and provides posterior probabilities of matches to a known database of target genomes. Importantly, our approach also incorporates the possibility that multiple species can be present in the sample and considers cases when the sample species/strain is not in the reference database. Furthermore, our approach can accurately discriminate between very closely related strains of the same species with very little coverage of the genome and without the need for multiple alignment steps, extensive homology searches, or genome assembly—which are time-consuming and labor-intensive steps. We demonstrate the utility of our approach on genomic data from purified and in silico “environmental” samples from known bacterial agents impacting human health for accuracy assessment and comparison with other approaches.The accurate and rapid identification of species and strains of pathogens is an essential component of biosurveillance from both human health and biodefense perspectives (Vaidyanathan 2011). For example, misidentification was among the issues that resulted in a 3-wk delay in accurate diagnosis of the recent outbreak of hemorrhagic Escherichia coli being due to strain O104:H4, resulting in over 3800 infections across 13 countries in Europe with 54 deaths (Frank et al. 2011). The most accurate diagnostic information, necessary for species identification and strain attribution, comes from the most refined level of biological data—genomic DNA sequences (Eppinger et al. 2011). Advances in DNA-sequencing technologies allows for the rapid collection of extraordinary amounts of genomic data, yet robust approaches to analyze this volume of data are just developing, from both statistical and algorithmic perspectives.Next-generation sequencing approaches have revolutionized the way we collect DNA sequence data, including for applications in pathology, bioforensics, and biosurveillance. Given a particular clinical or metagenomic sample, our goal is to identify the specific species, strains, or substrains present in the sample, as well as accurately estimate the proportions of DNA originating from each source genome in the sample. Current approaches for next-gen sequencing usually have read lengths between 25 and 1000 bp; however, these sequencing technologies include error rates that vary by approach and by samples. Such variation is typically less important for species identification given the relatively larger genetic divergences among species than among individuals within species. But for strain attribution, sequencing error has the potential to swamp out discriminatory signal in a data set, necessitating highly sensitive and refined computational models and a robust database for both species identification and strain attribution.Current methods for classifying metagenomic samples rely on one or more of three general approaches: composition or pattern matching (McHardy et al. 2007; Brady and Salzberg 2009; Segata et al. 2012), taxonomic mapping (Huson et al. 2007; Meyer et al. 2008; Monzoorul Haque et al. 2009; Gerlach and Stoye 2011; Patil et al. 2012; Segata et al. 2012), and whole-genome assembly (Kostic et al. 2011; Bhaduri et al. 2012). Composition and pattern-matching algorithms use predetermined patterns in the data, such as taxonomic clade markers (Segata et al. 2012), k-mer frequency, or GC content, often coupled with sophisticated classification algorithms such as support vector machines (McHardy et al. 2007; Patil et al. 2012) or interpolated Markov Models (Brady and Salzberg 2009) to classify reads to the species of interest. These approaches require intensive preprocessing of the genomic database before application. In addition, the classification rule and results can often change dramatically depending on the size and composition of the genome database.Taxonomy-based approaches typically rely on a “lowest common ancestor” approach (Huson et al. 2007), meaning that they identify the most specific taxonomic group for each read. If a read originates from a genomic region that shares homology with other organisms in the database, the read is assigned to the lowest taxonomic group that contains all of the genomes that share the homologous region. These methods are typically highly accurate for higher-level taxonomic levels (e.g., phylum and family), but experience reduced accuracy at lower levels (e.g., species and strain) (Gerlach and Stoye 2011). Furthermore, these approaches are not informative when the reads originate from one or more species or strains that are closely related to each other or different organisms in the database. In these cases, all of the reads can be reassigned to higher-level taxonomies, thus failing to identify the specific species or strains contained in the sample.Assembly-based algorithms can often lead to the most accurate strain identification. However, these methods also require the assembly of a whole genome from a sample, which is a computationally difficult and time-consuming process that requires large numbers of reads to achieve an adequate accuracy—often on the order of 50–100× coverage of the target genome (Schatz et al. 2010). Given current sequencing depths, obtaining this level of coverage is usually possible for purified samples, but coverage levels may not be sufficient for mixed samples or in multiplexed sequencing runs. Assembly approaches are further complicated by the fact that data collection at a crime scene or hospital might include additional environmental components in the biological sample (host genome or naturally occurring bacterial and viral species), thus requiring multiple filtering and alignment steps in order to obtain reads specific to the pathogen of interest.Here we describe an accurate and efficient approach to analyze next-generation sequence data for species identification and strain attribution that capitalizes on a Bayesian statistical framework implemented in the new software package Pathoscope v1.0. Our approach accommodates information on sequence quality, mapping quality, and provides posterior probabilities of matches to a known database of reference genomes. Importantly, our approach incorporates the possibility that multiple species can be present in the sample or that the target strain is not even contained within the reference database. It also accurately discriminates between very closely related strains of the same species with much less than 1× coverage of the genome and without the need for sequence assembly or complex preprocessing of the database or taxonomy. No other method in the literature can identify species or substrains in such a direct and automatic manner and without the need for large numbers of reads. We demonstrate our approach through application to next-generation DNA sequence data from a recent outbreak of the hemorrhagic E. coli (O104:H4) strain in Europe (Frank et al. 2011; Rohde et al. 2011; Turner 2011) and on purified and in silico mixed samples from several other known bacterial agents that impact human health. Software and data examples for our approach are freely available for download at https://sourceforge.net/projects/pathoscope/.  相似文献   

12.
Saccharomyces cerevisiae, a well-established model for species as diverse as humans and pathogenic fungi, is more recently a model for population and quantitative genetics. S. cerevisiae is found in multiple environments—one of which is the human body—as an opportunistic pathogen. To aid in the understanding of the S. cerevisiae population and quantitative genetics, as well as its emergence as an opportunistic pathogen, we sequenced, de novo assembled, and extensively manually edited and annotated the genomes of 93 S. cerevisiae strains from multiple geographic and environmental origins, including many clinical origin strains. These 93 S. cerevisiae strains, the genomes of which are near-reference quality, together with seven previously sequenced strains, constitute a novel genetic resource, the “100-genomes” strains. Our sequencing coverage, high-quality assemblies, and annotation provide unprecedented opportunities for detailed interrogation of complex genomic loci, examples of which we demonstrate. We found most phenotypic variation to be quantitative and identified population, genotype, and phenotype associations. Importantly, we identified clinical origin associations. For example, we found that an introgressed PDR5 was present exclusively in clinical origin mosaic group strains; that the mosaic group was significantly enriched for clinical origin strains; and that clinical origin strains were much more copper resistant, suggesting that copper resistance contributes to fitness in the human host. The 100-genomes strains are a novel, multipurpose resource to advance the study of S. cerevisiae population genetics, quantitative genetics, and the emergence of an opportunistic pathogen.Research on Saccharomyces cerevisiae, the most extensively characterized model eukaryote, has historically focused on a very small number of strains, or genetic backgrounds. In particular, most research has focused on the laboratory strain S288c, the first eukaryotic genome to be completely sequenced, assembled, and annotated (Goffeau et al. 1996) and thus the reference S. cerevisiae genome (Engel et al. 2014). However, as with all species, there is more to S. cerevisiae than one strain. For example, array analyses (Muller and McCusker 2009b, 2011; Schacherer et al. 2009; Muller et al. 2011; Dunn et al. 2012), low coverage sequencing (Liti et al. 2009), and higher coverage sequencing (Wei et al. 2007; Doniger et al. 2008; Dowell et al. 2010; Skelly et al. 2013; Bergstrom et al. 2014) of a limited number of additional S. cerevisiae strains identified extensive sequence variation. Studies of S. cerevisiae genetic variation and its influence on phenotypic variation have been limited by the modest number of high quality, complete, assembled, and annotated genome sequences. To address these limitations, we describe here the sequencing, and subsequent de novo, high quality, and extensively manually edited assembly and annotation of the genomes of 93 S. cerevisiae strains of multiple geographic and environmental origins.In addition to isolation from traditional, often human-associated environments (Mortimer and Johnston 1986; Mortimer and Polsinelli 1999; Sniegowski et al. 2002; Cromie et al. 2013), S. cerevisiae is isolated clinically, consistent with its being an emerging opportunistic pathogen (Murphy and Kavanagh 1999; Ponton et al. 2000; Silva et al. 2004; Enache-Angoulvant and Hennequin 2005; Munoz et al. 2005; McCusker 2006; Skovgaard 2007; Pfaller and Diekema 2010; Miceli et al. 2011; Chitasombat et al. 2012). Because a reasonable hypothesis is that human environment-associated S. cerevisiae give rise to clinical S. cerevisiae, we compare 57 nonclinical, mostly human environment-associated strains with 43 clinical strains to gain insight into the emergence of S. cerevisiae as an opportunistic pathogen.These 93 highly accurate, assembled, and annotated genome sequences, together with the genome sequences of S288c (Goffeau et al. 1996), YJM789 (Wei et al. 2007), RM11-1a (RM11 2004), SK1 (Nishant et al. 2010), Σ1278b (Dowell et al. 2010), YPS163 (Doniger et al. 2008), and M22 (Doniger et al. 2008), constitute a novel, multipurpose genetic resource, the “100-genomes” strains. In addition to describing the sequences of the 93 genomes, we describe for the 100-genomes strains their population structure, multiple types of polymorphisms, chromosome rearrangements, aneuploidy, specific phenotypes, genotype-phenotype associations, as well as phenotypic differentiation between strains varying in population ancestry and in nonclinical vs. clinical origin.  相似文献   

13.
The nematode Caenorhabditis briggsae is a model for comparative developmental evolution with C. elegans. Worldwide collections of C. briggsae have implicated an intriguing history of divergence among genetic groups separated by latitude, or by restricted geography, that is being exploited to dissect the genetic basis to adaptive evolution and reproductive incompatibility; yet, the genomic scope and timing of population divergence is unclear. We performed high-coverage whole-genome sequencing of 37 wild isolates of the nematode C. briggsae and applied a pairwise sequentially Markovian coalescent (PSMC) model to 703 combinations of genomic haplotypes to draw inferences about population history, the genomic scope of natural selection, and to compare with 40 wild isolates of C. elegans. We estimate that a diaspora of at least six distinct C. briggsae lineages separated from one another approximately 200,000 generations ago, including the “Temperate” and “Tropical” phylogeographic groups that dominate most samples worldwide. Moreover, an ancient population split in its history approximately 2 million generations ago, coupled with only rare gene flow among lineage groups, validates this system as a model for incipient speciation. Low versus high recombination regions of the genome give distinct signatures of population size change through time, indicative of widespread effects of selection on highly linked portions of the genome owing to extreme inbreeding by self-fertilization. Analysis of functional mutations indicates that genomic context, owing to selection that acts on long linkage blocks, is a more important driver of population variation than are the functional attributes of the individually encoded genes.The record of natural selection in shaping the genetic basis to organismal form and function of a species is inscribed in the genomes of its constituent individuals. Comparisons of genome sequences for each of the related nematodes Caenorhabditis elegans and C. briggsae have revealed powerful insights into the evolution of functional novelty and constraint in phenotypes and genetic pathways (Cutter et al. 2009; Marri and Gupta 2009; Thomas et al. 2012; Haag and Liu 2013; Verster et al. 2014). The high quality C. briggsae genome assembly facilitated such analysis (Stein et al. 2003; Hillier et al. 2007; Ross et al. 2011), and genomic analysis of populations of individuals provides a powerful means to further characterize evolution on contemporary timescales to understand the origins of novelty and constraint (The 1000 Genomes Project Consortium 2012; Langley et al. 2012; Brandvain et al. 2014). Indeed, key questions remain to be solved: How do genomes respond to the simultaneous pressures of mutation, natural selection, and genetic linkage—especially when a novel reproductive mode, facultative self-fertilization, has evolved in the ancestry of a species?C. briggsae is similar to C. elegans in many ways, most notably in their streamlined morphology, amenability to genetic and experimental manipulation, and in both being comprised primarily of self-fertilizing hermaphrodites that are found around the globe. However, C. briggsae is distinctive in having more molecular and phenotypic wild diversity, which is divided along latitudinal phylogeographic lines (Cutter 2006; Raboin et al. 2010; Félix et al. 2013), and by being partly interfertile with its male–female (dioecious) sister species C. nigoni (Woodruff et al. 2010; Kozlowska et al. 2012; Félix et al. 2014). Some strain combinations within C. briggsae also appear to show reproductive incompatibilities and outbreeding depression (Dolgin et al. 2008; Ross et al. 2011; Baird and Stonesifer 2012). However, the extent of genetic exchange and admixture across the genome within this species, as well as a full depiction of its evolutionary history, has remained elusive. Moreover, the extensive linkage disequilibrium conferred on the genome by self-fertilizing reproduction is thought to interact with selection to shape chromosome-scale patterns of genetic diversity (Cutter and Choi 2010; Cutter and Payseur 2013). Consequently, selection, self-fertilization, and gene flow all likely interact to control diversity and divergence in ways requiring genomic-scale population information to discern.These features make C. briggsae a powerful tool for dissecting evolutionary pattern and process in connection with trait divergence in nature, especially in combination with its deep experimental toolkit (Koboldt et al. 2010; Ross et al. 2011; Frøkjaer-Jensen 2013). Indeed, this species is now an active target of research into the molecular basis of trait variation and adaptation (Baird et al. 2005; Prasad et al. 2011; Ross et al. 2011; Stegeman et al. 2013), the evolution of development (Delattre and Félix 2001; Hill et al. 2006; Guo et al. 2009; Marri and Gupta 2009), and speciation (Woodruff et al. 2010; Baird and Stonesifer 2012; Kozlowska et al. 2012; Yan et al. 2012). Yet, the limited genomic scope of understanding for its natural variation constrains our ability to fully exploit it. Here we provide the population genomic framework for relating evolutionary pressures and demographic histories to their genomic signatures in a global sample of C. briggsae.  相似文献   

14.
15.
Structural variation (SV) is a rich source of genetic diversity in mammals, but due to the challenges associated with mapping SV in complex genomes, basic questions regarding their genomic distribution and mechanistic origins remain unanswered. We have developed an algorithm (HYDRA) to localize SV breakpoints by paired-end mapping, and a general approach for the genome-wide assembly and interpretation of breakpoint sequences. We applied these methods to two inbred mouse strains: C57BL/6J and DBA/2J. We demonstrate that HYDRA accurately maps diverse classes of SV, including those involving repetitive elements such as transposons and segmental duplications; however, our analysis of the C57BL/6J reference strain shows that incomplete reference genome assemblies are a major source of noise. We report 7196 SVs between the two strains, more than two-thirds of which are due to transposon insertions. Of the remainder, 59% are deletions (relative to the reference), 26% are insertions of unlinked DNA, 9% are tandem duplications, and 6% are inversions. To investigate the origins of SV, we characterized 3316 breakpoint sequences at single-nucleotide resolution. We find that ∼16% of non-transposon SVs have complex breakpoint patterns consistent with template switching during DNA replication or repair, and that this process appears to preferentially generate certain classes of complex variants. Moreover, we find that SVs are significantly enriched in regions of segmental duplication, but that this effect is largely independent of DNA sequence homology and thus cannot be explained by non-allelic homologous recombination (NAHR) alone. This result suggests that the genetic instability of such regions is often the cause rather than the consequence of duplicated genomic architecture.In the six years since the first genome-wide analyses revealed extensive DNA copy number variation (CNV) among human individuals (Iafrate et al. 2004; Sebat et al. 2004), numerous studies have extended this observation in scope and scale with increasingly powerful genomic tools. It is now widely recognized that structural variation (SV), which includes duplications, deletions, inversions, transpositions, and other genomic rearrangements, is an abundant and functionally important class of genetic variation in mammals (Zhang et al. 2009a). Besides the emerging role of inherited variants in complex disease, new structural mutations contribute to sporadic human disorders, are a hallmark of tumor genomes, and drive the evolution of genes and species. For all of these reasons, it is important to generate accurate SV maps in many different organisms and cellular contexts, so that the biological consequences of SV may be assessed, and so that the molecular mechanisms that generate new variation may be fully understood.Several technical challenges have precluded a more complete understanding of the patterns and origins of SV. First, most studies have used array comparative genome hybridization (aCGH), which has limited resolution, cannot detect balanced rearrangements or reconstruct locus architecture, and has limited ability to detect SVs composed of multi-copy elements such as segmental duplications (SDs) or transposable elements (TEs). Second, sequence-based methods such as paired-end mapping (PEM) have emerged as a potent alternative to aCGH (Raphael et al. 2003; Tuzun et al. 2005; Korbel et al. 2007; Lee et al. 2008), but their practical utility has been limited by the high cost of “long-read” sequencing, and the computational difficulties associated with interpreting “short-read” sequence data from complex genomes. Thus, while a number of PEM-based algorithms have been developed to identify SV from short-read sequence data (Chen et al. 2009; Hormozdiari et al. 2009; Korbel et al. 2009; Medvedev et al. 2009) and newer methods have been devised to map SVs at higher resolution (Lee et al. 2009; Sindi et al. 2009), all short-read PEM studies except one (Hormozdiari et al. 2009) have restricted their analyses to paired-end reads that map uniquely to the reference genome. This approach is not ideal given that SVs often involve repeated sequences such as segmental duplications and transposons. Finally, it has been difficult to evaluate structural mutation mechanisms in an unbiased way because genome-wide studies have thus far characterized relatively few breakpoints at single-nucleotide resolution (Korbel et al. 2007; Kidd et al. 2008; Kim et al. 2008; Perry et al. 2008), and the relative contribution of different molecular mechanisms remains a matter of debate.Despite rapid advances in DNA sequencing technologies, affordable and accurate assembly of entire mammalian genomes remains years away. Indeed, even traditional methods have difficulty resolving complex genomic regions. In the interim, we argue that the optimal solution for breakpoint detection is a hybrid approach that combines PEM and local de novo assembly. Here we describe a general approach for unbiased detection, assembly, and mechanistic interpretation of SV breakpoints using both short and long reads, and apply it to whole-genome sequence data from two widely used inbred mouse strains. We show that our algorithms accurately identify diverse classes of SV, capture an unprecedented number of variants, and reveal novel breakpoint features. Of mechanistic significance, we report an abundance of complex SVs that appear to be derived from template switching during DNA replication or repair, and a propensity for duplicated genomic regions to generate new variants through mechanism(s) other than non-allelic homologous recombination (NAHR). A unique strength of this study is our choice of the mouse genome; because the reference genome is derived from an established inbred line (C57BL/6J), we were able to sequence an animal whose genome should be essentially identical to the reference. This important methodological control, which has not been present in any other PEM study, allowed us to distinguish true genetic variation from technical “noise” and poorly assembled genomic regions.  相似文献   

16.
We describe a statistical and comparative-genomic approach for quantifying error rates of genome sequence assemblies. The method exploits not substitutions but the pattern of insertions and deletions (indels) in genome-scale alignments for closely related species. Using two- or three-way alignments, the approach estimates the amount of aligned sequence containing clusters of nucleotides that were wrongly inserted or deleted during sequencing or assembly. Thus, the method is well-suited to assessing fine-scale sequence quality within single assemblies, between different assemblies of a single set of reads, and between genome assemblies for different species. When applying this approach to four primate genome assemblies, we found that average gap error rates per base varied considerably, by up to sixfold. As expected, bacterial artificial chromosome (BAC) sequences contained lower, but still substantial, predicted numbers of errors, arguing for caution in regarding BACs as the epitome of genome fidelity. We then mapped short reads, at approximately 10-fold statistical coverage, from a Bornean orangutan onto the Sumatran orangutan genome assembly originally constructed from capillary reads. This resulted in a reduced gap error rate and a separation of error-prone from high-fidelity sequence. Over 5000 predicted indel errors in protein-coding sequence were corrected in a hybrid assembly. Our approach contributes a new fine-scale quality metric for assemblies that should facilitate development of improved genome sequencing and assembly strategies.Genome sequence assemblies form the bedrock of genome research. Any errors within them directly impair genomic and comparative genomic predictions and inferences based upon them. The prediction of functional elements or the elucidation of the evolutionary provenance of genomic sequence, for example, relies on the fidelity and completeness of these assemblies. Imperfections, such as erroneous nucleotide substitutions, insertions or deletions, or larger-scale translocations, may misinform genome annotations or analyses (Salzberg and Yorke 2005; Choi et al. 2008; Phillippy et al. 2008). Insertion and deletion (indel) errors are particularly hazardous to the prediction of protein-coding genes since many introduce frame-shifts to otherwise open reading frames. Noncoding yet functional sequence can be identified from a deficit of indels (Lunter et al. 2006), but only where this evolutionary signal has not been obscured by indel errors. Several high-quality reference genomes currently exist, and many errors in initial draft genome sequence assemblies have been rectified in later more finished assemblies. However, because of the substantial costs involved, among the mammals only the genomes of human, mouse, and dog have been taken (or are being taken) toward “finished” quality, defined as fewer than one error in 104 bases and no gaps (International Human Genome Sequencing Consortium 2004; Church et al. 2009). It is likely that other draft genome assemblies will remain in their unfinished states until technological improvements substantially reduce the cost of attaining finished genome quality.Genome assemblies have been constructed from sequence data produced by different sequencing platforms and strategies, and using a diverse array of assembly algorithms (e.g., PCAP [Huang et al. 2003], ARACHNE [Jaffe et al. 2003], Atlas [Havlak et al. 2004], PHUSION [Mullikin and Ning 2003], Jazz [Aparicio et al. 2002], and the Celera Assembler [Myers et al. 2000]). The recent introduction of new sequencing technologies (Mardis 2008) further complicates genome assemblies, as each platform exhibits read lengths and error characteristics very different from those of Sanger capillary sequencing reads. These new technologies have also spawned additional assembly and mapping algorithms, such as Velvet (Zerbino and Birney 2008) and MAQ (Li et al. 2008). Considering the methodological diversity of sequence generation and assembly, and the importance of high-quality primary data to biologists, there is a clear need for an objective and quantitative assessment of the fine-scale fidelity of the different assemblies.One frequently discussed property of genome assemblies is the N50 value (Salzberg and Yorke 2005). This is defined as the weighted median contig size, so that half of the assembly is covered by contigs of size N50 or larger. While the N50 value thus quantifies the ability of the assembler algorithm to combine reads into large seamless blocks, it fails to capture all aspects of assembly quality. For example, artefactually high N50 values can be obtained by lowering thresholds for amalgamating smaller blocks of—often repetitive—contiguous reads, resulting in misassembled contigs, although approaches to ameliorate such problems are being developed (Bartels et al. 2005; Dew et al. 2005; Schatz et al. 2007). Some validation of the global assembly accuracy, as summarized by N50, can be achieved by comparison with physical or genetic maps or by alignment to related genomes. Contiguity can also be quantified from the alignment of known cDNAs or ESTs. More regional errors can be indicated by fragmentation, incompleteness, or exon noncollinearity of gene models, or by unexpectedly high read depths that often reflect collapse of virtually identical segmental duplications.In addition to these problems, N50 values fail to reflect fine-scale inaccuracies, such as substitution and indel errors. Quality at the nucleotide level is summarized as a phred score, with scores exceeding 40 indicating finished sequence (Ewing and Green 1998) and corresponding to an error rate of less than one base in 10,000. Once assembled, a base is assigned a consensus quality score (CQS) depending on its read depth and the quality of each base contributing to that position (Huang and Madan 1999). Finally, assessing sequence error has traditionally relied on comparison with bacterial artificial chromosome (BAC) sequence. Discrepancies between assembly and BAC sequences are assumed to reflect errors in the draft sequence, although a minority may remain in the finished BAC sequence.Here, we introduce a statistical and comparative genomics method that quantifies the fine-scale quality of a genome assembly and that has the merit of being complementary to the aforementioned approaches. Instead of considering rates of nucleotide substitution errors in an assembly, which are already largely indicated by CQSs, the method quantifies genome assembly quality by the rate of insertion and deletion errors in alignments. This approach estimates the abundance of indel errors between aligned genome pairs, by separating these from true evolutionary indels.Previously, we demonstrated that in the absence of selection, indel mutations leave a precise and determinable fingerprint on the distribution of ungapped alignment block lengths (Lunter et al. 2006). These block lengths, which represent distances between successive indel mutations (represented as gaps within genome alignments), we refer to as intergap segment (IGS) lengths. Under the neutral indel model, these IGS lengths are expected to follow a geometric frequency distribution whenever sequence has been free of selection. There is substantial evidence that the large majority of mammalian genome sequence has evolved neutrally (Mouse Genome Sequencing Consortium 2002; Lunter et al. 2006). More specifically, virtually all transposable elements (TEs) have, upon insertion, subsequently been free of purifying selection (Lunter et al. 2006; Lowe et al. 2007). This absence of selection manifests itself in IGS in ancestral repeats (those TEs that were inserted before the common ancestor of two species), closely following the geometric frequency distribution expected of neutral sequence (Fig. 1A).Open in a separate windowFigure 1.Genomic distribution of intergap segment lengths in mouse-rat alignments for ancestral repeats (A) and whole-genome sequences (B). Frequencies of IGS lengths are shown on a natural log scale. The black line represents the prediction of the neutral indel model, a geometric distribution of IGS lengths; observed counts (blue circles) are accumulated in 5 bp bins of IGS lengths. Within mouse-rate ancestral repeat sequence, the observations fit the model accurately for IGS between 10 bp and 300 bp. For whole-genome data, a similarly close fit is observed for IGS between 10 bp and 100 bp. Beyond 100 bp, an excess of longer IGSs (green) above the quantities predicted by the neutral indel model can be observed, representing functional sequence that has been conserved with regards to indel mutations. The depletion of short (<10 bp) IGS reflects a “gap attraction” phenomenon (Lunter et al. 2008).Within conserved functional sequence, on the other hand, deleterious indels will tend to have been purged, hence IGS lengths frequently will be more extended compared with neutral sequence. This results in a departure of the observed IGS length distribution from the geometric distribution (Fig. 1B), the extent of which allows the amount of functional sequence shared between genome pairs to be estimated accurately (for further details, see Lunter et al. 2006).In any alignment, a proportion of gaps will represent true evolutionary events, whereas the remainder represent “gap errors” that inadvertently have been introduced during sequencing and assembly. Causes of assembly errors, such as insufficient read coverage or mis-assembly, are often regional and thus may be expected to result in clustering of errors. In contrast, from the results of comparisons between species such as human and mouse, true evolutionary indel events appear to be only weakly clustered, for instance, through a dependence of indel rate on G+C content (Lunter et al. 2006). Indels may cluster because of recurrent and regional positive selection of nucleotide insertions and/or deletions. Nevertheless, these effects are unlikely to be sufficiently widespread to explain the high rates of indel clustering (up to one indel per 4 kb) that we discuss later. Indels may also cluster because of mutational biases that are independent of G+C, although we know of no such short-distance effects (see Discussion). This reasoning provided the rationale for seeking to exploit the neutral indel model to estimate the number of gap errors in alignments of two assemblies. Purifying selection on indels and clustered indel errors contribute to largely distinct parts of the observed IGS histogram: The former increases the representation of long IGS (Fig. 1B), whereas the latter cause short IGS to become more prevalent than expected.Nevertheless, owing to the considerable divergence between human and mouse, the probability of a true indel greatly exceeds assembly indel error rates (5 × 10−2 versus 10−3 to 10−4 per nucleotide) (see below) (Lunter et al. 2006). In short, the large number of true indel events renders the proportion of gap errors so low as to be inestimable. Even for more closely related species, such as mouse and rat (Fig. 1A), neutral sequence is estimated to contain one true indel per 50 bases, which is also approximately 100-fold higher than the frequency of indel errors we will report later. Consequently, indel errors will be most easily discerned between genome assemblies from yet more closely related species. Few species pairs, whose divergence within neutral sequence is low (<5%), have yet been sequenced. Nevertheless, recent reductions in sequencing costs are likely to result in substantial numbers of closely related genomes being sequenced in the near future.For this analysis, we took advantage of the newly available genome assembly of the Sumatran orangutan (Pongo pygmaeus abelii), sequenced using a conventional capillary sequencing approach (Orangutan Genome Sequencing Consortium, in prep.; D Locke, pers. comm.), and its alignment to other closely related great ape genome assemblies, namely, those of human (Homo sapiens) and chimpanzee (Pan troglodytes). The latter two genomes have been sequenced to finished quality and sixfold coverage, respectively (see Methods) (International Human Genome Sequencing Consortium 2004; The Chimpanzee Sequencing and Analysis Consortium 2005), whereas the effective coverage of the Sumatran orangutan is lower at approximately fourfold (Orangutan Genome Sequencing Consortium, in prep.).We were able to take advantage of a data set of short reads at approximately 10-fold statistical coverage from a single Bornean orangutan (Pongo pygmaeus pygmaeus) that was shotgun-sequenced using the Illumina short read platform as part of the orangutan sequencing project (Orangutan Genome Sequencing Consortium, in prep.). This substantial read depth afforded us an opportunity to quantify the improvement to traditional capillary-read assemblies from the mapping of short sequence reads. Using a sequence mapper (Stampy) that was specifically designed for high sensitivity and accuracy in the presence of indels as well as substitution mutations (see Methods) (GA Lunter and M Goodson, in prep.), we placed these reads onto the Sumatran orangutan genome assembly. Using this assembly as a template, we called indels and substitutions and, from these, derived a templated assembly of the Bornean individual. This assembly is expected to contain polymorphisms specific to the Bornean individual and also to correct many fine-scale substitution and indel errors present in the Sumatran capillary-read assembly. The assembly will be syntenic with the Sumatran assembly, rather than following the Bornean genome where structural variants exist. Moreover, in regions where the Sumatran genome is divergent or contains many errors, reads will not be mapped; such regions will be excluded from the templated assembly. Using our indel error statistics, we show that this templated assembly improves on the original assembly in terms of accuracy by effectively separating low-fidelity from high-fidelity sequence.  相似文献   

17.
Next-generation massively parallel sequencing technologies provide ultrahigh throughput at two orders of magnitude lower unit cost than capillary Sanger sequencing technology. One of the key applications of next-generation sequencing is studying genetic variation between individuals using whole-genome or target region resequencing. Here, we have developed a consensus-calling and SNP-detection method for sequencing-by-synthesis Illumina Genome Analyzer technology. We designed this method by carefully considering the data quality, alignment, and experimental errors common to this technology. All of this information was integrated into a single quality score for each base under Bayesian theory to measure the accuracy of consensus calling. We tested this methodology using a large-scale human resequencing data set of 36× coverage and assembled a high-quality nonrepetitive consensus sequence for 92.25% of the diploid autosomes and 88.07% of the haploid X chromosome. Comparison of the consensus sequence with Illumina human 1M BeadChip genotyped alleles from the same DNA sample showed that 98.6% of the 37,933 genotyped alleles on the X chromosome and 98% of 999,981 genotyped alleles on autosomes were covered at 99.97% and 99.84% consistency, respectively. At a low sequencing depth, we used prior probability of dbSNP alleles and were able to improve coverage of the dbSNP sites significantly as compared to that obtained using a nonimputation model. Our analyses demonstrate that our method has a very low false call rate at any sequencing depth and excellent genome coverage at a high sequencing depth.Genetic polymorphisms contribute to variations in phenotypes, risk to certain diseases, and response to drugs and the environment. Genome-wide linkage analysis and positional cloning have been tremendously successful for mapping human disease genes that underlie monogenic Mendelian diseases (Jimenez-Sanchez et al. 2001). But most common diseases (such as diabetes, cardiovascular disease, and cancer) and clinically important quantitative traits have complex genetic architectures; a combination of multiple genes and interactions with environmental factors is believed to determine these phenotypes. Linkage analysis has significant limitations in its ability to identify common genetic variations that have modest effects on disease (Wang et al. 2005). In contrast, genome-wide association studies offer a promising approach for mapping associated loci. The completion of the human genome sequence (Lander et al. 2001; Venter et al. 2001) enabled the identification of millions of single nucleotide polymorphisms (SNPs) (Sachidanandam et al. 2001) and the construction of a high-density haplotype map (International HapMap Consortium 2005; International HapMap Consortium et al. 2007). These advances have set the stage for large-scale genome-wide SNP surveys for seeking genetic variations associated with or causative of a wide variety of human diseases.For more than two decades, Sanger sequencing and fluorescence-based electrophoresis technologies have dominated the DNA sequencing field. And DNA sequencing is the method of choice for novel SNP detection, using either a random shotgun strategy or PCR amplification of regions of interest. Most of the SNPs deposited in dbSNP were identified by these methods (Sherry et al. 2001). A key advantage of the utility of traditional Sanger sequencing is the availability of the universal standard of phred scores (Ewing and Green 1998; Ewing et al. 1998) for defining SNP detection accuracy, in which the phred program assigns a score to each base of the raw sequence to estimate an error probability.With high-throughput clone sequencing of shotgun libraries, a standard method for SNP detection (such as ssahaSNP; Ning et al. 2001) is to align the reads onto a reference genome and filter low-quality mismatches according to their phred score, known as the “neighborhood quality standard” (NQS) (Altshuler et al. 2000). With direct sequencing of PCR-amplified sequences from diploid samples, software, including SNPdetector (Zhang et al. 2005), novoSNP (Weckx et al. 2005), PolyPhred (Stephens et al. 2006), and PolyScan (Chen et al. 2007), has been developed to examine chromatogram files to detect heterozygous polymorphisms.New DNA sequencing technologies, which have recently been developed and implemented, such as the Illumina Genome Analyzer (GA), Roche/454 FLX system, and AB SOLiD system, have significantly improved throughput and dramatically reduced the cost as compared to capillary-based electrophoresis systems (Shendure et al. 2004). In a single experiment using one Illumina GA, the sequence of approximately 100 million reads of up to 50 bases in length can be determined. This ultrahigh throughput makes next-generation sequencing technologies particularly suitable for carrying out genetic variation studies by using large-scale resequencing of sizeable cohorts of individuals with a known reference (Bentley 2006). Currently, using these technologies, three human individuals have been sequenced: James Watson''s genome by 454 Life Sciences (Roche) FLX sequencing technology (Wheeler et al. 2008), an Asian genome (Wang et al. 2008), and an African genome (Bentley et al. 2008) sequenced by Illumina GA technology. Additionally, given such sequencing advances, an international research consortium has formed to sequence the genomes of at least 1000 individuals from around the world to create the most detailed human genetic variation map to date.As noted, SNP detection methods for standard sequencing technologies are well developed; however, given distinct differences in the sequence data output from and analyses of next-generation sequencing, novel methods for accurate SNP detection are essential. To meet these needs, we have developed a method of consensus calling and SNP detection for the massively parallel Illumina GA technology. The Illumina platform uses a phred-like quality score system to measure the accuracy of each sequenced base pair. Using this, we calculated the likelihood of each genotype at each site based on the alignment of short reads to a reference genome together with the corresponding sequencing quality scores. We then inferred the genotype with highest posterior probability at each site using a Bayesian statistical method. The Bayesian method has been used for SNP calling for traditional Sanger sequencing technology (Marth et al. 1999) and has also been introduced for the analysis of next-generation sequencing data (Li et al. 2008a). In the method presented here, we have taken into account the intrinsic bias or errors that are common in Illumina GA sequencing data and recalibrated the quality values for use in inferring consensus sequence.We evaluated this SNP detection method using the Asian genome sequence, which has 36× high-quality data (Wang et al. 2008). The evaluation demonstrated that our method has a very low false call rate at any sequencing depth, and excellent genome coverage in high-depth data, making it very useful for SNP detection in Illumina GA resequencing data at any sequencing depth. This methodology and the developed software described in this report have been integrated into the Short Oligonucleotide Alignment Program (SOAP) package (Li et al. 2008b) and named “SOAPsnp” to indicate its functionality for SNP detection using SOAP short read alignment results as input.  相似文献   

18.
Unbiased next-generation sequencing (NGS) approaches enable comprehensive pathogen detection in the clinical microbiology laboratory and have numerous applications for public health surveillance, outbreak investigation, and the diagnosis of infectious diseases. However, practical deployment of the technology is hindered by the bioinformatics challenge of analyzing results accurately and in a clinically relevant timeframe. Here we describe SURPI (“sequence-based ultrarapid pathogen identification”), a computational pipeline for pathogen identification from complex metagenomic NGS data generated from clinical samples, and demonstrate use of the pipeline in the analysis of 237 clinical samples comprising more than 1.1 billion sequences. Deployable on both cloud-based and standalone servers, SURPI leverages two state-of-the-art aligners for accelerated analyses, SNAP and RAPSearch, which are as accurate as existing bioinformatics tools but orders of magnitude faster in performance. In fast mode, SURPI detects viruses and bacteria by scanning data sets of 7–500 million reads in 11 min to 5 h, while in comprehensive mode, all known microorganisms are identified, followed by de novo assembly and protein homology searches for divergent viruses in 50 min to 16 h. SURPI has also directly contributed to real-time microbial diagnosis in acutely ill patients, underscoring its potential key role in the development of unbiased NGS-based clinical assays in infectious diseases that demand rapid turnaround times.There is great interest in the use of unbiased next-generation sequencing (NGS) technology for comprehensive detection of pathogens from clinical samples (Dunne et al. 2012; Wylie et al. 2012; Chiu 2013; Firth and Lipkin 2013). Conventional diagnostic testing for pathogens is narrow in scope and fails to detect the etiologic agent in a significant percentage of cases (Barnes et al. 1998; Louie et al. 2005; van Gageldonk-Lafeber et al. 2005; Bloch and Glaser 2007; Denno et al. 2012). Failure to accurately diagnose and treat infection in a timely fashion contributes to continued transmission and increased mortality in hospitalized patients (Kollef et al. 2008). Ongoing discovery of novel pathogens, such as Bas-Congo rhabdovirus (Grard et al. 2012) and MERS (Middle East Respiratory Syndrome) coronavirus (Zaki et al. 2012), also underscores the need for rapid, broad-spectrum diagnostic assays that are able to recognize these emerging agents.Unbiased NGS holds the promise of identifying all potential pathogens in a single assay without a priori knowledge of the target. Given sufficiently long read lengths, multiple hits to the microbial genome, and a well-annotated reference database, nearly all microorganisms can be uniquely identified on the basis of their specific nucleic acid sequence. Thus, NGS has widespread microbiological applications, including infectious disease diagnosis in clinical laboratories (Dunne et al. 2012), pathogen discovery in acute and chronic illnesses of unknown origin (Chiu 2013), and outbreak investigation on a global level (Firth and Lipkin 2013). However, the latest NGS laboratory workflows incur minimum turnaround times exceeding 8 h from clinical sample to sequence (Quail et al. 2012). Thus, it is critical that subsequent computational analyses of NGS data be performed within a timeframe suitable for actionable responses in clinical medicine and public health (i.e., minutes to hours). Such pipelines must also retain sensitivity, accuracy, and throughput in detecting a broad range of clinically relevant pathogenic microorganisms.Computational analysis of metagenomic NGS data for pathogen identification remains challenging for several reasons. First, alignment/classification algorithms must contend with massive amounts of sequence data. Recent advances in NGS technologies have resulted in instruments that are capable of producing >100 gigabases (Gb) of reads in a day (Loman et al. 2012). Reference databases of host and pathogen sequences range in size from 2 Gb for viruses to 3.1 Gb for the human genome to 42 Gb for all nucleotide sequences in the National Center for Biotechnology Information (NCBI) nucleotide (nt) collection (NCBI nt DB) as of January 2013. Second, only a small fraction of short NGS reads in clinical metagenomic data typically correspond to pathogens (a “needle-in-a-haystack” problem) (Kostic et al. 2012; Wylie et al. 2012; Yu et al. 2012), and such sparse reads often do not overlap sufficiently to permit de novo assembly into longer contiguous sequences (contigs) (Kostic et al. 2011). Thus, individual reads, typically only 100–300 nucleotides (nt) in length, must be classified to a high degree of accuracy. Finally, novel microorganisms with divergent genomes, particularly viruses, are not adequately represented in existing reference databases and often can only be identified on the basis of remote amino acid homology (Xu et al. 2011; Grard et al. 2012).To address these challenges, the most widely used approach is computational subtraction of reads corresponding to the host (e.g., human), followed by alignment to reference databases that contain sequences from candidate pathogens (MacConaill and Meyerson 2008; Greninger et al. 2010; Kostic et al. 2011; Zhao et al. 2013). Traditionally, the BLAST algorithm (Altschul et al. 1990) is used for classification of human and nonhuman reads at the nucleotide level (BLASTn), followed by low-stringency protein alignments using a translated nucleotide query (BLASTx) for detection of divergent sequences from novel pathogens (Delwart 2007; Briese et al. 2009; Xu et al. 2011; Grard et al. 2012; Chiu 2013). However, BLAST is too slow for routine analysis of NGS metagenomics data (Niu et al. 2011), and end-to-end processing times, even on multicore computational servers, can take several days to weeks. Analysis pipelines that use faster, albeit less sensitive, algorithms upfront for host computational subtraction, such as PathSeq (Kostic et al. 2011), still rely on traditional BLAST approaches for final pathogen determination. In addition, whereas PathSeq works well for tissue samples in which the vast majority of reads are host-derived and thus subject to subtraction, the pipeline becomes computationally prohibitive when analyzing complex clinical metagenomic samples open to the environment, such as respiratory secretions or stool (Fig. 1B; Supplemental Table S1). Other published pipelines are focused solely on limited detection of specific types of microorganisms, are unable to identify highly divergent novel pathogens, and/or utilize computationally taxing algorithms such as BLAST (Bhaduri et al. 2012; Borozan et al. 2012; Dimon et al. 2013; Naeem et al. 2013; Wang et al. 2013; Zhao et al. 2013). Furthermore, there is hitherto scarce reported data on the real-life performance of these pipelines for pathogen identification in clinical samples.Open in a separate windowFigure 1.The SURPI pipeline for pathogen detection. (A) A schematic overview of the SURPI pipeline. Raw NGS reads are preprocessed by removal of adapter, low-quality, and low-complexity sequences, followed by computational subtraction of human reads using SNAP. In fast mode, viruses and bacteria are identified by SNAP alignment to viral and bacterial nucleotide databases. In comprehensive mode, reads are aligned using SNAP to all nucleotide sequences in the NCBI nt collection, enabling identification of bacteria, fungi, parasites, and viruses. For pathogen discovery of divergent microorganisms, unmatched reads and contigs generated from de novo assembly are then aligned to a viral protein database or all protein sequences in the NCBI nr collection using RAPSearch. SURPI reports include a list of all classified reads with taxonomic assignments, a summary table of read counts, and both viral and bacterial genomic coverage maps. (B) Relative proportion of NGS reads classified as human, bacterial, viral, or other in different clinical sample types. (C) The SNAP nucleotide aligner (Zaharia et al. 2011). SNAP aligns reads by generating a hash table of sequences of length “s” from the reference database and then comparing the hash index with “n” seeds of length “s” generated from the query sequence, producing a match based on the edit distance “d.” (D) The RAPSearch protein similarity search tool (Zhao et al. 2012). RAPSearch aligns translated nucleotide queries to a protein database using a compressed amino acid alphabet at the level of chemical similarity for greatly increased processing speed.Here we describe SURPI (“sequence-based ultrarapid pathogen identification”), a cloud-compatible bioinformatics analysis pipeline that provides extensive classification of reads against viral and bacterial databases in fast mode and against the entire NCBI nt DB in comprehensive mode (Fig. 1A). Novel pathogens are also identified in comprehensive mode by amino acid alignment to viral and/or NCBI nr protein databases. Notably, SURPI generates results in a clinically actionable timeframe of minutes to hours by leveraging two alignment tools, SNAP (Fig. 1C; Zaharia et al. 2011) and RAPSearch (Fig. 1D; Zhao et al. 2012), which have computational times that are orders of magnitude faster than other available algorithms. Here we evaluate the performance of these tools for pathogen detection using both in silico-generated and clinical data and describe use of the SURPI pipeline in the analysis of 15 independent NGS data sets consisting of 157 clinical samples multiplexed across 47 barcodes and including over 1.1 billion reads. These data sets encompass a variety of clinical infections, detected pathogens, sample types, and depths of coverage. We also demonstrate use of the pipeline for detection of emerging novel outbreak viruses and for clinical diagnosis of a case of unknown fever in a returning traveler.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号