首页 | 本学科首页   官方微博 | 高级检索  
检索        


Oxford Nanopore sequencing,hybrid error correction,and de novo assembly of a eukaryotic genome
Authors:Sara Goodwin  James Gurtowski  Scott Ethe-Sayers  Panchajanya Deshpande  Michael C Schatz  W Richard McCombie
Institution:Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
Abstract:Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available, and we used this for sequencing the Saccharomyces cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr specifically for Oxford Nanopore reads, because existing packages were incapable of assembling the long read lengths (5–50 kbp) at such high error rates (between ∼5% and 40% error). With this new method, we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: The contig N50 length is more than ten times greater than an Illumina-only assembly (678 kb versus 59.9 kbp) and has >99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly.Most DNA sequencing methods are based on either chemical cleavage of DNA molecules (Maxam and Gilbert 1977) or synthesis of new DNA strands (Sanger et al. 1977), which are used in the majority of today''s sequencing routines. In the more common synthesis-based methods, base analogs of one form or another are incorporated into a nascent DNA strand that is labeled either on the primer from which it originates or on the newly incorporated bases. This is the basis of the sequencing method used for most current sequencers, including Illumina, Ion Torrent, and Pacific Biosciences (PacBio) sequencing, and their earlier predecessors (Mardis 2008). Alternatively, it has been observed that individual DNA molecules could be sequenced by monitoring their progress through various types of pores (Kasianowicz et al. 1996; Venkatesan and Bashir 2011) originally envisioned as being pores derived from bacteriophage particles (Sanger et al. 1980). The advantages of this approach include potentially very long and unbiased sequence reads, because neither amplification nor chemical reactions are necessary for sequencing (Yang et al. 2013).Recently we began testing a sequencing device using nanopore technology from Oxford Nanopore Technologies (ONT) through their early access program (Eisenstein 2012). This device, the MinION, is a nanopore-based device in which pores are embedded in a membrane placed over an electrical detection grid. As DNA molecules pass through the pores, they create measureable alterations in the ionic current. The fluctuations are sequence dependent and thus can be used by a base-calling algorithm to infer the sequence of nucleotides in each molecule (Stoddart et al. 2009; Yang et al. 2013). As part of the library preparation protocol, a hairpin adapter is ligated to one end of a double-stranded DNA sample, while a “motor” protein is bound to the other to unwind the DNA and control the rate of nucleotides passing through the pore (Clarke et al. 2009). Under ideal conditions the leading template strand passes through the pore, followed by the hairpin adapter and then the complement strand. In such a run where both strands are sequenced, a consensus sequence of the molecule can be produced; these consensus reads are termed “2D reads” and have generally higher accuracy than reads from only a single pass of the molecule (“1D reads”).The ability to generate very long read lengths from a handheld sequencer opens the potential for many important applications in genomics, including de novo genome assembly of novel genomes, structural variation analysis of healthy or diseased samples, or even isoform resolution when applied to cDNA sequencing. However, both the “1D” and “2D” read types currently have a high error rate that limits their direct application to these problems and necessitates a new suite of algorithms. Here we report our experiences sequencing the Saccharomyces cerevisiae (yeast) genome with the instrument, including an in-depth analysis of the data characteristics and error model. We also describe our new hybrid error correction algorithm, Nanocorr, which leverages high-quality short-read MiSeq sequencing to computationally “polish” the long nanopore reads. After error correction, we then de novo assemble the genome using just the error-corrected long reads to produce a very high-quality assembly of the genome with each chromosome assembled into a small number of contigs at very high sequence identity. We further demonstrate that our error correction is nearly optimal: Our results with the error-corrected real data approach those produced using idealized simulated reads extracted directly from the reference genome itself. Finally, we validate these results by error correcting long Oxford Nanopore reads of the E. coli K12 genome sequenced at a different institution and produce an essentially perfect de novo assembly of the genome. As such, we believe our hybrid error correction and assembly approach will be generally applicable to many other sequencing projects.
Keywords:
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号