De novo assembly of human genomes with massively parallel short read sequencing期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

De novo assembly of human genomes with massively parallel short read sequencing

Authors:	Ruiqiang Li Hongmei Zhu Jue Ruan Wubin Qian Xiaodong Fang Zhongbin Shi Yingrui Li Shengting Li Gao Shan Karsten Kristiansen Songgang Li Huanming Yang Jian Wang Jun Wang

Affiliation:	1. Beijing Genomics Institute at Shenzhen, Shenzhen 518083, China;;2. Department of Biology, University of Copenhagen, Copenhagen DK-2200, Denmark

Abstract:	Next-generation massively parallel DNA sequencing technologies provide ultrahigh throughput at a substantially lower unit data cost; however, the data are very short read length sequences, making de novo assembly extremely challenging. Here, we describe a novel method for de novo assembly of large genomes from short read sequences. We successfully assembled both the Asian and African human genome sequences, achieving an N50 contig size of 7.4 and 5.9 kilobases (kb) and scaffold of 446.3 and 61.9 kb, respectively. The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.The development and commercialization of next-generation massively parallel DNA sequencing technologies, including Illumina Genome Analyzer (GA) (Bentley 2006), Applied Biosystems SOLiD System, and Helicos BioSciences HeliScope (Harris et al. 2008), have revolutionized genomic research. Compared to traditional Sanger capillary-based electrophoresis systems, these new technologies provide ultrahigh throughput with two orders of magnitude lower unit data cost. However, they all share a common intrinsic characteristic of providing very short read length, currently 25–75 base pairs (bp), which is substantially shorter than the Sanger sequencing reads (500–1000 bp) (Shendure et al. 2004). This has raised concern about their ability to accurately assemble large genomes. Illumina GA technology has been shown to be feasible for use in human whole-genome resequencing and can be used to identify single nucleotide polymorphisms (SNPs) accurately by mapping the short reads onto the known reference genome (Bentley et al. 2008; Wang et al. 2008). But to thoroughly annotate insertions, deletions, and structural variations, de novo assembly of each individual genome from these raw short reads is required.Currently, Sanger sequencing technology remains the dominant method for building a reference genome sequence for a species. It is, however, expensive, and this prevents many genome sequencing projects from being put into practice. Over the past 10 yr, only a limited number of plant and animal genomes have been completely sequenced, (http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html), including human (Lander et al. 2001; Venter et al. 2001) and mouse (Mouse Genome Sequencing Consortium 2002), but accurate understanding of evolutionary history and biological processes at a nucleotide level requires substantially more. The development of a de novo short read assembly method would allow the building of reference sequences for these unexplored genomes in a very cost-effective way, opening the door for carrying out numerous substantial new analyses.Several programs, such as phrap (http://www.phrap.org), Celera assembler (Myers et al. 2000), ARACHNE (Batzoglou et al. 2002), Phusion (Mullikin and Ning 2003), RePS (Wang et al. 2002), PCAP (Huang et al. 2003), and Atlas (Havlak et al. 2004), have been successfully used for de novo assembly of whole-genome shotgun (WGS) sequencing reads in the projects applying the Sanger technology. These are based on an overlap-layout strategy, but for very short reads, this approach is unsuitable because it is hard to distinguish correct assembly from repetitive sequence overlap due to there being only a very short sequence overlap between these short reads. Also, in practice, it is unrealistic to record into a computer memory all the sequence overlap information from deep sequencing that are made up of huge numbers of short reads.The de Bruijn graph data structure, introduced in the EULER (Pevzner et al. 2001) assembler, is particularly suitable for representing the short read overlap relationship. The advantage of the data structure is that it uses K-mer as vertex, and read path along the K-mers as edges on the graph. Hence, the graph size is determined by the genome size and repeat content of the sequenced sample, and in principle, will not be affected by the high redundancy of deep read coverage. A few short read assemblers, including Velvet (Zerbino and Birney 2008), ALLPATHS (Butler et al. 2008), and EULER-SR (Chaisson and Pevzner 2008), have adopted this algorithm, explicitly or implicitly, and have been implemented and shown very promising performances. Some other short read assemblers have applied the overlap and extension strategy, such as SSAKE (Warren et al. 2007), VCAKE (Jeck et al. 2007) (the follower of SSAKE which can handle sequencing errors), SHARCGS (Dohm et al. 2007), and Edena (Hernandez et al. 2008). However, all these assemblers were designed to handle bacteria- or fungi-sized genomes, and cannot be applied for assembly of large genomes, such as the human, given the limits of the available memory of current supercomputers. Recently, ABySS (Simpson et al. 2009) used a distributed de Bruijn graph algorithm that can split data and parallelize the job on a Linux cluster with message passing interface (MPI) protocol, allowing communication between nodes. Thus, it is able to handle a whole short read data set of a human individual; however, the assembly is very fragmented with an N50 length of ∼1.5 kilobases (kb). This is not long enough for structural variation detection between human individuals, nor is it good enough for gene annotation and further analysis of the genomes of novel species.Here, we present a novel short read assembly method that can build a de novo draft assembly for the human genome. We previously sequenced the complete genome of an Asian individual using a resequencing method, producing a total of 117.7 gigabytes (Gb) of data, and have now an additional 82.5 Gb of paired-end short reads, achieving a 71× sequencing depth of the NCBI human reference sequence. We used this substantial amount of data to test our de novo assembly method, as well as the data from the African genome sequence (Bentley et al. 2008; Wang et al. 2008; Li et al. 2009a). We compared the de novo assemblies to the NCBI reference genome and demonstrated the capability of this method to accurately identify structural variations, especially small deletions and insertions that are difficult to detect using the resequencing method. This software has been integrated into the short oligonucleotide alignment program (SOAP) (Li et al. 2008, 2009b,c) package and named SOAPdenovo to indicate its functionality.

Keywords:

设为首页 | 免责声明 | 关于勤云 | 加入收藏