首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Next-generation massively parallel DNA sequencing technologies provide ultrahigh throughput at a substantially lower unit data cost; however, the data are very short read length sequences, making de novo assembly extremely challenging. Here, we describe a novel method for de novo assembly of large genomes from short read sequences. We successfully assembled both the Asian and African human genome sequences, achieving an N50 contig size of 7.4 and 5.9 kilobases (kb) and scaffold of 446.3 and 61.9 kb, respectively. The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.The development and commercialization of next-generation massively parallel DNA sequencing technologies, including Illumina Genome Analyzer (GA) (Bentley 2006), Applied Biosystems SOLiD System, and Helicos BioSciences HeliScope (Harris et al. 2008), have revolutionized genomic research. Compared to traditional Sanger capillary-based electrophoresis systems, these new technologies provide ultrahigh throughput with two orders of magnitude lower unit data cost. However, they all share a common intrinsic characteristic of providing very short read length, currently 25–75 base pairs (bp), which is substantially shorter than the Sanger sequencing reads (500–1000 bp) (Shendure et al. 2004). This has raised concern about their ability to accurately assemble large genomes. Illumina GA technology has been shown to be feasible for use in human whole-genome resequencing and can be used to identify single nucleotide polymorphisms (SNPs) accurately by mapping the short reads onto the known reference genome (Bentley et al. 2008; Wang et al. 2008). But to thoroughly annotate insertions, deletions, and structural variations, de novo assembly of each individual genome from these raw short reads is required.Currently, Sanger sequencing technology remains the dominant method for building a reference genome sequence for a species. It is, however, expensive, and this prevents many genome sequencing projects from being put into practice. Over the past 10 yr, only a limited number of plant and animal genomes have been completely sequenced, (http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html), including human (Lander et al. 2001; Venter et al. 2001) and mouse (Mouse Genome Sequencing Consortium 2002), but accurate understanding of evolutionary history and biological processes at a nucleotide level requires substantially more. The development of a de novo short read assembly method would allow the building of reference sequences for these unexplored genomes in a very cost-effective way, opening the door for carrying out numerous substantial new analyses.Several programs, such as phrap (http://www.phrap.org), Celera assembler (Myers et al. 2000), ARACHNE (Batzoglou et al. 2002), Phusion (Mullikin and Ning 2003), RePS (Wang et al. 2002), PCAP (Huang et al. 2003), and Atlas (Havlak et al. 2004), have been successfully used for de novo assembly of whole-genome shotgun (WGS) sequencing reads in the projects applying the Sanger technology. These are based on an overlap-layout strategy, but for very short reads, this approach is unsuitable because it is hard to distinguish correct assembly from repetitive sequence overlap due to there being only a very short sequence overlap between these short reads. Also, in practice, it is unrealistic to record into a computer memory all the sequence overlap information from deep sequencing that are made up of huge numbers of short reads.The de Bruijn graph data structure, introduced in the EULER (Pevzner et al. 2001) assembler, is particularly suitable for representing the short read overlap relationship. The advantage of the data structure is that it uses K-mer as vertex, and read path along the K-mers as edges on the graph. Hence, the graph size is determined by the genome size and repeat content of the sequenced sample, and in principle, will not be affected by the high redundancy of deep read coverage. A few short read assemblers, including Velvet (Zerbino and Birney 2008), ALLPATHS (Butler et al. 2008), and EULER-SR (Chaisson and Pevzner 2008), have adopted this algorithm, explicitly or implicitly, and have been implemented and shown very promising performances. Some other short read assemblers have applied the overlap and extension strategy, such as SSAKE (Warren et al. 2007), VCAKE (Jeck et al. 2007) (the follower of SSAKE which can handle sequencing errors), SHARCGS (Dohm et al. 2007), and Edena (Hernandez et al. 2008). However, all these assemblers were designed to handle bacteria- or fungi-sized genomes, and cannot be applied for assembly of large genomes, such as the human, given the limits of the available memory of current supercomputers. Recently, ABySS (Simpson et al. 2009) used a distributed de Bruijn graph algorithm that can split data and parallelize the job on a Linux cluster with message passing interface (MPI) protocol, allowing communication between nodes. Thus, it is able to handle a whole short read data set of a human individual; however, the assembly is very fragmented with an N50 length of ∼1.5 kilobases (kb). This is not long enough for structural variation detection between human individuals, nor is it good enough for gene annotation and further analysis of the genomes of novel species.Here, we present a novel short read assembly method that can build a de novo draft assembly for the human genome. We previously sequenced the complete genome of an Asian individual using a resequencing method, producing a total of 117.7 gigabytes (Gb) of data, and have now an additional 82.5 Gb of paired-end short reads, achieving a 71× sequencing depth of the NCBI human reference sequence. We used this substantial amount of data to test our de novo assembly method, as well as the data from the African genome sequence (Bentley et al. 2008; Wang et al. 2008; Li et al. 2009a). We compared the de novo assemblies to the NCBI reference genome and demonstrated the capability of this method to accurately identify structural variations, especially small deletions and insertions that are difficult to detect using the resequencing method. This software has been integrated into the short oligonucleotide alignment program (SOAP) (Li et al. 2008, 2009b,c) package and named SOAPdenovo to indicate its functionality.  相似文献   

2.
Widespread adoption of massively parallel deoxyribonucleic acid (DNA) sequencing instruments has prompted the recent development of de novo short read assembly algorithms. A common shortcoming of the available tools is their inability to efficiently assemble vast amounts of data generated from large-scale sequencing projects, such as the sequencing of individual human genomes to catalog natural genetic variation. To address this limitation, we developed ABySS (Assembly By Short Sequences), a parallelized sequence assembler. As a demonstration of the capability of our software, we assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc. Approximately 2.76 million contigs ≥100 base pairs (bp) in length were created with an N50 size of 1499 bp, representing 68% of the reference human genome. Analysis of these contigs identified polymorphic and novel sequences not present in the human reference assembly, which were validated by alignment to alternate human assemblies and to other primate genomes.Massively parallel sequencing platforms, such as the Illumina, Inc. Genome Analyzer, Applied Biosystems SOLiD System, and 454 Life Sciences (Roche) GS FLX, have provided an unprecedented increase in DNA sequencing throughput. Currently, these technologies produce high-quality short reads from 25 to 500 bp in length, which is substantially shorter than the capillary-based sequencing technology. However, the total number of base pairs sequenced in a given run is orders of magnitude higher. These two factors introduce a number of new informatics challenges, including the ability to perform de novo assembly of millions or even billions of short reads.The field of short read de novo assembly developed from pioneering work on de Bruijn graphs by Pevzner et al. (Pevzner and Tang 2001; Pevzner et al. 2001). The de Bruijn graph representation is prevalent in current short read assemblers, with Velvet (Zerbino and Birney 2008), ALLPATHS (Butler et al. 2008), and EULER-SR (Chaisson and Pevzner 2008) all following this approach. As an alternative, a prefix tree-based approach was introduced by Warren et al. (2007) with their early work on SSAKE. This paradigm was also followed in the VCAKE algorithm by Jeck et al. (2007), and in the SHARCGS algorithm by Dohm et al. (2007). On a third branch, Edena (Hernandez et al. 2008) was an adaptation of the traditional overlap-layout-consensus model to short reads.These short read de novo assemblers are single-threaded applications designed to run on a single processor. However, computation time and memory constraints limit the practical use of these implementations to genomes on the order of a megabase in size. On the other hand, as the next generation sequencing technologies have matured, and read lengths and throughput increase, the application of these technologies to structural analysis of large, complex genomes has become feasible. Notably, the 1000 Genomes Project (www.1000genomes.org) is undertaking the identification and cataloging of human genetic variation by sequencing the genomes of 1000 individuals from a diverse range of populations using short read platforms. Up to this point however, analysis of short read sequences from mammalian-sized genomes has been limited to alignment-based methods (Korbel et al. 2007; Bentley et al. 2008; Campbell et al. 2008; Wheeler et al. 2008) due to the lack of de novo assembly tools able to handle the vast amount of data generated by these projects.To assemble the very large data sets produced by sequencing individual human genomes, we have developed ABySS (Assembly By Short Sequencing). The primary innovation in ABySS is a distributed representation of a de Bruijn graph, which allows parallel computation of the assembly algorithm across a network of commodity computers. We demonstrate the ability of our assembler to quickly and accurately assemble 3.5 billion short sequence reads generated from whole-genome sequencing of a Yoruban male (NA18507) on the Illumina Genome Analyzer platform.  相似文献   

3.
We have developed a novel approach for using massively parallel short-read sequencing to generate fast and inexpensive de novo genomic assemblies comparable to those generated by capillary-based methods. The ultrashort (<100 base) sequences generated by this technology pose specific biological and computational challenges for de novo assembly of large genomes. To account for this, we devised a method for experimentally partitioning the genome using reduced representation (RR) libraries prior to assembly. We use two restriction enzymes independently to create a series of overlapping fragment libraries, each containing a tractable subset of the genome. Together, these libraries allow us to reassemble the entire genome without the need of a reference sequence. As proof of concept, we applied this approach to sequence and assembled the majority of the 125-Mb Drosophila melanogaster genome. We subsequently demonstrate the accuracy of our assembly method with meaningful comparisons against the current available D. melanogaster reference genome (dm3). The ease of assembly and accuracy for comparative genomics suggest that our approach will scale to future mammalian genome-sequencing efforts, saving both time and money without sacrificing quality.Genomes are the fundamental unit by which a species can be defined and form the foundation for deciphering how an organism develops, lives, dies, and is affected by disease. In addition, comparisons of genomes from related species have become a powerful method for finding functional sequences (Eddy 2005; Xie et al. 2005, 2007; Pennacchio et al. 2006; The ENCODE Project Consortium 2007). However, the high cost and effort needed to produce draft genomes with capillary-based sequencing technologies have limited genome-based biological exploration and evolutionary sequence comparisons to dozens of species (Margulies et al. 2007; Stark et al. 2007). This limitation is particularly true for mammalian-sized genomes, which are gigabases in size.With the advent of massively parallel short-read sequencing technologies (Bentley et al. 2008), the cost of sequencing DNA has been reduced by orders of magnitude, now making it possible to sequence hundreds or thousands of genomes. However, the reduced length of the sequence reads, compared with capillary-based approaches, poses new challenges in genome assembly. Here, we sought to address those experimental and bioinformatics hurdles by combining classical biochemical methodologies with new algorithms specifically tailored to handle massive quantities of short-read sequences.To date, the whole-genome shotgun (WGS) approach using massively parallel short-read sequencing has shown significant promise in silico (Butler et al. 2008) and has been applied to de novo sequencing and assembly of small genomes that do not contain an overabundance of low-complexity repetitive sequence (Hernandez et al. 2008). This presents a challenge when scaled to larger more complex genomes, where the information contained in a single short read cannot unambiguously place that read in the genome. Additionally, de novo assembly from WGS short-read sequencing currently requires large computational resources, on the order of hundreds of gigabytes of RAM, when scaled to larger genomes. As a compromise, current mammalian genomic analyses utilizing short-read sequencing technology either use alignments of individual reads against a reference genome (Ley et al. 2008; Wang et al. 2008) or require elaborate parallelization schemes across a large compute farm (Simpson et al. 2009) for assembly. Regardless of computational improvements, effectively handling repetitive sequences in a whole-genome assembly still remains a challenge.Our goal was to establish wet-lab and bioinformatics methods to rapidly sequence and assemble mammalian-sized genomes in a de novo fashion. Importantly, we wanted an approach that (1) did not rely on existing reference assemblies, (2) could be accomplished using commodity computational hardware, and (3) would yield functional assemblies useful for comparative sequence analyses at a fraction of the time and cost of existing capillary-based methods. To accomplish this, we propose a generic genome partitioning approach to solve both the biological and computational challenges of short-read assembly.Traditionally, partitioning of genomic libraries was accomplished through the use of clonal libraries of bacterial artificial chromosomes (BACs) (or yeast artificial chromosomes). This method accurately partitions genomes into more manageable subregions for sequencing and assembly. However, the high financial cost and overhead associated with creating and maintaining these libraries make this method unattractive for scaling to hundreds of genomes. In addition, a single BAC clone, which contains ∼200 kb of sequence, is not large enough to leverage the amount of sequence obtained from a single lane of Illumina data (currently ∼2.5 Gb of sequence), requiring the need for various pooling or indexing strategies (Meyer et al. 2007). Furthermore, virtually all BAC libraries exhibit some degree of variable cloning bias, with some regions overrepresented and others not at all. In silico studies have investigated the cost-saving potential of using a randomized BAC clone library with short-read sequencing (Sundquist et al. 2007); however, even with this shortcut, the clonal library concept does not lend itself to fast and cheap whole-genome partitioning.We propose a novel partitioning approach using restriction enzymes to create a series of reduced representation (RR) libraries by size fractionation. This method was originally described for single nucleotide polymorphism (SNP) discovery using Sanger-based sequencing methods on AB377 machines (Altshuler et al. 2000) and was subsequently used with massively parallel short-read sequencing (Van Tassell et al. 2008). Importantly, this method allows for the selection of a smaller reproducible subset of the genome for assembly. We extended this concept to create a series of distinct RR libraries consisting of similarly sized restriction fragments. Individually, these libraries represent a tractable subset of the genome for sequencing and assembly; when taken together, they represent virtually the entire genome. Using two separate restriction enzymes generates overlapping libraries, which allow for assembly of the genome without using a reference sequence.As proof of concept, we present here a de novo Drosophila melanogaster genomic assembly, equivalent in utility to a comparative grade assembly (Blakesley et al. 2004). Two enzymes were used to create a total of eight libraries. Short reads (∼36 bp) from each library were sequenced on the Illumina Genome Analyzer and assembled using the short-read assembler Velvet (Zerbino and Birney 2008). Contigs assembled from each library were merged into a single nonoverlapping meta assembly using the lightweight assembly program Minimus (Sommer et al. 2007). Furthermore, we sequenced genomic paired-end libraries with short and long inserts to order and orient the contigs into larger genomic scaffolds. When compared with a whole-genome shotgun assembly of the same data, we produce a higher quality assembly more rapidly by reducing the biological complexity and computational cost to assemble each library. Finally, we compare this assembly to the dm3 fly reference to highlight the accuracy of our assembly and utility for comparative sequence analyses. Our results demonstrate that this method is a rapid and cost-effective means to generate high-quality de novo assemblies of large genomes.  相似文献   

4.
We present a new approach to indel calling that explicitly exploits that indel differences between a reference and a sequenced sample make the mapping of reads less efficient. We assign all unmapped reads with a mapped partner to their expected genomic positions and then perform extensive de novo assembly on the regions with many unmapped reads to resolve homozygous, heterozygous, and complex indels by exhaustive traversal of the de Bruijn graph. The method is implemented in the software SOAPindel and provides a list of candidate indels with quality scores. We compare SOAPindel to Dindel, Pindel, and GATK on simulated data and find similar or better performance for short indels (<10 bp) and higher sensitivity and specificity for long indels. A validation experiment suggests that SOAPindel has a false-positive rate of ∼10% for long indels (>5 bp), while still providing many more candidate indels than other approaches.Calling indels from the mapping of short paired-end sequences to a reference genome is much more challenging than SNP calling because the indel by itself interferes with accurate mapping and therefore indels up to a few base pairs in size are allowed in the most popular mapping approaches (Li et al. 2008; Li and Durbin 2009; Li et al. 2009). The most powerful indel calling approach would be to perform de novo assembly of each genome and identify indels by alignment of genomes. However, this is computationally daunting and requires very high sequencing coverage. Therefore, local approaches offer more promise. Recent approaches exploit the paired-end information to perform local realignment of poorly mapped pairs, thus allowing for longer indels (Ye et al. 2009; Homer and Nelson 2010; McKenna et al. 2010; Albers et al. 2011). One such approach, Dindel, maps reads to a set of candidate haplotypes obtained from mapping or from external information. It uses a probabilistic framework that naturally integrates various sources of sequencing errors and was found to have high specificity for identification of indels of sizes up to half the read length (Albers et al. 2011). Deletions longer than that can be called using split read approaches such as implemented in Pindel (Ye et al. 2009). Long insertions remain problematic because short reads will not span them and a certain amount of de novo assembly is required.Our approach, implemented in SOAPindel, performs full local de novo assembly of regions where reads appear to map poorly as indicated by an excess of paired-end reads where only one of the mates maps. The idea is to collect all unmapped reads at their expected genomic positions, then perform a local assembly of the regions with a high density of such reads and finally align these assemblies to the reference. A related idea has recently been published by Carnevali et al. (2012), but their approach is designed for a different sequencing method, and software is not available for comparison.While conceptually simple, our approach is sensitive to various sources of errors, e.g., false mate pairs, sequencing errors, nonunique mapping, and repetitive sequences. We deal with these complexities by examining all the paths in an extended de Bruijn graph (Zerbino and Birney 2008) and choose those that anchor at some points on the reference genome sequence. In this way, we can detect heterozygous indels as two different paths in the de Bruijn graph and, in principle, call multiallelic indels in polyploid samples or pools of individuals. Unlike, e.g., Pindel, the approach treats insertions and deletions in the same way and has no constraint on indel length other than that determined by the local assembly.We explore the specificity and the sensitivity of SOAPindel by extensive simulations based on the human genome and by indel calling on one of the high-coverage samples of the 1000 Genomes Project. We estimate a low false-positive rate of the de novo indel calls by direct Sanger resequencing as well as from simulated reads data based on the Venter genome and the chimpanzee genome, mapped against the reference genome. We benchmark SOAPindel against Dindel, Pindel, and GATK, and it shows similar or better specificity and sensitivity for short indels and much higher sensitivity for long indels.  相似文献   

5.
Recent studies show that along with single nucleotide polymorphisms and small indels, larger structural variants among human individuals are common. The Human Genome Structural Variation Project aims to identify and classify deletions, insertions, and inversions (>5 Kbp) in a small number of normal individuals with a fosmid-based paired-end sequencing approach using traditional sequencing technologies. The realization of new ultra-high-throughput sequencing platforms now makes it feasible to detect the full spectrum of genomic variation among many individual genomes, including cancer patients and others suffering from diseases of genomic origin. Unfortunately, existing algorithms for identifying structural variation (SV) among individuals have not been designed to handle the short read lengths and the errors implied by the “next-gen” sequencing (NGS) technologies. In this paper, we give combinatorial formulations for the SV detection between a reference genome sequence and a next-gen-based, paired-end, whole genome shotgun-sequenced individual. We describe efficient algorithms for each of the formulations we give, which all turn out to be fast and quite reliable; they are also applicable to all next-gen sequencing methods (Illumina, 454 Life Sciences [Roche], ABI SOLiD, etc.) and traditional capillary sequencing technology. We apply our algorithms to identify SV among individual genomes very recently sequenced by Illumina technology.Recent introduction of the next-generation sequencing technologies has significantly changed how genomics research is conducted (Mardis 2008). High-throughput, low-cost sequencing technologies such as pyrosequencing (454 Life Sciences [Roche]), sequencing-by-synthesis (Illumina and Helicos), and sequencing-by-ligation (ABI SOLiD) methods produce shorter reads than the traditional capillary sequencing, but they also increase the redundancy by 10- to 100-fold or more (Shendure et al. 2004; Mardis 2008). With the arrival of these new sequencing technologies, along with the capability of sequencing paired-ends (or “matepairs”) of a clone insert that follows a tight length distribution (Raphael et al. 2003; Volik et al. 2003; Dew et al. 2005; Tuzun et al. 2005; Korbel et al. 2007; Bashir et al. 2008; Kidd et al. 2008; Lee et al. 2008), it is becoming feasible to perform detailed and comprehensive genome variation and rearrangement studies.The genetic variation among human individuals has been traditionally analyzed at the single nucleotide polymorphism (SNP) level as demonstrated by the HapMap Project (International HapMap Consortium 2003, 2005), where the genomes of 270 individuals were systematically genotyped for 3.1 million SNPs. However, human genetic variation extends beyond SNPs. The Human Genome Structural Variation Project (Eichler et al. 2007) has been initiated to identify and catalog structural variation (SV). In the broadest sense, SV can be defined as the genomic changes among individuals that are not single nucleotide variants (Tuzun et al. 2005; Eichler et al. 2007). These include insertions, deletions, duplications, inversions, and translocations (Feuk et al. 2006; Sharp et al. 2006) (see Supplemental material for details on types of SV).End-sequence profiling (ESP) was first presented by Volik et al. (2003) and Raphael et al. (2003) to discover SV events using bacterial artificial chromosome (BAC) end sequences to map structural rearrangements in cancer cell line genomes, and it was used by Tuzun et al. (2005) to systematically discover structural variants in the genome of a human individual. Several other genome-wide studies (Iafrate et al. 2004; Sebat et al. 2004; Redon et al. 2006; Cooper et al. 2007; Korbel et al. 2007) demonstrated that SV among normal individuals is common and ubiquitous. More recently, Kidd et al. (2008) detected, experimentally validated, and sequenced SV from eight different individuals. The ESP method was also utilized by Dew et al. (2005) to evaluate and compare assemblies and detect assembly breakpoints.As the promise of these next-generation sequencing (NGS) technologies became reality with the publication of the first three human genomes sequenced with NGS platforms (Bentley et al. 2008; Wang et al. 2008; Wheeler et al. 2008), the sequencing of more than 1000 individuals (http://www.1000genomes.org), computational methods for analyzing and managing the massive numbers of the short-read pairs produced by these platforms are urgently needed to effectively detect SNPs, SVs, and copy-number variants (Pop and Salzberg 2008). Since most SV events are found in the duplicated regions (Eichler et al. 2007; Kidd et al. 2008), the algorithms must also be able to discover variation in the repetitive regions of the human genome.Detection of SVs in the human genome using NGS technologies was first presented by Korbel et al. (2007). In this study, paired-end sequences generated with the 454 Life Sciences (Roche) platform were employed to detect SVs in two human individuals; however, the same algorithms and heuristics designed for capillary-based sequencing presented by Tuzun et al. (2005) were used, and no further optimizations for NGS were introduced. Campbell et al. (2008) employed Illumina sequencing to discover genome rearrangements in cancer cell lines; however, they considered one “best” paired map location per insert, by the use of the alignment tool MAQ (Li et al. 2008), and thus did not utilize the full information produced by high-throughput sequencing methods. In the first study on the genome sequenced with a NGS platform (Illumina) that produced paired-end sequences, Bentley et al. (2008) also detected SVs using the same methods and unique map locations of the sequenced reads.More recently, Lee et al. (2008) presented a probabilistic method for detecting SV. In this work, a scoring function for each SV was defined as a weighted sum of (1) sequence similarity, (2) length of SV, and (3) the square of the number of paired-end reads supporting the SV. The scoring function was computed via a hill-climbing strategy to assign paired-end reads to SVs. In theory, the method of Lee et al. (2008) can be applied to data generated by new sequencing technologies; however, the experiments presented in this work were based on capillary sequencing (Levy et al. 2007). In another study, Bashir et al. (2008) presented a computational framework to evaluate the use of paired-end sequences to detect genome rearrangements and fusion genes in cancer; note that no NGS data were utilized in this study due to lack of availability of sequences at the time of publication.In this paper, we present novel combinatorial algorithms for SV detection using the paired-end, NGS methods. In comparison to “naïve” methods for SV detection, our algorithms evaluate all potential mapping locations of each paired-end read and decide on the final mapping and the SVs they imply interdependently. We define two alternative formulations for the problem of computationally predicting the SV between a reference genome sequence (i.e., human genome assembly) and a set of paired-end reads from a whole genome shotgun (WGS) sequence library obtained via an NGS method from an individual genome sequence. The first formulation, which aims to obtain the most parsimonious mapping of paired-end reads to the potential structural variants, is called Maximum Parsimony Structural Variation Problem (MPSV). MPSV problem turns out to be NP-hard; we give a simple O(log n) approximation algorithm to solve this problem in polynomial time. This algorithm is based on the classical approximation algorithm to solve the “Set-Cover” problem from the combinatorial algorithms literature and thus is called the VariationHunter-Set Cover method (abbreviated VariationHunter-SC). The second formulation aims to calculate the probability of each SV. For this variant we give expressions for (1) the probability of each possible SV conditioned on other SVs and the paired-end reads that “support them,” and (2) the probability of mapping each paired-end read to a particular location, conditioned on the set of SVs that are “realized.” We show how to obtain a consistent set of solutions to these expressions iteratively. The resulting algorithm is called VariationHunter-Probabilistic (VariationHunter-Pr). We test our algorithms (VariationHunter-SC and VariationHunter-Pr) on a paired-end WGS library generated with Illumina technology and compare the results with the validated SV set from the genome of the same individual, obtained via fosmid-based capillary end-sequencing (Kidd et al. 2008). We compare our results with the SV calls reported earlier on the same data set (Bentley et al. 2008), which was based on mapping each paired-end read to a single location (with the minimum number of mismatches) and clustering the mappings greedily to obtain the SVs.  相似文献   

6.
Sequencing a genome to great depth can be highly informative about heterogeneity within an individual or a population. Here we address the problem of how to visualize the multiple layers of information contained in deep sequencing data. We propose an interactive AJAX-based web viewer for browsing large data sets of aligned sequence reads. By enabling seamless browsing and fast zooming, the LookSeq program assists the user to assimilate information at different levels of resolution, from an overview of a genomic region to fine details such as heterogeneity within the sample. A specific problem, particularly if the sample is heterogeneous, is how to depict information about structural variation. LookSeq provides a simple graphical representation of paired sequence reads that is more revealing about potential insertions and deletions than are conventional methods.New technologies for massively parallel DNA sequencing allow a genome to be sequenced to great depth, i.e., with many sequence reads covering each nucleotide position (Margulies et al. 2005; Shendure et al. 2005; Bentley et al. 2008; Branton et al. 2008; Campbell et al. 2008a; Harris et al. 2008; Hillier et al. 2008; Shendure and Ji 2008). Deep sequencing data can be valuable for many different purposes. It has been estimated that approximately 30-fold depth of paired 35-base reads is needed to discover 99% of true variants in non-repeat regions of the genome of a diploid individual (Bentley et al. 2008; H Li et al. 2008). In excess of 100-fold coverage may be needed for assembling short sequence reads in highly variable regions of the genome. And the possibility of sequencing to a much greater depth, e.g., more than 104-fold, creates unprecedented opportunities to investigate the genetic basis of heterogeneity within individual biological samples. This has many potential clinical and biological applications, e.g., to analyze viral mutation rates (Harris et al. 2008) or to investigate the genetic driving forces that determine the evolution of a cancer cell population within an individual patient (Campbell et al. 2008b).The first stage in analysis of deep sequencing data is to align sequence reads against a reference genome with sufficient confidence to identify true variants. This can be an exceptionally complex analytical problem, particularly when attempting to align short sequence reads of imperfect quality to a highly variable or repetitive genome. A good example of how this problem may be addressed is the popular MAQ software (H Li et al. 2008), which assigns to each genotype call an error probability that is based on a number of factors, including raw sequence quality and mapping quality, a measure of the confidence that individual reads have been mapped to the correct location in the genome (H Li et al. 2008). The development and optimization of alignment algorithms for short sequence reads is currently a very active area of research (Bentley et al. 2008; Hillier et al. 2008; H Li et al. 2008; R Li et al. 2008; Lin et al. 2008; Schatz et al. 2007b; Smith et al. 2008; Langmead et al. 2009). Many algorithms allow the user to define a number of parameters that effectively alter the stringency of alignment or filter the output to achieve the optimal trade-off between false positive and false negative results. Thus a common problem confronting an investigator is to visually inspect the data to compare the alignments and variant calls made by different algorithms or by a specific algorithm at different parameter settings. If sequencing has been performed to great depth, and particularly if the biological sample is heterogeneous, then the process of data visualization is nontrivial.Many of the available tools for visualizing sequence read alignments derive from pioneering work on genome sequence assembly based on capillary sequencing data (Dear and Staden 1991; Bonfield et al. 1995; Gordon et al. 1998; Schatz et al. 2007a). This approach has been usefully extended for next-generation sequencing, e.g., in EagleView, a client-installed software that can handle a large volume of sequence reads and that allows assemblies to be constructed from multiple technology platforms (Huang and Marth 2008).Here, we explore a different approach to the visualization of deep sequencing data. We address a general problem inherent in very large data sets, i.e., that too detailed a view tends to drown the user in data, whereas too condensed a view may lose important details in the abundance of data. Specifically, we address the problem that a simple pile up of sequence alignments is a useful way of displaying the detail of a conventional genome assembly, but it is impractical for deep sequencing data as only a proportion of reads can be viewed at a time; whereas on the other hand, a collapsed view that summarizes information across all reads might obscure potentially important details such as heterogeneity, outliers, and haplotypic relationships. Our proposed solution aims to make it as easy as possible for an investigator to browse across a large genomic region and to zoom in to inspect interesting features at any desired level of resolution.  相似文献   

7.
We present the first Korean individual genome sequence (SJK) and analysis results. The diploid genome of a Korean male was sequenced to 28.95-fold redundancy using the Illumina paired-end sequencing method. SJK covered 99.9% of the NCBI human reference genome. We identified 420,083 novel single nucleotide polymorphisms (SNPs) that are not in the dbSNP database. Despite a close similarity, significant differences were observed between the Chinese genome (YH), the only other Asian genome available, and SJK: (1) 39.87% (1,371,239 out of 3,439,107) SNPs were SJK-specific (49.51% against Venter''s, 46.94% against Watson''s, and 44.17% against the Yoruba genomes); (2) 99.5% (22,495 out of 22,605) of short indels (< 4 bp) discovered on the same loci had the same size and type as YH; and (3) 11.3% (331 out of 2920) deletion structural variants were SJK-specific. Even after attempting to map unmapped reads of SJK to unanchored NCBI scaffolds, HGSV, and available personal genomes, there were still 5.77% SJK reads that could not be mapped. All these findings indicate that the overall genetic differences among individuals from closely related ethnic groups may be significant. Hence, constructing reference genomes for minor socio-ethnic groups will be useful for massive individual genome sequencing.In 1977, the first full viral genome sequence was published (Sanger et al. 1977), and 3 yr later the same group (Anderson et al. 1981) sequenced the complete human mitochondrial genome. These early and subsequent genome projects lay the foundation for sequencing the first human genome that was completed in 2004 (International Human Genome Sequencing) (Lander et al. 2001; Venter et al. 2001; International Human Genome Sequencing Consortium 2004). Since then, we have seen astounding progress in sequencing technology which has opened a way for personal genomics (Church 2005; Shendure and Ji 2008; von Bubnoff 2008). The first personal genome (HuRef, Venter) was sequenced by the conventional Sanger dideoxy method, which is still the method of choice for de novo sequencing due to its long read-lengths of up to ∼1000 bp and per-base accuracies as high as 99.999% (Shendure and Ji 2008). Using this method, Levy et al. (2007) assembled diploid sequences with phase information that has not been performed in other genomes published. Despite limitations in read length, which is extremely important for the assembly of contigs and final genomes (Sundquist et al. 2007), it is the next-generation sequencing (NGS) technology that has made personal genomics possible by dramatically reducing the cost and increasing the efficiency (Mardis 2008; Shendure and Ji 2008). To date, at least four individual genome sequences, analyzed by NGS, have been published (Bentley et al. 2008; Ley et al. 2008; Wang et al. 2008; Wheeler et al. 2008). Using NGS for resequencing, researchers can simply map short read NGS data to known reference genomes, avoiding expensive and laborious long fragment based de novo assembly. As demonstrated by a large percentage of unmapped data in previous human genome resequencing projects, however, it should be noted that a resequenced genome may not fully reflect ethnic and individual genetic differences because its assembly is dependent on the previously sequenced genome. After the introduction of NGS, the genome sequencing bottleneck of a whole population or people is not the sequencing process itself, but the bioinformatics process of fast and accurate mapping to known data, structural variation analyses, phylogenetic analyses, association study, and application to phenotypes such as diseases.The full analysis of a human genome is far from complete, contrary to the case of phi X 174 by the Sanger group in the 1970s. For example, the NCBI human reference genome, an essential tool for resequencing genome by NGS, does not reflect an ideal picture of a human genome in terms of the number of base pairs sequenced and of genes determined. Furthermore, in a recent study, it was reported that 13% of sequence reads were not mapped to the NCBI reference genome (Wang et al. 2008). This is one of the reasons the Korean reference genome construction project was initiated. Koreans and Chinese are thought to have originated from the same ancestors and admixed for thousands of years. Comparing the two genome scale variations in relation to other already known individual genomes has given us insight about how distinct they are from each other.Here, we report SJK, the first full-length Korean individual genome sequence (SJK from Seong-Jin Kim, the genome donor), accompanied by genotype information of the donor and his mother. The SJK sequence was first released in December 2008 as the result of the Korean reference genome construction project and has been freely available at ftp://ftp.kobic.kr/pub/KOBIC-KoreanGenome/.  相似文献   

8.
9.
Although many de novo genome assembly projects have recently been conducted using high-throughput sequencers, assembling highly heterozygous diploid genomes is a substantial challenge due to the increased complexity of the de Bruijn graph structure predominantly used. To address the increasing demand for sequencing of nonmodel and/or wild-type samples, in most cases inbred lines or fosmid-based hierarchical sequencing methods are used to overcome such problems. However, these methods are costly and time consuming, forfeiting the advantages of massive parallel sequencing. Here, we describe a novel de novo assembler, Platanus, that can effectively manage high-throughput data from heterozygous samples. Platanus assembles DNA fragments (reads) into contigs by constructing de Bruijn graphs with automatically optimized k-mer sizes followed by the scaffolding of contigs based on paired-end information. The complicated graph structures that result from the heterozygosity are simplified during not only the contig assembly step but also the scaffolding step. We evaluated the assembly results on eukaryotic samples with various levels of heterozygosity. Compared with other assemblers, Platanus yields assembly results that have a larger scaffold NG50 length without any accompanying loss of accuracy in both simulated and real data. In addition, Platanus recorded the largest scaffold NG50 values for two of the three low-heterozygosity species used in the de novo assembly contest, Assemblathon 2. Platanus therefore provides a novel and efficient approach for the assembly of gigabase-sized highly heterozygous genomes and is an attractive alternative to the existing assemblers designed for genomes of lower heterozygosity.With the rapid progress in sequencing technologies, the throughput of sequencers has approached hundreds of billions of base pairs per run. Despite the drawbacks of short read lengths, a number of draft genomes have been constructed solely from these short-read data at an increasingly accelerated pace (Li et al. 2009b; Al-Dous et al. 2011; Jex et al. 2011; Kim et al. 2011; The Potato Genome Sequencing Consortium 2011; Murchison et al. 2012). The draft genome assemblies from high-throughput short reads primarily use de Bruijn-graph-based algorithms (Pevzner et al. 2001; Vinson et al. 2005; Zerbino and Birney 2008; Gnerre et al. 2011). During de novo assembly, the nodes of the de Bruijn graphs represent k-mers in the reads, and the edges represent (k − 1) overlaps between the k-mers. The graph can be simplified in a variety of ways; and as a consequence, assembled contigs or scaffolds are constructed from subgraphs lacking junctions. The most distinctive advantage of this approach is the computational efficiency that results from omitting the costly pairwise alignment steps that are required in traditional overlap-layout-consensus algorithms (Kurtz et al. 2004). The de Bruijn graph is constructed from information derived from precise k-mer overlaps; therefore, its calculation cost is relatively low. Although mismatches between k-mers caused by sequencing errors may occur, their distributions are expected to be random, such that sufficient sequence coverage would resolve the sequence error by removing the short, thin tips. Therefore, this approach is suitable for the assembly of a huge number of short reads from a massively parallel sequencer.Despite its strong functionality, several obstacles remain in applying de Bruijn-graph-based assembly to the data from massively parallel sequencers. One of the primary difficulties to overcome is the existence of heterozygosity between diploid chromosomes (Vinson et al. 2005; Velasco et al. 2007; The Potato Genome Sequencing Consortium 2011; Star et al. 2011; Takeuchi et al. 2012; Zhang et al. 2012; Nystedt et al. 2013; You et al. 2013; Zheng et al. 2013). In cases in which a de Bruijn graph is built up from a diploid sample, different k-mers derived from the heterozygous regions corresponding to each homologous chromosome are created and used in the graph structures. As a result, junctions are created in the graph, which represent the borders between homozygous and heterozygous regions. This phenomenon leads to bubble structures in the graph, and most of the existing de Bruijn-graph-based assemblers attempt to simplify such structures by cutting the edge surrounding the junctions and splitting them into multiple straight graphs (Pevzner et al. 2001; Zerbino and Birney 2008; Li et al. 2010; Gnerre et al. 2011). To overcome this problem, many assemblers have developed a common solution by removing one of the similar sequences in a bubble structure with a pairwise alignment. This approach is effective for genome sequences with lower rates of nonstructural variations; however, the assembly of highly heterozygous organisms may encounter more serious problems caused by a high density of single nucleotide variants (SNVs) and structural variations (e.g., repeat sequences and coverage gaps). Algorithms to simply remove bubbles, which are used by the existing de Bruijn-graph-based assemblers, may not be sufficient to resolve these problems.Thus, several advanced techniques have been used to sequence highly heterozygous genomes. The establishment of inbred lines is the most popular method for targeting highly heterozygous genomes, but this method is both time consuming and costly. Inconveniently, in some cases inbreeding methods can fail to eliminate high levels of heterozygosity; thus, these inbred samples can be unsuitable for use with existing whole-genome shotgun assembly methods (Zhang et al. 2012; You et al. 2013). In contrast, in the Potato Genome Project (The Potato Genome Sequencing Consortium 2011) a homozygous doubled-monoploid clone was first generated using classical tissue culture techniques and then sequenced. However, this method can also be fairly costly and is not always technically possible. Consequently, the fosmid-based hierarchical sequencing method has been increasingly used for sequencing highly heterozygous samples, such as oyster (Zhang et al. 2012), diamondback moth (You et al. 2013), and Norway spruce (Nystedt et al. 2013). Although these approaches have been successful in meeting the functional goals of each sequencing project, all are costly compared with a simple whole-genome shotgun sequencing strategy. Model organisms whose lineages have been maintained in laboratories have long been the main targets of genome sequencing. However, various wild-type organisms that may have highly heterozygous genomes are now targets; thus, a more efficient method to assemble such genomes is needed to further accelerate the genome sequencing of a wide range of organisms.Here we describe a novel de novo sequence assembler, called Platanus, that can reconstruct genomic sequences of highly heterozygous diploids from massively parallel shotgun sequencing data. Similarly to other de Bruijn-graph-based assemblers, Platanus first constructs contigs from a de Bruijn graph and then builds up scaffolds from the contigs using paired-end or mate-pair libraries. However, various improvements (e.g., k-mer auto-extension) have been implemented to allow Platanus to efficiently handle giga-order and relatively repetitive genomes. In addition, Platanus efficiently captures heterozygous regions containing structural variations, repeats, and/or low-coverage sites; it can merge haplotypes during not only the contig assembly step but also the scaffolding step to overcome the challenge of heterozygosity. Key algorithms of Platanus and the results of the intensive evaluation of Platanas using both simulated data and real data, including those from highly heterozygous genomes and those used in the de novo assembly contest Assemblathon 2 (Bradnam et al. 2013), are described here.  相似文献   

10.
Emerging next-generation sequencing technologies have revolutionized the collection of genomic data for applications in bioforensics, biosurveillance, and for use in clinical settings. However, to make the most of these new data, new methodology needs to be developed that can accommodate large volumes of genetic data in a computationally efficient manner. We present a statistical framework to analyze raw next-generation sequence reads from purified or mixed environmental or targeted infected tissue samples for rapid species identification and strain attribution against a robust database of known biological agents. Our method, Pathoscope, capitalizes on a Bayesian statistical framework that accommodates information on sequence quality, mapping quality, and provides posterior probabilities of matches to a known database of target genomes. Importantly, our approach also incorporates the possibility that multiple species can be present in the sample and considers cases when the sample species/strain is not in the reference database. Furthermore, our approach can accurately discriminate between very closely related strains of the same species with very little coverage of the genome and without the need for multiple alignment steps, extensive homology searches, or genome assembly—which are time-consuming and labor-intensive steps. We demonstrate the utility of our approach on genomic data from purified and in silico “environmental” samples from known bacterial agents impacting human health for accuracy assessment and comparison with other approaches.The accurate and rapid identification of species and strains of pathogens is an essential component of biosurveillance from both human health and biodefense perspectives (Vaidyanathan 2011). For example, misidentification was among the issues that resulted in a 3-wk delay in accurate diagnosis of the recent outbreak of hemorrhagic Escherichia coli being due to strain O104:H4, resulting in over 3800 infections across 13 countries in Europe with 54 deaths (Frank et al. 2011). The most accurate diagnostic information, necessary for species identification and strain attribution, comes from the most refined level of biological data—genomic DNA sequences (Eppinger et al. 2011). Advances in DNA-sequencing technologies allows for the rapid collection of extraordinary amounts of genomic data, yet robust approaches to analyze this volume of data are just developing, from both statistical and algorithmic perspectives.Next-generation sequencing approaches have revolutionized the way we collect DNA sequence data, including for applications in pathology, bioforensics, and biosurveillance. Given a particular clinical or metagenomic sample, our goal is to identify the specific species, strains, or substrains present in the sample, as well as accurately estimate the proportions of DNA originating from each source genome in the sample. Current approaches for next-gen sequencing usually have read lengths between 25 and 1000 bp; however, these sequencing technologies include error rates that vary by approach and by samples. Such variation is typically less important for species identification given the relatively larger genetic divergences among species than among individuals within species. But for strain attribution, sequencing error has the potential to swamp out discriminatory signal in a data set, necessitating highly sensitive and refined computational models and a robust database for both species identification and strain attribution.Current methods for classifying metagenomic samples rely on one or more of three general approaches: composition or pattern matching (McHardy et al. 2007; Brady and Salzberg 2009; Segata et al. 2012), taxonomic mapping (Huson et al. 2007; Meyer et al. 2008; Monzoorul Haque et al. 2009; Gerlach and Stoye 2011; Patil et al. 2012; Segata et al. 2012), and whole-genome assembly (Kostic et al. 2011; Bhaduri et al. 2012). Composition and pattern-matching algorithms use predetermined patterns in the data, such as taxonomic clade markers (Segata et al. 2012), k-mer frequency, or GC content, often coupled with sophisticated classification algorithms such as support vector machines (McHardy et al. 2007; Patil et al. 2012) or interpolated Markov Models (Brady and Salzberg 2009) to classify reads to the species of interest. These approaches require intensive preprocessing of the genomic database before application. In addition, the classification rule and results can often change dramatically depending on the size and composition of the genome database.Taxonomy-based approaches typically rely on a “lowest common ancestor” approach (Huson et al. 2007), meaning that they identify the most specific taxonomic group for each read. If a read originates from a genomic region that shares homology with other organisms in the database, the read is assigned to the lowest taxonomic group that contains all of the genomes that share the homologous region. These methods are typically highly accurate for higher-level taxonomic levels (e.g., phylum and family), but experience reduced accuracy at lower levels (e.g., species and strain) (Gerlach and Stoye 2011). Furthermore, these approaches are not informative when the reads originate from one or more species or strains that are closely related to each other or different organisms in the database. In these cases, all of the reads can be reassigned to higher-level taxonomies, thus failing to identify the specific species or strains contained in the sample.Assembly-based algorithms can often lead to the most accurate strain identification. However, these methods also require the assembly of a whole genome from a sample, which is a computationally difficult and time-consuming process that requires large numbers of reads to achieve an adequate accuracy—often on the order of 50–100× coverage of the target genome (Schatz et al. 2010). Given current sequencing depths, obtaining this level of coverage is usually possible for purified samples, but coverage levels may not be sufficient for mixed samples or in multiplexed sequencing runs. Assembly approaches are further complicated by the fact that data collection at a crime scene or hospital might include additional environmental components in the biological sample (host genome or naturally occurring bacterial and viral species), thus requiring multiple filtering and alignment steps in order to obtain reads specific to the pathogen of interest.Here we describe an accurate and efficient approach to analyze next-generation sequence data for species identification and strain attribution that capitalizes on a Bayesian statistical framework implemented in the new software package Pathoscope v1.0. Our approach accommodates information on sequence quality, mapping quality, and provides posterior probabilities of matches to a known database of reference genomes. Importantly, our approach incorporates the possibility that multiple species can be present in the sample or that the target strain is not even contained within the reference database. It also accurately discriminates between very closely related strains of the same species with much less than 1× coverage of the genome and without the need for sequence assembly or complex preprocessing of the database or taxonomy. No other method in the literature can identify species or substrains in such a direct and automatic manner and without the need for large numbers of reads. We demonstrate our approach through application to next-generation DNA sequence data from a recent outbreak of the hemorrhagic E. coli (O104:H4) strain in Europe (Frank et al. 2011; Rohde et al. 2011; Turner 2011) and on purified and in silico mixed samples from several other known bacterial agents that impact human health. Software and data examples for our approach are freely available for download at https://sourceforge.net/projects/pathoscope/.  相似文献   

11.
Eliminating the bacterial cloning step has been a major factor in the vastly improved efficiency of massively parallel sequencing approaches. However, this also has made it a technical challenge to produce the modern equivalent of the Fosmid- or BAC-end sequences that were crucial for assembling and analyzing complex genomes during the Sanger-based sequencing era. To close this technology gap, we developed Fosill, a method for converting Fosmids to Illumina-compatible jumping libraries. We constructed Fosmid libraries in vectors with Illumina primer sequences and specific nicking sites flanking the cloning site. Our family of pFosill vectors allows multiplex Fosmid cloning of end-tagged genomic fragments without physical size selection and is compatible with standard and multiplex paired-end Illumina sequencing. To excise the bulk of each cloned insert, we introduced two nicks in the vector, translated them into the inserts, and cleaved them. Recircularization of the vector via coligation of insert termini followed by inverse PCR generates a jumping library for paired-end sequencing with 101-base reads. The yield of unique Fosmid-sized jumps is sufficiently high, and the background of short, incorrectly spaced and chimeric artifacts sufficiently low, to enable applications such as mapping of structural variation and scaffolding of de novo assemblies. We demonstrate the power of Fosill to map genome rearrangements in a cancer cell line and identified three fusion genes that were corroborated by RNA-seq data. Our Fosill-powered assembly of the mouse genome has an N50 scaffold length of 17.0 Mb, rivaling the connectivity (16.9 Mb) of the Sanger-sequencing based draft assembly.Paired-end sequencing of large DNA fragments cloned in Fosmid (Kim et al. 1992) or BAC (Shizuya et al. 1992) vectors were a mainstay of genome projects during the Sanger-based sequencing era. The large spans, particularly of BAC ends, helped resolve long repeats and segmental duplications and provided long-range connectivity in shotgun assemblies of complex genomes (Adams et al. 2000; Venter et al. 2001; Waterston et al. 2002). Fosmids are shorter than BACs but much easier to generate. Their consistent, narrow insert-size distribution centered around 35–40 kb enabled the scanning of individual human genomes with read pairs to detect structural variation such as insertions, deletions, and inversions (International Human Genome Sequencing Consortium 2004; Tuzun et al. 2005; Kidd et al. 2008).Massively parallel genome-sequencing technologies no longer rely on cloning DNA fragments in a bacterial host. The platforms currently on the market (454, Illumina, SOLiD, Ion Torrent) replaced vectors with synthetic adapters and bacterial colonies with PCR-amplified “clones” of DNA fragments tethered to a bead (Margulies et al. 2005; McKernan et al. 2009) or with “colonies” of identical molecules grown by bridge PCR amplification on a solid surface (Bentley et al. 2008).However, none of these platforms can handle DNA molecules much longer than 1 kb. Consequently, paired-end sequencing of DNA fragments >1 kb by these technologies requires “jumping” constructs (Collins and Weissman 1984; Poustka et al. 1987): the ends of size-selected genomic DNA fragments are brought together by circularization, the bulk of the intervening DNA is excised, and the coligated junction fragments are isolated and end-sequenced.Suitable protocols exist for converting sheared and size-selected DNA samples to jumping libraries and for generating read pairs that span several kb of genomic distance which is generally sufficient to fashion accurate and highly contiguous de novo assemblies of microbial genomes from massively parallel short sequencing reads (MacCallum et al. 2009; Nelson et al. 2010; Nowrousian et al. 2010). However, early short-read assemblies of complex genomes, including human genomes, turned out fragmented—despite jumps up to ∼12 kb in length (Li et al. 2010a,b; Schuster et al. 2010; Yan et al. 2011). Without the equivalent of Fosmid or BAC end sequences, the N50 scaffold length (a measure of long-range connectivity) of these assemblies was <1.3 Mb. By comparison, largely owing to paired-end reads from large-insert clones, some of the best traditional Sanger-based mammalian draft assemblies had N50 scaffold lengths of >40 Mb (Lindblad-Toh et al. 2005; Mikkelsen et al. 2007).Constructing a jumping library entails numerous physical and enzymatic DNA manipulations. Several steps, notably size selection and circularization of genomic DNA fragments in vitro, become increasingly difficult and inefficient as the desired jump length, and hence, fragment length, goes up. In contrast, Fosmid cloning employs a sophisticated biological machinery to carry out these critical steps: Large fragments are size-selected (and short competing fragments excluded) by packaging in bacteriophage λ; once inside the Escherichia coli host, cohesive ends mediate efficient circularization—aided by the cellular machinery and a powerful selection for circular amplicons.To our knowledge, no jumping library constructed to date from uncloned DNA fragments has approached the average span (35–40 kb) and complexity (>105 independent clones per μg of input DNA) of a traditional Fosmid library. To close this technology gap, we and others have taken a hybrid approach wherein Fosmid libraries are constructed first and then converted to Fosmid-size jumps in vitro (Gnerre et al. 2011; Hampton et al. 2011).Here, we present the experimental details of the “Fosill” concept (Gnerre et al. 2011) as well as extensive improvements of the original protocol. The term Fosill stands for paired-end sequencing of Fosmid libraries by Illumina, though we note that this approach should work for any massively parallel sequencing technology that can generate paired reads. We describe the methodology and novel cloning vectors that enable molecular barcoding of DNA inserts and multiplex Fosmid library construction without physical size selection of sheared genomic DNA. We demonstrate the power of Fosill to detect structural abnormalities in cancer genomes and to improve de novo assemblies of mammalian genomes from short reads.  相似文献   

12.
Obtaining high-quality sequence continuity of complex regions of recent segmental duplication remains one of the major challenges of finishing genome assemblies. In the human and mouse genomes, this was achieved by targeting large-insert clones using costly and laborious capillary-based sequencing approaches. Sanger shotgun sequencing of clone inserts, however, has now been largely abandoned, leaving most of these regions unresolved in newer genome assemblies generated primarily by next-generation sequencing hybrid approaches. Here we show that it is possible to resolve regions that are complex in a genome-wide context but simple in isolation for a fraction of the time and cost of traditional methods using long-read single molecule, real-time (SMRT) sequencing and assembly technology from Pacific Biosciences (PacBio). We sequenced and assembled BAC clones corresponding to a 1.3-Mbp complex region of chromosome 17q21.31, demonstrating 99.994% identity to Sanger assemblies of the same clones. We targeted 44 differences using Illumina sequencing and find that PacBio and Sanger assemblies share a comparable number of validated variants, albeit with different sequence context biases. Finally, we targeted a poorly assembled 766-kbp duplicated region of the chimpanzee genome and resolved the structure and organization for a fraction of the cost and time of traditional finishing approaches. Our data suggest a straightforward path for upgrading genomes to a higher quality finished state.Complete high-quality sequence assembly remains a difficult problem for the de novo assembly of genomes (Alkan et al. 2011b; Church et al. 2011; Salzberg et al. 2012). Finishing of the human and mouse genomes involved selecting large-insert BAC clones and subjecting them to capillary-based shotgun sequence and assembly (English et al. 2012). Sanger-based assembly of large-insert clones has been typically a time-consuming and expensive operation requiring the infrastructure of large genome sequencing centers and specialists focused on particular problematic or repetitive regions (Zody et al. 2008; Dennis et al. 2012; Hughes et al. 2012). Such activities can significantly improve the quality of genomes, including the discovery of missing genes and gene families. A recent effort to upgrade the mouse genome assembly, for example, resulted in the correction or addition of 2185 genes, 61% of which corresponded to lineage-specific segmental duplications (Church et al. 2009). Within the human genome, there are >900 annotated genes mapping to large segmental duplications. About half of these map to particularly problematic regions of the genome where annotation and genetic variation are poorly understood (Sudmant et al. 2010). Such genes are typically missing or misassembled in working draft assemblies of genomes. These include genes such as the SRGAP2 family, which evolved specifically in the human lineage and is thought to be important in the development of the human brain (Charrier et al. 2012; Dennis et al. 2012). Other regions (e.g., 17q21.31 inversion) show incredible structural diversity, predispose specific populations to disease, and have been the target of remarkable selection in the human lineage (Stefansson et al. 2005; Zody et al. 2008; Steinberg et al. 2012). Such structurally complex regions were not resolved within the human reference sequence until large-insert clones were recovered and completely sequenced.The widespread adoption of next-generation sequencing methods for de novo genome assemblies has complicated the assembly of repetitive sequences and their organization. Although we can generate much more sequence, the short sequence read data and inability to scaffold across repetitive structures translates into more gaps, missing data, and more incomplete reference assemblies (Alkan et al. 2011a; Salzberg et al. 2012). Due to budgetary constraints, traditional capillary-based sequencing capacity as well as genome finishing efforts have dwindled in sequencing centers leaving most of the complex regions of working draft genomes unresolved. Clone-based hierarchical approaches remain important for reducing the complexity of genomes, but even targeted sequencing of these clones using short-read data fails to completely resolve and assemble these regions due to the presence of highly identical repeat sequences common in mammalian genomes. Here, we tested the efficacy of a method developed for finishing microbial genomes (Chin et al. 2013) to a 1.3-Mbp complex region of human chromosome 17q21.31 previously sequenced and assembled using traditional Sanger-based approaches. We directly compared sequenced and assembled clones and validated differences to highlight advantages and limitations of the different technologies. We then applied the approach to a previously uncharacterized, highly duplicated region of the chimpanzee genome and show that we can rapidly resolve the structure and organization of the region using this approach.  相似文献   

13.
We present the discovery of genes recurrently involved in structural variation in nasopharyngeal carcinoma (NPC) and the identification of a novel type of somatic structural variant. We identified the variants with high complexity mate-pair libraries and a novel computational algorithm specifically designed for tumor-normal comparisons, SMASH. SMASH combines signals from split reads and mate-pair discordance to detect somatic structural variants. We demonstrate a >90% validation rate and a breakpoint reconstruction accuracy of 3 bp by Sanger sequencing. Our approach identified three in-frame gene fusions (YAP1-MAML2, PTPLB-RSRC1, and SP3-PTK2) that had strong levels of expression in corresponding NPC tissues. We found two cases of a novel type of structural variant, which we call “coupled inversion,” one of which produced the YAP1-MAML2 fusion. To investigate whether the identified fusion genes are recurrent, we performed fluorescent in situ hybridization (FISH) to screen 196 independent NPC cases. We observed recurrent rearrangements of MAML2 (three cases), PTK2 (six cases), and SP3 (two cases), corresponding to a combined rate of structural variation recurrence of 6% among tested NPC tissues.Nasopharyngeal carcinoma (NPC) is a malignant neoplasm of the head and neck originating in the epithelial lining of the nasopharynx. It has a high incidence among the native people of the American Arctic and Greenland and in southern Asia (Yu and Yuan 2002). NPC is strongly linked to consumption of Cantonese salted fish (Ning et al. 1990) and infection with Epstein-Barr virus (EBV) (Raab-Traub 2002), which is almost invariably present within the cancer cells and is thought to promote oncogenic transformation (zur Hausen et al. 1970). A challenging feature of NPC genome sequencing is that significant lymphocyte infiltration (e.g., 80% of cells in a sample) (Jayasurya et al. 2000) is common, requiring special laboratory and bioinformatic approaches not necessary for higher-purity tumors (Mardis et al. 2009).Currently, cancer genomes are analyzed by reading short sequences (usually 100 bases) from the ends of library DNA inserts 200–500 bp in length (Meyerson et al. 2010). For technical reasons, it is difficult to obtain deep genome coverage with inserts exceeding 600 bp using this approach. Much larger inserts can be produced by circularizing large DNA fragments of up to 10 kb (Fullwood et al. 2009; Hillmer et al. 2011), and subsequent isolation of a short fragment that contains both ends (mate pairs). Large-insert and fosmid mate-pair libraries offer several attractive features that make them well-suited for analysis of structural variation (Raphael et al. 2003; International Human Genome Sequencing Consortium 2004; Tuzun et al. 2005; Kidd et al. 2008; Hampton et al. 2011; Williams et al. 2012). First, mate pairs inherently capture genomic structure in that discordantly aligning mate-pair reads occur at sites of genomic rearrangements, exposing underlying lesions (Supplemental Fig. S1a). Second, large-insert mate-pair libraries deliver deep physical coverage of the genome (100–1000×), reliably revealing somatic structural variants even in specimens with low tumor content (Supplemental Fig. S1a,b). Third, variant-supporting mate-pair reads from large inserts may align up to several kilobases away from a breakpoint, beyond the repeats that often catalyze structural variants (Supplemental Fig. S1c). For these reasons, large insert mate-pair libraries have been used extensively for de novo assembly of genomes and for identification of inherited structural variants (International Human Genome Sequencing Consortium 2001). In principle, they should also be well-suited for analysis of “difficult” low-tumor purity cancer tissues such as NPCs. In a recent study, paired-end fosmid sequencing libraries with nearly 40 kb inserts were adapted to Illumina sequencing and applied to the K562 cell line (Williams et al. 2012). Due to low library complexity, of 33.9 million sequenced read pairs, only about 7 million were unique (corresponding to about 0.5× genome sequence coverage), but nonetheless facilitated structural variation detection.Mate-pair techniques have not yet been applied to produce truly deep sequencing data sets of tumor-normal samples (30× or greater sequence coverage), presumably due to the difficulty of retaining sufficient library complexity to support deep sequencing. We have improved the efficiency of library preparation by combining two existing protocols (Supplemental Fig. S2), and so were able to generate 3.5-kb insert libraries with sufficient genomic complexity to enable deep sequencing of two NPC genomes. To take full advantage of unique features offered by large-insert libraries, such as the large footprints of breakpoint-spanning inserts (Supplemental Fig. S1c) and the correlation between the two ends of alignment coordinates of breakpoint-spanning inserts, we also developed a novel somatic structural variant caller. SMASH (Somatic Mutation Analysis by Sequencing and Homology detection) is specifically designed to accurately map somatic structural lesions, including deletions, duplications, translocations, and large duplicative insertions via direct comparison of tumor and normal data sets.Structural variation methods, such as GASV (Sindi et al. 2009), SegSeq (Chiang et al. 2009), DELLY (Rausch et al. 2012b), HYDRA (Quinlan et al. 2010), AGE (Abyzov and Gerstein 2011), and others (Lee et al. 2008; for review, see Snyder et al. 2010; Alkan et al. 2011), generally utilize (1) read-pair (RP) discordance, (2) increase or reduction in sequence coverage, (3) split reads that span breakpoints, and (4) exact assembly of breakpoint sequences. These tools were primarily designed for variant detection from a single data set, such as a normal genome, and are suited for cataloguing structural polymorphisms in the human population (Kidd et al. 2008, 2010; Mills et al. 2011). However, specific detection of somatic structural variants in cancer using these tools typically requires additional downstream custom analysis to enable “subtraction” of germline variants from the tumor variant calls (Rausch et al. 2012a). This limits the general utility of such tools for somatic variant detection. Recently, as an increasing number of studies specifically focused on somatic mutations, dedicated somatic variant callers such as CREST (Wang et al. 2011) have been developed. CREST relies on detection of partially aligned reads (known as “soft clipping”) across a breakpoint.SMASH adopts a hybrid approach to somatic variant detection. It relies on read-pair discordance to discover somatic breakpoints and then uses split reads to refine their coordinates. Furthermore, SMASH incorporates a number of important quality measures and filters, which are critical for minimizing the rate of false positive somatic SV calls (Supplemental Material). As we demonstrate here with both simulated and real sequence data, such a hybrid approach delivers high sensitivity of somatic SV detection due to the read-pair discordance, high accuracy of breakpoint coordinates enabled by split reads, and overall low false discovery rate due to extensive use of quality measures.  相似文献   

14.
15.
Next-generation sequencing technologies have made it possible to sequence targeted regions of the human genome in hundreds of individuals. Deep sequencing represents a powerful approach for the discovery of the complete spectrum of DNA sequence variants in functionally important genomic intervals. Current methods for single nucleotide polymorphism (SNP) detection are designed to detect SNPs from single individual sequence data sets. Here, we describe a novel method SNIP-Seq (single nucleotide polymorphism identification from population sequence data) that leverages sequence data from a population of individuals to detect SNPs and assign genotypes to individuals. To evaluate our method, we utilized sequence data from a 200-kilobase (kb) region on chromosome 9p21 of the human genome. This region was sequenced in 48 individuals (five sequenced in duplicate) using the Illumina GA platform. Using this data set, we demonstrate that our method is highly accurate for detecting variants and can filter out false SNPs that are attributable to sequencing errors. The concordance of sequencing-based genotype assignments between duplicate samples was 98.8%. The 200-kb region was independently sequenced to a high depth of coverage using two sequence pools containing the 48 individuals. Many of the novel SNPs identified by SNIP-Seq from the individual sequencing were validated by the pooled sequencing data and were subsequently confirmed by Sanger sequencing. We estimate that SNIP-Seq achieves a low false-positive rate of ∼2%, improving upon the higher false-positive rate for existing methods that do not utilize population sequence data. Collectively, these results suggest that analysis of population sequencing data is a powerful approach for the accurate detection of SNPs and the assignment of genotypes to individual samples.With the availability of several next-generation sequencing platforms, the cost of DNA sequencing has dropped dramatically over the past few years and improvements in technology are expected to decrease the cost further (Shendure and Ji 2008). Next-generation sequencers, such as the Illumina Genome Analyzer (GA), can generate gigabases of nucleotides per day and have enabled the sequencing of complete individual human genomes (Bentley et al. 2008; Ley et al. 2008; Wang et al. 2008; Wheeler et al. 2008; McKernan et al. 2009). While the resequencing of complete human genomes still remains quite expensive, the targeted sequencing of specific genomic intervals in a large population of individuals is now feasible in an individual laboratory. Resequencing of coding sequences of genes in large populations has previously been shown to be useful for identifying multiple rare variants affecting quantitative traits (Cohen et al. 2004, 2006; Ji et al. 2008). Resequencing of genomic regions identified by genome-wide association studies in healthy and diseased populations represents a powerful strategy for assessing the contribution of rare variants to disease etiology. Nejentsev et al. (2009) have used this approach to identify four rare variants protective for type 1 diabetes.For harnessing the capacity of next-generation sequencers for deep population resequencing, the first challenge is to selectively capture DNA from the region of interest. Recently, Craig et al. (2008) used long-range PCR and DNA barcodes to sequence specific regions of the human regions in multiple samples simultaneously using the Illumina GA. Harismendy et al. (2009) also used long-range PCR to sequence targeted regions of the human genome using multiple sequencing platforms to evaluate the feasibility of targeted population sequencing and the concordancy of variant calling between the different platforms. However, traditional sequence capture methods, such as long-range polymerase chain reaction (LR-PCR), are not adequate for capturing thousands of noncontiguous regions of the genome, e.g., all exons, in a large number of samples. Several high-throughput target capture methods have been developed (Hodges et al. 2007; Okou et al. 2007; Porreca et al. 2007; Turner et al. 2009).After millions of reads have been generated by the sequencer, the next challenge is to identify genetic variants by mapping the reads to a reference sequence. A variety of tools have been developed that can efficiently align hundreds of millions of short reads to a reference sequence even in the presence of multiple errors in the reads (Li et al. 2008; Langmead et al. 2009; Li and Durbin 2009; Li et al. 2009b). Each base mismatch in an aligned read represents either a sequencing error or a single nucleotide variant in the diploid individual. To compensate for the high sequencing error rates of next-generation sequencing platforms, one requires the presence of multiple overlapping reads, each with a base different from the reference base for single nucleotide polymorphism (SNP) calling. Base quality values—probability estimates of the correctness of a base call—are particularly useful for distinguishing sequencing errors from SNPs. The Illumina sequencing system generates a phred-like quality score for each base call using various predictors of the sequencing errors. SNP calling methods for Illumina sequence data utilize these base quality values to compute the likelihood of different genotypes at each position using Bayesian or statistical models (Li et al. 2008, 2009a). Positions for which the most likely genotype is different from the reference genotype and which satisfy additional filters on neighborhood sequence quality, read alignment quality, etc. are reported as SNPs. However, sequencing errors for the Illumina GA are not completely random and are dependent on the local sequence context of the base being read, the position of the base in the read, etc. (Dohm et al. 2008; Erlich et al. 2008). Therefore, assuming independence between multiple base calls, each with a non-reference base, results in overcalling of SNPs, i.e., increased number of false-positives. To reduce the number of false variant calls, the MAQ SNP caller (Li et al. 2008) uses a dependency model to estimate an average error rate using all base quality scores.MAQ and other SNP calling methods have enabled fairly accurate detection of SNPs from resequencing of individual human genomes (Bentley et al. 2008; Wang et al. 2008). However, there is potential for developing more accurate SNP detection methods, in particular, by taking advantage of sequence information from a population of sequenced individuals. Comparison of sequenced reads for a potential variant site across multiple individuals has the potential to differentiate systematic sequencing errors from real SNPs. Patterns of mismatched bases (bases not matching the reference base) resulting from systematic sequencing errors are likely to be shared across individuals. On the other hand, the profiles of mismatched bases between individuals with and without a SNP are likely to be distinct. Comparison of read alignments across multiple individuals also has the potential to filter out SNPs that are an artifact of inaccurate read alignments. We present a probabilistic model that leverages sequence data from a population of individuals, each sequenced separately, for detecting single nucleotide variants and also assigning genotypes to each individual in the population.3 Our method recalibrates each base quality value by adding a population error correction to the Illumina base error probability. This correction is computed using the distribution of mismatched bases across all sequenced individuals. The recalibrated base quality values are then used to compute genotype probabilities for each individual under a simple Bayesian model that assumes independence between base calls. Finally, positions in the sequence with one or more individuals showing evidence for harboring a non-reference allele are identified as SNPs. Craig et al. (2008) described a similar approach for SNP detection using sequence data from multiple individuals where they used Bayes factors to compare the fraction of reads with the alternate allele across multiple individuals. Sites at which one or more individuals have a fraction of reads with the alternate allele sufficiently greater than the average were identified as SNPs. Our model is much more general and can take advantage of the complete information about each base call, i.e., base quality value, position in the read containing the base, and the strand to which the read aligns to.To evaluate our population SNP detection method, we analyzed sequence data from a 200-kilobase (kb)-long region on chromosome 9p21 that was sequenced to a median depth of 45× in 48 individuals using the Illumina Genome Analyzer (O Harismendy, V Bansal, N Rahim, X Wang, N Heintzman, B Ren, EJ Topol, and KA Frazer, in prep.). We demonstrate that our method can accurately detect SNPs with a low false-positive rate (∼2%) and a low false-negative rate in comparison to SNP detection from individual sequence data using MAQ. By comparing genotype calls between replicate samples, we show a 98.8% accuracy for sequence-based genotyping using our method.  相似文献   

16.
17.
Methods for the direct detection of copy number variation (CNV) genome-wide have become effective instruments for identifying genetic risk factors for disease. The application of next-generation sequencing platforms to genetic studies promises to improve sensitivity to detect CNVs as well as inversions, indels, and SNPs. New computational approaches are needed to systematically detect these variants from genome sequence data. Existing sequence-based approaches for CNV detection are primarily based on paired-end read mapping (PEM) as reported previously by Tuzun et al. and Korbel et al. Due to limitations of the PEM approach, some classes of CNVs are difficult to ascertain, including large insertions and variants located within complex genomic regions. To overcome these limitations, we developed a method for CNV detection using read depth of coverage. Event-wise testing (EWT) is a method based on significance testing. In contrast to standard segmentation algorithms that typically operate by performing likelihood evaluation for every point in the genome, EWT works on intervals of data points, rapidly searching for specific classes of events. Overall false-positive rate is controlled by testing the significance of each possible event and adjusting for multiple testing. Deletions and duplications detected in an individual genome by EWT are examined across multiple genomes to identify polymorphism between individuals. We estimated error rates using simulations based on real data, and we applied EWT to the analysis of chromosome 1 from paired-end shotgun sequence data (30×) on five individuals. Our results suggest that analysis of read depth is an effective approach for the detection of CNVs, and it captures structural variants that are refractory to established PEM-based methods.Structural variants (SVs) in the human genome (Iafrate et al. 2004; Sebat et al. 2004; Feuk et al. 2006a), including copy number variants (CNVs) and balanced rearrangements such as inversions and translocations, play an important role in the genetics of complex disease. Analysis of CNV in diseases such as cancer (Lucito et al. 2000; Pollack et al. 2002; Albertson and Pinkel 2003), and in developmental and neuropsychiatric disorders (Feuk et al. 2006b; Sebat et al. 2007; Kirov et al. 2008, 2009; Marshall et al. 2008; Mefford et al. 2008; Rujescu et al. 2008; Stefansson et al. 2008; Stone et al. 2008; Walsh et al. 2008; Zhang et al. 2008), has led to the identification of novel disease-causing mutations, thus contributing important new insights into the genetics of these disorders.Our current power to detect SVs in disease studies is limited by the resolution of microarray analysis. Currently available array platforms that consist of more than 1 million probes have a lower limit of detection of ∼10–25 kb (McCarroll et al. 2008; Cooper et al. 2008). More comprehensive studies of individual genomes using sequencing-based approaches are capable of detecting CNVs <1 kb in size (Tuzun et al. 2005; Korbel et al. 2007; Bentley et al. 2008; Wang et al. 2008). Thus, new sequencing technologies promise to enable more comprehensive detection of SVs as well as indels and point mutations (Mardis 2008).New computational methods are needed that can reliably identify SVs using next-generation sequencing platforms. To date, multiple approaches have been developed for the detection of SVs that are based on paired-end read mapping (PEM), which detects insertions and deletions by comparing the distance between mapped read pairs to the average insert size of the genomic library (Tuzun et al. 2005; Korbel et al. 2007). Advantages of this approach include the sensitivity for detecting deletions <1 kb in size, and localizing the breakpoint within the region of a small fragment. This approach also has certain limitations. In particular, PEM-based methods have poor ascertainment of SVs in complex genomic regions rich in segmental duplications and have limited ability to detect insertions larger than the average insert size of the library (Tuzun et al. 2005).We sought to develop an alternative approach to the detection of SVs from sequence data that compliments existing methods. Here we used the depth of coverage in sequence data from the Illumina Genome Analyzer to look for genomic regions that differ in copy number between individuals. This method is based on the depth of single reads and, hence, is orthogonal to methods that are based on the mapping of paired-end sequences.To detect CNVs based on read depth (RD), we developed a pipeline consisting of three steps, as illustrated in Figure 1: (1) First, we estimated the coverage or RD in nonoverlapping intervals across an individual genome, (2) we implemented a novel CNV-calling algorithm to detect events, and (3) we compared data from multiple individuals to distinguish events that are polymorphic (i.e., CNVs) from those that show similarly increased or decreased copy number in all individuals in this study (i.e., mononomorphic events). Here we demonstrate the feasibility of this approach and its unique advantages in comparison with other methods of SV detection.Open in a separate windowFigure 1.Pipeline for the detection of CNVs based on analysis of read depth (RD). (A) RD was determined by counting the start position of reads in nonoverlapping windows of 100 bp. (B) Events were detected using a custom CNV-calling algorithm, event-wise testing (EWT). (C) Each event was examined in multiple genomes in order to distinguish polymorphic events (CNVs) from the majority of events that were found to show a similar copy number change in all five genomes in this study (i.e., monomorphic events).  相似文献   

18.
We describe a statistical and comparative-genomic approach for quantifying error rates of genome sequence assemblies. The method exploits not substitutions but the pattern of insertions and deletions (indels) in genome-scale alignments for closely related species. Using two- or three-way alignments, the approach estimates the amount of aligned sequence containing clusters of nucleotides that were wrongly inserted or deleted during sequencing or assembly. Thus, the method is well-suited to assessing fine-scale sequence quality within single assemblies, between different assemblies of a single set of reads, and between genome assemblies for different species. When applying this approach to four primate genome assemblies, we found that average gap error rates per base varied considerably, by up to sixfold. As expected, bacterial artificial chromosome (BAC) sequences contained lower, but still substantial, predicted numbers of errors, arguing for caution in regarding BACs as the epitome of genome fidelity. We then mapped short reads, at approximately 10-fold statistical coverage, from a Bornean orangutan onto the Sumatran orangutan genome assembly originally constructed from capillary reads. This resulted in a reduced gap error rate and a separation of error-prone from high-fidelity sequence. Over 5000 predicted indel errors in protein-coding sequence were corrected in a hybrid assembly. Our approach contributes a new fine-scale quality metric for assemblies that should facilitate development of improved genome sequencing and assembly strategies.Genome sequence assemblies form the bedrock of genome research. Any errors within them directly impair genomic and comparative genomic predictions and inferences based upon them. The prediction of functional elements or the elucidation of the evolutionary provenance of genomic sequence, for example, relies on the fidelity and completeness of these assemblies. Imperfections, such as erroneous nucleotide substitutions, insertions or deletions, or larger-scale translocations, may misinform genome annotations or analyses (Salzberg and Yorke 2005; Choi et al. 2008; Phillippy et al. 2008). Insertion and deletion (indel) errors are particularly hazardous to the prediction of protein-coding genes since many introduce frame-shifts to otherwise open reading frames. Noncoding yet functional sequence can be identified from a deficit of indels (Lunter et al. 2006), but only where this evolutionary signal has not been obscured by indel errors. Several high-quality reference genomes currently exist, and many errors in initial draft genome sequence assemblies have been rectified in later more finished assemblies. However, because of the substantial costs involved, among the mammals only the genomes of human, mouse, and dog have been taken (or are being taken) toward “finished” quality, defined as fewer than one error in 104 bases and no gaps (International Human Genome Sequencing Consortium 2004; Church et al. 2009). It is likely that other draft genome assemblies will remain in their unfinished states until technological improvements substantially reduce the cost of attaining finished genome quality.Genome assemblies have been constructed from sequence data produced by different sequencing platforms and strategies, and using a diverse array of assembly algorithms (e.g., PCAP [Huang et al. 2003], ARACHNE [Jaffe et al. 2003], Atlas [Havlak et al. 2004], PHUSION [Mullikin and Ning 2003], Jazz [Aparicio et al. 2002], and the Celera Assembler [Myers et al. 2000]). The recent introduction of new sequencing technologies (Mardis 2008) further complicates genome assemblies, as each platform exhibits read lengths and error characteristics very different from those of Sanger capillary sequencing reads. These new technologies have also spawned additional assembly and mapping algorithms, such as Velvet (Zerbino and Birney 2008) and MAQ (Li et al. 2008). Considering the methodological diversity of sequence generation and assembly, and the importance of high-quality primary data to biologists, there is a clear need for an objective and quantitative assessment of the fine-scale fidelity of the different assemblies.One frequently discussed property of genome assemblies is the N50 value (Salzberg and Yorke 2005). This is defined as the weighted median contig size, so that half of the assembly is covered by contigs of size N50 or larger. While the N50 value thus quantifies the ability of the assembler algorithm to combine reads into large seamless blocks, it fails to capture all aspects of assembly quality. For example, artefactually high N50 values can be obtained by lowering thresholds for amalgamating smaller blocks of—often repetitive—contiguous reads, resulting in misassembled contigs, although approaches to ameliorate such problems are being developed (Bartels et al. 2005; Dew et al. 2005; Schatz et al. 2007). Some validation of the global assembly accuracy, as summarized by N50, can be achieved by comparison with physical or genetic maps or by alignment to related genomes. Contiguity can also be quantified from the alignment of known cDNAs or ESTs. More regional errors can be indicated by fragmentation, incompleteness, or exon noncollinearity of gene models, or by unexpectedly high read depths that often reflect collapse of virtually identical segmental duplications.In addition to these problems, N50 values fail to reflect fine-scale inaccuracies, such as substitution and indel errors. Quality at the nucleotide level is summarized as a phred score, with scores exceeding 40 indicating finished sequence (Ewing and Green 1998) and corresponding to an error rate of less than one base in 10,000. Once assembled, a base is assigned a consensus quality score (CQS) depending on its read depth and the quality of each base contributing to that position (Huang and Madan 1999). Finally, assessing sequence error has traditionally relied on comparison with bacterial artificial chromosome (BAC) sequence. Discrepancies between assembly and BAC sequences are assumed to reflect errors in the draft sequence, although a minority may remain in the finished BAC sequence.Here, we introduce a statistical and comparative genomics method that quantifies the fine-scale quality of a genome assembly and that has the merit of being complementary to the aforementioned approaches. Instead of considering rates of nucleotide substitution errors in an assembly, which are already largely indicated by CQSs, the method quantifies genome assembly quality by the rate of insertion and deletion errors in alignments. This approach estimates the abundance of indel errors between aligned genome pairs, by separating these from true evolutionary indels.Previously, we demonstrated that in the absence of selection, indel mutations leave a precise and determinable fingerprint on the distribution of ungapped alignment block lengths (Lunter et al. 2006). These block lengths, which represent distances between successive indel mutations (represented as gaps within genome alignments), we refer to as intergap segment (IGS) lengths. Under the neutral indel model, these IGS lengths are expected to follow a geometric frequency distribution whenever sequence has been free of selection. There is substantial evidence that the large majority of mammalian genome sequence has evolved neutrally (Mouse Genome Sequencing Consortium 2002; Lunter et al. 2006). More specifically, virtually all transposable elements (TEs) have, upon insertion, subsequently been free of purifying selection (Lunter et al. 2006; Lowe et al. 2007). This absence of selection manifests itself in IGS in ancestral repeats (those TEs that were inserted before the common ancestor of two species), closely following the geometric frequency distribution expected of neutral sequence (Fig. 1A).Open in a separate windowFigure 1.Genomic distribution of intergap segment lengths in mouse-rat alignments for ancestral repeats (A) and whole-genome sequences (B). Frequencies of IGS lengths are shown on a natural log scale. The black line represents the prediction of the neutral indel model, a geometric distribution of IGS lengths; observed counts (blue circles) are accumulated in 5 bp bins of IGS lengths. Within mouse-rate ancestral repeat sequence, the observations fit the model accurately for IGS between 10 bp and 300 bp. For whole-genome data, a similarly close fit is observed for IGS between 10 bp and 100 bp. Beyond 100 bp, an excess of longer IGSs (green) above the quantities predicted by the neutral indel model can be observed, representing functional sequence that has been conserved with regards to indel mutations. The depletion of short (<10 bp) IGS reflects a “gap attraction” phenomenon (Lunter et al. 2008).Within conserved functional sequence, on the other hand, deleterious indels will tend to have been purged, hence IGS lengths frequently will be more extended compared with neutral sequence. This results in a departure of the observed IGS length distribution from the geometric distribution (Fig. 1B), the extent of which allows the amount of functional sequence shared between genome pairs to be estimated accurately (for further details, see Lunter et al. 2006).In any alignment, a proportion of gaps will represent true evolutionary events, whereas the remainder represent “gap errors” that inadvertently have been introduced during sequencing and assembly. Causes of assembly errors, such as insufficient read coverage or mis-assembly, are often regional and thus may be expected to result in clustering of errors. In contrast, from the results of comparisons between species such as human and mouse, true evolutionary indel events appear to be only weakly clustered, for instance, through a dependence of indel rate on G+C content (Lunter et al. 2006). Indels may cluster because of recurrent and regional positive selection of nucleotide insertions and/or deletions. Nevertheless, these effects are unlikely to be sufficiently widespread to explain the high rates of indel clustering (up to one indel per 4 kb) that we discuss later. Indels may also cluster because of mutational biases that are independent of G+C, although we know of no such short-distance effects (see Discussion). This reasoning provided the rationale for seeking to exploit the neutral indel model to estimate the number of gap errors in alignments of two assemblies. Purifying selection on indels and clustered indel errors contribute to largely distinct parts of the observed IGS histogram: The former increases the representation of long IGS (Fig. 1B), whereas the latter cause short IGS to become more prevalent than expected.Nevertheless, owing to the considerable divergence between human and mouse, the probability of a true indel greatly exceeds assembly indel error rates (5 × 10−2 versus 10−3 to 10−4 per nucleotide) (see below) (Lunter et al. 2006). In short, the large number of true indel events renders the proportion of gap errors so low as to be inestimable. Even for more closely related species, such as mouse and rat (Fig. 1A), neutral sequence is estimated to contain one true indel per 50 bases, which is also approximately 100-fold higher than the frequency of indel errors we will report later. Consequently, indel errors will be most easily discerned between genome assemblies from yet more closely related species. Few species pairs, whose divergence within neutral sequence is low (<5%), have yet been sequenced. Nevertheless, recent reductions in sequencing costs are likely to result in substantial numbers of closely related genomes being sequenced in the near future.For this analysis, we took advantage of the newly available genome assembly of the Sumatran orangutan (Pongo pygmaeus abelii), sequenced using a conventional capillary sequencing approach (Orangutan Genome Sequencing Consortium, in prep.; D Locke, pers. comm.), and its alignment to other closely related great ape genome assemblies, namely, those of human (Homo sapiens) and chimpanzee (Pan troglodytes). The latter two genomes have been sequenced to finished quality and sixfold coverage, respectively (see Methods) (International Human Genome Sequencing Consortium 2004; The Chimpanzee Sequencing and Analysis Consortium 2005), whereas the effective coverage of the Sumatran orangutan is lower at approximately fourfold (Orangutan Genome Sequencing Consortium, in prep.).We were able to take advantage of a data set of short reads at approximately 10-fold statistical coverage from a single Bornean orangutan (Pongo pygmaeus pygmaeus) that was shotgun-sequenced using the Illumina short read platform as part of the orangutan sequencing project (Orangutan Genome Sequencing Consortium, in prep.). This substantial read depth afforded us an opportunity to quantify the improvement to traditional capillary-read assemblies from the mapping of short sequence reads. Using a sequence mapper (Stampy) that was specifically designed for high sensitivity and accuracy in the presence of indels as well as substitution mutations (see Methods) (GA Lunter and M Goodson, in prep.), we placed these reads onto the Sumatran orangutan genome assembly. Using this assembly as a template, we called indels and substitutions and, from these, derived a templated assembly of the Bornean individual. This assembly is expected to contain polymorphisms specific to the Bornean individual and also to correct many fine-scale substitution and indel errors present in the Sumatran capillary-read assembly. The assembly will be syntenic with the Sumatran assembly, rather than following the Bornean genome where structural variants exist. Moreover, in regions where the Sumatran genome is divergent or contains many errors, reads will not be mapped; such regions will be excluded from the templated assembly. Using our indel error statistics, we show that this templated assembly improves on the original assembly in terms of accuracy by effectively separating low-fidelity from high-fidelity sequence.  相似文献   

19.
A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.The production of a reference sequence assembly for the human genome was a milestone in biology and clearly has impacted many areas of biomedical research (McPherson et al. 2001; International Human Genome Sequencing 2004). The availability of this resource allows us to investigate genomic structure and variation at a depth previously unavailable (Kidd et al. 2008; The 1000 Genomes Project Consortium 2012). These studies have helped make clear the shortcomings of our initial assembly models and the difficulty of comprehensive genome analysis. While the current human reference assembly is of extremely high quality and is still the benchmark by which all other human assemblies must be compared, it is far from perfect. Technical and biological complexity lead to both missing sequences as well as misassembled sequence in the current reference, GRCh38 (Robledo et al. 2002; Eichler et al. 2004; International Human Genome Sequencing 2004; Church et al. 2011; Genovese et al. 2013).The two most vexing biological problems affecting assembly are (1) complex genomic architecture seen in large regions with highly homologous duplicated sequences and (2) excess allelic diversity (Bailey et al. 2001; Mills et al. 2006; Korbel et al. 2007; Kidd et al. 2008; Zody et al. 2008). Assembling these regions is further complicated due to the fact that regions of segmental duplication (SD) are often correlated with copy-number variants (CNVs) (Sharp et al. 2005). Regions harboring large CNV SDs have been misrepresented in the reference assembly because assembly algorithms aim to produce a haploid consensus. Highly identical paralogous and structurally polymorphic regions frequently lead to nonallelic sequences being collapsed into a single contig or allelic sequences being improperly represented as duplicates. Because of this complexity, a single, haploid reference is insufficient to fully represent human diversity (Church et al. 2011).The availability of at least one accurate allelic representation at loci with complex genomic architecture facilitates the understanding of the genomic architecture and diversity in these regions (Watson et al. 2013). To enable the assembly of these regions, we have developed a suite of resources from CHM1, a DNA source containing a single human haplotype (Taillon-Miller et al. 1997; Fan et al. 2002). A complete hydatidiform mole (CHM) is an abnormal product of conception in which there is a very early fetal demise and overgrowth of the placental tissue. Most CHMs are androgenetic and contain only paternally derived autosomes and sex chromosomes resulting either from dispermy or duplication of a single sperm genome. The phenotype is thought to be a result of abnormal parental contribution leading to aberrant genomic imprinting (Hoffner and Surti 2012). The absence of allelic variation in monospermic CHM makes it an ideal candidate for producing a single haplotype representation of the human genome. There are a number of existing resources associated with the “CHM1” sample, including a BAC library with end sequences generated with Sanger sequencing using ABI 3730 technology (https://bacpac.chori.org/), an optical map (Teague et al. 2010), and a BioNano genomic map (see Data access), some of which have previously been used to improve regions of the reference human genome assembly.BAC clones have historically been used to resolve difficult genomic regions and identify structural variants (Barbouti et al. 2004; Carvalho and Lupski 2008). A BAC library constructed from CHM1 DNA (CHORI-17, CH17) has also been utilized to resolve several very difficult genomic regions, including human-specific duplications at the SRGAP2 gene family on Chromosome 1 (Dennis et al. 2012). Additionally, the CHM1 BAC clones were used to generate single haplotype assemblies of regions that were previously misrepresented because of haplotype mixing (Watson et al. 2013). Both of these efforts contributed to the improvement of the GRCh38 reference human genome assembly, adding hundreds of kilobases of sequence missing in GRCh37, in addition to providing an accurate single haplotype representation of complex genome regions.Because of the previously established utility of sequence data derived from the CHM1 resource, we wished to develop a complete assembly of a single human haplotype. To this end, we produced a short read-based (Illumina) reference-guided assembly of CHM1 with integrated high-quality finished fully sequenced BAC clones to further improve the assembly. This assembly has been annotated using the NCBI annotation process and has been aligned to other human assemblies in GenBank, including both GRCh37 and GRCh38. Here we present evidence that the CHM1 genome assembly is a high-quality draft with respect to gene and repetitive element content as well as by comparison to other individual genome assemblies. We will also discuss current plans for developing a fully finished genome assembly based on this resource.  相似文献   

20.
Next-generation massively parallel sequencing technologies provide ultrahigh throughput at two orders of magnitude lower unit cost than capillary Sanger sequencing technology. One of the key applications of next-generation sequencing is studying genetic variation between individuals using whole-genome or target region resequencing. Here, we have developed a consensus-calling and SNP-detection method for sequencing-by-synthesis Illumina Genome Analyzer technology. We designed this method by carefully considering the data quality, alignment, and experimental errors common to this technology. All of this information was integrated into a single quality score for each base under Bayesian theory to measure the accuracy of consensus calling. We tested this methodology using a large-scale human resequencing data set of 36× coverage and assembled a high-quality nonrepetitive consensus sequence for 92.25% of the diploid autosomes and 88.07% of the haploid X chromosome. Comparison of the consensus sequence with Illumina human 1M BeadChip genotyped alleles from the same DNA sample showed that 98.6% of the 37,933 genotyped alleles on the X chromosome and 98% of 999,981 genotyped alleles on autosomes were covered at 99.97% and 99.84% consistency, respectively. At a low sequencing depth, we used prior probability of dbSNP alleles and were able to improve coverage of the dbSNP sites significantly as compared to that obtained using a nonimputation model. Our analyses demonstrate that our method has a very low false call rate at any sequencing depth and excellent genome coverage at a high sequencing depth.Genetic polymorphisms contribute to variations in phenotypes, risk to certain diseases, and response to drugs and the environment. Genome-wide linkage analysis and positional cloning have been tremendously successful for mapping human disease genes that underlie monogenic Mendelian diseases (Jimenez-Sanchez et al. 2001). But most common diseases (such as diabetes, cardiovascular disease, and cancer) and clinically important quantitative traits have complex genetic architectures; a combination of multiple genes and interactions with environmental factors is believed to determine these phenotypes. Linkage analysis has significant limitations in its ability to identify common genetic variations that have modest effects on disease (Wang et al. 2005). In contrast, genome-wide association studies offer a promising approach for mapping associated loci. The completion of the human genome sequence (Lander et al. 2001; Venter et al. 2001) enabled the identification of millions of single nucleotide polymorphisms (SNPs) (Sachidanandam et al. 2001) and the construction of a high-density haplotype map (International HapMap Consortium 2005; International HapMap Consortium et al. 2007). These advances have set the stage for large-scale genome-wide SNP surveys for seeking genetic variations associated with or causative of a wide variety of human diseases.For more than two decades, Sanger sequencing and fluorescence-based electrophoresis technologies have dominated the DNA sequencing field. And DNA sequencing is the method of choice for novel SNP detection, using either a random shotgun strategy or PCR amplification of regions of interest. Most of the SNPs deposited in dbSNP were identified by these methods (Sherry et al. 2001). A key advantage of the utility of traditional Sanger sequencing is the availability of the universal standard of phred scores (Ewing and Green 1998; Ewing et al. 1998) for defining SNP detection accuracy, in which the phred program assigns a score to each base of the raw sequence to estimate an error probability.With high-throughput clone sequencing of shotgun libraries, a standard method for SNP detection (such as ssahaSNP; Ning et al. 2001) is to align the reads onto a reference genome and filter low-quality mismatches according to their phred score, known as the “neighborhood quality standard” (NQS) (Altshuler et al. 2000). With direct sequencing of PCR-amplified sequences from diploid samples, software, including SNPdetector (Zhang et al. 2005), novoSNP (Weckx et al. 2005), PolyPhred (Stephens et al. 2006), and PolyScan (Chen et al. 2007), has been developed to examine chromatogram files to detect heterozygous polymorphisms.New DNA sequencing technologies, which have recently been developed and implemented, such as the Illumina Genome Analyzer (GA), Roche/454 FLX system, and AB SOLiD system, have significantly improved throughput and dramatically reduced the cost as compared to capillary-based electrophoresis systems (Shendure et al. 2004). In a single experiment using one Illumina GA, the sequence of approximately 100 million reads of up to 50 bases in length can be determined. This ultrahigh throughput makes next-generation sequencing technologies particularly suitable for carrying out genetic variation studies by using large-scale resequencing of sizeable cohorts of individuals with a known reference (Bentley 2006). Currently, using these technologies, three human individuals have been sequenced: James Watson''s genome by 454 Life Sciences (Roche) FLX sequencing technology (Wheeler et al. 2008), an Asian genome (Wang et al. 2008), and an African genome (Bentley et al. 2008) sequenced by Illumina GA technology. Additionally, given such sequencing advances, an international research consortium has formed to sequence the genomes of at least 1000 individuals from around the world to create the most detailed human genetic variation map to date.As noted, SNP detection methods for standard sequencing technologies are well developed; however, given distinct differences in the sequence data output from and analyses of next-generation sequencing, novel methods for accurate SNP detection are essential. To meet these needs, we have developed a method of consensus calling and SNP detection for the massively parallel Illumina GA technology. The Illumina platform uses a phred-like quality score system to measure the accuracy of each sequenced base pair. Using this, we calculated the likelihood of each genotype at each site based on the alignment of short reads to a reference genome together with the corresponding sequencing quality scores. We then inferred the genotype with highest posterior probability at each site using a Bayesian statistical method. The Bayesian method has been used for SNP calling for traditional Sanger sequencing technology (Marth et al. 1999) and has also been introduced for the analysis of next-generation sequencing data (Li et al. 2008a). In the method presented here, we have taken into account the intrinsic bias or errors that are common in Illumina GA sequencing data and recalibrated the quality values for use in inferring consensus sequence.We evaluated this SNP detection method using the Asian genome sequence, which has 36× high-quality data (Wang et al. 2008). The evaluation demonstrated that our method has a very low false call rate at any sequencing depth, and excellent genome coverage in high-depth data, making it very useful for SNP detection in Illumina GA resequencing data at any sequencing depth. This methodology and the developed software described in this report have been integrated into the Short Oligonucleotide Alignment Program (SOAP) package (Li et al. 2008b) and named “SOAPsnp” to indicate its functionality for SNP detection using SOAP short read alignment results as input.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号