首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
CLONEPICKER is a software pipeline that integrates sequence data with BAC clone fingerprints to dynamically select a minimal overlapping clone set covering the whole genome. In the Rat Genome Sequencing Project (RGSP), a hybrid strategy of "clone by clone" and "whole genome shotgun" approaches was used to maximize the merits of both approaches. Like the "clone by clone" method, one key challenge for this strategy was to select a low-redundancy clone set that covered the whole genome while the sequencing is in progress. The CLONEPICKER pipeline met this challenge using restriction enzyme fingerprint data, BAC end sequence data, and sequences generated from individual BAC clones as well as WGS reads. In the RGSP, an average of 7.5 clones was identified from each side of a seed clone, and the minimal overlapping clones were reliably selected. Combined with the assembled BAC fingerprint map, a set of BAC clones that covered >97% of the genome was identified and used in the RGSP.  相似文献   

2.
Physical map-assisted whole-genome shotgun sequence assemblies   总被引:2,自引:0,他引:2       下载免费PDF全文
We describe a targeted approach to improve the contiguity of whole-genome shotgun sequence (WGS) assemblies at run-time, using information from Bacterial Artificial Chromosome (BAC)-based physical maps. Clone sizes and overlaps derived from clone fingerprints are used for the calculation of length constraints between any two BAC neighbors sharing 40% of their size. These constraints are used to promote the linkage and guide the arrangement of sequence contigs within a sequence scaffold at the layout phase of WGS assemblies. This process is facilitated by FASSI, a stand-alone application that calculates BAC end and BAC overlap length constraints from clone fingerprint map contigs created by the FPC package. FASSI is designed to work with the assembly tool PCAP, but its output can be formatted to work with other WGS assembly algorithms able to use length constraints for individual clones. The FASSI method is simple to implement, potentially cost-effective, and has resulted in the increase of scaffold contiguity for both the Drosophila melanogaster and Cryptococcus gattii genomes when compared to a control assembly without map-derived constraints. A 6.5-fold coverage draft DNA sequence of the Pan troglodytes (chimpanzee) genome was assembled using map-derived constraints and resulted in a 26.1% increase in scaffold contiguity.  相似文献   

3.
《Genome research》2015,25(3):445-458
Drosophila melanogaster plays an important role in molecular, genetic, and genomic studies of heredity, development, metabolism, behavior, and human disease. The initial reference genome sequence reported more than a decade ago had a profound impact on progress in Drosophila research, and improving the accuracy and completeness of this sequence continues to be important to further progress. We previously described improvement of the 117-Mb sequence in the euchromatic portion of the genome and 21 Mb in the heterochromatic portion, using a whole-genome shotgun assembly, BAC physical mapping, and clone-based finishing. Here, we report an improved reference sequence of the single-copy and middle-repetitive regions of the genome, produced using cytogenetic mapping to mitotic and polytene chromosomes, clone-based finishing and BAC fingerprint verification, ordering of scaffolds by alignment to cDNA sequences, incorporation of other map and sequence data, and validation by whole-genome optical restriction mapping. These data substantially improve the accuracy and completeness of the reference sequence and the order and orientation of sequence scaffolds into chromosome arm assemblies. Representation of the Y chromosome and other heterochromatic regions is particularly improved. The new 143.9-Mb reference sequence, designated Release 6, effectively exhausts clone-based technologies for mapping and sequencing. Highly repeat-rich regions, including large satellite blocks and functional elements such as the ribosomal RNA genes and the centromeres, are largely inaccessible to current sequencing and assembly methods and remain poorly represented. Further significant improvements will require sequencing technologies that do not depend on molecular cloning and that produce very long reads.The genome sequence of the fruit fly Drosophila melanogaster was first reported in 2000 (Adams et al. 2000). This sequence assembly, designated Release 1, represented the single-copy fraction of the genome in 116.2 megabases (Mb) of sequence in 134 large mapped scaffolds containing 1299 sequence gaps and an additional 3.8 Mb in 704 small (<64 kb) unmapped scaffolds. Release 1 was produced by combining a de novo whole-genome shotgun (WGS) sequence assembly, designated WGS1 (Myers et al. 2000), with sequences of mapped BAC and P1 genomic clones, including 29.7 Mb of finished sequences and draft sequences of a tiling path of BAC and P1 clones spanning the euchromatic portion of the genome (Adams et al. 2000).WGS1 and Release 1 were validated by comparison to the available finished genomic sequences and to a BAC-based physical map of the major autosomes (Hoskins et al. 2000).WGS1 was the first shotgun assembly of a eukaryotic genome and served as a model for sequencing mammalian genomes (Venter et al. 2001; Stark et al. 2007). WGS remains the method of choice in genome sequencing because it is rapid and efficient. However, because eukaryotic genomes typically contain a large fraction of repetitive sequences with complex structures, current WGS sequencing strategies produce fragmented assemblies in which the location, order, and orientation of sequence scaffolds along the chromosomes are poorly determined. Furthermore, tandem and dispersed repetitive sequences including gene families, pseudogenes, transposable elements (TEs), segmental duplications, and simple sequence repeats are poorly represented. This leads to misassembled regions, unmapped regions, and numerous gaps, particularly in heterochromatic regions which often span many megabases of the genome and include vital protein-coding genes and other essential loci. Therefore, physical mapping, cytogenetic mapping, and sequence finishing to improve genome sequence assemblies remain a priority, especially for human (International Human Genome Sequencing Consortium 2004) and model organisms of particular importance in biomedical research.Because D. melanogaster is a widely used research organism, we have continued to improve the reference genome sequence. Late in 2000, the Release 2 sequence corrected the order and orientation of a few small sequence scaffolds and filled a few hundred small sequence gaps. In 2002, we reported BAC-based finishing of 116.9 Mb of genome sequence in 13 scaffolds spanning the euchromatic portions of the six chromosome arms (Celniker et al. 2002) and an improved WGS assembly (WGS3) including 20.7 Mb of draft-quality sequence in larger scaffolds in the heterochromatic portion of the genome (Celniker et al. 2002; Hoskins et al. 2002). This Release 3 assembly had high sequence accuracy (estimated error rate < 1 in 100,000) and contiguity (37 sequence gaps; seven physical map gaps) in the euchromatic portion of the assembly, and the order and orientation of sequences within the assembly was confirmed by in situ hybridization of 915 BACs to salivary gland polytene chromosomes, representing 96% of the BACs in a tiling path spanning the euchromatic portion of the assembly (Hoskins et al. 2000; Celniker et al. 2002). The euchromatic sequence went through two unpublished revisions in 2004 and 2006 (Releases 4 and 5; http://www.fruitfly.org) to further improve accuracy and completeness. In 2007, we reported on further physical and cytogenetic mapping, and sequence finishing of 15 Mb in the heterochromatic portion of the genome, including essentially all single-copy regions (Hoskins et al. 2007). However, gaps and assembly errors remained due to the difficulties of mapping and finishing in repeat-rich regions. The remaining physical map gaps resulted from the absence of genomic regions from BAC libraries, likely due to incompatibility with molecular cloning or clone instability in E. coli. Sequence gaps within clone-based assemblies resulted from failure of assembly in complex nested repetitive regions. The remaining sequence assembly errors were due to incorrect but self-consistent clone-based sequence assemblies or clone rearrangements. Particularly in heterochromatin, errors in the physical and cytogenetic maps existed due to the presence of repeat-rich sequences.Despite impressive developments in high-throughput sequencing technology, the production of high-quality finished genome sequences has remained laborious and inefficient. Furthermore, highly repeat-rich genomic regions such as those in centric heterochromatin have remained inaccessible to mapping, sequencing, and assembly. We define the “centric heterochromatin” as the repeat-rich sequences found at the functional centromeres (Sun et al. 2003). “Pericentric heterochromatin” refers to the Mb-scale regions that flank the centromeres and contain large blocks of satellite DNA and other simple-sequence repeats (Supplemental Fig. S1) interspersed with large regions of transposable-element and other middle-repetitive sequences and including essential protein-coding genes. “Telomeric heterochromatin” refers to the subtelomeric regions composed of tandem repeats (Mason and Villasante 2014) and the arrays of telomeric retrotransposons at the most distal chromosome ends (Abad et al. 2004b). By these definitions, the Y chromosome is composed entirely of centric, pericentric, and telomeric heterochromatin.Here, we report the Release 6 assembly of the D. melanogaster reference genome sequence. Much of the improvement in the sequence is in the mapping, finishing, and assembly of repeat-rich regions in the heterochromatic portions of the genome. Release 6 incorporates (1) additional BAC-based cytogenetic mapping of previously unmapped, unordered, and unoriented sequence scaffolds by fluorescent in situ hybridization (FISH) to mitotic and polytene chromosomes, (2) BAC-based sequence finishing of clones spanning the remainder of the genome physical map guided by comparison to high-resolution BAC restriction fingerprints, and sequence finishing of 10-kb genomic plasmid clones spanning the remainder of the WGS3 assembly, (3) use of cDNA sequences to order and orient scaffolds, (4) incorporation of map and sequence data from other sources, and (5) validation of the sequence assembly by comparison to a whole-genome optical restriction map (Zhou et al. 2007). The resulting genome sequence assembly is a substantially improved reference that spans 143.9 Mb and represents the practical limit of established technologies. Relative to Release 5, Release 6 closes 628 gaps, extends the chromosome arm assemblies into telomeric and pericentric heterochromatin by 5.4 Mb, and increases the Y chromosome assembly 10-fold from ∼242 kb to 3.4 Mb. Further substantial improvement to the reference genome sequence will require new technologies that do not depend on standard molecular cloning. Emerging very-long-read WGS sequencing and assembly technologies will permit efficient production of more complete genome sequences for D. melanogaster and other species.  相似文献   

4.
We have developed a novel approach for using massively parallel short-read sequencing to generate fast and inexpensive de novo genomic assemblies comparable to those generated by capillary-based methods. The ultrashort (<100 base) sequences generated by this technology pose specific biological and computational challenges for de novo assembly of large genomes. To account for this, we devised a method for experimentally partitioning the genome using reduced representation (RR) libraries prior to assembly. We use two restriction enzymes independently to create a series of overlapping fragment libraries, each containing a tractable subset of the genome. Together, these libraries allow us to reassemble the entire genome without the need of a reference sequence. As proof of concept, we applied this approach to sequence and assembled the majority of the 125-Mb Drosophila melanogaster genome. We subsequently demonstrate the accuracy of our assembly method with meaningful comparisons against the current available D. melanogaster reference genome (dm3). The ease of assembly and accuracy for comparative genomics suggest that our approach will scale to future mammalian genome-sequencing efforts, saving both time and money without sacrificing quality.Genomes are the fundamental unit by which a species can be defined and form the foundation for deciphering how an organism develops, lives, dies, and is affected by disease. In addition, comparisons of genomes from related species have become a powerful method for finding functional sequences (Eddy 2005; Xie et al. 2005, 2007; Pennacchio et al. 2006; The ENCODE Project Consortium 2007). However, the high cost and effort needed to produce draft genomes with capillary-based sequencing technologies have limited genome-based biological exploration and evolutionary sequence comparisons to dozens of species (Margulies et al. 2007; Stark et al. 2007). This limitation is particularly true for mammalian-sized genomes, which are gigabases in size.With the advent of massively parallel short-read sequencing technologies (Bentley et al. 2008), the cost of sequencing DNA has been reduced by orders of magnitude, now making it possible to sequence hundreds or thousands of genomes. However, the reduced length of the sequence reads, compared with capillary-based approaches, poses new challenges in genome assembly. Here, we sought to address those experimental and bioinformatics hurdles by combining classical biochemical methodologies with new algorithms specifically tailored to handle massive quantities of short-read sequences.To date, the whole-genome shotgun (WGS) approach using massively parallel short-read sequencing has shown significant promise in silico (Butler et al. 2008) and has been applied to de novo sequencing and assembly of small genomes that do not contain an overabundance of low-complexity repetitive sequence (Hernandez et al. 2008). This presents a challenge when scaled to larger more complex genomes, where the information contained in a single short read cannot unambiguously place that read in the genome. Additionally, de novo assembly from WGS short-read sequencing currently requires large computational resources, on the order of hundreds of gigabytes of RAM, when scaled to larger genomes. As a compromise, current mammalian genomic analyses utilizing short-read sequencing technology either use alignments of individual reads against a reference genome (Ley et al. 2008; Wang et al. 2008) or require elaborate parallelization schemes across a large compute farm (Simpson et al. 2009) for assembly. Regardless of computational improvements, effectively handling repetitive sequences in a whole-genome assembly still remains a challenge.Our goal was to establish wet-lab and bioinformatics methods to rapidly sequence and assemble mammalian-sized genomes in a de novo fashion. Importantly, we wanted an approach that (1) did not rely on existing reference assemblies, (2) could be accomplished using commodity computational hardware, and (3) would yield functional assemblies useful for comparative sequence analyses at a fraction of the time and cost of existing capillary-based methods. To accomplish this, we propose a generic genome partitioning approach to solve both the biological and computational challenges of short-read assembly.Traditionally, partitioning of genomic libraries was accomplished through the use of clonal libraries of bacterial artificial chromosomes (BACs) (or yeast artificial chromosomes). This method accurately partitions genomes into more manageable subregions for sequencing and assembly. However, the high financial cost and overhead associated with creating and maintaining these libraries make this method unattractive for scaling to hundreds of genomes. In addition, a single BAC clone, which contains ∼200 kb of sequence, is not large enough to leverage the amount of sequence obtained from a single lane of Illumina data (currently ∼2.5 Gb of sequence), requiring the need for various pooling or indexing strategies (Meyer et al. 2007). Furthermore, virtually all BAC libraries exhibit some degree of variable cloning bias, with some regions overrepresented and others not at all. In silico studies have investigated the cost-saving potential of using a randomized BAC clone library with short-read sequencing (Sundquist et al. 2007); however, even with this shortcut, the clonal library concept does not lend itself to fast and cheap whole-genome partitioning.We propose a novel partitioning approach using restriction enzymes to create a series of reduced representation (RR) libraries by size fractionation. This method was originally described for single nucleotide polymorphism (SNP) discovery using Sanger-based sequencing methods on AB377 machines (Altshuler et al. 2000) and was subsequently used with massively parallel short-read sequencing (Van Tassell et al. 2008). Importantly, this method allows for the selection of a smaller reproducible subset of the genome for assembly. We extended this concept to create a series of distinct RR libraries consisting of similarly sized restriction fragments. Individually, these libraries represent a tractable subset of the genome for sequencing and assembly; when taken together, they represent virtually the entire genome. Using two separate restriction enzymes generates overlapping libraries, which allow for assembly of the genome without using a reference sequence.As proof of concept, we present here a de novo Drosophila melanogaster genomic assembly, equivalent in utility to a comparative grade assembly (Blakesley et al. 2004). Two enzymes were used to create a total of eight libraries. Short reads (∼36 bp) from each library were sequenced on the Illumina Genome Analyzer and assembled using the short-read assembler Velvet (Zerbino and Birney 2008). Contigs assembled from each library were merged into a single nonoverlapping meta assembly using the lightweight assembly program Minimus (Sommer et al. 2007). Furthermore, we sequenced genomic paired-end libraries with short and long inserts to order and orient the contigs into larger genomic scaffolds. When compared with a whole-genome shotgun assembly of the same data, we produce a higher quality assembly more rapidly by reducing the biological complexity and computational cost to assemble each library. Finally, we compare this assembly to the dm3 fly reference to highlight the accuracy of our assembly and utility for comparative sequence analyses. Our results demonstrate that this method is a rapid and cost-effective means to generate high-quality de novo assemblies of large genomes.  相似文献   

5.
A fosmid library representing 10-fold coverage of the Histoplasma capsulatum G217B genome was used to construct a restriction-based physical map. The data obtained from three restriction endonuclease fingerprints, generated from each clone using BamHI, HindIII, and PstI endonucleases, were combined and used in FPC for automatic and manual contig assembly builds. Concomitantly, a whole-genome shotgun (WGS) sequencing of paired-end reads from plasmids and fosmids were assembled with PCAP, providing a predicted genome size of up to 43.5 Mbp and 17% repetitive DNA. Fosmid paired-end sequences in the WGS assembly provide anchoring information to the physical map and result in joining of existing physical map contigs into 84 clusters containing 9551 fosmid clones. Here, we detail mapping the Histoplasma capsulatum genome comprehensively in fosmids, resulting in an efficient paradigm for de novo sequencing that uses a map-assisted whole genome shotgun approach.  相似文献   

6.
Second-generation sequencing technology can now be used to sequence an entire human genome in a matter of days and at low cost. Sequence read lengths, initially very short, have rapidly increased since the technology first appeared, and we now are seeing a growing number of efforts to sequence large genomes de novo from these short reads. In this Perspective, we describe the issues associated with short-read assembly, the different types of data produced by second-gen sequencers, and the latest assembly algorithms designed for these data. We also review the genomes that have been assembled recently from short reads and make recommendations for sequencing strategies that will yield a high-quality assembly.As genome sequencing technology has evolved, methods for assembling genomes have changed with it. Genome sequencers have never been able to “read” more than a relatively short stretch of DNA at once, with read lengths gradually increasing over time. Reconstructing a complete genome from a set of reads requires an assembly program, and a variety of genome assemblers have been used for this task. In 1995, when the first bacterial genome was published (Haemophilus influenzae), read lengths were ∼460 base pairs (bp), and that whole-genome shotgun (WGS) sequencing project generated 24,304 reads (Fleischmann et al. 1995). The human genome project required ∼30 million reads, with lengths up to 800 bp, using Sanger sequencing technology and automated capillary sequencers (International Human Genome Sequencing Consortium 2001; Venter et al. 2001). This corresponded to 24 billion bases (Gb), or approximately eightfold coverage of the 3-Gb human genome. Redundant coverage, in which on average every nucleotide is sequenced many times over, is required to produce a high-quality assembly. Another benefit of redundancy is greatly increased accuracy compared with a single read: Where a single read might have an error rate of 1%, eightfold coverage has an error rate as low as 10−16 when eight high-quality reads agree with one another. High coverage is also necessary to sequence polymorphic alleles within diploid or polyploid genomes.Current second-generation sequencing (SGS) technologies produce read lengths ranging from 35 to 400 bp, at far greater speed and much lower cost than Sanger sequencing. However, as reads get shorter, coverage needs to increase to compensate for the decreased connectivity and produce a comparable assembly. Certain problems cannot be overcome by deeper coverage: If a repetitive sequence is longer than a read, then coverage alone will never compensate, and all copies of that sequence will produce gaps in the assembly. These gaps can be spanned by paired reads—consisting of two reads generated from a single fragment of DNA and separated by a known distance—as long as the pair separation distance is longer than the repeat. Paired-end sequencing is available from most of the SGS machines, although it is not yet as flexible or as reliable as paired-end sequencing using traditional methods.After the successful assembly of the human (International Human Genome Sequencing Consortium 2001; Venter et al. 2001) and mouse (Waterston et al. 2002) genomes by whole-genome shotgun sequencing, most large-scale genome projects quickly moved to adopt the WGS approach, which has subsequently been used for dozens of eukaryotic genomes. Today, thanks to changes in sequencing technology, a major question confronting genome projects is, can we sequence a large genome (>100 Mbp) using short reads? If so, what are the limitations on read length, coverage, and error rates? How much paired-end sequencing is necessary? And what will the assembly look like? In this perspective we take a look at each of these questions and describe the solutions available today. Although we provide some answers, we have no doubt that the solutions will change rapidly over the next few years, as both the sequencing methods and the computational solutions improve.  相似文献   

7.
Multi-species sequence comparisons are a very efficient way to reveal conserved genes. Because sequence finishing is expensive and time consuming, many genome sequences are likely to stay incomplete. A challenge is to use these fragmented data for understanding the human genome. Methods for using cross-species whole-genome shotgun sequence (WGS) for genome annotation are described in this paper. About one-half million high-quality rat WGS reads (covering 7.5% of the rat genome) generated at the Baylor College of Medicine Human Genome Sequencing Center were compared with the human genome. Using computer-generated random reads as a negative control, a set of parameters was determined for reliable interpretation of BLAST search results. About 10% of the rat reads contain regions that are conserved in the human genomic sequence and about one-third of these include known gene-coding regions. Mapping the conserved regions to human chromosomes showed a 23-fold enrichment for coding regions compared with noncoding regions. This approach can also be applied to other mammalian genomes for gene finding. These data predicted approximately 42,500 genes in the human, slightly more than reported previously.  相似文献   

8.
New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.  相似文献   

9.
Sorghum is an important target for plant genomic mapping because of its adaptation to harsh environments, diverse germplasm collection, and value for comparing the genomes of grass species such as corn and rice. The construction of an integrated genetic and physical map of the sorghum genome (750 Mbp) is a primary goal of our sorghum genome project. To help accomplish this task, we have developed a new high-throughput PCR-based method for building BAC contigs and locating BAC clones on the sorghum genetic map. This task involved pooling 24,576 sorghum BAC clones ( approximately 4x genome equivalents) in six different matrices to create 184 pools of BAC DNA. DNA fragments from each pool were amplified using amplified fragment length polymorphism (AFLP) technology, resolved on a LI-COR dual-dye DNA sequencing system, and analyzed using Bionumerics software. On average, each set of AFLP primers amplified 28 single-copy DNA markers that were useful for identifying overlapping BAC clones. Data from 32 different AFLP primer combinations identified approximately 2400 BACs and ordered approximately 700 BAC contigs. Analysis of a sorghum RIL mapping population using the same primer pairs located approximately 200 of the BAC contigs on the sorghum genetic map. Restriction endonuclease fingerprinting of the entire collection of sorghum BAC clones was applied to test and extend the contigs constructed using this PCR-based methodology. Analysis of the fingerprint data allowed for the identification of 3366 contigs each containing an average of 5 BACs. BACs in approximately 65% of the contigs aligned by AFLP analysis had sufficient overlap to be confirmed by DNA fingerprint analysis. In addition, 30% of the overlapping BACs aligned by AFLP analysis provided information for merging contigs and singletons that could not be joined using fingerprint data alone. Thus, the combination of fingerprinting and AFLP-based contig assembly and mapping provides a reliable, high-throughput method for building an integrated genetic and physical map of the sorghum genome.  相似文献   

10.
11.
The phusion assembler   总被引:14,自引:2,他引:14       下载免费PDF全文
The Phusion assembler has assembled the mouse genome from the whole-genome shotgun (WGS) dataset collected by the Mouse Genome Sequencing Consortium, at ~7.5x sequence coverage, producing a high-quality draft assembly 2.6 gigabases in size, of which 90% of these bases are in 479 scaffolds. For the mouse genome, which is a large and repeat-rich genome, the input dataset was designed to include a high proportion of paired end sequences of various size selected inserts, from 2-200 kbp lengths, into various host vector templates. Phusion uses sequence data, called reads, and information about reads that share common templates, called read pairs, to drive the assembly of this large genome to highly accurate results. The preassembly stage, which clusters the reads into sensible groups, is a key element of the entire assembler, because it permits a simple approach to parallelization of the assembly stage, as each cluster can be treated independent of the others. In addition to the application of Phusion to the mouse genome, we will also present results from the WGS assembly of Caenorhabditis briggsae sequenced to about 11x coverage. The C. briggsae assembly was accessioned through EMBL, http://www.ebi.ac.uk/services/index.html, using the series CAAC01000001-CAAC01000578, however, the Phusion mouse assembly described here was not accessioned. The mouse data was generated by the Mouse Genome Sequencing Consortium. The C. briggsae sequence was generated at The Wellcome Trust Sanger Institute and the Genome Sequencing Center, Washington University School of Medicine.  相似文献   

12.
A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.The production of a reference sequence assembly for the human genome was a milestone in biology and clearly has impacted many areas of biomedical research (McPherson et al. 2001; International Human Genome Sequencing 2004). The availability of this resource allows us to investigate genomic structure and variation at a depth previously unavailable (Kidd et al. 2008; The 1000 Genomes Project Consortium 2012). These studies have helped make clear the shortcomings of our initial assembly models and the difficulty of comprehensive genome analysis. While the current human reference assembly is of extremely high quality and is still the benchmark by which all other human assemblies must be compared, it is far from perfect. Technical and biological complexity lead to both missing sequences as well as misassembled sequence in the current reference, GRCh38 (Robledo et al. 2002; Eichler et al. 2004; International Human Genome Sequencing 2004; Church et al. 2011; Genovese et al. 2013).The two most vexing biological problems affecting assembly are (1) complex genomic architecture seen in large regions with highly homologous duplicated sequences and (2) excess allelic diversity (Bailey et al. 2001; Mills et al. 2006; Korbel et al. 2007; Kidd et al. 2008; Zody et al. 2008). Assembling these regions is further complicated due to the fact that regions of segmental duplication (SD) are often correlated with copy-number variants (CNVs) (Sharp et al. 2005). Regions harboring large CNV SDs have been misrepresented in the reference assembly because assembly algorithms aim to produce a haploid consensus. Highly identical paralogous and structurally polymorphic regions frequently lead to nonallelic sequences being collapsed into a single contig or allelic sequences being improperly represented as duplicates. Because of this complexity, a single, haploid reference is insufficient to fully represent human diversity (Church et al. 2011).The availability of at least one accurate allelic representation at loci with complex genomic architecture facilitates the understanding of the genomic architecture and diversity in these regions (Watson et al. 2013). To enable the assembly of these regions, we have developed a suite of resources from CHM1, a DNA source containing a single human haplotype (Taillon-Miller et al. 1997; Fan et al. 2002). A complete hydatidiform mole (CHM) is an abnormal product of conception in which there is a very early fetal demise and overgrowth of the placental tissue. Most CHMs are androgenetic and contain only paternally derived autosomes and sex chromosomes resulting either from dispermy or duplication of a single sperm genome. The phenotype is thought to be a result of abnormal parental contribution leading to aberrant genomic imprinting (Hoffner and Surti 2012). The absence of allelic variation in monospermic CHM makes it an ideal candidate for producing a single haplotype representation of the human genome. There are a number of existing resources associated with the “CHM1” sample, including a BAC library with end sequences generated with Sanger sequencing using ABI 3730 technology (https://bacpac.chori.org/), an optical map (Teague et al. 2010), and a BioNano genomic map (see Data access), some of which have previously been used to improve regions of the reference human genome assembly.BAC clones have historically been used to resolve difficult genomic regions and identify structural variants (Barbouti et al. 2004; Carvalho and Lupski 2008). A BAC library constructed from CHM1 DNA (CHORI-17, CH17) has also been utilized to resolve several very difficult genomic regions, including human-specific duplications at the SRGAP2 gene family on Chromosome 1 (Dennis et al. 2012). Additionally, the CHM1 BAC clones were used to generate single haplotype assemblies of regions that were previously misrepresented because of haplotype mixing (Watson et al. 2013). Both of these efforts contributed to the improvement of the GRCh38 reference human genome assembly, adding hundreds of kilobases of sequence missing in GRCh37, in addition to providing an accurate single haplotype representation of complex genome regions.Because of the previously established utility of sequence data derived from the CHM1 resource, we wished to develop a complete assembly of a single human haplotype. To this end, we produced a short read-based (Illumina) reference-guided assembly of CHM1 with integrated high-quality finished fully sequenced BAC clones to further improve the assembly. This assembly has been annotated using the NCBI annotation process and has been aligned to other human assemblies in GenBank, including both GRCh37 and GRCh38. Here we present evidence that the CHM1 genome assembly is a high-quality draft with respect to gene and repetitive element content as well as by comparison to other individual genome assemblies. We will also discuss current plans for developing a fully finished genome assembly based on this resource.  相似文献   

13.
Comparison is a fundamental tool for analyzing DNA sequence. Interspecies sequence comparison is particularly powerful for inferring genome function and is based on the simple premise that conserved sequences are likely to be important. Thus, the comparison of a genomic sequence with its orthologous counterpart from another species is increasingly becoming an integral component of genome analysis. In ideal situations, such comparisons are performed with orthologous sequences from multiple species. To facilitate multispecies comparative sequence analysis, a robust and scalable strategy for simultaneously constructing sequence-ready bacterial artificial chromosome (BAC) contig maps from targeted genomic regions has been developed. Central to this approach is the generation and utilization of "universal" oligonucleotide-based hybridization probes ("overgo" probes), which are designed from sequences that are highly conserved between distantly related species. Large collections of these probes are used en masse to screen BAC libraries from multiple species in parallel, with the isolated clones assembled into physical contig maps. To validate the effectiveness of this strategy, efforts were focused on the construction of BAC-based physical maps from multiple mammalian species (chimpanzee, baboon, cat, dog, cow, and pig). Using available human and mouse genomic sequence and a newly developed computer program to design the requisite probes, sequence-ready maps were constructed in all species for a series of targeted regions totaling approximately 16 Mb in the human genome. The described approach can be used to facilitate the multispecies comparative sequencing of targeted genomic regions and can be adapted for constructing BAC contig maps in other vertebrates.  相似文献   

14.
Efficient sequencing of animal and plant genomes by next-generation technology should allow many neglected organisms of biological and medical importance to be better understood. As a test case, we have assembled a draft genome of Caenorhabditis sp. 3 PS1010 through a combination of direct sequencing and scaffolding with RNA-seq. We first sequenced genomic DNA and mixed-stage cDNA using paired 75-nt reads from an Illumina GAII. A set of 230 million genomic reads yielded an 80-Mb assembly, with a supercontig N50 of 5.0 kb, covering 90% of 429 kb from previously published genomic contigs. Mixed-stage poly(A)(+) cDNA gave 47.3 million mappable 75-mers (including 5.1 million spliced reads), which separately assembled into 17.8 Mb of cDNA, with an N50 of 1.06 kb. By further scaffolding our genomic supercontigs with cDNA, we increased their N50 to 9.4 kb, nearly double the average gene size in C. elegans. We predicted 22,851 protein-coding genes, and detected expression in 78% of them. Multigenome alignment and data filtering identified 2672 DNA elements conserved between PS1010 and C. elegans that are likely to encode regulatory sequences or previously unknown ncRNAs. Genomic and cDNA sequencing followed by joint assembly is a rapid and useful strategy for biological analysis.  相似文献   

15.
We present whole genome profiling (WGP), a novel next-generation sequencing-based physical mapping technology for construction of bacterial artificial chromosome (BAC) contigs of complex genomes, using Arabidopsis thaliana as an example. WGP leverages short read sequences derived from restriction fragments of two-dimensionally pooled BAC clones to generate sequence tags. These sequence tags are assigned to individual BAC clones, followed by assembly of BAC contigs based on shared regions containing identical sequence tags. Following in silico analysis of WGP sequence tags and simulation of a map of Arabidopsis chromosome 4 and maize, a WGP map of Arabidopsis thaliana ecotype Columbia was constructed de novo using a six-genome equivalent BAC library. Validation of the WGP map using the Columbia reference sequence confirmed that 350 BAC contigs (98%) were assembled correctly, spanning 97% of the 102-Mb calculated genome coverage. We demonstrate that WGP maps can also be generated for more complex plant genomes and will serve as excellent scaffolds to anchor genetic linkage maps and integrate whole genome sequence data.  相似文献   

16.
A BAC- and BIBAC-based physical map of the soybean genome   总被引:14,自引:0,他引:14       下载免费PDF全文
Genome-wide physical maps are crucial to many aspects of advanced genome research. We report a genome-wide, bacterial artificial chromosome (BAC) and plant-transformation-competent binary large-insert plasmid clone (hereafter BIBAC)-based physical map of the soybean genome. The map was constructed from 78001 clones from five soybean BAC and BIBAC libraries representing 9.6 haploid genomes and three cultivars, and consisted of 2905 BAC/BIBAC contigs, estimated to span 1408 Mb in physical length. We evaluated the reliability of the map contigs using different contig assembly strategies, independent contig building methods, DNA marker hybridization, and different fingerprinting methods, and the results showed that the contigs were assembled properly. Furthermore, we tested the feasibility of integrating the physical map with the existing soybean composite genetic map using 388 DNA markers. The results further confirmed the nature of the ancient tetraploid origin of soybean and indicated that it is feasible to integrate the physical map with the linkage map even though greater efforts are needed. This map represents the first genome-wide, BAC/BIBAC-based physical map of the soybean genome and would provide a platform for advanced genome research of soybean and other legume species. The inclusion of BIBACs in the map would streamline the utility of the map for positional cloning of genes and QTLs, and functional analysis of soybean genomic sequences.  相似文献   

17.
The success of the ongoing Human Genome Project has resulted in accelerated plans for completing the human genome sequence and the earlier-than-anticipated initiation of efforts to sequence the mouse genome. As a complement to these efforts, we are utilizing the available human sequence to refine human-mouse comparative maps and to assemble sequence-ready mouse physical maps. Here we describe how the first glimpses of genomic sequence from human chromosome 7 are directly facilitating these activities. Specifically, we are actively enhancing the available human-mouse comparative map by analyzing human chromosome 7 sequence for the presence of orthologs of mapped mouse genes. Such orthologs can then be precisely positioned relative to mapped human STSs and other genes. The chromosome 7 sequence generated to date has allowed us to more than double the number of genes that can be placed on the comparative map. The latter effort reveals that human chromosome 7 is represented by at least 20 orthologous segments of DNA in the mouse genome. A second component of our program involves systematically analyzing the evolving human chromosome 7 sequence for the presence of matching mouse genes and expressed-sequence tags (ESTs). Mouse-specific hybridization probes are designed from such sequences and used to screen a mouse bacterial artificial chromosome (BAC) library, with the resulting data used to assemble BAC contigs based on probe-content data. Nascent contigs are then expanded using probes derived from newly generated BAC-end sequences. This approach produces BAC-based sequence-ready maps that are known to contain a gene(s) and are homologous to segments of the human genome for which sequence is already available. Our ongoing efforts have thus far resulted in the isolation and mapping of >3,800 mouse BACs, which have been assembled into >100 contigs. These contigs include >250 genes and represent approximately 40% of the mouse genome that is homologous to human chromosome 7. Together, these approaches illustrate how the availability of genomic sequence directly facilitates studies in comparative genomics and genome evolution.  相似文献   

18.
Bacterial artificial chromosome clones (BACs) are widely used at present in human genome physical mapping projects. To extend the utility of these clones for functional genomic studies, we have devised a method to modify BACs using Cre recombinase to introduce a gene cassette into the loxP sequence, which is present in the vector portion of the BAC clone. Cre-mediated integration is site specific and thus maintains the integrity of the genomic insert sequences, while eliminating the steps that are involved in restriction digest-based DNA cloning strategies. The success of this method depends on the use of a DNA construct, RETRObac, which contains the reporter marker green fluorescent protein (GFP) and the selectable marker neomycin phosphotransferase (neo), but does not contain a bacterial origin of replication. BAC clones have been modified successfully using this method and the genomic insert shows no signs of deletions or rearrangements. Transfection efficiencies of the modified BACs into human or murine cell lines ranged from 1% to 6%. After culture in media containing G418 for 3 weeks, ~0.1% of cells previously sorted for GFP expression acquired stable antibiotic resistance. Introduction of a human BAC clone that contains genomic p53 sequences into murine NIH3T3 cells led to expression of human p53 mRNA as determined by RT–PCR, demonstrating that sequences contained on the BAC are expressed. We believe that GFP–neo modified BAC clones will be a valuable resource in efforts to study biological effects of known genes as well as in efforts to clone and analyze new genes and regulatory regions.  相似文献   

19.
Comparative genomic hybridization (CGH) has proved to be a powerful tool for the detection of genome copy number changes in human cancers and in other diseases caused by segmental aneusomies. Array versions of CGH allow the definition of these aberrations, with resolution determined by the size and distribution of the array elements. Resolution approaching 100 kb can be achieved by use of arrays comprising bacterial artificial chromosomes (BACs) distributed contiguously across regions of interest. We describe here a computer program that automatically assembles contigs of minimally overlapping BAC clones, using information about BAC end-sequences and the normal genome DNA sequence. We demonstrate the characteristics of contigs assembled and annotated by use of this approach for regions of recurrent abnormality in human ovarian and breast cancers at chromosome bands 3q25-q27 and 8q24 and chromosome arm 20q. We also show illustrative analyses of regions of amplification in these regions in breast and ovarian tumor cell lines by use of array CGH with arrays comprising contiguous BACs.  相似文献   

20.
The pericentromeric regions of human chromosomes pose particular problems for both mapping and sequencing. These difficulties are due, in large part, to the presence of duplicated genomic segments that are distributed among multiple human chromosomes. To ensure contiguity of genomic sequence in these regions, we designed a sequence-based strategy to characterize different pericentromeric regions using a single (162 kb) 2p11 seed sequence as a point of reference. Molecular and cytogenetic techniques were first used to construct a paralogy map that delineated the interchromosomal distribution of duplicated segments throughout the human genome. Monochromosomal hybrid DNAs were PCR amplified by primer pairs designed to the 2p11 reference sequence. The PCR products were directly sequenced and used to develop a catalog of sequence tags for each duplicon for each chromosome. A total of 685 paralogous sequence variants were generated by sequencing 34.7 kb of paralogous pericentromeric sequence. Using PCR products as hybridization probes, we were able to identify 702 human BAC clones, of which a subset, 107 clones, were analyzed at the sequence level. We used diagnostic paralogous sequence variants to assign 65 of these BACs to at least 9 chromosomal pericentromeric regions: 1q12, 2p11, 9p11/q12, 10p11, 14q11, 15q11, 16p11, 17p11, and 22q11. Comparisons with existing sequence and physical maps for the human genome suggest that many of these BACs map to regions of the genome with sequence gaps. Our analysis indicates that large portions of pericentromeric DNA are virtually devoid of unique sequences. Instead, they consist of a mosaic of different genomic segments that have had different propensities for duplication. These biologic properties may be exploited for the rapid characterization of, not only pericentromeric DNA, but also other complex paralogous regions of the human genome.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号