首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
2.
The analysis of next-generation sequencing data is computationally and statistically challenging because of the massive volume of data and imperfect data quality. We present GotCloud, a pipeline for efficiently detecting and genotyping high-quality variants from large-scale sequencing data. GotCloud automates sequence alignment, sample-level quality control, variant calling, filtering of likely artifacts using machine-learning techniques, and genotype refinement using haplotype information. The pipeline can process thousands of samples in parallel and requires less computational resources than current alternatives. Experiments with whole-genome and exome-targeted sequence data generated by the 1000 Genomes Project show that the pipeline provides effective filtering against false positive variants and high power to detect true variants. Our pipeline has already contributed to variant detection and genotyping in several large-scale sequencing projects, including the 1000 Genomes Project and the NHLBI Exome Sequencing Project. We hope it will now prove useful to many medical sequencing studies.The cost of human genome sequencing has declined rapidly, powered by advances in massively parallel sequencing technologies. This has made possible the collection of genomic information on an unprecedented scale and made large-scale sequencing a practical strategy for biological and medical studies. An initial step for nearly all sequencing studies is to detect variant sites among sampled individuals and genotype them. This analysis is challenging because errors in high-throughput sequence data are much more common than true genomic variation. There are diverse sources of trouble (base-calling errors, alignment artifacts, contaminant reads derived from other samples), and the resulting errors are often correlated. The analysis is also computationally and statistically challenging because of the volume of data involved. Using standard formats, raw sequence reads for a single deeply (30×) sequenced human genome require >100 gigabytes (GB) of storage.Several tools are now available to process next-generation sequencing data. For example, the Genome Analysis Toolkit (GATK) (DePristo et al. 2011), SAMtools (Li 2011), and SNPTools (Wang et al. 2013) are used for variant discovery and genotyping from small to moderate numbers of sequenced samples. However, as the number of sequenced genomes grows, analysis becomes increasingly challenging, requiring complex data processing steps, division of sequence data into many small regions, management and scheduling of analysis jobs, and often, prohibitive demands on computing resources. A tempting approach to alleviate computational burden is to process samples in small batches, but this can lead to reduced power for rare variant discovery and systematic differences between samples processed in different batches.There is a pressing need for software pipelines that support large-scale medical sequencing studies that will be made possible by decreased sequencing costs. Desirable features for such pipelines include (1) scalability to tens of thousands of samples; (2) the ability to easily stop and resume analyses; (3) the option to carry out incremental analyses as new samples are sequenced; (4) flexibility to accommodate different study designs: shallow and deep sequencing, whole-genome, whole-exome, or small targeted experiments; and, of course, (5) high-quality genotyping and variant discovery.Here, we describe and evaluate our flexible and efficient sequence analysis software pipeline, Genomes on the Cloud (GotCloud). We show that GotCloud delivers high-quality variant sites and accurate genotypes across thousands of samples. We describe the strategies to systematically divide processing of very large data sets into manageable pieces. We also demonstrate novel automated frameworks for filtering sequencing and alignment artifacts from variant calls as well as for accurate genotyping using haplotype information.  相似文献   

3.
Using conventional Sanger sequencing as a reference standard, we compared the sensitivity, specificity, and capacity of the Illumina GA II platform for the detection of TP53, BRCA1, and BRCA2 mutations in established tumor cell lines and DNA from patients with germline mutations. A total of 656 coding variants were identified in four cell lines and 65 patient DNAs. All of the known pathogenic mutations (including point mutations and insertions/deletions of up to 16 nucleotides) were identified, using a combination of the Illumina data analysis pipeline with custom and commercial sequence alignment software. In our configuration, clonal sequencing outperforms current diagnostic methods, providing a reduction in analysis times and in reagent costs compared with conventional sequencing. These improvements open the possibility of BRCA1/2 testing for a wider spectrum of at‐risk women, and will allow the genetic classification of tumors prior to the use of novel PARP inhibitors to treat BRCA‐deficient breast cancers. Hum Mutat 31:1–8, 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

4.
Novel high-throughput DNA sequencing technologies allow researchers to characterize a bacterial genome during a single experiment and at a moderate cost. However, the increase in sequencing throughput that is allowed by using such platforms is obtained at the expense of individual sequence read length, which must be assembled into longer contigs to be exploitable. This study focuses on the Illumina sequencing platform that produces millions of very short sequences that are 35 bases in length. We propose a de novo assembler software that is dedicated to process such data. Based on a classical overlap graph representation and on the detection of potentially spurious reads, our software generates a set of accurate contigs of several kilobases that cover most of the bacterial genome. The assembly results were validated by comparing data sets that were obtained experimentally for Staphylococcus aureus strain MW2 and Helicobacter acinonychis strain Sheeba with that of their published genomes acquired by conventional sequencing of 1.5- to 3.0-kb fragments. We also provide indications that the broad coverage achieved by high-throughput sequencing might allow for the detection of clonal polymorphisms in the set of DNA molecules being sequenced.  相似文献   

5.
CLONEPICKER is a software pipeline that integrates sequence data with BAC clone fingerprints to dynamically select a minimal overlapping clone set covering the whole genome. In the Rat Genome Sequencing Project (RGSP), a hybrid strategy of "clone by clone" and "whole genome shotgun" approaches was used to maximize the merits of both approaches. Like the "clone by clone" method, one key challenge for this strategy was to select a low-redundancy clone set that covered the whole genome while the sequencing is in progress. The CLONEPICKER pipeline met this challenge using restriction enzyme fingerprint data, BAC end sequence data, and sequences generated from individual BAC clones as well as WGS reads. In the RGSP, an average of 7.5 clones was identified from each side of a seed clone, and the minimal overlapping clones were reliably selected. Combined with the assembled BAC fingerprint map, a set of BAC clones that covered >97% of the genome was identified and used in the RGSP.  相似文献   

6.
7.
Next-generation massively parallel sequencing technologies provide ultrahigh throughput at two orders of magnitude lower unit cost than capillary Sanger sequencing technology. One of the key applications of next-generation sequencing is studying genetic variation between individuals using whole-genome or target region resequencing. Here, we have developed a consensus-calling and SNP-detection method for sequencing-by-synthesis Illumina Genome Analyzer technology. We designed this method by carefully considering the data quality, alignment, and experimental errors common to this technology. All of this information was integrated into a single quality score for each base under Bayesian theory to measure the accuracy of consensus calling. We tested this methodology using a large-scale human resequencing data set of 36× coverage and assembled a high-quality nonrepetitive consensus sequence for 92.25% of the diploid autosomes and 88.07% of the haploid X chromosome. Comparison of the consensus sequence with Illumina human 1M BeadChip genotyped alleles from the same DNA sample showed that 98.6% of the 37,933 genotyped alleles on the X chromosome and 98% of 999,981 genotyped alleles on autosomes were covered at 99.97% and 99.84% consistency, respectively. At a low sequencing depth, we used prior probability of dbSNP alleles and were able to improve coverage of the dbSNP sites significantly as compared to that obtained using a nonimputation model. Our analyses demonstrate that our method has a very low false call rate at any sequencing depth and excellent genome coverage at a high sequencing depth.Genetic polymorphisms contribute to variations in phenotypes, risk to certain diseases, and response to drugs and the environment. Genome-wide linkage analysis and positional cloning have been tremendously successful for mapping human disease genes that underlie monogenic Mendelian diseases (Jimenez-Sanchez et al. 2001). But most common diseases (such as diabetes, cardiovascular disease, and cancer) and clinically important quantitative traits have complex genetic architectures; a combination of multiple genes and interactions with environmental factors is believed to determine these phenotypes. Linkage analysis has significant limitations in its ability to identify common genetic variations that have modest effects on disease (Wang et al. 2005). In contrast, genome-wide association studies offer a promising approach for mapping associated loci. The completion of the human genome sequence (Lander et al. 2001; Venter et al. 2001) enabled the identification of millions of single nucleotide polymorphisms (SNPs) (Sachidanandam et al. 2001) and the construction of a high-density haplotype map (International HapMap Consortium 2005; International HapMap Consortium et al. 2007). These advances have set the stage for large-scale genome-wide SNP surveys for seeking genetic variations associated with or causative of a wide variety of human diseases.For more than two decades, Sanger sequencing and fluorescence-based electrophoresis technologies have dominated the DNA sequencing field. And DNA sequencing is the method of choice for novel SNP detection, using either a random shotgun strategy or PCR amplification of regions of interest. Most of the SNPs deposited in dbSNP were identified by these methods (Sherry et al. 2001). A key advantage of the utility of traditional Sanger sequencing is the availability of the universal standard of phred scores (Ewing and Green 1998; Ewing et al. 1998) for defining SNP detection accuracy, in which the phred program assigns a score to each base of the raw sequence to estimate an error probability.With high-throughput clone sequencing of shotgun libraries, a standard method for SNP detection (such as ssahaSNP; Ning et al. 2001) is to align the reads onto a reference genome and filter low-quality mismatches according to their phred score, known as the “neighborhood quality standard” (NQS) (Altshuler et al. 2000). With direct sequencing of PCR-amplified sequences from diploid samples, software, including SNPdetector (Zhang et al. 2005), novoSNP (Weckx et al. 2005), PolyPhred (Stephens et al. 2006), and PolyScan (Chen et al. 2007), has been developed to examine chromatogram files to detect heterozygous polymorphisms.New DNA sequencing technologies, which have recently been developed and implemented, such as the Illumina Genome Analyzer (GA), Roche/454 FLX system, and AB SOLiD system, have significantly improved throughput and dramatically reduced the cost as compared to capillary-based electrophoresis systems (Shendure et al. 2004). In a single experiment using one Illumina GA, the sequence of approximately 100 million reads of up to 50 bases in length can be determined. This ultrahigh throughput makes next-generation sequencing technologies particularly suitable for carrying out genetic variation studies by using large-scale resequencing of sizeable cohorts of individuals with a known reference (Bentley 2006). Currently, using these technologies, three human individuals have been sequenced: James Watson''s genome by 454 Life Sciences (Roche) FLX sequencing technology (Wheeler et al. 2008), an Asian genome (Wang et al. 2008), and an African genome (Bentley et al. 2008) sequenced by Illumina GA technology. Additionally, given such sequencing advances, an international research consortium has formed to sequence the genomes of at least 1000 individuals from around the world to create the most detailed human genetic variation map to date.As noted, SNP detection methods for standard sequencing technologies are well developed; however, given distinct differences in the sequence data output from and analyses of next-generation sequencing, novel methods for accurate SNP detection are essential. To meet these needs, we have developed a method of consensus calling and SNP detection for the massively parallel Illumina GA technology. The Illumina platform uses a phred-like quality score system to measure the accuracy of each sequenced base pair. Using this, we calculated the likelihood of each genotype at each site based on the alignment of short reads to a reference genome together with the corresponding sequencing quality scores. We then inferred the genotype with highest posterior probability at each site using a Bayesian statistical method. The Bayesian method has been used for SNP calling for traditional Sanger sequencing technology (Marth et al. 1999) and has also been introduced for the analysis of next-generation sequencing data (Li et al. 2008a). In the method presented here, we have taken into account the intrinsic bias or errors that are common in Illumina GA sequencing data and recalibrated the quality values for use in inferring consensus sequence.We evaluated this SNP detection method using the Asian genome sequence, which has 36× high-quality data (Wang et al. 2008). The evaluation demonstrated that our method has a very low false call rate at any sequencing depth, and excellent genome coverage in high-depth data, making it very useful for SNP detection in Illumina GA resequencing data at any sequencing depth. This methodology and the developed software described in this report have been integrated into the Short Oligonucleotide Alignment Program (SOAP) package (Li et al. 2008b) and named “SOAPsnp” to indicate its functionality for SNP detection using SOAP short read alignment results as input.  相似文献   

8.
Leukemias are currently subclassified based on the presence of recurrent cytogenetic abnormalities and gene mutations. These molecular findings are the basis for risk-adapted therapy; however, such data are generally obtained by disparate methods in the clinical laboratory, and often rely on low-resolution techniques such as fluorescent in situ hybridization. Using targeted next generation sequencing, we demonstrate that the full spectrum of prognostically significant gene mutations including translocations, single nucleotide variants (SNVs), and insertions/deletions (indels) can be identified simultaneously in multiplexed sequence data. As proof of concept, we performed hybrid capture using a panel of 20 genes implicated in leukemia prognosis (covering a total of 1?Mbp) from five leukemia cell lines including K562, NB4, OCI-AML3, kasumi-1, and MV4-11. Captured DNA was then sequenced in multiplex on an Illumina HiSeq. Using an analysis pipeline based on freely available software we correctly identified DNA-level translocations in three of the three cell lines where translocations were covered by our capture probes. Furthermore, we found all published gene mutations in commonly tested genes including NPM1, FLT3, and KIT. The same methodology was applied to DNA extracted from the bone marrow of a patient with acute myeloid leukemia, and identified a t(9;11) translocation with single base accuracy as well other gene mutations. These results indicate that targeted next generation sequencing can be successfully applied in the clinical laboratory to identify a full spectrum of DNA mutations ranging from SNVs and indels to translocations. Such methods have the potential to both greatly streamline and improve the accuracy of DNA-based diagnostics.  相似文献   

9.
Target enrichment strategies are a very common approach to sequence a predefined part of an individual's genome using second-generation sequencing technologies. While highly dependent on the technology and the target sequences selected, the performance of the various assays is also variable between samples and is influenced by the way how the libraries are handled in the laboratory. Here, we show how to find detailed information about the enrichment performance using a novel software package called NGSrich, which we developed as a part of a whole-exome resequencing pipeline in a medium-sized genomics center. Our software is suitable for high-throughput use and the results can be shared using HTML and a web server. Finally, we have sequenced exome-enriched DNA libraries of 18 human individuals using three different enrichment products and used our new software for a comparative analysis of their performance.  相似文献   

10.
11.
《Clinical microbiology and infection》2021,27(9):1350.e1-1350.e5
ObjectivesSARS-CoV-2 has evolved rapidly into several genetic clusters. However, data on mutations during the course of infection are scarce. This study aims to determine viral genome diversity in serial samples of COVID-19 patients.MethodsTargeted deep sequencing of the spike gene was performed on serial respiratory specimens from COVID-19 patients using nanopore and Illumina sequencing. Sanger sequencing was then performed to confirm the single nucleotide polymorphisms.ResultsA total of 28 serial respiratory specimens from 12 patients were successfully sequenced using nanopore and Illumina sequencing. A 75-year-old patient with severe disease had a mutation, G22017T, identified in the second specimen. The frequency of G22017T increased from ≤5% (nanopore: 3.8%; Illumina: 5%) from the first respiratory tract specimen (sputum) to ≥60% (nanopore: 67.7%; Illumina: 60.4%) in the second specimen (saliva; collected 2 days after the first specimen). The difference in G22017T frequency was also confirmed by Sanger sequencing. G22017T corresponds to W152L amino acid mutation in the spike protein which was only found in <0.03% of the sequences deposited into a public database. Spike amino acid residue 152 is located within the N-terminal domain, which mediates the binding of a neutralizing antibody.DiscussionA spike protein amino acid mutation W152L located within a neutralizing epitope has appeared naturally in a patient. Our study demonstrated that monitoring of serial specimens is important in identifying hotspots of mutations, especially those occurring at neutralizing epitopes which may affect the therapeutic efficacy of monoclonal antibodies.  相似文献   

12.
Huang W  Marth G 《Genome research》2008,18(9):1538-1543
The emergence of high-throughput next-generation sequencing technologies (e.g., 454 Life Sciences [Roche], Illumina sequencing [formerly Solexa sequencing]) has dramatically sped up whole-genome de novo sequencing and resequencing. While the low cost of these sequencing technologies provides an unparalleled opportunity for genome-wide polymorphism discovery, the analysis of the new data types and huge data volume poses formidable informatics challenges for base calling, read alignment and genome assembly, polymorphism detection, as well as data visualization. We introduce a new data integration and visualization tool EagleView to facilitate data analyses, visual validation, and hypothesis generation. EagleView can handle a large genome assembly of millions of reads. It supports a compact assembly view, multiple navigation modes, and a pinpoint view of technology-specific trace information. Moreover, EagleView supports viewing coassembly of mixed-type reads from different technologies and supports integrating genome feature annotations into genome assemblies. EagleView has been used in our own lab and by over 100 research labs worldwide for next-generation sequence analyses. The EagleView software is freely available for not-for-profit use at http://bioinformatics.bc.edu/marthlab/EagleView.  相似文献   

13.
Targeted sequencing using next‐generation sequencing technologies is currently being rapidly adopted for clinical sequencing and cancer marker tests. However, no existing bioinformatics tool is available for the analysis and visualization of multiple targeted sequencing datasets. In the present study, we use cancer panel targeted sequencing datasets generated by the Life Technologies Ion Personal Genome Machine Sequencer as an example to illustrate how to develop an automated pipeline for the comparative analyses of multiple datasets. Cancer Panel Analysis Pipeline (CPAP) uses standard output files from variant calling software to generate a distribution map of SNPs among all of the samples in a circular diagram generated by Circos. The diagram is hyperlinked to a dynamic HTML table that allows the users to identify target SNPs by using different filters. CPAP also integrates additional information about the identified SNPs by linking to an integrated SQL database compiled from SNP‐related databases, including dbSNP, 1000 Genomes Project, COSMIC, and dbNSFP. CPAP only takes 17 min to complete a comparative analysis of 500 datasets. CPAP not only provides an automated platform for the analysis of multiple cancer panel datasets but can also serve as a model for any customized targeted sequencing project.  相似文献   

14.
Massively parallel sequencing technology and the associated rapidly decreasing sequencing costs have enabled systemic analyses of somatic mutations in large cohorts of cancer cases. Here we introduce a comprehensive mutational analysis pipeline that uses standardized sequence-based inputs along with multiple types of clinical data to establish correlations among mutation sites, affected genes and pathways, and to ultimately separate the commonly abundant passenger mutations from the truly significant events. In other words, we aim to determine the Mutational Significance in Cancer (MuSiC) for these large data sets. The integration of analytical operations in the MuSiC framework is widely applicable to a broad set of tumor types and offers the benefits of automation as well as standardization. Herein, we describe the computational structure and statistical underpinnings of the MuSiC pipeline and demonstrate its performance using 316 ovarian cancer samples from the TCGA ovarian cancer project. MuSiC correctly confirms many expected results, and identifies several potentially novel avenues for discovery.  相似文献   

15.
To provide a comprehensive data on the prevalence of mutations in Leber congenital amaurosis (LCA) candidate genes from a larger Indian cohort. Ninety‐two unrelated subjects were recruited after complete ophthalmic examination and informed consent. Targeted re‐sequencing of 20 candidate genes was performed using Agilent HaloPlex target enrichment assay and sequenced on Illumina MiSeq platform. The data were analyzed using standard bioinformatics pipeline, variants annotated, validated and segregated. Genotype‐phenotype correlation was performed for the mutation‐positive cases. Targeted next generation sequencing (NGS) for the 20 candidate genes generated data with an average sequence coverage and depth of 99.03% and 134X, respectively. Mutations were identified in 61% (56/92) of the cases, which were validated, segregated in the families and absent in 200 control chromosomes. These mutations were observed in 14/20 candidate genes and 39% (21/53) were novel. Distinct phenotypes were observed with respect to genotypes. To our knowledge, this study presents the first comprehensive mutation spectrum of LCA in a large Indian cohort. The mutation‐negative cases indicate scope for finding novel candidate gene(s) although mutations in deep intronic and regulatory regions cannot be ruled out.  相似文献   

16.
17.
18.
With the increasing popularity of whole-genome shotgun sequencing (WGSS) via high-throughput sequencing technologies, it is becoming highly desirable to perform comparative studies involving multiple individuals (from a specific population, race, or a group sharing a particular phenotype). The conventional approach for a comparative genome variation study involves two key steps: (1) each paired-end high-throughput sequenced genome is compared with a reference genome and its (structural) differences are identified; (2) the lists of structural variants in each genome are compared against each other. In this study we propose to move away from this two-step approach to a novel one in which all genomes are compared with the reference genome simultaneously for obtaining much higher accuracy in structural variation detection. For this purpose, we introduce the maximum parsimony-based simultaneous structural variation discovery problem for a set of high-throughput sequenced genomes and provide efficient algorithms to solve it. We compare the proposed framework with the conventional framework, on the genomes of the Yoruban mother-father-child trio, as well as the CEU trio of European ancestry (both sequenced by Illumina platforms). We observed that the conventional framework predicts an unexpectedly high number of de novo variations in the child in comparison to the parents and misses some of the known variations. Our proposed framework, on the other hand, not only significantly reduces the number of incorrectly predicted de novo variations but also predicts more of the known (true) variations.  相似文献   

19.
Multilocus sequence typing (MLST) analysis for semi-routine applications is hindered by the downstream, manually intensive steps of processing the raw sequence data files. This report describes the development of an MLST pipeline that automates DNA sequence editing and analysis in order to significantly reduce the time required for processing data. Validation using a pneumococcal dataset revealed complete agreement between the results generated by manual and automated workflows. The MLST pipeline was developed for both double-strand and single-strand sequencing.  相似文献   

20.
The effective control of multidrug resistant tuberculosis (MDR-TB) relies upon the timely diagnosis and correct treatment of all tuberculosis cases. Whole genome sequencing (WGS) has great potential as a method for the rapid diagnosis of drug resistant Mycobacterium tuberculosis (Mtb) isolates. This method overcomes most of the problems that are associated with current phenotypic drug susceptibility testing. However, the application of WGS in the clinical setting has been deterred by data complexities and skill requirements for implementing the technologies as well as clinical interpretation of the next generation sequencing (NGS) data. The proposed diagnostic application was drawn upon recent discoveries of patterns of Mtb clade-specific genetic polymorphisms associated with antibiotic resistance. A catalogue of genetic determinants of resistance to thirteen anti-TB drugs for each phylogenetic clade was created. A computational algorithm for the identification of states of diagnostic polymorphisms was implemented as an online software tool, Resistance Sniffer (http://resistance-sniffer.bi.up.ac.za/), and as a stand-alone software tool to predict drug resistance in Mtb isolates using complete or partial genome datasets in different file formats including raw Illumina fastq read files. The program was validated on sequenced Mtb isolates with data on antibiotic resistance trials available from GMTV database and from the TB Platform of South African Medical Research Council (SAMRC), Pretoria. The program proved to be suitable for probabilistic prediction of drug resistance profiles of individual strains and large sequence data sets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号