首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This work presents more precise computational methods for improving the diagnosis of Parkinson's disease based on the detection of dysphonia. New methods are presented for enhanced evaluation and recognize Parkinson's disease affected patients at early stage. Analysis is performed with significant level of error tolerance rate and established our results with corrected T-test. Here new ensembles and other machine learning methods consisting of multinomial logistic regression classifier with Haar wavelets transformation as projection filter that outperform logistic regression is used. Finally a novel and reliable inference system is presented for early recognition of people affected by this disease and presents a new measure of the severity of the disease. Feature selection method is based on Support Vector Machines and ranker search method. Performance analysis of each model is compared to the existing methods and examines the main advancements and concludes with propitious results. Reliable methods are proposed for treating Parkinson's disease that includes sparse multinomial logistic regression, Bayesian network, Support Vector Machines, Artificial Neural Networks, Boosting methods and their ensembles. The study aim at improving the quality of Parkinson's disease treatment by tracking them and reinforce the viability of cost effective, regular and precise telemonitoring application.  相似文献   

2.
3.
4.
Insights on biology and evolution from microbial genome sequencing   总被引:9,自引:0,他引:9       下载免费PDF全文
No field of research has embraced and applied genomic technology more than the field of microbiology. Comparative analysis of nearly 300 microbial species has demonstrated that the microbial genome is a dynamic entity shaped by multiple forces. Microbial genomics has provided a foundation for a broad range of applications, from understanding basic biological processes, host-pathogen interactions, and protein-protein interactions, to discovering DNA variations that can be used in genotyping or forensic analyses, the design of novel antimicrobial compounds and vaccines, and the engineering of microbes for industrial applications. Most recently, metagenomics approaches are allowing us to begin to probe complex microbial communities for the first time, and they hold great promise in helping to unravel the relationships between microbial species.  相似文献   

5.
6.
As whole-genome sequencing becomes commoditized and we begin to sequence and analyze personal genomes for clinical and diagnostic purposes, it is necessary to understand what constitutes a complete sequencing experiment for determining genotypes and detecting single-nucleotide variants. Here, we show that the current recommendation of ~30× coverage is not adequate to produce genotype calls across a large fraction of the genome with acceptably low error rates. Our results are based on analyses of a clinical sample sequenced on two related Illumina platforms, GAII(x) and HiSeq 2000, to a very high depth (126×). We used these data to establish genotype-calling filters that dramatically increase accuracy. We also empirically determined how the callable portion of the genome varies as a function of the amount of sequence data used. These results help provide a "sequencing guide" for future whole-genome sequencing decisions and metrics by which coverage statistics should be reported.  相似文献   

7.
《Genetics in medicine》2020,22(4):803-808
PurposeUniparental disomy (UPD) is the rare occurrence of two homologous chromosomes originating from the same parent and is typically identified by marker analysis or single-nucleotide polymorphism (SNP)-based microarrays. UPDs may lead to disease due to imprinting effects, underlying homozygous pathogenic variants, or low-level mosaic aneuploidies. In this study we detected clinically relevant UPD events in both trio and single exome sequencing (ES) data.MethodsUPD was detected by applying a method based on Mendelian inheritance errors to a cohort of 4912 ES trios (all UPD types) and by using median absolute deviation–scaled regions of homozygosity to a cohort of 29,723 single ES samples (isodisomy only).ResultsAs positive controls, we accurately identified three mixed UPD, three isodisomy, as well as two segmental UPD events that were all previously reported by SNP-based microarrays. In addition, we identified three segmental UPD and 11 isodisomy events. This resulted in a novel diagnosis based on imprinting for one patient, and adjusted genetic counseling for another patient.ConclusionUPD can easily be identified using both single and trio ES and may be clinically relevant to patients. UPD analysis should become routine in clinical ES, because it increases the diagnostic yield and could affect genetic counseling.  相似文献   

8.
European Journal of Clinical Microbiology & Infectious Diseases - Rickettsia and Coxiella burnetii are zoonotic tick-borne pathogens that cause febrile illnesses in humans. Metagenomic...  相似文献   

9.
Short tandem repeats (STRs) are implicated in dozens of human genetic diseases and contribute significantly to genome variation and instability. Yet profiling STRs from short-read sequencing data is challenging because of their high sequencing error rates. Here, we developed STR-FM, short tandem repeat profiling using flank-based mapping, a computational pipeline that can detect the full spectrum of STR alleles from short-read data, can adapt to emerging read-mapping algorithms, and can be applied to heterogeneous genetic samples (e.g., tumors, viruses, and genomes of organelles). We used STR-FM to study STR error rates and patterns in publicly available human and in-house generated ultradeep plasmid sequencing data sets. We discovered that STRs sequenced with a PCR-free protocol have up to ninefold fewer errors than those sequenced with a PCR-containing protocol. We constructed an error correction model for genotyping STRs that can distinguish heterozygous alleles containing STRs with consecutive repeat numbers. Applying our model and pipeline to Illumina sequencing data with 100-bp reads, we could confidently genotype several disease-related long trinucleotide STRs. Utilizing this pipeline, for the first time we determined the genome-wide STR germline mutation rate from a deeply sequenced human pedigree. Additionally, we built a tool that recommends minimal sequencing depth for accurate STR genotyping, depending on repeat length and sequencing read length. The required read depth increases with STR length and is lower for a PCR-free protocol. This suite of tools addresses the pressing challenges surrounding STR genotyping, and thus is of wide interest to researchers investigating disease-related STRs and STR evolution.Short tandem repeats (STRs) of 1–6 base pairs per motif constitute ∼3% of the human genome (Lander 2001). Due to the high incidence of polymerase slippage at STRs (Levinson and Gutman 1987; Abdulovic et al. 2011; Baptiste and Eckert 2012), these repeats have elevated germline mutation and polymorphism rates. After a certain threshold length, STRs are termed microsatellites (Kelkar et al. 2010; Ananda et al. 2013). The high level of polymorphism makes microsatellites attractive markers for population and conservation genetics studies (Jarne and Lagoda 1996; Sunnucks 2000; Wan et al. 2004; Kim and Sappington 2013) and for identifying individuals in forensics (Hagelberg et al. 1991; Chambers et al. 2014). Many STRs are involved in gene regulation and protein function (Li et al. 2004), with ∼17% of human genes containing STRs in their open reading frames (Gemayel et al. 2010). Although long microsatellites have attracted much attention, length alterations even within relatively short repeat tracts are sometimes associated with disease (Li et al. 2004). For instance, differences in the number of repeats at the (TG)10-13(T)5-9 STR located within the splicing branch/acceptor site of the CFTR gene (exon 9) can affect in-frame exon skipping and, as a result, can influence the severity of cystic fibrosis (Cuppens et al. 1990; Chu et al. 1991). The purity of STRs (the degree to which the perfect STR sequence remains uninterrupted) also has a functional effect. Interrupted STRs have lower mutation rates (Ananda et al. 2014), and this can diminish disease risk. For instance, ∼6% of Ashkenazi Jews have a T to A mutation in the APC gene (encoding for a tumor suppressor) that alters an interrupted STR (A)3T(A)4 into a perfect (A)8 (Laken et al. 1997). This increases the probability of somatic frameshift mutation within the STR, leading to APC protein inactivation. As a result, Ashkenazi Jews have a higher colorectal cancer risk (Gryfe et al. 1999). Since even small changes in STR length and purity can have functional effects, accurate STR profiling is crucial.Despite the importance of STRs in evolution and disease, their accurate genotyping from next generation sequencing (NGS) data has been challenging (for review, see Treangen and Salzberg 2012). Sequencing library construction frequently includes polymerase chain reaction (PCR) steps during which a polymerase might undergo slippage at STRs, leading to amplicons that differ in length due to expansion and contraction of repeat units (Ellegren 2004; Wang et al. 2011). Additionally, base calling by NGS instruments at repetitive regions is frequently imprecise. These factors result in high sequencing errors at homopolymer runs produced by the 454 (Roche) and Illumina instruments (Balzer et al. 2010; Albers et al. 2011).From a bioinformatics perspective, if STR-containing reads are mapped in their entirety, some reads cannot be mapped because of high mismatch/indel penalties associated with STR lengths different than those at the corresponding positions in the reference genome. This obscures accurate estimation of allele frequency and underestimates the real level of STR variation in the genome. To alleviate this problem, a short-read alignment approach using nonrepetitive flanks of STR-containing reads has been proposed recently (lobSTR) (Gymrek et al. 2012). This tool has fast running time and takes into account PCR stutter noise during the genotyping stage. However, the entropy scanning implemented by lobSTR to detect STRs has low sensitivity for mononucleotide STRs and short STRs (<25 bp), which constitute a large proportion of STRs in the genome. Additionally, the allele frequency at STRs for genetically heterogeneous samples, for which a simple 1:1 ratio in allele frequency present in heterozygous diploids is not expected (e.g., for tumors, viral populations, and organelles), cannot be determined. Furthermore, lobSTR uses a fixed (embedded in the program) mapping algorithm. Novel short-read mapping and STR detection algorithms (Pellegrini et al. 2010; Lim et al. 2013) are constantly being developed; an STR-profiling tool that can be customized to incorporate emerging mapping algorithms is needed.The recently released PCR-free Illumina library preparation protocol (hereafter called “PCR−”) is expected to improve STR genotyping accuracy. The direct advantage of limiting PCR steps during NGS is the increased uniformity of the sequencing depth (Kozarewa et al. 2009). Also, this protocol eliminates duplicate reads that obscure allele frequency profiling for heterogeneous genetic samples. Importantly, the degree to which the accuracy of calling STR alleles is improved using the PCR-free protocol has not been evaluated previously. Moreover, massive amounts of data have already been generated by the NGS technology with the PCR-containing library preparation protocol (hereafter called “PCR+”), and some such data cannot be regenerated due to the scarcity of samples and/or time and cost constraints. Therefore, universal methods are urgently needed that can evaluate and correct STR errors generated by NGS technology (both PCR− and PCR+) and accommodate evolving protocols and sequencing techniques.Some efforts have been made to evaluate errors generated by NGS at STRs. For instance, errors at STRs sequenced with the PCR+ protocol vary with repeat number and motif size (Luo et al. 2012). However, an explicit quantitation of various sources of STR-related sequencing errors has been lacking, which hinders an unambiguous estimation of STR mutational properties. Indeed, as both mutation and sequencing error rates increase with STR length (Kelkar et al. 2008; Luo et al. 2012; Highnam et al. 2013), one cannot confidently decipher mutation rates without accounting for sequencing error rates. Recently, a tool to guide genotyping of STRs using informed error profiles from inbred Drosophila lines (RepeatSeq) has been released (Highnam et al. 2013). This tool utilizes reads mapped by other programs, such as BWA (Li and Durbin 2009) and Bowtie (Langmead et al. 2009), and predicts the most probable genotype at a locus based on STR motif, length, and base quality. However, RepeatSeq uses the whole-read mapping approach, which introduces a bias toward the STR length in the reference genome (Gymrek et al. 2012) and thus might obscure the true STR variation spectrum. Such biases can be accounted for by an error correction model based on the STR flank-based method.To profile the full spectrum of STR lengths in the human and other genomes, and to correct for NGS-associated STR errors, we developed STR-FM (short tandem repeat profiling using a flank-based mapping approach), a flexible pipeline for detecting and genotyping STRs from short-read sequencing data. Our pipeline can detect STRs of any length, including short ones (as short as only two repeats), includes an error-correcting module, and can incorporate any NGS mapping algorithm with paired-end mapping capability, making it adaptable to new mapping methods as they become available. Applying this pipeline, we asked the following questions. First, what are the rates and patterns of sequencing errors associated with STRs of different motif sizes (mono-, di-, tri-, and tetranucleotides), motif compositions, and repeat numbers? These were contrasted between publicly available genome-wide data sets sequenced with PCR+ and PCR− protocols and validated with in-house generated, ultradeep sequencing of plasmids harboring individual STR sequences. Second, do technical errors have different patterns from true STR mutations? Third, based on the detailed knowledge of the error profiles, what is the minimum sequencing depth required for producing reliable STR genotypes for PCR+ and PCR− protocols? As a result, we provide the scientific community with STR-FM, a reproducible and versatile pipeline for genotyping STRs that incorporates an error correction model. To illustrate the utility of STR-FM, we applied it to the completely sequenced human genomes from the Platinum Genomes Project (Ajay et al. 2011) and determined human genome-wide germline mutation rates at STRs.  相似文献   

10.
《Genetics in medicine》2016,18(7):712-719
PurposeTo develop and validate VisCap, a software program targeted to clinical laboratories for inference and visualization of germ-line copy-number variants (CNVs) from targeted next-generation sequencing data.MethodsVisCap calculates the fraction of overall sequence coverage assigned to genomic intervals and computes log2 ratios of these values to the median of reference samples profiled using the same test configuration. Candidate CNVs are called when log2 ratios exceed user-defined thresholds.ResultsWe optimized VisCap using 14 cases with known CNVs, followed by prospective analysis of 1,104 cases referred for diagnostic DNA sequencing. To verify calls in the prospective cohort, we used droplet digital polymerase chain reaction (PCR) to confirm 10/27 candidate CNVs and 72/72 copy-neutral genomic regions scored by VisCap. We also used a genome-wide bead array to confirm the absence of CNV calls across panels applied to 10 cases. To improve specificity, we instituted a visual scoring system that enabled experienced reviewers to differentiate true-positive from false-positive calls with minimal impact on laboratory workflow.ConclusionsVisCap is a sensitive method for inferring CNVs from targeted sequence data from targeted gene panels. Visual scoring of data underlying CNV calls is a critical step to reduce false-positive calls for follow-up testing.  相似文献   

11.
Next-generation sequencing technologies have made it possible to sequence targeted regions of the human genome in hundreds of individuals. Deep sequencing represents a powerful approach for the discovery of the complete spectrum of DNA sequence variants in functionally important genomic intervals. Current methods for single nucleotide polymorphism (SNP) detection are designed to detect SNPs from single individual sequence data sets. Here, we describe a novel method SNIP-Seq (single nucleotide polymorphism identification from population sequence data) that leverages sequence data from a population of individuals to detect SNPs and assign genotypes to individuals. To evaluate our method, we utilized sequence data from a 200-kilobase (kb) region on chromosome 9p21 of the human genome. This region was sequenced in 48 individuals (five sequenced in duplicate) using the Illumina GA platform. Using this data set, we demonstrate that our method is highly accurate for detecting variants and can filter out false SNPs that are attributable to sequencing errors. The concordance of sequencing-based genotype assignments between duplicate samples was 98.8%. The 200-kb region was independently sequenced to a high depth of coverage using two sequence pools containing the 48 individuals. Many of the novel SNPs identified by SNIP-Seq from the individual sequencing were validated by the pooled sequencing data and were subsequently confirmed by Sanger sequencing. We estimate that SNIP-Seq achieves a low false-positive rate of ∼2%, improving upon the higher false-positive rate for existing methods that do not utilize population sequence data. Collectively, these results suggest that analysis of population sequencing data is a powerful approach for the accurate detection of SNPs and the assignment of genotypes to individual samples.With the availability of several next-generation sequencing platforms, the cost of DNA sequencing has dropped dramatically over the past few years and improvements in technology are expected to decrease the cost further (Shendure and Ji 2008). Next-generation sequencers, such as the Illumina Genome Analyzer (GA), can generate gigabases of nucleotides per day and have enabled the sequencing of complete individual human genomes (Bentley et al. 2008; Ley et al. 2008; Wang et al. 2008; Wheeler et al. 2008; McKernan et al. 2009). While the resequencing of complete human genomes still remains quite expensive, the targeted sequencing of specific genomic intervals in a large population of individuals is now feasible in an individual laboratory. Resequencing of coding sequences of genes in large populations has previously been shown to be useful for identifying multiple rare variants affecting quantitative traits (Cohen et al. 2004, 2006; Ji et al. 2008). Resequencing of genomic regions identified by genome-wide association studies in healthy and diseased populations represents a powerful strategy for assessing the contribution of rare variants to disease etiology. Nejentsev et al. (2009) have used this approach to identify four rare variants protective for type 1 diabetes.For harnessing the capacity of next-generation sequencers for deep population resequencing, the first challenge is to selectively capture DNA from the region of interest. Recently, Craig et al. (2008) used long-range PCR and DNA barcodes to sequence specific regions of the human regions in multiple samples simultaneously using the Illumina GA. Harismendy et al. (2009) also used long-range PCR to sequence targeted regions of the human genome using multiple sequencing platforms to evaluate the feasibility of targeted population sequencing and the concordancy of variant calling between the different platforms. However, traditional sequence capture methods, such as long-range polymerase chain reaction (LR-PCR), are not adequate for capturing thousands of noncontiguous regions of the genome, e.g., all exons, in a large number of samples. Several high-throughput target capture methods have been developed (Hodges et al. 2007; Okou et al. 2007; Porreca et al. 2007; Turner et al. 2009).After millions of reads have been generated by the sequencer, the next challenge is to identify genetic variants by mapping the reads to a reference sequence. A variety of tools have been developed that can efficiently align hundreds of millions of short reads to a reference sequence even in the presence of multiple errors in the reads (Li et al. 2008; Langmead et al. 2009; Li and Durbin 2009; Li et al. 2009b). Each base mismatch in an aligned read represents either a sequencing error or a single nucleotide variant in the diploid individual. To compensate for the high sequencing error rates of next-generation sequencing platforms, one requires the presence of multiple overlapping reads, each with a base different from the reference base for single nucleotide polymorphism (SNP) calling. Base quality values—probability estimates of the correctness of a base call—are particularly useful for distinguishing sequencing errors from SNPs. The Illumina sequencing system generates a phred-like quality score for each base call using various predictors of the sequencing errors. SNP calling methods for Illumina sequence data utilize these base quality values to compute the likelihood of different genotypes at each position using Bayesian or statistical models (Li et al. 2008, 2009a). Positions for which the most likely genotype is different from the reference genotype and which satisfy additional filters on neighborhood sequence quality, read alignment quality, etc. are reported as SNPs. However, sequencing errors for the Illumina GA are not completely random and are dependent on the local sequence context of the base being read, the position of the base in the read, etc. (Dohm et al. 2008; Erlich et al. 2008). Therefore, assuming independence between multiple base calls, each with a non-reference base, results in overcalling of SNPs, i.e., increased number of false-positives. To reduce the number of false variant calls, the MAQ SNP caller (Li et al. 2008) uses a dependency model to estimate an average error rate using all base quality scores.MAQ and other SNP calling methods have enabled fairly accurate detection of SNPs from resequencing of individual human genomes (Bentley et al. 2008; Wang et al. 2008). However, there is potential for developing more accurate SNP detection methods, in particular, by taking advantage of sequence information from a population of sequenced individuals. Comparison of sequenced reads for a potential variant site across multiple individuals has the potential to differentiate systematic sequencing errors from real SNPs. Patterns of mismatched bases (bases not matching the reference base) resulting from systematic sequencing errors are likely to be shared across individuals. On the other hand, the profiles of mismatched bases between individuals with and without a SNP are likely to be distinct. Comparison of read alignments across multiple individuals also has the potential to filter out SNPs that are an artifact of inaccurate read alignments. We present a probabilistic model that leverages sequence data from a population of individuals, each sequenced separately, for detecting single nucleotide variants and also assigning genotypes to each individual in the population.3 Our method recalibrates each base quality value by adding a population error correction to the Illumina base error probability. This correction is computed using the distribution of mismatched bases across all sequenced individuals. The recalibrated base quality values are then used to compute genotype probabilities for each individual under a simple Bayesian model that assumes independence between base calls. Finally, positions in the sequence with one or more individuals showing evidence for harboring a non-reference allele are identified as SNPs. Craig et al. (2008) described a similar approach for SNP detection using sequence data from multiple individuals where they used Bayes factors to compare the fraction of reads with the alternate allele across multiple individuals. Sites at which one or more individuals have a fraction of reads with the alternate allele sufficiently greater than the average were identified as SNPs. Our model is much more general and can take advantage of the complete information about each base call, i.e., base quality value, position in the read containing the base, and the strand to which the read aligns to.To evaluate our population SNP detection method, we analyzed sequence data from a 200-kilobase (kb)-long region on chromosome 9p21 that was sequenced to a median depth of 45× in 48 individuals using the Illumina Genome Analyzer (O Harismendy, V Bansal, N Rahim, X Wang, N Heintzman, B Ren, EJ Topol, and KA Frazer, in prep.). We demonstrate that our method can accurately detect SNPs with a low false-positive rate (∼2%) and a low false-negative rate in comparison to SNP detection from individual sequence data using MAQ. By comparing genotype calls between replicate samples, we show a 98.8% accuracy for sequence-based genotyping using our method.  相似文献   

12.
With the recent increase in study sample sizes in human genetics, there has been growing interest in inferring historical population demography from genomic variation data. Here, we present an efficient inference method that can scale up to very large samples, with tens or hundreds of thousands of individuals. Specifically, by utilizing analytic results on the expected frequency spectrum under the coalescent and by leveraging the technique of automatic differentiation, which allows us to compute gradients exactly, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum. In addition to inferring demography, our method can also accurately estimate locus-specific mutation rates. We perform extensive validation of our method on simulated data and show that it can accurately infer multiple recent epochs of rapid exponential growth, a signal that is difficult to pick up with small sample sizes. Lastly, we use our method to analyze data from recent sequencing studies, including a large-sample exome-sequencing data set of tens of thousands of individuals assayed at a few hundred genic regions.The demography of an evolving population strongly influences the genetic variation found within it, and understanding the intricate interplay between natural selection, genetic drift, and demography is a key aim of population genomics. For example, the human census population has expanded more than 1000-fold in the last 400 generations (Keinan and Clark 2012), resulting in a state that is profoundly out of equilibrium with respect to genetic variation. Recently, there has been much interest in studying the consequences of such rapid expansion on mutation load and the genetic architecture of complex traits (Gazave et al. 2013; Lohmueller 2014; Simons et al. 2014). Estimating the population demography is necessary for developing more accurate null models of neutral evolution in order to identify genomic regions subject to natural selection (Williamson et al. 2005; Boyko et al. 2008; Lohmueller et al. 2008). The problem of inferring demography from genomic data also has several other important applications. In particular, the population demography is needed to correct for spurious genotype-phenotype associations in genome-wide association studies due to hidden population substructure (Marchini et al. 2004; Campbell et al. 2005; Clayton et al. 2005), to date historical population splits, migrations, admixture, and introgression events (Gravel et al. 2011; Li and Durbin 2011; Lukić and Hey 2012; Sankararaman et al. 2012), to compute random match probabilities accurately in forensic applications (Balding and Nichols 1997; Graham et al. 2000), for examples.A commonly used null model in population genetics assumes that individuals are randomly sampled from a well-mixed population of constant size that evolves neutrally according to some model of random mating (Ewens 2004). However, several recent large-sample sequencing studies in humans (Coventry et al. 2010; Fu et al. 2012; Nelson et al. 2012; Tennessen et al. 2012) have found an excess of single nucleotide variants (SNVs) that have very low minor allele frequency (MAF) in the sample compared to that predicted by coalescent models with a constant effective population size. For example, in a sample of ∼12,500 individuals of European descent analyzed by Nelson et al. (2012), >74% of the SNVs have only one or two copies of the minor allele, and >95% of the SNVs have an MAF <0.5%. On the other hand, assuming a constant population size over time, Kingman’s coalescent predicts that the number of neutral SNVs is inversely proportional to the sample frequency of the variant (Fu 1995). Keinan and Clark (2012) have suggested that such an excess of sites segregating with low MAF can be explained by recent exponential population growth. In particular, a rapid population expansion produces genealogical trees that have long branch lengths at the tips of the trees, leading to a large fraction of mutations being limited to a single individual in the sample. Motivated by these findings and rapidly increasing sample sizes in population genomics, we here tackle the problem of developing an efficient algorithm for inferring historical effective population sizes and locus-specific mutation rates using a very large sample, with tens or hundreds of thousands of individuals.At the coarsest level, previous approaches to inferring demography from genomic variation data can be divided according to the representation of the data that they operate on. Full sequence-based approaches for inferring the historical population size such as the works of Li and Durbin (2011) and Sheehan et al. (2013) use between two and a dozen genomes to infer piecewise constant models of historical population sizes. Since these approaches operate genome wide, they can take into account linkage information between neighboring SNVs. On the other hand, they are computationally very expensive and cannot be easily applied to infer recent demographic events from large numbers of whole genomes. A slightly more tractable approach to inferring potentially complex demographies involves comparing the length distribution of identical-by-descent and identical-by-state tracts between pairs of sequences (Palamara et al. 2012; Harris and Nielsen 2013).The third class of methods, and the one that our approach also belongs to, summarizes the variation in the genome sequences by the sample frequency spectrum (SFS). The SFS of a sample of size n counts the number of SNVs as a function of their mutant allele frequency in the sample. Since the SFS is a very efficient dimensional reduction of large-scale population genomic data that summarizes the variation in n sequences by n − 1 numbers, it is naturally attractive for computational and statistical purposes. Furthermore, the expected SFS of a random sample drawn from the population strongly depends on the underlying demography, and there have been several previous approaches that exploit this relationship for demographic inference. Nielsen (2000) developed a method based on coalescent tree simulations to infer exponential population growth from single nucleotide polymorphisms that are far enough apart to be in linkage equilibrium. Coventry et al. (2010) developed a similar coalescent simulation-based method that additionally infers per-locus mutation rates and applied this method to exome-sequencing data from ∼10,000 individuals at two genes. Nelson et al. (2012) have also applied this method to a larger data set of 11,000 individuals of European ancestry (CEU) sequenced at 185 genes to infer a recent epoch of exponential population growth. The common feature of all these methods is that they use Monte Carlo simulations to empirically estimate the expected SFS under a given demographic model, and then they compute a pseudo-likelihood function for the demographic model by comparing the expected and observed SFS. The optimization over the demographic models is then performed via grid search procedures. More recently, Excoffier et al. (2013) have developed a software package that employs coalescent tree simulations to estimate the expected joint SFS of multiple subpopulations for inferring potentially very complex demographic scenarios from multipopulation genomic data. The problem of demographic inference has also been approached from the perspective of diffusion processes. Given a demographic model, one can derive a partial differential equation (PDE) for the density of segregating sites at a given derived allele frequency as a function of time. Gutenkunst et al. (2009) used numerical methods to approximate the solution to this PDE, while Lukić et al. (2011) approximated this solution using an orthogonal polynomial expansion. The coalescent-based method of Excoffier et al. (2013), fastsimcoal, and the diffusion-based method of Gutenkunst et al. (2009), ∂a∂i, can infer the joint demography of multiple subpopulations with changing population sizes and complex patterns of migration between subpopulations.In this paper, we focus on the problem of inferring the effective population size as a function of time for a single randomly mating population. As mentioned above, our method is based on the SFS. By restricting our inference to a single population, we are able to compute the expected SFS exactly, rather than using Monte Carlo simulations or solving PDEs numerically. Briefly, we utilize the theoretical work of Polanski et al. (2003) and Polanski and Kimmel (2003), which relate the expected SFS for a sample of size n from a single population to the expected waiting times to the first coalescence event for all sample sizes ≤n. We show that the latter quantities can be computed efficiently and numerically stably for very large sample sizes and for an arbitrary piecewise-exponential model of the historical effective population size. Further, our method utilizes the technique of automatic differentiation to compute exact gradients of the likelihood with respect to the parameters of the effective population size function, thereby facilitating optimization over the space of demographic parameters. These techniques result in our method being both more accurate and more computationally efficient than ∂a∂i and fastsimcoal. In what follows, we carry out an extensive simulation study to demonstrate that our method can infer multiple recent epochs of rapid exponential growth and estimate locus-specific mutation rates with a high accuracy. We then apply our method to analyze data from recent sequencing studies.  相似文献   

13.
The hypoxic environment imposes severe selective pressure on species living at high altitude. To understand the genetic bases of adaptation to high altitude in dogs, we performed whole-genome sequencing of 60 dogs including five breeds living at continuous altitudes along the Tibetan Plateau from 800 to 5100 m as well as one European breed. More than 150× sequencing coverage for each breed provides us with a comprehensive assessment of the genetic polymorphisms of the dogs, including Tibetan Mastiffs. Comparison of the breeds from different altitudes reveals strong signals of population differentiation at the locus of hypoxia-related genes including endothelial Per-Arnt-Sim (PAS) domain protein 1 (EPAS1) and beta hemoglobin cluster. Notably, four novel nonsynonymous mutations specific to high-altitude dogs are identified at EPAS1, one of which occurred at a quite conserved site in the PAS domain. The association testing between EPAS1 genotypes and blood-related phenotypes on additional high-altitude dogs reveals that the homozygous mutation is associated with decreased blood flow resistance, which may help to improve hemorheologic fitness. Interestingly, EPAS1 was also identified as a selective target in Tibetan highlanders, though no amino acid changes were found. Thus, our results not only indicate parallel evolution of humans and dogs in adaptation to high-altitude hypoxia, but also provide a new opportunity to study the role of EPAS1 in the adaptive processes.The mechanisms of organismal adaptation to high-altitude hypoxia are of great interest during recent years. Highland wild animals have a long life history at high altitude, and the whole genomes of yak (descendants of wild yak) (Qiu et al. 2012), Tibetan antelope (Ge et al. 2013), snow leopard (Cho et al. 2013), and wild boar (Li et al. 2013) have been sequenced. In contrast, the human settlement history on highland is rather short, which dates from ∼25,000 yr ago (Zhao et al. 2009). Whole-genome genotyping and re-sequencing have been performed for three typical highland populations including Tibetans (Beall et al. 2010; Bigham et al. 2010; Simonson et al. 2010; Yi et al. 2010; Peng et al. 2011; Xu et al. 2011), Andeans (Bigham et al. 2009, 2010), and Ethiopians (Alkorta-Aranburu et al. 2012; Scheinfeldt et al. 2012).The increased oxygen uptake and delivery are physiological hallmarks of high-altitude adaptation. On one hand, the capacity of oxygen uptake is determined by hemoglobin concentration and oxygen affinity. For example, the Andean highlanders display a high level of hemoglobin concentration (Beall et al. 2002; Beall 2007). The high oxygen affinity of hemoglobin is found in many highland animals such as yak (Weber et al. 1988), alpaca (Piccinini et al. 1990), deer mice (Storz et al. 2007; Storz et al. 2009), bar-headed goose (Zhang et al. 1996; Liang et al. 2001), and Andean goose (Jessen et al. 1991). On the other hand, the rate of oxygen delivery is determined by blood flow. For example, although Tibetans maintain a nearly normal level of hemoglobin concentration and a low level of oxygen saturation, they display a high level of blood flow, resulting in the increase of oxygen delivery (Beall et al. 2001; Erzurum et al. 2007).Whole-genome scans revealed that positive selection for human high-altitude adaptation occurred in the hypoxia-inducible factor (HIF) pathway (Bigham et al. 2009; Beall et al. 2010; Bigham et al. 2010; Simonson et al. 2010; Yi et al. 2010; Peng et al. 2011; Xu et al. 2011; Alkorta-Aranburu et al. 2012; Scheinfeldt et al. 2012), which regulates genes associated with blood physiology. In addition, metabolic pathways may also be involved in the adaptive process of yak (Qiu et al. 2012) and Tibetan antelope (Ge et al. 2013).Although a lot of studies focused on wildlife and human highlanders, no research was performed on domesticated animals that migrated to the plateau with humans, which represent an adaptation pattern on a short evolutionary time scale of thousands of years. For example, the Tibetan Mastiff is a native dog living in the Tibetan Plateau with an altitude of 3000–6000 m. It is also an ancient dog in the world (Li and Zhang 2012). However, the genetic and physiological mechanisms of its adaptation to high-altitude environments remain elusive.In this study, we sampled five dog breeds including the Tibetan Mastiff from continuous altitudes along the Ancient Tea Horse Road in southwestern China as well as one European breed. We performed whole-genome sequencing for the dogs and identified candidate genes for high-altitude adaptation using selective sweep mapping. We also measured the hematologic and hemorheologic parameters of the dogs and tested the association between the candidate alleles and blood physiology.  相似文献   

14.
Genetic diagnostics of phenylketonuria (PKU) and tetrahydrobiopterin (BH4) deficient hyperphenylalaninemia (BH4DH) rely on methods that scan for known mutations or on laborious molecular tools that use Sanger sequencing. We have implemented a novel and much more efficient strategy based on high-throughput multiplex-targeted resequencing of four genes (PAH, GCH1, PTS, and QDPR) that, when affected by loss-of-function mutations, cause PKU and BH4DH. We have validated this approach in a cohort of 95 samples with the previously known PAH, GCH1, PTS, and QDPR mutations and one control sample. Pooled barcoded DNA libraries were enriched using a custom NimbleGen SeqCap EZ Choice array and sequenced using a HiSeq2000 sequencer. The combination of several robust bioinformatics tools allowed us to detect all known pathogenic mutations (point mutations, short insertions/deletions, and large genomic rearrangements) in the 95 samples, without detecting spurious calls in these genes in the control sample. We then used the same capture assay in a discovery cohort of 11 uncharacterized HPA patients using a MiSeq sequencer. In addition, we report the precise characterization of the breakpoints of four genomic rearrangements in PAH, including a novel deletion of 899 bp in intron 3. Our study is a proof-of-principle that high-throughput-targeted resequencing is ready to substitute classical molecular methods to perform differential genetic diagnosis of hyperphenylalaninemias, allowing the establishment of specifically tailored treatments a few days after birth.  相似文献   

15.
16.
《Genetics in medicine》2019,21(3):663-675
PurposeDefects in the cohesin pathway are associated with cohesinopathies, notably Cornelia de Lange syndrome (CdLS). We aimed to delineate pathogenic variants in known and candidate cohesinopathy genes from a clinical exome perspective.MethodsWe retrospectively studied patients referred for clinical exome sequencing (CES, N = 10,698). Patients with causative variants in novel or recently described cohesinopathy genes were enrolled for phenotypic characterization.ResultsPathogenic or likely pathogenic single-nucleotide and insertion/deletion variants (SNVs/indels) were identified in established disease genes including NIPBL (N = 5), SMC1A (N = 14), SMC3 (N = 4), RAD21 (N = 2), and HDAC8 (N = 8). The phenotypes in this genetically defined cohort skew towards the mild end of CdLS spectrum as compared with phenotype-driven cohorts. Candidate or recently reported cohesinopathy genes were supported by de novo SNVs/indels in STAG1 (N = 3), STAG2 (N = 5), PDS5A (N = 1), and WAPL (N = 1), and one inherited SNV in PDS5A. We also identified copy-number deletions affecting STAG1 (two de novo, one of unknown inheritance) and STAG2 (one of unknown inheritance). Patients with STAG1 and STAG2 variants presented with overlapping features yet without characteristic facial features of CdLS.ConclusionCES effectively identified disease-causing alleles at the mild end of the cohensinopathy spectrum and enabled characterization of candidate disease genes.  相似文献   

17.
Psoriasis is a chronic and complex inflammatory skin disease with lesions displaying dramatically altered mRNA expression profiles. However, much less is known about the expression of small RNAs. Here, we describe a comprehensive analysis of the normal and psoriatic skin miRNAome with next-generation sequencing in a large patient cohort. We generated 6.7 × 10(8) small RNA reads representing 717 known and 284 putative novel microRNAs (miRNAs). We also observed widespread expression of isomiRs and miRNA*s derived from known and novel miRNA loci, and a low frequency of miRNA editing in normal and psoriatic skin. The expression and processing of selected novel miRNAs were confirmed with qRT-PCR in skin and other human tissues or cell lines. Eighty known and 18 novel miRNAs were 2-42-fold differentially expressed in psoriatic skin. Of particular significance was the 2.7-fold upregulation of a validated novel miRNA derived from the antisense strand of the miR-203 locus, which plays a role in epithelial differentiation. Other differentially expressed miRNAs included hematopoietic-specific miRNAs such as miR-142-3p and miR-223/223*, and angiogenic miRNAs such as miR-21, miR-378, miR-100 and miR-31, which was the most highly upregulated miRNA in psoriatic skin. The functions of these miRNAs are consistent with the inflammatory and hyperproliferative phenotype of psoriatic lesions. In situ hybridization of differentially expressed miRNAs revealed stratified epidermal expression of an uncharacterized keratinocyte-derived miRNA, miR-135b, as well as the epidermal infiltration of the hematopoietic-specific miRNA, miR-142-3p, in psoriatic lesions. This study lays a critical framework for functional characterization of miRNAs in healthy and diseased skin.  相似文献   

18.
Lymphoma is the most common hematological malignancy in developed countries. Outcome is strongly determined by molecular subtype, reflecting a need for new and improved treatment options. Dogs spontaneously develop lymphoma, and the predisposition of certain breeds indicates genetic risk factors. Using the dog breed structure, we selected three lymphoma predisposed breeds developing primarily T-cell (boxer), primarily B-cell (cocker spaniel), and with equal distribution of B- and T-cell lymphoma (golden retriever), respectively. We investigated the somatic mutations in B- and T-cell lymphomas from these breeds by exome sequencing of tumor and normal pairs. Strong similarities were evident between B-cell lymphomas from golden retrievers and cocker spaniels, with recurrent mutations in TRAF3-MAP3K14 (28% of all cases), FBXW7 (25%), and POT1 (17%). The FBXW7 mutations recurrently occur in a specific codon; the corresponding codon is recurrently mutated in human cancer. In contrast, T-cell lymphomas from the predisposed breeds, boxers and golden retrievers, show little overlap in their mutation pattern, sharing only one of their 15 most recurrently mutated genes. Boxers, which develop aggressive T-cell lymphomas, are typically mutated in the PTEN-mTOR pathway. T-cell lymphomas in golden retrievers are often less aggressive, and their tumors typically showed mutations in genes involved in cellular metabolism. We identify genes with known involvement in human lymphoma and leukemia, genes implicated in other human cancers, as well as novel genes that could allow new therapeutic options.Diffuse large B-cell lymphoma (DLBCL) is a genetically heterogeneous, aggressive form of non-Hodgkin lymphoma (NHL) and the most prevalent form of B-cell NHL in people. Approximately 30,000 DLBCL cases are diagnosed each year in the United States, of which only half are curable (Abramson and Shipp 2005). Molecular subclassifications of DLBCL are highly predictive of treatment outcome (Alizadeh et al. 2000). NHL can also arise from T cells, making up 10%–15% of all NHL cases. T-cell lymphomas are highly heterogeneous and several subtypes exist (Evens and Gartenhaus 2003; Iqbal et al. 2010).Dogs have previously been proven useful in determining predisposing genetic markers for human diseases due to the breed structure caused by artificial breeding for phenotypic factors (Dodman et al. 2010; Wilbe et al. 2010; Shearin et al. 2012; Karlsson et al. 2013; Tang et al. 2014). A recent lymphoma study in dogs identified a gene (TRAF3) as being commonly mutated in both dog and human B-cell lymphomas (Bushell et al. 2015). However, this study did not separate canine tumors based on breed, preventing the discovery of somatic mutations reflecting genetic background. Since dog breeds represent genetic isolates, comparing dog breeds with differential predispositions to cancer can indicate the role of the genetic background and allow for better detection of mutations influenced by that genetic background. In dogs, malignant lymphoma is the most common tumor treated with chemotherapy, affecting dogs of all ages and breeds (Valli et al. 2013); however, the high rate of lymphoma in certain breeds and preferential cells of origin among different breeds indicates genetic risk factors (Modiano et al. 2005).Approximately 70% of all canine lymphomas arise from B cells (Modiano et al. 2005; Ponce et al. 2010). The most common form of canine B-cell lymphoma is the clinical and histological equivalent of human DLBCL (Vail and MacEwen 2000; Modiano et al. 2005; Valli et al. 2011; Ito et al. 2014). As in humans, the CHOP-based chemotherapy protocols are the most effective treatment for canine B-cell lymphomas. T-cell lymphomas are less common in dogs, and the most aggressive types have a higher risk of relapse and early death (Ruslander et al. 1997; Dobson et al. 2001). Many T-cell lymphomas also have a low long-term survival frequency in humans.Exome sequencing efforts in human lymphomas have shown that certain mutations are specific for lymphoma subtypes or shared between only a few subtypes (Zhang et al. 2014). For example, DLBCL typically have mutations affecting B-cell receptor signaling or subunits, e.g., CD79A/B, and in histone modifiers like EZH2 and MLL2, PRDM1, TP53, CARD11, and MYD88 (Lenz et al. 2008; Davis et al. 2010; Mandelbaum et al. 2010; Morin et al. 2010; Ngo et al. 2011; Lohr et al. 2012). MYD88 is also recurrently mutated in other B-cell lymphomas like primary central nervous system lymphoma (PCNSL) (Gonzalez-Aguilar et al. 2012), and MLL2 and TP53 mutations have been reported for mantle cell lymphoma (MCL) (Zhang et al. 2014). The two main human DLBCL subtypes—activated B cell (ABC) and germinal center B cell (GCB)—which can be distinguished based on differential gene expression and prognosis (Alizadeh et al. 2000), also have unique recurrent mutations. ABC DLBCLs are characterized by mutations in B-cell differentiation genes and often show constitutive activation of NF-κB signaling, whereas BCL2 and MYC translocations are characteristic of GCB DLBCLs (summarized in Pasqualucci et al. 2011b). Canine B-cell lymphomas can also be separated into germinal and post-germinal center types using gene expression data, sharing pathways with their human counterparts (Richards et al. 2013), although more studies are needed to fully elucidate this. T-cell lymphomas are much less studied compared with B-cell lymphomas, particularly DLBCL.The treatment outcome of canine lymphoma is predictive of the human response to the same treatment (Honigberg et al. 2010; Marconato et al. 2013; London et al. 2014), and canine clinical trials, although highly regulated, are easier to complete compared with human trials. Hence, a better understanding of canine lymphoma offers great potential to accelerate development of new treatments for human patients. Here, we have compared the somatic mutations of B- and T-cell canine lymphoma in three dog breeds with different lymphoma immunophenotype predispositions. The identified breed-specific patterns are a good opportunity to study interventions targeting the significant mutations.  相似文献   

19.
Three-dimensional tumor models have emerged as valuable in vitro research tools, though the power of such systems as quantitative reporters of tumor growth and treatment response has not been adequately explored. We introduce an approach combining a 3-D model of disseminated ovarian cancer with high-throughput processing of image data for quantification of growth characteristics and cytotoxic response. We developed custom MATLAB routines to analyze longitudinally acquired dark-field microscopy images containing thousands of 3-D nodules. These data reveal a reproducible bimodal log-normal size distribution. Growth behavior is driven by migration and assembly, causing an exponential decay in spatial density concomitant with increasing mean size. At day 10, cultures are treated with either carboplatin or photodynamic therapy (PDT). We quantify size-dependent cytotoxic response for each treatment on a nodule by nodule basis using automated segmentation combined with ratiometric batch-processing of calcein and ethidium bromide fluorescence intensity data (indicating live and dead cells, respectively). Both treatments reduce viability, though carboplatin leaves micronodules largely structurally intact with a size distribution similar to untreated cultures. In contrast, PDT treatment disrupts micronodular structure, causing punctate regions of toxicity, shifting the distribution toward smaller sizes, and potentially increasing vulnerability to subsequent chemotherapeutic treatment.  相似文献   

20.
We have sequenced the genomes of 18 isolates of the closely related human pathogenic fungi Coccidioides immitis and Coccidioides posadasii to more clearly elucidate population genomic structure, bringing the total number of sequenced genomes for each species to 10. Our data confirm earlier microsatellite-based findings that these species are genetically differentiated, but our population genomics approach reveals that hybridization and genetic introgression have recently occurred between the two species. The directionality of introgression is primarily from C. posadasii to C. immitis, and we find more than 800 genes exhibiting strong evidence of introgression in one or more sequenced isolates. We performed PCR-based sequencing of one region exhibiting introgression in 40 C. immitis isolates to confirm and better define the extent of gene flow between the species. We find more coding sequence than expected by chance in the introgressed regions, suggesting that natural selection may play a role in the observed genetic exchange. We find notable heterogeneity in repetitive sequence composition among the sequenced genomes and present the first detailed genome-wide profile of a repeat-induced point mutation (RIP) process distinctly different from what has been observed in Neurospora. We identify promiscuous HLA-I and HLA-II epitopes in both proteomes and discuss the possible implications of introgression and population genomic data for public health and vaccine candidate prioritization. This study highlights the importance of population genomic data for detecting subtle but potentially important phenomena such as introgression.Coccidioides spp. are dimorphic fungal pathogens, existing alternately as saprobes in the soil and as pathogens of living mammals. Coccidioidomycosis in humans typically begins as a pulmonary infection resulting from inhalation of airborne spores (arthroconidia) created by the asexual development of the soil-dwelling mycelial form of the fungus. It is estimated that 60% of Coccidioides infections are asymptomatic, but some patients exhibit influenza-like symptoms that may last many months (Arizona Department of Health Services 2007). Less than 1% of patients develop disseminated disease, which in some cases may require lifelong chemotherapeutic treatment with anti-fungal medication. The historic incidence of symptomatic Coccidioides infections in the United States has been approximately 30,000 cases per year, but recent increases in coccidioidomycosis in southern California and Arizona have been reported (Komatsu et al. 2003; Sunenshine et al. 2007; Kim et al. 2009; Vugla et al. 2009). Coccidioides was classified as a “Select Agent” of bioterrorism in response to the U.S. Antiterrorism and Effective Death Penalty Act of 1996 (Dixon 2001).Coccidioides fungi are endemic to arid regions of the southwestern United States and northern Mexico, and patchily distributed through Central and South America. Once thought to be a monotypic genus, multilocus sequencing testing (MLST) and other genetic analyses have identified two genetically distinct cryptic species: Coccidioides immitis (San Joaquin Valley of California and Southern California) and Coccidioides posadasii (Arizona, Northern Mexico, Southern California, Texas, and parts of Central and South America) (Koufopanou et al. 1997; Fisher et al. 2001, 2002). This taxonomic splitting encountered resistance but has gained increasing acceptance in part due to concordant genetic evidence from MLST and evidence of differential thermotolerance between the species (BM Barker, C Wendel, JN Galgiani, and MJ Orbach, unpubl.).A comparative genomics analysis of the first reference genome sequences for C. immitis and C. posadasii and related Ascomycota detected changes in gene family size, gene gain and loss, variation in rate of gene evolution, and nucleotide substitutions that alter protein sequence (Sharpton et al. 2009). These analyses assessed macroevolutionary events, including a shift from plant to animal hosts in the ancestral Coccidioides lineage, that could be detected among a group of closely related species using a single representative genome from each species. For assessment of microevolutionary events occurring on a more recent timescale, analysis of multiple genomes from related species may be more informative. For example, a population genomic analysis of more than 70 domestic and wild isolates of Saccharomyces cerevisiae and Saccharomyces paradoxus demonstrated the power of this approach for understanding functional and geographic variation not evident from the initial analyses of reference genomes for those species (Liti et al. 2009). The falling cost of genome sequencing is quickly making a population genomic analysis approach accessible for a broad array of organisms, including Drosophila (Sackton et al. 2009), Arabidopsis (Clark et al. 2007), and humans (Richard Durbin and David Altshuler, http://www.1000genomes.org/).Here, by increasing the number of sequenced Coccidioides genomes to 20, we extend our evolutionary analysis within species to provide insights into the population biology, evolution, life cycle, and virulence of Coccidioides. We describe important new findings including the discovery of ample but unequal genetic diversity within both taxa, a novel form of transposon control, and a clear signal of recent introgression between C. immitis and C. posadasii, suggesting that these sister species are not reproductively isolated despite having diverged 5 million years ago (Sharpton et al. 2009). The genome-wide mapping of population-level diversity in both species will assist in the prioritization of vaccine candidates and improve our understanding of gene flow within the genus.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号