Accurate typing of short tandem repeats from genome-wide sequencing data and its applications期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Accurate typing of short tandem repeats from genome-wide sequencing data and its applications

Authors:	Arkarachai Fungtammasan Guruprasad Ananda Suzanne E. Hile Marcia Shu-Wei Su Chen Sun Robert Harris Paul Medvedev Kristin Eckert Kateryna D. Makova

Abstract:	Short tandem repeats (STRs) are implicated in dozens of human genetic diseases and contribute significantly to genome variation and instability. Yet profiling STRs from short-read sequencing data is challenging because of their high sequencing error rates. Here, we developed STR-FM, short tandem repeat profiling using flank-based mapping, a computational pipeline that can detect the full spectrum of STR alleles from short-read data, can adapt to emerging read-mapping algorithms, and can be applied to heterogeneous genetic samples (e.g., tumors, viruses, and genomes of organelles). We used STR-FM to study STR error rates and patterns in publicly available human and in-house generated ultradeep plasmid sequencing data sets. We discovered that STRs sequenced with a PCR-free protocol have up to ninefold fewer errors than those sequenced with a PCR-containing protocol. We constructed an error correction model for genotyping STRs that can distinguish heterozygous alleles containing STRs with consecutive repeat numbers. Applying our model and pipeline to Illumina sequencing data with 100-bp reads, we could confidently genotype several disease-related long trinucleotide STRs. Utilizing this pipeline, for the first time we determined the genome-wide STR germline mutation rate from a deeply sequenced human pedigree. Additionally, we built a tool that recommends minimal sequencing depth for accurate STR genotyping, depending on repeat length and sequencing read length. The required read depth increases with STR length and is lower for a PCR-free protocol. This suite of tools addresses the pressing challenges surrounding STR genotyping, and thus is of wide interest to researchers investigating disease-related STRs and STR evolution.Short tandem repeats (STRs) of 1–6 base pairs per motif constitute ∼3% of the human genome (Lander 2001). Due to the high incidence of polymerase slippage at STRs (Levinson and Gutman 1987; Abdulovic et al. 2011; Baptiste and Eckert 2012), these repeats have elevated germline mutation and polymorphism rates. After a certain threshold length, STRs are termed microsatellites (Kelkar et al. 2010; Ananda et al. 2013). The high level of polymorphism makes microsatellites attractive markers for population and conservation genetics studies (Jarne and Lagoda 1996; Sunnucks 2000; Wan et al. 2004; Kim and Sappington 2013) and for identifying individuals in forensics (Hagelberg et al. 1991; Chambers et al. 2014). Many STRs are involved in gene regulation and protein function (Li et al. 2004), with ∼17% of human genes containing STRs in their open reading frames (Gemayel et al. 2010). Although long microsatellites have attracted much attention, length alterations even within relatively short repeat tracts are sometimes associated with disease (Li et al. 2004). For instance, differences in the number of repeats at the (TG)_10-13(T)_5-9 STR located within the splicing branch/acceptor site of the CFTR gene (exon 9) can affect in-frame exon skipping and, as a result, can influence the severity of cystic fibrosis (Cuppens et al. 1990; Chu et al. 1991). The purity of STRs (the degree to which the perfect STR sequence remains uninterrupted) also has a functional effect. Interrupted STRs have lower mutation rates (Ananda et al. 2014), and this can diminish disease risk. For instance, ∼6% of Ashkenazi Jews have a T to A mutation in the APC gene (encoding for a tumor suppressor) that alters an interrupted STR (A)₃T(A)₄ into a perfect (A)₈ (Laken et al. 1997). This increases the probability of somatic frameshift mutation within the STR, leading to APC protein inactivation. As a result, Ashkenazi Jews have a higher colorectal cancer risk (Gryfe et al. 1999). Since even small changes in STR length and purity can have functional effects, accurate STR profiling is crucial.Despite the importance of STRs in evolution and disease, their accurate genotyping from next generation sequencing (NGS) data has been challenging (for review, see Treangen and Salzberg 2012). Sequencing library construction frequently includes polymerase chain reaction (PCR) steps during which a polymerase might undergo slippage at STRs, leading to amplicons that differ in length due to expansion and contraction of repeat units (Ellegren 2004; Wang et al. 2011). Additionally, base calling by NGS instruments at repetitive regions is frequently imprecise. These factors result in high sequencing errors at homopolymer runs produced by the 454 (Roche) and Illumina instruments (Balzer et al. 2010; Albers et al. 2011).From a bioinformatics perspective, if STR-containing reads are mapped in their entirety, some reads cannot be mapped because of high mismatch/indel penalties associated with STR lengths different than those at the corresponding positions in the reference genome. This obscures accurate estimation of allele frequency and underestimates the real level of STR variation in the genome. To alleviate this problem, a short-read alignment approach using nonrepetitive flanks of STR-containing reads has been proposed recently (lobSTR) (Gymrek et al. 2012). This tool has fast running time and takes into account PCR stutter noise during the genotyping stage. However, the entropy scanning implemented by lobSTR to detect STRs has low sensitivity for mononucleotide STRs and short STRs (<25 bp), which constitute a large proportion of STRs in the genome. Additionally, the allele frequency at STRs for genetically heterogeneous samples, for which a simple 1:1 ratio in allele frequency present in heterozygous diploids is not expected (e.g., for tumors, viral populations, and organelles), cannot be determined. Furthermore, lobSTR uses a fixed (embedded in the program) mapping algorithm. Novel short-read mapping and STR detection algorithms (Pellegrini et al. 2010; Lim et al. 2013) are constantly being developed; an STR-profiling tool that can be customized to incorporate emerging mapping algorithms is needed.The recently released PCR-free Illumina library preparation protocol (hereafter called “PCR−”) is expected to improve STR genotyping accuracy. The direct advantage of limiting PCR steps during NGS is the increased uniformity of the sequencing depth (Kozarewa et al. 2009). Also, this protocol eliminates duplicate reads that obscure allele frequency profiling for heterogeneous genetic samples. Importantly, the degree to which the accuracy of calling STR alleles is improved using the PCR-free protocol has not been evaluated previously. Moreover, massive amounts of data have already been generated by the NGS technology with the PCR-containing library preparation protocol (hereafter called “PCR+”), and some such data cannot be regenerated due to the scarcity of samples and/or time and cost constraints. Therefore, universal methods are urgently needed that can evaluate and correct STR errors generated by NGS technology (both PCR− and PCR+) and accommodate evolving protocols and sequencing techniques.Some efforts have been made to evaluate errors generated by NGS at STRs. For instance, errors at STRs sequenced with the PCR+ protocol vary with repeat number and motif size (Luo et al. 2012). However, an explicit quantitation of various sources of STR-related sequencing errors has been lacking, which hinders an unambiguous estimation of STR mutational properties. Indeed, as both mutation and sequencing error rates increase with STR length (Kelkar et al. 2008; Luo et al. 2012; Highnam et al. 2013), one cannot confidently decipher mutation rates without accounting for sequencing error rates. Recently, a tool to guide genotyping of STRs using informed error profiles from inbred Drosophila lines (RepeatSeq) has been released (Highnam et al. 2013). This tool utilizes reads mapped by other programs, such as BWA (Li and Durbin 2009) and Bowtie (Langmead et al. 2009), and predicts the most probable genotype at a locus based on STR motif, length, and base quality. However, RepeatSeq uses the whole-read mapping approach, which introduces a bias toward the STR length in the reference genome (Gymrek et al. 2012) and thus might obscure the true STR variation spectrum. Such biases can be accounted for by an error correction model based on the STR flank-based method.To profile the full spectrum of STR lengths in the human and other genomes, and to correct for NGS-associated STR errors, we developed STR-FM (short tandem repeat profiling using a flank-based mapping approach), a flexible pipeline for detecting and genotyping STRs from short-read sequencing data. Our pipeline can detect STRs of any length, including short ones (as short as only two repeats), includes an error-correcting module, and can incorporate any NGS mapping algorithm with paired-end mapping capability, making it adaptable to new mapping methods as they become available. Applying this pipeline, we asked the following questions. First, what are the rates and patterns of sequencing errors associated with STRs of different motif sizes (mono-, di-, tri-, and tetranucleotides), motif compositions, and repeat numbers? These were contrasted between publicly available genome-wide data sets sequenced with PCR+ and PCR− protocols and validated with in-house generated, ultradeep sequencing of plasmids harboring individual STR sequences. Second, do technical errors have different patterns from true STR mutations? Third, based on the detailed knowledge of the error profiles, what is the minimum sequencing depth required for producing reliable STR genotypes for PCR+ and PCR− protocols? As a result, we provide the scientific community with STR-FM, a reproducible and versatile pipeline for genotyping STRs that incorporates an error correction model. To illustrate the utility of STR-FM, we applied it to the completely sequenced human genomes from the Platinum Genomes Project (Ajay et al. 2011) and determined human genome-wide germline mutation rates at STRs.

Keywords:

设为首页 | 免责声明 | 关于勤云 | 加入收藏