Abstract: | High-throughput reporter assays such as self-transcribing active regulatory region sequencing (STARR-seq) have made it possible to measure regulatory element activity across the entire human genome at once. The resulting data, however, present substantial analytical challenges. Here, we identify technical biases that explain most of the variance in STARR-seq data. We then develop a statistical model to correct those biases and to improve detection of regulatory elements. This approach substantially improves precision and recall over current methods, improves detection of both activating and repressive regulatory elements, and controls for false discoveries despite strong local correlations in signal.Gene regulation is of foundational importance to nearly all biological processes, and variation in gene regulatory activity plays a major role in human disease risk (Lee and Young 2013; Parker et al. 2013; Finucane et al. 2015). A major step toward measuring regulatory activity across the human genome has been the development of high-throughput reporter assays such as STARR-seq (Arnold et al. 2013) that allow regulatory element activity to be quantified with high-throughput sequencing rather than with optical detection of a fluorescent or luminescent signal.High-throughput reporter assays create substantial analytical challenges that are distinct from other sequencing-based genomic assays. There is significant local variation in high-throughput reporter assay signal. We show here that, across data from several laboratories, most of that variation can be explained by features of the underlying genomic sequence and experimental procedures rather than by regulatory element activity. For example, nucleotide composition can alter PCR efficiency leading to under- and overrepresentation of some sequences. Meanwhile, highly repetitive sequences often do not align uniquely to the human reference genome, also biasing signal estimates. Additional analytical challenges include that STARR-seq signals can be both positive and negative, reflecting activation and repression, and the boundaries of regulatory elements are typically unknown and must therefore be estimated from the data. Those challenges together impact signal representations, hinder estimation of regulatory element activity, and cause false positives and false negatives when left unaddressed.Taken together, key requirements of statistical methods to analyze STARR-seq data are the ability to identify and estimate the effect of both activating and repressing regulatory elements while also correcting for underlying sequence biases in high-throughput reporter assays. A statistical model was recently introduced that corrects technical biases and detects regulatory elements in STARR-seq, but the model is limited to detecting only activating regulatory elements (Lee et al. 2020). Considering repression is a crucial gene regulation mechanism (Courey and Jia 2001), overlooking repressive elements may limit understanding of gene regulation with STARR-seq. To overcome that challenge, our correcting reads and analysis of differentially active elements (CRADLE) model takes a two-step approach. First, CRADLE uses a generalized linear regression model to estimate and correct major biases that we have identified in STARR-seq data. Next, CRADLE detects regions with statistically significant regulatory activity from the bias-corrected signals while rigorously controlling FDR. In doing so, CRADLE substantially improves the use of STARR-seq by providing a robust estimation of regulatory activity and improved visualization of raw signals. |