DifferentialRegulation: a novel approach to identify differentially regulated genes
Author(s): Simone Tiberi,Joel Meili,Charlotte Soneson,Dongze He,Hirak Sarkar,Robert Patro,Mark Robinson
Affiliation(s): Department of Statistical Sciences, University of Bologna, Bologna, Italy
Social media: https://twitter.com/tiberi_simone
Background Technological developments have led to an explosion of high-throughput data, which reveal unprecedented perspectives on cell identity. Recently, significant attention has focused on studying cellular dynamic processes, such as cell differentiation, cell (de)activation, and gene regulation. Aim and impact We introduce DifferentialRegulation, a novel approach to investigate gene regulation from bulk and single-cell RNA-seq data. DifferentialRegulation performs differential regulation analyses between experimental conditions (e.g., healthy vs. disease or treated vs. untreated), by discovering differences in the balance (i.e., relative abundance) of spliced and unspliced mRNA. Intuitively, a higher proportion of unspliced (spliced) mRNA in a condition suggests that a gene is currently being up- (down-) regulated compared to the other condition. On single-cell data, our method targets cell-types specific changes in regulation (at the gene level); conversely, on bulk data, it identifies changes in individual transcripts within a gene (across all cells). DifferentialRegulation enables scientists to deepen the current understanding of gene regulation; for instance, it was shown that c-Myc regulated genes are subject to up-regulation in cancer: our method is ideal to investigate such a scenario, and identify the specific genes and transcripts which are differentially regulated. Methodology The abundance of spliced and unspliced mRNA reads can be inferred with pseudo-aligners, such as salmon, kallisto, alevin, and kallisto-bustools. However, many reads are compatible with i) multiple genes and transcripts, or ii) both spliced and unspliced versions. Therefore, estimated spliced and unspliced counts carry a significant degree of uncertainty, which should be accounted for. Here, we propose a Bayesian hierarchical model that inputs equivalence classes of reads, i.e., the list of genes and trasnscripts (and relative spliced/unspliced versions) each read is compatible with. Our method treats gene and transcript allocations of reads as latent states. Furthermore, to account for the variability between biological replicates, we embed multiple samples in a hierarchical model, which enables sharing of information across replicates while allowing for sample-specific parameters. Additionally, sharing of information across genes is performed via a (mild) empirical Bayes approach. The posterior distributions of the parameters is inferred via MCMC techniques where model parameters and latent states are alternately sampled. Overall, our method explicitly models two major sources of variability: i) the sample-to-sample variability between biological replicates, and ii) the mapping uncertainty. From a computational perspective, despite relying on MCMC algorithms, our method is coded in C++, and displays efficient computational performance by completing an analysis (~100k independent MCMC runs) in less than 1 hour on a laptop. Benchmarking We performed extensive benchmarks of our method and several competitors; in particular, starting from real data as anchor data, we designed realistic simulations for bulk and single-cell RNA-seq data. Overall, our tool displays significantly higher sensitivity and specificity. We also show that false positive rates are well calibrated in null (simulated and real) data. Availability DifferentialRegulation is available as a Bioconductor R package: https://bioconductor.org/packages/DifferentialRegulation The bulk implementation will be uploaded in the coming weeks. A pre-print should follow in spring.