Juggling with offsets unlocks bulk RNA-seq tools for fast and scalable differential usage and aberrant splicing analyses
Author(s): Alexandre Segers,Jeroen Gilis,Mattias Van Heetvelde,Elfride De Baere,Lieven Clement
Affiliation(s): Ghent
Millions of patients suffer from rare Mendelian diseases, for whom a diagnostic rate of their pathogenic variants of 15-75% is currently achieved with whole exome sequencing (WES) and whole genome sequencing (WGS) [1]. There is growing evidence that the diagnostic rate can be further improved by discovering mutations in intronic and in other non-coding regions that contribute to disease by disrupting transcriptional regulation. Therefore, WGS is increasingly complemented with RNA-seq profiling to boost the diagnostic rate by identifying aberrant expression (AE), aberrant splicing (AS) or mono-allelic expression [2]. The AE and AS discovery, however, is not possible with default bulk RNA-seq workflows. Indeed, testing for differential expression by comparing each sample against the rest of the cohort is statistically invalid. With this respect, Outrider [3] and Fraser [4] have disrupted the field by providing formal count-based outlier tests to pick up AE and AS, respectively, while automatically controlling for latent confounders. But, their approach is slow and Fraser is discarding a lot of information as it only uses junction reads. We argue that conventional bulk methods can be unlocked for the AE and AS discovery. Indeed, they can be used for estimating the mean and dispersion of the negative binomial distribution, which we then use in count-based outlier tests. The ASpli tool [5] is our starting point for our AS workflows. ASpli integrates exon and intron bin reads with junction reads, to obtain higher power to identify AS, novel intron retention or splice sites than by using junction reads only. However, its parameter estimation is based on an edgeR’s diffSpliceDGE which performs worse than DEXSeq, but the latter scales poorly to the large cohorts in Mendelian disease studies. In this contribution, we show how juggling with offsets can effectively unlock conventional bulk RNA-seq workflows for fast and scalable differential usage (DU) and AS analyses. Indeed, by replacing the conventional offset for library size in DESeq2 or edgeR transcript or exon analyses by the log of the total gene count, the parameters of the mean model directly estimate the average transcript or exon usage, respectively. We further develop workflows on different ASpli counts combined with the appropriate offsets to infer on aberrant junction usage and intron retention. We also provide an unbiased and fast parameter estimation procedure for assessing AE and AS that scales better to the large number of covariates included in Mendelian disease studies. In simulation studies and real case studies we show how our workflows vastly outperform existing state-of-the-art tools DEXseq, Outrider and Fraser in terms of computational speed and scalability. They also dramatically boost the performance for aberrant splicing (cf. Fraser) while maintaining a similar performance for differential usage (cf. DEXseq) and aberrant outlier detection (cf. Outrider). [1] Turro, Ernest et al. “Whole-genome sequencing of patients with rare diseases in a national health system.” Nature vol. 583,7814 (2020): 96-102. doi:10.1038/s41586-020-2434-2 [2] Cummings, Beryl B et al. “Improving genetic diagnosis in Mendelian disease with transcriptome sequencing.” Science translational medicine vol. 9,386 (2017): eaal5209. doi:10.1126/scitranslmed.aal5209 [3] Brechtmann, Felix et al. “OUTRIDER: A Statistical Method for Detecting Aberrantly Expressed Genes in RNA Sequencing Data.” American journal of human genetics vol. 103,6 (2018): 907-917. doi:10.1016/j.ajhg.2018.10.025 [4] Mertes, Christian et al. “Detection of aberrant splicing events in RNA-seq data using FRASER.” Nature communications vol. 12,1 529. 22 Jan. 2021, doi:10.1038/s41467-020-20573-7 [5] Estefania, Mancini et al. “ASpli: Integrative analysis of splicing landscapes through RNA-Seq assays.” Bioinformatics (Oxford, England), btab141. 2 Mar. 2021, doi:10.1093/bioinformatics/btab141