Differential detection workflows for multi-patient single-cell RNA-seq data

Differential detection workflows for multi-patient single-cell RNA-seq data


Author(s): Jeroen Gilis,Laura Perin,Milan Malfait,Koen Van den Berge,Bie Verbist,Davide Risso,Lieven Clement

Affiliation(s): Ghent University

Social media: https://twitter.com/GilisJeroen

Single-cell RNA-sequencing (scRNA-seq) has improved our understanding of complex biological processes by elucidating cell-level heterogeneity in gene expression. One of the key tasks in the downstream analysis of scRNA-seq data is studying differential gene expression (DGE). Traditional DGE analyses aim to identify genes for which the average expression differs between biological groups of interest, e.g., between cell types or between diseased and healthy cells. These traditional DGE analyses only allow for assessing one aspect of the gene expression distribution: the mean. However, in scRNA-seq data, other differences between count distributions can be observed, such as differences in the number of modes and differential variability. This has recently prompted the development of a variety of frameworks that allow for comparing multiple aspects of the expression distribution (1–3). One particularly interesting distributional characteristic of gene expression that is not explicitly captured by aforementioned frameworks is the fraction of cells in a group in which the gene is detected. It has been reported repeatedly that gene expression profiles may exhibit characteristic bimodal expression patterns, in which the expression of otherwise abundant genes is either strongly positive or undetected within individual cells (4). Undetected genes or zero counts may arise from technical artefacts or the stochastic nature of gene expression, but they can also reflect actual biological differences between samples. In this context, Qiu (5) has demonstrated that binarising counts allows for obtaining an expression profile that still accurately reflects biological variation. This was confirmed by the work of Bouland et al. (6), which showed that the frequencies of zero counts suffice for capturing biological variability and even claimed that a binarised representation of the single-cell expression data allows for a more robust description of the relative abundance of transcripts than counts. In this work, we show the potential of differential detection (DD) strategies for scRNA-seq data analysis. First, we benchmark several DD strategies: we start with a simple logistic regression model on the binarised scRNA-seq expression matrix, and gradually increase the model complexity to account for overdispersion and allow for model-based normalisation. In the context of multi-patient datasets, we additionally assess the potential of pseudobulking on the model performance and type 1 error control. Second, we combine results from our differential detection tests and a traditional DGE analysis on the same data. Using the two-stage testing paradigm from Van den Berge et al. (7), we identify differential genes by using an omnibus test for differential detection and differentially expression (DE) in the first stage. In the second stage, we perform post-hoc tests on the differential genes from stage one to unravel whether they are DD, DE or both. The two-stage approach increases statistical power and provides better type 1 error control. Finally, we show the added value of our two-stage test for DD and DE on a large multi-patient case study, where we identify genes relevant to the biological system that would fly under the radar when only using a traditional DGE test. 1. Korthauer, K. D. et al. A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol. 17, 222 (2016). 2. Zhang, M. et al. IDEAS: individual level differential expression analysis for single-cell RNA-seq data. Genome Biol. 23, 33 (2022). 3. Tiberi, S., Crowell, H. L., Samartsidis, P., Weber, L. M. & Robinson, M. D. distinct: a novel approach to differential distribution analyses. Ann. Appl. Stat. (2023) doi:10.1101/2020.11.24.394213. 4. Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015). 5. Qiu, P. Embracing the dropouts in single-cell RNA-seq analysis. Nat. Commun. 11, 1169 (2020). 6. Bouland, G. A., Mahfouz, A. & Reinders, M. J. T. Differential analysis of binarized single-cell RNA sequencing data captures biological variation. NAR Genomics Bioinforma. 3, lqab118 (2021). 7. Van den Berge, K., Soneson, C., Robinson, M. D. & Clement, L. stageR: a general stage-wise method for controlling the gene-level false discovery rate in differential expression and differential transcript usage. Genome Biol. 18, 151 (2017).

On YouTube: