Data fission for post-clustering differential analysis using dearseq
Author(s): Benjamin Hivert,Denis Agniel,Rodolphe ThiƩbaut,Boris P Hejblum
Affiliation(s): Univ. Bordeaux, INSERM, INRIA, SISTM team, BPH, U1219, F-33000 Bordeaux, France
Differential expression analysis of gene expression data is crucial to describe the biological phenomena that discriminate between groups of samples at the gene level. Many statistical tests for differential analysis have been proposed. Recently, dearseq, a variance component score test was developed to ensure a better control of the False Discovery Rate in large sample studies than state-of-the-art methods for differential analysis. Differential analysis tools are traditionally used to compare groups or conditions known a priori. However, in some exploratory analysis, groups of samples can be identified from the data using clustering algorithms. Performing differential analysis between those data-driven clusters violates the traditional inference setting which assumes hypotheses (i.e. groups) are known a priori. This two-step process can lead to false discoveries, even for well calibrated tests. We implemented data fission, a new approach for post-clustering differential analysis, inside the dearseq package. By separating the information contained in each sample into two datasets, data fission allows the clustering step and the differential analysis to be performed on two independent datasets. Data fission thus preserves all the known properties of dearseq for post-clustering differential analysis, such as the efficient control of the False Discovery Rate. We illustrate how the data fission implemented in dearseq helps to answer biological questions using a real log2-cpm normalized RNA-seq dataset of 54 patients from a COVID19 study where clustering identified different clusters linked to COVID severity.