ScalablePCA: Benchmarking principal component analysis for large-scale single-cell RNA-sequencing data
Author(s): Ilaria Billato,Chiara Romualdi,Gabriele Sales,Davide Risso
Affiliation(s): Department of Biology, University of Padova
With the advances in sequencing technology, the size and complexity of single-cell RNA-seq data are increasing, to the point that standard workflows are becoming too computationally demanding. In fact, datasets with millions of cells are becoming routine and they require workflows that operate out-of-memory. However, existing software tools for single cells do not scale well to such large datasets. As an exemplary step, we consider dimensionality reduction, which is often one of the first steps of a typical analysis. Popular algorithms, such as principal component analysis (PCA), are based on singular value decomposition (SVD). The classic implementations of SVD can be slow with large data matrices. Furthermore, they require the data to be loaded entirely into memory and therefore can be impossible to run with large datasets. To address this problem, the use of more efficient algorithms and out-of-memory data representations, such as HDF5, become essential for the analysis. Recently, the DelayedArray Bioconductor package implemented a new framework that makes it easy to apply (delayed) operations to matrix-like objects in the R language. Several new packages, such as HDF5Array, have been implemented to interface DelayedArray with these particular data formats. While Bioconductor provides the core package infrastructure to deal with large out-of-memory matrices, it is still unclear what are the benefits and challenges to apply such approaches to single-cell RNA-seq data. Recently, Hicks et al. have proposed a new clustering algorithm that can group millions of cells in minutes. In their workflow, the bottleneck was PCA, which took 96 hours to compute the top 50 principal components of data stored in a HDF5 file, using the IRLBA algorithm. Hence, an efficient implementation of SVD becomes paramount for the scalable analysis of single-cell data. In this work, we compare SVD algorithms, applied to the computation of the first 50 principal components of single-cell RNA-seq data. We compare IRLBA with the randomized SVD algorithms proposed in Halko et al., as implemented in the R/Bioconductor BiocSingular, Python Scanpy and ScikitLearn. We tested the above methods on a real single-cell RNA-seq dataset from 10X Genomics that contains approximately 1.3 million cells and 30,000 genes isolated from the mouse brain. We also created downsampled datasets (of sizes 100k, 500k, and 1M) to check the scalability of the methods (combination of input and algorithm) both in terms of time and memory consumption. We found that randomized PCA, as implemented in the BiocSingular package, was the most memory efficient which takes 7.05 GB of RAM to compute the top 50 principal components, using a single core. Instead, the faster one was IRLBA PCA, as implemented in the BiocSingular package, which takes 7.48 minutes to compute the top 50 principal components, using a single core. In this work, we want to write a guideline to find out the best trade-off regarding time and memory consumption. Moreover, observing how computational times and costs change based on the use of the GPU and CPU is interesting. This is the object of ongoing and future work.