demuxSNP: supervised demultiplexing of scRNAseq data using cell hashing and SNPs
Author(s): Michael P Lynch,Laurent Gatto,Aedin C Culhane
Affiliation(s): University of Limerick
Sequencing at a single-cell resolution allows unprecedented understanding of biologically relevant differences between individual cells compared to previous bulk methods. The cost of sequencing has dropped considerably in recent years. Multiplexing, that is the loading of multiple biological samples into each sequencing lane, is widely used to further reduce costs. The obtained sequencing reads must then be demultiplexed or computationally separated back into their original groups. Experimental and computational methods have been proposed to facilitate demultiplexing. We present our approach and its corresponding R package ‘demuxSNP’ which overcomes current challenges in demultiplexing scRNAseq reads and can be applied to genetically distinct biological samples. Demultiplexing is usually done either through cell hashing (tagging) or exploiting genetic differences between biological sample groups using single nucleotide polymorphisms (SNPs). Tagging methods work by experimentally barcoding cells in each biological sample with a different HTO (hashtag oligonucleotide) or LMO (lipid modified oligonucleotide) tag prior to sequencing. These tags are then sequenced to form a counts matrix where rows are barcode tags and columns are cells. Counts of a given tag form a bimodal distribution and the signal needs to be distinguished from the background non-specific binding. The performance of tag based demultiplexing algorithms is sensitive to tagging quality due to greater overlap of the signal and noise, lower signal:noise and other artifacts. A second class of methods uses SNPs variation between biological samples to perform computational demultiplexing with genotype information (Demuxlet) or genotype free (Vireo, Souporcell). SNPs methods require no cell tagging but require genetically distinct samples, sufficient sequencing depth, are associated with higher computational cost and decreased performance in the presence of high levels of ambient RNA. Methods requiring genotype information (Demuxlet) perform well but obtaining the genotype of each biological sample incurs additional cost. To address this, genotype free approaches (Souporcell, Vireo) were developed which do not require prior knowledge of the expected SNPs in a biological sample. These genotype free methods, however, struggle to identify biological samples with low cell numbers. We propose a method, demuxSNP, utilising data from both tags and SNPs to assess the quality, increase the number of confidently called cells and overall performance of cell tagging assignment even when the cell tagging quality is low. We train a classifier using the SNP profiles of singlet cells assigned with high confidence using cell tagging methods from the genetically distinct biological samples. In addition to high confidence singlets, demuxSNP combines singlet SNP profiles from different singlet groups to simulate doublets and includes these in the training data. We can then assign low confidence cells (doublets or singlets which we could not confidently call using cell tagging methods alone). demuxSNP uses a subset of SNPs which are well observed in the dataset. This pre-selection of SNPs reduces computational cost, mitigates classification bias due to cell type and allows for tangible inspection of demultiplexing results from this and other algorithms. The demuxSNP package is submitted to Bioconductor.