A novel statistical method for single isoform proteogenomics inference
Author(s): Jordy Bollon,Michael Shortreed,Ben T Jordan,Rachel Miller,Colin Dewey,Gloria M Sheynkman,Simone Tiberi
Affiliation(s): Department of Statistical Sciences, University of Bologna, Bologna, Italy
Social media: https://twitter.com/tiberi_simone
Background Currently, the main strategy to infer proteins is via “bottom-up” proteomics, where proteins are only measured indirectly via peptides. However, most peptides (called shared peptides) map to multiple proteins in the database; this results in ambiguous protein identifications, where various protein isoforms cannot be distinguished, and protein inference is typically abstracted at the gene-level (NB: most genes are associated to multiple isoforms). Although a few methods have been proposed to perform inference at the isoform level, protein detection is affected by low statistical power; furthermore, inference only focuses on identifying proteins (presence vs. absence), and not on further measures such as protein abundance. Aim and impact We propose a novel statistical method for enhanced proteomics inference via integration of sample¬ matched mRNA expression, which is a prerequisite and correlate of protein abundance. We jointly model transcriptomics and proteomics data in a Bayesian probabilistic framework, and perform inference on individual protein isoforms. Our approach infers the presence/absence and abundance of protein isoforms, and provides a measure of the uncertainty of both estimates, via a posterior probability and credible interval. Additionally, our framework detects isoforms where mRNA and protein relative abundances differ, which indicates alterations in the regulation of transcripts and proteins. Our tool could be of great utility to the field of computational biology, by enabling scientists to gain deeper insight into translational regulation. For instance, it was shown that proteins of the transcription factor MITF display changes (at the gene-level) in abundance between different sub¬types of melanoma; our approach could allow estimating the presence and abundance of individual protein isoforms, and studying how they vary across cancer subtypes, hence enabling a deeper understanding of cancer driving mechanisms. Methodology We developed a Bayesian model, where the abundance of shared peptides is allocated to the protein isoforms of origin based on latent variables models. Transcriptomics data is embedded in the form of an informative prior for the relative abundance of protein isoforms. Parameters and latent states are alternately sampled via Markov chain Monte Carlo schemes. Our algorithm is efficiently coded in C++, and runs in about 1 minute on a laptop. Note that our method can also be used on proteomics data alone, but results are more accurate when transcriptomics data is also provided. Benchmarking We designed various benchmarks, on both real and simulated data, where we evaluated the performance of our tool and four competitors that perform inference on isoform proteins (EPIFANY, FIDO, ProteinProphet and PIA). In particular, we collected proteomics measurements of the same cell line from six distinct proteases: with a leave-one-out approach, we analyzed one at a time, and used the remaining five to validate results. Even when using proteomics data alone, our approach displays higher sensitivity and specificity than competitors at detecting present protein isoforms; this gap significantly increases when adding transcriptomics data. Furthermore, our estimated abundances highly correlate with the corresponding ground truth. Availability In spring, we plan to distribute our tool as a Bioconductor R package, and release a pre-print of our work.