Characterizing the impacts of dataset imbalance on single-cell data integration

Abstract

Single-cell transcriptomic data measured across multiple samples and conditions has led to a surge in computational methods for data integration. Few studies have explicitly examined the common case of cell-type imbalance between datasets to be integrated, and none have characterized its impact on downstream analyses. These scenarios are highly typical in developmental and tumor datasets, and have implications for the associated downstream tasks and biological conclusions. To address this gap, we developed the Iniquitate pipeline for assessing the stability of single-cell RNA sequencing (scRNA-seq) integration results after perturbing the degree of imbalance between datasets. Through benchmarking 5 state-of-the-art scRNA-seq integration techniques in 2600 integration experiments, our results indicate that sample imbalance has significant impacts on downstream analyses and the biological interpretation of integration results. We observed significant variation in clustering, cell-type classification, marker gene based annotation, query-to-reference mapping, and trajectory inference after imbalance perturbation. Our analysis quantifies the biologically-relevant effects of dataset imbalance in integration scenarios and introduces guidelines and novel metrics for integration of disparate datasets.

Date
Jun 12, 2024 12:00 AM
Location
Hinxton Hall, Wellcome Genome Campus, Hinxton, England