SCArray – Large-scale single-cell omics data manipulation with GDS files

Dr. Xiuwen Zheng
Genomics Research Center, AbbVie

Aug, 2021

Introduction

Workflow & Data Structure

image/svg+xmlGDS files SingleCellExperiment SCArray / GDSArray DelayedMatrix scater . . . gdsfmt Input file R package R object DelayedMatrix wraps an on-disk matrix and allows common operations on it without loading the whole object into memory. Sparse matrix in GDS (e.g., counts): 1. nonzero numbers are stored and compressed sequentially (column-major order) 2. override row-specific functions in DelayedMatrixStats (e.g., rowVars) for faster implementation Read data efficiently (b) Inefficiently (c) column (cells) row (genes) column (cells) row (genes) (a)

Key Functions in SCArray

image/svg+xmlFunctions Description scConvGDS() Saves a SingleCellExperiment object to a GDS file scExperiment() Returns an instance of SingleCellExperiment from a GDS file scArray() Gets a DelayedArray instance from a GDS file library(scRNAseq) library(SCArray) sce <- MacoskoRetinaData() # Macosko et al. (2016) scConvGDS(sce, "macosko_retina_data.gds") sce2 <- scExperiment("macosko_retina_data.gds") sce2 ## class: SingleCellExperiment ## dim: 24658 49300 ## assays(1): counts ## rownames(24658): KITL TMTC3 ... 1110059M19RIK GM20861 ## rowData names(0): ## colnames(49300): r1_GGCCGCAGTCCG r1_CTTGTGCGGGAA ... p1_TAACGCGCTCCT ## colData names(2): cell.id cluster ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): File: macosko_retina_data.gds (40.4M) + [ ] * | -- + feature.id { Str8 24658 LZMA_ra(46.2%), 82.0K } | -- + sample.id { Str8 49300 LZMA_ra(22.5%), 173.6K } | -- + counts { SparseReal32 24658x49300 LZMA_ra(14.1%), 40.0M } | -- + feature.data [ ] | -- + sample.data [ ] | | -- + cell.id { Str8 49300 LZMA_ra(22.5%), 173.6K } | \ -- + cluster { Int32 49300 LZMA_ra(10.1%), 19.5K } \ -- + meta.data [ ]

Example: Small-size Dataset

image/svg+xmllibrary(scRNAseq) library(scater) # Load scRNA-seq data sce <- MacoskoRetinaData() # Or sce <- readRDS("macosko_retina_data.rds") # Quality control is.mito <- grepl("^MT-", rownames(sce)) qcstats <- perCellQCMetrics(sce, subsets=list(Mito=is.mito)) filtered <- quickPerCellQC(qcstats, percent_subsets="subsets_Mito_percent") sce <- sce[, !filtered$discard] sce <- logNormCounts(sce) # normalization # Dimensionality reduction using the scater package # using logcounts in PCA & UMAP sce <- runPCA(sce) sce <- runUMAP(sce) reducedDims(sce) ## List of length 2 ## names(2): PCA UMAP # of genes # of cells % of nonzero Size (sparse, in-memory) GDS (compressed) GDS (no compression) Citation 24,658 49,300 3.1% 447.2MB 40.4MB 285.5M Macosko , et al . (2016) # Load a GDS file using DelayedArray sce <- scExperiment("macosko_retina_data.gds") counts(sce) ## <24658 x 49300> sparse matrix of class SC_GDSMatrix and type "double": ## r1_GGCCGCAGTCCG r1_CTTGTGCGGGAA ... p1_TAACGCGCTCCT ## KITL 0 0 . 0 ## TMTC3 3 0 . 0 ## ... . . . . logcounts(sce) ## <24658 x 49300> sparse matrix of class SC_GDSMatrix and type "double": ## r1_GGCCGCAGTCCG r1_CTTGTGCGGGAA ... p1_TAACGCGCTCCT ## KITL 0.00000000 0.00000000 . 0 ## TMTC3 0.14147024 0.00000000 . 0 ## ... . . . . the example in https://bioconductor.org/books/release/OSCA/ (Amezquita et al. 2020) replaced by GDS

Example: Large-size Dataset (1.3M mouse brain cells)

image/svg+xmllibrary(SCArray) library(scater) # Load the data of 1.3M cells sce <- scExperiment("1M_sc_neurons.gds") sce <- logNormCounts(sce) # normalization logcounts(sce) # Dimensionality reduction with scater # using logcounts in PCA & UMAP sce <- scObj(sce) sce <- runPCA(sce, ntop=500, BSPARAM=RandomParam()) sce <- runUMAP(sce) reducedDims(sce) ## List of length 2 ## names(2): PCA UMAP # of genes # of cells % of nonzero Size (sparse, in-memory) GDS (compressed) GDS (no compression) Citation 27,998 1,306,127 7.2% ~30GB a 2.7GB 19.0GB Zheng, et al . (2017) logcounts(sce) # no log count data is stored in memory or a file ## <27998 x 1306127> sparse matrix of class SC_GDSMatrix and type "double": ## AAACCTGAGATAGGAG - 1 ... TTTGTCATCTGAAAGA - 133 ## ENSMUSG00000051951 0 . 0 ## ENSMUSG00000089699 0 . 0 ## ENSMUSG00000095041 1.143348 . 0 ## ... . . . File: 1M_sc_neurons.gds (2.7G) + [ ] | -- + feature.id { Str8 27998 LZMA_ra(12.4%), 64.5K } | -- + sample.id { Str8 1306127 LZMA_ra(10.8%), 2.7M } \ -- + counts { SparseReal32 27998x1306127 LZMA_ra(14.4%), 2.7G } # of genes with the highest variances used in PCA/UMAP use a randomized PCA algorithm a Cannot load the sparse form into memory using Matrix::dgCMatrix

Discussion

Acknowledgements