BiocStyle 2.30.0
Data from different experimental platforms and/or batches exhibit systematic variation – i.e., batch effects. Therefore, when conducting joint analysis of data from different batches, a key first step is to align the datasets.
corralm
is a multi-table adaptation of correspondence analysis designed
for single-cell data, which applies multi-dimensional optimized scaling and
matrix factorization to compute integrated embeddings across the datasets.
These embeddings can then be used in downstream analyses, such as clustering,
cell type classification, trajectory analysis, etc.
See the vignette for corral
for dimensionality reduction of a single matrix of single-cell data.
We will use the SCMixology
datasets from the CellBench package (Tian et al. 2019).
library(corral)
library(SingleCellExperiment)
library(ggplot2)
library(CellBench)
library(MultiAssayExperiment)
scmix_dat <- load_all_data()[1:3]
These datasets include a mixture of three lung cancer cell lines:
which was sequenced using three platforms:
scmix_dat
## $sc_10x
## class: SingleCellExperiment
## dim: 16468 902
## metadata(3): scPipe Biomart log.exprs.offset
## assays(2): counts logcounts
## rownames(16468): ENSG00000272758 ENSG00000154678 ... ENSG00000054219
## ENSG00000137691
## rowData names(0):
## colnames(902): CELL_000001 CELL_000002 ... CELL_000955 CELL_000965
## colData names(14): unaligned aligned_unmapped ... cell_line sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
##
## $sc_celseq
## class: SingleCellExperiment
## dim: 28204 274
## metadata(3): scPipe Biomart log.exprs.offset
## assays(2): counts logcounts
## rownames(28204): ENSG00000281131 ENSG00000227456 ... ENSG00000148143
## ENSG00000226887
## rowData names(0):
## colnames(274): A1 A10 ... P8 P9
## colData names(15): unaligned aligned_unmapped ... cell_line sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
##
## $sc_dropseq
## class: SingleCellExperiment
## dim: 15127 225
## metadata(3): scPipe Biomart log.exprs.offset
## assays(2): counts logcounts
## rownames(15127): ENSG00000223849 ENSG00000225355 ... ENSG00000133789
## ENSG00000146674
## rowData names(0):
## colnames(225): CELL_000001 CELL_000002 ... CELL_000249 CELL_000302
## colData names(14): unaligned aligned_unmapped ... cell_line sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Each sequencing platform captures a different set of genes. In order to apply this method, the matrices need to be matched by features (i.e., genes). We’ll find the intersect of the three datasets, then subset for that as we proceed.
First, we will prepare the data by:
1. adding to the colData the sequencing platform (Method
in colData
for each SCE), and
2. subsetting by the intersect of the genes.
platforms <- c('10X','CELseq2','Dropseq')
for(i in seq_along(scmix_dat)) {
colData(scmix_dat[[i]])$Method<- rep(platforms[i], ncol(scmix_dat[[i]]))
}
scmix_mae <- as(scmix_dat,'MultiAssayExperiment')
scmix_dat <- as.list(MultiAssayExperiment::experiments(MultiAssayExperiment::intersectRows(scmix_mae)))