1 Overview

The mumosa package implements a variety of utilities for analyses of single-cell data with multiple modalities. This usually refers to single-cell RNA-seq experiments with proteomics, epigenomics or other data collected from the same cells. The aim is to investigate how different modalities relate to each other via analyses of correlations, and to combine data together from multiple modalities for integrated downstream analyses. The actual analyses of each individual modality are deferred to other packages; the scope of mumosa is limited to the sharing of information across modalities.

2 Setting up the dataset

To demonstrate the functionalities of this package, we will use a subset of a CITE-seq experiment from the scRNAseq package. The main Experiment contains the RNA-seq counts while the adt alternative Experiment contains the CITE-seq counts - see the SingleCellExperiment package documentation for more details.

library(scater)
library(scran)
library(scRNAseq)
sce <- KotliarovPBMCData()
sce <- sce[,1:1000] # subset for speed.
sce
## class: SingleCellExperiment 
## dim: 32738 1000 
## metadata(0):
## assays(1): counts
## rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
## rowData names(0):
## colnames(1000): AAACCTGAGAGCCCAA_H1B1ln1 AAACCTGAGGCGTACA_H1B1ln1 ...
##   ATAACGCGTTTGGGCC_H1B1ln1 ATAACGCTCAACGCTA_H1B1ln1
## colData names(24): nGene nUMI ... dmx_hto_match timepoint
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(1): adt

We perform some cursory analyses on the RNA component:

stats <- perCellQCMetrics(sce, subsets=list(Mito=grep("^MT-", rownames(sce))))
filter <- quickPerCellQC(stats, sub.fields="subsets_Mito_percent")
sce <- sce[,!filter$discard]
sce <- logNormCounts(sce)
dec <- modelGeneVar(sce)
set.seed(10000)
sce <- runPCA(sce, ncomponents=25, subset_row=getTopHVGs(dec, n=5000))

And again on the protein component.

# TODO: sync this with the book.
sceA <- altExp(sce)
statsA <- perCellQCMetrics(sceA)
keep <- statsA$detected > 50 
sceA <- sceA[,keep]

library(DropletUtils)
amb <- inferAmbience(assay(sceA))
sceA <- computeMedianFactors(sceA, reference=amb)
sceA <- logNormCounts(sceA)

set.seed(10000)
sceA <- runPCA(sceA, ncomponents=15)

Finally, we put everything back into the same object, only considering the cells passing QC in both modalities.

sce2 <- sce[,keep]
altExp(sce2) <- sceA

3 Rescaling by neighbors

The easiest way to combine data for the same set of cells is to simply cbind their matrices together prior to downstream analyses like clustering. However, this requires some rescaling to adjust for the differences in the number of features and variation of each modality; otherwise, the modality with more features or stronger (technical) variation would just dominate later calculations. rescaleByNeighbors() quantifies the “noise” of each modality using on the median distance to the \(k\)-th nearest neighbor, and then rescales each expression/PC matrix by that distance to equalize the noise across modalities.

library(mumosa)
output <- rescaleByNeighbors(list(reducedDim(sce2), reducedDim(altExp(sce2))))
dim(output)
## [1] 911  40

The result is a cbinded matrix that can be used directly in downstream analyses like clustering and dimensionality reduction.

set.seed(100001)
library(bluster)
sce2$combined.clustering <- clusterRows(output, NNGraphParam())

reducedDim(sce2, "combined") <- output
sce2 <- runTSNE(sce2, dimred="combined")
plotTSNE(sce2, colour_by="combined.clustering")