# 1 Detecting differentially abundant subpopulations in mass cytometry data

Package: cydar
Author: Aaron Lun (alun@wehi.edu.au)
Compilation date: 2019-05-02

# 2 Introduction

Mass cytometry is a technique that allows simultaneous profiling of many (> 30) protein markers on each of millions of cells. This is frequently used to characterize cell subpopulations based on unique combinations of markers. One way to analyze this data is to identify subpopulations that change in abundance between conditions, e.g., with or without drug treatment, before and after stimulation. This vignette will describe the steps necessary to perform this “differential abundance” (DA) analysis.

# 3 Setting up the data

## 3.1 Mocking up a data set

The analysis starts from a set of Flow Cytometry Standard (FCS) files containing intensities for each cell. For the purposes of this vignette, we will simulate some data to demonstrate the methods below. This experiment will assay 30 markers, and contain 3 replicate samples in each of 2 biological conditions. We add two small differentially abundant subpopulations to ensure that we get something to look at later.

ncells <- 20000
nda <- 200
nmarkers <- 31
down.pos <- 1.8
up.pos <- 1.2
conditions <- rep(c("A", "B"), each=3)
combined <- rbind(matrix(rnorm(ncells*nmarkers, 1.5, 0.6), ncol=nmarkers),
matrix(rnorm(nda*nmarkers, down.pos, 0.3), ncol=nmarkers),
matrix(rnorm(nda*nmarkers, up.pos, 0.3), ncol=nmarkers))
combined[,31] <- rnorm(nrow(combined), 1, 0.5) # last marker is a QC marker.
combined <- 10^combined # raw intensity values
sample.id <- c(sample(length(conditions), ncells, replace=TRUE),
sample(which(conditions=="A"), nda, replace=TRUE),
sample(which(conditions=="B"), nda, replace=TRUE))
colnames(combined) <- paste0("Marker", seq_len(nmarkers))

We use this to construct a ncdfFlowSet for our downstream analysis.

library(ncdfFlow)
collected.exprs <- list()
for (i in seq_along(conditions)) {
stuff <- list(combined[sample.id==i,,drop=FALSE])
names(stuff) <- paste0("Sample", i)
collected.exprs[[i]] <- poolCells(stuff)
}
names(collected.exprs) <- paste0("Sample", seq_along(conditions))
collected.exprs <- ncdfFlowSet(as(collected.exprs, "flowSet"))

In practice, we can use the read.ncdfFlowSet function to load intensities from FCS files into the R session. The ncdfFlowSet object can replace all instances of collected.exprs in the downstream steps.

## 3.2 Pre-processing of intensities

### 3.2.1 Pooling cells together

The intensities need to be transformed and gated prior to further analysis. We first pool all cells together into a single flowFrame, which will be used for construction of the transformation and gating functions for all samples. This avoids spurious differences from using sample-specific functions.

pool.ff <- poolCells(collected.exprs)

### 3.2.2 Estimating transformation parameters

We use the estimateLogicle method from the flowCore package to obtain a transformation function, and apply it to pool.ff. This performs a biexponential transformation with parameters estimated for optimal display.

library(flowCore)
trans <- estimateLogicle(pool.ff, colnames(pool.ff))
proc.ff <- transform(pool.ff, trans)

### 3.2.3 Gating out uninteresting cells

The next step is to construct gates to remove uninteresting cells. There are several common gates that are used in mass cytometry data analysis, typically used in the following order:

• Gating out calibration beads (high in Ce140) used to correct for intensity shifts in the mass spectrometer.
• Gating on moderate intensities for the DNA markers, to remove debris and doublets. This can be done using the dnaGate function.
• Gating on a dead/alive marker, to remove dead cells. Whether high or low values should be removed depends on the marker (e.g., high values to be removed for cisplatin).
• Gating out low values for selected markers, e.g., CD45 when studying leukocytes, CD3 when studying T cells.

To demonstrate, we will construct a gate to remove low values for the last marker, using the outlierGate function. The constructed gate is then applied to the flowFrame, only retaining cells falling within the gated region.

gate.31 <- outlierGate(proc.ff, "Marker31", type="upper")
gate.31
## Rectangular gate 'Marker31_outlierGate' with dimensions:
##   Marker31: (-Inf,4.00732898173722)
filter.31 <- filter(proc.ff, gate.31)
summary(filter.31@subSet)
##    Mode   FALSE    TRUE
## logical      35   20011

We apply the gate before proceeding to the next marker to be gated.

proc.ff <- Subset(proc.ff, gate.31)

### 3.2.4 Applying functions to the original data

Applying the transformation functions to the original data is simple.

processed.exprs <- transform(collected.exprs, trans)

Applying the gates is similarly easy. Use methods the flowViz package to see how gating diagnostics can be visualized.

processed.exprs <- Subset(processed.exprs, gate.31)

Markers used for gating are generally ignored in the rest of the analysis. For example, as long as all cells contain DNA, we are generally not interested in differences in the amount of DNA. This is achieved by discarding those markers (in this case, marker 31).

processed.exprs <- processed.exprs[,1:30]

## 3.3 Normalizing intensities across batches

By default, we do not perform any normalization of intensities between samples. This is because we assume that barcoding was used with multiplexed staining and mass cytometry. Thus, technical biases that might affect intensity should be the same in all samples, which means that they cancel out when comparing between samples.

In data sets containing multiple batches of separately barcoded samples, we provide the normalizeBatch function to adjust the intensities. This uses range-based normalization to equalize the dynamic range between batches for each marker. Alternatively, it can use warping functions to eliminate non-linear distortions due to batch effects.

The problem of normalization is much harder to solve in data sets with no barcoding at all. In such cases, the best solution is to expand the sizes of the hyperspheres to “smooth over” any batch effects. See the expandRadius function for more details.

# 4 Counting cells into hyperspheres

We quantify abundance by assigning cells to hyperspheres in the high-dimensional marker space, and counting the number of cells from each sample in each hypersphere. To do this, we first convert the intensity data into a format that is more amenable for counting. The prepareCellData function works with either a list of matrices or directly with a ncdfFlowSet object, and generates a CyData object containing the reformatted intensities.

cd <- prepareCellData(processed.exprs)

We then assign cells to hyperspheres using the countCells function. Each hypersphere is centred at a cell to restrict ourselves to non-empty hyperspheres, and has radius equal to 0.5 times the square root of the number of markers. The square root function adjusts for increased sparsity of the data at higher dimensions, while the 0.5 scaling factor allows cells with 10-fold differences in marker intensity (due to biological variability or technical noise) to be counted into the same hypersphere. Also see the neighborDistances function for guidance on choosing a value of tol.

cd <- countCells(cd, tol=0.5)

The output is another CyData object with extra information added to various fields. In particular, the reported count matrix contains the set of counts for each hypersphere (row) from each sample (column).

head(assay(cd))
##      Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
## [1,]       3       2       2       3       1       3
## [2,]       5       3       3       1       3       4
## [3,]      18      12      10      20      20      17
## [4,]       8      10      10       8      10      11
## [5,]       4       3       2       2       6       4
## [6,]      15      11      11       9      11      15

Also reported are the “positions” of the hyperspheres, defined for each marker as the median intensity for all cells assigned to each hypersphere. This will be required later for interpretation, as the marker intensities are required for defining the function of each subpopulation. Shown below is the position of the first hypersphere, represented by its set of median intensities across all markers.

head(intensities(cd))
##       Marker1  Marker2  Marker3  Marker4  Marker5  Marker6  Marker7  Marker8
## [1,] 1.935271 1.621715 2.325860 2.359797 1.860706 2.554898 1.584440 2.177339
## [2,] 2.048290 1.888639 1.801296 2.305013 2.323972 2.983556 1.893312 1.577262
## [3,] 1.880087 1.749390 2.040243 1.913854 1.960262 2.450824 1.815228 2.014238
## [4,] 2.152659 2.091876 2.130330 1.884311 1.734820 2.385827 2.213045 1.557434
## [5,] 1.987980 2.040192 1.689138 1.821578 2.029730 2.230265 1.812931 1.577035
## [6,] 1.681563 1.644541 2.120475 2.067052 2.373371 2.615281 2.020534 1.879639
##       Marker9 Marker10 Marker11 Marker12 Marker13 Marker14 Marker15 Marker16
## [1,] 1.965217 2.504569 1.987041 1.615124 2.277062 1.864430 1.985323 2.622484
## [2,] 1.895843 2.250674 2.278541 2.136949 2.450958 1.775300 2.475793 2.354322
## [3,] 2.186760 2.194524 1.794452 1.925373 1.876132 1.215054 2.175622 2.116142
## [4,] 1.957296 1.982114 1.666668 1.837383 1.913268 1.729039 2.057938 2.502681
## [5,] 1.647814 2.250674 2.164280 1.599091 2.059319 1.445530 2.209087 1.758848
## [6,] 2.221902 2.198835 1.990981 2.235947 2.076680 1.432520 2.310074 2.131157
##      Marker17 Marker18 Marker19 Marker20 Marker21 Marker22 Marker23 Marker24
## [1,] 1.994050 2.126200 2.070206 1.759896 2.383431 2.101017 2.450558 2.402547
## [2,] 2.551383 2.007046 2.238959 1.827292 1.834961 2.797987 2.241913 2.631473
## [3,] 2.053675 2.092680 2.331300 1.543087 2.121845 2.414022 2.088776 2.320335
## [4,] 2.200909 2.114198 2.416717 1.626354 1.834237 2.133142 1.976356 2.652829
## [5,] 2.223575 1.932704 2.001990 1.520431 2.103909 2.449779 1.682073 2.088892
## [6,] 2.136425 2.128450 2.227813 1.418250 2.542713 2.656262 1.748727 2.362882
##      Marker25 Marker26 Marker27 Marker28 Marker29 Marker30
## [1,] 2.224684 2.051308 2.552803 2.304294 1.851418 1.806884
## [2,] 1.935623 2.161547 2.026939 2.195970 2.235876 1.539637
## [3,] 2.304625 1.281958 2.209503 2.418481 2.035186 1.913879
## [4,] 2.322714 1.742567 2.273046 2.212864 2.100933 2.000857
## [5,] 2.191442 1.391203 1.758744 1.834644 1.584498 2.123303
## [6,] 2.005661 2.379569 2.224199 2.394712 1.876099 2.085435

There is some light filtering in countCells to improve memory efficiency, which can be adjusted with the filter argument.

# 5 Testing for significant differences in abundance

We can use a number of methods to test the count data for differential abundance. Here, we will use the quasi-likelihood (QL) method from the edgeR package. This allows us to model discrete count data with overdispersion due to biological variability.

library(edgeR)
y <- DGEList(assay(cd), lib.size=cd$totals) First, we do some filtering to remove low-abundance hyperspheres with average counts below 5. These are mostly uninteresting as they do not provide enough evidence to reject the null hypothesis. Removing them also reduces computational work and the severity of the multiple testing correction. Lower values can also be used, but we do not recommend going below 1. keep <- aveLogCPM(y) >= aveLogCPM(5, mean(cd$totals))
cd <- cd[keep,]
y <- y[keep,]

We then apply the QL framework to estimate the dispersions, fit a generalized linear model and test for significant differences between conditions. We refer interested readers to the edgeR user’s guide for more details.

design <- model.matrix(~factor(conditions))
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design, robust=TRUE)
res <- glmQLFTest(fit, coef=2)

Note that normalization by total cell count per sample is implicitly performed by setting lib.size=out$totals. We do not recommend using calcNormFactors in this context, as its assumptions may not be applicable to mass cytometry data. # 6 Controlling the spatial FDR To correct for multiple testing, we aim to control the spatial false discovery rate (FDR). This refers to the FDR across areas of the high-dimensional space. We do this using the spatialFDR function, given the p-values and positions of all tested hyperspheres. qvals <- spatialFDR(intensities(cd), res$table$PValue) Hyperspheres with significant differences in abundance are defined as those detected at a spatial FDR of, say, 5%. is.sig <- qvals <= 0.05 summary(is.sig) ## Mode FALSE TRUE ## logical 117 79 This approach is a bit more sophisticated than simply applying the BH method to the hypersphere p-values. Such a simple approach would fail to account for the different densities of hyperspheres in different parts of the high-dimensional space. # 7 Visualizing and interpreting the results ## 7.1 With static plots To interpret the DA hyperspheres, we use dimensionality reduction to visualize them in a convenient two-dimensional representation. This is done here with PCA, though for more complex data sets, we suggest using something like Rtsne. sig.coords <- intensities(cd)[is.sig,] sig.res <- res$table[is.sig,]
coords <- prcomp(sig.coords)

Each DA hypersphere is represented as a point on the plot below, coloured according to its log-fold change between conditions. We can see that we’ve recovered the two DA subpopulations that we put in at the start. One subpopulation increases in abundance (red) while the other decreases (blue) in the second condition relative to the first.

plotSphereLogFC(coords$x[,1], coords$x[,2], sig.res$logFC) This plot should be interpreted by examining the marker intensities, in order to determine what each area of the plot represents. We suggest using the plotSphereIntensity function to make a series of plots for all markers, as shown below. Colours represent to the median marker intensities of each hypersphere, mapped onto the viridis colour scale. par(mfrow=c(6,5), mar=c(2.1, 1.1, 3.1, 1.1)) limits <- intensityRanges(cd, p=0.05) all.markers <- markernames(cd) for (i in order(all.markers)) { plotSphereIntensity(coords$x[,1], coords$x[,2], sig.coords[,i], irange=limits[,i], main=all.markers[i]) } We use the intensityRanges function to define the bounds of the colour scale. This caps the minimum and maximum intensities at the 5th and 95th percentiles, respectively, to avoid colours being skewed by outliers. Note that both of these functions return a vector of colours, named with the corresponding numeric value of the log-fold change or intensity. This can be used to construct a colour bar – see ?plotSphereLogFC for more details. ## 7.2 Using a Shiny app An alternative approach to interpretation is to examine each hypersphere separately, and to determine the cell type corresponding to the hypersphere’s intensities. First, we prune done the number of hyperspheres to be examined in this manner. This is done by identifying “non-redundant” hyperspheres, i.e., hyperspheres that do not overlap hyperspheres with lower p-values. nonred <- findFirstSphere(intensities(cd), res$table$PValue) summary(nonred) ## Mode FALSE TRUE ## logical 194 2 We pass these hyperspheres to the interpretSpheres, which creates a Shiny app where the intensities are displayed. The idea is to allow users to inspect each hypersphere, annotate it and then save the labels to R once annotation is complete. See the documentation for more details. all.coords <- prcomp(intensities(cd)) app <- interpretSpheres(cd, select=nonred, metrics=res$table, run=FALSE,
red.coords=all.coords\$x[,1:2], red.highlight=is.sig)
# Set run=TRUE if you want the app to run automatically.

Users wanting to identify specific subpopulations may consider using the selectorPlot function from scran. This provides an interactive framework by which hyperspheres can be selected and saved to a R session for further examination. The best markers that distinguish cells in one subpopulation from all others can also be identified using pickBestMarkers.

# 9 Session information

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets
## [8] methods   base
##
## other attached packages:
##  [1] edgeR_3.26.0                limma_3.40.0
##  [3] ncdfFlow_2.30.0             BH_1.69.0-1
##  [7] cydar_1.8.0                 SingleCellExperiment_1.6.0
##  [9] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
## [11] matrixStats_0.54.0          Biobase_2.44.0
## [13] GenomicRanges_1.36.0        GenomeInfoDb_1.20.0
## [15] IRanges_2.18.0              S4Vectors_0.22.0
## [17] BiocGenerics_0.30.0         BiocParallel_1.18.0
## [19] knitr_1.22                  BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.1          splines_3.6.0          viridisLite_0.3.0
##  [4] shiny_1.3.2            assertthat_0.2.1       statmod_1.4.30
##  [7] BiocManager_1.30.4     latticeExtra_0.6-28    GenomeInfoDbData_1.2.1
## [10] yaml_2.2.0             robustbase_0.93-4      pillar_1.3.1
## [13] lattice_0.20-38        glue_1.3.1             digest_0.6.18
## [16] RColorBrewer_1.1-2     promises_1.0.1         XVector_0.24.0
## [19] colorspace_1.4-1       htmltools_0.3.6        httpuv_1.5.1
## [22] Matrix_1.2-17          plyr_1.8.4             pcaPP_1.9-73
## [25] pkgconfig_2.0.2        bookdown_0.9           zlibbioc_1.30.0
## [28] purrr_0.3.2            xtable_1.8-4           corpcor_1.6.9
## [31] mvtnorm_1.0-10         scales_1.0.0           later_0.8.0
## [34] tibble_2.1.1           ggplot2_3.1.1          flowViz_1.48.0
## [37] hexbin_1.27.2          lazyeval_0.2.2         IDPmisc_1.1.19
## [40] magrittr_1.5           crayon_1.3.4           mime_0.6
## [43] evaluate_0.13          MASS_7.3-51.4          graph_1.62.0
## [46] tools_3.6.0            stringr_1.4.0          locfit_1.5-9.1
## [49] munsell_0.5.0          cluster_2.0.9          compiler_3.6.0
## [52] rlang_0.3.4            grid_3.6.0             RCurl_1.95-4.12
## [55] BiocNeighbors_1.2.0    bitops_1.0-6           rmarkdown_1.12
## [58] gtable_0.3.0           rrcov_1.4-7            R6_2.4.0
## [61] gridExtra_2.3          dplyr_0.8.0.1          KernSmooth_2.23-15
## [64] stringi_1.4.3          Rcpp_1.0.1             DEoptimR_1.0-8
## [67] tidyselect_0.2.5       xfun_0.6