Contents

1 Overview

The HDCytoData data package contains a set of publicly available high-dimensional flow cytometry and mass cytometry (CyTOF) data sets, formatted into SummarizedExperiment and flowSet Bioconductor object formats. The data objects are hosted on the Bioconductor ExperimentHub web resource.

The objects contain the cell-level expression values, as well as row and column meta-data, including sample IDs, group IDs, true cell population labels or cluster labels (where available), channel names, protein marker names, and protein marker classes (cell type or cell state).

These data sets have been used for benchmarking purposes in our previous work and publications, e.g. to evaluate the performance of clustering algorithms. They are provided here in the SummarizedExperiment and flowSet formats to make them easier to access.

2 Data sets

Currently, the package contains the following data sets:

Additional details on each data set are included in the help files for the data sets. For each data set, this includes a description of the data set (biological context, number of samples, number of cells, number and classes of protein markers, etc.), as well as an explanation of the object structures, and references and raw data sources.

The help files can be accessed by the data set names, e.g. ?Bodenmiller_BCR_XL.

3 How to load data

First, we show how to load the data sets.

The data sets can be loaded either with named functions referring directly to the object names, or by using the ExperimentHub interface. Both methods are demonstrated below, using the Bodenmiller_BCR_XL data set as an example.

See the help file (?Bodenmiller_BCR_XL) for details about the structure of the SummarizedExperiment or flowSet objects for this data set.

Load the data sets using named functions:

suppressPackageStartupMessages(library(HDCytoData))
## snapshotDate(): 2018-04-27
# Load 'SummarizedExperiment' object using named function
Bodenmiller_BCR_XL_SE()
## snapshotDate(): 2018-04-27
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache 
##     '/home/biocbuild//.ExperimentHub/1119'
## class: SummarizedExperiment 
## dim: 172791 35 
## metadata(2): experiment_info n_cells
## assays(1): exprs
## rownames: NULL
## rowData names(4): group_id patient_id sample_id population_id
## colnames(35): Time Cell_length ... DNA-1 DNA-2
## colData names(3): channel_name marker_name marker_class
# Load 'flowSet' object using named function
Bodenmiller_BCR_XL_flowSet()
## snapshotDate(): 2018-04-27
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache 
##     '/home/biocbuild//.ExperimentHub/1120'
## A flowSet with 16 experiments.
## 
##   column names:
##   Time Cell_length CD3 CD45 BC1 BC2 pNFkB pp38 CD4 BC3 CD20 CD33 pStat5 CD123 pAkt pStat1 pSHP2 pZap70 pStat3 BC4 CD14 pSlp76 BC5 pBtk pPlcg2 pErk BC6 pLat IgM pS6 HLA-DR BC7 CD7 DNA-1 DNA-2 group_id patient_id sample_id population_id

Alternatively, load the data sets using the ExperimentHub interface:

# Create an ExperimentHub instance
ehub <- ExperimentHub()
## snapshotDate(): 2018-04-27
# Query ExperimentHub instance to find data sets
query(ehub, "HDCytoData")
## ExperimentHub with 2 records
## # snapshotDate(): 2018-04-27 
## # $dataprovider: NA
## # $species: Homo sapiens
## # $rdataclass: SummarizedExperiment, flowSet
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH1119"]]' 
## 
##            title                     
##   EH1119 | Bodenmiller_BCR_XL_SE     
##   EH1120 | Bodenmiller_BCR_XL_flowSet
# Load 'SummarizedExperiment' object using index of data set
ehub[["EH1119"]]
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache 
##     '/home/biocbuild//.ExperimentHub/1119'
## class: SummarizedExperiment 
## dim: 172791 35 
## metadata(2): experiment_info n_cells
## assays(1): exprs
## rownames: NULL
## rowData names(4): group_id patient_id sample_id population_id
## colnames(35): Time Cell_length ... DNA-1 DNA-2
## colData names(3): channel_name marker_name marker_class
# Load 'flowSet' object using index of data set
ehub[["EH1120"]]
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache 
##     '/home/biocbuild//.ExperimentHub/1120'
## A flowSet with 16 experiments.
## 
##   column names:
##   Time Cell_length CD3 CD45 BC1 BC2 pNFkB pp38 CD4 BC3 CD20 CD33 pStat5 CD123 pAkt pStat1 pSHP2 pZap70 pStat3 BC4 CD14 pSlp76 BC5 pBtk pPlcg2 pErk BC6 pLat IgM pS6 HLA-DR BC7 CD7 DNA-1 DNA-2 group_id patient_id sample_id population_id

4 Example workflow

We demonstrate the use of the HDCytoData package with a short example analysis workflow for one of the data sets.

4.1 Load data

We load the data in SummarizedExperiment format for the example workflow.

# Load data set in 'SummarizedExperiment' format
d_SE <- Bodenmiller_BCR_XL_SE()
## snapshotDate(): 2018-04-27
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache 
##     '/home/biocbuild//.ExperimentHub/1119'
# Inspect the object
d_SE
## class: SummarizedExperiment 
## dim: 172791 35 
## metadata(2): experiment_info n_cells
## assays(1): exprs
## rownames: NULL
## rowData names(4): group_id patient_id sample_id population_id
## colnames(35): Time Cell_length ... DNA-1 DNA-2
## colData names(3): channel_name marker_name marker_class
assay(d_SE)[1:6, 1:6]
##       Time Cell_length        CD3     CD45       BC1        BC2
## [1,] 33073          30 120.823265 454.6009  576.8983 10.0057297
## [2,] 36963          35 135.106171 624.6824  564.6299  5.5991135
## [3,] 37892          30  -1.664619 601.0125 3077.2668  1.7105789
## [4,] 41345          58 115.290245 820.7125 6088.5967 22.5641403
## [5,] 42475          35  14.373802 326.6405 4606.6929 -0.6584854
## [6,] 44620          31  37.737877 557.0137 4854.1519 -0.4517288
rowData(d_SE)
## DataFrame with 172791 rows and 4 columns
##         group_id patient_id          sample_id population_id
##         <factor>   <factor>           <factor>      <factor>
## 1         BCR-XL   patient1    patient1_BCR-XL   CD4 T-cells
## 2         BCR-XL   patient1    patient1_BCR-XL   CD4 T-cells
## 3         BCR-XL   patient1    patient1_BCR-XL      NK cells
## 4         BCR-XL   patient1    patient1_BCR-XL   CD4 T-cells
## 5         BCR-XL   patient1    patient1_BCR-XL   CD8 T-cells
## ...          ...        ...                ...           ...
## 172787 Reference   patient8 patient8_Reference   CD8 T-cells
## 172788 Reference   patient8 patient8_Reference   CD4 T-cells
## 172789 Reference   patient8 patient8_Reference   CD4 T-cells
## 172790 Reference   patient8 patient8_Reference   CD4 T-cells
## 172791 Reference   patient8 patient8_Reference   CD8 T-cells
colData(d_SE)
## DataFrame with 35 rows and 3 columns
##                channel_name marker_name marker_class
##                 <character> <character>     <factor>
## Time                   Time        Time         none
## Cell_length     Cell_length Cell_length         none
## CD3          CD3(110:114)Dd         CD3         type
## CD45          CD45(In115)Dd        CD45         type
## BC1            BC1(La139)Dd         BC1         none
## ...                     ...         ...          ...
## HLA-DR      HLA-DR(Yb174)Dd      HLA-DR         type
## BC7            BC7(Lu175)Dd         BC7         none
## CD7            CD7(Yb176)Dd         CD7         type
## DNA-1        DNA-1(Ir191)Dd       DNA-1         none
## DNA-2        DNA-2(Ir193)Dd       DNA-2         none

4.2 Transform data

Flow and mass cytometry data should be transformed before performing any analysis. For mass cytometry (CyTOF), a commonly used standard transformation is the arcsinh with parameter cofactor = 5. (For flow cytometry, cofactor = 150 can be used.) This brings the marker expression profiles closer to normal distributions, which improves clustering performance and visualizations.

We apply the arcsinh transform with cofactor = 5 to all protein marker columns.

# Apply transformation to marker columns
cofactor <- 5
cols <- colData(d_SE)$marker_class != "none"
assay(d_SE)[, cols] <- asinh(assay(d_SE)[, cols] / cofactor)

summary(assay(d_SE)[, 1:10])
##       Time          Cell_length         CD3               CD45       
##  Min.   :    156   Min.   :11.00   Min.   :-3.2073   Min.   :0.5989  
##  1st Qu.: 316454   1st Qu.:29.00   1st Qu.: 0.6857   1st Qu.:4.8019  
##  Median : 830931   Median :35.00   Median : 2.4079   Median :5.2392  
##  Mean   :1006144   Mean   :36.36   Mean   : 2.0589   Mean   :5.1288  
##  3rd Qu.:1715812   3rd Qu.:43.00   3rd Qu.: 3.3871   3rd Qu.:5.6240  
##  Max.   :2399632   Max.   :68.00   Max.   : 8.2302   Max.   :7.6076  
##       BC1                 BC2                pNFkB        
##  Min.   : -100.880   Min.   : -100.409   Min.   :-2.5389  
##  1st Qu.:    1.029   1st Qu.:    0.128   1st Qu.: 0.6641  
##  Median :   30.826   Median :  566.548   Median : 1.7580  
##  Mean   :  980.926   Mean   : 1340.369   Mean   : 1.6895  
##  3rd Qu.: 1817.560   3rd Qu.: 2359.330   3rd Qu.: 2.5995  
##  Max.   :14304.919   Max.   :15963.296   Max.   : 5.9574  
##       pp38               CD4                BC3           
##  Min.   :-3.32423   Min.   :-2.52144   Min.   : -100.060  
##  1st Qu.:-0.10257   1st Qu.:-0.07151   1st Qu.:   -0.402  
##  Median : 0.05176   Median : 0.86322   Median :  292.778  
##  Mean   : 0.57622   Mean   : 1.46421   Mean   : 1088.787  
##  3rd Qu.: 1.11095   3rd Qu.: 3.02456   3rd Qu.: 1945.990  
##  Max.   : 7.81873   Max.   : 6.25088   Max.   :14317.106

4.3 Clustering

We perform clustering to define cell populations, and compare the clustering results with the reference cell population labels. We use the FlowSOM clustering algorithm (Van Gassen et al., 2015), available from Bioconductor.

We use t-SNE plots to visualize the clustering results. Note that the t-SNE plots show only a subset of cells, to speed up runtime.

Since we are interested in cell populations (not cell states), we use only ‘cell type’ markers for the clustering and t-SNE plots.

suppressPackageStartupMessages(library(FlowSOM))
suppressPackageStartupMessages(library(Rtsne))
suppressPackageStartupMessages(library(ggplot2))


# --------------
# Run clustering
# --------------

d_FlowSOM <- as.matrix(assay(d_SE)[, colData(d_SE)$marker_class == "type"])
d_FlowSOM <- flowFrame(d_FlowSOM)

set.seed(123)
out <- ReadInput(d_FlowSOM, transform = FALSE, scale = FALSE)
out <- BuildSOM(out, colsToUse = NULL)
## Building SOM
## Mapping data to SOM
labels_pre <- out$map$mapping[, 1]

# number of meta-clusters
k <- 20
out <- metaClustering_consensus(out$map$codes, k = k, seed = 123)

labels <- out[labels_pre]

# check meta-cluster labels
table(labels)
## labels
##     1     2     3     4     5     6     7     8     9    10    11    12 
##  2550 38400 17831  3070  2542  2978  3007  2768  7088  1164   716 23829 
##    13    14    15    16    17    18    19    20 
## 13409 10686  2341  1770 24726  3025  4312  6579
# ------------------------------
# t-SNE plot with cluster labels
# ------------------------------

d_Rtsne <- as.matrix(assay(d_SE)[, colData(d_SE)$marker_class == "type"])
colnames(d_Rtsne) <- colData(d_SE)$marker_name[colData(d_SE)$marker_class == "type"]

# subsampling
n_sub <- 1000
set.seed(123)
ix <- sample(seq_along(labels), n_sub)

d_Rtsne <- d_Rtsne[ix, ]
labels <- labels[ix]

# remove any duplicate rows (required by Rtsne)
dups <- duplicated(d_Rtsne)

d_Rtnse <- d_Rtsne[!dups, ]
labels <- labels[!dups]

# run Rtsne
set.seed(123)
out_Rtsne <- Rtsne(d_Rtsne, pca = FALSE, verbose = TRUE)
## Read the 1000 x 10 data matrix successfully!
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 1000
## Done in 0.30 seconds (sparsity = 0.125964)!
## Learning embedding...
## Iteration 50: error is 69.531932 (50 iterations in 0.67 seconds)
## Iteration 100: error is 65.737428 (50 iterations in 0.52 seconds)
## Iteration 150: error is 65.196893 (50 iterations in 0.39 seconds)
## Iteration 200: error is 65.197083 (50 iterations in 0.51 seconds)
## Iteration 250: error is 65.197058 (50 iterations in 0.42 seconds)
## Iteration 300: error is 1.243463 (50 iterations in 0.35 seconds)
## Iteration 350: error is 1.113915 (50 iterations in 0.37 seconds)
## Iteration 400: error is 1.080398 (50 iterations in 0.38 seconds)
## Iteration 450: error is 1.067168 (50 iterations in 0.39 seconds)
## Iteration 500: error is 1.058012 (50 iterations in 0.37 seconds)
## Iteration 550: error is 1.051614 (50 iterations in 0.35 seconds)
## Iteration 600: error is 1.047412 (50 iterations in 0.34 seconds)
## Iteration 650: error is 1.043370 (50 iterations in 0.37 seconds)
## Iteration 700: error is 1.041953 (50 iterations in 0.36 seconds)
## Iteration 750: error is 1.039327 (50 iterations in 0.40 seconds)
## Iteration 800: error is 1.037391 (50 iterations in 0.36 seconds)
## Iteration 850: error is 1.036243 (50 iterations in 0.37 seconds)
## Iteration 900: error is 1.034768 (50 iterations in 0.35 seconds)
## Iteration 950: error is 1.033678 (50 iterations in 0.36 seconds)
## Iteration 1000: error is 1.032335 (50 iterations in 0.40 seconds)
## Fitting performed in 8.02 seconds.
d_plot <- as.data.frame(out_Rtsne$Y)
colnames(d_plot) <- c("tSNE_1", "tSNE_2")
d_plot$cluster <- as.factor(labels)

# create plot
ggplot(d_plot, aes(x = tSNE_1, y = tSNE_2, color = cluster)) +
  geom_point() +
  ggtitle("t-SNE plot: clustering") +
  xlab("t-SNE dimension 1") +
  ylab("t-SNE dimension 2") +
  theme_classic()

# -------------------------------------------
# t-SNE plot with reference population labels
# -------------------------------------------

d_plot$population <- rowData(d_SE)$population_id[ix][!dups]

# create plot
ggplot(d_plot, aes(x = tSNE_1, y = tSNE_2, color = population)) +
  geom_point() +
  ggtitle("t-SNE plot: reference populations") +
  xlab("t-SNE dimension 1") +
  ylab("t-SNE dimension 2") +
  theme_classic()

4.4 Data exploration using ‘iSEE’ package

Additional visualizations to explore the data can be generated using the iSEE (“Interactive SummarizedExperiment Explorer”) package, available from Bioconductor (Soneson, Lun, Marini, and Rue-Albrecht, 2018), which provides a Shiny-based graphical user interface for exploring single-cell data sets stored in SummarizedExperiment objects.

For more details on exploring data with iSEE, see the iSEE package vignette.