HDCytoData 1.0.0
The HDCytoData
data package contains a set of publicly available high-dimensional flow cytometry and mass cytometry (CyTOF) data sets, formatted into SummarizedExperiment
and flowSet
Bioconductor object formats. The data objects are hosted on the Bioconductor ExperimentHub web resource.
The objects contain the cell-level expression values, as well as row and column meta-data, including sample IDs, group IDs, true cell population labels or cluster labels (where available), channel names, protein marker names, and protein marker classes (cell type or cell state).
These data sets have been used for benchmarking purposes in our previous work and publications, e.g. to evaluate the performance of clustering algorithms. They are provided here in the SummarizedExperiment
and flowSet
formats to make them easier to access.
Currently, the package contains the following data sets:
Additional details on each data set are included in the help files for the data sets. For each data set, this includes a description of the data set (biological context, number of samples, number of cells, number and classes of protein markers, etc.), as well as an explanation of the object structures, and references and raw data sources.
The help files can be accessed by the data set names, e.g. ?Bodenmiller_BCR_XL
.
First, we show how to load the data sets.
The data sets can be loaded either with named functions referring directly to the object names, or by using the ExperimentHub
interface. Both methods are demonstrated below, using the Bodenmiller_BCR_XL
data set as an example.
See the help file (?Bodenmiller_BCR_XL
) for details about the structure of the SummarizedExperiment
or flowSet
objects for this data set.
Load the data sets using named functions:
suppressPackageStartupMessages(library(HDCytoData))
## snapshotDate(): 2018-04-27
# Load 'SummarizedExperiment' object using named function
Bodenmiller_BCR_XL_SE()
## snapshotDate(): 2018-04-27
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache
## '/home/biocbuild//.ExperimentHub/1119'
## class: SummarizedExperiment
## dim: 172791 35
## metadata(2): experiment_info n_cells
## assays(1): exprs
## rownames: NULL
## rowData names(4): group_id patient_id sample_id population_id
## colnames(35): Time Cell_length ... DNA-1 DNA-2
## colData names(3): channel_name marker_name marker_class
# Load 'flowSet' object using named function
Bodenmiller_BCR_XL_flowSet()
## snapshotDate(): 2018-04-27
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache
## '/home/biocbuild//.ExperimentHub/1120'
## A flowSet with 16 experiments.
##
## column names:
## Time Cell_length CD3 CD45 BC1 BC2 pNFkB pp38 CD4 BC3 CD20 CD33 pStat5 CD123 pAkt pStat1 pSHP2 pZap70 pStat3 BC4 CD14 pSlp76 BC5 pBtk pPlcg2 pErk BC6 pLat IgM pS6 HLA-DR BC7 CD7 DNA-1 DNA-2 group_id patient_id sample_id population_id
Alternatively, load the data sets using the ExperimentHub
interface:
# Create an ExperimentHub instance
ehub <- ExperimentHub()
## snapshotDate(): 2018-04-27
# Query ExperimentHub instance to find data sets
query(ehub, "HDCytoData")
## ExperimentHub with 2 records
## # snapshotDate(): 2018-04-27
## # $dataprovider: NA
## # $species: Homo sapiens
## # $rdataclass: SummarizedExperiment, flowSet
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH1119"]]'
##
## title
## EH1119 | Bodenmiller_BCR_XL_SE
## EH1120 | Bodenmiller_BCR_XL_flowSet
# Load 'SummarizedExperiment' object using index of data set
ehub[["EH1119"]]
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache
## '/home/biocbuild//.ExperimentHub/1119'
## class: SummarizedExperiment
## dim: 172791 35
## metadata(2): experiment_info n_cells
## assays(1): exprs
## rownames: NULL
## rowData names(4): group_id patient_id sample_id population_id
## colnames(35): Time Cell_length ... DNA-1 DNA-2
## colData names(3): channel_name marker_name marker_class
# Load 'flowSet' object using index of data set
ehub[["EH1120"]]
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache
## '/home/biocbuild//.ExperimentHub/1120'
## A flowSet with 16 experiments.
##
## column names:
## Time Cell_length CD3 CD45 BC1 BC2 pNFkB pp38 CD4 BC3 CD20 CD33 pStat5 CD123 pAkt pStat1 pSHP2 pZap70 pStat3 BC4 CD14 pSlp76 BC5 pBtk pPlcg2 pErk BC6 pLat IgM pS6 HLA-DR BC7 CD7 DNA-1 DNA-2 group_id patient_id sample_id population_id
We demonstrate the use of the HDCytoData
package with a short example analysis workflow for one of the data sets.
We load the data in SummarizedExperiment
format for the example workflow.
# Load data set in 'SummarizedExperiment' format
d_SE <- Bodenmiller_BCR_XL_SE()
## snapshotDate(): 2018-04-27
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache
## '/home/biocbuild//.ExperimentHub/1119'
# Inspect the object
d_SE
## class: SummarizedExperiment
## dim: 172791 35
## metadata(2): experiment_info n_cells
## assays(1): exprs
## rownames: NULL
## rowData names(4): group_id patient_id sample_id population_id
## colnames(35): Time Cell_length ... DNA-1 DNA-2
## colData names(3): channel_name marker_name marker_class
assay(d_SE)[1:6, 1:6]
## Time Cell_length CD3 CD45 BC1 BC2
## [1,] 33073 30 120.823265 454.6009 576.8983 10.0057297
## [2,] 36963 35 135.106171 624.6824 564.6299 5.5991135
## [3,] 37892 30 -1.664619 601.0125 3077.2668 1.7105789
## [4,] 41345 58 115.290245 820.7125 6088.5967 22.5641403
## [5,] 42475 35 14.373802 326.6405 4606.6929 -0.6584854
## [6,] 44620 31 37.737877 557.0137 4854.1519 -0.4517288
rowData(d_SE)
## DataFrame with 172791 rows and 4 columns
## group_id patient_id sample_id population_id
## <factor> <factor> <factor> <factor>
## 1 BCR-XL patient1 patient1_BCR-XL CD4 T-cells
## 2 BCR-XL patient1 patient1_BCR-XL CD4 T-cells
## 3 BCR-XL patient1 patient1_BCR-XL NK cells
## 4 BCR-XL patient1 patient1_BCR-XL CD4 T-cells
## 5 BCR-XL patient1 patient1_BCR-XL CD8 T-cells
## ... ... ... ... ...
## 172787 Reference patient8 patient8_Reference CD8 T-cells
## 172788 Reference patient8 patient8_Reference CD4 T-cells
## 172789 Reference patient8 patient8_Reference CD4 T-cells
## 172790 Reference patient8 patient8_Reference CD4 T-cells
## 172791 Reference patient8 patient8_Reference CD8 T-cells
colData(d_SE)
## DataFrame with 35 rows and 3 columns
## channel_name marker_name marker_class
## <character> <character> <factor>
## Time Time Time none
## Cell_length Cell_length Cell_length none
## CD3 CD3(110:114)Dd CD3 type
## CD45 CD45(In115)Dd CD45 type
## BC1 BC1(La139)Dd BC1 none
## ... ... ... ...
## HLA-DR HLA-DR(Yb174)Dd HLA-DR type
## BC7 BC7(Lu175)Dd BC7 none
## CD7 CD7(Yb176)Dd CD7 type
## DNA-1 DNA-1(Ir191)Dd DNA-1 none
## DNA-2 DNA-2(Ir193)Dd DNA-2 none
Flow and mass cytometry data should be transformed before performing any analysis. For mass cytometry (CyTOF), a commonly used standard transformation is the arcsinh
with parameter cofactor = 5
. (For flow cytometry, cofactor = 150
can be used.) This brings the marker expression profiles closer to normal distributions, which improves clustering performance and visualizations.
We apply the arcsinh
transform with cofactor = 5
to all protein marker columns.
# Apply transformation to marker columns
cofactor <- 5
cols <- colData(d_SE)$marker_class != "none"
assay(d_SE)[, cols] <- asinh(assay(d_SE)[, cols] / cofactor)
summary(assay(d_SE)[, 1:10])
## Time Cell_length CD3 CD45
## Min. : 156 Min. :11.00 Min. :-3.2073 Min. :0.5989
## 1st Qu.: 316454 1st Qu.:29.00 1st Qu.: 0.6857 1st Qu.:4.8019
## Median : 830931 Median :35.00 Median : 2.4079 Median :5.2392
## Mean :1006144 Mean :36.36 Mean : 2.0589 Mean :5.1288
## 3rd Qu.:1715812 3rd Qu.:43.00 3rd Qu.: 3.3871 3rd Qu.:5.6240
## Max. :2399632 Max. :68.00 Max. : 8.2302 Max. :7.6076
## BC1 BC2 pNFkB
## Min. : -100.880 Min. : -100.409 Min. :-2.5389
## 1st Qu.: 1.029 1st Qu.: 0.128 1st Qu.: 0.6641
## Median : 30.826 Median : 566.548 Median : 1.7580
## Mean : 980.926 Mean : 1340.369 Mean : 1.6895
## 3rd Qu.: 1817.560 3rd Qu.: 2359.330 3rd Qu.: 2.5995
## Max. :14304.919 Max. :15963.296 Max. : 5.9574
## pp38 CD4 BC3
## Min. :-3.32423 Min. :-2.52144 Min. : -100.060
## 1st Qu.:-0.10257 1st Qu.:-0.07151 1st Qu.: -0.402
## Median : 0.05176 Median : 0.86322 Median : 292.778
## Mean : 0.57622 Mean : 1.46421 Mean : 1088.787
## 3rd Qu.: 1.11095 3rd Qu.: 3.02456 3rd Qu.: 1945.990
## Max. : 7.81873 Max. : 6.25088 Max. :14317.106
We perform clustering to define cell populations, and compare the clustering results with the reference cell population labels. We use the FlowSOM clustering algorithm (Van Gassen et al., 2015), available from Bioconductor.
We use t-SNE plots to visualize the clustering results. Note that the t-SNE plots show only a subset of cells, to speed up runtime.
Since we are interested in cell populations (not cell states), we use only ‘cell type’ markers for the clustering and t-SNE plots.
suppressPackageStartupMessages(library(FlowSOM))
suppressPackageStartupMessages(library(Rtsne))
suppressPackageStartupMessages(library(ggplot2))
# --------------
# Run clustering
# --------------
d_FlowSOM <- as.matrix(assay(d_SE)[, colData(d_SE)$marker_class == "type"])
d_FlowSOM <- flowFrame(d_FlowSOM)
set.seed(123)
out <- ReadInput(d_FlowSOM, transform = FALSE, scale = FALSE)
out <- BuildSOM(out, colsToUse = NULL)
## Building SOM
## Mapping data to SOM
labels_pre <- out$map$mapping[, 1]
# number of meta-clusters
k <- 20
out <- metaClustering_consensus(out$map$codes, k = k, seed = 123)
labels <- out[labels_pre]
# check meta-cluster labels
table(labels)
## labels
## 1 2 3 4 5 6 7 8 9 10 11 12
## 2550 38400 17831 3070 2542 2978 3007 2768 7088 1164 716 23829
## 13 14 15 16 17 18 19 20
## 13409 10686 2341 1770 24726 3025 4312 6579
# ------------------------------
# t-SNE plot with cluster labels
# ------------------------------
d_Rtsne <- as.matrix(assay(d_SE)[, colData(d_SE)$marker_class == "type"])
colnames(d_Rtsne) <- colData(d_SE)$marker_name[colData(d_SE)$marker_class == "type"]
# subsampling
n_sub <- 1000
set.seed(123)
ix <- sample(seq_along(labels), n_sub)
d_Rtsne <- d_Rtsne[ix, ]
labels <- labels[ix]
# remove any duplicate rows (required by Rtsne)
dups <- duplicated(d_Rtsne)
d_Rtnse <- d_Rtsne[!dups, ]
labels <- labels[!dups]
# run Rtsne
set.seed(123)
out_Rtsne <- Rtsne(d_Rtsne, pca = FALSE, verbose = TRUE)
## Read the 1000 x 10 data matrix successfully!
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
## - point 0 of 1000
## Done in 0.30 seconds (sparsity = 0.125964)!
## Learning embedding...
## Iteration 50: error is 69.531932 (50 iterations in 0.67 seconds)
## Iteration 100: error is 65.737428 (50 iterations in 0.52 seconds)
## Iteration 150: error is 65.196893 (50 iterations in 0.39 seconds)
## Iteration 200: error is 65.197083 (50 iterations in 0.51 seconds)
## Iteration 250: error is 65.197058 (50 iterations in 0.42 seconds)
## Iteration 300: error is 1.243463 (50 iterations in 0.35 seconds)
## Iteration 350: error is 1.113915 (50 iterations in 0.37 seconds)
## Iteration 400: error is 1.080398 (50 iterations in 0.38 seconds)
## Iteration 450: error is 1.067168 (50 iterations in 0.39 seconds)
## Iteration 500: error is 1.058012 (50 iterations in 0.37 seconds)
## Iteration 550: error is 1.051614 (50 iterations in 0.35 seconds)
## Iteration 600: error is 1.047412 (50 iterations in 0.34 seconds)
## Iteration 650: error is 1.043370 (50 iterations in 0.37 seconds)
## Iteration 700: error is 1.041953 (50 iterations in 0.36 seconds)
## Iteration 750: error is 1.039327 (50 iterations in 0.40 seconds)
## Iteration 800: error is 1.037391 (50 iterations in 0.36 seconds)
## Iteration 850: error is 1.036243 (50 iterations in 0.37 seconds)
## Iteration 900: error is 1.034768 (50 iterations in 0.35 seconds)
## Iteration 950: error is 1.033678 (50 iterations in 0.36 seconds)
## Iteration 1000: error is 1.032335 (50 iterations in 0.40 seconds)
## Fitting performed in 8.02 seconds.
d_plot <- as.data.frame(out_Rtsne$Y)
colnames(d_plot) <- c("tSNE_1", "tSNE_2")
d_plot$cluster <- as.factor(labels)
# create plot
ggplot(d_plot, aes(x = tSNE_1, y = tSNE_2, color = cluster)) +
geom_point() +
ggtitle("t-SNE plot: clustering") +
xlab("t-SNE dimension 1") +
ylab("t-SNE dimension 2") +
theme_classic()
# -------------------------------------------
# t-SNE plot with reference population labels
# -------------------------------------------
d_plot$population <- rowData(d_SE)$population_id[ix][!dups]
# create plot
ggplot(d_plot, aes(x = tSNE_1, y = tSNE_2, color = population)) +
geom_point() +
ggtitle("t-SNE plot: reference populations") +
xlab("t-SNE dimension 1") +
ylab("t-SNE dimension 2") +
theme_classic()