Contents

1 Overview

The HDCytoData data package contains a set of publicly available high-dimensional flow cytometry and mass cytometry (CyTOF) datasets, formatted into SummarizedExperiment and flowSet Bioconductor object formats. The data objects are hosted on the Bioconductor ExperimentHub web resource.

The objects contain the cell-level expression values, as well as row and column metadata, including sample IDs, group IDs, true cell population labels or cluster labels (where available), channel names, protein marker names, and protein marker classes (cell type or cell state).

These datasets have been used for benchmarking purposes in our previous work and publications, e.g. to benchmark clustering algorithms or methods for differential analysis. They are provided here in the SummarizedExperiment and flowSet formats to make them easier to access.

2 Datasets

The package contains the following datasets, which can be grouped into datasets useful for benchmarking either (i) clustering algorithms or (ii) methods for differential analysis.

(For more datasets, see the updated development version of the HDCytoData package, available in the development version of Bioconductor.)

Additional details on each dataset are included in the help files for the datasets. For each dataset, this includes a description of the dataset (biological context, number of samples, number of cells, number of manually gated cell populations, number and classes of protein markers, etc.), as well as an explanation of the object structures, and references and raw data sources.

The help files can be accessed by the dataset names, e.g. ?Bodenmiller_BCR_XL.

3 How to load data

This section shows how to load the datasets, using one of the datasets (Bodenmiller_BCR_XL) as an example.

The datasets can be loaded either with named functions referring directly to the object names, or by using the ExperimentHub interface. Both methods are demonstrated below.

See the help files (e.g. ?Bodenmiller_BCR_XL) for details about the structure of the SummarizedExperiment or flowSet objects.

Load the datasets using named functions:

suppressPackageStartupMessages(library(HDCytoData))

# Load 'SummarizedExperiment' object using named function
Bodenmiller_BCR_XL_SE()
## class: SummarizedExperiment 
## dim: 172791 35 
## metadata(2): experiment_info n_cells
## assays(1): exprs
## rownames: NULL
## rowData names(4): group_id patient_id sample_id population_id
## colnames(35): Time Cell_length ... DNA-1 DNA-2
## colData names(3): channel_name marker_name marker_class
# Load 'flowSet' object using named function
Bodenmiller_BCR_XL_flowSet()
## A flowSet with 16 experiments.
## 
##   column names:
##   Time Cell_length CD3(110:114)Dd CD45(In115)Dd BC1(La139)Dd BC2(Pr141)Dd pNFkB(Nd142)Dd pp38(Nd144)Dd CD4(Nd145)Dd BC3(Nd146)Dd CD20(Sm147)Dd CD33(Nd148)Dd pStat5(Nd150)Dd CD123(Eu151)Dd pAkt(Sm152)Dd pStat1(Eu153)Dd pSHP2(Sm154)Dd pZap70(Gd156)Dd pStat3(Gd158)Dd BC4(Tb159)Dd CD14(Gd160)Dd pSlp76(Dy164)Dd BC5(Ho165)Dd pBtk(Er166)Dd pPlcg2(Er167)Dd pErk(Er168)Dd BC6(Tm169)Dd pLat(Er170)Dd IgM(Yb171)Dd pS6(Yb172)Dd HLA-DR(Yb174)Dd BC7(Lu175)Dd CD7(Yb176)Dd DNA-1(Ir191)Dd DNA-2(Ir193)Dd group_id patient_id sample_id population_id

Alternatively, load the datasets using the ExperimentHub interface:

(Note: this option is currently disabled in the release version 3.8 of Bioconductor, since the ExperimentHub IDs for the HDCytoData datasets have been re-numbered. To use this option, use the development version 3.9 of Bioconductor instead.)

# Create an ExperimentHub instance
ehub <- ExperimentHub()
## snapshotDate(): 2018-10-30
# Query ExperimentHub instance to find datasets
query(ehub, "HDCytoData")
## ExperimentHub with 2 records
## # snapshotDate(): 2018-10-30 
## # $dataprovider: NA
## # $species: Homo sapiens
## # $rdataclass: SummarizedExperiment, flowSet
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH1119"]]' 
## 
##            title                     
##   EH1119 | Bodenmiller_BCR_XL_SE     
##   EH1120 | Bodenmiller_BCR_XL_flowSet
# Load 'SummarizedExperiment' object using index of dataset
#ehub[["EH2254"]]

# Load 'flowSet' object using index of dataset
#ehub[["EH2255"]]

4 Using the data

Once the datasets have been loaded from ExperimentHub, they can be used as normal within an R session. For example, using the SummarizedExperiment form of the dataset loaded above:

# Load dataset in 'SummarizedExperiment' format
d_SE <- Bodenmiller_BCR_XL_SE()

# Inspect the object
d_SE
## class: SummarizedExperiment 
## dim: 172791 35 
## metadata(2): experiment_info n_cells
## assays(1): exprs
## rownames: NULL
## rowData names(4): group_id patient_id sample_id population_id
## colnames(35): Time Cell_length ... DNA-1 DNA-2
## colData names(3): channel_name marker_name marker_class
assay(d_SE)[1:6, 1:6]
##       Time Cell_length        CD3     CD45       BC1        BC2
## [1,] 33073          30 120.823265 454.6009  576.8983 10.0057297
## [2,] 36963          35 135.106171 624.6824  564.6299  5.5991135
## [3,] 37892          30  -1.664619 601.0125 3077.2668  1.7105789
## [4,] 41345          58 115.290245 820.7125 6088.5967 22.5641403
## [5,] 42475          35  14.373802 326.6405 4606.6929 -0.6584854
## [6,] 44620          31  37.737877 557.0137 4854.1519 -0.4517288
rowData(d_SE)
## DataFrame with 172791 rows and 4 columns
##         group_id patient_id          sample_id population_id
##         <factor>   <factor>           <factor>      <factor>
## 1         BCR-XL   patient1    patient1_BCR-XL   CD4 T-cells
## 2         BCR-XL   patient1    patient1_BCR-XL   CD4 T-cells
## 3         BCR-XL   patient1    patient1_BCR-XL      NK cells
## 4         BCR-XL   patient1    patient1_BCR-XL   CD4 T-cells
## 5         BCR-XL   patient1    patient1_BCR-XL   CD8 T-cells
## ...          ...        ...                ...           ...
## 172787 Reference   patient8 patient8_Reference   CD8 T-cells
## 172788 Reference   patient8 patient8_Reference   CD4 T-cells
## 172789 Reference   patient8 patient8_Reference   CD4 T-cells
## 172790 Reference   patient8 patient8_Reference   CD4 T-cells
## 172791 Reference   patient8 patient8_Reference   CD8 T-cells
colData(d_SE)
## DataFrame with 35 rows and 3 columns
##                channel_name marker_name marker_class
##                 <character> <character>     <factor>
## Time                   Time        Time         none
## Cell_length     Cell_length Cell_length         none
## CD3          CD3(110:114)Dd         CD3         type
## CD45          CD45(In115)Dd        CD45         type
## BC1            BC1(La139)Dd         BC1         none
## ...                     ...         ...          ...
## HLA-DR      HLA-DR(Yb174)Dd      HLA-DR         type
## BC7            BC7(Lu175)Dd         BC7         none
## CD7            CD7(Yb176)Dd         CD7         type
## DNA-1        DNA-1(Ir191)Dd       DNA-1         none
## DNA-2        DNA-2(Ir193)Dd       DNA-2         none

5 Transformation of raw data

Note that flow and mass cytometry data should be transformed prior to performing any downstream analyses, such as clustering. Standard transforms include the asinh with cofactor parameter equal to 5 (for mass cytometry data) or 150 (for flow cytometry data).

6 Exploring the data

Interactive visualizations to explore the datasets can be generated from the SummarizedExperiment objects using the iSEE (“Interactive SummarizedExperiment Explorer”) package, available from Bioconductor (Soneson, Lun, Marini, and Rue-Albrecht, 2018), which provides a Shiny-based graphical user interface to explore single-cell datasets stored in the SummarizedExperiment format. For more details, see the iSEE package vignettes.