Contents

1 Introduction

In this vignette we describe the basic usage of the DuoClustering2018 package: how to retrieve data sets and clustering results, and how to construct various plots summarizing the performance of different methods across several data sets.

2 Load the necessary packages

suppressPackageStartupMessages({
  library(ExperimentHub)
  library(SingleCellExperiment)
  library(DuoClustering2018)
  library(plyr)
})

3 Retrieve a data set

The clustering evaluation (Duò, Robinson, and Soneson 2018) is based on 12 data sets (9 real and 3 simulated), which are all provided via ExperimentHub and retrievable via this package. We include the full data sets (after quality filtering of cells and removal of genes with zero counts across all cells) as well as three filtered versions of each data set (by expression, variability and dropout pattern, respectively), each containing 10% of the genes in the full data set.

To get an overview, we can list all records from this package that are available in ExperimentHub:

eh <- ExperimentHub()
## snapshotDate(): 2018-10-31
query(eh, "DuoClustering2018")
## ExperimentHub with 122 records
## # snapshotDate(): 2018-10-31 
## # $dataprovider: Robinson group (UZH), 10x Genomics, Zheng et al (2017), ...
## # $species: Homo sapiens, Mus musculus, NA
## # $rdataclass: data.frame, SingleCellExperiment, list
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH1499"]]' 
## 
##            title                                               
##   EH1499 | duo_clustering_all_parameter_settings_v1            
##   EH1500 | sce_full_Koh                                        
##   EH1501 | sce_filteredExpr10_Koh                              
##   EH1502 | sce_filteredHVG10_Koh                               
##   EH1503 | sce_filteredM3Drop10_Koh                            
##   ...      ...                                                 
##   EH1651 | clustering_summary_filteredHVG10_SimKumar4hard_v2   
##   EH1652 | clustering_summary_filteredM3Drop10_SimKumar4hard_v2
##   EH1653 | clustering_summary_filteredExpr10_SimKumar8hard_v2  
##   EH1654 | clustering_summary_filteredHVG10_SimKumar8hard_v2   
##   EH1655 | clustering_summary_filteredM3Drop10_SimKumar8hard_v2

The records with names starting in sce_ represent (filtered or unfiltered) data sets (in SingleCellExperiment format). The records with names starting in clustering_summary_ correspond to data.frame objects with clustering results for each of the filtered data sets. Finally, the duo_clustering_all_parameter_settings object contains the parameter settings we used for all the clustering methods. For clustering summaries and parameter settings, the version number (e.g., _v2) corresponds to the version of the publication.

The records can be retrieved using their ExperimentHub ID, e.g.:

eh[["EH1500"]]
## see ?DuoClustering2018 and browseVignettes('DuoClustering2018') for documentation
## downloading 0 resources
## loading from cache 
##     '/home/biocbuild//.ExperimentHub/1500'
## class: SingleCellExperiment 
## dim: 48981 531 
## metadata(1): log.exprs.offset
## assays(3): counts logcounts normcounts
## rownames(48981): ENSG00000000003.14 ENSG00000000005.5 ...
##   ENSG00000283122.1 ENSG00000283124.1
## rowData names(8): is_feature_control mean_counts ... total_counts
##   log10_total_counts
## colnames(531): SRR3952323 SRR3952325 ... SRR3952970 SRR3952971
## colData names(14): Run LibraryName ... libsize.drop feature.drop
## reducedDimNames(2): PCA TSNE
## spikeNames(0):

Alternatively, the shortcut functions provided by this package can be used:

sce_filteredExpr10_Koh()
## snapshotDate(): 2018-10-31
## see ?DuoClustering2018 and browseVignettes('DuoClustering2018') for documentation
## downloading 0 resources
## loading from cache 
##     '/home/biocbuild//.ExperimentHub/1501'
## class: SingleCellExperiment 
## dim: 4898 531 
## metadata(1): log.exprs.offset
## assays(3): counts logcounts normcounts
## rownames(4898): ENSG00000198804.2 ENSG00000210082.2 ...
##   ENSG00000072134.15 ENSG00000090061.17
## rowData names(8): is_feature_control mean_counts ... total_counts
##   log10_total_counts
## colnames(531): SRR3952323 SRR3952325 ... SRR3952970 SRR3952971
## colData names(14): Run LibraryName ... pct_counts_top_500_features
##   is_cell_control
## reducedDimNames(2): PCA TSNE
## spikeNames(0):

4 Read a set of clustering results

For each included data set, we have applied a range of clustering methods (see the run_clustering vignette for more details on how this was done, and how to apply additional methods). As mentioned above, the results of these clusterings are also available from ExperimentHub, and can be loaded either by their ExperimentHub ID or using the provided shortcut functions, as above. For simplicity, the results of all methods for a given data set are combined into a single object. As an illustration, we load the clustering summaries for two different data sets (Koh and Zhengmix4eq), each with two different gene filterings (Expr10 and HVG10):

res <- plyr::rbind.fill(
  clustering_summary_filteredExpr10_Koh_v2(),
  clustering_summary_filteredHVG10_Koh_v2(),
  clustering_summary_filteredExpr10_Zhengmix4eq_v2(),
  clustering_summary_filteredHVG10_Zhengmix4eq_v2()
)
## snapshotDate(): 2018-10-31
## see ?DuoClustering2018 and browseVignettes('DuoClustering2018') for documentation
## downloading 0 resources
## loading from cache 
##     '/home/biocbuild//.ExperimentHub/1620'
## snapshotDate(): 2018-10-31
## see ?DuoClustering2018 and browseVignettes('DuoClustering2018') for documentation
## downloading 0 resources
## loading from cache 
##     '/home/biocbuild//.ExperimentHub/1621'
## snapshotDate(): 2018-10-31
## see ?DuoClustering2018 and browseVignettes('DuoClustering2018') for documentation
## downloading 0 resources
## loading from cache 
##     '/home/biocbuild//.ExperimentHub/1638'
## snapshotDate(): 2018-10-31
## see ?DuoClustering2018 and browseVignettes('DuoClustering2018') for documentation
## downloading 0 resources
## loading from cache 
##     '/home/biocbuild//.ExperimentHub/1639'
dim(res)
## [1] 5625885      10

The resulting data.frame contains 10 columns:

head(res)
##                  dataset    method       cell run k resolution cluster
## 1 sce_filteredExpr10_Koh PCAKmeans SRR3952323   1 2         NA       1
## 2 sce_filteredExpr10_Koh PCAKmeans SRR3952325   1 2         NA       1
## 3 sce_filteredExpr10_Koh PCAKmeans SRR3952326   1 2         NA       1
## 4 sce_filteredExpr10_Koh PCAKmeans SRR3952327   1 2         NA       1
## 5 sce_filteredExpr10_Koh PCAKmeans SRR3952328   1 2         NA       1
## 6 sce_filteredExpr10_Koh PCAKmeans SRR3952329   1 2         NA       1
##   trueclass est_k elapsed
## 1    H7hESC    NA  14.318
## 2    H7hESC    NA  14.318
## 3    H7hESC    NA  14.318
## 4    H7hESC    NA  14.318
## 5    H7hESC    NA  14.318
## 6    H7hESC    NA  14.318

5 Define consistent method colors

For some of the plots generated below, the points will be colored according to the clustering method. We can enforce a consistent set of colors for the methods by defining a named vector of colors to use for all plots.

method_colors <- c(CIDR = "#332288", FlowSOM = "#6699CC", PCAHC = "#88CCEE", 
            PCAKmeans = "#44AA99", pcaReduce = "#117733",
            RtsneKmeans = "#999933", Seurat = "#DDCC77", SC3svm = "#661100", 
            SC3 = "#CC6677", TSCAN = "grey34", ascend = "orange", SAFE = "black",
            monocle = "red", RaceID2 = "blue")

6 Plot

Each plotting function described below returns a list of ggplot objects. These can be plotted directly, or further modified if desired.

6.1 Performance

The plot_performance() function generates plots related to the performance of the clustering methods. We quantify performance using the adjusted Rand Index (ARI) (Hubert and Arabie 1985), comparing the obtained clustering to the true clusters. As we noted in the publication (Duò, Robinson, and Soneson 2018), defining a true partitioning of the cells is difficult, since they can often be grouped together in several different, but still interpretable, ways. We refer to our paper for more information on how the true clusters were defined for each of the data sets.

perf <- plot_performance(res, method_colors = method_colors)
names(perf)
## [1] "median_ari_vs_k"           "scatter_time_vs_ari_truek"
## [3] "median_ari_heatmap_truek"  "median_ari_heatmap_bestk" 
## [5] "median_ari_heatmap_estk"
perf$median_ari_vs_k
## Warning: Removed 4 rows containing missing values (geom_path).