Contents

1 Introduction

Simulated data sets with known ground truths are often used for developing and comparing computational tools for genomic studies. However, the methods and approaches for simulating complex genomic data are rarely unified across studies. Recognizing this problem in the area of single-cell RNA-sequencing (scRNA-seq), the splatter package provides a uniform API for several “simulators” of scRNA-seq data, including the authors’ own “Splat” simulator (Zappia, Phipson, and Oshlack 2017). In the splatter package, given a set of simulation parameters, each method returns a SingleCellExperiment object of simulated scRNA-seq counts.

Using comparisons presented in (Zappia, Phipson, and Oshlack 2017), we illustrate how the SummarizedBenchmark framework can be used to perform comparisons when the output of each method is more complex than a vector of numbers (e.g. a SingleCellExperiment).

2 Building the BenchDesign

library("SummarizedBenchmark")
library("magrittr")

Parameters for the simulators implemented in splatter can either be manually specified or estimated using existing data. Here, we use RSEM counts for a subset of high coverage samples in the fluidigm data set included in the scRNAseq package. The data is made available as a SummarizedExperiment object.

library("splatter")
library("scRNAseq")
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'AnnotationDbi'
fluidigm <- ReprocessedFluidigmData()
se <- fluidigm[, colData(fluidigm)[, "Coverage_Type"] == "High"]
assays(se) <- assays(se)["rsem_counts"]
assayNames(se) <- "counts"

For the purposes of this vignette, we only use a subset of the samples and genes.

set.seed(1912)
se <- se[sample(nrow(se), 1e4), sample(ncol(se), 20)]

To make comparisons with the simulated data sets easier, we convert the SummarizedExperiment object to the SingleCellExperiment class.

sce <- as(se, "SingleCellExperiment")

Each of the simulators in the splatter package follow the [prefix]Simulate naming convention, with the corresponding parameter estimation function, [prefix]Estimate. Here, we use three methods included in the comparisons of (Zappia, Phipson, and Oshlack 2017).

bd <- BenchDesign() %>%
    addMethod(label = "splat",
              func = splatSimulate,
              params = rlang::quos(params = splatEstimate(in_data),
                                   verbose = in_verbose,
                                   seed = in_seed)) %>%
    addMethod(label = "simple",
              func = simpleSimulate,
              params = rlang::quos(params = simpleEstimate(in_data),
                                   verbose = in_verbose,
                                   seed = in_seed)) %>%
    addMethod(label = "lun",
              func = lunSimulate,
              params = rlang::quos(params = lunEstimate(in_data),
                                   verbose = in_verbose,
                                   seed = in_seed))

Each simulator returns a single SingleCellExperiment object containing the simulated scRNA-seq counts. However, to fit the SummarizedBenchmark structure, each method in the BenchDesign must return a vector or a list. By default, if methods return objects which cannot be easily coerced to a matrix structure, the outputs from each assay is captured and combined using rbind. To prevent the warnings from printing, the user may specify post = list for all methods. However, note that this will result in all method output being wrapped in a list.

3 Running the Benchmark Experiment

Using the "counts" assay of the fluidigm data set as input, we generate simulated data with the three methods.

fluidigm_dat <- list(in_data = assay(sce, "counts"),
                     in_verbose = FALSE,
                     in_seed = 19120128)
sb <- buildBench(bd, data = fluidigm_dat) 
## Warning in FUN(X[[i]], ...): 
## Method outputs could not be reduced to matrix.
sb
## class: SummarizedBenchmark 
## dim: 1 3 
## metadata(1): sessions
## assays(1): default
## rownames: NULL
## rowData names(1): default
## colnames(3): splat simple lun
## colData names(7): func.pkg func.pkg.vers ... param.seed session.idx

The simulated data sets are returned as a single row in the assay of the SummarizedBenchmark object, with each column containing a list with a single SingleCellExperiment object.

assay(sb)
##      splat                                                                          
## [1,] <S4 class 'SingleCellExperiment' [package "SingleCellExperiment"] with 9 slots>
##      simple                                                                         
## [1,] <S4 class 'SingleCellExperiment' [package "SingleCellExperiment"] with 9 slots>
##      lun                                                                            
## [1,] <S4 class 'SingleCellExperiment' [package "SingleCellExperiment"] with 9 slots>
sapply(assay(sb), class)
## [1] "SingleCellExperiment" "SingleCellExperiment" "SingleCellExperiment"

4 Comparing the Results

Now that we have our set of simulated data sets, we can compare the behavior of each simulator. Fortunately, the splatter package includes two useful functions for comparing SingleCellExperiment objects (compareSCEs and diffSCEs). The assay of the SummarizedBenchmark can be passed directly to these functions. We also concatenate the original fluidigm data set, sce, with the simulated data sets for comparison.

res_compare <- compareSCEs(c(ref = sce, assay(sb)[1, ]))
res_diff <- diffSCEs(c(ref = sce, assay(sb)[1, ]), ref = "ref")

While these functions produce several metrics and plots, we only include two for illustration. More details on the output of these functions can be found in the documentation of the splatter package.

res_compare$Plots$MeanVar

res_diff$Plots$MeanVar