Simulating complex design scRNA-seq data with `muscat`

Helena L Crowell1,2*, Charlotte Soneson1,3, Pierre-Luc Germain1,2 and Mark D Robinson1,2

1Institute for Molecular Life Sciences, University of Zurich, Zurich, Switzerland
2Swiss Institute of Bioinformatics (SIB), Zurich, Switzerland
3Present address: Friedrich Miescher Institute Basel, Switzerland
& Swiss Institute of Bioinformatics (SIB), Basel, Switzerland

*helena.crowell@uzh.ch

May 19, 2021

Abstract

muscat: multi-sample multi-group scRNA-seq analysis tools (Crowell et al. 2019) provides a straightforward but effective simulation framework that is anchored to a labeled multi-sample multi-subpopulation scRNA-seq reference dataset, uses (non-zero-inflated) negative binomial (NB) as the canonical distribution for droplet scRNA-seq datasets, and exposes various parameters to modulate: the number of subpopulations and samples simulated, the number of cells per subpopulation (and sample), and the type and magnitude of a wide range of patterns of differential expression.

This vignette serves to provide the underlying theoretical background, to thoroughly describe the various input arguments, and to demonstrate the simulation framework’s current capabilities using some illustrative (not necessarily realistic) examples.

Package

muscat 1.6.0

Load packages
Data description
1 Simulation framework
2 Quality control
3 Method benchmarking
Session info
References

For details on the concepts presented here, consider having a look at our preprint:

Crowell HL, Soneson C*, Germain P-L*,
Calini D, Collin L, Raposo C, Malhotra D & Robinson MD:
On the discovery of population-specific state transitions from
multi-sample multi-condition single-cell RNA sequencing data.
bioRxiv 713412 (July, 2019). doi: 10.1101/713412

Load packages

library(cowplot)
library(dplyr)
library(reshape2)
library(muscat)
library(purrr)
library(scater)
library(SingleCellExperiment)

Data description

To demonstrate muscat’s simulation framework, we will use a SingleCellExperiment (SCE) containing 10x droplet-based scRNA-seq PBCM data from 8 Lupus patients obtained befor and after 6h-treatment with IFN-$\beta$ (Kang et al. 2018). The complete raw data, as well as gene and cell metadata is available through the NCBI GEO, accession number GSE96583.

1 Simulation framework

muscat’s simulation framework comprises: i) estimation of negative binomial (NB) parameters from a reference multi-subpopulation, multi-sample dataset; ii) sampling of gene and cell parameters to use for simulation; and, iii) simulation of gene expression data as NB distributions of mixtures thereof. See Fig. 1.

Let $Y = (y_{gc})\in\mathbb{N}_0^{G\times C}$ denote the count matrix of a multi-sample multi-subpopulation reference dataset with genes $\mathcal{G} = \{ g_1, \ldots, g_G \}$ and sets of cells $\mathcal{C}_{sk} = \{ c^{sk}_1, ..., c^{sk}_{C_{sk}} \}$ for each sample $s$ and subpopulation $k$ ($C_{sk}$ is the number of cells for sample $s$, subpopulation $k$). For each gene $g$, we fit a model to estimate sample-specific means $\beta_g^s$, for each sample $s$, and dispersion parameters $\phi_g$ using ’s function with default parameters. Thus, we model the reference count data as NB distributed:

\[Y_{gc} \sim NB(\mu_{gc}, \phi_g)\]

for gene $g$ and cell $c$, where the mean $\mu_{gc} = \exp(\beta_{g}^{s(c)}) \cdot \lambda_c$. Here, $\beta_{g}^{s(c)}$ is the relative abundance of gene $g$ in sample $s(c)$, $\lambda_c$ is the library size (total number of counts), and $\phi_g$ is the dispersion.

Schematic overview of muscat’s simulation framework. Given a count matrix of features by cells and, for each cell, pre-determined subpopulation identifiers as well as sample labels (0), dispersion and sample-wise means are estimated from a negative binomial distribution for each gene (for each subpopulation) (1.1); and library sizes are recorded (1.2). From this set of parameters (dispersions, means, library sizes), gene expression is sampled from a negative binomial distribution. Here, genes are selected to be “type” (subpopulation-specifically expressed; e.g., via marker genes), “state” (change in expression in a condition-specific manner) or equally expressed (relatively) across all samples (2). The result is a matrix of synthetic gene expression data (3).

Figure 1: Schematic overview of muscat’s simulation framework
Given a count matrix of features by cells and, for each cell, pre-determined subpopulation identifiers as well as sample labels (0), dispersion and sample-wise means are estimated from a negative binomial distribution for each gene (for each subpopulation) (1.1); and library sizes are recorded (1.2). From this set of parameters (dispersions, means, library sizes), gene expression is sampled from a negative binomial distribution. Here, genes are selected to be “type” (subpopulation-specifically expressed; e.g., via marker genes), “state” (change in expression in a condition-specific manner) or equally expressed (relatively) across all samples (2). The result is a matrix of synthetic gene expression data (3).

For each subpopulation, we randomly assign each gene to a given differential distribution (DD) category (Korthauer et al. 2016) according to a probability vector p_dd $=(p_{EE},p_{EP},p_{DE},p_{DP},p_{DM},p_{DB})$. For each gene and subpopulation, we draw a vector of fold changes (FCs) from a Gamma distribution with shape parameter $\alpha=4$ and rate $\beta=4/\mu_\text{logFC}$, where $\mu_\text{logFC}$ is the desired average logFC across all genes and subpopulations specified via argument lfc. The direction of differential expression is randomized for each gene, with equal probability of up- and down-regulation.

Next, we split the cells in a given subpopulations into two sets (representing treatment groups), $\mathcal{T}_A$ and $\mathcal{T}_B$, which are in turn split again into two sets each (representing subpopulations within the given treatment group.), $\mathcal{T}_{A_1}/\mathcal{T}_{A_2}$ and $\mathcal{T}_{B_1}/\mathcal{T}_{B_2}$.

For EE genes, counts for $\mathcal{T}_A$ and $\mathcal{T}_B$ are drawn using identical means.For EP genes, we multiply the effective means for identical fractions of cells per group by the sample FCs, i.e., cells are split such that $\dim\mathcal{T}_{A_1} = \dim\mathcal{T}_{B_1}$ and $\dim\mathcal{T}_{A_2} = \dim\mathcal{T}_{B_2}$. For DE genes, the means of one group, $A$ or $B$, are multiplied with the samples FCs. DP genes are simulated analogously to EP genes with $\dim\mathcal{T}_{A_1} = a\cdot\dim\mathcal{T}_A$ and $\dim\mathcal{T}_{B_1} = b\cdot\dim\mathcal{T}_B$, where $a+b=1$ and $a\neq b$. For DM genes, 50% of cells from one group are simulated at $\mu\cdot\text{logFC}$. For DB genes, all cells from one group are simulated at $\mu\cdot\text{logFC}/2$, and the second group is split into equal proportions of cells simulated at $\mu$ and $\mu\cdot\text{logFC}$, respectively. See Fig. 2.

Schematic of the various types of differential distributions supported by muscat’s simulation framework. Differential distributions are simulated from a NB distribution or mixtures thereof, according to the definitions of random variables X, Y and Z.

Figure 2: Schematic of the various types of differential distributions supported by muscat’s simulation framework
Differential distributions are simulated from a NB distribution or mixtures thereof, according to the definitions of random variables X, Y and Z.

1.1 `prepSim`: Preparing data for simulation

To prepare a reference SingleCellExperiment (SCE) for simulation of multi-sample multi-group scRNA-seq data, prepSim will

perform basic filtering of genes and cells
(optionally) filter for subpopulation-sample instances with a threshold number of cells to assure accurate parameter estimation
estimate cell (library sizes) and gene parameters (dispersions and sample-specific means)

Importantly, we want to introduce known changes in states across conditions; thus, only replicates from a single condition should go into the simulation. The group to be kept for simulation may be specified via group_keep, in which case samples from all other groups (sce$group_id != group_keep) will be dropped. By default (group_keep = NULL), prepSim will select the first group available as reference.

Arguments min_count, min_cells, min_genes and min_size are used to tune the filtering of genes, cells and subpopulation-instances as follows:

only genes with a count > min_count in >= min_cells will be retained
only cells with a count > 0 in >= min_genes will be retained
only subpopulation-sample instances with >= min_size cells will be retained; min_size = NULL will skip this step

# estimate simulation parameters
data(example_sce)
ref <- prepSim(example_sce, verbose = FALSE)
# only samples from `ctrl` group are retained
table(ref$sample_id)

## 
## ctrl101 ctrl107 
##     200     200

# cell parameters: library sizes
sub <- assay(example_sce[rownames(ref), colnames(ref)])
all.equal(exp(ref$offset), as.numeric(colSums(sub)))

## [1] "names for target but not for current"
## [2] "Mean relative difference: 0.4099568"

# gene parameters: dispersions & sample-specific means
head(rowData(ref))

## DataFrame with 6 rows and 4 columns
##                  ENSEMBL      SYMBOL                           beta      disp
##              <character> <character>                    <DataFrame> <numeric>
## ISG15    ENSG00000187608       ISG15 -7.84574:-0.24711310:-1.039480 4.6360532
## AURKAIP1 ENSG00000175756    AURKAIP1 -7.84859: 0.00768812:-1.171896 0.1784345
## MRPL20   ENSG00000242485      MRPL20 -8.31434:-0.58684477:-0.304827 0.6435324
## SSU72    ENSG00000160075       SSU72 -8.05160:-0.17119703:-0.793222 0.0363892
## RER1     ENSG00000157916        RER1 -7.75327: 0.10731331:-1.261821 0.5046166
## RPL22    ENSG00000116251       RPL22 -8.03553:-0.03357193: 0.143506 0.2023632

1.2 `simData`: Simulating complex designs

Provided with a reference SCE as returned by prepSim, a variery of simulation scenarios can be generated using the simData function, which will again return an SCE containg the following elements:

assay counts containing the simulated count data
colData columns cluster/sample/group_id containing each cells cluster, sample, and group ID (A or B).
metadata$gene_info containing a data.frame listing, for each gene and cluster
- the simulationed DD category
- the sampled logFC; note that this will only approximate log2(sim_mean.B/sim_mean.A) for genes of the de category as other types of state changes use mixtures for NBs, and will consequently not exhibit a shift in means of the same magnitude as logFC
- the reference sim_gene from which dispersion sim_disp and sample-specific means beta.<sample_id> were used
- the simulated expression means sim_mean.A/B for each group

In the code chunk that follows, we run a simple simulation with

p_dd = c(1,0,...0), i.e., 10% of EE genes
nk = 3 subpopulations and ns = 3 replicates for each of 2 groups
ng = 1000 genes and nc = 2000 cells, resulting in 2000/2/ns/nk $\approx111$ cells for 2 groups with 3 samples each and 3 subpopulations

# simulated 10% EE genes
sim <- simData(ref, p_dd = diag(6)[1, ],
    nk = 3, ns = 3, nc = 2e3,
    ng = 1e3, force = TRUE)
# number of cells per sample and subpopulation
table(sim$sample_id, sim$cluster_id)

##            
##             cluster1 cluster2 cluster3
##   sample1.A      120      107      102
##   sample2.A       95      116      103
##   sample3.A      103      117      103
##   sample1.B      108      118      126
##   sample2.B      100      112      115
##   sample3.B      132       98      125

By default, we have drawn a random reference sample from levels(ref$sample_id) for every simulated sample in each group, resulting in an unpaired design:

metadata(sim)$ref_sids

##         A         B        
## sample1 "ctrl107" "ctrl101"
## sample2 "ctrl101" "ctrl107"
## sample3 "ctrl107" "ctrl107"

Alternatively, we can re-run the above simulation with paired = TRUE such that both groups will use the same set of reference samples, resulting in a paired design:

# simulated paired design
sim <- simData(ref, paired = TRUE, 
    nk = 3, ns = 3, nc = 2e3,
    ng = 1e3, force = TRUE)
# same set of reference samples for both groups
ref_sids <- metadata(sim)$ref_sids
all(ref_sids[, 1] == ref_sids[, 2])

## [1] TRUE

1.2.1 `p_dd`: Simulating differential distributions

Argument p_dd specifies the fraction of cells to simulate for each DD category. Its values should thus lie in $[0,1]$ and sum to 1. Expression densities for an exemplary set of genes simulated from the code below is shown in Fig. 3.

# simulare genes from all DD categories
sim <- simData(ref, p_dd = c(0.5, rep(0.1, 5)),
    nc = 2e3, ng = 1e3, force = TRUE)

We can retrieve the category assigned to each gene in each cluster from the gene_info table stored in the output SCE’s metadata:

gi <- metadata(sim)$gene_info
table(gi$category)

## 
##   ee   ep   de   dp   dm   db 
## 1026  170  192  212  230  170

Expression densities for an exemplary set of 3 genes per *differential distribution* category. Each density corresponds to one sample, lines are colored by group ID, and panels are split by gene and subpopulation.

Figure 3: Expression densities for an exemplary set of 3 genes per differential distribution category
Each density corresponds to one sample, lines are colored by group ID, and panels are split by gene and subpopulation.

1.2.2 `rel_lfc`: Simulating cluster-specific state changes

By default, for each gene and subpopulation, we draw a vector of fold changes (FCs) from a Gamma distribution with rate parameter $\beta\propto\mu_\text{logFC}$, where $\mu_\text{logFC}$ is the desired average logFC across all genes and subpopulations specified via argument lfc. This results in state changes that are of same magnitute for each subpopulation.

Now, suppose we wanted to have a subpopulation that does not exhibit any state changes across conditions, or vary the magnitute of changes across subpopulations. To this end, argument rel_lfc supplies a subpopulation-specific factor applied to the FCs sampled for subpopulation. Fig. 4 demonstrates how this manifests in in two-dimensional embeddings of the cells: Here, we generate a set of 3 simulations with

equal magnitute of change for all subpopulations: rel_lfc=c(1,1,1)
stronger change for one cluster: rel_lfc=c(1,1,3)
cluster-specific FC factors with no change for one cluster: rel_lfc=c(0,1,2)

t-SNEs of exemplary simulations demonstrating `rel_lfc`'s effect to induce cluster-specific state changes. Cells are colored by cluster ID (top-row) and group ID (bottom-row), respectively. From left to right: No cluster-specific changes, stronger change for `cluster3`, different logFC factors for all clusters with no change for `cluster1`.

Figure 4: t-SNEs of exemplary simulations demonstrating rel_lfc’s effect to induce cluster-specific state changes
Cells are colored by cluster ID (top-row) and group ID (bottom-row), respectively. From left to right: No cluster-specific changes, stronger change for cluster3, different logFC factors for all clusters with no change for cluster1.

1.2.3 `p_type`: Simulating type features

The idea underlying differential state (DS) analysis to test for subpopulation-specific changes in expression across experimental conditions is based on the idea that we i) use stable moleculare signatures (i.e., type features) to group cells into meaningful subpopulations; and, ii) perform statistical tests on state features that are more transiently expression and may be subject to changes in expression upon, for example, treatment or during disease.

The fraction of type features introduced into each subpopulation is specified via argument p_type. Note that, without introducing any differential states, a non-zero fraction of type genes will result in separation of cells into clusters. Fig. 5 demonstrates how increasing values for p_type lead to more and more separation of the cells when coloring by cluster ID, but that the lack of state changes leads to homogenous mixing of cells when coloring by group ID.

t-SNEs of exemplary simulations demonstrating `p_type`'s effect to introduce *type* features. Cells are colored by cluster ID (top-row) and group ID (bottom-row), respectively. The percentage of type features increases from left to right (1, 5, 10%). Simulations are pure EE, i.e., all genes are non-differential across groups.

Figure 5: t-SNEs of exemplary simulations demonstrating p_type’s effect to introduce type features
Cells are colored by cluster ID (top-row) and group ID (bottom-row), respectively. The percentage of type features increases from left to right (1, 5, 10%). Simulations are pure EE, i.e., all genes are non-differential across groups.

1.3 Simulation a hierarchical cluster structure

under development.

simData contains three parameters that control how subpopulations relate to and differ from one another:

p_type determines the percentage of type genes exclusice to each cluster
phylo_tree represents a phylogenetic tree specifying of clusters relate to one another
phylo_pars controls how branch distances are to be interpreted

Note that, when supplied with a cluster phylogeny, argument nk is ignored and simData extracts the number of clusters to be simulated from phylo_tree.

1.3.1 `p_type`: Introducing type features

To exemplify the effect of the parameter p_type, we simulate a dataset with $\approx5\%$ of type genes per cluster, and one group only via probs = list(..., c(1, 0) (i.e., $\text{Prob}(\textit{cell is in group 2}) = 0$):

# simulate 5% of type genes; one group only
sim <- simData(ref, p_type = 0.1, 
    nc = 2e3, ng = 1e3, force = TRUE,
    probs = list(NULL, NULL, c(1, 0)))
# do log-library size normalization
sim <- logNormCounts(sim)

For visualizing the above simulation, we select for genes that are of class type (rowData()$class == "type") and have a decent simulated expression mean. Furthermore, we sample a subset of cells for each cluster. The resulting heatmap (Fig. 6) shows that the 3 clusters separate well from one another, but that type genes aren’t necessarily expressed higher in a single cluster. This is the case because a gene selected as reference for a type gene in a given cluster may indeed have a lower expression than the gene used for the remainder of clusters.

# extract gene metadata & number of clusters
rd <- rowData(sim)
nk <- nlevels(sim$cluster_id)
# filter for type genes with high expression mean
is_type <- rd$class == "type"
is_high <- rowMeans(assay(sim, "logcounts")) > 1
gs <- rownames(sim)[is_type & is_high]
# sample 100 cells per cluster for plotting
cs <- lapply(split(seq_len(ncol(sim)), sim$cluster_id), sample, 100)
plotHeatmap(sim[, unlist(cs)], features = gs, center = TRUE,
    colour_columns_by = "cluster_id", cutree_cols = nk)

Exemplary heatmap demonstrating the effect of `p_type` to introduce cluster-specific *type* genes. Included are type genes (= rows) with a simulated expression mean > 1, and a random subset of 100 cells (= columns) per cluster; column annotations represent cluster IDs. Bins are colored by expression scaled in row direction, and both genes and cells are hierarchically clustered.

Figure 6: Exemplary heatmap demonstrating the effect of p_type to introduce cluster-specific type genes
Included are type genes (= rows) with a simulated expression mean > 1, and a random subset of 100 cells (= columns) per cluster; column annotations represent cluster IDs. Bins are colored by expression scaled in row direction, and both genes and cells are hierarchically clustered.

1.3.2 `phylo_tree`: Introducing a cluster phylogeny

The scenario illustrated above is arguably not very realistic. Instead, in a biology setting, subpopulations don’t differ from one another by a specific subset of genes, but may share some of the genes decisive for their biologigcal role. I.e., the set type features is not exclusive for every given subpopulation, and some subpopulations are more similar to one another than others.

To introduce a more realistic subpopulation structure, simData can be supplied with a phylogenetic tree, phylo_tree, that specifies the relationship and distances between clusters. The tree should be written in Newick format as in the following example:

# specify cluster phylogeny 
tree <- "(('cluster1':0.4,'cluster2':0.4):0.4,('cluster3':
    0.5,('cluster4':0.2,'cluster5':0.2,'cluster6':0.2):0.4):0.4);"
# visualize cluster tree
library(phylogram)
plot(read.dendrogram(text = tree))

Exemplary phylogeny. The phylogenetic tree specified via `phylo` relates 3 clusters such that there are 2 main branches, and clusters 1 and 2 should be more similar to one another than cluster 3.

Figure 7: Exemplary phylogeny
The phylogenetic tree specified via phylo relates 3 clusters such that there are 2 main branches, and clusters 1 and 2 should be more similar to one another than cluster 3.

# simulate 5% of type genes; one group only
sim <- simData(ref, 
    phylo_tree = tree, phylo_pars = c(0.1, 1),
    nc = 800, ng = 1e3, dd = FALSE, force = TRUE)
# do log-library size normalization
sim <- logNormCounts(sim)

# extract gene metadata & number of clusters
rd <- rowData(sim)
nk <- nlevels(sim$cluster_id)
# filter for type & shared genes with high expression mean
is_type <- rd$class != "state"
is_high <- rowMeans(assay(sim, "logcounts")) > 1
gs <- rownames(sim)[is_type & is_high]
# sample 100 cells per cluster for plotting
cs <- lapply(split(seq_len(ncol(sim)), sim$cluster_id), sample, 50)
plotHeatmap(sim[, unlist(cs)], features = gs, 
    center = TRUE, show_rownames = FALSE,
    colour_columns_by = "cluster_id")

Exemplary heatmap demonstrating the effect of `phylo_tree` to introduce a hierarchical cluster structure. Included are 100 randomly sampled non-state, i.e. type or shared, genes (= rows) with a simulated expression mean > 1, and a random subset of 100 cells (= columns) per cluster; column annotations represent cluster IDs. Bins are colored by expression scaled in row direction, and both genes and cells are hierarchically clustered.

Figure 8: Exemplary heatmap demonstrating the effect of phylo_tree to introduce a hierarchical cluster structure
Included are 100 randomly sampled non-state, i.e. type or shared, genes (= rows) with a simulated expression mean > 1, and a random subset of 100 cells (= columns) per cluster; column annotations represent cluster IDs. Bins are colored by expression scaled in row direction, and both genes and cells are hierarchically clustered.

1.4 Simulating batch effects

under development.

2 Quality control

As is the case with any simulation, it is crutial to verify the qualitation of the simulated data; i.e., how well key characteristics of the reference data are captured in the simulation. While we have demonstrated that muscats simulation framework is capable of reproducing key features of scRNA-seq dataset at both the single-cell and pseudobulk level (Crowell et al. 2019), simulation quality will vary depending on the reference dataset and could suffer from too extreme simulation parameters. Therefore, we advise anyone interested in using the framework presented herein for any type of method evaluation or comparison to generate countsimQC report (Soneson and Robinson 2018) as it is extremly simple to make and very comprehensive.

The code chunk below (not evaluated here) illustrates how to generate a report comparing an exemplary simData simulation with the reference data provided in ref. Runtimes are mainly determined by argument maxNForCorr and maxNForDisp, and computing a full-blown report can be very time intensive. We thus advice using a sufficient but low number of cells/genes for these steps.

# load required packages
library(countsimQC)
library(DESeq2)
# simulate data
sim <- simData(ref, 
    ng = nrow(ref), 
    nc = ncol(ref),
    dd = FALSE)
# construct 'DESeqDataSet's for both, 
# simulated and reference dataset
dds_sim <- DESeqDataSetFromMatrix(
    countData = counts(sim),
    colData = colData(sim),
    design = ~ sample_id)
dds_ref <- DESeqDataSetFromMatrix(
    countData = counts(ref),
    colData = colData(ref),
    design = ~ sample_id)
dds_list <- list(sim = dds_sim, data = dds_ref)
# generate 'countsimQC' report
countsimQCReport(
    ddsList = dds_list,
    outputFile = "<file_name>.html",
    outputDir = "<output_path>",
    outputFormat = "html_document",
    maxNForCorr = 200, 
    maxNForDisp = 500)

3 Method benchmarking

A variety of functions for calculation and visualizing performance metrics for evaluation of ranking and binary classification (assignment) methods is provided in the iCOBRA package (Soneson and Robinson 2016).

We firstly define a wrapper that takes as input a method passed pbDS and reformats the results as a data.frame in tidy format, which is in turn right_joined with simulation gene metadata. As each methods may return results for different subsets of gene-subpopulation instances, the latter steps assures that the dimensions of all method results will match.

# 'm' is a character string specifying a valid `pbDS` method
.run_method <- function(m) {
    res <- pbDS(pb, method = m, verbose = FALSE)
    tbl <- resDS(sim, res)
    left_join(gi, tbl, by = c("gene", "cluster_id"))
}

Having computed result data.frames for a set of methods, we next define a wrapper that prepares the data for evaluation with iCOBRA using the COBRAData constructor, and calculates any performance measures of interest (specified via aspects) with calculate_performance:

# 'x' is a list of result 'data.frame's
.calc_perf <- function(x, facet = NULL) {
    cd <- COBRAData(truth = gi,
        pval = data.frame(bind_cols(map(x, "p_val"))),
        padj = data.frame(bind_cols(map(x, "p_adj.loc"))))
    perf <- calculate_performance(cd, 
        binary_truth = "is_de", maxsplit = 1e6,
        splv = ifelse(is.null(facet), "none", facet),
        aspects = c("fdrtpr", "fdrtprcurve", "curve"))
}

Putting it all together, we can finally simulate some data, run a set of DS analysis methods, calculate their performance, and plot a variety of performance metrics depending on the aspects calculated by .calc_perf:

# simulation with all DD types
sim <- simData(ref, 
    p_dd = c(rep(0.3, 2), rep(0.1, 4)),
    ng = 1e3, nc = 2e3, ns = 3, force = TRUE)
# aggregate to pseudobulks
pb <- aggregateData(sim)
# extract gene metadata
gi <- metadata(sim)$gene_info
# add truth column (must be numeric!)
gi$is_de <- !gi$category %in% c("ee", "ep")
gi$is_de <- as.numeric(gi$is_de) 

# specify methods for comparison & run them
# (must set names for methods to show in visualizations!)
names(mids) <- mids <- c("edgeR", "DESeq2", "limma-trend", "limma-voom")
res <- lapply(mids, .run_method)

# calculate performance measures 
# and prep. for plotting with 'iCOBRRA'
library(iCOBRA)
perf <- .calc_perf(res, "cluster_id")
pd <- prepare_data_for_plot(perf)

# plot FDR-TPR curves by cluster
plot_fdrtprcurve(pd) +
    theme(aspect.ratio = 1) +
    scale_x_continuous(trans = "sqrt") +
    facet_wrap(~ splitval, nrow = 1)

Session info

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] iCOBRA_1.20.0               phylogram_2.1.0            
##  [3] reshape2_1.4.4              UpSetR_1.4.0               
##  [5] scater_1.20.0               scuttle_1.2.0              
##  [7] muscData_1.5.0              SingleCellExperiment_1.14.0
##  [9] SummarizedExperiment_1.22.0 Biobase_2.52.0             
## [11] GenomicRanges_1.44.0        GenomeInfoDb_1.28.0        
## [13] IRanges_2.26.0              S4Vectors_0.30.0           
## [15] MatrixGenerics_1.4.0        matrixStats_0.58.0         
## [17] ExperimentHub_2.0.0         AnnotationHub_3.0.0        
## [19] BiocFileCache_2.0.0         dbplyr_2.1.1               
## [21] BiocGenerics_0.38.0         purrr_0.3.4                
## [23] muscat_1.6.0                limma_3.48.0               
## [25] ggplot2_3.3.3               dplyr_1.0.6                
## [27] cowplot_1.1.1               BiocStyle_2.20.0           
## 
## loaded via a namespace (and not attached):
##   [1] shinydashboard_0.7.1          utf8_1.2.1                   
##   [3] tidyselect_1.1.1              lme4_1.1-27                  
##   [5] htmlwidgets_1.5.3             RSQLite_2.2.7                
##   [7] AnnotationDbi_1.54.0          grid_4.1.0                   
##   [9] BiocParallel_1.26.0           Rtsne_0.15                   
##  [11] munsell_0.5.0                 ScaledMatrix_1.0.0           
##  [13] codetools_0.2-18              DT_0.18                      
##  [15] statmod_1.4.36                future_1.21.0                
##  [17] withr_2.4.2                   colorspace_2.0-1             
##  [19] filelock_1.0.2                highr_0.9                    
##  [21] knitr_1.33                    ROCR_1.0-11                  
##  [23] listenv_0.8.0                 labeling_0.4.2               
##  [25] emmeans_1.6.0                 GenomeInfoDbData_1.2.6       
##  [27] pheatmap_1.0.12               farver_2.1.0                 
##  [29] bit64_4.0.5                   glmmTMB_1.0.2.1              
##  [31] coda_0.19-4                   parallelly_1.25.0            
##  [33] vctrs_0.3.8                   generics_0.1.0               
##  [35] TH.data_1.0-10                xfun_0.23                    
##  [37] R6_2.5.0                      doParallel_1.0.16            
##  [39] ggbeeswarm_0.6.0              clue_0.3-59                  
##  [41] rsvd_1.0.5                    locfit_1.5-9.4               
##  [43] bitops_1.0-7                  cachem_1.0.5                 
##  [45] DelayedArray_0.18.0           assertthat_0.2.1             
##  [47] promises_1.2.0.1              scales_1.1.1                 
##  [49] multcomp_1.4-17               beeswarm_0.3.1               
##  [51] gtable_0.3.0                  beachmat_2.8.0               
##  [53] Cairo_1.5-12.2                globals_0.14.0               
##  [55] sandwich_3.0-1                rlang_0.4.11                 
##  [57] genefilter_1.74.0             GlobalOptions_0.1.2          
##  [59] splines_4.1.0                 TMB_1.7.20                   
##  [61] shinyBS_0.61                  broom_0.7.6                  
##  [63] BiocManager_1.30.15           yaml_2.2.1                   
##  [65] backports_1.2.1               httpuv_1.6.1                 
##  [67] tools_4.1.0                   bookdown_0.22                
##  [69] ellipsis_0.3.2                gplots_3.1.1                 
##  [71] jquerylib_0.1.4               RColorBrewer_1.1-2           
##  [73] Rcpp_1.0.6                    plyr_1.8.6                   
##  [75] sparseMatrixStats_1.4.0       progress_1.2.2               
##  [77] zlibbioc_1.38.0               RCurl_1.98-1.3               
##  [79] prettyunits_1.1.1             GetoptLong_1.0.5             
##  [81] viridis_0.6.1                 zoo_1.8-9                    
##  [83] cluster_2.1.2                 colorRamps_2.3               
##  [85] variancePartition_1.22.0      magrittr_2.0.1               
##  [87] magick_2.7.2                  RSpectra_0.16-0              
##  [89] data.table_1.14.0             lmerTest_3.1-3               
##  [91] circlize_0.4.12               mvtnorm_1.1-1                
##  [93] mime_0.10                     hms_1.1.0                    
##  [95] evaluate_0.14                 xtable_1.8-4                 
##  [97] pbkrtest_0.5.1                XML_3.99-0.6                 
##  [99] gridExtra_2.3                 shape_1.4.6                  
## [101] compiler_4.1.0                tibble_3.1.2                 
## [103] KernSmooth_2.23-20            crayon_1.4.1                 
## [105] minqa_1.2.4                   htmltools_0.5.1.1            
## [107] later_1.2.0                   tidyr_1.1.3                  
## [109] geneplotter_1.70.0            DBI_1.1.1                    
## [111] ComplexHeatmap_2.8.0          MASS_7.3-54                  
## [113] rappdirs_0.3.3                boot_1.3-28                  
## [115] Matrix_1.3-3                  pkgconfig_2.0.3              
## [117] numDeriv_2016.8-1.1           foreach_1.5.1                
## [119] annotate_1.70.0               vipor_0.4.5                  
## [121] bslib_0.2.5.1                 blme_1.0-5                   
## [123] XVector_0.32.0                estimability_1.3             
## [125] stringr_1.4.0                 digest_0.6.27                
## [127] RcppAnnoy_0.0.18              sctransform_0.3.2            
## [129] Biostrings_2.60.0             rmarkdown_2.8                
## [131] uwot_0.1.10                   edgeR_3.34.0                 
## [133] DelayedMatrixStats_1.14.0     curl_4.3.1                   
## [135] shiny_1.6.0                   gtools_3.8.2                 
## [137] rjson_0.2.20                  nloptr_1.2.2.2               
## [139] lifecycle_1.0.0               nlme_3.1-152                 
## [141] jsonlite_1.7.2                BiocNeighbors_1.10.0         
## [143] viridisLite_0.4.0             fansi_0.4.2                  
## [145] pillar_1.6.1                  lattice_0.20-44              
## [147] KEGGREST_1.32.0               fastmap_1.1.0                
## [149] httr_1.4.2                    survival_3.2-11              
## [151] interactiveDisplayBase_1.30.0 glue_1.4.2                   
## [153] png_0.1-7                     iterators_1.0.13             
## [155] BiocVersion_3.13.1            bit_4.0.4                    
## [157] stringi_1.6.2                 sass_0.4.0                   
## [159] blob_1.2.1                    DESeq2_1.32.0                
## [161] BiocSingular_1.8.0            caTools_1.18.2               
## [163] memoise_2.0.0                 ape_5.5                      
## [165] irlba_2.3.3                   future.apply_1.7.0

References

Crowell, Helena L, Charlotte Soneson, Pierre-Luc Germain, Daniela Calini, Ludovic Collin, Catarina Raposo, Dheeraj Malhotra, and Mark D Robinson. 2019. “On the Discovery of Population-Specific State Transitions from Multi-Sample Multi-Condition Single-Cell RNA Sequencing Data.” bioRxiv 713412.

Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2018. “Multiplexed Droplet Single-Cell RNA-sequencing Using Natural Genetic Variation.” Nature Biotechnology 36 (1): 89–94.

Korthauer, Keegan D, Li-Fang Chu, Michael A Newton, Yuan Li, James Thomson, Ron Stewart, and Christina Kendziorski. 2016. “A Statistical Approach for Identifying Differential Distributions in Single-Cell RNA-seq Experiments.” Genome Biology 17 (1): 222.

Soneson, Charlotte, and Mark D Robinson. 2016. “iCOBRA: Open, Reproducible, Standardized and Live Method Benchmarking.” Nature Methods 13 (4): 283.

———. 2018. “Towards Unified Quality Verification of Synthetic Count Data with countsimQC.” Bioinformatics 34 (4): 691–92.

Simulating complex design scRNA-seq data with muscat