`muscat`

`muscat`

: **mu**lti-sample **mu**lti-group **sc**RNA-seq **a**nalysis **t**ools (Crowell et al. 2019) provides a straightforward but effective simulation framework that is anchored to a labeled multi-sample multi-subpopulation scRNA-seq reference dataset, uses (non-zero-inflated) negative binomial (NB) as the canonical distribution for droplet scRNA-seq datasets, and exposes various parameters to modulate: the number of subpopulations and samples simulated, the number of cells per subpopulation (and sample), and the type and magnitude of a wide range of patterns of differential expression.

This vignette serves to provide the underlying theoretical background, to thoroughly describe the various input arguments, and to demonstrate the simulation framework’s current capabilities using some illustrative (not necessarily realistic) examples.

muscat 1.4.0

For details on the concepts presented here, consider having a look at our preprint:

Crowell HL, Soneson C*, Germain P-L*,

Calini D, Collin L, Raposo C, Malhotra D & Robinson MD:

On the discovery of population-specific state transitions from

multi-sample multi-condition single-cell RNA sequencing data.

bioRxiv713412(July, 2019). doi: 10.1101/713412

```
library(cowplot)
library(dplyr)
library(reshape2)
library(muscat)
library(purrr)
library(scater)
library(SingleCellExperiment)
```

To demonstrate *muscat*’s simulation framework, we will use a *SingleCellExperiment* (SCE) containing 10x droplet-based scRNA-seq PBCM data from 8 Lupus patients obtained befor and after 6h-treatment with IFN-\(\beta\) (Kang et al. 2018). The complete raw data, as well as gene and cell metadata is available through the NCBI GEO, accession number GSE96583.

*muscat*’s simulation framework comprises: i) estimation of negative binomial (NB) parameters from a reference multi-subpopulation, multi-sample dataset; ii) sampling of gene and cell parameters to use for simulation; and, iii) simulation of gene expression data as NB distributions of mixtures thereof. See Fig. 1.

Let \(Y = (y_{gc})\in\mathbb{N}_0^{G\times C}\) denote the count matrix of a multi-sample multi-subpopulation reference dataset with genes \(\mathcal{G} = \{ g_1, \ldots, g_G \}\) and sets of cells \(\mathcal{C}_{sk} = \{ c^{sk}_1, ..., c^{sk}_{C_{sk}} \}\) for each sample \(s\) and subpopulation \(k\) (\(C_{sk}\) is the number of cells for sample \(s\), subpopulation \(k\)). For each gene \(g\), we fit a model to estimate sample-specific means \(\beta_g^s\), for each sample \(s\), and dispersion parameters \(\phi_g\) using ’s function with default parameters. Thus, we model the reference count data as NB distributed:

\[Y_{gc} \sim NB(\mu_{gc}, \phi_g)\]

for gene \(g\) and cell \(c\), where the mean \(\mu_{gc} = \exp(\beta_{g}^{s(c)}) \cdot \lambda_c\). Here, \(\beta_{g}^{s(c)}\) is the relative abundance of gene \(g\) in sample \(s(c)\), \(\lambda_c\) is the library size (total number of counts), and \(\phi_g\) is the dispersion.

For each subpopulation, we randomly assign each gene to a given *differential distribution* (DD) category (Korthauer et al. 2016) according to a probability vector `p_dd`

\(=(p_{EE},p_{EP},p_{DE},p_{DP},p_{DM},p_{DB})\). For each gene and subpopulation, we draw a vector of fold changes (FCs) from a Gamma distribution with shape parameter \(\alpha=4\) and rate \(\beta=4/\mu_\text{logFC}\), where \(\mu_\text{logFC}\) is the desired average logFC across all genes and subpopulations specified via argument `lfc`

. The direction of differential expression is randomized for each gene, with equal probability of up- and down-regulation.

Next, we split the cells in a given subpopulations into two sets (representing treatment groups), \(\mathcal{T}_A\) and \(\mathcal{T}_B\), which are in turn split again into two sets each (representing subpopulations within the given treatment group.), \(\mathcal{T}_{A_1}/\mathcal{T}_{A_2}\) and \(\mathcal{T}_{B_1}/\mathcal{T}_{B_2}\).

For EE genes, counts for \(\mathcal{T}_A\) and \(\mathcal{T}_B\) are drawn using identical means.For EP genes, we multiply the effective means for identical fractions of cells per group by the sample FCs, i.e., cells are split such that \(\dim\mathcal{T}_{A_1} = \dim\mathcal{T}_{B_1}\) and \(\dim\mathcal{T}_{A_2} = \dim\mathcal{T}_{B_2}\). For DE genes, the means of one group, \(A\) or \(B\), are multiplied with the samples FCs. DP genes are simulated analogously to EP genes with \(\dim\mathcal{T}_{A_1} = a\cdot\dim\mathcal{T}_A\) and \(\dim\mathcal{T}_{B_1} = b\cdot\dim\mathcal{T}_B\), where \(a+b=1\) and \(a\neq b\). For DM genes, 50% of cells from one group are simulated at \(\mu\cdot\text{logFC}\). For DB genes, all cells from one group are simulated at \(\mu\cdot\text{logFC}/2\), and the second group is split into equal proportions of cells simulated at \(\mu\) and \(\mu\cdot\text{logFC}\), respectively. See Fig. 2.

`prepSim`

: Preparing data for simulationTo prepare a reference *SingleCellExperiment* (SCE) for simulation of multi-sample multi-group scRNA-seq data, `prepSim`

will

- perform basic filtering of genes and cells
- (optionally) filter for subpopulation-sample instances with a threshold number of cells to assure accurate parameter estimation
- estimate cell (library sizes) and gene parameters (dispersions and sample-specific means)

Importantly, we want to introduce *known* changes in states across conditions; thus, only replicates from a single condition should go into the simulation. The group to be kept for simulation may be specified via `group_keep`

, in which case samples from all other groups (`sce$group_id != group_keep`

) will be droped. By default (`group_keep = NULL`

), `prepSim`

will select the first group available as reference.

Arguments `min_count`

, `min_cells`

, `min_genes`

and `min_size`

are used to tune the filtering of genes, cells and subpopulation-instances as follows:

- only genes with a count
`> min_count`

in`>= min_cells`

will be retained - only cells with a count
`> 0`

in`>= min_genes`

will be retained - only subpopulation-sample instances with
`>= min_size`

cells will be retained;`min_size = NULL`

will skip this step

```
data(sce)
ref <- prepSim(sce, verbose = FALSE)
# only samples from `ctrl` group are retained
table(ref$sample_id)
```

```
##
## ctrl101 ctrl107
## 200 200
```

```
# cell parameters: library sizes
sub <- assay(sce[rownames(ref), colnames(ref)])
all.equal(exp(ref$offset), as.numeric(colSums(sub)))
```

`## [1] TRUE`

```
# gene parameters: dispersions & sample-specific means
head(rowData(ref))
```

```
## DataFrame with 6 rows and 5 columns
## ENSEMBL SYMBOL dispersion beta.ctrl101 beta.ctrl107
## <character> <character> <numeric> <numeric> <numeric>
## ISG15 ENSG00000187608 ISG15 4.8320290 -8.30850 -8.59397
## AURKAIP1 ENSG00000175756 AURKAIP1 0.1681737 -8.37653 -8.37197
## MRPL20 ENSG00000242485 MRPL20 0.6879266 -8.55650 -9.12391
## SSU72 ENSG00000160075 SSU72 0.0449046 -8.48326 -8.63955
## RER1 ENSG00000157916 RER1 0.5048393 -8.31339 -8.20253
## RPL22 ENSG00000116251 RPL22 0.4064407 -8.06758 -8.03934
```

`simData`

: Simulating complex designsProvided with a reference SCE as returned by `prepSim`

, a variery of simulation scenarios can be generated using the `simData`

function, which will again return an SCE containg the following elements:

`assay`

`counts`

containing the simulated count data`colData`

columns`cluster/sample/group_id`

containing each cells cluster, sample, and group ID (A or B).`metadata$gene_info`

containing a`data.frame`

listing, for each gene and cluster- the simulationed DD
`category`

- the sampled
`logFC`

; note that this will only approximate`log2(sim_mean.B/sim_mean.A)`

for genes of the`de`

category as other types of state changes use mixtures for NBs, and will consequently not exhibit a shift in means of the same magnitude as`logFC`

- the reference
`sim_gene`

from which dispersion`sim_disp`

and sample-specific means`beta.<sample_id>`

were used - the simulated expression means
`sim_mean.A/B`

for each group

- the simulationed DD

In the code chunk that follows, we run a simple simulation with

`p_dd = c(1,0,...0)`

, i.e., 10% of EE genes`nk = 3`

subpopulations and`ns = 3`

replicates for each of 2 groups`ng = 1000`

genes and`nc = 2000`

cells, resulting in`2000/2/ns/nk`

\(\approx111\) cells for 2 groups with 3 samples each and 3 subpopulations

```
# simulated 10% EE genes
sim <- simData(ref, p_dd = diag(6)[1, ],
nk = 3, ns = 3, nc = 2e3,
ng = 1e3, force = TRUE)
# number of cells per sample and subpopulation
table(sim$sample_id, sim$cluster_id)
```

```
##
## cluster1 cluster2 cluster3
## sample1.A 95 114 97
## sample2.A 119 111 111
## sample3.A 122 98 103
## sample1.B 105 119 135
## sample2.B 95 100 107
## sample3.B 129 115 125
```

By default, we have drawn a random reference sample from `levels(ref$sample_id)`

for every simulated sample in each group, resulting in an unpaired design:

`metadata(sim)$ref_sids`

```
## A B
## sample1 "ctrl101" "ctrl107"
## sample2 "ctrl101" "ctrl101"
## sample3 "ctrl101" "ctrl107"
```

Alternatively, we can re-run the above simulation with `paired = TRUE`

such that both groups will use the same set of reference samples, resulting in a paired design:

```
# simulated paired design
sim <- simData(ref, paired = TRUE,
nk = 3, ns = 3, nc = 2e3,
ng = 1e3, force = TRUE)
# same set of reference samples for both groups
ref_sids <- metadata(sim)$ref_sids
all(ref_sids[, 1] == ref_sids[, 2])
```

`## [1] TRUE`

`p_dd`

: Simulating differential distributionsArgument `p_dd`

specifies the fraction of cells to simulate for each DD category. Its values should thus lie in \([0,1]\) and sum to 1. Expression densities for an exemplary set of genes simulated from the code below is shown in Fig. 3.

```
# simulare genes from all DD categories
sim <- simData(ref, p_dd = c(0.5, rep(0.1, 5)),
nc = 2e3, ng = 1e3, force = TRUE)
```

We can retrieve the category assigned to each gene in each cluster from the `gene_info`

table stored in the output SCE’s `metadata`

:

```
gi <- metadata(sim)$gene_info
table(gi$category)
```

```
##
## ee ep de dp dm db
## 1419 318 288 300 348 327
```