Welcome to Splatter! Splatter is an R package for the simple simulation of single-cell RNA sequencing data. This vignette gives an overview and introduction to Splatter’s functionality.
Splatter can be installed from Bioconductor:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("splatter")
To install the most recent development version from Github use:
BiocManager::install("Oshlack/splatter", dependencies = TRUE,
build_vignettes = TRUE)
Assuming you already have a matrix of count data similar to that you wish to
simulate there are two simple steps to creating a simulated data set with
Splatter. Here is an example a mock dataset generated with the scater
package:
# Load package
library(splatter)
# Create mock data
library(scater)
set.seed(1)
sce <- mockSCE()
# Estimate parameters from mock data
params <- splatEstimate(sce)
## NOTE: Library sizes have been found to be normally distributed instead of log-normal. You may want to check this is correct.
# Simulate data using estimated parameters
sim <- splatSimulate(params)
## Getting parameters...
## Creating simulation object...
## Simulating library sizes...
## Simulating gene means...
## Simulating BCV...
## Simulating counts...
## Simulating dropout (if needed)...
## Done!
These steps will be explained in detail in the following sections but briefly the first step takes a dataset and estimates simulation parameters from it and the second step takes those parameters and simulates a new dataset.
Before we look at how we estimate parameters let’s first look at how Splatter simulates data and what those parameters are. We use the term ‘Splat’ to refer to the Splatter’s own simulation and differentiate it from the package itself. The core of the Splat model is a gamma-Poisson distribution used to generate a gene by cell matrix of counts. Mean expression levels for each gene are simulated from a gamma distribution and the Biological Coefficient of Variation is used to enforce a mean-variance trend before counts are simulated from a Poisson distribution. Splat also allows you to simulate expression outlier genes (genes with mean expression outside the gamma distribution) and dropout (random knock out of counts based on mean expression). Each cell is given an expected library size (simulated from a log-normal distribution) that makes it easier to match to a given dataset.
Splat can also simulate differential expression between groups of different types of cells or differentiation paths between different cells types where expression changes in a continuous way. These are described further in the simulating counts section.
SplatParams
objectAll the parameters for the Splat simulation are stored in a SplatParams
object. Let’s create a new one and see what it looks like.
params <- newSplatParams()
params
## A Params object of class SplatParams
## Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT'
## Secondary parameters are usually set during simulation
##
## Global:
## (Genes) (Cells) [Seed]
## 10000 100 587425
##
## 28 additional parameters
##
## Batches:
## [Batches] [Batch Cells] [Location] [Scale]
## 1 100 0.1 0.1
##
## Mean:
## (Rate) (Shape)
## 0.3 0.6
##
## Library size:
## (Location) (Scale) (Norm)
## 11 0.2 FALSE
##
## Exprs outliers:
## (Probability) (Location) (Scale)
## 0.05 4 0.5
##
## Groups:
## [Groups] [Group Probs]
## 1 1
##
## Diff expr:
## [Probability] [Down Prob] [Location] [Scale]
## 0.1 0.5 0.1 0.4
##
## BCV:
## (Common Disp) (DoF)
## 0.1 60
##
## Dropout:
## [Type] (Midpoint) (Shape)
## none 0 -1
##
## Paths:
## [From] [Steps] [Skew] [Non-linear] [Sigma Factor]
## 0 100 0.5 0.1 0.8
As well as telling us what type of object we have (“A Params
object of class
SplatParams
”) and showing us the values of the parameter this output gives us
some extra information. We can see which parameters can be estimated by the
splatEstimate
function (those in parentheses), which can’t be estimated
(those in brackets) and which have been changed from their default values (those
in ALL CAPS). For more details about the parameters of the Splat simulation
refer to the Splat parameters vignette.
If we want to look at a particular parameter, for example the number of genes to
simulate, we can extract it using the getParam
function:
getParam(params, "nGenes")
## [1] 10000
Alternatively, to give a parameter a new value we can use the setParam
function:
params <- setParam(params, "nGenes", 5000)
getParam(params, "nGenes")
## [1] 5000
If we want to extract multiple parameters (as a list) or set multiple parameters
we can use the getParams
or setParams
functions:
# Set multiple parameters at once (using a list)
params <- setParams(params, update = list(nGenes = 8000, mean.rate = 0.5))
# Extract multiple parameters as a list
getParams(params, c("nGenes", "mean.rate", "mean.shape"))
## $nGenes
## [1] 8000
##
## $mean.rate
## [1] 0.5
##
## $mean.shape
## [1] 0.6
# Set multiple parameters at once (using additional arguments)
params <- setParams(params, mean.shape = 0.5, de.prob = 0.2)
params
## A Params object of class SplatParams
## Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT'
## Secondary parameters are usually set during simulation
##
## Global:
## (GENES) (Cells) [Seed]
## 8000 100 587425
##
## 28 additional parameters
##
## Batches:
## [Batches] [Batch Cells] [Location] [Scale]
## 1 100 0.1 0.1
##
## Mean:
## (RATE) (SHAPE)
## 0.5 0.5
##
## Library size:
## (Location) (Scale) (Norm)
## 11 0.2 FALSE
##
## Exprs outliers:
## (Probability) (Location) (Scale)
## 0.05 4 0.5
##
## Groups:
## [Groups] [Group Probs]
## 1 1
##
## Diff expr:
## [PROBABILITY] [Down Prob] [Location] [Scale]
## 0.2 0.5 0.1 0.4
##
## BCV:
## (Common Disp) (DoF)
## 0.1 60
##
## Dropout:
## [Type] (Midpoint) (Shape)
## none 0 -1
##
## Paths:
## [From] [Steps] [Skew] [Non-linear] [Sigma Factor]
## 0 100 0.5 0.1 0.8
The parameters with have changed are now shown in ALL CAPS to indicate that they been changed form the default.
We can also set parameters directly when we call newSplatParams
:
params <- newSplatParams(lib.loc = 12, lib.scale = 0.6)
getParams(params, c("lib.loc", "lib.scale"))
## $lib.loc
## [1] 12
##
## $lib.scale
## [1] 0.6
Splat allows you to estimate many of it’s parameters from a data set containing
counts using the splatEstimate
function.
# Get the mock counts matrix
counts <- counts(sce)
# Check that counts is an integer matrix
class(counts)
## [1] "matrix"
typeof(counts)
## [1] "double"
# Check the dimensions, each row is a gene, each column is a cell
dim(counts)
## [1] 2000 200
# Show the first few entries
counts[1:5, 1:5]
## Cell_001 Cell_002 Cell_003 Cell_004 Cell_005
## Gene_0001 0 5 7 276 50
## Gene_0002 12 0 0 0 0
## Gene_0003 97 292 58 64 541
## Gene_0004 0 0 0 170 19
## Gene_0005 105 123 174 565 1061
params <- splatEstimate(counts)
## NOTE: Library sizes have been found to be normally distributed instead of log-normal. You may want to check this is correct.
Here we estimated parameters from a counts matrix but splatEstimate
can also
take a SingleCellExperiment
object. The estimation process has the following
steps:
estimateDisp
function from the
edgeR
package.For more details of the estimation procedures see ?splatEstimate
.
Once we have a set of parameters we are happy with we can use splatSimulate
to simulate counts. If we want to make small adjustments to the parameters we
can provide them as additional arguments, alternatively if we don’t supply any
parameters the defaults will be used:
sim <- splatSimulate(params, nGenes = 1000)
## Getting parameters...
## Creating simulation object...
## Simulating library sizes...
## Simulating gene means...
## Simulating BCV...
## Simulating counts...
## Simulating dropout (if needed)...
## Done!
sim
## class: SingleCellExperiment
## dim: 1000 200
## metadata(1): Params
## assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts
## rownames(1000): Gene1 Gene2 ... Gene999 Gene1000
## rowData names(4): Gene BaseGeneMean OutlierFactor GeneMean
## colnames(200): Cell1 Cell2 ... Cell199 Cell200
## colData names(3): Cell Batch ExpLibSize
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):
Looking at the output of splatSimulate
we can see that sim
is
SingleCellExperiment
object with 1000 features (genes) and
200 samples (cells). The main part of this object is a features
by samples matrix containing the simulated counts (accessed using counts
),
although it can also hold other expression measures such as FPKM or TPM.
Additionally a SingleCellExperiment
contains phenotype information about
each cell (accessed using colData
) and feature information about each gene
(accessed using rowData
). Splatter uses these slots, as well as assays
, to
store information about the intermediate values of the simulation.
# Access the counts
counts(sim)[1:5, 1:5]
## Cell1 Cell2 Cell3 Cell4 Cell5
## Gene1 58 96 21 36 61
## Gene2 7 53 85 12 82
## Gene3 716 868 963 18 1147
## Gene4 75 293 72 292 132
## Gene5 5 7 19 34 81
# Information about genes
head(rowData(sim))
## DataFrame with 6 rows and 4 columns
## Gene BaseGeneMean OutlierFactor GeneMean
## <factor> <numeric> <numeric> <numeric>
## Gene1 Gene1 34.6332441000976 1 34.6332441000976
## Gene2 Gene2 20.3019738695719 1 20.3019738695719
## Gene3 Gene3 304.225454370614 1 304.225454370614
## Gene4 Gene4 70.0179226428246 1 70.0179226428246
## Gene5 Gene5 13.305089447855 1 13.305089447855
## Gene6 Gene6 483.263595982948 1 483.263595982948
# Information about cells
head(colData(sim))
## DataFrame with 6 rows and 3 columns
## Cell Batch ExpLibSize
## <factor> <character> <numeric>
## Cell1 Cell1 Batch1 365125.510306803
## Cell2 Cell2 Batch1 345681.398513973
## Cell3 Cell3 Batch1 364841.199171529
## Cell4 Cell4 Batch1 350763.493373038
## Cell5 Cell5 Batch1 370658.38435863
## Cell6 Cell6 Batch1 363185.353974474
# Gene by cell matrices
names(assays(sim))
## [1] "BatchCellMeans" "BaseCellMeans" "BCV" "CellMeans"
## [5] "TrueCounts" "counts"
# Example of cell means matrix
assays(sim)$CellMeans[1:5, 1:5]
## Cell1 Cell2 Cell3 Cell4 Cell5
## Gene1 62.176607 122.018232 23.41937 36.21054 76.38050
## Gene2 9.466685 58.699960 106.52211 15.35667 91.74018
## Gene3 692.749409 871.256852 970.39745 17.91015 1123.43688
## Gene4 107.574979 277.839272 75.76843 271.30839 144.36478
## Gene5 5.945762 9.819639 18.92383 26.07896 56.42265
An additional (big) advantage of outputting a SingleCellExperiment
is that we
get immediate access to other analysis packages, such as the plotting functions
in scater
. For example we can make a PCA plot:
# Use scater to calculate logcounts
sim <- normalize(sim)
## Warning: 'normalizeSCE' is deprecated.
## Use 'logNormCounts' instead.
## See help("Deprecated")
## Warning in .local(object, ...): using library sizes as size factors
## Warning: 'centreSizeFactors' is deprecated.
## See help("Deprecated")
# Plot PCA
plotPCA(sim)
## Warning: call 'runPCA' explicitly to compute results
(NOTE: Your values and plots may look different as the simulation is random and produces different results each time it is run.)
For more details about the SingleCellExperiment
object refer to the [vignette]
SCE-vignette. For information about what you can do with scater
refer to the
scater
documentation and vignette.
The splatSimulate
function outputs the following additional information about
the simulation:
colData
)
Cell
- Unique cell identifier.Group
- The group or path the cell belongs to.ExpLibSize
- The expected library size for that cell.Step
(paths only) - How far along the path each cell is.rowData
)
Gene
- Unique gene identifier.BaseGeneMean
- The base expression level for that gene.OutlierFactor
- Expression outlier factor for that gene (1 is not an
outlier).GeneMean
- Expression level after applying outlier factors.DEFac[Group]
- The differential expression factor for each gene
in a particular group (1 is not differentially expressed).GeneMean[Group]
- Expression level of a gene in a particular group after
applying differential expression factors.assays
)
BaseCellMeans
- The expression of genes in each cell adjusted for
expected library size.BCV
- The Biological Coefficient of Variation for each gene in
each cell.CellMeans
- The expression level of genes in each cell adjusted
for BCV.TrueCounts
- The simulated counts before dropout.Dropout
- Logical matrix showing which counts have been dropped in which
cells.Values that have been added by Splatter are named using UpperCamelCase
to
separate them from the underscore_naming
used by scater
and other packages.
For more information on the simulation see ?splatSimulate
.
So far we have only simulated a single population of cells but often we are
interested in investigating a mixed population of cells and looking to see what
cell types are present or what differences there are between them. Splatter is
able to simulate these situations by changing the method
argument Here we are
going to simulate two groups, by specifying the group.prob
parameter and
setting the method
parameter to "groups"
:
(NOTE: We have also set the verbose
argument to FALSE
to stop Splatter
printing progress messages.)
sim.groups <- splatSimulate(group.prob = c(0.5, 0.5), method = "groups",
verbose = FALSE)
sim.groups <- normalize(sim.groups)
## Warning: 'normalizeSCE' is deprecated.
## Use 'logNormCounts' instead.
## See help("Deprecated")
## Warning in .local(object, ...): using library sizes as size factors
## Warning: 'centreSizeFactors' is deprecated.
## See help("Deprecated")
plotPCA(sim.groups, colour_by = "Group")
## Warning: call 'runPCA' explicitly to compute results