scater
scater 1.6.3
This document gives an introduction to and overview of the quality control
functionality of the scater
package.
The scater
package is contains tools to help with the analysis of single-cell
transcriptomic data, with the focus on RNA-seq data. The package features:
SingleCellExperiment
class as a data container for
interoperability with a wide range of other Bioconductor packages;kallisto
and
‘Salmon’ for rapid
quantification of transcript abundance and tight integration with scater
;To get up and running as quickly as possible, see the Quick Start section below. For see the various in-depth sections on various aspects of the functionality that follow.
NB: as of July 2017, scater
has switched from the SCESet
class previously
defined within the package to the more widely applicable SingleCellExperiment
class. The functions toSingleCellExperiment
and updateSCESet
(for backwards
compatibility) can be used to convert an old SCESet
object to a
SingleCellExperiment
object.
Assuming you have a matrix containing expression count data summarised at the
level of some features (gene, exon, region, etc.), then we first need to form a
SingleCellExperiment
object containing the data. A SingleCellExperiment
object is the basic data container used in scater
and many other Bioconductor
packages for single-cell data analysis.
Here we use the example data provided with the package, which gives us two objects, a matrix of counts and a dataframe with information about the cells we are studying:
suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info")
We use these objects to form a SingleCellExperiment
object containing all of
the necessary information for our analysis:
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts), colData = sc_example_cell_info)
We always expect to have (raw) count data in a SingleCellExperiment
object. In
almost all cases we will also want to have a log2-scale representation of the
data. We expect this to be stored as the exprs
assay.
Here we use log2-counts-per-million with an offset of 1 as the exprs
values.
exprs(example_sce) <- log2(
calculateCPM(example_sce, use.size.factors = FALSE) + 1)
Subsetting is very convenient with this class. For example, we can filter out features (genes) that are not expressed in any cells:
keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]
Now we have the expression data neatly stored in a structure that can be used for lots of exciting analyses.
It is straight-forward to compute many quality control metrics. We typically provide one or more sets of “feature controls”, that is sets of genes or features that represent technical features of the expression data or are not of primary biological interest. QC metrics are computed especially for these feature sets are used to assess the quality of cells. Spike-in genes (such as the commonly-used ERCC set) and mitochondrial genes are typically useful as “feature controls”. Here, for demonstration, we just use the first 40 features.
example_sce <- calculateQCMetrics(example_sce, feature_controls = 1:40)
Now you can play around with your data using the graphical user interface (GUI), which opens an interactive dashboard in your browser!
scater_gui(example_sce)
Many plotting functions are available for visualising the data:
plot
: a plot method exists for SingleCellExperiment
objects, which gives
an overview of expression across cells.plotQC
: various methods are available for producing QC diagnostic plots.plotPCA
: produce a principal components plot for the cells.plotTSNE
: produce a t-distributed stochastic neighbour embedding (reduced
dimension) plot for the cells.plotDiffusionMap
: produce a diffusion map (reduced dimension) plot for the
cells.plotMDS
: produce a multi-dimensional scaling plot for the cells.plotReducedDim
: plot a reduced-dimension representation of the cells.plotExpression
: plot expression levels for a defined set of features.plotPlatePosition
: plot cells in their position on a plate, coloured by
cell metadata and QC metrics or feature expression level.plotColData
: plot cell metadata and QC metrics.plotRowData
: plot feature metadata and QC metrics.More detail on the QC plotting methods is given throughout the vignette below.
The many other plotting methods are shown in detail in the data visualisation
vignette. Visualisations can highlight features and cells to be filtered out,
which can be done easily with the subsetting capabilities of scater
.
The QC plotting functions also enable the identification of important experimental variables, which can be conditioned out in the data normalisation step.
After QC and data normalisation (methods are available in scater
), the dataset
is ready for downstream statistical modeling.
It is possible to get an overall view of the dataset by using the plotScater
method available for SingleCellExperiment
objects. (NB: this function replaces
the generic plot
method that was previously available for SCESet
objects.)
This method plots the cumulative proportion of each cell’s library that is accounted for by the top highest-expressed features (by default showing the cumulative proportion across the top 500 features).
This type of plot gives an overall idea of differences in expression distributions for different cells. It is used in the same way as per-sample boxplots are for microarray or bulk RNA-seq data. Due to the large numbers of zeroes in expression values for single-cell RNA-seq data, boxplots are not as useful, so instead we focus on the contributions from the most expressed features for each cell.
With this function, we can split up the cells based on cell metadata variables
to get a finer-grained look at differences between cells. By default, the plot
method will try to use count values for the plot. If these
are not present in the SingleCellExperiment
object, then the values to use
should be specified using the exprs_values
argument.
plot(example_sceset, block1 = "Mutation_Status", block2 = "Treatment",
colour_by = "Cell_Cycle", nfeatures = 300, exprs_values = "counts")
This sort of approach can help to pick up large differences in expression distributions across different experimental blocks (e.g. processing batches or similar.)
The scater
package puts a focus on aiding with quality control (QC) and
pre-processing of single-cell RNA-seq data before further downstream analysis.
We see QC as consisting of three distinct steps:
Following QC, we can proceed with data normalisation before downstream analysis and modelling.
In the next few sections we discuss the QC and filtering capabilities available
in scater
.
To compute commonly-used QC metrics we have the function calculateQCMetrics()
:
example_sceset <- calculateQCMetrics(example_sceset, feature_controls = 1:20)
varLabels(example_sceset)
More than one set of feature controls can be defined if desired.
example_sceset <- calculateQCMetrics(
example_sceset, feature_controls = list(controls1 = 1:20, controls2 = 500:1000),
cell_controls = list(set_1 = 1:5, set_2 = 31:40))
varLabels(example_sceset)
This function adds the following columns to pData(object)
:
total_counts
: total number of counts for the cell (aka ‘library size’)log10_total_counts
: total_counts on the log10-scaletotal_features
: the number of features for the cell that have expression above the
detection limit (default detection limit is zero)filter_on_total_counts
: would this cell be filtered out based on its log10-total_counts
being (by default) more than 5 median absolute deviations from the median
log10-total_counts for the dataset?filter_on_total_features
: would this cell be filtered out based on its total_features
being (by default) more than 5 median absolute deviations from the median
total_features for the dataset?counts_feature_controls
: total number of counts for the cell that come from
(a set of user-defined) control features. Defaults to zero if no control features are
indicated.counts_endogenous_features
: total number of counts for the cell that come from
endogenous features (i.e. not control features). Defaults to total_counts
if no control features are
indicated.log10_counts_feature_controls
: total number of counts from control features on
the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite
values) if no control features are indicated.log10_counts_endogenous_features
: total number of counts from endogenous
features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid
infinite values) if no control features are indicated.n_detected_feature_controls
: number of defined feature controls that have
expression greater than the threshold defined in the object.
*pct_counts_feature_controls
: percentage of all counts that come from the
defined control features. Defaults to zero if no control features are defined.If we define multiple sets of feature controls, then the above will be supplied for all feature sets, plus the set of all feature controls combined, as appropriate.
Furthermore, where “counts” appear in the above, the same metrics will also be
computed for “exprs”, “tpm” and “fpkm” (if tpm and fpkm are present in the
SCESet
object).
The function further adds the following columns to fData(object)
:
mean_exprs
: the mean expression level of the gene/feature.exprs_rank
: the rank of the feature’s expression level in the cell.total_feature_counts
: the total number of counts mapped to that feature across all
cells.log10_total_feature_counts
: total feature counts on the log10-scale.pct_total_counts
: the percentage of all counts that are accounted for by the
counts mapping to the feature.is_feature_control
: is the feature a control feature? Default is FALSE
unless
control features are defined by the user.n_cells_exprs
: the number of cells for which the expression level of the feature
is above the detection limit (default detection limit is zero).names(fData(example_sceset))
As above, where “counts” appear in the above, the same metrics will also be
computed for “exprs”, “tpm” and “fpkm” (if tpm and fpkm are present in the
SCESet
object).
Visualising the data and metadata in various ways can be very helpful for QC. We have a suite of plotting functions to produce diagnostic plots for:
pData(object)
).These three QC plots can all be accessed through the function plotQC
(we need
to make sure there are no features with zero or constant expression).
The first step in the QC process is filtering out unwanted features. We will typically filter out features with very low overall expression, and any others that plots or other metrics indicate may be problematic.
First we look at a plot that shows the top 50 (by default) most-expressed
features. By default, “expression” is defined using the feature counts (if
available), but \(tpm\), \(cpm\), \(fpkm\) or the exprs
values can be used instead,
if desired.
keep_feature <- rowSums(counts(example_sceset) > 0) > 4
example_sceset <- example_sceset[keep_feature,]
## Plot QC
plotQC(example_sceset, type = "highest-expression", exprs_values = "counts")
The multiplot
function allows a very simple way to plot multiple ggplot2
plots on the same page. For more sophisticated possibilities for arranging
multiple ggplot2
plots, check out the excellent cowplot
package, available on CRAN. If you have cowplot
installed (highly
recommended), then scater
will automatically use it to create particularly
attractive plots.
It can also be particularly useful to inspect the most-expressed features in
just the cell controls (for example blanks or bulk samples). Subsetting
capabilities for SCESet
objects allow us to do this easily. In the previous
section, we defined two sets of cell controls in the call to
calculateQCMetrics
. That function added the is_cell_control
column to the
phenotype data of the SCESet
object example_sceset
, which indicates if a
cell is defined as a cell control across any of the cell control sets.
The $
operator makes it easy to access the is_cell_control
column and use it
to subset the SCESet
as below. We can compare the most-expressed features in the
cell controls and in the cells of biological interest with this subsetting, as
demonstrated in the code below (plot not shown).
p1 <- plotQC(example_sceset[, !example_sceset$is_cell_control],
type = "highest-expression")
p2 <- plotQC(example_sceset[, example_sceset$is_cell_control],
type = "highest-expression")
multiplot(p1, p2, cols = 2)
Another way to obtain an idea of the level of technical noise in the dataset is
to plot the frequency of expression (that is, number of cells with expression
for the gene above the defined threshold (default is zero)) against mean
expression expression level . A set of specific features to plot can be defined,
but need not be. By default, the function will look for defined feature controls
(as supplied to calculateQCMetrics
). If feature controls are found, then these
will be plotted, if not then all features will be plotted.
plotQC(example_sceset, type = "exprs-freq-vs-mean")
We can also plot just a subset of features with code like that below (plot not shown):
feature_set_1 <- fData(example_sceset)$is_feature_control_controls1
plotQC(example_sceset, type = "exprs-freq-vs-mean", feature_set = feature_set_1)
Beyond these QC plots, we have a neat, general and flexible function for plotting two feature metadata variables:
plotFeatureData(example_sceset, aes(x = n_cells_exprs, y = pct_total_counts))
We can see that there is a small number of features that are ubiquitously
expressed expressed in all cells (n_cells_exprs
) and account for a large
proportion of all counts observed (pct_total_counts
; more than 0.5% of all
counts).
The subsetting of rows of SCESet
objects makes it easy to drop unwanted
features.
See plotPhenoData
and other QC plots below. The subsetting of columns (which
correspond to cells) of SCESet
objects makes it easy to drop unwanted cells.
We also have neat functions to plot two cell metadata variables:
plotPhenoData(example_sceset, aes(x = Mutation_Status, y = total_features,
colour = log10_total_counts))
Note that ggplot aesthetics will work correctly (in general) for everything
except colour
(color
) and fill
, which must be either columns of pData
or feature names (i.e. gene/transcript names).
These sorts of plots can be very useful for finding potentially problematic cells.
plotPhenoData(example_sceset, aes(x = total_counts, y = total_features,
colour = Gene_1000))
plotPhenoData(example_sceset, aes(x = pct_counts_feature_controls,
y = total_features, colour = Gene_0500))
plotPhenoData(example_sceset, aes(x = pct_counts_feature_controls,
y = pct_counts_top_50_features,
colour = Gene_0001))
The output of these functions is a ggplot
object, which can be added to,
amended and altered. For example, if we don’t like the legend position we can change it, and we could also add a trend line for each group (see below).
Tapping into the powerful capabilities of ggplot2
, the possibilities are many.
A particularly useful plot for cell QC is plotting the percentage of expression accounted for by feature controls against total_features.
plotPhenoData(example_sceset, aes(x = total_features,
y = pct_counts_feature_controls,
colour = Mutation_Status)) +
theme(legend.position = "top") +
stat_smooth(method = "lm", se = FALSE, size = 2, fullrange = TRUE)
On real data, we expect to see well-behaved cells with relatively high total_features (number of features with detectable expression) and low percentage of expression from feature controls. High percentage expression from feature controls and low total_features are indicative of blank and failed cells.
The plotPhenoData
function is useful for exploring the relationships between
the many QC metrics computed by calculateQCMetrics
above. Often, problematic
cells can be identified from such plots.
Based on PCA or dimensionality reduction plots (described in detail in the data visualisation vignette) we may identify outlier cells and, if we wish, filter them out of the analysis. There is also an outlier detection option available with the plotPCA function. This performs PCA on QC metrics to highlight cells that differ from other cells based on technical features.
example_sceset <- plotPCA(example_sceset, pca_data_input = "pdata",
detect_outliers = TRUE, return_SCESet = TRUE)
The $outlier
element of the pData (phenotype data) slot of the SCESet
contains indicator values about whether or not each cell has been designated as
an outlier based on the PCA. Here, these values can be accessed for filtering
low quality cells with example_sceset$outlier
. Automatic outlier detection can be informative, but a close inspection of QC metrics and tailored filtering for the specifics of the dataset at hand is strongly recommended.
On this example dataset there are no cells that need filtering, but the
subsetting capabilities of scater
make it easy to filter out unwanted cells.
Column subsetting selects cells, while row subsetting selects features (genes
or transcripts). In particular, there is a function filter
(inspired by the function of the same name in the dplyr
package and operating in exactly the same) that can be used to very conviently subset (i.e. filter) the cells of an SCESet
object based on pData
variables of the object.
See the plotQC
options below. The various plotting functions enable
visualisation of the relationship betwen experimental variables and the
expression data.
We can look at the relative importance of different explanatory variables with
some of the plotQC
function options. We can compute the median marginal \(R^2\)
for each variable in pData(example_sceset)
when fitting a linear model
regressing exprs
values against just that variable.
The default approach looks at all variables in pData(object)
and plots the top
nvars_to_plot
variables (default is 10).
plotQC(example_sceset, type = "expl")
Alternatively, we can choose a subset of variables to plot in this manner.
plotQC(example_sceset, type = "expl",
variables = c("total_features", "total_counts", "Mutation_Status", "Treatment",
"Cell_Cycle"))
We can also easily produce plots to identify PCs that correlate with experimental and QC variables of interest. The function ranks the principal components in decreasing order of \(R^2\) from a linear model regressing PC value against the variable of interest.
We can also produce a pairs plot of potential explanatory variables ranked by their median percentage of expression variance explained in a marginal (only one explanatory variable) linear model.
plotQC(example_sceset, type = "expl", method = "pairs", theme_size = 6)
In this small dataset, total_counts
and total_features
explain a very large proportion of the variance in feature expression. The proportion of variance that they explain for a real dataset should be much smaller (say 1-5%).
The default is to plot six most-associated principal components against the variable of interest.
p1 <- plotQC(example_sceset, type = "find-pcs", variable = "total_features",
plot_type = "pcs-vs-vars")
p2 <- plotQC(example_sceset, type = "find-pcs", variable = "Cell_Cycle",
plot_type = "pcs-vs-vars")
multiplot(p1, p2, cols = 2)
An alternative is to produce a pairs plot of the top five PCs.
plotQC(example_sceset, type = "find-pcs", variable = "total_features",
plot_type = "pairs-pcs")
plotQC(example_sceset, type = "find-pcs", variable = "Cell_Cycle",
plot_type = "pairs-pcs")
Combined with the excellent subsetting capabilities of the
SingleCellExperiment
class, we have convenient tools for conducting QC and
pre-processing (e.g. filtering) data for downstream analysis.
High levels of variability between cells characterise single-cell expression data. In almost all settings, many sources of unwanted variation should be accounted for before proceeding with more sophisticated analysis.
The size-factor normalisation method from the scran
package is tightly integrated with scater
and
strongly recommended as a first normalization of the data before investigating
other sources of variability.
Below, we show some of scater
’s capabilities for normalising data for downstream analyses.
We can use feature controls to help address differences between cells arising from different sets of transcripts being expressed and differences in library composition.
Important experimental variables and latent factors (if used) can be regressed out, so that normalised data has these effects removed.