scater
: Single-cell analysis toolkit for expression in Rscater 1.4.0
This document gives an introduction to and overview of the functionality of the scater
package.
The scater
package is contains tools to help with the analysis of single-cell transcriptomic data, with the focus on RNA-seq data. The package features:
kallisto
and ‘Salmon’ for rapid quantification of transcript abundance and tight integration with scater
;Future versions of scater
may also incorporate:
To get up and running as quickly as possible, see the Quick Start section below. For see the various in-depth sections on various aspects of the functionality that follow.
Assuming you have a matrix containing expression count data summarised at the level of some features (gene, exon, region, etc.), then we first need to form an SCESet
object containing the data. An SCESet
object is the basic data container that we use in scater
.
Here we use the example data provided with the package, which gives us two objects, a matrix of counts and a dataframe with information about the cells we are studying:
library(scater, quietly = TRUE)
data("sc_example_counts")
data("sc_example_cell_info")
We use these objects to form an SCESet
object containing all of the necessary information for our analysis:
pd <- new("AnnotatedDataFrame", data = sc_example_cell_info)
rownames(pd) <- pd$Cell
example_sceset <- newSCESet(countData = sc_example_counts, phenoData = pd)
Subsetting is very convenient with this class. For example, we can filter out features (genes) that are not expressed in any cells:
keep_feature <- rowSums(exprs(example_sceset) > 0) > 0
example_sceset <- example_sceset[keep_feature,]
Now we have the expression data neatly stored in a structure that can be used for lots of exciting analyses.
It is straight-forward to compute many quality control metrics:
example_sceset <- calculateQCMetrics(example_sceset, feature_controls = 1:40)
Now you can play around with your data using the graphical user interface (GUI), which opens an interactive dashboard in your browser!
scater_gui(example_sceset)
Many plotting functions are available for visualising the data:
plot
: a plot method exists for SCESet
objects, which gives an overview of expression across cells.plotQC
: various methods are available for producing QC diagnostic plots.plotPCA
: produce a principal components plot for the cells.plotTSNE
: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells.plotDiffusionMap
: produce a diffusion map (reduced dimension) plot for the cells.plotMDS
: produce a multi-dimensional scaling plot for the cells.plotReducedDim
: plot a reduced-dimension representation of the cells.plotExpression
: plot expression levels for a defined set of features.plotPlatePosition
: plot cells in their position on a plate, coloured by cell metadata and QC metrics or feature expression level.plotPhenoData
: plot cell metadata and QC metrics.plotFeatureData
: plot feature metadata and QC metrics.More detail on all plotting methods is given throughout the vignette below.
Visualisations can highlight features and cells to be filtered out, which can be done easily with the subsetting capabilities of scater.
The QC plotting functions also enable the identification of important experimental variables, which can be conditioned out in the data normalisation step.
After QC and data normalisation (methods are available in scater
), the dataset is ready for downstream statistical modeling.
The capabilities of scater
are built on top of Bioconductor’s Biobase
, which ties us to some specific terminology. Some of this terminology may be unfamiliar or seem peculiar for the single-cell RNA-seq setting scater
is designed for.
In Bioconductor terminology we assay numerous “features” for a number of “samples”.
Features, in the context of scater
, correspond most commonly to genes or transcripts, but could be any general genomic or transcriptomic regions (e.g. exon) of interest for which we take measurements. Thus, a command such as featureNames
, returns the names of the features defined for an SCESet
object, which in typical scater
usage could correspond to gene IDs. In much of what follows, it may be more intuitive to mentally replace “feature” with “gene” or “transcript” (depending on the context of the study) wherever “feature” appears.
In the scater
context, “samples” refer to individual cells that we have assayed. This differs from common usage of “sample” in other contexts, where we might usually use “sample” to refer to an individual subject, a biological replicate or similar. A “sample” in this sense in scater
may be referred to as a “block” in the more classical statistical sense. Within a “block” (e.g. individual) we may have assayed numerous cells. Thus, the function sampleNames
, when applied to an SCESet
object returns the cell IDs.
In scater
we organise single-cell expression data in objects of the SCESet
class. The class inherits the Bioconductor ExpressionSet
class, which provides a common interface familiar to those who have analyzed microarray experiments with Bioconductor. The class requires three input files:
exprs
, a numeric matrix of expression values, where rows are features, and columns are cellsphenoData
, an AnnotatedDataFrame
object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)featureData
, an AnnotatedDataFrame
object, where rows are features (e.g. genes), and columns are feature attributes, such as biotype, gc content, etc.For more details about other features inherited from Bioconductor’s ExpressionSet
class, type ?ExpressionSet
at the R prompt.
The requirements for the SCESet
class (as with other S4 classes in R and Bioconductor) are strict. The idea is that strictness with generating a valid class object ensures that downstream methods applied to the class will work reliably. Thus, the expression value matrix must have the same number of columns as the phenoData
object has rows, and it must have the same number of rows as the featureData
dataframe has rows. Row names of the phenoData
object need to match the column names of the expression matrix. Row names of the featureData
object need to match row names of the expression matrix.
You can create a new SCESet
object using count data as follows. In this case, the exprs
slot in the object will be generated as log2(counts-per-million) using the cpm
function from edgeR
, with a “prior count” value of 1:
pd <- new("AnnotatedDataFrame", data = sc_example_cell_info)
rownames(pd) <- pd$Cell
gene_df <- data.frame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
fd <- new("AnnotatedDataFrame", data = gene_df)
example_sceset <- newSCESet(countData = sc_example_counts, phenoData = pd,
featureData = fd)
example_sceset
## SCESet (storageMode: lockedEnvironment)
## assayData: 2000 features, 40 samples
## element names: counts, exprs
## protocolData: none
## phenoData
## sampleNames: Cell_001 Cell_002 ... Cell_040 (40 total)
## varLabels: Cell Mutation_Status Cell_Cycle Treatment
## varMetadata: labelDescription
## featureData
## featureNames: Gene_0001 Gene_0002 ... Gene_2000 (2000 total)
## fvarLabels: Gene
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
We can also make an SCESet
object with just a matrix of expression values (no count data required; here we compute counts-per-million from the example_sceset
defined above and transform to the log2 scale) like this:
example2 <- newSCESet(exprsData = log2(calculateCPM(example_sceset) + 1))
## Warning in .local(object, ...): 'sizeFactors' have not been set
## Warning in calculateCPM(example_sceset): size factors requested but not
## specified, using library sizes instead
## Warning in newSCESet(exprsData = log2(calculateCPM(example_sceset) + 1)):
## 'logExprsOffset' should be set manually for non-count data
## Warning in newSCESet(exprsData = log2(calculateCPM(example_sceset) + 1)):
## 'lowerDetectionLimit' should be set manually for log-expression values
pData(example2)
## data frame with 0 columns and 40 rows
fData(example2)
## data frame with 0 columns and 2000 rows
We have accessor functions to access elements of the SCESet
object:
counts(object)
: returns the matrix of read counts. As you can see above, if no counts are defined for the object, then the counts matrix slot is simpy NULL
.counts(example2)[1:3, 1:6]
## NULL
exprs(object)
: returns the matrix of feature expression values. Typically these should be log2(counts-per-million) values or log2(reads-per-kilobase-per-million-mapped), appropriately normalised of course. The package will generally assume that these are the values to use for expression.exprs(example2)[1:3, 1:6]
## Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
## Gene_0001 0.0000 9.551765 2.918629 0.00000 0.00000 0.000000
## Gene_0002 10.3943 8.633332 3.438547 12.44072 11.46530 9.852595
## Gene_0003 0.0000 0.000000 0.000000 0.00000 10.53582 0.000000
get_exprs
function. We simply supply the function with the SCESet
object and the name of the desired expression matrix:get_exprs(example_sceset, "counts")[1:3, 1:6]
## Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
## Gene_0001 0 123 2 0 0 0
## Gene_0002 575 65 3 1561 2311 160
## Gene_0003 0 0 0 0 1213 0
Similarly we can assign a new (say, transformed) expression matrix to an SCESet
object using set_exprs
as follows:
set_exprs(example2, "counts") <- counts(example_sceset)
If you later find some count data that you want to add to the SCESet
object, then you can add it easily with the counts
assignment function directly (and the same goes for exprs
, tpm
, cpm
, fpkm
and versions of these with the prefix “norm_”):
counts(example2) <- sc_example_counts
example2
## SCESet (storageMode: lockedEnvironment)
## assayData: 2000 features, 40 samples
## element names: counts, exprs
## protocolData: none
## phenoData: none
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
counts(example2)[1:3, 1:6]
## Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
## Gene_0001 0 123 2 0 0 0
## Gene_0002 575 65 3 1561 2311 160
## Gene_0003 0 0 0 0 1213 0
Handily, it is also easy to replace other data in slots of the SCESet
object using generic accessor and replacement functions.
gene_df <- data.frame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
fd <- new("AnnotatedDataFrame", data = gene_df)
## replace featureData
fData(example_sceset) <- fd
## replace phenotype data
pData(example_sceset) <- pd
## replace expression data to be used
exprs(example_sceset) <- log2(calculateCPM(example_sceset) + 1)
## Warning in .local(object, ...): 'sizeFactors' have not been set
## Warning in calculateCPM(example_sceset): size factors requested but not
## specified, using library sizes instead
It is possible to get an overall view of the dataset by using the plot
method available for SCESet
objects. This method plots the cumulative proportion of each cell’s library that is accounted for by the top highest-expressed features (by default showing the cumulative proportion across the top 500 features).
This type of plot gives an overall idea of differences in expression distributions for different cells. It is used in the same way as per-sample boxplots are for microarray or bulk RNA-seq data. Due to the large numbers of zeroes in expression values for single-cell RNA-seq data, boxplots are not as useful, so instead we focus on the contributions from the most expressed features for each cell.
With this function, we can split up the cells based on phenoData variables to get a finer-grained look at differences between cells. By default, the plot method will try to use transcripts-per-million values for the plot, but if these are not present in the SCESet
object, then the values in exprs(object)
will be used. Other expression values can be used, specified with the exprs_values
argument
plot(example_sceset, block1 = "Mutation_Status", block2 = "Treatment",
colour_by = "Cell_Cycle", nfeatures = 300, exprs_values = "counts")
This sort of approach can help to pick up large differences in expression distributions across different experimental blocks (e.g. processing batches or similar.)
In scater
, the plotExpression
function makes it easy to plot expression values for a subset of genes or features. This can be particularly useful when investigating the some features identified as being of interest from differential expression testing or other means.
plotExpression(example_sceset, rownames(example_sceset)[1:6],
x = "Mutation_Status", exprs_values = "exprs", colour = "Treatment")