Contents

This document gives an introduction to and overview of the functionality of the scater package.

The scater package is contains tools to help with the analysis of single-cell transcriptomic data, with the focus on RNA-seq data. The package features:

Future versions of scater may also incorporate:

To get up and running as quickly as possible, see the Quick Start section below. For see the various in-depth sections on various aspects of the functionality that follow.

1 Quick Start

Assuming you have a matrix containing expression count data summarised at the level of some features (gene, exon, region, etc.), then we first need to form an SCESet object containing the data. An SCESet object is the basic data container that we use in scater.

Here we use the example data provided with the package, which gives us two objects, a matrix of counts and a dataframe with information about the cells we are studying:

library(scater, quietly = TRUE)
data("sc_example_counts")
data("sc_example_cell_info")

We use these objects to form an SCESet object containing all of the necessary information for our analysis:

pd <- new("AnnotatedDataFrame", data = sc_example_cell_info)
rownames(pd) <- pd$Cell
example_sceset <- newSCESet(countData = sc_example_counts, phenoData = pd)

Subsetting is very convenient with this class. For example, we can filter out features (genes) that are not expressed in any cells:

keep_feature <- rowSums(exprs(example_sceset) > 0) > 0
example_sceset <- example_sceset[keep_feature,]

Now we have the expression data neatly stored in a structure that can be used for lots of exciting analyses.

It is straight-forward to compute many quality control metrics:

example_sceset <- calculateQCMetrics(example_sceset, feature_controls = 1:40)

Now you can play around with your data using the graphical user interface (GUI), which opens an interactive dashboard in your browser!

scater_gui(example_sceset)

Many plotting functions are available for visualising the data:

More detail on all plotting methods is given throughout the vignette below.

Visualisations can highlight features and cells to be filtered out, which can be done easily with the subsetting capabilities of scater.

The QC plotting functions also enable the identification of important experimental variables, which can be conditioned out in the data normalisation step.

After QC and data normalisation (methods are available in scater), the dataset is ready for downstream statistical modeling.

2 A note on terminology

The capabilities of scater are built on top of Bioconductor’s Biobase, which ties us to some specific terminology. Some of this terminology may be unfamiliar or seem peculiar for the single-cell RNA-seq setting scater is designed for.

In Bioconductor terminology we assay numerous “features” for a number of “samples”.

Features, in the context of scater, correspond most commonly to genes or transcripts, but could be any general genomic or transcriptomic regions (e.g. exon) of interest for which we take measurements. Thus, a command such as featureNames, returns the names of the features defined for an SCESet object, which in typical scater usage could correspond to gene IDs. In much of what follows, it may be more intuitive to mentally replace “feature” with “gene” or “transcript” (depending on the context of the study) wherever “feature” appears.

In the scater context, “samples” refer to individual cells that we have assayed. This differs from common usage of “sample” in other contexts, where we might usually use “sample” to refer to an individual subject, a biological replicate or similar. A “sample” in this sense in scater may be referred to as a “block” in the more classical statistical sense. Within a “block” (e.g. individual) we may have assayed numerous cells. Thus, the function sampleNames, when applied to an SCESet object returns the cell IDs.

3 The SCESet class and methods

In scater we organise single-cell expression data in objects of the SCESet class. The class inherits the Bioconductor ExpressionSet class, which provides a common interface familiar to those who have analyzed microarray experiments with Bioconductor. The class requires three input files:

  1. exprs, a numeric matrix of expression values, where rows are features, and columns are cells
  2. phenoData, an AnnotatedDataFrame object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)
  3. featureData, an AnnotatedDataFrame object, where rows are features (e.g. genes), and columns are feature attributes, such as biotype, gc content, etc.

For more details about other features inherited from Bioconductor’s ExpressionSet class, type ?ExpressionSet at the R prompt.

The requirements for the SCESet class (as with other S4 classes in R and Bioconductor) are strict. The idea is that strictness with generating a valid class object ensures that downstream methods applied to the class will work reliably. Thus, the expression value matrix must have the same number of columns as the phenoData object has rows, and it must have the same number of rows as the featureData dataframe has rows. Row names of the phenoData object need to match the column names of the expression matrix. Row names of the featureData object need to match row names of the expression matrix.

You can create a new SCESet object using count data as follows. In this case, the exprs slot in the object will be generated as log2(counts-per-million) using the cpm function from edgeR, with a “prior count” value of 1:

pd <- new("AnnotatedDataFrame", data = sc_example_cell_info)
rownames(pd) <- pd$Cell
gene_df <- data.frame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
fd <- new("AnnotatedDataFrame", data = gene_df)
example_sceset <- newSCESet(countData = sc_example_counts, phenoData = pd,
                            featureData = fd)
example_sceset
## SCESet (storageMode: lockedEnvironment)
## assayData: 2000 features, 40 samples 
##   element names: counts, exprs 
## protocolData: none
## phenoData
##   sampleNames: Cell_001 Cell_002 ... Cell_040 (40 total)
##   varLabels: Cell Mutation_Status Cell_Cycle Treatment
##   varMetadata: labelDescription
## featureData
##   featureNames: Gene_0001 Gene_0002 ... Gene_2000 (2000 total)
##   fvarLabels: Gene
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

We can also make an SCESet object with just a matrix of expression values (no count data required; here we compute counts-per-million from the example_sceset defined above and transform to the log2 scale) like this:

example2 <- newSCESet(exprsData = log2(calculateCPM(example_sceset) + 1))
## Warning in .local(object, ...): 'sizeFactors' have not been set
## Warning in calculateCPM(example_sceset): size factors requested but not
## specified, using library sizes instead
## Warning in newSCESet(exprsData = log2(calculateCPM(example_sceset) + 1)):
## 'logExprsOffset' should be set manually for non-count data
## Warning in newSCESet(exprsData = log2(calculateCPM(example_sceset) + 1)):
## 'lowerDetectionLimit' should be set manually for log-expression values
pData(example2)
## data frame with 0 columns and 40 rows
fData(example2)
## data frame with 0 columns and 2000 rows

We have accessor functions to access elements of the SCESet object:

counts(example2)[1:3, 1:6]
## NULL
exprs(example2)[1:3, 1:6]
##           Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
## Gene_0001   0.0000 9.551765 2.918629  0.00000  0.00000 0.000000
## Gene_0002  10.3943 8.633332 3.438547 12.44072 11.46530 9.852595
## Gene_0003   0.0000 0.000000 0.000000  0.00000 10.53582 0.000000
get_exprs(example_sceset, "counts")[1:3, 1:6]
##           Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
## Gene_0001        0      123        2        0        0        0
## Gene_0002      575       65        3     1561     2311      160
## Gene_0003        0        0        0        0     1213        0

Similarly we can assign a new (say, transformed) expression matrix to an SCESet object using set_exprs as follows:

set_exprs(example2, "counts") <- counts(example_sceset)

If you later find some count data that you want to add to the SCESet object, then you can add it easily with the counts assignment function directly (and the same goes for exprs, tpm, cpm, fpkm and versions of these with the prefix “norm_”):

counts(example2) <- sc_example_counts
example2
## SCESet (storageMode: lockedEnvironment)
## assayData: 2000 features, 40 samples 
##   element names: counts, exprs 
## protocolData: none
## phenoData: none
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
counts(example2)[1:3, 1:6]
##           Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
## Gene_0001        0      123        2        0        0        0
## Gene_0002      575       65        3     1561     2311      160
## Gene_0003        0        0        0        0     1213        0

Handily, it is also easy to replace other data in slots of the SCESet object using generic accessor and replacement functions.

gene_df <- data.frame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
fd <- new("AnnotatedDataFrame", data = gene_df)
## replace featureData
fData(example_sceset) <- fd
## replace phenotype data
pData(example_sceset) <- pd
## replace expression data to be used
exprs(example_sceset) <- log2(calculateCPM(example_sceset) + 1)
## Warning in .local(object, ...): 'sizeFactors' have not been set
## Warning in calculateCPM(example_sceset): size factors requested but not
## specified, using library sizes instead

It is possible to get an overall view of the dataset by using the plot method available for SCESet objects. This method plots the cumulative proportion of each cell’s library that is accounted for by the top highest-expressed features (by default showing the cumulative proportion across the top 500 features).

This type of plot gives an overall idea of differences in expression distributions for different cells. It is used in the same way as per-sample boxplots are for microarray or bulk RNA-seq data. Due to the large numbers of zeroes in expression values for single-cell RNA-seq data, boxplots are not as useful, so instead we focus on the contributions from the most expressed features for each cell.

With this function, we can split up the cells based on phenoData variables to get a finer-grained look at differences between cells. By default, the plot method will try to use transcripts-per-million values for the plot, but if these are not present in the SCESet object, then the values in exprs(object) will be used. Other expression values can be used, specified with the exprs_values argument

plot(example_sceset, block1 = "Mutation_Status", block2 = "Treatment",
     colour_by = "Cell_Cycle", nfeatures = 300, exprs_values = "counts")

This sort of approach can help to pick up large differences in expression distributions across different experimental blocks (e.g. processing batches or similar.)

4 Plots of expression values

In scater, the plotExpression function makes it easy to plot expression values for a subset of genes or features. This can be particularly useful when investigating the some features identified as being of interest from differential expression testing or other means.

plotExpression(example_sceset, rownames(example_sceset)[1:6],
               x = "Mutation_Status", exprs_values = "exprs", colour = "Treatment")