Contents

1 Overview

The scater package puts a focus on aiding with quality control (QC) and pre-processing of single-cell RNA-seq data before further downstream analysis. We see QC as consisting of three distinct steps:

  1. QC and filtering of cells
  2. QC and filtering of features (genes)
  3. QC of experimental variables

Following QC, we can proceed with data normalisation before downstream analysis and modelling. We will demonstrate on some example data generated below:

library(scater)
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
    assays = list(counts = sc_example_counts), 
    colData = sc_example_cell_info
)
example_sce
## class: SingleCellExperiment 
## dim: 2000 40 
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(0):
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(4): Cell Mutation_Status Cell_Cycle Treatment
## reducedDimNames(0):
## spikeNames(0):

2 Calculating QC metrics

2.1 Introducing the calculateQCMetrics function

The calculateQCMetrics function computes a number of quality control metrics for each cell and feature, stored in the colData and rowData respectively. By default, the QC metrics are computed from the count data, but this can be changed through the exprs_values argument.

example_sce <- calculateQCMetrics(example_sce)
colnames(colData(example_sce))
##  [1] "Cell"                           "Mutation_Status"               
##  [3] "Cell_Cycle"                     "Treatment"                     
##  [5] "is_cell_control"                "total_features_by_counts"      
##  [7] "log10_total_features_by_counts" "total_counts"                  
##  [9] "log10_total_counts"             "pct_counts_in_top_50_features" 
## [11] "pct_counts_in_top_100_features" "pct_counts_in_top_200_features"
## [13] "pct_counts_in_top_500_features"
colnames(rowData(example_sce))
## [1] "is_feature_control"    "mean_counts"           "log10_mean_counts"    
## [4] "n_cells_by_counts"     "pct_dropout_by_counts" "total_counts"         
## [7] "log10_total_counts"

Control sets can be defined for features (e.g., spike-in transcripts, mitochondrial genes) or cells (e.g., empty wells, visually damaged cells). The function will subsequently compute metrics for each control set, e.g., the proportion of counts assigned to spike-in transcripts.

example_sce <- calculateQCMetrics(example_sce, 
    feature_controls = list(ERCC = 1:20, mito = 500:1000),
    cell_controls = list(empty = 1:5, damaged = 31:40))

all_col_qc <- colnames(colData(example_sce))
all_col_qc <- all_col_qc[grep("ERCC", all_col_qc)]

Metrics are also computed for the union of all control sets, and for the set of all features/cells that are not marked as controls. These sets are labelled as "feature_control" and "endogenous" for features, and "cell_control" and "non_control" for cells. Users should avoid using these names for their own sets.

2.2 Cell-level QC metrics

We refer users to the ?calculateQCMetrics for a full list of the computed cell-level metrics. However, we will describe some of the more popular ones here:

  • total_counts: total number of counts for the cell (i.e., the library size).
  • total_features_by_counts: the number of features for the cell that have counts above the detection limit (default of zero).
  • pct_counts_X: percentage of all counts that come from the feature control set named X.

If exprs_values is set to something other than "counts", the names of the metrics will be changed by swapping "counts" for whatever named assay was used.

2.3 Feature-level QC metrics

Feature-level metrics include:

  • mean_counts: the mean count of the gene/feature.
  • pct_dropout_by_counts: the percentage of cells with counts of zero for each gene.
  • pct_counts_Y: percentage of all counts that come from the cell control set named Y.

Again, if a different exprs_values was used, the names of the metrics will change correspondingly.

3 Producing diagnostic plots for QC

3.1 Examining the most expressed features

We look at a plot that shows the top 50 (by default) most-expressed features. Each row in the plot below corresponds to a gene, and each bar corresponds to the expression of a gene in a single cell. The circle indicates the median expression of each gene, with which genes are sorted. By default, “expression” is defined using the feature counts (if available), but other expression values can be used instead by changing exprs_values.

plotHighestExprs(example_sce, exprs_values = "counts")

We expect to see the “usual suspects”, i.e., mitochondrial genes, actin, ribosomal protein, MALAT1. A few spike-in transcripts may also be present here, though if all of the spike-ins are in the top 50, it suggests that too much spike-in RNA was added. A large number of pseudo-genes or predicted genes may indicate problems with alignment.

3.2 Frequency of expression as a function of the mean

Another useful plot is that of the frequency of expression (i.e., number of cells with non-zero expression) against the mean. These two metrics should be positively correlated with each other for most genes.

plotExprsFreqVsMean(example_sce)

Outliers from the trend may warrant further investigation. For example, alignment errors for pseudo-genes of highly-expressed genes will result in features with low means that are expressed in all cells. Conversely, PCR amplification biases (or the presence of rare populations) may result in genes with very high means that are expressed in very few cells.

3.3 Percentage of counts assigned to feature controls

A particularly useful plot for cell-level QC involves percentage of expression in feature controls against the total number of expressed features. These two metadata variables can be plotted against each other as shown below. We take advantage of the ggplot2 semantics to fine-tune the plot aesthetics and to add a smoothing curve:

plotColData(example_sce, x = "total_features_by_counts",
    y = "pct_counts_feature_control", colour = "Mutation_Status") +
    theme(legend.position = "top") +
    stat_smooth(method = "lm", se = FALSE, size = 2, fullrange = TRUE)

Well-behaved cells should have a large number of expressed features and and a low percentage of expression from feature controls. High percentage expression from feature controls and few expressed features are indicative of blank and failed cells.

3.4 Cumulative expression plot

The plotScater method plots the cumulative proportion of each cell’s library assigned to the top highest-expressed features (default 500). This type of plot visualizes differences in expression distributions for different cells, in the same manner as per-sample boxplots for microarray or bulk RNA-seq data. Cumulative expression plots are more effective for single-cell data where it is not easy to examine hundreds or thousands of boxplots at once.

With this function, we can split up the cells based on colData variables to examine differences in the expression distributions between cells. By default, the plot method will try to use count values for the plot, but other data available in the assays slot of the object can be used by specifying exprs_values.

plotScater(example_sce, block1 = "Mutation_Status", block2 = "Treatment",
     colour_by = "Cell_Cycle", nfeatures = 300, exprs_values = "counts")

This approach can allow users to identify large differences in expression distributions across different experimental blocks (e.g., processing batches).

3.5 Plate position plot

For plate-based experiments, it is useful to see how expression or factors vary with the position of cell on the plate. This can be visualized using the plotPlatePosition function:

example_sce2 <- example_sce
example_sce2$plate_position <- paste0(
     rep(LETTERS[1:5], each = 8), 
     rep(formatC(1:8, width = 2, flag = "0"), 5)
)
plotPlatePosition(example_sce2, colour_by = "Gene_0001",
    by_exprs_values = "counts") 

Systematic trends in expression with the plate position may indicate that there were issues with processing. The same approach can be used with experimental factors to determine whether cells are appropriately randomized across the plate.

3.6 Other quality control plots

Two feature metadata variables can be easily plotted against each other using the plotFeatureData function:

plotRowData(example_sce, x = "n_cells_by_counts", y = "mean_counts")

The multiplot function also allows multiple plots to be generated on the same page, as demonstrated below.

p1 <- plotColData(example_sce, x = "total_counts", 
    y = "total_features_by_counts")
p2 <- plotColData(example_sce, x = "pct_counts_feature_control",
    y = "total_features_by_counts")
p3 <- plotColData(example_sce, x = "pct_counts_feature_control",
    y = "pct_counts_in_top_50_features")
multiplot(p1, p2, p3, cols = 3)

This is especially useful for side-by-side comparisons between control sets, as demonstrated below for the plot of highest-expressing features. A plot for non-control cells is shown on the left while the plot for the controls is shown on the right.

p1 <- plotHighestExprs(example_sce[, !example_sce$is_cell_control])
p2 <- plotHighestExprs(example_sce[, example_sce$is_cell_control])
multiplot(p1, p2, cols = 2)

4 Filtering the SingleCellExperiment

4.1 By cells

4.1.1 Column subsetting

Column subsetting of the SingeCellExperiment object will only retain the selected cells, thus removing low-quality or otherwise unwanted cells. We demonstrate below by retaining the first 40 cells. (This happens to be all the cells in this particular dataset, which are already known to be high-quality.)

example_sce <- example_sce[,1:40]

scater also provides a filter function, inspired by the function of the same name in the dplyr package and operating in exactly the same manner. This can be used to very conviently subset (i.e. filter) the cells of an SingleCellExperiment object based on its colData variables.

filter(example_sce, Treatment == "treat1")
## class: SingleCellExperiment 
## dim: 2000 27 
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(37): is_feature_control is_feature_control_ERCC ...
##   log10_total_counts_damaged pct_counts_damaged
## colnames(27): Cell_001 Cell_002 ... Cell_037 Cell_039
## colData names(51): Cell Mutation_Status ...
##   pct_counts_in_top_200_features_mito
##   pct_counts_in_top_500_features_mito
## reducedDimNames(0):
## spikeNames(0):

4.1.2 Identifying filtering thresholds

We can identify high-quality cells to retain by setting a fixed threshold on particular metrics. For example, we could retain only cells that have at least 100,000 total counts and at least 500 expressed features:

keep.total <- example_sce$total_counts > 1e5
keep.n <- example_sce$total_features_by_counts > 500
filtered <- example_sce[,keep.total & keep.n]
dim(filtered)
## [1] 2000   37

A more flexible way of choosing thresholds is through the isOutlier function. This defines the threshold at a certain number of median absolute deviations (MADs) away from the median. Values beyond this threshold are considered outliers and can be filtered out, assuming that they correspond to low-quality cells. Here, we define small outliers (using type="lower") for the log-total counts at 3 MADs from the median.

keep.total <- isOutlier(example_sce$total_counts, nmads=3, 
    type="lower", log=TRUE)
filtered <- example_sce[,keep.total]

The isOutlier approach adjusts to experiment-specific aspects of the data, e.g., sequencing depth, amount of spike-in RNA added, cell type. In contrast, a fixed threshold would require manual adjustment to account for changes to the experimental protocol or system. We refer readers to the simpleSingleCell workflow for more details.

4.1.3 Identifying outliers on all QC metrics

Outlier cells can also be identified by using the mvoutlier package on the QC metrics for all cells. This will identify cells that have substantially different QC metrics from the others, possibly corresponding to low-quality cells. We can visualize any outliers using a principal components plot as shown below:

example_sce <- runPCA(example_sce, use_coldata = TRUE,
    detect_outliers = TRUE)
plotReducedDim(example_sce, use_dimred="PCA_coldata")

Column subsetting can then be performed based on the $outlier slot, which indicates whether or not each cell has been designated as an outlier. Automatic outlier detection can be informative, but a close inspection of QC metrics and tailored filtering for the specifics of the dataset at hand is strongly recommended.

summary(example_sce$outlier)
##    Mode   FALSE 
## logical      40

4.2 By features

It is common to filter out low-abundance features prior to further analyses. This is easily achieved by row subsetting of the SingleCellExperiment object. In the example below, genes are only retained if they are expressed in four or more cells:

keep_feature <- nexprs(example_sce, byrow=TRUE) >= 4
example_sce <- example_sce[keep_feature,]
dim(example_sce)
## [1] 1753   40

Other filtering can be done using existing annotation. For example, ribosomal protein genes and predicted genes can be identified (and removed) using regular expressions or biotype information. Such genes are often (but not always) uninteresting for characterizing population heterogeneity.

5 Relationships between experimental factors and expression

We can investigate the relative importance of different explanatory factors with the plotExplanatoryVariables function. We compute the \(R^2\) for each factor in colData(example_sce) when fitting a linear model regressing expression values for each gene against that factor. This is best done on the log-expression values to reduce the effect of the mean on the variance - hence, we run normalize first.

example_sce <- normalize(example_sce)
plotExplanatoryVariables(example_sce)

Each line corresponds to one factor and represents the distribution of R2 values across all genes. Alternatively, we can choose a subset of factors to plot in this manner.

plotExplanatoryVariables(example_sce,
    variables = c("total_features_by_counts", "total_counts",
        "Mutation_Status", "Treatment", "Cell_Cycle"))

In this small dataset, total_counts and total_features_by_counts explain a very large proportion of the variance in feature expression. The proportion of variance that they explain for a real dataset should be much smaller (say 1-5%).

6 Removing technical biases

6.1 Scaling normalization

Scaling normalization accounts for cell-specific biases that scale expression up or down for all genes in a particular cell, e.g., coverage or capture efficiency. The simplest approach to scaling normalization defines the size factors from the scaled library sizes of all cells. is done so that the mean size factor is equal to unity, ensuring that the normalized values are on the same scale as the original counts.

sizeFactors(example_sce) <- librarySizeFactors(example_sce)
summary(sizeFactors(example_sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1463  0.6609  0.8112  1.0000  1.2533  2.7356

Log-transformed normalized expression values can then be computed with normalize, which stores the output in the "logcounts" slot.

example_sce <- normalize(example_sce)

While simple, library size normalization does not account for composition biases that are often present in high-throughput sequencing data. It also fails to account for differences in the biases affecting spike-in transcripts. We strongly suggest using the computeSumFactors and computeSpikeFactors functions from the scran package.

6.2 Batch correction

Batch correction accounts for systematic differences in expression between cells in different batches. Unlike scaling biases, these are usually constant across all cells in a given batch but different for each gene.

Batch effects can be regressed out by using the removeBatchEffect function from the limma package. This applies a linear model, usually on the log-expression values to avoid issues with the mean-variance relationship. To illustrate:

library(limma)
batch <- rep(1:2, each=20)
corrected <- removeBatchEffect(logcounts(example_sce), block=batch)
assay(example_sce, "corrected_logcounts") <- corrected

Factors of interest can be included in design to avoid regressing them out. This is necessary when they are not orthogonal to the block. However, this assumes that your model is fully specified, which may not be possible when the factors of interest are unknown. In such cases, an alternative method is to use the mnnCorrect approach from scran.

Session information

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] limma_3.40.2                scater_1.12.2              
##  [3] SingleCellExperiment_1.6.0  SummarizedExperiment_1.14.0
##  [5] DelayedArray_0.10.0         BiocParallel_1.18.0        
##  [7] matrixStats_0.54.0          Biobase_2.44.0             
##  [9] GenomicRanges_1.36.0        GenomeInfoDb_1.20.0        
## [11] IRanges_2.18.0              S4Vectors_0.22.0           
## [13] BiocGenerics_0.30.0         ggplot2_3.1.1              
## [15] knitr_1.23                  BiocStyle_2.12.0           
## 
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.15               ggbeeswarm_0.6.0        
##   [3] colorspace_1.4-1         mvoutlier_2.0.9         
##   [5] RcppEigen_0.3.3.5.0      modeltools_0.2-22       
##   [7] class_7.3-15             rio_0.5.16              
##   [9] mclust_5.4.3             XVector_0.24.0          
##  [11] pls_2.7-1                BiocNeighbors_1.2.0     
##  [13] proxy_0.4-23             cvTools_0.3.2           
##  [15] flexmix_2.3-15           mvtnorm_1.0-10          
##  [17] ranger_0.11.2            splines_3.6.0           
##  [19] sROC_0.1-2               robustbase_0.93-5       
##  [21] robCompositions_2.1.0    kernlab_0.9-27          
##  [23] cluster_2.0.9            BiocManager_1.30.4      
##  [25] rrcov_1.4-7              compiler_3.6.0          
##  [27] assertthat_0.2.1         Matrix_1.2-17           
##  [29] lazyeval_0.2.2           BiocSingular_1.0.0      
##  [31] htmltools_0.3.6          tools_3.6.0             
##  [33] rsvd_1.0.0               igraph_1.2.4.1          
##  [35] gtable_0.3.0             glue_1.3.1              
##  [37] GenomeInfoDbData_1.2.1   reshape2_1.4.3          
##  [39] dplyr_0.8.1              ggthemes_4.2.0          
##  [41] Rcpp_1.0.1               carData_3.0-2           
##  [43] cellranger_1.1.0         zCompositions_1.2.0     
##  [45] sgeostat_1.0-27          fpc_2.2-1               
##  [47] DelayedMatrixStats_1.6.0 lmtest_0.9-37           
##  [49] xfun_0.7                 laeken_0.5.0            
##  [51] stringr_1.4.0            openxlsx_4.1.0          
##  [53] irlba_2.3.3              DEoptimR_1.0-8          
##  [55] zlibbioc_1.30.0          MASS_7.3-51.4           
##  [57] zoo_1.8-5                scales_1.0.0            
##  [59] VIM_4.8.0                hms_0.4.2               
##  [61] RColorBrewer_1.1-2       yaml_2.2.0              
##  [63] curl_3.3                 NADA_1.6-1              
##  [65] gridExtra_2.3            reshape_0.8.8           
##  [67] stringi_1.4.3            pcaPP_1.9-73            
##  [69] e1071_1.7-1              destiny_2.14.0          
##  [71] TTR_0.23-4               boot_1.3-22             
##  [73] zip_2.0.2                truncnorm_1.0-8         
##  [75] prabclus_2.2-7           rlang_0.3.4             
##  [77] pkgconfig_2.0.2          bitops_1.0-6            
##  [79] evaluate_0.13            lattice_0.20-38         
##  [81] purrr_0.3.2              labeling_0.3            
##  [83] cowplot_0.9.4            tidyselect_0.2.5        
##  [85] GGally_1.4.0             plyr_1.8.4              
##  [87] magrittr_1.5             bookdown_0.10           
##  [89] R6_2.4.0                 pillar_1.4.0            
##  [91] haven_2.1.0              foreign_0.8-71          
##  [93] withr_2.1.2              xts_0.11-2              
##  [95] survival_2.44-1.1        scatterplot3d_0.3-41    
##  [97] abind_1.4-5              RCurl_1.95-4.12         
##  [99] sp_1.3-1                 nnet_7.3-12             
## [101] tibble_2.1.1             crayon_1.3.4            
## [103] car_3.0-2                rmarkdown_1.13          
## [105] viridis_0.5.1            grid_3.6.0              
## [107] readxl_1.3.1             data.table_1.12.2       
## [109] forcats_0.4.0            diptest_0.75-7          
## [111] vcd_1.4-4                digest_0.6.19           
## [113] tidyr_0.8.3              munsell_0.5.0           
## [115] beeswarm_0.2.3           viridisLite_0.3.0       
## [117] smoother_1.1             vipor_0.4.5