1 Introduction

This document provides some examples of the data visualisation functions available in scater.

plotExpression: plot cell expression levels for one or more genes;
plotReducedDim: plot (and/or calculate) reduced dimension coordinates;
various QC plots, described in the QC vignette.

We will demonstrate on some example data generated below:

library(scater)
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
    assays = list(counts = sc_example_counts),
    colData = sc_example_cell_info
) 
example_sce <- normalize(example_sce)
example_sce

## class: SingleCellExperiment 
## dim: 2000 40 
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(0):
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(4): Cell Mutation_Status Cell_Cycle Treatment
## reducedDimNames(0):
## spikeNames(0):

2 Plots of expression values

The plotExpression function makes it easy to plot expression values for a subset of genes or features. This can be particularly useful for further examination of features identified from differential expression testing, pseudotime analysis or other analyses. By default, it uses expression values in the "logcounts" assay, but this can be changed through the exprs_values argument.

plotExpression(example_sce, rownames(example_sce)[1:6],
    x = "Mutation_Status", exprs_values = "logcounts")

Setting x will determine the covariate to be shown on the x-axis. This can be a field in the column metadata or the name of a feature (to obtain the expression profile across cells). Categorical covariates will yield grouped violins as shown above, with one panel per feature. By comparison, continuous covariates will generate a scatter plot in each panel, as shown below.

plotExpression(example_sce, rownames(example_sce)[1:6],
    x = "Gene_0001")

The points can also be coloured, shaped or resized by the column metadata or expression values.

plotExpression(example_sce, rownames(example_sce)[1:6],
    colour_by = "Cell_Cycle", shape_by = "Mutation_Status", 
    size_by = "Gene_0002")

For categorical x, we can also show the median expression level per group on the plot to summarise the distribution of expression values:

plotExpression(example_sce, rownames(example_sce)[7:12],
    x = "Mutation_Status", exprs_values = "counts", 
    colour = "Cell_Cycle", show_median = TRUE, 
    xlab = "Mutation Status", log = TRUE)

Directly plotting the gene expression without any x or other visual parameters will generate a set of grouped violin plots, coloured in an aesthetically pleasing manner.

plotExpression(example_sce, rownames(example_sce)[1:6])

3 Dimensionality reduction plots

3.1 Using the `reducedDims` slot

The SingleCellExperiment object has a reducedDims slot, where coordinates for reduced dimension representations of the cells can be stored. These can be accessed using the reducedDim and reducedDims functions, which are described in more detail in the SingleCellExperiment documentation. In the code below, we perform a principal components analysis (PCA) and store the results in the "PCA" slot.

example_sce <- runPCA(example_sce)
reducedDimNames(example_sce)

## [1] "PCA"

Any reduced dimension results can be plotted using the plotReducedDim function:

plotReducedDim(example_sce, use_dimred = "PCA", 
    colour_by = "Treatment", shape_by = "Mutation_Status")

We can also colour and size points by the expression of particular features:

plotReducedDim(example_sce, use_dimred = "PCA", 
    colour_by = "Gene_1000", size_by = "Gene_0500")

3.2 Generating PCA plots

The plotPCA function makes it easy to produce a PCA plot directly from a SingleCellExperiment object, which is useful for visualising the relationships between cells. The default plot shows the first two principal components, if "PCA" is already in the reducedDims slot.

plotPCA(example_sce)

If pre-existing "PCA" results are not present, the function will automatically call runPCA to generate the results prior to plotting. However, it may be preferable for users to call runPCA manually if multiple plots are to be generated from the same results. This avoids re-calculation of the reduced dimension results, which can be time-consuming for very large data sets.

By default, runPCA performs PCA on the log-counts using the 500 features with the most variable expression across all cells. The number of most-variable features used can be changed with the ntop argument. Alternatively, a specific set of features to use for PCA can be defined with the feature_set argument. This is demonstrated with the feature controls below, to identify technical factors of variation:.

example_sce2 <- runPCA(example_sce, 
    feature_set = rowData(example_sce)$is_feature_control)
plotPCA(example_sce2)

Multiple components can be plotted in a series of pairwise plots. When more than two components are plotted, the diagonal boxes in the scatter plot matrix show the density for each component.

example_sce <- runPCA(example_sce, ncomponents=20)
plotPCA(example_sce, ncomponents = 4, colour_by = "Treatment",
        shape_by = "Mutation_Status")

As shown above, various metadata variables can be used to define the colour, shape and size of points in the scatter plot. We can also use the colour and size of point in the plot to reflect feature expression values.

plotPCA(example_sce, colour_by = "Gene_0001", size_by = "Gene_1000")

3.3 Generating t-SNE plots

t-distributed stochastic neighbour embedding (t-SNE) is widely used for visualizing complex single-cell data sets. The same procedure described for PCA plots can be applied to generate t-SNE plots using plotTSNE, with coordinates obtained using runTSNE via the Rtsne package. We strongly recommend generating plots with different random seeds and perplexity values, to ensure that any conclusions are robust to different visualizations.

# Perplexity of 10 just chosen here arbitrarily. 
set.seed(1000)
example_sce <- runTSNE(example_sce, perplexity=10)
plotTSNE(example_sce, colour_by = "Gene_0001", size_by = "Gene_1000")

It is also possible to use the pre-existing PCA results as input into the t-SNE algorithm. This is useful as it improves speed by using a low-rank approximation of the expression matrix; and reduces random noise, by focusing on the major factors of variation. The code below uses the first 10 principal components to perform the t-SNE.

set.seed(1000)
example_sce <- runTSNE(example_sce, perplexity=10, use_dimred="PCA", n_dimred=10)
plotTSNE(example_sce, colour_by="Treatment")

Users can force plotTSNE to call runTSNE by setting rerun=TRUE, even when "TSNE" already exists in the input SingleCellExperiment object. Users can also pass parameters for runTSNE directly to plotTSNE via the run.args argument. The same applies for the other plot* and run* arguments.

3.4 Generating diffusion maps

Again, the same can be done for diffusion maps using plotDiffusionMap, with coordinates obtained using runDiffusionMap via the destiny package.

example_sce <- runDiffusionMap(example_sce)
plotDiffusionMap(example_sce, colour_by = "Gene_0001", size_by = "Gene_1000")

Session information

sessionInfo()

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scater_1.10.1               SingleCellExperiment_1.4.1 
##  [3] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
##  [5] BiocParallel_1.16.5         matrixStats_0.54.0         
##  [7] Biobase_2.42.0              GenomicRanges_1.34.0       
##  [9] GenomeInfoDb_1.18.1         IRanges_2.16.0             
## [11] S4Vectors_0.20.1            BiocGenerics_0.28.0        
## [13] ggplot2_3.1.0               knitr_1.21                 
## [15] BiocStyle_2.10.0           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6             destiny_2.12.0          
##  [3] xts_0.11-2               tools_3.5.2             
##  [5] R6_2.3.0                 HDF5Array_1.10.1        
##  [7] vipor_0.4.5              lazyeval_0.2.1          
##  [9] colorspace_1.3-2         nnet_7.3-12             
## [11] withr_2.1.2              sp_1.3-1                
## [13] smoother_1.1             tidyselect_0.2.5        
## [15] gridExtra_2.3            curl_3.2                
## [17] compiler_3.5.2           labeling_0.3            
## [19] bookdown_0.9             scales_1.0.0            
## [21] lmtest_0.9-36            DEoptimR_1.0-8          
## [23] robustbase_0.93-3        proxy_0.4-22            
## [25] stringr_1.3.1            digest_0.6.18           
## [27] foreign_0.8-71           rmarkdown_1.11          
## [29] rio_0.5.16               XVector_0.22.0          
## [31] pkgconfig_2.0.2          htmltools_0.3.6         
## [33] TTR_0.23-4               ggthemes_4.0.1          
## [35] rlang_0.3.0.1            readxl_1.2.0            
## [37] DelayedMatrixStats_1.4.0 bindr_0.1.1             
## [39] zoo_1.8-4                dplyr_0.7.8             
## [41] zip_1.0.0                car_3.0-2               
## [43] RCurl_1.95-4.11          magrittr_1.5            
## [45] GenomeInfoDbData_1.2.0   Matrix_1.2-15           
## [47] Rcpp_1.0.0               ggbeeswarm_0.6.0        
## [49] munsell_0.5.0            Rhdf5lib_1.4.2          
## [51] abind_1.4-5              viridis_0.5.1           
## [53] scatterplot3d_0.3-41     stringi_1.2.4           
## [55] yaml_2.2.0               carData_3.0-2           
## [57] MASS_7.3-51.1            zlibbioc_1.28.0         
## [59] rhdf5_2.26.2             Rtsne_0.15              
## [61] plyr_1.8.4               grid_3.5.2              
## [63] forcats_0.3.0            crayon_1.3.4            
## [65] lattice_0.20-38          haven_2.0.0             
## [67] cowplot_0.9.3            hms_0.4.2               
## [69] pillar_1.3.1             igraph_1.2.2            
## [71] boot_1.3-20              reshape2_1.4.3          
## [73] glue_1.3.0               evaluate_0.12           
## [75] laeken_0.4.6             data.table_1.11.8       
## [77] BiocManager_1.30.4       vcd_1.4-4               
## [79] VIM_4.7.0                cellranger_1.1.0        
## [81] gtable_0.2.0             purrr_0.2.5             
## [83] assertthat_0.2.0         xfun_0.4                
## [85] openxlsx_4.1.0           RcppEigen_0.3.3.5.0     
## [87] e1071_1.7-0              class_7.3-15            
## [89] viridisLite_0.3.0        tibble_2.0.0            
## [91] beeswarm_0.2.3           bindrcpp_0.2.2

Data visualisation methods in scater

4 January 2019

Package

Contents

1 Introduction

2 Plots of expression values

3 Dimensionality reduction plots

3.1 Using the `reducedDims` slot

3.2 Generating PCA plots

3.3 Generating t-SNE plots

3.4 Generating diffusion maps

Session information

Data visualisation methods in scater

4 January 2019

Package

Contents

1 Introduction

2 Plots of expression values

3 Dimensionality reduction plots

3.1 Using the reducedDims slot

3.2 Generating PCA plots

3.3 Generating t-SNE plots

3.4 Generating diffusion maps

Session information

3.1 Using the `reducedDims` slot