1 Loading Processed Single-Cell Data

For the demonstration of escape, we will use the example “pbmc_small” data from Seurat and also generate a SingleCellExperiment object from it.

suppressPackageStartupMessages(library(escape))
suppressPackageStartupMessages(library(dittoSeq))
suppressPackageStartupMessages(library(SingleCellExperiment))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(SeuratObject))
pbmc_small <- get("pbmc_small")
data("pbmc_small")
pbmc_small <- suppressMessages(UpdateSeuratObject(pbmc_small))
sce <- as.SingleCellExperiment(pbmc_small)

2 Getting Gene Sets

The first step in the process of performing gene set enrichment analysis is identifying the gene sets we would like to use. The function getGeneSets() allows users to isolate a whole or multiple libraries from a list of GSEABase GeneSet objects. We can do this for gene set collections from the built-in Molecular Signature Database by setting the parameter library equal to the collection(s) of interest. For multiple libraries, just set library = c("C2", "C5"). To check which collections are available, use msigdbr::msigdbr_collections(). The gs_cat entry indicates the individal gene set collection names (e.g. “C2”). Individual pathways/gene sets can be isolated from the selected libraries selected, by setting gene.sets = the name(s) of the gene sets of interest. For an overview of the gene sets and collections, see https://www.gsea-msigdb.org/gsea/msigdb/genesets.jsp. If you are using a species other than Homo sapiens, set the species parameter to the appropriate scientific name to ensure the gene nomenclature matches your expression data. To check which species are available, use msigdbr::msigdbr_species().

## We'll use the HALLMARK gene set collection ("H")
GS.hallmark <- getGeneSets(library = "H")

If a user would like to expand beyond the GSEA Molecular Signature Database for pathway analysis, the enrichment function enrichIt() can use any list of GeneSet objects from the GSEABase R package. An excellent vignette exists on how to create GeneSet objects to be used in the enrichment analysis. Otherwise, you may also provide a simple named list of gene sets of interest. See ?enrichIt() for details.

3 Enrichment

The next step is performing the enrichment on the RNA count data. The function enrichIt() can handle either a matrix of raw count data or will pull that data directly from a SingleCellExperiment or Seurat object. The gene.sets parameter in the function is the GeneSets, either generated from getGeneSets() or from the user. The enrichment scores will be calculated across all individual cells and groups is the n size to break the enrichment by while the cores is the number of cores to perform in parallel during the enrichment calculation too.

Important Note: This is computationally intensive and is highly dependent on the number of cells and the number of gene sets included.

ES.seurat <- enrichIt(obj = pbmc_small, gene.sets = GS.hallmark, groups = 1000, cores = 2)
## [1] "Using sets of 1000 cells. Running 1 times."
## Setting parallel calculations through a SnowParam back-end
## with workers=2 and tasks=100.
## Estimating ssGSEA scores for 43 gene sets.
ES.sce <- enrichIt(obj = sce, gene.sets = GS.hallmark, groups = 1000, cores = 2)
## [1] "Using sets of 1000 cells. Running 1 times."
## Setting parallel calculations through a SnowParam back-end
## with workers=2 and tasks=100.
## Estimating ssGSEA scores for 43 gene sets.

We can then easily add these results back to our Seurat or SCE object.

## if working with a Seurat object
pbmc_small <- Seurat::AddMetaData(pbmc_small, ES.seurat)

## if working with an SCE object
met.data <- merge(colData(sce), ES.sce, by = "row.names", all=TRUE)
row.names(met.data) <- met.data$Row.names
met.data$Row.names <- NULL
colData(sce) <- met.data

4 Visualizations

The easiest way to generate almost any visualization for single cell data is via dittoSeq, which is an extremely flexible visualization package for both bulk and single-cell RNA-seq data that works very well for both expression data and metadata. Better yet, it can handle both SingleCellExperiment and Seurat objects.

To keep things consistent, we’ll define a pleasing color scheme.

colors <- colorRampPalette(c("#0348A6", "#7AC5FF", "#C6FDEC", "#FFB433", "#FF4B20"))

4.1 Heatmaps

A simple way to approach visualizations for enrichment results is the heatmap, especially if you are using a number of gene sets or libraries.

dittoHeatmap(pbmc_small, genes = NULL, metas = names(ES.seurat), 
             annot.by = "groups", 
             fontsize = 7, 
             cluster_cols = TRUE,
             heatmap.colors = colors(50))

A user can also produce a heatmap with select gene sets by providing specific names to the metas parameter. For example, we can isolated gene sets involved in apoptosis and DNA damage.

dittoHeatmap(sce, genes = NULL, 
             metas = c("HALLMARK_APOPTOSIS", "HALLMARK_DNA_REPAIR", "HALLMARK_P53_PATHWAY"), 
             annot.by = "groups", 
             fontsize = 7,
             heatmap.colors = colors(50))

4.2 Violin Plots

Another way to visualize a subset of gene set enrichment would be to graph the distribution of enrichment using violin, jitter, boxplot, or ridgeplots. We can also compare between categorical variables using the group.by parameter.

multi_dittoPlot(sce, vars = c("HALLMARK_APOPTOSIS", "HALLMARK_DNA_REPAIR", "HALLMARK_P53_PATHWAY"), 
                group.by = "groups", plots = c("jitter", "vlnplot", "boxplot"), 
                ylab = "Enrichment Scores", 
                theme = theme_classic() + theme(plot.title = element_text(size = 10)))

4.3 Hex Density Enrichment Plots

We can also compare the distribution of enrichment scores of 2 distinct gene sets across all single cells using the dittoScatterHex() function. Here, we use our SingleCellExperiment object with results of enrichIt() and specify gene sets to the x.var and y.var parameters to produce a density plot. We can also add contours to the plot, by passing do.contour = TRUE.

dittoScatterHex(sce, x.var = "HALLMARK_DNA_REPAIR", 
                    y.var = "HALLMARK_MTORC1_SIGNALING", 
                    do.contour = TRUE) + 
        scale_fill_gradientn(colors = colors(11)) 

We can also separate the graph using the split.by parameter, allowing for the direct comparison of categorical variables.

dittoScatterHex(sce, x.var = "HALLMARK_DNA_REPAIR", 
                y.var = "HALLMARK_MTORC1_SIGNALING", 
                do.contour = TRUE,
                split.by = "groups")  + 
        scale_fill_gradientn(colors = colors(11)) 

4.4 Ridge Plots

Another distribution visualization is using a Ridge Plot from the ggridges R package. This allows the user to incorporate categorical variables to seperate the enrichment scores along the y-axis in addition to the faceting by categorical variables.

Like above, we can explore the distribution of the “HALLMARK_DNA_REPAIR” gene set between groups by calling ridgeEnrichment() with the ES2 object. We specify group = letters.idents, which will seperate groups on the y-axis. We can also add a rug plot (add.rug = TRUE) to look at the discrete sample placement along the enrichment ridge plot.

## Seurat object example
ES2 <- data.frame(pbmc_small[[]], Idents(pbmc_small))
colnames(ES2)[ncol(ES2)] <- "cluster"

## plot
ridgeEnrichment(ES2, gene.set = "HALLMARK_DNA_REPAIR", group = "cluster", add.rug = TRUE)

In addition to the separation of letter.idents, we can also use ridgeEnrichment() for better granualarity of multiple variables. For example, instead of looking at the difference just between “Type”, we can set group = "cluster" and then facet = "letter.idents". This gives a visualization of the enrichment of DNA Repair by cluster and Type.

ridgeEnrichment(ES2, gene.set = "HALLMARK_DNA_REPAIR", group = "cluster", 
                facet = "letter.idents", add.rug = TRUE)

4.5 Split Violin Plots

Another distribution visualization is a violin plot, which we seperate and directly compare using a binary classification. Like ridgeEnrichment(), this allows for greater use of categorical variables. For splitEnrichment(), the output will be two halves of a violin plot bassed on the split parameter with a central boxplot with the relative distribution accross all samples.

Like above, we can explore the distribution of the “HALLMARK_DNA_REPAIR” gene set between groups by calling splitEnrichment() with the ES2 object and split = "groups". We can also explore the cluster distribution by assigning the x-axis = "cluster".

splitEnrichment(ES2, split = "groups", gene.set = "HALLMARK_DNA_REPAIR")

splitEnrichment(ES2, x.axis = "cluster", split = "groups", gene.set = "HALLMARK_DNA_REPAIR")


5 Expanded Analysis

One frustration of Gene Set Enrichment is trying to make sense of the values. In order to move away from just selecting pathways that may be of interest, escape offers the ability to performPCA() on the enrichment scores. Like the other functions, we will need to provide the output of enrichIt() to the enriched parameter and the groups to include for later graphing.

PCA <- performPCA(enriched = ES2, groups = c("groups", "cluster"))
pcaEnrichment(PCA, PCx = "PC1", PCy = "PC2", contours = TRUE)

We can also look at the pcaEnrichment() output separated by categorical factors using the facet parameter, for example using the cluster assignment.

pcaEnrichment(PCA, PCx = "PC1", PCy = "PC2", contours = FALSE, facet = "cluster") 

We can more closely examine the construction of the PCA by looking at the contribution of each gene set to the respective principal component using masterPCAPlot() with the same input as above with the pcaEnrichment(). We can also control the number of gene sets plotted with top.contribution.

masterPCAPlot(ES2, PCx = "PC1", PCy = "PC2", top.contribution = 10)

5.1 Signficance

We can also look for significant differences using linear modeling from the limma package, Welch’s T test or one-way ANOVA using getSignificance(). For this, we need to assign a group parameter and the type of fit (either linear.model, T.test, or ANOVA).

output <- getSignificance(ES2, group = "cluster", fit = "linear.model")

6 Support

If you have any questions, comments or suggestions, feel free to visit the github repository or email me.

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] SeuratObject_4.0.1          Seurat_4.0.1               
##  [3] SingleCellExperiment_1.14.0 SummarizedExperiment_1.22.0
##  [5] Biobase_2.52.0              GenomicRanges_1.44.0       
##  [7] GenomeInfoDb_1.28.0         IRanges_2.26.0             
##  [9] S4Vectors_0.30.0            BiocGenerics_0.38.0        
## [11] MatrixGenerics_1.4.0        matrixStats_0.58.0         
## [13] dittoSeq_1.4.0              ggplot2_3.3.3              
## [15] escape_1.2.0                BiocStyle_2.20.0           
## 
## loaded via a namespace (and not attached):
##   [1] snow_0.4-3                plyr_1.8.6               
##   [3] igraph_1.2.6              lazyeval_0.2.2           
##   [5] GSEABase_1.54.0           splines_4.1.0            
##   [7] BiocParallel_1.26.0       listenv_0.8.0            
##   [9] scattermore_0.7           digest_0.6.27            
##  [11] htmltools_0.5.1.1         magick_2.7.2             
##  [13] fansi_0.4.2               magrittr_2.0.1           
##  [15] memoise_2.0.0             ScaledMatrix_1.0.0       
##  [17] tensor_1.5                cluster_2.1.2            
##  [19] ROCR_1.0-11               limma_3.48.0             
##  [21] globals_0.14.0            Biostrings_2.60.0        
##  [23] annotate_1.70.0           spatstat.sparse_2.0-0    
##  [25] colorspace_2.0-1          blob_1.2.1               
##  [27] ggrepel_0.9.1             xfun_0.23                
##  [29] dplyr_1.0.6               hexbin_1.28.2            
##  [31] crayon_1.4.1              RCurl_1.98-1.3           
##  [33] jsonlite_1.7.2            graph_1.70.0             
##  [35] spatstat.data_2.1-0       survival_3.2-11          
##  [37] zoo_1.8-9                 glue_1.4.2               
##  [39] polyclip_1.10-0           gtable_0.3.0             
##  [41] zlibbioc_1.38.0           XVector_0.32.0           
##  [43] leiden_0.3.7              DelayedArray_0.18.0      
##  [45] BiocSingular_1.8.0        Rhdf5lib_1.14.0          
##  [47] future.apply_1.7.0        HDF5Array_1.20.0         
##  [49] abind_1.4-5               scales_1.1.1             
##  [51] msigdbr_7.4.1             pheatmap_1.0.12          
##  [53] DBI_1.1.1                 miniUI_0.1.1.1           
##  [55] Rcpp_1.0.6                isoband_0.2.4            
##  [57] viridisLite_0.4.0         xtable_1.8-4             
##  [59] spatstat.core_2.1-2       reticulate_1.20          
##  [61] bit_4.0.4                 rsvd_1.0.5               
##  [63] GSVA_1.40.0               htmlwidgets_1.5.3        
##  [65] httr_1.4.2                RColorBrewer_1.1-2       
##  [67] ellipsis_0.3.2            ica_1.0-2                
##  [69] farver_2.1.0              pkgconfig_2.0.3          
##  [71] XML_3.99-0.6              uwot_0.1.10              
##  [73] deldir_0.2-10             sass_0.4.0               
##  [75] utf8_1.2.1                labeling_0.4.2           
##  [77] reshape2_1.4.4            tidyselect_1.1.1         
##  [79] rlang_0.4.11              later_1.2.0              
##  [81] AnnotationDbi_1.54.0      munsell_0.5.0            
##  [83] tools_4.1.0               cachem_1.0.5             
##  [85] generics_0.1.0            RSQLite_2.2.7            
##  [87] ggridges_0.5.3            evaluate_0.14            
##  [89] stringr_1.4.0             fastmap_1.1.0            
##  [91] goftest_1.2-2             yaml_2.2.1               
##  [93] babelgene_21.4            knitr_1.33               
##  [95] bit64_4.0.5               fitdistrplus_1.1-3       
##  [97] purrr_0.3.4               RANN_2.6.1               
##  [99] KEGGREST_1.32.0           nlme_3.1-152             
## [101] pbapply_1.4-3             future_1.21.0            
## [103] sparseMatrixStats_1.4.0   mime_0.10                
## [105] compiler_4.1.0            plotly_4.9.3             
## [107] png_0.1-7                 spatstat.utils_2.1-0     
## [109] tibble_3.1.2              bslib_0.2.5.1            
## [111] stringi_1.6.2             highr_0.9                
## [113] lattice_0.20-44           Matrix_1.3-3             
## [115] vctrs_0.3.8               pillar_1.6.1             
## [117] lifecycle_1.0.0           rhdf5filters_1.4.0       
## [119] BiocManager_1.30.15       spatstat.geom_2.1-0      
## [121] lmtest_0.9-38             jquerylib_0.1.4          
## [123] RcppAnnoy_0.0.18          data.table_1.14.0        
## [125] cowplot_1.1.1             bitops_1.0-7             
## [127] irlba_2.3.3               httpuv_1.6.1             
## [129] patchwork_1.1.1           R6_2.5.0                 
## [131] bookdown_0.22             promises_1.2.0.1         
## [133] KernSmooth_2.23-20        gridExtra_2.3            
## [135] parallelly_1.25.0         codetools_0.2-18         
## [137] MASS_7.3-54               assertthat_0.2.1         
## [139] rhdf5_2.36.0              withr_2.4.2              
## [141] sctransform_0.3.2         GenomeInfoDbData_1.2.6   
## [143] mgcv_1.8-35               rpart_4.1-15             
## [145] grid_4.1.0                beachmat_2.8.0           
## [147] tidyr_1.1.3               rmarkdown_2.8            
## [149] DelayedMatrixStats_1.14.0 Rtsne_0.15               
## [151] shiny_1.6.0