Here we illustrate how to use GSVA with single-cell RNA sequencing (scRNA-seq) data.
GSVA 2.7.3
License: Artistic-2.0
GSVA provides now specific support for single-cell data in the algorithm
that runs through the gsvaParam() parameter constructor, and originally
described in the publication by Hänzelmann, Castelo, and Guinney (2013). At the moment, this
specific support consists of the following features:
dgCMatrix, SVT_SparseArray, HDF5Matrix and DelayedMatrix.
The currently available container for single-cell data that allows one
to input additional row and column metadata is a SingleCellExperiment
object.matrix or a
dense DelayedMatrix object using an HDF5Matrix backend. The latter will
be particularly used when the total number of values exceeds 2^31, which is
the largest 32-bit standard integer value in R.sparse=FALSE in the call to
gsvaParam(), the classical GSVA algorithm will be used, which for a
typical single-cell data set will result in longer running times and larger
memory consumption than running it in the default sparse regime for this
type of data.gsva()
with a parameter object or in three steps: (1) row normalization with
gsvaRowNorm(); (2) column rank transformation with gsvaColRanks();
and (3) column enrichment scores calculation with gsvaColScores().
Splitting the GSVA algorithm into these three steps allows one to distribute
and balance the computational load of the algorithm in a high-performance
computing (HPC) environment with multiple nodes, and to reuse the output of
the first two steps, which are independent of the gene sets, to calculate
enrichment scores for different collections of gene sets, without having to
repeat the first two steps.In what follows, we will illustrate the use of GSVA on a publicly available single-cell transcriptomics data set of peripheral blood mononuclear cells (PBMCs) published by Zheng et al. (2017).
We import the PBMC data using the TENxPBMCData package, as a SingleCellExperiment object.
library(SingleCellExperiment)
library(TENxPBMCData)
sce <- TENxPBMCData(dataset="pbmc4k")
sce
class: SingleCellExperiment
dim: 33694 4340
metadata(0):
assays(1): counts
rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
ENSG00000268674
rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
colnames: NULL
colData names(11): Sample Barcode ... Individual Date_published
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
Here, we perform a quality control (QC) and pre-processing steps using the package scrapper (Lun and Kancherla 2022). We start identifying mitochondrial genes.
library(scrapper)
is_mito <- grepl("^MT-", rowData(sce)$Symbol_TENx)
table(is_mito)
is_mito
FALSE TRUE
33681 13
Calculate QC metrics and filter out low-quality cells.
sce <- quickRnaQc.se(sce, subsets=list(mito=is_mito))
sce <- sce[, sce$keep]
dim(sce)
[1] 33694 4147
We filter out genes that are expressed in less than 1% of the cells.
cellsxgene <- rowSums(counts(sce) > 0)
sce <- sce[cellsxgene > floor(ncol(sce)*0.01), ]
dim(sce)
[1] 10799 4147
Calculate library size factors and normalized units of expression in logarithmic scale.
sce <- normalizeRnaCounts.se(sce, size.factors=sce$sum)
assayNames(sce)
[1] "counts" "logcounts"
Here, we illustrate how to annotate cell types in the PBMC data using GSVA and a collection of relevant gene sets.
First, we fetch a collection of 22 leukocyte gene set signatures, containing
a total 547 genes, which should help to distinguish among 22 mature human
hematopoietic cell type populations isolated from peripheral blood or
in vitro culture conditions, including seven T cell types: naïve and memory B
cells, plasma cells, NK cell, and myeloid subsets. These gene sets have been
used in the benchmarking publication by Diaz-Mejia et al. (2019), and were
originally compiled by the CIBERSORT
developers, where they called it the LM22 signature (Newman et al. 2015). The
LM22 signature is stored in the GSVAdata experiment data
package as a compressed text file in
GMT format, which can
be read into R using the readGMT() function from the GSVA
package, which will return the gene sets, by default, into a
GeneSetCollection object, defined in the GSEABase package.
This default argument can be changed to return the gene sets into a base
list object by setting valueType="list" in the call to readGMT().
library(GSEABase)
library(GSVA)
fname <- file.path(system.file("extdata", package="GSVAdata"),
"pbmc_cell_type_gene_set_signatures.gmt.gz")
gsets <- readGMT(fname)
gsets
GeneSetCollection
names: B_CELLS_MEMORY, B_CELLS_NAIVE, ..., T_CELLS_REGULATORY_TREGS (22 total)
unique identifiers: AIM2, BANK1, ..., SKAP1 (248 total)
types in collection:
geneIdType: SymbolIdentifier (1 total)
collectionType: NullCollection (1 total)
Note that while gene identifers in the sce object correspond to
Ensembl stable identifiers
(ENSG...), the gene identifiers in the gene sets are
HGNC gene symbols. This, in principle, precludes
matching directly what gene in the single-cell data object sce corresponds to
what gene set in the GeneSetCollection object gsets. However, the
GSVA package can do that matching as long as the appropriate
metadata is present in both objects.
In the case of a GeneSetCollection object, its geneIdType metadata slot
stores the type of gene identifier. In the case of a SingleCellExperiment
object, such as the previous sce object, such metadata is not present.
However, using the function gsvaAnnotation() from the GSVA
package, and the helper function ENSEMBLIdentifier() from the
GSEABase package, we add such metadata to the sce object as
follows.
gsvaAnnotation(sce) <- ENSEMBLIdentifier("org.Hs.eg.db")
gsvaAnnotation(sce)
geneIdType: ENSEMBL (org.Hs.eg.db)
We first build a parameter object using the function gsvaParam(). By
default, the expression values in the logcounts assay will be selected for
downstream analysis.
gsvapar <- gsvaParam(sce, gsets)
gsvapar
class: GSVA::gsvaParam
expression data dim: 10799 4147
number of gene sets: 22
details: use 'details(object)'
While at this point, we could already run the entire GSVA algorithm with a call
to the gsva(gsvapar) function. We show here how to do it in three steps.
First we calculate row-normalized expression values using the function
gsvaRowNorm(), which if, as in this example, the given input is a
SingleCellExperiment object, then the output will be the same object with
an additional assay called gsvarownr containing the row-normalized expression
values.
gsvarownorm <- gsvaRowNorm(gsvapar)
gsvarownorm
class: SingleCellExperiment
dim: 10799 4147
metadata(3): qc annotation gsvaParam
assays(3): counts logcounts gsvarownr
rownames(10799): ENSG00000279457 ENSG00000228463 ... ENSG00000273748
ENSG00000278817
rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
colnames: NULL
colData names(16): Sample Barcode ... keep sizeFactor
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
assayNames(gsvarownorm)
[1] "counts" "logcounts" "gsvarownr"
Second, we calculate GSVA column rank values using the function
gsvaColRanks(), which takes as input the output of gsvaRowNorm(), and returns
the column rank values in a new assay called gsvaranks, if the input is a
SingleCellExperiment object.
gsvacolranks <- gsvaColRanks(gsvarownorm)
gsvacolranks
class: SingleCellExperiment
dim: 10799 4147
metadata(3): qc annotation gsvaParam
assays(4): counts logcounts gsvarownr gsvaranks
rownames(10799): ENSG00000279457 ENSG00000228463 ... ENSG00000273748
ENSG00000278817
rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
colnames: NULL
colData names(16): Sample Barcode ... keep sizeFactor
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
assayNames(gsvacolranks)
[1] "counts" "logcounts" "gsvarownr" "gsvaranks"
Third, we finally calculate the GSVA scores using the output of
gsvaColRanks() as input to the function gsvaColScores(). By default, this
function will calculate the scores for all gene sets specified in the input
parameter object given in the call to gsvaRowNorm().
es <- gsvaColScores(gsvacolranks)
es
class: SingleCellExperiment
dim: 22 4147
metadata(2): qc gsvaParam
assays(1): es
rownames(22): B_CELLS_MEMORY B_CELLS_NAIVE ... T_CELLS_GAMMA_DELTA
T_CELLS_REGULATORY_TREGS
rowData names(1): gs
colnames: NULL
colData names(16): Sample Barcode ... keep sizeFactor
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
However, we could calculate the scores for another collection of gene sets,
without having to calculate the column ranks again, by giving this other
collection of gene sets as second argument to the call to gsvaColScores().
es2 <- gsvaColScores(gsvacolranks, alternative_gsets)
Following Amezquita et al. (2020), and some of the steps described in
“Chapter 5 Clustering” of the first version of the
OSCA book,
we use GSVA scores to build a nearest-neighbor graph of the cells using the
function makeSNNGraph() from the bluster package. The
parameter k in the call to makeSNNGraph() specifies the number of nearest
neighbors to consider during graph construction, and here we set k=20 because
it leads to a number of clusters close to the expected number of cell types.
library(bluster)
g <- makeSNNGraph(t(assay(es)), k=20)
Second, we use the function cluster_walktrap() from the igraph
package (Csardi and Nepusz 2006), to cluster cells by finding densely connected
subgraphs. We store the resulting vector of cluster indicator values into the
sce object using the function colLabels().
library(igraph)
colLabels(es) <- factor(cluster_walktrap(g)$membership)
table(colLabels(es))
1 2 3 4 5 6
919 1081 1017 589 205 336
Similarly to Diaz-Mejia et al. (2019), we apply a simple cell type assignment
algorithm, which consists of selecting at each cell the gene set with highest
GSVA score, tallying the selected gene sets per cluster, and assigning to the
cluster the most frequent gene set, storing that assignment into the sce
object with the function colLabels().
whmax <- apply(assay(es), 2, which.max)
gsxlab <- split(rownames(es)[whmax], colLabels(es))
gsxlab <- names(sapply(sapply(gsxlab, table), which.max))
colLabels(es) <- factor(gsub("[0-9]\\.", "", gsxlab))[colLabels(es)]
table(colLabels(es))
B_CELLS_NAIVE MONOCYTES NK_CELLS_RESTING T_CELLS_CD4_NAIVE
589 1017 205 2000
T_CELLS_CD8
336
We can visualize the cell type assignments by projecting cells dissimilarity in two dimensions with a principal components analysis (PCA) on the GSVA scores, and coloring cells using the previously assigned clusters.
library(RColorBrewer)
res <- prcomp(assay(es))
varexp <- res$sdev^2 / sum(res$sdev^2)
nclusters <- nlevels(colLabels(es))
hmcol <- colorRampPalette(brewer.pal(nclusters, "Set1"))(nclusters)
par(mar=c(4, 5, 1, 1))
plot(res$rotation[, 1], res$rotation[, 2], col=hmcol[colLabels(es)], pch=19,
xlab=sprintf("PCA 1 (%.0f%%)", varexp[1]*100),
ylab=sprintf("PCA 2 (%.0f%%)", varexp[2]*100),
las=1, cex.axis=1.2, cex.lab=1.5)
mask <- colLabels(es) == "NK_CELLS_RESTING"
points(res$rotation[mask, 1], res$rotation[mask, 2], ## show the overlap better
col=hmcol[colLabels(es)[mask]], pch=19)
legend("bottomright", gsub("_", " ", levels(colLabels(es))), fill=hmcol, inset=0.01)
Figure 1: Cell type assignments of PBMC scRNA-seq data, based on GSVA scores
Finally, if we want to better understand why a specific cell type is annotated
to a given cell, we can use the gsvaEnrichment() function, which will show
a GSEA enrichment plot. This function takes as input the output of
gsvaRanks(), a given column (cell) in the input single-cell data, and a given
gene set. In Figure 2 below, we show such a plot for the
first cell annotated to the monocytes cell type.
firstmonocytecell <- which(colLabels(es) == "MONOCYTES")[1]
par(mar=c(4, 5, 1, 1))
gsvaEnrichment(gsvacolranks, column=firstmonocytecell, geneSet="MONOCYTES",
cex.axis=1.2, cex.lab=1.5, plot="ggplot")
Figure 2: GSVA enrichment plot of the EOSINOPHILS gene set in the expression profile of the first cell annotated to that cell type
In the previous call to gsvaEnrichment() we used the argument plot="ggplot"
to produce a plot with the ggplot2
package. By default, if we call gsvaEnrichment() interactively, it will
produce a plot using “base R”, but either when we do it non-interactively, or
when we set plot="no" it will return a data.frame object with the enrichment
data.
We are still benchmarking and testing this version of GSVA for single-cell data. If you encounter problems or have suggestions, do not hesitate to contact us by opening an issue in the GSVA GitHub repo.
Here is the output of sessionInfo() on the system on which this document was
compiled running pandoc 2.7.3:
sessionInfo()
R version 4.6.0 RC (2026-04-17 r89917)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.24-bioc/R/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] igraph_2.3.2 bluster_1.23.0
[3] scrapper_1.7.3 TENxPBMCData_1.31.0
[5] HDF5Array_1.41.0 h5mread_1.5.0
[7] rhdf5_2.57.1 DelayedArray_0.39.3
[9] SparseArray_1.13.2 S4Arrays_1.13.0
[11] abind_1.4-8 Matrix_1.7-5
[13] SingleCellExperiment_1.35.1 sva_3.61.0
[15] BiocParallel_1.47.0 genefilter_1.95.0
[17] mgcv_1.9-4 nlme_3.1-169
[19] RColorBrewer_1.1-3 edgeR_4.11.1
[21] limma_3.69.2 GSVAdata_1.49.0
[23] SummarizedExperiment_1.43.0 GenomicRanges_1.65.0
[25] Seqinfo_1.3.0 MatrixGenerics_1.25.0
[27] matrixStats_1.5.0 GSEABase_1.75.0
[29] graph_1.91.0 annotate_1.91.0
[31] XML_3.99-0.23 org.Hs.eg.db_3.23.1
[33] AnnotationDbi_1.75.0 IRanges_2.47.2
[35] S4Vectors_0.51.3 Biobase_2.73.1
[37] BiocGenerics_0.59.6 generics_0.1.4
[39] GSVA_2.7.3 BiocStyle_2.41.0
loaded via a namespace (and not attached):
[1] DBI_1.3.0 httr2_1.2.2
[3] rlang_1.2.0 magrittr_2.0.5
[5] otel_0.2.0 compiler_4.6.0
[7] RSQLite_3.53.1 DelayedMatrixStats_1.35.0
[9] png_0.1-9 vctrs_0.7.3
[11] pkgconfig_2.0.3 SpatialExperiment_1.23.0
[13] crayon_1.5.3 memuse_4.2-3
[15] fastmap_1.2.0 dbplyr_2.5.2
[17] magick_2.9.1 XVector_0.53.0
[19] labeling_0.4.3 rmarkdown_2.31
[21] purrr_1.2.2 tinytex_0.59
[23] bit_4.6.0 xfun_0.58
[25] cachem_1.1.0 beachmat_2.29.0
[27] jsonlite_2.0.0 blob_1.3.0
[29] rhdf5filters_1.25.0 Rhdf5lib_2.1.0
[31] cluster_2.1.8.2 irlba_2.3.7
[33] parallel_4.6.0 R6_2.6.1
[35] bslib_0.11.0 jquerylib_0.1.4
[37] Rcpp_1.1.1-1.1 bookdown_0.46
[39] knitr_1.51 tidyselect_1.2.1
[41] splines_4.6.0 dichromat_2.0-0.1
[43] yaml_2.3.12 codetools_0.2-20
[45] curl_7.1.0 tibble_3.3.1
[47] lattice_0.22-9 S7_0.2.2
[49] withr_3.0.2 KEGGREST_1.53.0
[51] evaluate_1.0.5 survival_3.8-6
[53] BiocFileCache_3.3.0 ExperimentHub_3.3.0
[55] Biostrings_2.81.2 filelock_1.0.3
[57] pillar_1.11.1 BiocManager_1.30.27
[59] ggplot2_4.0.3 BiocVersion_3.24.0
[61] scales_1.4.0 sparseMatrixStats_1.25.0
[63] xtable_1.8-8 glue_1.8.1
[65] tools_4.6.0 beachmat.hdf5_1.11.0
[67] AnnotationHub_4.3.0 BiocNeighbors_2.7.2
[69] ScaledMatrix_1.21.0 locfit_1.5-9.12
[71] grid_4.6.0 BiocSingular_1.29.0
[73] cli_3.6.6 rsvd_1.0.5
[75] rappdirs_0.3.4 dplyr_1.2.1
[77] gtable_0.3.6 sass_0.4.10
[79] digest_0.6.39 farver_2.1.2
[81] rjson_0.2.23 memoise_2.0.1
[83] htmltools_0.5.9 lifecycle_1.0.5
[85] httr_1.4.8 statmod_1.5.2
[87] bit64_4.8.2
Amezquita, Robert A, Aaron TL Lun, Etienne Becht, Vince J Carey, Lindsay N Carpp, Ludwig Geistlinger, Federico Marini, et al. 2020. “Orchestrating Single-Cell Analysis with Bioconductor.” Nature Methods 17 (2): 137–45.
Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal Complex Systems: 1695. https://igraph.org.
Diaz-Mejia, J Javier, Elaine C Meng, Alexander R Pico, Sonya A MacParland, Troy Ketela, Trevor J Pugh, Gary D Bader, and John H Morris. 2019. “Evaluation of Methods to Assign Cell Type Labels to Cell Clusters from Single-Cell Rna-Sequencing Data.” F1000Research 8: ISCB–Comm.
Hänzelmann, Sonja, Robert Castelo, and Justin Guinney. 2013. “GSVA: Gene Set Variation Analysis for Microarray and RNA-Seq Data.” BMC Bioinformatics 14: 7. https://doi.org/10.1186/1471-2105-14-7.
Lun, Aaron, and Jayaram Kancherla. 2022. “Powering Single-Cell Analyses in the Browser with Webassembly.” bioRxiv, 2022–03.
Newman, Aaron M, Chih Long Liu, Michael R Green, Andrew J Gentles, Weiguo Feng, Yue Xu, Chuong D Hoang, Maximilian Diehn, and Ash A Alizadeh. 2015. “Robust Enumeration of Cell Subsets from Tissue Expression Profiles.” Nature Methods 12 (5): 453–57.
Zheng, Grace XY, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, et al. 2017. “Massively Parallel Digital Transcriptional Profiling of Single Cells.” Nature Communications 8 (1): 14049.