if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("roastgsa")
Several gene set databases are publicly available and can be used
for gene set analysis as part of the screening of bulk data for
hypothesis generation. The Hallmark gene set database [1] provides a
well-curated gene set list of
biological states which can be used to obtain an overall
characterization of the data. Other Human Molecular Signatures
Database (MSigDB) collections including
gene ontology pathways (Biological Process, Cellular Component,
Molecular Function), immunologic signature gene sets or regulatory
target gene sets can be employed for roastgsa
analysis to identify
the most relevant expression changes between experimental conditions
for highly specific biological functions. These broad collections and
sub-collections can be downloaded through
https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp.
Other public collections that can be employed for battery testing such
as the KEGG pathway database (https://www.genome.jp/kegg/pathway.html) [2]
or reactome (https://reactome.org/) [3] provide a
broad representation of gene sets for human diseases, metabolism or
cellular processes among others.
These gene set collections can be considered for roastgsa
either by (1)
R loading the gene sets in a list
object, each element containing
the gene identifiers for the testing set or (2) saving a .gmt
file
with the whole collection in a specific folder, i.e.,
# DO NOT RUN
# gspath = "path/to/folder/of/h.all.v7.2.symbols.gmt"
# gsetsel = "h.all.v7.2.symbols.gmt"
We download publicly available arrays from GEO, accession ‘GSE145603’, which contain LGR5 and POLI double tumor cell populations in colorectal cancer [4]
# DO NOT RUN
# library(GEOquery)
# data <- getGEO('GSE145603')
# normdata <- (data[[1]])
# pd <- pData(normdata)
# pd$group_LGR5 <- pd[["lgr5:ch1"]]
We are interested in screening the hallmarks with the largests changes
between LGR5 negative and LGR5 positive samples. The hallmark gene sets
are stored in gspath
with the file name specified in gsetsel
(see specifications above).
The formula, design matrix and the corresponding contrast can be obtained as follows
# DO NOT RUN
# form <- "~ -1 + group_LGR5"
# design <- model.matrix(as.formula(form),pd)
# cont.mat <- data.frame(Lgr5.high_Lgr5.neg = c(1,-1))
# rownames(cont.mat) <- colnames(design)
In microarrays, the expression values for each gene can be measured in several probesets. To perform gene set analysis, we select the probeset with maximum variability for every gene:
# DO NOT RUN
# mads <- apply(exprs(normdata), 1, mad)
# gu <- strsplit(fData(normdata)[["Gene Symbol"]], split=' \\/\\/\\/ ')
# names(gu) <- rownames(fData(normdata))
# gu <- gu[sapply(gu, length)==1]
# gu <- gu[gu!='' & !is.na(gu) & gu!='---']
# ps <- rep(names(gu), sapply(gu, length))
# gs <- unlist(gu)
# pss <- tapply(ps, gs, function(o) names(which.max(mads[o])))
# psgen.mvar <- pss
The roastgsa
function is used for competitive gene set analysis testing:
# DO NOT RUN
# roast1 <- roastgsa(exprs(normdata), form = form, covar = pd,
# psel = psgen.mvar, contrast = cont.mat[, 1],
# gspath = gspath, gsetsel = gsetsel, nrot = 1000,
# mccores = 7, set.statistic = "maxmean")
#
# print(roast1)
#
Summary tables can be presented following the roastgsa::htmlrgsa
documentation.
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.46 BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.35 R6_2.5.1 bookdown_0.39
## [4] fastmap_1.1.1 xfun_0.43 cachem_1.0.8
## [7] htmltools_0.5.8.1 rmarkdown_2.26 lifecycle_1.0.4
## [10] cli_3.6.2 sass_0.4.9 jquerylib_0.1.4
## [13] compiler_4.4.0 tools_4.4.0 evaluate_0.23
## [16] bslib_0.7.0 yaml_2.3.8 BiocManager_1.30.22
## [19] jsonlite_1.8.8 rlang_1.1.3
[1] A. Liberzon, C. Birger, H. Thorvaldsdottir, M. Ghandi, J. P. Mesirov, and P. Tamayo. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Systems, 1(6):417-425, 2015.
[2] M. Kanehisa et al. KEGG as a reference resource for gene and protein annotation, Nucleic Acids Research, Volume 44, Issue D1, 4 January 2016, D457–D462, https://doi.org/10.1093/nar/gkv1070
[3] M. Gillespie et al. The reactome pathway knowledgebase 2022, Nucleic Acids Research, 2021;, gkab1028, https://doi.org/10.1093/nar/gkab1028
[4] Morral C, Stanisavljevic J, Hernando-Momblona X, et al. Zonation of Ribosomal DNA Transcription Defines a Stem Cell Hierarchy in Colorectal Cancer. Cell Stem Cell. 2020;26(6):845-861.e12. doi:10.1016/j.stem.2020.04.012