if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("roastgsa")
This vignette explains broadly the main functions for applying roastgsa
in RNA-seq data. A more exhaustive example to explore the roastgsa
functionality is presented in the “roastgsa vignette (main)”. All the analyses
explained in the main vignette can be reproduced for RNA-seq data, after
undertaking the steps covered here in the section
“Data normalization and filtering”.
library( roastgsa )
We consider the first dataset available in the tcga
compendium
from the GSEABenchmarkeR
package [1], which consists of a RNA-seq study
with 19 tumor Bladder Urothelial Carcinoma samples and 19 adjacent
healthy tissues.
#library(GSEABenchmarkeR)
#tcga <- loadEData("tcga", nr.datasets=1,cache = TRUE)
#ysel <- assays(tcga[[1]])$expr
#fd <- rowData(tcga[[1]])
#pd <- colData(tcga[[1]])
data(fd.tcga)
data(pd.tcga)
data(expr.tcga)
fd <- fd.tcga
ysel <- expr.tcga
pd <- pd.tcga
N <- ncol(ysel)
head(pd)
## DataFrame with 6 rows and 4 columns
## sample type GROUP
## <character> <factor> <numeric>
## TCGA-K4-A3WV-01A-11R-A22U-07 TCGA-K4-A3WV-01A-11R.. BLCA 1
## TCGA-BT-A20W-01A-21R-A14Y-07 TCGA-BT-A20W-01A-21R.. BLCA 1
## TCGA-K4-A5RI-01A-11R-A28M-07 TCGA-K4-A5RI-01A-11R.. BLCA 1
## TCGA-BT-A20N-01A-11R-A14Y-07 TCGA-BT-A20N-01A-11R.. BLCA 1
## TCGA-BL-A13J-01A-11R-A277-07 TCGA-BL-A13J-01A-11R.. BLCA 1
## TCGA-BT-A20U-01A-11R-A14Y-07 TCGA-BT-A20U-01A-11R.. BLCA 1
## BLOCK
## <character>
## TCGA-K4-A3WV-01A-11R-A22U-07 TCGA-K4-A3WV
## TCGA-BT-A20W-01A-21R-A14Y-07 TCGA-BT-A20W
## TCGA-K4-A5RI-01A-11R-A28M-07 TCGA-K4-A5RI
## TCGA-BT-A20N-01A-11R-A14Y-07 TCGA-BT-A20N
## TCGA-BL-A13J-01A-11R-A277-07 TCGA-BL-A13J
## TCGA-BT-A20U-01A-11R-A14Y-07 TCGA-BT-A20U
cnames <- c("BLOCK","GROUP")
covar <- data.frame(pd[,cnames,drop=FALSE])
covar$GROUP <- as.factor(covar$GROUP)
colnames(covar) <- cnames
print(table(covar$GROUP))
##
## 0 1
## 19 19
To apply roastgsa
, the expression data should be approximately normally
distributed, at least in their univariate form. Depending on the user’s
preferred method for differential expression analysis, counts transformation
methods such as rlog
or vst
(DESeq2
) [2], zscoreDGE
(edgeR
) [3] or
voom
(limma
) [4], can be applied. In the paper we explored the type I
and type II errors when applying the rlog
or vst
transformation followed by
roastgsa
, showing both good control of type I errors and decent true
discovery rates. In the example presented here we transform the expression
data with vst
function from DESeq2
R package
library(DESeq2)
dds1 <- DESeqDataSetFromMatrix(countData=ysel,colData=pd,
design= ~ BLOCK + GROUP)
dds1 <- estimateSizeFactors(dds1)
ynorm <- assays(vst(dds1))[[1]]
colnames(ynorm) <- rownames(covar) <- paste0("s",1:ncol(ynorm))
Another key step before using the roastgsa methods for enrichment analysis
is to filter out low expressed genes, where coverage might be a
limitation for detecting true differentially expressed genes. For the
TCGA data considered here, the default filter employed by the authors when
loading the data was to exclude genes with cpm < 2 in more than half of
the samples. A short discussion about the relationship between gene coverage
and statistical power for the roastgsa
approach is available in our article
presenting the roastgsa
package.
threshLR <- 10
dim(ysel)
## [1] 3621 38
min(apply(ysel,1,mean))
## [1] 88.26316
We consider a classic repository of general biological functions for battery gene set analysis such as broad hallmarks [5]. The gene sets for human are saved within the roastgsa package and can be loaded by
data(hallmarks.hs)
head(names(hallmarks.hs))
## [1] "HALLMARK_TNFA_SIGNALING_VIA_NFKB" "HALLMARK_HYPOXIA"
## [3] "HALLMARK_CHOLESTEROL_HOMEOSTASIS" "HALLMARK_MITOTIC_SPINDLE"
## [5] "HALLMARK_WNT_BETA_CATENIN_SIGNALING" "HALLMARK_TGF_BETA_SIGNALING"
In this case, hallmarks.hs
contains gene symbols whereas the row
names for ynorm
are entrez identifiers. We can set the row names to
symbols, which in this case presents a one-to-one relationship
rownames(ynorm) <-fd[rownames(ynorm),1]
Other gene set databases that could be applied to these data for battery
testing are presented in the roastgsa
vignette (gene set collections).
The comparison of interest can be specified by a numeric vector with length matching the number of columns in the design.
form <- as.formula(paste0("~ ", paste0(cnames, collapse = "+")))
design <- model.matrix(form , data = covar)
terms <- colnames(design)
contrast <- rep(0, length(terms))
contrast[length(colnames(design))] <- 1
Below, there is the standard roastgsa
instruction (under competitive
testing) for maxmean
and mean
statistics.
fit.maxmean <- roastgsa(ynorm, form = form, covar = covar,
contrast = contrast, index = hallmarks.hs, nrot = 500,
mccores = 1, set.statistic = "maxmean",
self.contained = FALSE, executation.info = FALSE)
f1 <- fit.maxmean$res
rownames(f1) <- gsub("HALLMARK_","",rownames(f1))
head(f1)
## total_genes measured_genes est nes pval
## G2M_CHECKPOINT 200 188 1.1752646 3.695294 0.001996008
## E2F_TARGETS 200 194 1.4694371 3.687416 0.001996008
## MYOGENESIS 200 155 -0.8304752 -2.882802 0.001996008
## MYC_TARGETS_V2 58 58 1.0304122 2.606054 0.021956088
## MYC_TARGETS_V1 200 192 0.7727287 2.594936 0.017964072
## UV_RESPONSE_DN 144 134 -0.6095027 -2.562008 0.009980040
## adj.pval
## G2M_CHECKPOINT 0.0332668
## E2F_TARGETS 0.0332668
## MYOGENESIS 0.0332668
## MYC_TARGETS_V2 0.1372255
## MYC_TARGETS_V1 0.1283148
## UV_RESPONSE_DN 0.1164338
fit.mean <- roastgsa(ynorm, form = form, covar = covar,
contrast = contrast, index = hallmarks.hs, nrot = 500,
mccores = 1, set.statistic = "mean",
self.contained = FALSE, executation.info = FALSE)
f2 <- fit.mean$res
rownames(f2) <- gsub("HALLMARK_","",rownames(f2))
head(f2)
## total_genes measured_genes est nes
## E2F_TARGETS 200 194 1.1896796 2.832982
## G2M_CHECKPOINT 200 188 0.9287256 2.651037
## UNFOLDED_PROTEIN_RESPONSE 113 104 0.4270303 2.548580
## MYOGENESIS 200 155 -0.7076989 -2.317204
## DNA_REPAIR 150 139 0.4878909 2.237875
## UV_RESPONSE_DN 144 134 -0.5941011 -2.232843
## pval adj.pval
## E2F_TARGETS 0.001996008 0.0249501
## G2M_CHECKPOINT 0.001996008 0.0249501
## UNFOLDED_PROTEIN_RESPONSE 0.001996008 0.0249501
## MYOGENESIS 0.001996008 0.0249501
## DNA_REPAIR 0.013972056 0.1164338
## UV_RESPONSE_DN 0.009980040 0.0998004
Several graphics can be obtained to complement the table results in
f1
and f2
. Here we only show the heatmaps that summarize the
expression patterns obtained for all tested hallmarks. Full description
and usage of all graphical options available in the roastgsa
package
are considered in the roastgsa
vignette for arrays data and the
roastgsa
manual
hm1 <- heatmaprgsa_hm(fit.maxmean, ynorm, intvar = "GROUP", whplot = 1:50,
toplot = TRUE, pathwaylevel = TRUE, mycol = c("orange","green",
"white"), sample2zero = FALSE)
hm2 <- heatmaprgsa_hm(fit.mean, ynorm, intvar = "GROUP", whplot = 1:50,
toplot = TRUE, pathwaylevel = TRUE, mycol = c("orange","green",
"white"), sample2zero = FALSE)
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] DESeq2_1.44.0 preprocessCore_1.66.0
## [3] hgu133plus2.db_3.13.0 org.Hs.eg.db_3.19.1
## [5] AnnotationDbi_1.66.0 GSEABenchmarkeR_1.24.0
## [7] SummarizedExperiment_1.34.0 GenomicRanges_1.56.0
## [9] GenomeInfoDb_1.40.0 IRanges_2.38.0
## [11] S4Vectors_0.42.0 MatrixGenerics_1.16.0
## [13] matrixStats_1.3.0 Biobase_2.64.0
## [15] BiocGenerics_0.50.0 roastgsa_1.2.0
## [17] knitr_1.46 BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.2 bitops_1.0-7
## [3] GSEABase_1.66.0 rlang_1.1.3
## [5] magrittr_2.0.3 compiler_4.4.0
## [7] RSQLite_2.3.6 png_0.1-8
## [9] vctrs_0.6.5 pkgconfig_2.0.3
## [11] crayon_1.5.2 fastmap_1.1.1
## [13] magick_2.8.3 dbplyr_2.5.0
## [15] XVector_0.44.0 labeling_0.4.3
## [17] caTools_1.18.2 utf8_1.2.4
## [19] rmarkdown_2.26 graph_1.82.0
## [21] UCSC.utils_1.0.0 KEGGgraph_1.64.0
## [23] tinytex_0.50 purrr_1.0.2
## [25] bit_4.0.5 xfun_0.43
## [27] zlibbioc_1.50.0 cachem_1.0.8
## [29] jsonlite_1.8.8 blob_1.2.4
## [31] highr_0.10 DelayedArray_0.30.0
## [33] BiocParallel_1.38.0 parallel_4.4.0
## [35] R6_2.5.1 bslib_0.7.0
## [37] RColorBrewer_1.1-3 limma_3.60.0
## [39] jquerylib_0.1.4 Rcpp_1.0.12
## [41] bookdown_0.39 Matrix_1.7-0
## [43] tidyselect_1.2.1 abind_1.4-5
## [45] yaml_2.3.8 gplots_3.1.3.1
## [47] codetools_0.2-20 curl_5.2.1
## [49] lattice_0.22-6 tibble_3.2.1
## [51] withr_3.0.0 KEGGREST_1.44.0
## [53] evaluate_0.23 BiocFileCache_2.12.0
## [55] Biostrings_2.72.0 pillar_1.9.0
## [57] BiocManager_1.30.22 filelock_1.0.3
## [59] KernSmooth_2.23-22 generics_0.1.3
## [61] RCurl_1.98-1.14 ggplot2_3.5.1
## [63] munsell_0.5.1 scales_1.3.0
## [65] gtools_3.9.5 xtable_1.8-4
## [67] glue_1.7.0 tools_4.4.0
## [69] locfit_1.5-9.9 annotate_1.82.0
## [71] XML_3.99-0.16.1 grid_4.4.0
## [73] colorspace_2.1-0 GenomeInfoDbData_1.2.12
## [75] cli_3.6.2 KEGGandMetacoreDzPathwaysGEO_1.23.0
## [77] fansi_1.0.6 S4Arrays_1.4.0
## [79] dplyr_1.1.4 Rgraphviz_2.48.0
## [81] gtable_0.3.5 sass_0.4.9
## [83] digest_0.6.35 SparseArray_1.4.0
## [85] farver_2.1.1 memoise_2.0.1
## [87] htmltools_0.5.8.1 lifecycle_1.0.4
## [89] httr_1.4.7 EnrichmentBrowser_2.34.0
## [91] KEGGdzPathwaysGEO_1.41.0 statmod_1.5.0
## [93] bit64_4.0.5
[1] Geistlinger L, Csaba G, Santarelli M, Schiffer L, Ramos M, Zimmer R, Waldron L (2019). GSEABenchmarkeR: Reproducible GSEA Benchmarking. R package version 1.6.0, https://github.com/waldronlab/GSEABenchmarkeR.
[2] Love MI, Huber W, Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550. doi:10.1186/s13059-014-0550-8.
[3] Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. doi:10.1093/bioinformatics/btp616.
[4] M. E. Ritchie, B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. limma powers differential expression analyses for RNAsequencing and microarray studies. Nucleic acids research, 43(7):e47, 2015.
[5] A. Liberzon, C. Birger, H. Thorvaldsdottir, M. Ghandi, J. P. Mesirov, and P. Tamayo. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Systems, 1(6):417-425, 2015.