if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("GenomicSuperSignature")
BiocManager::install("bcellViper")
library(GenomicSuperSignature)
library(bcellViper)
You can download GenomicSuperSignature from Google Cloud bucket using
GenomicSuperSignature::getModel
function. Currently available models are
built from top 20 PCs of 536 studies (containing 44,890 samples) containing
13,934 common genes from each of 536 study’s top 90% varying genes based on
their study-level standard deviation. There are two versions of this RAVmodel
annotated with different gene sets for GSEA - MSigDB C2 (C2
) and three
priors from PLIER package (PLIERpriors
).
The demo in this vignette is based on human B-cell expression data, so we are
using the PLIERpriors
model annotated with blood-associated gene sets.
Note that the first interactive run of this code, you will be asked to allow
R to create a cache directory. The model file will be stored there and
subsequent calls to getModel
will read from the cache.
RAVmodel <- getModel("PLIERpriors", load=TRUE)
#> [1] "downloading"
RAVmodel
#> class: PCAGenomicSignatures
#> dim: 13934 4764
#> metadata(8): cluster size ... updateNote version
#> assays(1): RAVindex
#> rownames(13934): CASKIN1 DDX3Y ... CTC-457E21.9 AC007966.1
#> rowData names(0):
#> colnames(4764): RAV1 RAV2 ... RAV4763 RAV4764
#> colData names(4): RAV studies silhouetteWidth gsea
#> trainingData(2): PCAsummary MeSH
#> trainingData names(536): DRP000987 SRP059172 ... SRP164913 SRP188526
The human B-cell dataset (Gene Expression Omnibus series GSE2350) consists of 211 normal and tumor human B-cell phenotypes. This dataset was generated on Affymatrix HG-U95Av2 arrays and stored in an ExpressionSet object with 6,249 features x 211 samples.
data(bcellViper)
dset
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 6249 features, 211 samples
#> element names: exprs
#> protocolData: none
#> phenoData
#> sampleNames: GSM44075 GSM44078 ... GSM44302 (211 total)
#> varLabels: sampleID description detailed_description
#> varMetadata: labelDescription
#> featureData: none
#> experimentData: use 'experimentData(object)'
#> Annotation:
You can provide your own expression dataset in any of these formats: simple matrix, ExpressionSet, or SummarizedExperiment. Just make sure that genes are in a ‘symbol’ format.
You can find more usecases here.
validate
function calculates validation score, which provides a quantitative
representation of the relevance between a new dataset and RAV. RAVs that give
the validation score is called validated RAV. The validation results can
be displayed in different ways for more intuitive interpretation.
val_all <- validate(dset, RAVmodel)
head(val_all)
#> score PC sw cl_size cl_num
#> RAV1 0.2249382 2 -0.05470163 6 1
#> RAV2 0.3899341 2 0.06426256 21 2
#> RAV3 0.1887409 2 -0.01800335 4 3
#> RAV4 0.2309703 6 -0.04005584 7 4
#> RAV5 0.2017431 2 0.05786189 3 5
#> RAV6 0.1903602 8 -0.02520973 3 6
heatmapTable
takes validation results as its input and displays them into
a two panel table: the top panel shows the average silhouette width (avg.sw)
and the bottom panel displays the validation score.
heatmapTable
can display different subsets of the validation output. For
example, if you specify scoreCutoff
, any validation result above that score
will be shown. If you specify the number (n) of top validation results through
num.out
, the output will be a n-columned heatmap table. You can also use the
average silhouette width (swCutoff
), the size of cluster (clsizecutoff
),
one of the top 8 PCs from the dataset (whichPC
).
Here, we print out top 5 validated RAVs with average silhouette width above 0.
heatmapTable(val_all, num.out = 5, swCutoff = 0)
Under the default condition, plotValidate
plots validation results of all non
single-element RAVs in one graph, where x-axis represents average silhouette
width of the RAVs (a quality control measure of RAVs) and y-axis represents
validation score. We recommend users to focus on RAVs with higher validation
score and use average silhouette width as a secondary criteria.
plotValidate(val_all, interactive = FALSE)
Note that interactive = TRUE
will result in a zoomable, interactive plot
that included tooltips.
You can hover each data point for more information:
If you double-click the PC legend on the right, you will enter an individual display mode where you can add an additional group of data point by single-click.
GenomicSuperSignature connects different public databases and prior information through RAVmodel, creating the knowledge graph illustrated below. Through RAVs, you can access and explore the knowledge graph from multiple entry points such as gene expression profiles, publications, study metadata, keywords in MeSH terms and gene sets.
You can draw a wordcloud with the enriched MeSH terms of RAVs. Based on the
heatmap table above, three RAVs (2538, 1139, and 884) show the high validation
scores with the positive average silhouette widths, so we draw wordclouds of
those RAVs using drawWordcloud
function. You need to provide RAVmodel and
the index of the RAV you are interested in.
Index of validated RAVs can be easily collected using validatedSingatures
function, which outputs the validated index based on num.out
, PC from dataset
(whichPC
) or any *Cutoff
arguments in the same way as heatmapTable
. Here,
we choose the top 3 RAVs with the average silhouette width above 0, which will
returns RAV2538, RAV1139, and RAV884 as we discussed above.
validated_ind <- validatedSignatures(val_all, num.out = 3,
swCutoff = 0, indexOnly = TRUE)
validated_ind
#> [1] 2538 1139 884
And we plot the wordcloud of those three RAVs.
set.seed(1) # only if you want to reproduce identical display of the same words
drawWordcloud(RAVmodel, validated_ind[1])
drawWordcloud(RAVmodel, validated_ind[2])
drawWordcloud(RAVmodel, validated_ind[3])
You can directly access the GSEA outputs for each RAV using gsea
. Based on
the wordclouds, RAV1139 seems to be associated with B-cell.
RAVnum <- validated_ind[2] # RAV1139
res <- gsea(RAVmodel)[[RAVnum]]
head(res)
#> Description
#> DMAP_ERY3 DMAP_ERY3
#> IRIS_CD4Tcell-Th1-restimulated48hour IRIS_CD4Tcell-Th1-restimulated48hour
#> IRIS_CD4Tcell-Th2-restimulated48hour IRIS_CD4Tcell-Th2-restimulated48hour
#> IRIS_MemoryTcell-RO-activated IRIS_MemoryTcell-RO-activated
#> KEGG_CELL_CYCLE KEGG_CELL_CYCLE
#> KEGG_HUNTINGTONS_DISEASE KEGG_HUNTINGTONS_DISEASE
#> NES pvalue qvalues
#> DMAP_ERY3 -2.184449 1e-10 6.221805e-10
#> IRIS_CD4Tcell-Th1-restimulated48hour -2.834540 1e-10 6.221805e-10
#> IRIS_CD4Tcell-Th2-restimulated48hour -2.790389 1e-10 6.221805e-10
#> IRIS_MemoryTcell-RO-activated -2.802690 1e-10 6.221805e-10
#> KEGG_CELL_CYCLE -2.910950 1e-10 6.221805e-10
#> KEGG_HUNTINGTONS_DISEASE -2.653498 1e-10 6.221805e-10
You can also find the RAVs annotated with the keyword-containing pathways using
findSignature
function. Without the k
argument, this function outputs a
data frame with two columns: the number of RAVs (Freq
column) with the
different numbers of keyword-containing, enriched pathways
(# of keyword-containing pathways
column).
Here, we used the keyword, “Bcell”.
findSignature(RAVmodel, "Bcell")
#> # of keyword-containing pathways Freq
#> 1 0 4685
#> 2 1 38
#> 3 2 17
#> 4 3 14
#> 5 4 8
#> 6 5 2
There are two RAVs with five keyword-containing pathways (row 6). We can check which RAVs they are.
findSignature(RAVmodel, "Bcell", k = 5)
#> [1] 695 1994
Enriched pathways are ordered by NES and you can check the rank of any keyword-
containing pathways using findKeywordInRAV
.
findKeywordInRAV(RAVmodel, "Bcell", ind = 695)
#> [1] "1|2|3|4|5|6|9 (out of 9)"
You can check all enriched pathways of RAV using subsetEnrichedPathways
function. If both=TRUE
, both the top and bottom enriched pathways will
be printed.
subsetEnrichedPathways(RAVmodel, ind = RAVnum, n = 3, both = TRUE)
#> DataFrame with 6 rows and 1 column
#> RAV1139
#> <character>
#> Up_1 DMAP_ERY3
#> Up_2 REACTOME_CYTOKINE_SI..
#> Up_3 REACTOME_APOPTOSIS
#> Down_1 REACTOME_DNA_REPLICA..
#> Down_2 REACTOME_CELL_CYCLE_..
#> Down_3 REACTOME_CELL_CYCLE
subsetEnrichedPathways(RAVmodel, ind = 695, n = 3, both = TRUE)
#> DataFrame with 6 rows and 1 column
#> RAV695
#> <character>
#> Up_1 IRIS_Bcell-Memory_Ig..
#> Up_2 DMAP_BCELLA3
#> Up_3 IRIS_Bcell-Memory_IgM
#> Down_1 DMAP_BCELLA1
#> Down_2 SVM B cells memory
#> Down_3 IRIS_PlasmaCell-From..
subsetEnrichedPathways(RAVmodel, ind = 1994, n = 3, both = TRUE)
#> DataFrame with 6 rows and 1 column
#> RAV1994
#> <character>
#> Up_1 DMAP_TCELLA6
#> Up_2 SVM T cells CD4 memo..
#> Up_3 DMAP_TCELLA7
#> Down_1 IRIS_Bcell-naive
#> Down_2 IRIS_Bcell-Memory_IgM
#> Down_3 IRIS_Bcell-Memory_Ig..
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] bcellViper_1.28.0 GenomicSuperSignature_1.0.1
#> [3] SummarizedExperiment_1.22.0 Biobase_2.52.0
#> [5] GenomicRanges_1.44.0 GenomeInfoDb_1.28.0
#> [7] IRanges_2.26.0 S4Vectors_0.30.0
#> [9] BiocGenerics_0.38.0 MatrixGenerics_1.4.0
#> [11] matrixStats_0.58.0 BiocStyle_2.20.0
#>
#> loaded via a namespace (and not attached):
#> [1] colorspace_2.0-1 ggsignif_0.6.1 rjson_0.2.20
#> [4] ellipsis_0.3.2 rio_0.5.26 circlize_0.4.12
#> [7] XVector_0.32.0 GlobalOptions_0.1.2 clue_0.3-59
#> [10] farver_2.1.0 ggpubr_0.4.0 bit64_4.0.5
#> [13] fansi_0.5.0 codetools_0.2-18 doParallel_1.0.16
#> [16] cachem_1.0.5 knitr_1.33 jsonlite_1.7.2
#> [19] Cairo_1.5-12.2 broom_0.7.6 cluster_2.1.2
#> [22] dbplyr_2.1.1 png_0.1-7 readr_1.4.0
#> [25] BiocManager_1.30.15 compiler_4.1.0 httr_1.4.2
#> [28] backports_1.2.1 assertthat_0.2.1 Matrix_1.3-3
#> [31] fastmap_1.1.0 htmltools_0.5.1.1 tools_4.1.0
#> [34] gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.6
#> [37] dplyr_1.0.6 rappdirs_0.3.3 Rcpp_1.0.6
#> [40] carData_3.0-4 cellranger_1.1.0 jquerylib_0.1.4
#> [43] vctrs_0.3.8 iterators_1.0.13 xfun_0.23
#> [46] stringr_1.4.0 openxlsx_4.2.3 lifecycle_1.0.0
#> [49] rstatix_0.7.0 zlibbioc_1.38.0 scales_1.1.1
#> [52] hms_1.1.0 RColorBrewer_1.1-2 ComplexHeatmap_2.8.0
#> [55] yaml_2.2.1 curl_4.3.1 memoise_2.0.0
#> [58] ggplot2_3.3.3 sass_0.4.0 stringi_1.6.2
#> [61] RSQLite_2.2.7 highr_0.9 foreach_1.5.1
#> [64] filelock_1.0.2 zip_2.1.1 shape_1.4.6
#> [67] rlang_0.4.11 pkgconfig_2.0.3 bitops_1.0-7
#> [70] evaluate_0.14 lattice_0.20-44 purrr_0.3.4
#> [73] labeling_0.4.2 bit_4.0.4 tidyselect_1.1.1
#> [76] magrittr_2.0.1 bookdown_0.22 R6_2.5.0
#> [79] magick_2.7.2 generics_0.1.0 DelayedArray_0.18.0
#> [82] DBI_1.1.1 pillar_1.6.1 haven_2.4.1
#> [85] foreign_0.8-81 withr_2.4.2 abind_1.4-5
#> [88] RCurl_1.98-1.3 tibble_3.1.2 crayon_1.4.1
#> [91] car_3.0-10 wordcloud_2.6 utf8_1.2.1
#> [94] BiocFileCache_2.0.0 rmarkdown_2.8 GetoptLong_1.0.5
#> [97] grid_4.1.0 readxl_1.3.1 data.table_1.14.0
#> [100] blob_1.2.1 forcats_0.5.1 digest_0.6.27
#> [103] tidyr_1.1.3 munsell_0.5.0 bslib_0.2.5.1