if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("GenomicSuperSignature")
library(GenomicSuperSignature)
You can download GenomicSuperSignature from Google Cloud bucket using
GenomicSuperSignature::getModel
function. Currently available models are
built from top 20 PCs of 536 studies (containing 44,890 samples) containing
13,934 common genes from each of 536 study’s top 90% varying genes based on
their study-level standard deviation. There are two versions of this RAVmodel
annotated with different gene sets for GSEA: MSigDB C2 (C2
) and three
priors from PLIER package (PLIERpriors
). In this vignette, we are showing
the C2
annotated model.
Note that the first interactive run of this code, you will be asked to allow
R to create a cache directory. The model file will be stored there and
subsequent calls to getModel
will read from the cache.
RAVmodel <- getModel("C2", load=TRUE)
#> [1] "downloading"
RAVindex
is a matrix containing genes in rows and RAVs in columns. colData
slot provides the information on each RAVs, such as GSEA annotation and
studies involved in each cluster. metadata
slot stores model construction
information. trainingData
slot contains the information on individual
studies in training dataset, such as MeSH terms assigned to each study.
RAVmodel
#> class: PCAGenomicSignatures
#> dim: 13934 4764
#> metadata(8): cluster size ... updateNote version
#> assays(1): RAVindex
#> rownames(13934): CASKIN1 DDX3Y ... CTC-457E21.9 AC007966.1
#> rowData names(0):
#> colnames(4764): RAV1 RAV2 ... RAV4763 RAV4764
#> colData names(4): RAV studies silhouetteWidth gsea
#> trainingData(2): PCAsummary MeSH
#> trainingData names(536): DRP000987 SRP059172 ... SRP164913 SRP188526
Replicable Axis of Variation (RAV) index is the main component of
GenomicSuperSignature. It serves as an index connecting new datasets and the
existing database. You can access it through GenomicSuperSignature::RAVindex
(equivalent of SummarizedExperiment::assay
). Rows are genes and columns are
RAVs.
Here, RAVmodel consists of 13,934 genes and 4,764 RAVs.
class(RAVindex(RAVmodel))
#> [1] "matrix" "array"
dim(RAVindex(RAVmodel))
#> [1] 13934 4764
RAVindex(RAVmodel)[1:4, 1:4]
#> RAV1 RAV2 RAV3 RAV4
#> CASKIN1 0.002449532 -2.845471e-03 0.001601715 -0.009366156
#> DDX3Y 0.006093438 1.870528e-03 -0.004346019 -0.006094526
#> MDGA1 0.009999975 2.610302e-03 -0.002831447 -0.006710757
#> EIF1AY 0.007923652 8.507021e-05 0.002322275 -0.007850784
Metadata slot of RAVmodel contains information related to the model building.
names(metadata(RAVmodel))
#> [1] "cluster" "size" "k" "n" "geneSets"
#> [6] "MeSH_freq" "updateNote" "version"
cluster
: cluster membership of each PCs from the training datasetsize
: an integer vector with the length of clusters, containing the number
of PCs in each clusterk
: the number of all clusters in the given RAVmodeln
: the number of top PCs kept from each study in the training datasetgeneSets
: the name of gene sets used for GSEA annotationMeSH_freq
: the frequency of MeSH terms associated with the training
dataset. MeSH terms like ‘Humans’ and ‘RNA-seq’ are top ranked (which is very
expected) because the training dataset of this model is Human RNA sequencing
data.updateNote
: a brief note on the given model’s specificationversion
: the version of the given modelhead(metadata(RAVmodel)$cluster)
#> DRP000987.PC1 DRP000987.PC2 DRP000987.PC3 DRP000987.PC4 DRP000987.PC5
#> 1 2 3 4 5
#> DRP000987.PC6
#> 6
head(metadata(RAVmodel)$size)
#> RAV1 RAV2 RAV3 RAV4 RAV5 RAV6
#> 6 21 4 7 3 3
metadata(RAVmodel)$k
#> [1] 4764
metadata(RAVmodel)$n
#> [1] 20
geneSets(RAVmodel)
#> [1] "MSigDB C2 v.7.1"
head(metadata(RAVmodel)$MeSH_freq)
#>
#> Humans RNA-Seq Transcriptome
#> 1163 591 576
#> RNA Whole Exome Sequencing Sequencing, RNA
#> 372 284 277
updateNote(RAVmodel)
#> [1] "536 refine.bio studies/ top 90% varying genes/ GSEA with MSigDB C2"
metadata(RAVmodel)$version
#> [1] ">= 0.0.7"
You can find which studies are in each cluster using studies
method. Output is
a list with the length of clusters, where each element is a character vector
containing the name of studies in each cluster.
length(studies(RAVmodel))
#> [1] 4764
studies(RAVmodel)[1:3]
#> $Cl4764_01
#> [1] "DRP000987" "SRP059172" "SRP059959" "SRP118741" "SRP144178" "SRP183186"
#>
#> $Cl4764_02
#> [1] "DRP000987" "ERP022839" "ERP104463" "ERP106996" "ERP109789" "ERP114435"
#> [7] "SRP059039" "SRP059172" "SRP065317" "SRP071547" "SRP073061" "SRP074274"
#> [13] "SRP075449" "SRP092158" "SRP096736" "SRP107337" "SRP111402" "SRP125008"
#> [19] "SRP133442" "SRP135684" "SRP155483"
#>
#> $Cl4764_03
#> [1] "DRP000987" "SRP001540" "SRP056840" "SRP125396"
You can check which PC from different studies are in RAVs using PCinRAV
.
PCinRAV(RAVmodel, 2)
#> [1] "DRP000987.PC2" "ERP022839.PC2" "ERP104463.PC1" "ERP106996.PC2"
#> [5] "ERP109789.PC2" "ERP114435.PC1" "SRP059039.PC3" "SRP059172.PC2"
#> [9] "SRP065317.PC1" "SRP071547.PC3" "SRP073061.PC2" "SRP074274.PC5"
#> [13] "SRP075449.PC2" "SRP092158.PC2" "SRP096736.PC2" "SRP107337.PC3"
#> [17] "SRP111402.PC1" "SRP125008.PC2" "SRP133442.PC1" "SRP135684.PC4"
#> [21] "SRP155483.PC2"
Silhouette width ranges from -1 to 1 for each cluster. Typically, it is
interpreted as follows:
- Values close to 1 suggest that the observation is well matched to the
assigned cluster
- Values close to 0 suggest that the observation is borderline matched
between two clusters
- Values close to -1 suggest that the observations may be assigned to the
wrong cluster
For RAVmodel, the average silhouette width of each cluster is a quality control measure and suggested as a secondary reference to choose proper RAVs, following validation score.
x <- silhouetteWidth(RAVmodel)
head(x) # average silhouette width of the first 6 RAVs
#> 1 2 3 4 5 6
#> -0.05470163 0.06426256 -0.01800335 -0.04005584 0.05786189 -0.02520973
Pre-processed GSEA results on each RAV are stored in RAVmodel and can be
accessed through gsea
function.
class(gsea(RAVmodel))
#> [1] "list"
class(gsea(RAVmodel)[[1]])
#> [1] "data.frame"
length(gsea(RAVmodel))
#> [1] 4764
gsea(RAVmodel)[1]
#> $RAV1
#> Description
#> HELLEBREKERS_SILENCED_DURING_TUMOR_ANGIOGENESIS HELLEBREKERS_SILENCED_DURING_TUMOR_ANGIOGENESIS
#> KIM_GLIS2_TARGETS_UP KIM_GLIS2_TARGETS_UP
#> NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP
#> PHONG_TNF_TARGETS_UP PHONG_TNF_TARGETS_UP
#> RASHI_RESPONSE_TO_IONIZING_RADIATION_2 RASHI_RESPONSE_TO_IONIZING_RADIATION_2
#> REACTOME_TRANSLATION REACTOME_TRANSLATION
#> ZHU_CMV_ALL_DN ZHU_CMV_ALL_DN
#> GALINDO_IMMUNE_RESPONSE_TO_ENTEROTOXIN GALINDO_IMMUNE_RESPONSE_TO_ENTEROTOXIN
#> NES pvalue
#> HELLEBREKERS_SILENCED_DURING_TUMOR_ANGIOGENESIS -3.851302 1.000000e-10
#> KIM_GLIS2_TARGETS_UP -3.472416 1.000000e-10
#> NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP -3.176200 1.000000e-10
#> PHONG_TNF_TARGETS_UP -3.276382 1.000000e-10
#> RASHI_RESPONSE_TO_IONIZING_RADIATION_2 -4.196536 1.000000e-10
#> REACTOME_TRANSLATION 2.300159 1.000000e-10
#> ZHU_CMV_ALL_DN -3.423609 1.000000e-10
#> GALINDO_IMMUNE_RESPONSE_TO_ENTEROTOXIN -3.032082 1.139605e-10
#> qvalues
#> HELLEBREKERS_SILENCED_DURING_TUMOR_ANGIOGENESIS 3.43231e-08
#> KIM_GLIS2_TARGETS_UP 3.43231e-08
#> NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP 3.43231e-08
#> PHONG_TNF_TARGETS_UP 3.43231e-08
#> RASHI_RESPONSE_TO_IONIZING_RADIATION_2 3.43231e-08
#> REACTOME_TRANSLATION 3.43231e-08
#> ZHU_CMV_ALL_DN 3.43231e-08
#> GALINDO_IMMUNE_RESPONSE_TO_ENTEROTOXIN 3.43231e-08
You can find MeSH terms associated with each study using mesh
method.
Output is a list with the length of studies used for training. Each element of
this output list is a data frame containing the assigned MeSH terms and the
detail of them. The last column bagOfWords
is the frequency of the MeSH term
in the whole training dataset.
length(mesh(RAVmodel))
#> [1] 536
mesh(RAVmodel)[1]
#> $DRP000987
#> score identifier name
#> 1468242 162422 DRP000987 Animals
#> 1596889 162422 DRP000987 Humans
#> 1116527 161422 DRP000987 Plasmodium falciparum
#> 832289 96999 DRP000987 Malaria, Falciparum
#> 216848 74145 DRP000987 Parasites
#> 1689323 57176 DRP000987 Transcriptome
#> 761253 55509 DRP000987 Malaria
#> 149067 37162 DRP000987 Indonesia
#> 1627781 36315 DRP000987 Transcriptome Analysis
#> 716101 22634 DRP000987 RNA-Seq
#> 1573542 19938 DRP000987 RNA
#> 58627 18575 DRP000987 Genotype
#> 1230230 7197 DRP000987 Innate Immune Response
#> 35822 1000 DRP000987 Genetic Variation
#> explanation
#> 1468242 Parasites, CT Treecode Lookup: B01.050.500.714 (Parasites)
#> 1596889 Forced Leaf Node Lookup:humans, CT Text Lookup: human, CT Text Lookup: humans, CT Text Lookup: patient, CT Text Lookup: patients, Forced Leaf Node Lookup:humans, CT Text Lookup: human, CT Text Lookup: humans, CT Text Lookup: patient, Parasites, CT Treecode Lookup: B01.050.500.714 (Parasites)
#> 1116527 RtM via: Plasmodium falciparum, Forced Leaf Node Lookup:plasmodium falciparums, checkForceMH Boosted
#> 832289 RtM via: Plasmodium falciparum infection
#> 216848 RtM via: Parasites, Forced Leaf Node Lookup:parasites, checkForceMH Boosted
#> 1689323 RtM via: Transcriptome, Forced Leaf Node Lookup:transcriptomes
#> 761253 RtM via: Malaria, Forced Non-Leaf Node Lookup:malaria
#> 149067 RtM via: Indonesia, RtM via: Indonesians, Forced Leaf Node Lookup:indonesians
#> 1627781 Entry Term Replacement for "Gene Expression Profiling", RtM via: Gene Expression Profiling, Forced Non-Leaf Node Lookup:transcriptome analysis
#> 716101 RtM via: RNA-Seq, Forced New Lookup:rna-seq
#> 1573542 RtM via: RNA
#> 58627 Forced Lookup:genotyping
#> 1230230 Entry Term Replacement for "Immunity, Innate", RtM via: Immunity, Innate, Forced Non-Leaf Node Lookup:innate immune responses
#> 35822 Forced Non-Leaf Node Lookup:genetic variations
#> bagOfWords
#> 1468242 209
#> 1596889 1163
#> 1116527 5
#> 832289 4
#> 216848 7
#> 1689323 576
#> 761253 6
#> 149067 2
#> 1627781 66
#> 716101 591
#> 1573542 372
#> 58627 20
#> 1230230 8
#> 35822 11
PCA summary of each study can be accessed through PCAsummary
method. Output
is a list with the length of studies, where each element is a matrix containing
PCA summary results: standard deviation (SD), variance explained by each PC
(Variance), and the cumulative variance explained (Cumulative).
length(PCAsummary(RAVmodel))
#> [1] 536
PCAsummary(RAVmodel)[1]
#> $DRP000987
#> DRP000987.PC1 DRP000987.PC2 DRP000987.PC3 DRP000987.PC4
#> SD 42.6382804 21.97962707 12.94848367 11.89265640
#> Variance 0.3580557 0.09514629 0.03302091 0.02785537
#> Cumulative 0.3580557 0.45320196 0.48622287 0.51407824
#> DRP000987.PC5 DRP000987.PC6 DRP000987.PC7 DRP000987.PC8
#> SD 10.89460616 10.53655583 9.47278967 8.97268098
#> Variance 0.02337622 0.02186495 0.01767287 0.01585607
#> Cumulative 0.53745446 0.55931942 0.57699228 0.59284836
#> DRP000987.PC9 DRP000987.PC10 DRP000987.PC11 DRP000987.PC12
#> SD 8.43652010 7.90810771 7.60865968 7.076401846
#> Variance 0.01401774 0.01231676 0.01140165 0.009862254
#> Cumulative 0.60686609 0.61918285 0.63058449 0.640446748
#> DRP000987.PC13 DRP000987.PC14 DRP000987.PC15 DRP000987.PC16
#> SD 6.89881271 6.766783788 6.703417100 6.6042884
#> Variance 0.00937346 0.009018116 0.008850009 0.0085902
#> Cumulative 0.64982021 0.658838324 0.667688332 0.6762785
#> DRP000987.PC17 DRP000987.PC18 DRP000987.PC19 DRP000987.PC20
#> SD 6.44257163 6.245257583 6.042236748 5.992461574
#> Variance 0.00817466 0.007681604 0.007190294 0.007072317
#> Cumulative 0.68445319 0.692134796 0.699325091 0.706397408
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] GenomicSuperSignature_1.0.1 SummarizedExperiment_1.22.0
#> [3] Biobase_2.52.0 GenomicRanges_1.44.0
#> [5] GenomeInfoDb_1.28.0 IRanges_2.26.0
#> [7] S4Vectors_0.30.0 BiocGenerics_0.38.0
#> [9] MatrixGenerics_1.4.0 matrixStats_0.58.0
#> [11] BiocStyle_2.20.0
#>
#> loaded via a namespace (and not attached):
#> [1] colorspace_2.0-1 ggsignif_0.6.1 rjson_0.2.20
#> [4] ellipsis_0.3.2 rio_0.5.26 circlize_0.4.12
#> [7] XVector_0.32.0 GlobalOptions_0.1.2 clue_0.3-59
#> [10] ggpubr_0.4.0 bit64_4.0.5 fansi_0.5.0
#> [13] codetools_0.2-18 doParallel_1.0.16 cachem_1.0.5
#> [16] knitr_1.33 jsonlite_1.7.2 Cairo_1.5-12.2
#> [19] broom_0.7.6 cluster_2.1.2 dbplyr_2.1.1
#> [22] png_0.1-7 BiocManager_1.30.15 compiler_4.1.0
#> [25] httr_1.4.2 backports_1.2.1 assertthat_0.2.1
#> [28] Matrix_1.3-3 fastmap_1.1.0 htmltools_0.5.1.1
#> [31] tools_4.1.0 gtable_0.3.0 glue_1.4.2
#> [34] GenomeInfoDbData_1.2.6 dplyr_1.0.6 rappdirs_0.3.3
#> [37] Rcpp_1.0.6 carData_3.0-4 cellranger_1.1.0
#> [40] jquerylib_0.1.4 vctrs_0.3.8 iterators_1.0.13
#> [43] xfun_0.23 stringr_1.4.0 openxlsx_4.2.3
#> [46] lifecycle_1.0.0 rstatix_0.7.0 zlibbioc_1.38.0
#> [49] scales_1.1.1 hms_1.1.0 RColorBrewer_1.1-2
#> [52] ComplexHeatmap_2.8.0 yaml_2.2.1 curl_4.3.1
#> [55] memoise_2.0.0 ggplot2_3.3.3 sass_0.4.0
#> [58] stringi_1.6.2 RSQLite_2.2.7 foreach_1.5.1
#> [61] filelock_1.0.2 zip_2.1.1 shape_1.4.6
#> [64] rlang_0.4.11 pkgconfig_2.0.3 bitops_1.0-7
#> [67] evaluate_0.14 lattice_0.20-44 purrr_0.3.4
#> [70] bit_4.0.4 tidyselect_1.1.1 magrittr_2.0.1
#> [73] bookdown_0.22 R6_2.5.0 generics_0.1.0
#> [76] DelayedArray_0.18.0 DBI_1.1.1 pillar_1.6.1
#> [79] haven_2.4.1 foreign_0.8-81 withr_2.4.2
#> [82] abind_1.4-5 RCurl_1.98-1.3 tibble_3.1.2
#> [85] crayon_1.4.1 car_3.0-10 utf8_1.2.1
#> [88] BiocFileCache_2.0.0 rmarkdown_2.8 GetoptLong_1.0.5
#> [91] grid_4.1.0 readxl_1.3.1 data.table_1.14.0
#> [94] blob_1.2.1 forcats_0.5.1 digest_0.6.27
#> [97] tidyr_1.1.3 munsell_0.5.0 bslib_0.2.5.1