Contents

1 Setup

1.1 Install and load package

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("GenomicSuperSignature")
library(GenomicSuperSignature)

1.2 Download RAVmodel

You can download GenomicSuperSignature from Google Cloud bucket using GenomicSuperSignature::getModel function. Currently available models are built from top 20 PCs of 536 studies (containing 44,890 samples) containing 13,934 common genes from each of 536 study’s top 90% varying genes based on their study-level standard deviation. There are two versions of this RAVmodel annotated with different gene sets for GSEA: MSigDB C2 (C2) and three priors from PLIER package (PLIERpriors). In this vignette, we are showing the C2 annotated model.

Note that the first interactive run of this code, you will be asked to allow R to create a cache directory. The model file will be stored there and subsequent calls to getModel will read from the cache.

RAVmodel <- getModel("C2", load=TRUE)
#> [1] "downloading"

2 Content of RAVmodel

RAVindex is a matrix containing genes in rows and RAVs in columns. colData slot provides the information on each RAVs, such as GSEA annotation and studies involved in each cluster. metadata slot stores model construction information. trainingData slot contains the information on individual studies in training dataset, such as MeSH terms assigned to each study.

RAVmodel
#> class: PCAGenomicSignatures 
#> dim: 13934 4764 
#> metadata(8): cluster size ... updateNote version
#> assays(1): RAVindex
#> rownames(13934): CASKIN1 DDX3Y ... CTC-457E21.9 AC007966.1
#> rowData names(0):
#> colnames(4764): RAV1 RAV2 ... RAV4763 RAV4764
#> colData names(4): RAV studies silhouetteWidth gsea
#> trainingData(2): PCAsummary MeSH
#> trainingData names(536): DRP000987 SRP059172 ... SRP164913 SRP188526

2.1 RAVindex

Replicable Axis of Variation (RAV) index is the main component of GenomicSuperSignature. It serves as an index connecting new datasets and the existing database. You can access it through GenomicSuperSignature::RAVindex (equivalent of SummarizedExperiment::assay). Rows are genes and columns are RAVs.

Here, RAVmodel consists of 13,934 genes and 4,764 RAVs.

class(RAVindex(RAVmodel))
#> [1] "matrix" "array"
dim(RAVindex(RAVmodel))
#> [1] 13934  4764
RAVindex(RAVmodel)[1:4, 1:4]
#>                RAV1          RAV2         RAV3         RAV4
#> CASKIN1 0.002449532 -2.845471e-03  0.001601715 -0.009366156
#> DDX3Y   0.006093438  1.870528e-03 -0.004346019 -0.006094526
#> MDGA1   0.009999975  2.610302e-03 -0.002831447 -0.006710757
#> EIF1AY  0.007923652  8.507021e-05  0.002322275 -0.007850784

2.2 Metadata for RAVmodel

Metadata slot of RAVmodel contains information related to the model building.

names(metadata(RAVmodel))
#> [1] "cluster"    "size"       "k"          "n"          "geneSets"  
#> [6] "MeSH_freq"  "updateNote" "version"
  • cluster : cluster membership of each PCs from the training dataset
  • size : an integer vector with the length of clusters, containing the number of PCs in each cluster
  • k : the number of all clusters in the given RAVmodel
  • n : the number of top PCs kept from each study in the training dataset
  • geneSets : the name of gene sets used for GSEA annotation
  • MeSH_freq : the frequency of MeSH terms associated with the training dataset. MeSH terms like ‘Humans’ and ‘RNA-seq’ are top ranked (which is very expected) because the training dataset of this model is Human RNA sequencing data.
  • updateNote : a brief note on the given model’s specification
  • version : the version of the given model
head(metadata(RAVmodel)$cluster)
#> DRP000987.PC1 DRP000987.PC2 DRP000987.PC3 DRP000987.PC4 DRP000987.PC5 
#>             1             2             3             4             5 
#> DRP000987.PC6 
#>             6
head(metadata(RAVmodel)$size)
#> RAV1 RAV2 RAV3 RAV4 RAV5 RAV6 
#>    6   21    4    7    3    3
metadata(RAVmodel)$k
#> [1] 4764
metadata(RAVmodel)$n
#> [1] 20
geneSets(RAVmodel)
#> [1] "MSigDB C2 v.7.1"
head(metadata(RAVmodel)$MeSH_freq)
#> 
#>                 Humans                RNA-Seq          Transcriptome 
#>                   1163                    591                    576 
#>                    RNA Whole Exome Sequencing        Sequencing, RNA 
#>                    372                    284                    277
updateNote(RAVmodel)  
#> [1] "536 refine.bio studies/ top 90% varying genes/ GSEA with MSigDB C2"
metadata(RAVmodel)$version
#> [1] ">= 0.0.7"

2.3 Studies in each RAV

You can find which studies are in each cluster using studies method. Output is a list with the length of clusters, where each element is a character vector containing the name of studies in each cluster.

length(studies(RAVmodel))
#> [1] 4764
studies(RAVmodel)[1:3]
#> $Cl4764_01
#> [1] "DRP000987" "SRP059172" "SRP059959" "SRP118741" "SRP144178" "SRP183186"
#> 
#> $Cl4764_02
#>  [1] "DRP000987" "ERP022839" "ERP104463" "ERP106996" "ERP109789" "ERP114435"
#>  [7] "SRP059039" "SRP059172" "SRP065317" "SRP071547" "SRP073061" "SRP074274"
#> [13] "SRP075449" "SRP092158" "SRP096736" "SRP107337" "SRP111402" "SRP125008"
#> [19] "SRP133442" "SRP135684" "SRP155483"
#> 
#> $Cl4764_03
#> [1] "DRP000987" "SRP001540" "SRP056840" "SRP125396"

You can check which PC from different studies are in RAVs using PCinRAV.

PCinRAV(RAVmodel, 2)
#>  [1] "DRP000987.PC2" "ERP022839.PC2" "ERP104463.PC1" "ERP106996.PC2"
#>  [5] "ERP109789.PC2" "ERP114435.PC1" "SRP059039.PC3" "SRP059172.PC2"
#>  [9] "SRP065317.PC1" "SRP071547.PC3" "SRP073061.PC2" "SRP074274.PC5"
#> [13] "SRP075449.PC2" "SRP092158.PC2" "SRP096736.PC2" "SRP107337.PC3"
#> [17] "SRP111402.PC1" "SRP125008.PC2" "SRP133442.PC1" "SRP135684.PC4"
#> [21] "SRP155483.PC2"

2.4 Silhouette width for each RAV

Silhouette width ranges from -1 to 1 for each cluster. Typically, it is interpreted as follows:
- Values close to 1 suggest that the observation is well matched to the assigned cluster
- Values close to 0 suggest that the observation is borderline matched between two clusters
- Values close to -1 suggest that the observations may be assigned to the wrong cluster

For RAVmodel, the average silhouette width of each cluster is a quality control measure and suggested as a secondary reference to choose proper RAVs, following validation score.

x <- silhouetteWidth(RAVmodel)
head(x)   # average silhouette width of the first 6 RAVs
#>           1           2           3           4           5           6 
#> -0.05470163  0.06426256 -0.01800335 -0.04005584  0.05786189 -0.02520973

2.5 GSEA on each RAV

Pre-processed GSEA results on each RAV are stored in RAVmodel and can be accessed through gsea function.

class(gsea(RAVmodel))
#> [1] "list"
class(gsea(RAVmodel)[[1]])
#> [1] "data.frame"
length(gsea(RAVmodel))
#> [1] 4764
gsea(RAVmodel)[1]
#> $RAV1
#>                                                                                     Description
#> HELLEBREKERS_SILENCED_DURING_TUMOR_ANGIOGENESIS HELLEBREKERS_SILENCED_DURING_TUMOR_ANGIOGENESIS
#> KIM_GLIS2_TARGETS_UP                                                       KIM_GLIS2_TARGETS_UP
#> NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP                         NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP
#> PHONG_TNF_TARGETS_UP                                                       PHONG_TNF_TARGETS_UP
#> RASHI_RESPONSE_TO_IONIZING_RADIATION_2                   RASHI_RESPONSE_TO_IONIZING_RADIATION_2
#> REACTOME_TRANSLATION                                                       REACTOME_TRANSLATION
#> ZHU_CMV_ALL_DN                                                                   ZHU_CMV_ALL_DN
#> GALINDO_IMMUNE_RESPONSE_TO_ENTEROTOXIN                   GALINDO_IMMUNE_RESPONSE_TO_ENTEROTOXIN
#>                                                       NES       pvalue
#> HELLEBREKERS_SILENCED_DURING_TUMOR_ANGIOGENESIS -3.851302 1.000000e-10
#> KIM_GLIS2_TARGETS_UP                            -3.472416 1.000000e-10
#> NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP             -3.176200 1.000000e-10
#> PHONG_TNF_TARGETS_UP                            -3.276382 1.000000e-10
#> RASHI_RESPONSE_TO_IONIZING_RADIATION_2          -4.196536 1.000000e-10
#> REACTOME_TRANSLATION                             2.300159 1.000000e-10
#> ZHU_CMV_ALL_DN                                  -3.423609 1.000000e-10
#> GALINDO_IMMUNE_RESPONSE_TO_ENTEROTOXIN          -3.032082 1.139605e-10
#>                                                     qvalues
#> HELLEBREKERS_SILENCED_DURING_TUMOR_ANGIOGENESIS 3.43231e-08
#> KIM_GLIS2_TARGETS_UP                            3.43231e-08
#> NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP             3.43231e-08
#> PHONG_TNF_TARGETS_UP                            3.43231e-08
#> RASHI_RESPONSE_TO_IONIZING_RADIATION_2          3.43231e-08
#> REACTOME_TRANSLATION                            3.43231e-08
#> ZHU_CMV_ALL_DN                                  3.43231e-08
#> GALINDO_IMMUNE_RESPONSE_TO_ENTEROTOXIN          3.43231e-08

2.6 MeSH terms for each study

You can find MeSH terms associated with each study using mesh method. Output is a list with the length of studies used for training. Each element of this output list is a data frame containing the assigned MeSH terms and the detail of them. The last column bagOfWords is the frequency of the MeSH term in the whole training dataset.

length(mesh(RAVmodel))
#> [1] 536
mesh(RAVmodel)[1]
#> $DRP000987
#>          score identifier                   name
#> 1468242 162422  DRP000987                Animals
#> 1596889 162422  DRP000987                 Humans
#> 1116527 161422  DRP000987  Plasmodium falciparum
#> 832289   96999  DRP000987    Malaria, Falciparum
#> 216848   74145  DRP000987              Parasites
#> 1689323  57176  DRP000987          Transcriptome
#> 761253   55509  DRP000987                Malaria
#> 149067   37162  DRP000987              Indonesia
#> 1627781  36315  DRP000987 Transcriptome Analysis
#> 716101   22634  DRP000987                RNA-Seq
#> 1573542  19938  DRP000987                    RNA
#> 58627    18575  DRP000987               Genotype
#> 1230230   7197  DRP000987 Innate Immune Response
#> 35822     1000  DRP000987      Genetic Variation
#>                                                                                                                                                                                                                                                                                                  explanation
#> 1468242                                                                                                                                                                                                                                           Parasites, CT Treecode Lookup: B01.050.500.714 (Parasites)
#> 1596889 Forced Leaf Node Lookup:humans, CT Text Lookup: human, CT Text Lookup: humans, CT Text Lookup: patient, CT Text Lookup: patients, Forced Leaf Node Lookup:humans, CT Text Lookup: human, CT Text Lookup: humans, CT Text Lookup: patient, Parasites, CT Treecode Lookup: B01.050.500.714 (Parasites)
#> 1116527                                                                                                                                                                                                 RtM via: Plasmodium falciparum, Forced Leaf Node Lookup:plasmodium falciparums, checkForceMH Boosted
#> 832289                                                                                                                                                                                                                                                              RtM via: Plasmodium falciparum infection
#> 216848                                                                                                                                                                                                                           RtM via: Parasites, Forced Leaf Node Lookup:parasites, checkForceMH Boosted
#> 1689323                                                                                                                                                                                                                                       RtM via: Transcriptome, Forced Leaf Node Lookup:transcriptomes
#> 761253                                                                                                                                                                                                                                                 RtM via: Malaria, Forced Non-Leaf Node Lookup:malaria
#> 149067                                                                                                                                                                                                                         RtM via: Indonesia, RtM via: Indonesians, Forced Leaf Node Lookup:indonesians
#> 1627781                                                                                                                                                       Entry Term Replacement for "Gene Expression Profiling", RtM via: Gene Expression Profiling, Forced Non-Leaf Node Lookup:transcriptome analysis
#> 716101                                                                                                                                                                                                                                                           RtM via: RNA-Seq, Forced New Lookup:rna-seq
#> 1573542                                                                                                                                                                                                                                                                                         RtM via: RNA
#> 58627                                                                                                                                                                                                                                                                               Forced Lookup:genotyping
#> 1230230                                                                                                                                                                        Entry Term Replacement for "Immunity, Innate", RtM via: Immunity, Innate, Forced Non-Leaf Node Lookup:innate immune responses
#> 35822                                                                                                                                                                                                                                                         Forced Non-Leaf Node Lookup:genetic variations
#>         bagOfWords
#> 1468242        209
#> 1596889       1163
#> 1116527          5
#> 832289           4
#> 216848           7
#> 1689323        576
#> 761253           6
#> 149067           2
#> 1627781         66
#> 716101         591
#> 1573542        372
#> 58627           20
#> 1230230          8
#> 35822           11

2.7 PCA summary for each study

PCA summary of each study can be accessed through PCAsummary method. Output is a list with the length of studies, where each element is a matrix containing PCA summary results: standard deviation (SD), variance explained by each PC (Variance), and the cumulative variance explained (Cumulative).

length(PCAsummary(RAVmodel))
#> [1] 536
PCAsummary(RAVmodel)[1]
#> $DRP000987
#>            DRP000987.PC1 DRP000987.PC2 DRP000987.PC3 DRP000987.PC4
#> SD            42.6382804   21.97962707   12.94848367   11.89265640
#> Variance       0.3580557    0.09514629    0.03302091    0.02785537
#> Cumulative     0.3580557    0.45320196    0.48622287    0.51407824
#>            DRP000987.PC5 DRP000987.PC6 DRP000987.PC7 DRP000987.PC8
#> SD           10.89460616   10.53655583    9.47278967    8.97268098
#> Variance      0.02337622    0.02186495    0.01767287    0.01585607
#> Cumulative    0.53745446    0.55931942    0.57699228    0.59284836
#>            DRP000987.PC9 DRP000987.PC10 DRP000987.PC11 DRP000987.PC12
#> SD            8.43652010     7.90810771     7.60865968    7.076401846
#> Variance      0.01401774     0.01231676     0.01140165    0.009862254
#> Cumulative    0.60686609     0.61918285     0.63058449    0.640446748
#>            DRP000987.PC13 DRP000987.PC14 DRP000987.PC15 DRP000987.PC16
#> SD             6.89881271    6.766783788    6.703417100      6.6042884
#> Variance       0.00937346    0.009018116    0.008850009      0.0085902
#> Cumulative     0.64982021    0.658838324    0.667688332      0.6762785
#>            DRP000987.PC17 DRP000987.PC18 DRP000987.PC19 DRP000987.PC20
#> SD             6.44257163    6.245257583    6.042236748    5.992461574
#> Variance       0.00817466    0.007681604    0.007190294    0.007072317
#> Cumulative     0.68445319    0.692134796    0.699325091    0.706397408

3 Session Info

sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] GenomicSuperSignature_1.2.1 SummarizedExperiment_1.24.0
#>  [3] Biobase_2.54.0              GenomicRanges_1.46.1       
#>  [5] GenomeInfoDb_1.30.0         IRanges_2.28.0             
#>  [7] S4Vectors_0.32.3            BiocGenerics_0.40.0        
#>  [9] MatrixGenerics_1.6.0        matrixStats_0.61.0         
#> [11] BiocStyle_2.22.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] bitops_1.0-7           bit64_4.0.5            filelock_1.0.2        
#>  [4] httr_1.4.2             doParallel_1.0.16      RColorBrewer_1.1-2    
#>  [7] tools_4.1.2            backports_1.4.1        bslib_0.3.1           
#> [10] utf8_1.2.2             R6_2.5.1               DBI_1.1.1             
#> [13] colorspace_2.0-2       GetoptLong_1.0.5       withr_2.4.3           
#> [16] tidyselect_1.1.1       curl_4.3.2             bit_4.0.4             
#> [19] compiler_4.1.2         DelayedArray_0.20.0    bookdown_0.24         
#> [22] sass_0.4.0             scales_1.1.1           rappdirs_0.3.3        
#> [25] stringr_1.4.0          digest_0.6.29          rmarkdown_2.11        
#> [28] XVector_0.34.0         pkgconfig_2.0.3        htmltools_0.5.2       
#> [31] dbplyr_2.1.1           fastmap_1.1.0          rlang_0.4.12          
#> [34] GlobalOptions_0.1.2    RSQLite_2.2.9          shape_1.4.6           
#> [37] jquerylib_0.1.4        generics_0.1.1         jsonlite_1.7.2        
#> [40] dplyr_1.0.7            car_3.0-12             RCurl_1.98-1.5        
#> [43] magrittr_2.0.1         GenomeInfoDbData_1.2.7 Matrix_1.4-0          
#> [46] Rcpp_1.0.7             munsell_0.5.0          fansi_0.5.0           
#> [49] abind_1.4-5            lifecycle_1.0.1        stringi_1.7.6         
#> [52] yaml_2.2.1             carData_3.0-4          zlibbioc_1.40.0       
#> [55] BiocFileCache_2.2.0    grid_4.1.2             blob_1.2.2            
#> [58] parallel_4.1.2         crayon_1.4.2           lattice_0.20-45       
#> [61] circlize_0.4.13        knitr_1.36             ComplexHeatmap_2.10.0 
#> [64] pillar_1.6.4           ggpubr_0.4.0           rjson_0.2.20          
#> [67] ggsignif_0.6.3         codetools_0.2-18       glue_1.5.1            
#> [70] evaluate_0.14          BiocManager_1.30.16    png_0.1-7             
#> [73] vctrs_0.3.8            foreach_1.5.1          gtable_0.3.0          
#> [76] purrr_0.3.4            tidyr_1.1.4            clue_0.3-60           
#> [79] assertthat_0.2.1       cachem_1.0.6           ggplot2_3.3.5         
#> [82] xfun_0.29              broom_0.7.10           rstatix_0.7.0         
#> [85] tibble_3.1.6           iterators_1.0.13       memoise_2.0.1         
#> [88] cluster_2.1.2          ellipsis_0.3.2