Handling Expression Matrices with ImmuneSpaceR

Renan Sauteraud

2020-04-27

This vignette shows detailed examples for the getGEMatrix() method

1 Create connections

As explained into the introductory vignette, datasets must be downloaded from ImmuneSpaceConnection objects. We must first instantiate a connection to the study or studies of interest. Throughout this vignette, we will use two connections, one to a single study, and one to to all available data.

library(ImmuneSpaceR)
sdy269 <- CreateConnection("SDY269")
## Warning in matrix(data = unlist(curfld[resultCols]), nrow = 1, ncol =
## length(resultCols), : data length [4] is not a sub-multiple or multiple of the
## number of columns [7]
all <- CreateConnection("")
## Warning in matrix(data = unlist(curfld[resultCols]), nrow = 1, ncol =
## length(resultCols), : data length [4] is not a sub-multiple or multiple of the
## number of columns [7]

2 List the expression matrices

Now that the connections have been instantiated, we can start downloading from them. But we need to figure out which processed matrices are available within our chosen studies.

On the ImmuneSpace portal, in the study of interest or at the project level, the Gene expression matrices table will show the available runs.

Printing the connections will, among other information, list the datasets availables. The listDatasets method will only display the downloadable data. looking for. With output = "expression", the datasets wont be printed.

sdy269$listDatasets()
## datasets
##  cohort_membership
##  demographics
##  elisa
##  elispot
##  fcs_analyzed_result
##  fcs_sample_files
##  gene_expression_files
##  hai
##  pcr
## Expression Matrices
##  SDY269_PBMC_TIV_Geo
##  SDY269_PBMC_LAIV_Geo

Using output = "expression", we can remove the datasets from the output.

all$listDatasets(output = "expression")
## Expression Matrices
##  SDY1086_PBMC_GroupB_Geo
##  SDY1086_PBMC_GroupA_Geo
##  SDY1267_PBMC_RRR_Geo
##  SDY1267_PBMC_ARR_Geo
##  SDY1412_WholeBlood_EPC002_geo
##  SDY1256_WholeBlood_EPIC001_geo
##  SDY80_PBMC_Cohort2_geo
##  SDY299_WholeBlood_HEPISLAV
##  SDY180_WholeBlood_Grp2Saline_Geo
##  SDY180_WholeBlood_Grp2Pneunomax23_Geo
##  SDY180_WholeBlood_Grp2Fluzone_Geo
##  SDY180_WholeBlood_Grp1Saline_Geo
##  SDY180_WholeBlood_Grp1Pneunomax23_Geo
##  SDY180_WholeBlood_Grp1Fluzone_Geo
##  SDY1325_WholeBlood_LowIntraMuscularPS_geo
##  SDY1294_PBMC_ChineseCohort_Geo
##  SDY1119_PBMC_oldHealthy_Geo
##  SDY1119_PBMC_oldT2D_Geo
##  SDY1119_PBMC_youngT2D_Geo
##  SDY1119_PBMC_youngHealthy_Geo
##  SDY1289_WholeBlood_MontrealCohort_Geo
##  SDY1289_WholeBlood_LausanneCohort_Geo
##  SDY1324_PBMC_nonBCGvacc
##  SDY1324_PBMC_LatentTB
##  SDY1324_PBMC_BCGvacc
##  SDY89_WholeBlood_EnergixB
##  SDY1370_Bcell_lc16m8_geo
##  SDY1370_Bcell_dryvax_geo
##  SDY1370_Tcell_lc16m8_geo
##  SDY1370_Tcell_dryvax_geo
##  SDY1370_PBMC_lc16m8_geo
##  SDY1370_PBMC_dryvax_geo
##  SDY1368_WholeBlood_Twin_Geo
##  SDY1368_WholeBlood_NonTwin_Geo
##  SDY67_PBMC_HealthyAdults
##  SDY1328_WholeBlood_HealthyAdults_geo
##  SDY224_PBMC_TIV2010_ImmPort
##  SDY888_PBMC_UninfectedEndemicArea_Geo
##  SDY888_PBMC_UninfectedNonEndemicArea_Geo
##  SDY888_PBMC_InfectedEndemicArea_Geo
##  SDY28_PBMC_Dryvax
##  SDY34_PBMC_TIV
##  SDY34_PBMC_Controls
##  SDY305_Other_IDTIV_Geo
##  SDY305_Other_TIV_Geo
##  SDY112_Other_GroupC
##  SDY112_Other_GroupB
##  SDY112_Other_GroupA
##  SDY315_Other_GroupC_Geo
##  SDY315_Other_GroupB_Geo
##  SDY315_Other_GroupA_Geo
##  SDY406_Other_ILI_Geo
##  SDY113_Other_IDTIV_Geo
##  SDY113_Other_LAIV_Geo
##  SDY113_Other_TIV_Geo
##  SDY144_Other_TIV_Geo
##  SDY690_PBMC_Energixb
##  SDY690_WholeBlood_Energixb
##  SDY597_Other_InVitro
##  SDY522_Other_LAIV
##  SDY387_WholeBlood_NCH2010
##  SDY372_WholeBlood_JDM2012
##  SDY368_WholeBlood_NCH2013
##  SDY364_WholeBlood_NCH2012
##  SDY312_Other_GroupC
##  SDY312_Other_GroupB
##  SDY312_Other_GroupA
##  SDY301_Other_AIRFV
##  SDY296_WholeBlood_AIRFV
##  SDY667_WholeBlood_PSORPPP
##  SDY212_WholeBlood_Older_Geo
##  SDY212_WholeBlood_Young_Geo
##  SDY212_PBMC_Older_Geo
##  SDY212_PBMC_Young_geo
##  SDY270_PBMC_TIVGroup_Geo
##  SDY1373_WholeBlood_highDose_Geo
##  SDY1373_WholeBlood_lowDose_Geo
##  SDY1364_PBMC_IntraDermal_Geo
##  SDY1364_PBMC_IntraMuscular_Geo
##  SDY1325_WholeBlood_IntramuscularCRM_Geo
##  SDY1325_WholeBlood_IntramuscularPS_Geo
##  SDY1325_WholeBlood_SubcutaneousPS_Geo
##  SDY1291_PBMC_HealthyHIVUninfected_Geo
##  SDY1293_PBMC_Vaccinated_geo
##  SDY1293_PBMC_Control_Geo
##  SDY1276_WholeBlood_Validation_Geo
##  SDY1276_WholeBlood_Discovery_Geo
##  SDY1264_PBMC_Trial2_Geo
##  SDY1264_PBMC_Trial1_Geo
##  SDY1260_PBMC_MCV4_Geo
##  SDY1260_PBMC_MPSV4_Geo
##  SDY984_PBMC_Elderly_Geo
##  SDY984_PBMC_Young_Geo
##  SDY61_PBMC_TIVGrp
##  SDY56_PBMC_Older
##  SDY56_PBMC_Young
##  SDY63_PBMC_Young_Geo
##  SDY63_PBMC_Older_Geo
##  SDY404_PBMC_Young_Geo
##  SDY404_PBMC_Older_Geo
##  SDY400_PBMC_Older_Geo
##  SDY400_PBMC_Young_Geo
##  SDY269_PBMC_TIV_Geo
##  SDY269_PBMC_LAIV_Geo
##  SDY300_dendriticCell_dcMonoFlu2011
##  SDY300_otherCell_dcMonoFlu2011
##  SDY162_Macrophage_VLplus
##  SDY162_PBMC_VLplus
##  SDY162_Macrophage_VLminus
##  SDY162_PBMC_VLminus

Naturally, all contains every processed matrices available on ImmuneSpace as it combines all available studies.

3 Download

3.1 By run name

The getGEMatrix method will accept any of the run names listed in the connection.

## Downloading matrix..
## Constructing ExpressionSet
## Downloading matrix..
## Constructing ExpressionSet

The matrices are returned as ExpressionSet where the phenoData slot contains basic demographic information and the featureData slot shows a mapping of probe to official gene symbols.

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 16442 features, 80 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: BS586128 BS586240 ... BS586267 (80 total)
##   varLabels: participant_id study_time_collected ...
##     exposure_process_preferred (8 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: DDR1 RFC2 ... NUS1P3 (16442 total)
##   fvarLabels: FeatureId gene_symbol
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

3.2 By cohortType

The cohortType argument can be used in place of the run name (x). It is a concatenation of “cohort” and “cell type” so that you may use matrices for analysis that have been normalized within cell-type. Likewise, the list of valid cohortTypes can be found in the Gene expression matrices table.

## Downloading matrix..
## Constructing ExpressionSet

Note that when cohort is used, x is ignored.

4 Summarized matrices

By default, the returned ExpressionSets have probe names as features (or rows). However, multiple probes often match the same gene and merging experiments from different arrays is impossible at feature level. When they are available, the summary argument allows to return the matrices with gene symbols instead of probes. You should use currAnno set to TRUE to use the latest official gene symbols mapped for each probe, but you can also set this to FALSE to retrieve the original mappings from when the matrix was created.

TIV_2008_sum <- sdy269$getGEMatrix("SDY269_PBMC_TIV_Geo", outputType = "summary", annotation = "latest")
## Returning SDY269_PBMC_TIV_Geo_summary_latest_eset from cache

Probes that do not map to a unique gene are removed and expression is averaged by gene.

TIV_2008_sum
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 16442 features, 80 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: BS586128 BS586240 ... BS586267 (80 total)
##   varLabels: participant_id study_time_collected ...
##     exposure_process_preferred (8 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: DDR1 RFC2 ... NUS1P3 (16442 total)
##   fvarLabels: FeatureId gene_symbol
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

5 Combining matrices

In order to faciliate analysis across experiments and studies, when multiple runs or cohorts are specified, getGEMatrix will attempt to combine the selected expression matrices into a single ExpressionSet.

To avoid returning an empty object, it is usually recommended to use the summarized version of the matrices, thus combining by genes. This is almost always necessary when combining data from multiple studies.

# Within a study
em269 <- sdy269$getGEMatrix(c("SDY269_PBMC_TIV_Geo", "SDY269_PBMC_LAIV_Geo"))
## Returning SDY269_PBMC_TIV_Geo_summary_latest_eset from cache
## Returning SDY269_PBMC_LAIV_Geo_summary_latest_eset from cache
## Combining ExpressionSets
# Combining across studies
TIV_seasons <- all$getGEMatrix(c("SDY269_PBMC_TIV_Geo", "SDY144_Other_TIV_Geo"),
                               outputType = "summary",
                               annotation = "latest")
## Downloading matrix..
## Constructing ExpressionSet
## Returning SDY144_Other_TIV_Geo_summary_latest_eset from cache
## Combining ExpressionSets

6 Caching

As explained in the introductory, the ImmuneSpaceConnection class is a R6 class. It means its objects have fields accessed by reference. As a consequence, they can be modified without making a copy of the entire object. ImmuneSpaceR uses this feature to store downloaded datasets and expression matrices. Subsequent calls to getGEMatrix with the same input will be faster.

See ?R6::R6Class for more information about R6 class system.

We can see a list of already downloaded runs and feature sets the cache field. This is not intended to be used for data manipulation and only displayed here to explain what gets cached.

names(sdy269$cache)
## [1] "GE_matrices"                             
## [2] "SDY269_PBMC_TIV_Geo_sum_latest"          
## [3] "featureset_18"                           
## [4] "SDY269_PBMC_TIV_Geo_summary_latest_eset" 
## [5] "SDY269_PBMC_LAIV_Geo_sum_latest"         
## [6] "SDY269_PBMC_LAIV_Geo_summary_latest_eset"

If, for any reason, a specific marix needs to be redownloaded, the reload argument will clear the cache for that specific getGEMatrix call and download the file and metadata again.

TIV_2008 <- sdy269$getGEMatrix("SDY269_PBMC_TIV_Geo", reload = TRUE)
## Downloading matrix..
## Constructing ExpressionSet

Finally, it is possible to clear every cached expression matrix (and dataset).

sdy269$clearCache()

Again, the cache field should never be modified manually. When in doubt, simply reload the expression matrix.

7 Session info

sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Rlabkey_2.4.0       jsonlite_1.6.1      httr_1.4.1         
## [4] ImmuneSpaceR_1.16.0 rmarkdown_2.1       knitr_1.28         
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.48.0        viridis_0.5.1         tidyr_1.0.2          
##  [4] viridisLite_0.3.0     foreach_1.5.0         gtools_3.8.2         
##  [7] RcppParallel_5.0.0    assertthat_0.2.1      stats4_4.0.0         
## [10] latticeExtra_0.6-29   flowWorkspace_4.0.0   yaml_2.2.1           
## [13] pillar_1.4.3          lattice_0.20-41       glue_1.4.0           
## [16] digest_0.6.25         RColorBrewer_1.1-2    colorspace_1.4-1     
## [19] preprocessCore_1.50.0 htmltools_0.4.0       XML_3.99-0.3         
## [22] pkgconfig_2.0.3       pheatmap_1.0.12       zlibbioc_1.34.0      
## [25] purrr_0.3.4           flowCore_2.0.0        scales_1.1.0         
## [28] webshot_0.5.2         gdata_2.18.0          jpeg_0.1-8.1         
## [31] tibble_3.0.1          farver_2.0.3          ggplot2_3.3.0        
## [34] ellipsis_0.3.0        BiocGenerics_0.34.0   lazyeval_0.2.2       
## [37] magrittr_1.5          crayon_1.3.4          heatmaply_1.1.0      
## [40] evaluate_0.14         MASS_7.3-51.6         gplots_3.0.3         
## [43] graph_1.66.0          registry_0.5-1        tools_4.0.0          
## [46] data.table_1.12.8     ncdfFlow_2.34.0       lifecycle_0.2.0      
## [49] matrixStats_0.56.0    stringr_1.4.0         plotly_4.9.2.1       
## [52] munsell_0.5.0         cluster_2.1.0         compiler_4.0.0       
## [55] caTools_1.18.0        rlang_0.4.5           grid_4.0.0           
## [58] iterators_1.0.12      htmlwidgets_1.5.1     labeling_0.3         
## [61] bitops_1.0-6          codetools_0.2-16      cytolib_2.0.0        
## [64] gtable_0.3.0          curl_4.3              TSP_1.1-10           
## [67] R6_2.4.1              RProtoBufLib_2.0.0    seriation_1.2-8      
## [70] gridExtra_2.3         dplyr_0.8.5           KernSmooth_2.23-17   
## [73] dendextend_1.13.4     Rgraphviz_2.32.0      stringi_1.4.6        
## [76] parallel_4.0.0        Rcpp_1.0.4.6          vctrs_0.2.4          
## [79] png_0.1-7             gclus_1.3.2           tidyselect_1.0.0     
## [82] xfun_0.13