This vignette shows detailed examples for the getGEMatrix
function.
As explained into the user guide vignette, datasets must be downloaded from ImmuneSpaceConnection
objects. We must first instantiate a connection to the study or studies of interest. Throughout this vignette, we will use two connections, one to a single study, and one to to all available data.
library(ImmuneSpaceR)
sdy269 <- CreateConnection("SDY269")
all <- CreateConnection("")
Now that the connections have been instantiated, we can start downloading from them. But we need to figure out which processed matrices are available within our chosen studies.
On www.immunespace.org, in the study of interest or at the project level, the Gene expression matrices table will show the available runs.
Printing the connections will, among other information, list the datasets availables. The listDatasets
method will only display the downloadable data. looking for. With output = "expression"
, the datasets wont be printed.
sdy269$listDatasets()
## datasets
## demographics
## elisa
## gene_expression_files
## hai
## cohort_membership
## pcr
## fcs_sample_files
## fcs_analyzed_result
## elispot
## Expression Matrices
## LAIV_2008
## TIV_2008
Using output = "expression"
, we can remove the datasets from the output.
all$listDatasets(output = "expression")
## Expression Matrices
## VLplus
## Fluzone_group1
## TIV_2007
## SDY400_youngerAdults
## VLminus
## SDY67_batch2
## LAIV_2008
## SDY522_LAIVvaccine
## Saline_group1
## Fluzone_group2
## SDY296_AIRFV_2011
## TIV_older
## SDY301_AIRFV_2012
## SDY404_older_PBMC_year2
## Cohort1_young
## Saline_group2
## SDY144_TIV2011
## pH1N1_2009
## SDY63_young_PBMC_year1
## Pneumovax23_group1
## Pneumovax23_group2
## TIV_2010
## SDY404_young_PBMC_year2
## Cohort2_older
## SDY63_older_PBMC_year1
## SDY400_oldAdults
## TIV_2008
## TIV_young
Naturally, all
contains every processed matrices available on ImmuneSpace as it combines all available studies.
The getGEMatrix
function will accept any of the run names listed in the connection.
TIV_2008 <- sdy269$getGEMatrix("TIV_2008")
## Downloading matrix..
## Constructing ExpressionSet
TIV_2011 <- all$getGEMatrix(matrixName = "SDY144_TIV2011")
## Downloading matrix..
## Constructing ExpressionSet
The matrices are returned as ExpressionSet
where the phenoData slot contains basic demographic information and the featureData slot shows a mapping of probe to official gene symbols.
TIV_2008
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 16729 features, 80 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: BS586131 BS586187 ... BS586267 (80 total)
## varLabels: study_time_collected study_time_collected_unit ...
## biosample_accession (5 total)
## varMetadata: labelDescription
## featureData
## featureNames: DDR1 RFC2 ... NUS1P3 (16729 total)
## fvarLabels: FeatureId gene_symbol
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
The cohort
argument can be used in place of the run name (x
). Likewise, the list of valid cohorts can be found in the Gene expression matrices table.
LAIV_2008 <- sdy269$getGEMatrix(cohort = "LAIV group 2008")
## Downloading matrix..
## Constructing ExpressionSet
Note that when cohort is used, x
is ignored.
By default, the returned ExpressionSet
s have probe names as features (or rows). However, multiple probes often match the same gene and merging experiments from different arrays is impossible at feature level. When they are available, the summary
argument allows to return the matrices with gene symbols instead of probes. You should use currAnno
set to TRUE
to use the latest official gene symbols mapped for each probe, but you can also set this to FALSE
to retrieve the original mappings from when the matrix was created.
TIV_2008_sum <- sdy269$getGEMatrix("TIV_2008", outputType = "summary", annotation = "latest")
## returning TIV_2008_sum_eset from cache
Probes that do not map to a unique gene are removed and expression is averaged by gene.
TIV_2008_sum
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 16729 features, 80 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: BS586131 BS586187 ... BS586267 (80 total)
## varLabels: study_time_collected study_time_collected_unit ...
## biosample_accession (5 total)
## varMetadata: labelDescription
## featureData
## featureNames: DDR1 RFC2 ... NUS1P3 (16729 total)
## fvarLabels: FeatureId gene_symbol
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
In order to faciliate analysis across experiments and studies, when multiple runs or cohorts are specified, getGEMatrix
will attempt to combine the selected expression matrices into a single ExpressionSet
.
To avoid returning an empty object, it is usually recommended to use the summarized version of the matrices, thus combining by genes. This is almost always necessary when combining data from multiple studies.
# Within a study
em269 <- sdy269$getGEMatrix(c("TIV_2008", "LAIV_2008"))
## returning summary matrix from cache
## returning summary matrix from cache
## returning latest annotation from cache
## returning latest annotation from cache
## Constructing ExpressionSet
## Constructing ExpressionSet
## Combining ExpressionSets
# Combining across studies
TIV_seasons <- all$getGEMatrix(c("TIV_2008", "SDY144_TIV2011"),
outputType = "summary",
annotation = "latest")
## Downloading matrix..
## returning summary matrix from cache
## returning latest annotation from cache
## Constructing ExpressionSet
## Constructing ExpressionSet
## Combining ExpressionSets
As explained in the user guide, the ImmuneSpaceConnection
class is a Reference class. It means its objects have fields accessed by reference. As a consequence, they can be modified without making a copy of the entire object. ImmuneSpaceR uses this feature to store downloaded datasets and expression matrices. Subsequent calls to getGEMatrix
with the same input will be faster.
See ?setRefClass
for more information about reference classes.
We can see a list of already downloaded runs and feature sets the data_cache
field. This is not intended to be used for data manipulation and only displayed here to explain what gets cached.
names(sdy269$data_cache)
## [1] "GE_matrices" "TIV_2008_sum" "featureset_18"
## [4] "TIV_2008_sum_eset" "LAIV_2008_sum" "LAIV_2008_sum_eset"
If, for any reason, a specific marix needs to be redownloaded, the reload
argument will clear the cache for that specific getGEMatrix
call and download the file and metadata again.
TIV_2008 <- sdy269$getGEMatrix("TIV_2008", reload = TRUE)
## returning summary matrix from cache
## returning latest annotation from cache
## Constructing ExpressionSet
Finally, it is possible to clear every cached expression matrix (and dataset).
sdy269$clear_cache()
Again, the data.cache
field should never be modified manually. When in doubt, simply reload the expression matrix.
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Rlabkey_2.2.0 rjson_0.2.15 httr_1.3.1
## [4] ImmuneSpaceR_1.6.2 rmarkdown_1.8 knitr_1.19
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.38.0 viridis_0.5.0 tidyr_0.8.0
## [4] jsonlite_1.5 viridisLite_0.3.0 foreach_1.4.4
## [7] gtools_3.5.0 assertthat_0.2.0 stats4_3.4.3
## [10] yaml_2.1.16 robustbase_0.92-8 pillar_1.1.0
## [13] backports_1.1.2 lattice_0.20-35 glue_1.2.0
## [16] digest_0.6.15 RColorBrewer_1.1-2 colorspace_1.3-2
## [19] preprocessCore_1.40.0 htmltools_0.3.6 plyr_1.8.4
## [22] pkgconfig_2.0.1 pheatmap_1.0.8 purrr_0.2.4
## [25] mvtnorm_1.0-7 scales_0.5.0 webshot_0.5.0
## [28] gdata_2.18.0 whisker_0.3-2 tibble_1.4.2
## [31] ggplot2_2.2.1 nnet_7.3-12 BiocGenerics_0.24.0
## [34] lazyeval_0.2.1 magrittr_1.5 mclust_5.4
## [37] heatmaply_0.14.1 evaluate_0.10.1 MASS_7.3-48
## [40] gplots_3.0.1 class_7.3-14 registry_0.5
## [43] tools_3.4.3 data.table_1.10.4-3 trimcluster_0.1-2
## [46] stringr_1.2.0 plotly_4.7.1 kernlab_0.9-25
## [49] munsell_0.4.3 cluster_2.0.6 fpc_2.1-11
## [52] bindrcpp_0.2 compiler_3.4.3 caTools_1.17.1
## [55] rlang_0.1.6 grid_3.4.3 iterators_1.0.9
## [58] htmlwidgets_1.0 labeling_0.3 bitops_1.0-6
## [61] codetools_0.2-15 gtable_0.2.0 curl_3.1
## [64] flexmix_2.3-14 reshape2_1.4.3 TSP_1.1-5
## [67] R6_2.2.2 seriation_1.2-3 gridExtra_2.3
## [70] prabclus_2.2-6 dplyr_0.7.4 bindr_0.1
## [73] rprojroot_1.3-2 KernSmooth_2.23-15 dendextend_1.7.0
## [76] modeltools_0.2-21 stringi_1.1.6 parallel_3.4.3
## [79] Rcpp_0.12.15 gclus_1.3.1 DEoptimR_1.0-8
## [82] diptest_0.75-7