Contents

1 Introduction

The phantasusLite package contains a set functions that facilitate working with public gene expression datasets originally developed for phantasus package. Unlike phantasus, phantasusLite aims to limit the amount of dependencies.

The current functionality includes:

2 Installation

It is recommended to install the release version of the package from Bioconductor using the following commands:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("phantasusLite")

Alternatively, the most recent version of the package can be installed from the GitHub repository:

library(devtools)
install_github("ctlab/phantasusLite")

3 Loading precomputed RNA-seq counts

library(GEOquery)
library(phantasusLite)

Let’s load dataset GSE53053 from GEO using GEOquery package:

ess <- getGEO("GSE53053")
es <- ess[[1]]

RNA-seq dataset from GEO do not contain the expression matrix, thus exprs(es) is empty:

head(exprs(es))
##      GSM1281300 GSM1281301 GSM1281302 GSM1281303 GSM1281304 GSM1281305
##      GSM1281306 GSM1281307

However, a number of precomputed gene count tables are available at HSDS server ‘https://ctlab.itmo.ru/hsds/’. It features HDF5 files with counts from ARCHS4 and DEE2 projects:

url <- 'https://ctlab.itmo.ru/hsds/?domain=/counts'
getHSDSFileList(url)
##  [1] "/counts/archs4/Arabidopsis_thaliana_count_matrix.h5"    
##  [2] "/counts/archs4/Bos_taurus_count_matrix.h5"              
##  [3] "/counts/archs4/Caenorhabditis_elegans_count_matrix.h5"  
##  [4] "/counts/archs4/Danio_rerio_count_matrix.h5"             
##  [5] "/counts/archs4/Drosophila_melanogaster_count_matrix.h5" 
##  [6] "/counts/archs4/Gallus_gallus_count_matrix.h5"           
##  [7] "/counts/archs4/Rattus_norvegicus_count_matrix.h5"       
##  [8] "/counts/archs4/Saccharomyces_cerevisiae_count_matrix.h5"
##  [9] "/counts/archs4/human_matrix_v11.h5"                     
## [10] "/counts/archs4/meta.h5"                                 
## [11] "/counts/archs4/mouse_matrix_v11.h5"                     
## [12] "/counts/archs4.old/meta.h5"                             
## [13] "/counts/archs4_new/human_gene_v2.2.h5"                  
## [14] "/counts/archs4_new/meta.h5"                             
## [15] "/counts/archs4_new/mouse_gene_v2.2.h5"                  
## [16] "/counts/dee2/athaliana_star_matrix_20221107.h5"         
## [17] "/counts/dee2/celegans_star_matrix_20221107.h5"          
## [18] "/counts/dee2/dmelanogaster_star_matrix_20221107.h5"     
## [19] "/counts/dee2/drerio_star_matrix_20211102.h5"            
## [20] "/counts/dee2/ecoli_star_matrix_20221107.h5"             
## [21] "/counts/dee2/hsapiens_star_matrix_20221107.h5"          
## [22] "/counts/dee2/meta.h5"                                   
## [23] "/counts/dee2/mmusculus_star_matrix_20221107.h5"         
## [24] "/counts/dee2/rnorvegicus_star_matrix_20221107.h5"       
## [25] "/counts/dee2/scerevisiae_star_matrix_20221107.h5"

GSE53053 dataset was sequenced from Mus musculus and we can get an expression matrix from the corresponding HDF5-file with DEE2 data:

file <- "dee2/mmusculus_star_matrix_20221107.h5"
es <- loadCountsFromH5FileHSDS(es, url, file)
head(exprs(es))
##                    GSM1281300 GSM1281301 GSM1281302 GSM1281303 GSM1281304
## ENSMUSG00000102693          0          0          0          0          0
## ENSMUSG00000064842          0          0          0          0          0
## ENSMUSG00000051951          0          0          0          0          0
## ENSMUSG00000102851          0          0          0          0          0
## ENSMUSG00000103377          0          0          0          0          0
## ENSMUSG00000104017          0          0          0          0          0
##                    GSM1281305 GSM1281306 GSM1281307
## ENSMUSG00000102693          0          0          0
## ENSMUSG00000064842          0          0          0
## ENSMUSG00000051951          0          0          0
## ENSMUSG00000102851          0          0          0
## ENSMUSG00000103377          0          0          0
## ENSMUSG00000104017          0          0          0

Normally loadCountsFromHSDS can be used to automatically select the HDF5-file with the largest number of quantified samples:

es <- ess[[1]]
es <- loadCountsFromHSDS(es, url)
head(exprs(es))
##               GSM1281300 GSM1281301 GSM1281302 GSM1281303 GSM1281304 GSM1281305
## 0610007P14Rik         86         67         30         46         23         61
## 0610009B22Rik         29         22          3          0         33         13
## 0610009L18Rik          0          0          7          0          0         15
## 0610009O20Rik        103         38         17         20         31         54
## 0610010F05Rik        259         91        115         88        113        185
## 0610010K14Rik         17          6          0          0          1          0
##               GSM1281306 GSM1281307
## 0610007P14Rik        105         22
## 0610009B22Rik         15         26
## 0610009L18Rik          0          9
## 0610009O20Rik         24         29
## 0610010F05Rik        108        163
## 0610010K14Rik          0          7

The counts are different from the previous values as ARCHS4 counts were used – ARCHS4 is prioritized when there are several files with the same number of samples:

preproc(experimentData(es))$gene_counts_source
## [1] "archs4/mouse_matrix_v11.h5"

4 Inferring sample groups

For some of the GEO datasets, such as GSE53053, the sample annotation is not fully available. However, frequently sample titles are structured in a way that allows to infer the groups. For example, for GSE53053 we can see there are three groups: Ctrl, MandIL4, MandLPSandIFNg, with up to 3 replicates:

es$title
## [1] "Ctrl_1"           "Ctrl_2"           "MandIL4_1"        "MandIL4_2"       
## [5] "MandIL4_3"        "MandLPSandIFNg_1" "MandLPSandIFNg_2" "MandLPSandIFNg_3"

For such well-structured titles, inferCondition function can be used to automatically identify the sample conditions and replicates:

es <- inferCondition(es)
print(es$condition)
## [1] "Ctrl"           "Ctrl"           "MandIL4"        "MandIL4"       
## [5] "MandIL4"        "MandLPSandIFNg" "MandLPSandIFNg" "MandLPSandIFNg"
print(es$replicate)
## [1] "1" "2" "1" "2" "3" "1" "2" "3"

5 Working with GCT files

GCT text format can be used to store annotated gene expression matrices and load them in software such as Morpheus or Phantasus.

For example, we can save the ExpressionSet object that we defined previously:

f <- file.path(tempdir(), "GSE53053.gct")
writeGct(es, f)

And the load the file back:

es2 <- readGct(f)
## Warning in readGct(f): duplicated row IDs: missing missing missing missing
## missing missing; they were renamed
print(es2)
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 32544 features, 8 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: Ctrl_1 Ctrl_2 ... MandLPSandIFNg_3 (8 total)
##   varLabels: title geo_accession ... replicate (43 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: missing ENSMUSG00000007777 ... ENSMUSG00000064368
##     (32544 total)
##   fvarLabels: ENSEMBLID Gene Symbol
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

6 Session info

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] phantasusLite_1.0.0 GEOquery_2.70.0     Biobase_2.62.0     
## [4] BiocGenerics_0.48.0 BiocStyle_2.30.0   
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.7            utf8_1.2.4            generics_0.1.3       
##  [4] tidyr_1.3.0           SparseArray_1.2.0     xml2_1.3.5           
##  [7] stringi_1.7.12        lattice_0.22-5        hms_1.1.3            
## [10] digest_0.6.33         magrittr_2.0.3        evaluate_0.22        
## [13] grid_4.3.1            bookdown_0.36         fastmap_1.1.1        
## [16] R.oo_1.25.0           jsonlite_1.8.7        Matrix_1.6-1.1       
## [19] R.utils_2.12.2        limma_3.58.0          BiocManager_1.30.22  
## [22] httr_1.4.7            purrr_1.0.2           fansi_1.0.5          
## [25] jquerylib_0.1.4       abind_1.4-5           cli_3.6.1            
## [28] crayon_1.5.2          rlang_1.1.1           XVector_0.42.0       
## [31] R.methodsS3_1.8.2     withr_2.5.1           cachem_1.0.8         
## [34] DelayedArray_0.28.0   yaml_2.3.7            S4Arrays_1.2.0       
## [37] tools_4.3.1           tzdb_0.4.0            dplyr_1.1.3          
## [40] curl_5.1.0            vctrs_0.6.4           R6_2.5.1             
## [43] matrixStats_1.0.0     stats4_4.3.1          lifecycle_1.0.3      
## [46] stringr_1.5.0         zlibbioc_1.48.0       S4Vectors_0.40.0     
## [49] IRanges_2.36.0        pkgconfig_2.0.3       pillar_1.9.0         
## [52] bslib_0.5.1           data.table_1.14.8     glue_1.6.2           
## [55] statmod_1.5.0         xfun_0.40             tibble_3.2.1         
## [58] tidyselect_1.2.0      MatrixGenerics_1.14.0 knitr_1.44           
## [61] rjson_0.2.21          htmltools_0.5.6.1     rmarkdown_2.25       
## [64] rhdf5client_1.24.0    readr_2.1.4           compiler_4.3.1

```