
1 Introduction

The phantasusLite package contains a set functions that facilitate working with public gene expression datasets originally developed for phantasus package. Unlike phantasus, phantasusLite aims to limit the amount of dependencies.

The current functionality includes:

2 Installation

It is recommended to install the release version of the package from Bioconductor using the following commands:

if (!require("BiocManager", quietly = TRUE))


Alternatively, the most recent version of the package can be installed from the GitHub repository:


3 Loading precomputed RNA-seq counts


Let’s load dataset GSE53053 from GEO using GEOquery package:

ess <- getGEO("GSE53053")
es <- ess[[1]]

RNA-seq dataset from GEO do not contain the expression matrix, thus exprs(es) is empty:

However, a number of precomputed gene count tables are available at HSDS server ‘’. It features HDF5 files with counts from ARCHS4 and DEE2 projects:

url <- ''
GSE53053 dataset was sequenced from Mus musculus and we can get an expression matrix from the corresponding HDF5-file with DEE2 data:

file <- "dee2/mmusculus_star_matrix_20221107.h5"
es <- loadCountsFromH5FileHSDS(es, url, file)
Normally loadCountsFromHSDS can be used to automatically select the HDF5-file with the largest number of quantified samples:

es <- ess[[1]]
es <- loadCountsFromHSDS(es, url)
The counts are different from the previous values as ARCHS4 counts were used – ARCHS4 is prioritized when there are several files with the same number of samples:

## [1] "archs4/mouse_matrix_v11.h5"

4 Inferring sample groups

For some of the GEO datasets, such as GSE53053, the sample annotation is not fully available. However, frequently sample titles are structured in a way that allows to infer the groups. For example, for GSE53053 we can see there are three groups: Ctrl, MandIL4, MandLPSandIFNg, with up to 3 replicates:

## [1] "Ctrl_1"           "Ctrl_2"           "MandIL4_1"        "MandIL4_2"       
## [5] "MandIL4_3"        "MandLPSandIFNg_1" "MandLPSandIFNg_2" "MandLPSandIFNg_3"

For such well-structured titles, inferCondition function can be used to automatically identify the sample conditions and replicates:

es <- inferCondition(es)
## [1] "Ctrl"           "Ctrl"           "MandIL4"        "MandIL4"       
## [5] "MandIL4"        "MandLPSandIFNg" "MandLPSandIFNg" "MandLPSandIFNg"
## [1] "1" "2" "1" "2" "3" "1" "2" "3"

5 Working with GCT files

GCT text format can be used to store annotated gene expression matrices and load them in software such as Morpheus or Phantasus.

For example, we can save the ExpressionSet object that we defined previously:

f <- file.path(tempdir(), "GSE53053.gct")
writeGct(es, f)

And the load the file back:

es2 <- readGct(f)
## Warning in readGct(f): duplicated row IDs: missing missing missing missing
## missing missing; they were renamed
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 32544 features, 8 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: Ctrl_1 Ctrl_2 ... MandLPSandIFNg_3 (8 total)
##   varLabels: title geo_accession ... replicate (43 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: missing ENSMUSG00000007777 ... ENSMUSG00000064368
##     (32544 total)
##   fvarLabels: ENSEMBLID Gene Symbol
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

6 Session info

