1 Introduction

This vignette demonstrates how to use the functions in R package ‘singscore’ to score a gene expression dataset against a gene set at a single-sample level and provides visualisation functions to improve interpretation of the results.

Please cite the following papers when using this package:

citation('singscore')
## 
## To cite package 'singscore' in publications use:
## 
##   Foroutan M, Bhuva D, Lyu R, Horan K, Cursons J, Davis M (2018).
##   "Single sample scoring of molecular phenotypes." _BMC
##   bioinformatics_, *19*(1), 404. doi:10.1186/s12859-018-2435-4
##   <https://doi.org/10.1186/s12859-018-2435-4>.
## 
##   Bhuva D, Cursons J, Davis M (2020). "Stable gene expression for
##   normalisation and single-sample scoring." _Nucleic Acids Research_,
##   *48*(19), e113. doi:10.1093/nar/gkaa802
##   <https://doi.org/10.1093/nar/gkaa802>.
## 
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.

singscore implements a simple single-sample gene-set (gene-signature) scoring method which scores individual samples independently without relying on other samples in gene expression datasets. It provides stable scores which are less likely to be affected by varying sample and gene sizes in datasets and unwanted variations across samples. The scoring method uses a rank-based statistics and is quick to compute. For details of the methods please refer to the paper (Foroutan et al. 2018). It also provides various visualisation functions to further explore results of the analysis.

Additional packages we have developed can enhance the singscore workflow:

  1. msigdb - A package that provides gene-sets from the molecular signatures database (MSigDB) as a GeneSetCollection object that is compatible with singscore.
  2. vissE - A package that can summarise and aid in the interpretation of a list of significant gene-sets identified by singscore (see tutorial).
  3. emtdata - The full EMT dataset used in this tutorial (with additional EMT related datasets).

We have also published and made openly available the extensive tutorials below that demonstrate the variety of ways in which singscore can be used to gain a better functional understanding of molecular data:

  1. Using singscore to predict mutation status in acute myeloid leukemia from transcriptomic signatures.
  2. Gene-set enrichment analysis workshop - available through the Orchestra platform (search “WEHI Masterclass Day 4: Functional Analysis, single sample gene set analysis”).

2 Install “singscore” R package

Install ‘singscore’ from Bioconductor

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("singscore")

The most updated version of ‘singscore’ is hosted on GitHub and can be easily installed using devtools::install_github() function provided by devtools, (https://cran.r-project.org/package=devtools)

# You would need to install 'devtools' package first.
 install.packages("devtools")

# And install the 'singscore' package from the GitHub repository
# 'singscore' requires these packages to be installed: methods, stats, graphics, ggplot2, ggsci, grDevices,
#  ggrepel, plotly, tidyr, plyr, magrittr, reshape, edgeR, RColorBrewer, Biobase, GSEABase, BiocParallel
 devtools::install_github('DavisLaboratory/singscore')
# Set build_vignette = TRUE if would like to browseVignette()

#Scoring samples against a gene-set ## Load datasets

To illustrate the usage of ‘simpleScore()’, we first need to load the example datasets. The datasets used in this vignette have been built within the package. You can use the following scripts to load them into your R environment. Detailed steps of obtaining the full datasets are included at the end of the vignette. The ‘tgfb_expr_10_se’ dataset was obtained from [@Foroutanmolcanres.0313.2016] and it is a ten-sample subset of the original dataset. We are going to score the integrated TGFb-treated gene expression dataset (4 cases and 6 controls) against a TGFb gene signature with an up-regulated and down-regulated gene-set pair [@Foroutanmolcanres.0313.2016].

Gene-sets from the molecular signatures database (MSigDB) can be accessed via our msigdb R/Bioconductor package (see vignette).

library(singscore)
library(GSEABase)
# The example expression dataset and gene signatures are included in the package
# distribution, one can directly access them using the variable names

# To see the description of 'tgfb_expr_10_se','tgfb_gs_up','tgfb_gs_dn', look at 
# their help pages using:

# ?tgfb_expr_10_se
# ?tgfb_gs_up
# ?tgfb_gs_dn

# Have a look at the object tgfb_expr_10_se containing gene expression data
# for 10 samples 
tgfb_expr_10_se
## class: SummarizedExperiment 
## dim: 11900 10 
## metadata(0):
## assays(1): counts
## rownames(11900): 2 9 ... 729164 752014
## rowData names(0):
## colnames(10): D_Ctrl_R1 D_TGFb_R1 ... Hil_Ctrl_R1 Hil_Ctrl_R2
## colData names(1): Treatment
# Get the sample names by
colnames(tgfb_expr_10_se)
##  [1] "D_Ctrl_R1"   "D_TGFb_R1"   "D_Ctrl_R2"   "D_TGFb_R2"   "Hes_Ctrl_R1"
##  [6] "Hes_TGFb_R1" "Hes_Ctrl_R2" "Hes_TGFb_R2" "Hil_Ctrl_R1" "Hil_Ctrl_R2"
# View what tgfb_gs_up/dn contains
tgfb_gs_up
## setName: NA 
## geneIds: 19, 87, ..., 402055 (total: 193)
## geneIdType: Null
## collectionType: Null 
## details: use 'details(object)'
tgfb_gs_dn
## setName: NA 
## geneIds: 136, 220, ..., 161291 (total: 108)
## geneIdType: Null
## collectionType: Null 
## details: use 'details(object)'
# Get the size of the gene sets
length(GSEABase::geneIds(tgfb_gs_up))
## [1] 193
length(GSEABase::geneIds(tgfb_gs_dn))
## [1] 108

2.1 Sample scoring

To score samples, the gene expression dataset first needs to be ranked using the rankGenes() function which returns a rank matrix. This matrix along with the signatures are then passed to the simpleScore() function which returns a data.frame containing the scores for each sample. When only a single gene-set is available (i.e. not an up- and down- regulated pair), the same function can be called by setting the upSet argument to the gene-set.

# The recommended method for dealing with ties in ranking is 'min', you can
# change by specifying 'tiesMethod' parameter for rankGenes function.
rankData <- rankGenes(tgfb_expr_10_se)

# Given the ranked data and gene signature, simpleScore returns the scores and 
# dispersions for each sample
scoredf <- simpleScore(rankData, upSet = tgfb_gs_up, downSet = tgfb_gs_dn)
scoredf
##               TotalScore TotalDispersion    UpScore UpDispersion   DownScore
## D_Ctrl_R1   -0.088097993        2867.348 0.06096415     3119.390 -0.14906214
## D_TGFb_R1    0.286994210        2217.970 0.24931565     2352.886  0.03767856
## D_Ctrl_R2   -0.098964086        2861.418 0.06841242     3129.769 -0.16737650
## D_TGFb_R2    0.270721958        2378.832 0.25035661     2470.012  0.02036534
## Hes_Ctrl_R1 -0.002084788        2746.146 0.08046490     3134.216 -0.08254969
## Hes_TGFb_R1  0.176122839        2597.515 0.22894035     2416.638 -0.05281751
## Hes_Ctrl_R2  0.016883867        2700.556 0.08817828     3138.664 -0.07129441
## Hes_TGFb_R2  0.188466953        2455.186 0.23895473     2324.717 -0.05048778
## Hil_Ctrl_R1 -0.061991164        3039.330 0.08314254     3553.792 -0.14513371
## Hil_Ctrl_R2 -0.064937366        2959.270 0.07433863     3396.637 -0.13927600
##             DownDispersion
## D_Ctrl_R1         2615.306
## D_TGFb_R1         2083.053
## D_Ctrl_R2         2593.067
## D_TGFb_R2         2287.652
## Hes_Ctrl_R1       2358.075
## Hes_TGFb_R1       2778.392
## Hes_Ctrl_R2       2262.448
## Hes_TGFb_R2       2585.654
## Hil_Ctrl_R1       2524.868
## Hil_Ctrl_R2       2521.903
# To view more details of the simpleScore, use ?simpleScore
# Note that, when only one gene set is available in a gene signature, one can 
# only input values for the upSet argument. In addition, a knownDirection 
# argument can be set to FALSE if the direction of the gene set is unknown.

# simpleScore(rankData, upSet = tgfb_gs_up, knownDirection = FALSE)

The returned data.frame consists of the scores for the up- and down- regulated gene-sets along with the combined score (TotalScore). Dispersion is calculated using the mad function by default and can be substituted by passing another function to the dispersionFun argument in simpleScore() such as IQR to calculate the inter-quartile range.

2.2 Sample scoring with a reduced number of measurements

Singscore requires transcriptome-wide measurements whereby all or most transcripts are measured. This is required to assess the expression of a gene in relation to the transcriptome. Panel-based transcriptomic assays measure a much smaller selection of transcripts therefore assessing the relative expression of each gene becomes a challenge. In such a setting, stably expressed genes can be used to assess relative expression. Expression of such genes is invariable across samples therefore they can be used to calibrate sample-wise gene expression measurements. This property allows for estimation of transcriptome-wide ranks with a small set of measurements inclusing a small set of stable genes. Detailed explanation of the rank estimation procedure using this logic is available in Bhuva et al. (2020) (manuscript in preparation).

Stably expressed genes in carcinoma transcriptomes and blood can be obtained using the getStableGenes function.

#get the top 5 stable genes in carcinomas
getStableGenes(5, type = 'carcinoma')
## [1] "RBM45"  "BRAP"   "CIAO1"  "TARDBP" "HNRNPK"
#get the top 5 stable genes in blood
getStableGenes(5, type = 'blood')
## [1] "RBM45"  "BRAP"   "GOSR1"  "IWS1"   "HNRNPK"
#get ensembl IDs instead of gene symbold
getStableGenes(5, type = 'carcinoma', id = 'ensembl')
## [1] "ENSG00000155636" "ENSG00000089234" "ENSG00000144021" "ENSG00000120948"
## [5] "ENSG00000165119"

This list can be used to score samples using a variant of singscore called stingscore.

#here we specify a custom set of genes (Entrez IDs)
stable_genes <-  c('8315', '9391', '23435', '3190')

#create a dataset with a reduced set of genes (signature genes and stable genes)
measured <-  unique(c(stable_genes, geneIds(tgfb_gs_up), geneIds(tgfb_gs_dn)))
small_tgfb_expr_10 <-  tgfb_expr_10_se[measured, ]
dim(small_tgfb_expr_10)
## [1] 305  10
#rank genes using stable genes
rankData_st <-  rankGenes(small_tgfb_expr_10, stableGenes = stable_genes)
head(rankData_st)
##       D_Ctrl_R1 D_TGFb_R1 D_Ctrl_R2 D_TGFb_R2 Hes_Ctrl_R1 Hes_TGFb_R1
## 8315        0.2       0.2       0.2       0.2         0.2         0.2
## 9391        0.4       0.4       0.4       0.4         0.4         0.4
## 23435       0.6       0.6       0.6       0.6         0.6         0.6
## 3190        0.8       0.8       0.8       0.8         0.8         0.8
## 19          0.4       0.4       0.4       0.6         0.4         0.6
## 87          0.8       1.0       0.8       0.8         0.8         0.8
##       Hes_Ctrl_R2 Hes_TGFb_R2 Hil_Ctrl_R1 Hil_Ctrl_R2
## 8315          0.2         0.2         0.2         0.2
## 9391          0.4         0.4         0.4         0.4
## 23435         0.6         0.6         0.6         0.6
## 3190          0.8         0.8         0.8         0.8
## 19            0.4         0.8         0.4         0.4
## 87            0.8         0.8         0.8         0.8
#score samples using stingscore
#simpleScore invoked with the new rank matrix will execute the stingscore
#   algorithm
scoredf_st <- simpleScore(rankData_st, upSet = tgfb_gs_up, downSet = tgfb_gs_dn)
scoredf_st
##              TotalScore TotalDispersion     UpScore UpDispersion   DownScore
## D_Ctrl_R1   -0.04940510         0.14826 -0.10310881      0.29652 0.053703704
## D_TGFb_R1    0.27645366         0.14826  0.08756477      0.29652 0.188888889
## D_Ctrl_R2   -0.07259643         0.14826 -0.09481865      0.29652 0.022222222
## D_TGFb_R2    0.26430628         0.14826  0.08652850      0.29652 0.177777778
## Hes_Ctrl_R1  0.01732873         0.14826 -0.09378238      0.29652 0.111111111
## Hes_TGFb_R1  0.18121282         0.29652  0.06269430      0.29652 0.118518519
## Hes_Ctrl_R2  0.03176933         0.14826 -0.08860104      0.29652 0.120370370
## Hes_TGFb_R2  0.20667818         0.29652  0.08445596      0.29652 0.122222222
## Hil_Ctrl_R1 -0.05136250         0.14826 -0.06062176      0.29652 0.009259259
## Hil_Ctrl_R2 -0.03202840         0.14826 -0.06165803      0.29652 0.029629630
##             DownDispersion
## D_Ctrl_R1          0.00000
## D_TGFb_R1          0.00000
## D_Ctrl_R2          0.00000
## D_TGFb_R2          0.00000
## Hes_Ctrl_R1        0.00000
## Hes_TGFb_R1        0.29652
## Hes_Ctrl_R2        0.00000
## Hes_TGFb_R2        0.29652
## Hil_Ctrl_R1        0.00000
## Hil_Ctrl_R2        0.00000
#plot the two scores against each other
plot(scoredf$TotalScore, scoredf_st$TotalScore)
abline(coef = c(0, 1), lty = 2)

Scores computed using the classic singscore and the modified stingscore will be similar (correlated if not equal) if the selection of stable genes is good. Ideal stable genes will have stable gene expression across the samples being investigated and should cover a wide range of the expression spectrum. The genes produced by getStableGenes fulfill these conditions. Our approach using the product of ranks can be used to determine a set of stable genes for a new context (manuscript in preparation).

3 Visualisation and diagnostic functions

In this section, we show example usages of the visualisation functions included in this package.

3.1 Plot rank densities

Scores of each sample are stored in scoredf. We can use the plotRankDensity function to plot the ranks of genes in the gene-sets for a specific sample. We plot the rank distribution for the second sample in rankData which combines a density plot (densities calculated using KDE) with a barcode plot. Please note that since we are subsetting the data.frame rankData by one column, we set drop = FALSE to maintain the structure of the data.frame/matrix.

#  You can provide the upSet alone when working with unpaired gene-sets 
# We plot the second sample in rankData, view it by 
head(rankData[,2,drop = FALSE])
##    D_TGFb_R1
## 2       1255
## 9       7611
## 10      1599
## 12      3682
## 13      3599
## 14     10013
plotRankDensity(rankData[,2,drop = FALSE], upSet = tgfb_gs_up, 
                downSet = tgfb_gs_dn, isInteractive = FALSE)
## Warning: Ignoring unknown aesthetics: text

Setting isInteractive = TRUE generates an interactive plot using the plotly package. Hovering over the bars in the interactive plot allows you to get information such as the normalised rank (between 0 and 1) and ID of the gene represented by the bar. For the rest of the plotting functions, the isInteractive = TRUE argument has the same behavior.

3.2 Plot dispersions of scores

Function plotDispersion generates the scatter plots of the ‘score VS. dispersions’ for the total scores, the up scores and the down score of samples. It requires the scored data.frame from simpleScore function and annotations (via annot parameter) can be used for coloring the points.

#  Get the annotations of samples by their sample names
tgfbAnnot <- data.frame(SampleID = colnames(tgfb_expr_10_se),
                       Type = NA)
tgfbAnnot$Type[grepl("Ctrl", tgfbAnnot$SampleID)] = "Control"
tgfbAnnot$Type[grepl("TGFb", tgfbAnnot$SampleID)] = "TGFb"

# Sample annotations
tgfbAnnot$Type
##  [1] "Control" "TGFb"    "Control" "TGFb"    "Control" "TGFb"    "Control"
##  [8] "TGFb"    "Control" "Control"
plotDispersion(scoredf,annot = tgfbAnnot$Type,isInteractive = FALSE)