$ singularity pull --name scShapes.sif shub://Malindrie/scShapes
$ singularity shell scShapes.sif
Singularity scShapes.sif:~ >
$ docker pull maldharm/scshapes
$ docker run -it maldharm/scshapes
root@516e456b0762:/#
To use containerized scShapes
package, need to launch
R
or execute an R script through Rscript
.
scShapes
from install_github
::install_github('Malindrie/scShapes') devtools
library(scShapes)
library(BiocParallel)
set.seed(0xBEEF)
This is a basic example which shows how you can use scShapes for
identifying differential distributions in single-cell RNA-seq data. For
this example data we use the toy example data scData
included in the package.
# Loading and preparing data for input
data(scData)
We first filter the genes to keep only genes expressed in at least 10% of cells:
<- filter_counts(scData$counts, perc.zero = 0.1) scData_filt
In order to normalize for differences in sequencing depth, the log of
the total UMI counts assigned per cell will be used as an offset in the
GLM. This function is inbuilt in the algorithm; however the user is
required to input the library sizes. In our example data this
information together with the covariates is given under the lists
lib_size
and covariates
respectively.
Perform Kolmogorov-Smirnov test to select genes belonging to the family of ZINB distributions.
<- ks_test(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=SnowParam(workers=8,type="SOCK"))
scData_KS
# Select genes significant from the KS test.
# By default the 'ks_sig' function performs Benjamini-Hochberg correction for multiple hypothese testing
# and selects genes significant at p-value of 0.01
<- ks_sig(scData_KS)
scData_KS_sig
# Subset UMI counts corresponding to the genes significant from the KS test
<- scData$counts[rownames(scData$counts) %in% names(scData_KS_sig$genes),] scData.sig.genes
Fit the 4 distributions P,NB,ZIP,ZINB for genes that belong to the ZINB family of distributions by fitting GLM with log of the library sizes as an offset and cell types as a covariate in the GLM.
<- fit_models(counts=scData.sig.genes, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=SnowParam(workers=8,type="SOCK")) scData_models
Once the 4 distributions are fitted, we next calculate the BIC value for each model and select the model with the least BIC value.
<- model_bic(scData_models)
scData_bicvals
# select model with least bic value
<- lbic_model(scData_bicvals, scData$counts) scData_least.bic
To ensure the fit of the models selected based on the least BIC value, additionally we perform LRT to test for model adequacy and presence of zero-inflation.
<- gof_model(scData_least.bic, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=SnowParam(workers=8,type="SOCK")) scData_gof
Finally based on the results of the model adequacy tests, we can identify the distribution of best fit for each gene.
<- select_model(scData_gof) scData_fit
Once the distribution of best fit is identified for genes of interest, it is also possible to extract parameters of interest for the models.
<- model_param(scData_models, scData_fit, model=NULL) scData_params
If our dataset consists of multiple conditions we can follow the above approach to identify the best fir distribution shape for each gene under each treatment condition (selecting the subset of genes common between conditions). Then using the dataframe of genes and distribution followed under each condition, now we can identify genes changing distribution between conditions.
For example suppose we follow above pipeline for scRNA-seq data on
two treatment conditions ‘CTRL’ and ‘STIM’ and have identified the best
distribution fit for each gene under each condition independently.
Suppose the dataframe ifnb.distr
; with genes as rows and
columns as ‘CTRL’ and ‘STIM’ with the corresponding distribution name a
particular gene follows, then we can identify genes changing
distribution shape between ‘CTRL’ and ‘STIM’ as;
<- change_shape(ifnb.distr) ifnb.DD.genes
This will give a list of two lists with genes changing distribution between condition and genes changing distribution from unimodal in one condition to zero-inflated in the other condition.
Here is the output of sessionInfo() on the system on which this document was compiled:
sessionInfo()
#> R version 4.2.1 Patched (2022-07-09 r82577)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_GB/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] BiocParallel_1.32.0 scShapes_1.4.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.9 bslib_0.4.0 compiler_4.2.1
#> [4] jquerylib_0.1.4 plyr_1.8.7 tools_4.2.1
#> [7] digest_0.6.30 jsonlite_1.8.3 evaluate_0.17
#> [10] lattice_0.20-45 rlang_1.0.6 Matrix_1.5-1
#> [13] cli_3.4.1 yaml_2.3.6 parallel_4.2.1
#> [16] mvtnorm_1.1-3 xfun_0.34 fastmap_1.1.0
#> [19] coda_0.19-4 stringr_1.4.1 knitr_1.40
#> [22] sass_0.4.2 stats4_4.2.1 grid_4.2.1
#> [25] R6_2.5.1 snow_0.4-4 bdsmatrix_1.3-6
#> [28] emdbook_1.3.12 VGAM_1.1-7 rmarkdown_2.17
#> [31] magrittr_2.0.3 codetools_0.2-18 htmltools_0.5.3
#> [34] MASS_7.3-58.1 splines_4.2.1 bbmle_1.0.25
#> [37] numDeriv_2016.8-1.1 dgof_1.4 stringi_1.7.8
#> [40] pscl_1.5.5 cachem_1.0.6