The waddR
package provides an adaptation of the semi-parametric testing procedure based on the 2-Wasserstein distance which is specifically tailored to identify differential distributions in single-cell RNA-seqencing (scRNA-seq) data.
In particular, a two-stage (TS) approach has been implemented that takes account of the specific nature of scRNA-seq data by separately testing for differential proportions of zero gene expression (using a logistic regression model) and differences in non-zero gene expression (using the semi-parametric 2-Wasserstein distance-based test) between two conditions.
This an example on how to analyse the expression of single cell data in two conditions. As example data we will look at single cell data from decidua and blood samples analysed by Vento-Tormo et al. 2018 (https://doi.org/10.1038/s41586-018-0698-6). For this demonstration we use only one replicate of the two conditions that has been normalized, scaled and is available for download on github.
First, we will first load all required packages:
suppressPackageStartupMessages({
library("waddR")
library("SingleCellExperiment")
library("BiocFileCache")
})
Download the example data set
url.base <- "https://github.com/goncalves-lab/waddR-data/blob/master/data/"
sce.blood.url <- paste0(url.base, "sce_blood.RDa?raw=true")
sce.decidua.url <- paste0(url.base, "sce_decidua.RDa?raw=true")
getCachedPath <- function(url, rname){
bfc <- BiocFileCache() # fire up cache
res <- bfcquery(bfc, url, field="fpath", exact=TRUE)
if (bfccount(res) == 0L)
cachedFilePath <- bfcadd(bfc, rname=rname, fpath=url)
else
cachedFilePath <- bfcpath(bfc, res[["rid"]])
cachedFilePath
}
load(getCachedPath(sce.blood.url, "sce.blood"))
load(getCachedPath(sce.decidua.url, "sce.decidua"))
We have downloaded the two SingleCellExperiment
objects sce.blood
and sce.decidua
. Next we will randomly select a subset of genes with which we will proceed, a whole analysis would be out of scope for this vignette.
set.seed(28)
randgenes <- sample(rownames(sce.blood), 2500, replace=FALSE)
sce.blood <- sce.blood[randgenes, ]
sce.decidua <- sce.decidua[randgenes, ]
The input data to wasserstein.sc
always discriminates two conditions. It can be given either as a data matrix and a vector explaining the condition of each column, or a SingleCellExperiment
for each condition.
We will proceed to use the two SingleCellExperiment
objects we loaded. The SingleCellExperiment
objects passed to wasserstein.sc
must contain a counts
assay each.
assays(sce.blood)
#> List of length 1
#> names(1): counts
assays(sce.decidua)
#> List of length 1
#> names(1): counts
Before testing for differential expression in the two conditions, the counts should be normalized for cell or gene-specific biases. We can skip this here since the example data has already been normalized and an entire analysis would be out of scope for this vignette. The detailed processing steps performed on this data can be seen here.
wsres <- wasserstein.sc(sce.blood, sce.decidua, "OS")
head(wsres, n=10L)
#> d.wass d.wass^2 d.comp^2 d.comp
#> GNPTG-ENSG00000090581 0.20573251 0.042325867 0.042627776 0.20646495
#> HIVEP3-ENSG00000127124 0.20588614 0.042389102 0.043288184 0.20805813
#> C6orf47-AS1-ENSG00000227198 0.00000000 0.000000000 0.000000000 0.00000000
#> RP11-377D9.3-ENSG00000255621 0.00000000 0.000000000 0.000000000 0.00000000
#> BBOX1-ENSG00000129151 0.00000000 0.000000000 0.000000000 0.00000000
#> CEP290-ENSG00000198707 0.16215300 0.026293594 0.025579340 0.15993542
#> FOXRED2-ENSG00000100350 0.07682033 0.005901363 0.005922216 0.07695594
#> RP11-166D18.1-ENSG00000251471 0.00000000 0.000000000 0.000000000 0.00000000
#> BBX-ENSG00000114439 0.39587040 0.156713370 0.157161283 0.39643572
#> MESTIT1-ENSG00000272701 0.00000000 0.000000000 0.000000000 0.00000000
#> location size shape rho
#> GNPTG-ENSG00000090581 4.142331e-04 0.019841603 0.02237194 0.9344195
#> HIVEP3-ENSG00000127124 3.160493e-03 0.017597022 0.02253067 0.8807532
#> C6orf47-AS1-ENSG00000227198 0.000000e+00 0.000000000 0.00000000 0.0000000
#> RP11-377D9.3-ENSG00000255621 0.000000e+00 0.000000000 0.00000000 0.0000000
#> BBOX1-ENSG00000129151 0.000000e+00 0.000000000 0.00000000 0.0000000
#> CEP290-ENSG00000198707 9.538226e-04 0.004271303 0.02035421 0.7916142
#> FOXRED2-ENSG00000100350 4.165072e-05 0.005880565 0.00000000 0.0000000
#> RP11-166D18.1-ENSG00000251471 0.000000e+00 0.000000000 0.00000000 0.0000000
#> BBX-ENSG00000114439 2.463632e-02 0.002027704 0.13049726 0.8288606
#> MESTIT1-ENSG00000272701 0.000000e+00 0.000000000 0.00000000 0.0000000
#> pval p.ad.gdp N.exc perc.loc perc.size
#> GNPTG-ENSG00000090581 0.20570000 NA NA 0.97 46.55
#> HIVEP3-ENSG00000127124 0.08620000 NA NA 7.30 40.65
#> C6orf47-AS1-ENSG00000227198 1.00000000 NA NA NaN NaN
#> RP11-377D9.3-ENSG00000255621 1.00000000 NA NA NaN NaN
#> BBOX1-ENSG00000129151 1.00000000 NA NA NaN NaN
#> CEP290-ENSG00000198707 0.22640000 NA NA 3.73 16.70
#> FOXRED2-ENSG00000100350 0.60040000 NA NA 0.70 99.30
#> RP11-166D18.1-ENSG00000251471 1.00000000 NA NA NaN NaN
#> BBX-ENSG00000114439 0.00079992 NA NA 15.68 1.29
#> MESTIT1-ENSG00000272701 1.00000000 NA NA NaN NaN
#> perc.shape decomp.error pval.adj
#> GNPTG-ENSG00000090581 52.48 0.007132959 0.75513950
#> HIVEP3-ENSG00000127124 52.05 0.021210199 0.39541284
#> C6orf47-AS1-ENSG00000227198 NaN 0.000000000 1.00000000
#> RP11-377D9.3-ENSG00000255621 NaN 0.000000000 1.00000000
#> BBOX1-ENSG00000129151 NaN 0.000000000 1.00000000
#> CEP290-ENSG00000198707 79.57 0.027164565 0.79830748
#> FOXRED2-ENSG00000100350 0.00 0.003533602 1.00000000
#> RP11-166D18.1-ENSG00000251471 NaN 0.000000000 1.00000000
#> BBX-ENSG00000114439 83.03 0.002858166 0.01242112
#> MESTIT1-ENSG00000272701 NaN 0.000000000 1.00000000
The 2-Wasserstein single-cell testing procedure is performed on all genes in the data set – one line of the output describes the result of testing one gene.
Note: The two-stage (TS) procedure includes a test for differential proportions of zero gene expression, considering the proportion of zeroes over all conditions and genes in a model. It may yield incorrect results, if it is applied on only a subset of genes is given.
Note: For reproducible p-values, see the “seed” argument.
Output Column | Description |
---|---|
d.wass | 2-Wasserstein distance between the two samples |
d.wass^2 | squared 2-Wasserstein distance between the two samples |
d.comp^2 | squared 2-Wasserstein distance (decomposition approx.) |
d.comp | 2-Wasserstein distance (decomposition approx.) |
location | location term in the squared 2-Wasserstein decomposition |
size | size term in the squared 2-Wasserstein decomposition |
shape | shape term in the squared 2-Wasserstein decomposition |
rho | correlation coefficient in the quantile-quantile plot |
pval | p-value of the semi-parametric test |
p.ad.gpd | p-value of the Anderson-Darling test to check whether the GPD actually fits the data well * |
N.exc | number of exceedances (starting with 250 and iteratively decreased by 10 if necessary) that are required to obtain a good GPD fit (i.e. p-value of Anderson-Darling test greater or eqaul to 0.05) * |
perc.loc | fraction (in %) of the location part with respect to the overall squared 2-Wasserstein distance obtained by the decomposition approximation |
perc.size | fraction (in %) of the size part with respect to the overall squared 2-Wasserstein distance obtained by the decomposition approximation |
perc.shape | fraction (in %) of the shape part with respect to the overall squared 2-Wasserstein distance obtained by the decomposition approximation |
decomp.error | relative error between the squared 2-Wasserstein distance computed by the quantile approximation and the squared 2-Wasserstein distance computed by the decomposition approximation |
p.zero | p-value of the test for differential proportions of zero expression (logistic regression model) ** |
p.combined | combined p-value of p.nonzero and p.zero obtained by Fisher’s method ** |
pval.adj | adjusted p-value of the semi-parametric 2-Wasserstein distance-based test according to the method of Benjamini-Hochberg |
p.adj.zero | adjusted p-value of the test for differential proportions of zero expression (logistic regression model) according to the method of Benjamini-Hochberg ** |
p.adj.combined | adjusted combined p-value of p.nonzero and p.zero obtained by Fisher’s method according to the method of Benjamini-Hochberg ** |
* only if GDP fitting is performed (NA otherwise); ** only if the two-stage test is run
The waddR
package
Fast and accurate calculation of the Wasserstein distance
Two-sample test to check for differences between two distributions
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] BiocFileCache_1.14.0 dbplyr_1.4.4
#> [3] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
#> [5] Biobase_2.50.0 GenomicRanges_1.42.0
#> [7] GenomeInfoDb_1.26.0 IRanges_2.24.0
#> [9] S4Vectors_0.28.0 BiocGenerics_0.36.0
#> [11] MatrixGenerics_1.2.0 matrixStats_0.57.0
#> [13] waddR_1.4.0
#>
#> loaded via a namespace (and not attached):
#> [1] httr_1.4.2 bit64_4.0.5 splines_4.0.3
#> [4] Formula_1.2-4 assertthat_0.2.1 statmod_1.4.35
#> [7] latticeExtra_0.6-29 blob_1.2.1 GenomeInfoDbData_1.2.4
#> [10] yaml_2.2.1 backports_1.1.10 pillar_1.4.6
#> [13] RSQLite_2.2.1 lattice_0.20-41 glue_1.4.2
#> [16] digest_0.6.27 RColorBrewer_1.1-2 XVector_0.30.0
#> [19] checkmate_2.0.0 minqa_1.2.4 colorspace_1.4-1
#> [22] htmltools_0.5.0 Matrix_1.2-18 pkgconfig_2.0.3
#> [25] zlibbioc_1.36.0 purrr_0.3.4 scales_1.1.1
#> [28] jpeg_0.1-8.1 lme4_1.1-25 BiocParallel_1.24.0
#> [31] arm_1.11-2 tibble_3.0.4 htmlTable_2.1.0
#> [34] generics_0.0.2 ggplot2_3.3.2 ellipsis_0.3.1
#> [37] nnet_7.3-14 survival_3.2-7 magrittr_1.5
#> [40] crayon_1.3.4 memoise_1.1.0 evaluate_0.14
#> [43] nlme_3.1-150 MASS_7.3-53 foreign_0.8-80
#> [46] data.table_1.13.2 tools_4.0.3 lifecycle_0.2.0
#> [49] stringr_1.4.0 munsell_0.5.0 cluster_2.1.0
#> [52] DelayedArray_0.16.0 compiler_4.0.3 rlang_0.4.8
#> [55] nloptr_1.2.2.2 grid_4.0.3 RCurl_1.98-1.2
#> [58] rstudioapi_0.11 htmlwidgets_1.5.2 rappdirs_0.3.1
#> [61] bitops_1.0-6 base64enc_0.1-3 rmarkdown_2.5
#> [64] boot_1.3-25 gtable_0.3.0 abind_1.4-5
#> [67] DBI_1.1.0 curl_4.3 R6_2.4.1
#> [70] gridExtra_2.3 knitr_1.30 dplyr_1.0.2
#> [73] bit_4.0.4 Hmisc_4.4-1 stringi_1.5.3
#> [76] Rcpp_1.0.5 vctrs_0.3.4 rpart_4.1-15
#> [79] png_0.1-7 coda_0.19-4 tidyselect_1.1.0
#> [82] xfun_0.18