Two Sample Testing Based on The 2-Wasserstein Distance

Testing Procedures

The package waddR provides two testing procedures using the 2-Wasserstein distance to test whether two distributions \(F_A\) and \(F_B\) given in the form of samples are different by specifically testing the null hypothesis \(\mathcal{H}_0: F_A = F_B\) against the alternative \(\mathcal{H}_1: F_A \neq F_B\).

The first, semi-parametric (SP), procedure uses a test based on permutations combined with a generalized Pareto distribution approximation to estimate small p-values accurately.

The second procedure (ASY) uses a test based on asymptotic theory which is valid only if the samples can be assumed to come from continuous distributions.

Examples

To demonstrate the capabilities of the testing procedures, we consider models based on normal distributions here. We exemplarily construct three cases in which two distributions (samples) differ with respect to location, size, and shape, respectively, and one case without a difference. For convenience, we focus on the p-value, the value of the (squared) 2-wasserstein distance, and the fractions of the location, size, and shape terms (in %) with respect to the (squared) 2-Wasserstein distance here, while the functions also provide additional output.

library(waddR)

spec.output <- c("pval", "d.wass^2", "perc.loc", "perc.size", "perc.shape")

We start with an example, in which the two distributions (samples) only differ with respect to the location, and show the results for the two testing procedures (semi-parametric (SP) and asymptotic theory-based (ASY)).

Please note that the method “SP” with the permnum 10000 is used by default in calls to wasserstein.test if nothing different is specified.

set.seed(24)
ctrl <- rnorm(300 ,0 ,1)
set.seed(24)
dd1 <- rnorm(300, 1, 1)
wasserstein.test(ctrl, dd1, method="SP", permnum=10000)[spec.output]
#>       pval   d.wass^2   perc.loc  perc.size perc.shape 
#>  9.999e-05  1.000e+00  1.000e+02  0.000e+00  0.000e+00
wasserstein.test(ctrl, dd1, method="ASY")[spec.output]
#>       pval   d.wass^2   perc.loc  perc.size perc.shape 
#>          0          1        100          0          0

We obtain a very low p-value, pointing at the existence of a difference, and see that differences with respect to location make up by far the most part of the 2-Wasserstein distance.

Analogously, we look at a case in which the two distributions (samples) only differ with respect to the size.

set.seed(24)
ctrl <- rnorm(300, 0, 1)
set.seed(24)
dd2 <- rnorm(300, 0, 2)
wasserstein.test(ctrl, dd2)
#>       d.wass     d.wass^2     d.comp^2       d.comp     location         size 
#> 9.980314e-01 9.960666e-01 9.993665e-01 9.996832e-01 9.405157e-03 9.899613e-01 
#>        shape          rho         pval     p.ad.gdp        N.exc     perc.loc 
#> 8.792623e-16 1.000000e+00 9.999000e-05           NA           NA 9.400000e-01 
#>    perc.size   perc.shape decomp.error 
#> 9.906000e+01 0.000000e+00 3.312902e-03
wasserstein.test(ctrl, dd2, method="ASY")
#>       d.wass     d.wass^2     d.comp^2       d.comp     location         size 
#> 9.980314e-01 9.960666e-01 9.993665e-01 9.996832e-01 9.405157e-03 9.899613e-01 
#>        shape          rho         pval     perc.loc    perc.size   perc.shape 
#> 8.792623e-16 1.000000e+00 2.400000e-05 9.400000e-01 9.906000e+01 0.000000e+00 
#> decomp.error 
#> 3.301963e-03

Similarly, we consider an example in which the two distributions (samples) only differ with respect to the shape.

set.seed(24)
ctrl <- rnorm(300, 6.5, sqrt(13.25))
set.seed(24)
sam1 <- rnorm(300, 3, 1)
set.seed(24)
sam2 <- rnorm(300, 10, 1)
dd3 <- sapply(seq(1:300), 
              function(n) {
                sample(c(sam1[n], sam2[n]), 1, prob=c(0.5, 0.5))})
wasserstein.test(ctrl, dd3)[spec.output]
#>        pval    d.wass^2    perc.loc   perc.size  perc.shape 
#>  0.00009999  2.24119267  9.71000000  0.00000000 90.29000000
wasserstein.test(ctrl, dd3, method="ASY")[spec.output]
#>       pval   d.wass^2   perc.loc  perc.size perc.shape 
#>   0.000007   2.241193   9.710000   0.000000  90.290000

Finally, we show an example in which there is no difference between the distributions. We obtain a high p-value, indicating that the null hypothesis cannot be rejected.

set.seed(24)
ctrl <- rnorm(300, 0, 1)
set.seed(24)
nodd <- rnorm(300, 0, 1)
wasserstein.test(ctrl, nodd)[spec.output]
#>       pval   d.wass^2   perc.loc  perc.size perc.shape 
#>          1          0          0          0        100
wasserstein.test(ctrl, nodd, method="ASY")[spec.output]
#>       pval   d.wass^2   perc.loc  perc.size perc.shape 
#>          1          0          0          0        100

Reproducibility of Results

Using the method “SP” (default), the semi-parametric test is performed to obtain p-values. The permutation procedure is based on a random sampling of the input vectors and for the results to be reproducible, a seed can be set from the user environment.

# set seed for reproducible p-values in re-runs
set.seed(123)
wasserstein.test(sam1, nodd)[spec.output]
#>       pval   d.wass^2   perc.loc  perc.size perc.shape 
#>  9.999e-05  9.000e+00  1.000e+02  0.000e+00  0.000e+00
# reproduce the result
set.seed(123)
wasserstein.test(sam1, nodd)[spec.output]
#>       pval   d.wass^2   perc.loc  perc.size perc.shape 
#>  9.999e-05  9.000e+00  1.000e+02  0.000e+00  0.000e+00

See Also

Session Info

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] BiocFileCache_1.10.2        dbplyr_1.4.2               
#>  [3] SingleCellExperiment_1.8.0  SummarizedExperiment_1.16.1
#>  [5] DelayedArray_0.12.2         BiocParallel_1.20.1        
#>  [7] matrixStats_0.56.0          Biobase_2.46.0             
#>  [9] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0        
#> [11] IRanges_2.20.2              S4Vectors_0.24.3           
#> [13] BiocGenerics_0.32.0         waddR_1.0.1                
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.0.0       xfun_0.12              purrr_0.3.3           
#>  [4] splines_3.6.3          lattice_0.20-40        vctrs_0.2.4           
#>  [7] htmltools_0.4.0        yaml_2.2.1             blob_1.2.1            
#> [10] rlang_0.4.5            nloptr_1.2.2.1         pillar_1.4.3          
#> [13] glue_1.3.2             DBI_1.1.0              rappdirs_0.3.1        
#> [16] bit64_0.9-7            GenomeInfoDbData_1.2.2 stringr_1.4.0         
#> [19] zlibbioc_1.32.0        coda_0.19-3            memoise_1.1.0         
#> [22] evaluate_0.14          knitr_1.28             curl_4.3              
#> [25] Rcpp_1.0.4             arm_1.10-1             XVector_0.26.0        
#> [28] abind_1.4-5            bit_1.1-15.2           lme4_1.1-21           
#> [31] digest_0.6.25          stringi_1.4.6          dplyr_0.8.5           
#> [34] grid_3.6.3             tools_3.6.3            bitops_1.0-6          
#> [37] magrittr_1.5           RCurl_1.98-1.1         tibble_2.1.3          
#> [40] RSQLite_2.2.0          crayon_1.3.4           pkgconfig_2.0.3       
#> [43] MASS_7.3-51.5          Matrix_1.2-18          minqa_1.2.4           
#> [46] assertthat_0.2.1       rmarkdown_2.1          httr_1.4.1            
#> [49] boot_1.3-24            R6_2.4.1               nlme_3.1-145          
#> [52] compiler_3.6.3