Earth Mover’s Distance

EMDomics analyzes differences in genomics data between two groups. Typically the data will be gene expression levels from array- or sequence-based experiments, but other scenarios are possible. In a real experiment, the two groups might be test vs. control, sensitive vs. resistant, etc., but here we’ll just call them “group A” and “group B”. Typically you’ll be analyzing differences across multiple genes, but we’ll start with a single gene to get a feel for how the Earth Mover’s Distance (EMD) algorithm works.

We’ll create a vector of expression data for 100 samples, and we’ll call the first 50 samples “group A” and the second 50 “group B”:

exp_data <- rnorm(100)
names(exp_data) <- paste("sample", 1:100)

groupA <- names(exp_data)[1:50]
groupB <- names(exp_data)[51:100]

We’ll take a quick look at the two distributions using ggplot:

library(ggplot2)
df <- as.data.frame(exp_data)
df$group[1:50] <- "A"
df$group[51:100] <- "B"
ggplot(df, aes(exp_data, fill=group)) + geom_density(alpha=0.5)

We shouldn’t expect the two groups to look too different, since we’re just sampling from the normal distribution. Intuitively, the “work” required to transform one distribution into the other should be low. We can calculate the EMD score for this single gene using the function calculate_emd_gene:

library(EMDomics)
calculate_emd_gene(exp_data, groupA, groupB)

## [1] 1.2

Now we’ll modify the expression data for group A and see how the EMD score changes. We’ll randomly add or subtract 2 from each data point in group A:

exp_data2 <- exp_data
mod_vec <- sample(c(2,-2), 50, replace=TRUE)
exp_data2[1:50] <- exp_data2[1:50] + mod_vec

Let’s again visualize the two distributions and calculate the EMD score:

df <- as.data.frame(exp_data2)
df$group[1:50] <- "A"
df$group[51:100] <- "B"
ggplot(df, aes(exp_data2, fill=group)) + geom_density(alpha=0.5)

calculate_emd_gene(exp_data2, groupA, groupB)

## [1] 5.96

The EMD score is larger, reflecting the increased work needed to transform one distribution into the other.

Analyzing Significance

The EMD score increases as the two distributions become increasingly dissimilar, but we have no framework for estimating the significance of a particular EMD score. EMDomics uses a permutation-based method to calculate a q-value that is interpreted analogously to a p-value. To access the full functionality of the package, we’ll use the function calculate_emd.

We’ll first create a matrix of gene expression data for 100 samples (tumors, patients, etc.) and 100 genes. We’ll just sample from the normal distribution for now. The first 50 samples will be our “group A”, and the second 50 will be “group B”.

data <- matrix(rnorm(10000), nrow=100, ncol=100)
rownames(data) <- paste("gene", 1:100, sep="")
colnames(data) <- paste("sample", 1:100, sep="")

groupA <- colnames(data)[1:50]
groupB <- colnames(data)[51:100]

Now we can call calculate_emd. We’ll only use 10 permutations for the purposes of this vignette, but in actual experiments using at least 100 permutations is advised. For this example we will turn of parallel processing, but in general it should be enabled.

results <- calculate_emd(data, groupA, groupB, nperm=10, parallel=FALSE)

Most of the time, you’ll be interested in the emd matrix returned as a member of the return object:

emd <- results$emd
head(emd)

##        emd          fc   q-value
## gene1 1.28  0.13351359 0.7368421
## gene2 1.30 -0.24102144 0.7368421
## gene3 2.02 -0.36741781 0.0000000
## gene4 0.64 -0.02864195 1.0000000
## gene5 1.74 -0.16910111 0.0000000
## gene6 1.22  0.09344968 0.9069767

This matrix lists the emd score, the fold change (defined as log2(2^meanA/2^meanB)), and the q-value for each gene in the data set. Because we’re not analyzing many genes and the data is randomly generated, there mqy be some significant q-values in the results simply by chance. We can order the emd matrix by q-value:

emd2 <- emd[(order(emd[,"q-value"])),]
head(emd2)

##         emd         fc q-value
## gene3  2.02 -0.3674178       0
## gene5  1.74 -0.1691011       0
## gene7  1.92  0.1331544       0
## gene13 1.86  0.3882018       0
## gene15 1.84 -0.3497934       0
## gene17 1.80  0.3669995       0

Note the correlation of significant q-values with relatively large EMD scores and with relatively high fold change.

Visualization

EMDomics includes a few visualization functions. The function plot_density will display the density distributions of “group A” and “group B” for a given gene, along with the EMD score. We can compare the gene with the largest EMD score and the gene with the smallest EMD score, for example:

emd3 <- emd[(order(emd[,"emd"])),]
smallest_gene <- rownames(emd3)[1]
biggest_gene <- rownames(emd3)[nrow(emd3)]

plot_density(results, smallest_gene)

plot_density(results, biggest_gene)

The smallest EMD score reflects two distributions that are nearly the same overall shape, while the largest EMD score corresponds to distributions that differ significantly.

We can plot a histogram of all the calculated EMD scores with the function plot_perms:

plot_perms(results)

This plot can help intuitively understand the relative significance of an EMD score. For example, almost all the randomly permuted EMD scores are smaller than the largest calculated EMD score plotted above.

In a similar vein, the function plot_emdnull plots the null distribution (the median of the permuted EMD scores) for each gene vs. the calculated EMD score (the line x=y is superimposed in red):

plot_emdnull(results)

Wrapping Up

This concludes the EMDomics vignette. For additional information, please consult the reference manual.

Session Info

## R version 3.2.0 (2015-04-16)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] EMDomics_1.0.0 ggplot2_1.0.1 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.5          knitr_1.9            MASS_7.3-40         
##  [4] munsell_0.4.2        BiocParallel_1.2.0   colorspace_1.2-6    
##  [7] stringr_0.6.2        plyr_1.8.1           tools_3.2.0         
## [10] parallel_3.2.0       grid_3.2.0           gtable_0.1.2        
## [13] lambda.r_1.1.7       futile.logger_1.4    matrixStats_0.14.0  
## [16] htmltools_0.2.6      yaml_2.1.13          digest_0.6.8        
## [19] reshape2_1.4.1       formatR_1.1          futile.options_1.0.0
## [22] evaluate_0.6         rmarkdown_0.5.1      labeling_0.3        
## [25] scales_0.2.4         emdist_0.3-1         proto_0.3-10

EMDomics Vignette

Daniel Schmolze

2015-04-18

Welcome

Earth Mover’s Distance

Analyzing Significance

Visualization

Wrapping Up

Session Info