ccmap finds drugs and drug combinations that are predicted to reverse or mimic gene expression signatures. These drugs might reverse diseases or mimic healthy lifestyles.


Query Signatures

To obtain a query gene expression signature, it is reccommended that you perform a meta-analysis of all gene expression studies that have compared similar groups. This can be accomplished with the crossmeta package.

This meta-analysis approach was validated by querying the cmap drug signatures using independant drug expression data. Query signatures from meta-analyses have improved rankings of the selfsame cmap drugs (figure 1).

Figure 1. Receiver operating curves comparing query results using signatures from individual contrasts (auc = 0.720) to meta-analyses (auc = 0.913). Queries from signatures generated by meta-analyses for 10 drugs were compared to queries from the 260 contrasts used in the meta-analyses.


To use ccmap, the query signature needs to be a named vector of effect size values where the names correspond to uppercase HGNC symbols. If you used crossmeta, proceeed as follows:

library(crossmeta)
library(ccmap)

# microarray data from studies using drug LY-294002
library(lydata)
data_dir <- system.file("extdata", package = "lydata")

# gather all GSEs
gse_names  <- c("GSE9601", "GSE15069", "GSE50841", "GSE34817", "GSE29689")

# load previous crossmeta differential expression analysis
anals <- load_diff(gse_names, data_dir)

# run meta-analysis
es <- es_meta(anals)

# contribute your signature to our public meta-analysis database
# contribute(anals, subject = "LY-294002")

# extract moderated adjusted standardized effect sizes
dprimes <- get_dprimes(es)

# query signature
query_sig <- dprimes$meta

Drug Signatures

Drug signatures were generated using the raw data from the Connectivity Map build 2. The raw data from experiments with a shared platform were norm-exp background corrected, quantile normalized, and log2 transformed (RMA algorithm). After preprocessing, contrasts were specified such that all signatures for each drug were compared to all vehicle treated signatures. Non-treatment related variables (cell-line, drug dose, batch effects, etc.) were discovered using sva and accounted for during differential expression analysis by limma. Finally, moderated t-statistics calculated by limma were used by GeneMeta to calculate moderated unbiased standardised effect sizes.

The final drug signatures are available in the ccdata package.

library(ccdata)

# load drug signatures
data(cmap_es)

Querying Drug Signatures

Genes from query and drug signatures are ranked based on their absolute effect sizes and then classified as up or down regulated based on the sign of their effect sizes. Net overlap (z), defined as the difference between the number of genes regulated in the same and opposite directions, is then calculated as a function of drug and query gene ranks (x and y). The volume under the surface (vos) formed by plotting z as a function of x and y is used for the final drug rankings. This vos metric has the advantage of weighting both query and drug genes according to their extent of differential expression.

top_drugs <- query_drugs(query_sig)

# correctly identifies LY-294002 as best match among drug signatures
# other PI3K inhibitors are also identified among top matching drugs
head(top_drugs)
##     LY-294002     sirolimus 0297417-0002B    wortmannin       dilazep 
##     0.3670635     0.2340646     0.2299864     0.2181248     0.2158558 
##   resveratrol 
##     0.2024713

Drug Combinations

To more closely mimic or reverse a gene expression signature, drug combinations may be promising. For the 1309 drugs in the Connectivity Map build 2, there are 856086 unique two-drug combinations. It is currently unfeasable to assay all these combinations, but their expression profiles can be predicted.

In order to do so, I collected microarray data from GEO where single treatments and their combinations were assayed. In total, 148 studies with 257 treatment combinations were obtained.

Remarkably, simply averaging the expression profiles from the single treatments predicted the direction of differential expression of the combined treatment with 78.96% accuracy. The average expression profiles for all 856086 unique two-drug combinations can be generated and queried as follows:

# query all 856086 combinations (takes ~10 minutes on Intel Core i7-6700)
# top_combos <- query_combos(query_sig)

# query only combinations with LY-294002
top_combos <- query_combos(query_sig, include='LY-294002', ncores=1)

A small improvement to 80.18% acurracy was obtained using machine learning models. To use these models requires 8-10GB of RAM and about 2 hours (Intel Core i7-6700 with the MRO+MKL distribution of R) to predict and query all 856086 unique two-drug combinations. In practice, the drug combinations that most closely mimic or reverse a query signature usually include the top few single drugs. By only predicting drug combinations that include the top few single drugs, prediction times are greatly reduced:

# Times on Intel Core i7-6700 with MRO+MKL
# requires ~8-10GB of RAM

method  <- 'ml'
include <- names(head(top_drugs))

# query all 856086 combinations (~2 hours)
# top_combos <- query_combos(query_sig, method)

# query combinations with top single drugs (~1 minute)
# top_combos <- query_combos(query_sig, method, include)

sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ccdata_0.99.4   limma_3.30.0    lydata_0.99.1   ccmap_1.0.0    
## [5] crossmeta_1.0.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7         SMVar_1.3.3         knitr_1.14         
##  [4] magrittr_1.5        BiocGenerics_0.20.0 doParallel_1.0.10  
##  [7] fdrtool_1.2.15      xtable_1.8-2        R6_2.2.0           
## [10] foreach_1.4.3       stringr_1.1.0       tools_3.3.1        
## [13] parallel_3.3.1      Biobase_2.34.0      data.table_1.9.6   
## [16] miniUI_0.1.1        iterators_1.0.8     htmltools_0.3.5    
## [19] yaml_2.1.13         assertthat_0.1      digest_0.6.10      
## [22] tibble_1.2          metaMA_3.1.2        shiny_0.14.1       
## [25] formatR_1.4         codetools_0.2-15    evaluate_0.10      
## [28] mime_0.5            rmarkdown_1.1       stringi_1.1.2      
## [31] compiler_3.3.1      httpuv_1.3.3        chron_2.3-47