1 Introduction

1.1 Overview

The drugTargetInteractions package provides utilities for identifying drug-target interactions for sets of small molecule and/or gene/protein identifiers (Wu et al. 2006). The required drug-target interaction information is obained from a downloaded SQLite instance of the ChEMBL database (Gaulton et al. 2012; Bento et al. 2014). ChEMBL has been chosen for this purpose, because it provides one of the most comprehensive and best annotatated knowledge resources for drug-target information available in the public domain.

1.2 Install Package

As Bioconductor package drugTargetInteraction can be installed with the BiocManager::install() function.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("drugTargetInteractions")

Alternatively, the package can be installed from GitHub as follows.

devtools::install_github("girke-lab/drugTargetInteractions", build_vignettes=TRUE) # Installs from github

1.3 Load Package and Access Help

To use drugTargetInteractions, the package needs to be loaded in a user’s R session.

library("drugTargetInteractions") # Loads the package

The following commands are useful to list the available help files and open the the vignette of the package.

library(help="drugTargetInteractions") # Lists package info
vignette(topic="drugTargetInteractions", package="drugTargetInteractions") # Opens vignette

2 Working Environment

2.1 Required Files and Directories

The drugTargetInteractions package uses a downloaded SQLite instance of the ChEMBL database. The following code in this vignette uses a slimmed down toy version of this database that is small enough to be included in this package for demonstrating the usage of its functions. For real drug-target analysis work, it is important that users download and uncompress a recent version of the ChEMBL SQLite database from here, and then replace the path assigned to chembldb below with the path to the full version of the ChEMBL database they have downloaded to their system. Since the SQLite database from ChEMBL can be used by this package as is, creating a copy of the ChEMBL SQLite database on Bioconductor’s AnnotationHub is not necessary at this point. This way users can always use the latest or historical versions of ChEMBL without the need of maintaining a mirror instance.

The following genConfig function call creates a list containing the paths to input and output directories used by the sample code introduced in this vignette. For real analyses, users want to customize these paths to match the environment on their system. Usually, this means all paths generated by the system.file function need to be changed, as those are specific to work with the test data of a package (e.g. toy ChEMBL SQLite instance).

chembldb <- system.file("extdata", "chembl_sample.db", package="drugTargetInteractions")
resultsPath <- system.file("extdata", "results", package="drugTargetInteractions")
config <- genConfig(chemblDbPath=chembldb, resultsPath=resultsPath)

In addition, a lookup table is downloaded on the fly from UniChem (see web page and corresponding ftp site. This lookup table is used by drugTargetInteractions to translate compound identifiers across different drug databases. Currently, this includes three types of compound identifiers: DrugBank, PubChem, and ChEBI.

downloadUniChem(config=config)
cmpIdMapping(config=config)

3 Produce Results Quickly

Users mainly interested in generating analysis results can skip the technical details in the following sections and continue with the section entitled Workflow to Run Everything.

4 Retrieve UniProt IDs

The following returns for a set of query IDs (e.g. ENSEMBL gene IDs) the corresponding UniProt IDs based on a stict ID matching as well as a more relaxed sequence similarity-based approach. The latter sequence similarity associations are obtained with the getUniprotIDs or the getParalogs functions using UniProt’s UNIREF cluster or BioMart’s paralog annotations, respectively.

4.1 UniProt’s UNIREF Clusters

The UniProt.ws package is used to to return for a set of query IDs (here Ensembl gene IDs) the corresponding UniProt IDs based on two independent approaches: ID mappings (IDMs) and sequence similarity nearest neighbors (SSNNs) using UNIREF clusters. The latter have been generated by UniProt with the MMSeqs2 and Linclust algorithms (Steinegger and Söding 2017; Steinegger and Soeding 2018). Additional details on the UNIREF clusters are available here. The organism, query ID type and sequence similarity level can be selected under the taxId, kt and seq_cluster arguments, respectively. The seq_cluster argument can be assigned one of: UNIREF100, UNIREF90 or UNIREF50. The result is a list with two data.frames. The first one is based on IDMs and the second one on SSNNs.

keys <- c("ENSG00000145700", "ENSG00000135441", "ENSG00000120071")
res_list90 <- getUniprotIDs(taxId=9606, kt="ENSEMBL", keys=keys, seq_cluster="UNIREF90") 

The following shows the first data.frame containing the ID mapping results.

library(DT)
datatable(res_list90[[1]], options = list(scrollX=TRUE, scrollY="600px", autoWidth = TRUE))

The following shows how to return the dimensions of the two data.frames and how to obtain the UniProt IDs as character vectors required for the downstream analysis steps.

sapply(res_list90, dim, simplify=FALSE)
## $IDM
## [1] 18  8
## 
## $SSNN
## [1] 22  8
sapply(names(res_list90), function(x) unique(na.omit(res_list90[[x]]$ID)))
## $IDM
##  [1] "A0A669KAY2" "D6RJB7"     "Q8N7Z5"     "F8VP73"     "F8W606"     "G8JLQ3"     "H0YHA4"    
##  [8] "P78537"     "A0A0G2JQP8" "A0A1W2PPV8" "A0A1W2PQT4" "A0A1W2PRA9" "A0A1W2PRB5" "A0A1W2PRR3"
## [15] "A0A1W2PS83" "A0A3B3IT55" "I3L233"     "Q7Z3B3"    
## 
## $SSNN
##  [1] "A0A669KAY2" "D6RJB7"     "Q8N7Z5"     "A0A087WSV2" "F8VP73"     "F8W606"     "G8JLQ3"    
##  [8] "F8VNQ1"     "H0YHA4"     "P78537"     "A0A1W2PPV8" "A0A1W2PQT4" "A0A1W2PRA9" "A0A1W2PRB5"
## [15] "A0A1W2PRR3" "A0A1W2PS83" "A0A3B3IT55" "A0A024R9Y2" "A0A0G2JNT7" "A0A0G2JQF4" "I3L233"    
## [22] "Q7Z3B3"

4.2 BioMart’s Paralogs

The following obtains via biomaRt for a set of query genes the corresponding UniProt IDs as well as their paralogs. The query genes can be Gene Names or ENSEMBL Gene IDs from H. sapiens. The result is similar to IDMs and SSNNs from the getUniprotIDs function, but instead of UNIREF clusters, biomaRt’s paralogs are used to obtain SSNNs.

queryBy <- list(molType="gene", idType="external_gene_name", ids=c("ANKRD31", "BLOC1S1", "KANSL1"))
queryBy <- list(molType="gene", idType="ensembl_gene_id", ids=c("ENSG00000145700", "ENSG00000135441", "ENSG00000120071"))
res_list <- getParalogs(queryBy)

The following shows the first data.frame containing the ID mapping results.

library(DT)
datatable(res_list[[1]], options = list(scrollX = TRUE, scrollY="400px", autoWidth = TRUE))

The following shows how to return the dimensions of the two data.frames and how to obtain the UniProt IDs as character vectors required for the downstream analysis steps.

sapply(res_list, dim, simplify=FALSE)
## $IDM
## [1] 30  5
## 
## $SSNN
## [1] 33  7
sapply(names(res_list), function(x) unique(na.omit(res_list[[x]]$ID_up_sp)))
## $IDM
## [1] "Q7Z3B3" "P78537" "Q8N7Z5"
## 
## $SSNN
## [1] "Q7Z3B3" "A0AUZ9" "P78537" "Q8N7Z5" "Q5JPF3" "A6QL64" "Q8N2N9"

5 Query Drug-Target Annotations

The drugTargetAnnot function returns for a set of compound or gene/protein IDs the corresponding known drug-target annotation data available in ChEMBL. A related function called getDrugTarget is described in the subsequent subsection. This method generates very similar results, but uses internally pre-computed annotation summary tables which is less flexible than the usage of pure SQL statements.

5.1 Using drugTargetAnnot

The drugTargetAnnot function queries the ChEMBL database with SQL statements without depending on pre-computed annotation tables.

5.1.1 Query with Compound IDs

queryBy <- list(molType="cmp", idType="chembl_id", ids=c("CHEMBL17", "CHEMBL19", "CHEMBL1201117", "CHEMBL25", "nomatch", "CHEMBL1742471"))
qresult1 <- drugTargetAnnot(queryBy, config=config)
library(DT)
datatable(qresult1, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))

5.1.2 Query with Protein IDs

queryBy <- list(molType="protein", idType="UniProt_ID", ids=c("P43166", "P00915"))
qresult2 <- drugTargetAnnot(queryBy, config=config)
library(DT)
datatable(qresult2, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))

5.1.3 Query with Gene IDs

The following returns drug-target annotations for a set of query Ensembl gene IDs. For this the Ensembl gene IDs are translated into UniProt IDs using both the IDM and SSNN approaches.

keys <- c("ENSG00000120088", "ENSG00000135441", "ENSG00000120071")
res_list90 <- getUniprotIDs(taxId=9606, kt="ENSEMBL", keys=keys, seq_cluster="UNIREF90") 
id_list <- sapply(names(res_list90), function(x) unique(na.omit(res_list90[[x]]$ID)))

Next, drug-target annotations are returned for the Uniprot IDs with the IDM or SSNN methods. The following example uses the UniProt IDs of the SSNN method. Note, to include the upstream Ensembl gene IDs in the final result table, the below Ensembl ID collapse step via a tapply is necessary since occasionally UniProt IDs are assigned to several Ensembl gene IDs (e.g. recent gene duplications).

queryBy <- list(molType="protein", idType="UniProt_ID", ids=id_list[[2]])
qresultSSNN <- drugTargetAnnot( queryBy, config=config)
ensidsSSNN <- tapply(res_list90[[2]]$ENSEMBL, res_list90[[2]]$ID, paste, collapse=", ") 
qresultSSNN <- data.frame(Ensembl_IDs=ensidsSSNN[as.character(qresultSSNN$QueryIDs)], qresultSSNN)
library(DT)
datatable(qresultSSNN, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))

5.2 Using getDrugTarget

The getDrugTarget function generates similar results as drugTargetAnnot, but depends on a pre-computed query table (here drugTargetAnnot.xls).

5.2.1 Query with Compound IDs

id_mapping <- c(chembl="chembl_id", pubchem="PubChem_ID", uniprot="UniProt_ID", drugbank="DrugBank_ID")
queryBy <- list(molType="cmp", idType="pubchem", ids=c("2244", "65869", "2244"))
queryBy <- list(molType="protein", idType="uniprot", ids=c("P43166", "P00915", "P43166"))
queryBy <- list(molType="cmp", idType="drugbank", ids=c("DB00945", "DB01202"))
#qresult3 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17,39,46:53),config=config)
qresult3 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17),config=config)
library(DT)
datatable(qresult3, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))

5.2.2 Query with Protein IDs

queryBy <- list(molType="protein", idType="chembl", ids=c("CHEMBL25", "nomatch", "CHEMBL1742471"))
#qresult4 <- getDrugTarget(queryBy=queryBy, id_mapping, columns=c(1,5,8,16,17,39,46:52),config=config) 
qresult4 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17),config=config) 
library(DT)
datatable(qresult4, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))

6 Query Bioassay Data

The drugTargetBioactivity function returns for a set of compound or gene/protein IDs the corresponding bioassay data available in ChEMBL.

6.1 Query with Compound IDs

Example query for compounds IDs.

queryBy <- list(molType="cmp", idType="DrugBank_ID", ids=c("DB00945", "DB00316", "DB01050"))
qresultBAcmp <- drugTargetBioactivity(queryBy, config=config)
library(DT)
datatable(qresultBAcmp, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))

6.2 Query with Protein IDs

Example query for protein IDs. Note, the Ensembl gene to UniProt ID mappings are derived from above and stored in the named character vector ensidsSSNN.

queryBy <- list(molType="protein", idType="uniprot", ids=id_list[[1]])                                                                                                             
qresultBApep <- drugTargetBioactivity(queryBy, config=config)                                                                                       
qresultBApep <- data.frame(Ensembl_IDs=ensidsSSNN[as.character(qresultBApep$UniProt_ID)], qresultBApep)
library(DT)
datatable(qresultBApep, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))

7 Workflow to Run Everything

This section explains how to run all of the above drug-target interaction analysis steps with a few convenience meta functions. Users mainly interested in generating analysis results quickly can focus on this section only.

7.1 ID mapping

The getSymEnsUp function returns for a query of gene or protein IDs a mapping table containing: ENSEMBL Gene IDs, Gene Names/Symbols, UniProt IDs and ENSEMBL Protein IDs. Internally, the function uses the ensembldb package. Its results are returned in a list where the first slot contains the ID mapping table, while the subsequent slots include the corresponding named character vectors: ens_gene_id, up_ens_id, and up_gene_id. Currently, the following query IDs are supported by getSymEnsUp: GENE_NAME, ENSEMBL_GENE_ID and UNIPROT_ID.

7.1.1 Query with Gene Names

gene_name <- c("CA7", "CFTR")
idMap <- getSymEnsUp(EnsDb="EnsDb.Hsapiens.v86", ids=gene_name, idtype="GENE_NAME")
ens_gene_id <- idMap$ens_gene_id
ens_gene_id
## ENSG00000168748 ENSG00000001626 
##           "CA7"          "CFTR"

7.1.2 Query with ENSEBML Gene IDs

ensembl_gene_id <- c("ENSG00000001626", "ENSG00000168748")
idMap <- getSymEnsUp(EnsDb="EnsDb.Hsapiens.v86", ids=ensembl_gene_id, idtype="ENSEMBL_GENE_ID")
ens_gene_id <- idMap$ens_gene_id

7.1.3 Query with UniProt IDs

uniprot_id <- c("P43166", "P13569") 
idMap <- getSymEnsUp(EnsDb="EnsDb.Hsapiens.v86", ids=uniprot_id, idtype="UNIPROT_ID")
ens_gene_id <- idMap$ens_gene_id

7.2 Retrieve UniProt IDs

The perfect match and nearest neighbor UniProt IDs can be obtained from UniProt’s UNIREF cluster or BioMart’s paralog annotations.

7.2.1 UNIREF Cluster

The corresponding IDM and SSNN UniProt IDs for the above ENSEMBL gene IDs are obtained with the getUniprotIDs function. This step is slow since the queries have to be performed with chunksize=1 in order to reliably track the query ENSEMBL gene ID information in the results.

res_list90 <- getUniprotIDs(taxId=9606, kt="ENSEMBL", keys=names(ens_gene_id), seq_cluster="UNIREF90", chunksize=1)
sapply(res_list90, dim)

7.2.2 BioMart Parlogs

Here the corresponding perfect match (IDM) and paralog (SSNN) UniProt IDs for the above ENSEMBL gene IDs are obtained with the getParalogs function. The latter is much faster than getUniprotIDs, and also covers wider evolutionary distances. Thus, it may be the preferred methods for many use cases.

queryBy <- list(molType="gene", idType="ensembl_gene_id", ids=names(ens_gene_id))
res_list <- getParalogs(queryBy)
sapply(res_list, dim)
##      IDM SSNN
## [1,]  13  127
## [2,]   5    7

7.3 Drug-Target Data

Both drug-target annotation and bioassay data are obtained with a meta function called runDrugTarget_Annot_Bioassay that internally uses the main processing functions drugTargetAnnot and drugTargetBioactivity. It organizes the result in a list with the annotation and bioassay data (data.frames) in the first and second slot, respectively. Importantly, the results from the IDM and SSNN UniProt IDs are combined in a single table, where duplicated rows have been removed. To track in the result table, which method was used for obtaining UniProt IDs, an IDM_Mapping_Type column has been added. Note, the gene IDs in the SSNN rows have the string Query_ prepended to indicate that they are not necessarily the genes encoding the SSNN UniProt proteins listed in the corresponding rows. Instead they are the genes encoding the query proteins used for searching for SSNNs.

drug_target_list <- runDrugTarget_Annot_Bioassay(res_list=res_list, up_col_id="ID_up_sp", ens_gene_id, config=config) 
sapply(drug_target_list, dim)
##      Annotation Bioassay
## [1,]         55       35
## [2,]         18       17

View content of annotation result slot:

datatable(drug_target_list$Annotation, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))

View content of bioassay result slot (restricted to first 500 rows):

datatable(drug_target_list$Bioassay[1:500,], options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))

7.4 Drug-Target Frequency

The following generates a summary table containing drug-target frequency information.

df <- drug_target_list$Annotation
df[,"GeneName"] <- gsub("Query_", "", as.character(df$GeneName))
stats <- tapply(df$CHEMBL_CMP_ID, as.factor(df$GeneName), function(x) unique(x))
stats <- sapply(names(stats), function(x) stats[[x]][nchar(stats[[x]]) > 0])
stats <- sapply(names(stats), function(x) stats[[x]][!is.na(stats[[x]])])
statsDF <- data.frame(GeneNames=names(stats), Drugs=sapply(stats, paste, collapse=", "), N_Drugs=sapply(stats, length))

Print drug-target frequency table.

datatable(statsDF, options = list(scrollX = TRUE, scrollY="150px", autoWidth = TRUE))

7.5 Write Results to Tabular Files

Both the annotation and bioassay data of the drug_target_list object can be exported to separate tabular files as follows.

write.table(drug_target_list$Annotation, "DrugTargetAnnotation.xls", row.names=FALSE, quote=FALSE, na="", sep="\t")
write.table(drug_target_list$Bioassay, "DrugTargetBioassay.xls", row.names=FALSE, quote=FALSE, na="", sep="\t")
write.table(statDF, "statDF.xls", row.names=FALSE, quote=FALSE, na="", sep="\t")

8 Session Info

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB             
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] EnsDb.Hsapiens.v86_2.99.0    ensembldb_2.16.4             AnnotationFilter_1.16.0     
##  [4] GenomicFeatures_1.44.2       AnnotationDbi_1.54.1         Biobase_2.52.0              
##  [7] GenomicRanges_1.44.0         GenomeInfoDb_1.28.2          IRanges_2.26.0              
## [10] S4Vectors_0.30.0             BiocGenerics_0.38.0          DT_0.18                     
## [13] drugTargetInteractions_1.0.2 BiocStyle_2.20.2            
## 
## loaded via a namespace (and not attached):
##  [1] ProtGenerics_1.24.0         bitops_1.0-7                matrixStats_0.60.1         
##  [4] bit64_4.0.5                 filelock_1.0.2              progress_1.2.2             
##  [7] httr_1.4.2                  tools_4.1.1                 bslib_0.2.5.1              
## [10] utf8_1.2.2                  R6_2.5.1                    DBI_1.1.1                  
## [13] lazyeval_0.2.2              withr_2.4.2                 tidyselect_1.1.1           
## [16] prettyunits_1.1.1           bit_4.0.4                   curl_4.3.2                 
## [19] compiler_4.1.1              xml2_1.3.2                  DelayedArray_0.18.0        
## [22] rtracklayer_1.52.1          bookdown_0.23               sass_0.4.0                 
## [25] rappdirs_0.3.3              stringr_1.4.0               digest_0.6.27              
## [28] Rsamtools_2.8.0             rmarkdown_2.10              XVector_0.32.0             
## [31] pkgconfig_2.0.3             htmltools_0.5.2             MatrixGenerics_1.4.3       
## [34] dbplyr_2.1.1                fastmap_1.1.0               htmlwidgets_1.5.3          
## [37] rlang_0.4.11                RSQLite_2.2.8               jquerylib_0.1.4            
## [40] BiocIO_1.2.0                generics_0.1.0              jsonlite_1.7.2             
## [43] crosstalk_1.1.1             BiocParallel_1.26.2         dplyr_1.0.7                
## [46] RCurl_1.98-1.4              magrittr_2.0.1              GenomeInfoDbData_1.2.6     
## [49] Matrix_1.3-4                Rcpp_1.0.7                  fansi_0.5.0                
## [52] lifecycle_1.0.0             stringi_1.7.4               yaml_2.2.1                 
## [55] UniProt.ws_2.32.0           SummarizedExperiment_1.22.0 zlibbioc_1.38.0            
## [58] BiocFileCache_2.0.0         grid_4.1.1                  blob_1.2.2                 
## [61] crayon_1.4.1                lattice_0.20-44             Biostrings_2.60.2          
## [64] hms_1.1.0                   KEGGREST_1.32.0             knitr_1.33                 
## [67] pillar_1.6.2                rjson_0.2.20                codetools_0.2-18           
## [70] biomaRt_2.48.3              XML_3.99-0.7                glue_1.4.2                 
## [73] evaluate_0.14               BiocManager_1.30.16         png_0.1-7                  
## [76] vctrs_0.3.8                 purrr_0.3.4                 assertthat_0.2.1           
## [79] cachem_1.0.6                xfun_0.25                   restfulr_0.0.13            
## [82] tibble_3.1.4                GenomicAlignments_1.28.0    memoise_2.0.0              
## [85] ellipsis_0.3.2

References

Bento, A Patrı́cia, Anna Gaulton, Anne Hersey, Louisa J Bellis, Jon Chambers, Mark Davies, Felix A Krüger, et al. 2014. “The ChEMBL bioactivity database: an update.” Nucleic Acids Res. 42 (Database issue): D1083–90. https://doi.org/10.1093/nar/gkt1031.

Gaulton, Anna, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, et al. 2012. “ChEMBL: a large-scale bioactivity database for drug discovery.” Nucleic Acids Res. 40 (Database issue): D1100–7. https://doi.org/10.1093/nar/gkr777.

Steinegger, Martin, and Johannes Söding. 2017. “MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.” Nat. Biotechnol. 35 (11): 1026–8. https://doi.org/10.1038/nbt.3988.

Steinegger, Martin, and Johannes Soeding. 2018. “Clustering huge protein sequence sets in linear time.” Nat. Commun. 9 (1): 2542. https://doi.org/10.1038/s41467-018-04964-5.

Wu, C H, R Apweiler, A Bairoch, D A Natale, W C Barker, B Boeckmann, S Ferro, et al. 2006. “The Universal Protein Resource (UniProt): an expanding universe of protein information.” Nucleic Acids Res. 34 (Database issue): D187–D191. https://doi.org/10.1093/nar/gkj161.