1 About the template

This section provides general description and how to use this cheminformatics workflow. In the actual analysis report, this section is usually removed.

This cheminformatics workflow template is based on the ChemmineR package and should be downloaded from Bioconductor before running the workflow. This template is a workflow that does:

Loads molecules from SDF format.
Help users visualize specified molecules.
Convert molecules to AP and FP format.
Compute the similarity and distance between molecules
Plot the distance matrix.

There is no other command-line software required in this workflow. All are written in R (Linewise) steps.

2 Introduction

Users want to provide here background information about the design of their cheminformatics project.

This report describes the analysis of a cheminformatics project studying drug …

2.1 Experimental design

Typically, users want to specify here all information relevant for the analysis of their Cheminformatics study. This includes detailed descriptions of files, experimental design, reference genome, gene annotations, etc.

3 Workflow environment

systemPipeR workflows can be designed and built from start to finish with a single command, importing from an R Markdown file or stepwise in interactive mode from the R console.

This tutorial will demonstrate how to build the workflow in an interactive mode, appending each step. The workflow is constructed by connecting each step via appendStep method. Each SYSargsList instance contains instructions needed for processing a set of input files with a specific command-line or R software and the paths to the corresponding outfiles generated by a particular tool/step.

To create a Workflow within systemPipeR, we can start by defining an empty container and checking the directory structure:

library(systemPipeR)
sal <- SPRproject()
sal

3.1 Load packages

This is an empty template that contains only one demo step. Refer to our website for how to add more steps. If you prefer a more enriched template, read this page for other pre-configured templates.

cat(crayon::blue$bold("To use this workflow, following R packages are expected:\n"))
cat(c("'ChemmineR", "ggplot2", "tibble", "readr", "ggpubr", "gplots'\n"),
    sep = "', '")
### pre-end
appendStep(sal) <- LineWise(code = {
    library(systemPipeR)
    library(ChemmineR)
}, step_name = "load_packages")

3.2 Load SDF dataset

Molecules can be loaded or downloaded. This example dataset has 100 molecules.

# Here, the dataset is downloaded. If you already have the
# data locally, change URL to local path.
appendStep(sal) <- LineWise(code = {
    sdfset <- read.SDFset("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/sdfsample.sdf")
    # rename molecule IDs by IDs in the header. If your
    # molecules' header does not have ID or not unique,
    # remove following code and use the default IDs
    cid(sdfset) <- makeUnique(sdfid(sdfset))
}, step_name = "load_data", dependency = "load_packages")

3.3 Visualize molecules

appendStep(sal) <- LineWise(code = {
    png("results/mols_plot.png", 700, 600)
    # Here only first 4 are plotted. Please choose the ones
    # you want to plot.
    ChemmineR::plot(sdfset[1:4])
    dev.off()
}, step_name = "vis_mol", dependency = "load_data", run_step = "optional")

3.4 Basic molecule inforamtion

Compute some basic molecule information and store to file, such as atom frequency matrix, molecular weight and formula.

appendStep(sal) <- LineWise(code = {
    propma <- data.frame(MF = MF(sdfset), MW = MW(sdfset), atomcountMA(sdfset))
    readr::write_csv(propma, "results/basic_mol_info.csv")
}, step_name = "basic_mol_info", dependency = "load_data", run_step = "optional")

The information can be visualized, for example, a boxplot of atom frequency.

appendStep(sal) <- LineWise(code = {
    png("results/atom_req.png", 700, 700)
    boxplot(propma[, 3:ncol(propma)], col = "#6cabfa", main = "Atom Frequency")
    dev.off()
}, step_name = "mol_info_plot", dependency = "basic_mol_info",
    run_step = "optional")

3.5 Convert SDF to AP/FP

Tanimoto similarity requires molecules to be converted into searchable format. In this workflow, either atom pair (AP) or fingerprint (FP) should be used as input. This step convert SDF molecules to these formats.

appendStep(sal) <- LineWise(code = {
    apset <- sdf2ap(sdfset)
    fpset <- desc2fp(apset, descnames = 1024, type = "FPset")
    # save them to a file so they can be loaded directly
    # next time
    readr::write_rds(apset, "results/apset.rds")
    readr::write_rds(fpset, "results/fpset.rds")
}, step_name = "apfp_convert", dependency = "load_data")

3.6 Remove duplicates

When molecules are converted to AP/FP, there is chance some very similar molecules end up with idenitical AP/FP, for example some stereoisomers. This is the shortness of these standardization methods. For the downstream clustering part, we need to remove the AP/FP duplicated ones. In this workflow, FPset is used for similarity and clustering computation, FP deduplication is demonstrated here. It will be very similar for AP.

appendStep(sal) <- LineWise(code = {
    fpset <- fpset[which(!cmp.duplicated(apset))]
}, step_name = "fp_dedup", dependency = "apfp_convert")

3.7 Similarity compute

In this step, the pairwise similarity is computed.

appendStep(sal) <- LineWise(code = {
    # If your set is very large, compute a small portion is
    # a better idea, e.g.  sapply(cid(fpset)[1:10], ...
    simMAfp <- sapply(cid(fpset), function(x) fpSim(x = fpset[x],
        fpset, sorted = FALSE))
}, step_name = "fp_similarity", dependency = "fp_dedup")

3.8 Hierarchical clustering

After the similarity computation, we can do some clustering. In this step, hierarchical clustering (hclust) is performed.

appendStep(sal) <- LineWise(code = {
    hc <- hclust(as.dist(1 - simMAfp))
    png("results/hclust.png", 1000, 700)
    plot(as.dendrogram(hc), edgePar = list(col = 4, lwd = 2),
        horiz = TRUE)
    dev.off()
    # to see the tree groupings, one need to cut the tree,
    # for example, by height of 0.9
    tree_cut <- cutree(hc, h = 0.9)
    grouping <- tibble::tibble(cid = names(tree_cut), group_id = tree_cut)
    readr::write_csv(grouping, "results/hclust_grouping.csv")
}, step_name = "hclust", dependency = "fp_similarity", run_step = "optional")

With the group information, directly plotting the hclust does not look nice. We can use other methods to visulize the grouping. Here, PCA is used.

appendStep(sal) <- LineWise(code = {
    library(magrittr)
    library(ggplot2)
    # simMAfp is already 0-1 value, no need to normalize
    # again
    mol_pca <- princomp(simMAfp)
    # find the variance
    mol_pca_var <- mol_pca$sdev[1:2]^2/sum(mol_pca$sdev^2)
    # plot
    png("results/mol_pca.png", 650, 700)
    tibble::tibble(PC1 = mol_pca$loadings[, 1], PC2 = mol_pca$loadings[,
        2], group_id = as.factor(grouping$group_id)) %>%
        # here only group labels are used to color, if you
        # have experimental conditions or drug types, the
        # coloring or shaping will be more meaningful.
    ggplot(aes(x = PC1, y = PC2, color = group_id)) + geom_point(size = 2) +
        stat_ellipse() + labs(x = paste0("PC1 ", round(mol_pca_var[1],
        3) * 100, "%"), y = paste0("PC1 ", round(mol_pca_var[2],
        3) * 100, "%")) + ggpubr::theme_pubr(base_size = 16) +
        scale_color_brewer(palette = "Set2")
    dev.off()
}, step_name = "PCA", dependency = "hclust", run_step = "optional")

3.9 Heatmap

Alternatively, we can plot a heatmap to see the similarities between molecules.

appendStep(sal) <- LineWise(code = {
    library(gplots)
    png("results/mol_heatmap.png", 700, 700)
    heatmap.2(simMAfp, Rowv = as.dendrogram(hc), Colv = as.dendrogram(hc),
        col = colorpanel(40, "darkblue", "yellow", "white"),
        density.info = "none", trace = "none")
    dev.off()
}, step_name = "heatmap", dependency = "fp_similarity", run_step = "optional")

3.10 Workflow session

appendStep(sal) <- LineWise(code = {
    sessionInfo()
}, step_name = "wf_session", dependency = "heatmap")

4 Manage the workflow

For running the workflow, runWF function will execute all the steps store in the workflow container. The execution will be on a single machine without submitting to a queuing system of a computer cluster.

sal <- runWF(sal)

To use complex workflow control options, such as parallelization, subsetting samples, selecting steps, read the documents on our website.
Explore other details of the workflow object.
Create logs and reports.
Visualize the workflow.

5 About the workflow

5.1 Tools used

To check command-line tools used in this workflow, use listCmdTools, and use listCmdModules to check if you have a modular system.

The following code will print out tools required in your custom SPR project in the report. In case you are running the workflow for the first and do not have a project yet, or you just want to browser this workflow, following code displays the tools required by default.

if (file.exists(file.path(".SPRproject", "SYSargsList.yml"))) {
    local({
        sal <- systemPipeR::SPRproject(resume = TRUE)
        systemPipeR::listCmdTools(sal)
        systemPipeR::listCmdModules(sal)
    })
} else {
    cat(crayon::blue$bold("Tools and modules required by this workflow are:\n"))
    cat(c("No other tool is required"), sep = "\n")
}

## Tools and modules required by this workflow are:
## No other tool is required

5.2 Session Info

This is the session information for rendering this report. To access the session information of workflow running, check HTML report of renderLogs.

sessionInfo()

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets 
## [6] methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.30.0
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.33       R6_2.5.1           
##  [3] codetools_0.2-19    bookdown_0.36      
##  [5] fastmap_1.1.1       xfun_0.40          
##  [7] cachem_1.0.8        knitr_1.44         
##  [9] htmltools_0.5.6.1   rmarkdown_2.25     
## [11] cli_3.6.1           sass_0.4.7         
## [13] jquerylib_0.1.4     compiler_4.3.1     
## [15] tools_4.3.1         evaluate_0.22      
## [17] bslib_0.5.1         yaml_2.3.7         
## [19] formatR_1.14        BiocManager_1.30.22
## [21] crayon_1.5.2        jsonlite_1.8.7     
## [23] rlang_1.1.1

Cheminformatics Drug Similarity

Last update: 26 October, 2023

Package