Note: the most recent version of this tutorial can be found here and a short overview slide show here.
ChemmineR
is a cheminformatics package for analyzing drug-like small molecule data in R. Its latest version contains functions for efficient processing of large numbers of small molecules, physicochemical/structural property predictions, structural similarity searching, classification and clustering of compound libraries with a wide spectrum of algorithms.
In addition, ChemmineR
offers visualization functions for compound clustering results and chemical structures. The integration of chemoinformatic tools with the R programming environment has many advantages, such as easy access to a wide spectrum of statistical methods, machine learning algorithms and graphic utilities. The first version of this package was published in Cao et al. (2008). Since then many additional utilities and add-on packages have been added to the environment (Figure 2) and many more are under development for future releases (Backman, Cao, and Girke 2011; Wang et al. 2013).
Recently Added Features
Improved SMILES support via new SMIset
object class and SMILES import/export functions
Integration of a subset of OpenBabel functionalities via new ChemmineOB
add-on package (Cao et al. 2008)
Streaming functionality for processing millions of molecules on a laptop
Mismatch tolerant maximum common substructure (MCS) search algorithm
Fast and memory efficient fingerprint search support using atom pair or PubChem fingerprints
The R software for running ChemmineR can be downloaded from CRAN (http://cran.at.r-project.org/). The ChemmineR package can be installed from R with:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("ChemmineR")
library("ChemmineR") # Loads the package
library(help="ChemmineR") # Lists all functions and classes
vignette("ChemmineR") # Opens this PDF manual from R
The following code gives an overview of the most important functionalities provided by ChemmineR
. Copy and paste of the commands into the R console will demonstrate their utilities.
Create Instances of SDFset
class:
data(sdfsample)
sdfset <- sdfsample
sdfset # Returns summary of SDFset
## An instance of "SDFset" with 100 molecules
sdfset[1:4] # Subsetting of object
## An instance of "SDFset" with 4 molecules
sdfset[[1]] # Returns summarized content of one SDF
## An instance of "SDF"
##
## <<header>>
## Molecule_Name Source
## "650001" " -OEChem-07071010512D"
## Comment Counts_Line
## "" " 61 64 0 0 0 0 0 0 0999 V2000"
##
## <<atomblock>>
## C1 C2 C3 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16
## O_1 7.0468 0.0839 0 0 0 0 0 0 0 0 0 0 0 0 0
## O_2 12.2708 1.0492 0 0 0 0 0 0 0 0 0 0 0 0 0
## ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
## H_60 1.8411 -1.5985 0 0 0 0 0 0 0 0 0 0 0 0 0
## H_61 2.6597 -1.2843 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## <<bondblock>>
## C1 C2 C3 C4 C5 C6 C7
## 1 1 16 2 0 0 0 0
## 2 2 23 1 0 0 0 0
## ... ... ... ... ... ... ... ...
## 63 33 60 1 0 0 0 0
## 64 33 61 1 0 0 0 0
##
## <<datablock>> (33 data items)
## PUBCHEM_COMPOUND_CID PUBCHEM_COMPOUND_CANONICALIZED PUBCHEM_CACTVS_COMPLEXITY
## "650001" "1" "700"
## PUBCHEM_CACTVS_HBOND_ACCEPTOR
## "7" "..."
view(sdfset[1:4]) # Returns summarized content of many SDFs, not printed here
as(sdfset[1:4], "list") # Returns complete content of many SDFs, not printed here
An SDFset
is created during the import of an SD file:
sdfset <- read.SDFset("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/sdfsample.sdf")
Miscellaneous accessor methods for SDFset
container:
header(sdfset[1:4]) # Not printed here
header(sdfset[[1]])
## Molecule_Name Source
## "650001" " -OEChem-07071010512D"
## Comment Counts_Line
## "" " 61 64 0 0 0 0 0 0 0999 V2000"
atomblock(sdfset[1:4]) # Not printed here
atomblock(sdfset[[1]])[1:4,]
## C1 C2 C3 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16
## O_1 7.0468 0.0839 0 0 0 0 0 0 0 0 0 0 0 0 0
## O_2 12.2708 1.0492 0 0 0 0 0 0 0 0 0 0 0 0 0
## O_3 12.2708 3.1186 0 0 0 0 0 0 0 0 0 0 0 0 0
## O_4 7.9128 2.5839 0 0 0 0 0 0 0 0 0 0 0 0 0
bondblock(sdfset[1:4]) # Not printed here
bondblock(sdfset[[1]])[1:4,]
## C1 C2 C3 C4 C5 C6 C7
## 1 1 16 2 0 0 0 0
## 2 2 23 1 0 0 0 0
## 3 2 27 1 0 0 0 0
## 4 3 25 1 0 0 0 0
datablock(sdfset[1:4]) # Not printed here
datablock(sdfset[[1]])[1:4]
## PUBCHEM_COMPOUND_CID PUBCHEM_COMPOUND_CANONICALIZED PUBCHEM_CACTVS_COMPLEXITY
## "650001" "1" "700"
## PUBCHEM_CACTVS_HBOND_ACCEPTOR
## "7"
Assigning compound IDs and keeping them unique:
cid(sdfset)[1:4] # Returns IDs from SDFset object
## [1] "CMP1" "CMP2" "CMP3" "CMP4"
sdfid(sdfset)[1:4] # Returns IDs from SD file header block
## [1] "650001" "650002" "650003" "650004"
unique_ids <- makeUnique(sdfid(sdfset))
## [1] "No duplicates detected!"
cid(sdfset) <- unique_ids
Converting the data blocks in an SDFset
to a matrix:
blockmatrix <- datablock2ma(datablocklist=datablock(sdfset)) # Converts data block to matrix
numchar <- splitNumChar(blockmatrix=blockmatrix) # Splits to numeric and character matrix
numchar[[1]][1:2,1:2] # Slice of numeric matrix
## PUBCHEM_COMPOUND_CID PUBCHEM_COMPOUND_CANONICALIZED
## 650001 650001 1
## 650002 650002 1
numchar[[2]][1:2,10:11] # Slice of character matrix
## PUBCHEM_MOLECULAR_FORMULA PUBCHEM_OPENEYE_CAN_SMILES
## 650001 "C23H28N4O6" "CC1=CC(=NO1)NC(=O)CCC(=O)N(CC(=O)NC2CCCC2)C3=CC4=C(C=C3)OCCO4"
## 650002 "C18H23N5O3" "CN1C2=C(C(=O)NC1=O)N(C(=N2)NCCCO)CCCC3=CC=CC=C3"
Compute atom frequency matrix, molecular weight and formula:
propma <- data.frame(MF=MF(sdfset), MW=MW(sdfset), atomcountMA(sdfset))
propma[1:4, ]
## MF MW C H N O S F Cl
## 650001 C23H28N4O6 456.4916 23 28 4 6 0 0 0
## 650002 C18H23N5O3 357.4069 18 23 5 3 0 0 0
## 650003 C18H18N4O3S 370.4255 18 18 4 3 1 0 0
## 650004 C21H27N5O5S 461.5346 21 27 5 5 1 0 0
Assign matrix data to data block:
datablock(sdfset) <- propma
datablock(sdfset[1])
## $`650001`
## MF MW C H N O S
## "C23H28N4O6" "456.4916" "23" "28" "4" "6" "0"
## F Cl
## "0" "0"
String searching in SDFset
:
grepSDFset("650001", sdfset, field="datablock", mode="subset") # Returns summary view of matches. Not printed here.
grepSDFset("650001", sdfset, field="datablock", mode="index")
## 1 1 1 1 1 1 1 1 1
## 1 2 3 4 5 6 7 8 9
Export SDFset to SD file:
write.SDF(sdfset[1:4], file="sub.sdf", sig=TRUE)
Plot molecule structure of one or many SDFs:
plot(sdfset[1:4], print=FALSE) # Plots structures to R graphics device
sdf.visualize(sdfset[1:4]) # Compound viewing in web browser
Structure similarity searching and clustering:
apset <- sdf2ap(sdfset) # Generate atom pair descriptor database for searching
data(apset) # Load sample apset data provided by library.
cmp.search(apset, apset[1], type=3, cutoff = 0.3, quiet=TRUE) # Search apset database with single compound.
## index cid scores
## 1 1 650001 1.0000000
## 2 96 650102 0.3516643
## 3 67 650072 0.3117569
## 4 88 650094 0.3094629
## 5 15 650015 0.3010753
cmp.cluster(db=apset, cutoff = c(0.65, 0.5), quiet=TRUE)[1:4,] # Binning clustering using variable similarity cutoffs.
##
## sorting result...
## ids CLSZ_0.65 CLID_0.65 CLSZ_0.5 CLID_0.5
## 48 650049 2 48 2 48
## 49 650050 2 48 2 48
## 54 650059 2 54 2 54
## 55 650060 2 54 2 54
ChemmineR
integrates now a subset of cheminformatics functionalities implemented in the OpenBabel C++ library (O’Boyle, Morley, and Hutchison 2008; Cao et al. 2008). These utilities can be accessed by installing the ChemmineOB
package and the OpenBabel software itself. ChemmineR
will automatically detect the availability of ChemmineOB
and make use of the additional utilities. The following lists the functions and methods that make use of OpenBabel. References are included to locate the sections in the manual where the utility and usage of these functions is described.
Structure format interconversions (see Section Format Inter-Conversions)
smiles2sdf
: converts from SMILES to SDF object
sdf2smiles
: converts from SDF to SMILES object
convertFormat
: converts strings between two formats
convertFormatFile
: converts files between two formats. This function can be used to enable ChemmineR to read in any format supported by Open Babel. For example, if you had an SML file you could do:
convertFormatFile("SML","SDF","mycompound.sml","mycompound.sdf")
sdfset=read.SDFset("mycompound.sdf")
propOB
: generates several compound properties. See the man page for a current list of properties computed.
propOB(sdfset[1])
fingerprintOB
: generates fingerprints for compounds. The fingerprint name can be anything supported by OpenBabel. See the man page for a current list.
fingerprintOB(sdfset,"FP2")
smartsSearchOB
: find matches of SMARTS patterns in compounds
#count rotable bonds
smartsSearchOB(sdfset[1:5],"[!$(*#*)&!D1]-!@[!$(*#*)&!D1]",uniqueMatches=FALSE)
exactMassOB
: Compute the monoisotopic (exact) mass of a set of compounds
exactMassOB(sdfset[1:5])
regenerateCoords
: Re-compute the 2D coordinates of a compound using Open Babel. This can sometimes improve the quality of the compounds plot. See also the regenCoords
option of the plot function.
sdfset2 = regenerateCoords(sdfset[1:5])
plot(sdfset[1], regenCoords=TRUE,print=FALSE)
OpenBabel can also be used to plot compounds directly:
openBabelPlot(sdfset[4],regenCoords=TRUE)
generate3DCoords
: Generate 3D coordinates for compounds with only 2D coordinates.
sdf3D = generate3DCoords(sdfset[1])
canonicalize
: Compute a canonicalized atom numbering. This allows compounds with the same molecular structure but different atom numberings to be compared properly.
canonicalSdf= canonicalize(sdfset[1])
canonicalNumbering
: Return a mapping from the original atom numbering to the canonical atom number.
mapping = canonicalNumbering(sdfset[1])
The following list gives an overview of the most important S4 classes, methods and functions available in the ChemmineR package. The help documents of the package provide much more detailed information on each utility. The standard R help documents for these utilities can be accessed with this syntax: ?function\_name
(e.g. ?cid
) and ?class\_name-class
(e.g. ?"SDFset-class"
).
Classes
SDFstr
: intermediate string class to facilitate SD file import; not important for end user
SDF
: container for single molecule imported from an SD file
SDFset
: container for many SDF objects; most important structure container for end user
SMI
: container for a single SMILES string
SMIset
: container for many SMILES strings
Functions/Methods (mainly for SDFset
container, SMIset
should be coerced with smiles2sd
to SDFset
)
Accessor methods for SDF/SDFset
Object slots: cid
, header
, atomblock
, bondblock
, datablock
(sdfid
, datablocktag
)
Summary of SDFset
: view
Matrix conversion of data block: datablock2ma
, splitNumChar
String search in SDFset: grepSDFset
Coerce one class to another
as(..., "...")
works in most cases. For details see R help with ?"SDFset-class"
.Utilities
Atom frequencies: atomcountMA
, atomcount
Molecular weight: MW
Molecular formula: MF
…
Compound structure depictions
R graphics device: plot
, plotStruc
Online: cmp.visualize
Classes
AP
: container for atom pair descriptors of a single molecule
APset
: container for many AP objects; most important structure descriptor container for end user
FP
: container for fingerprint of a single molecule
FPset
: container for fingerprints of many molecules, most important structure descriptor container for end user
Functions/Methods
Create AP/APset
instances
From SDFset
: sdf2ap
From SD file: cmp.parse
Summary of AP/APset
: view
, db.explain
Accessor methods for AP/APset
ap
, cid
Coerce one class to another
as(..., "...")
works in most cases. For details see R help with ?"APset-class"
.Structure Similarity comparisons and Searching
Compute pairwise similarities : cmp.similarity
, fpSim
Search APset database: cmp.search
, fpSim
AP-based Structure Similarity Clustering
Single-linkage binning clustering: cmp.cluster
Visualize clustering result with MDS: cluster.visualize
Size distribution of clusters: cluster.sizestat
Folding
fold
foldCount
numBits
The following gives an overview of the most important import/export functionalities for small molecules provided by ChemmineR
. The given example creates an instance of the SDFset
class using as sample data set the first 100 compounds from this PubChem SD file (SDF): Compound_00650001_00675000.sdf.gz (ftp://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/SDF/).
SDFs can be imported with the read.SDFset
function:
sdfset <- read.SDFset("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/sdfsample.sdf")
data(sdfsample) # Loads the same SDFset provided by the library
sdfset <- sdfsample
valid <- validSDF(sdfset) # Identifies invalid SDFs in SDFset objects
sdfset <- sdfset[valid] # Removes invalid SDFs, if there are any
Import SD file into SDFstr
container:
sdfstr <- read.SDFstr("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/sdfsample.sdf")
Create SDFset
from SDFstr
class:
sdfstr <- as(sdfset, "SDFstr")
sdfstr
## An instance of "SDFstr" with 100 molecules
as(sdfstr, "SDFset")
## An instance of "SDFset" with 100 molecules
The read.SMIset
function imports one or many molecules from a SMILES file and stores them in a SMIset
container. The input file is expected to contain one SMILES string per row with tab-separated compound identifiers at the end of each line. The compound identifiers are optional.
Create sample SMILES file and then import it:
data(smisample); smiset <- smisample
write.SMI(smiset[1:4], file="sub.smi")
smiset <- read.SMIset("sub.smi")
Inspect content of SMIset
:
data(smisample) # Loads the same SMIset provided by the library
smiset <- smisample
smiset
## An instance of "SMIset" with 100 molecules
view(smiset[1:2])
## $`650001`
## An instance of "SMI"
## [1] "O=C(NC1CCCC1)CN(c1cc2OCCOc2cc1)C(=O)CCC(=O)Nc1noc(c1)C"
##
## $`650002`
## An instance of "SMI"
## [1] "O=c1[nH]c(=O)n(c2nc(n(CCCc3ccccc3)c12)NCCCO)C"
Accessor functions:
cid(smiset[1:4])
## [1] "650001" "650002" "650003" "650004"
smi <- as.character(smiset[1:2])
Create SMIset
from named character vector:
as(smi, "SMIset")
## An instance of "SMIset" with 2 molecules
Write objects of classes SDFset/SDFstr/SDF
to SD file:
write.SDF(sdfset[1:4], file="sub.sdf")
Writing customized SDFset
to file containing ChemmineR
signature, IDs from SDFset
and no data block:
write.SDF(sdfset[1:4], file="sub.sdf", sig=TRUE, cid=TRUE, db=NULL)
Example for injecting a custom matrix/data frame into the data block of an SDFset
and then writing it to an SD file:
props <- data.frame(MF=MF(sdfset), MW=MW(sdfset), atomcountMA(sdfset))
datablock(sdfset) <- props
write.SDF(sdfset[1:4], file="sub.sdf", sig=TRUE, cid=TRUE)
Indirect export via SDFstr
object:
sdf2str(sdf=sdfset[[1]], sig=TRUE, cid=TRUE) # Uses default components
sdf2str(sdf=sdfset[[1]], head=letters[1:4], db=NULL) # Uses custom components for header and data block
Write SDF
, SDFset
or SDFstr
classes to file:
write.SDF(sdfset[1:4], file="sub.sdf", sig=TRUE, cid=TRUE, db=NULL)
write.SDF(sdfstr[1:4], file="sub.sdf")
cat(unlist(as(sdfstr[1:4], "list")), file="sub.sdf", sep="")
Write objects of class SMIset
to SMILES file with and without compound identifiers:
data(smisample); smiset <- smisample # Sample data set
write.SMI(smiset[1:4], file="sub.smi", cid=TRUE) write.SMI(smiset[1:4], file="sub.smi", cid=FALSE)
The sdf2smiles
and smiles2sdf
functions provide format interconversion between SMILES strings (Simplified Molecular Input Line Entry Specification) and SDFset
containers.
Convert an SDFset
container to a SMILES character
string:
data(sdfsample);
sdfset <- sdfsample[1]
smiles <- sdf2smiles(sdfset)
smiles
Convert a SMILES character
string to an SDFset
container:
sdf <- smiles2sdf("CC(=O)OC1=CC=CC=C1C(=O)O")
view(sdf)
When the ChemineOB
package is installed these conversions are performed with the OpenBabel Open Source Chemistry Toolbox. Otherwise the functions will fall back to using the ChemMine Tools web service for this operation. The latter will require internet connectivity and is limited to only the first compound given. ChemmineOB
provides access to the compound format conversion functions of OpenBabel. Currently, over 160 formats are supported by OpenBabel. The functions convertFormat
and convertFormatFile
can be used to convert files or strings between any two formats supported by OpenBabel. For example, to convert a SMILES string to an SDF string, one can use the convertFormat
function.
sdfStr <- convertFormat("SMI","SDF","CC(=O)OC1=CC=CC=C1C(=O)O_name")
This will return the given compound as an SDF formatted string. 2D coordinates are also computed and included in the resulting SDF string.
To convert a file with compounds encoded in one format to another format, the convertFormatFile
function can be used instead.
convertFormatFile("SMI","SDF","test.smiles","test.sdf")
To see the whole list of file formats supported by OpenBabel, one can run from the command-line “obabel -L formats”.
The following write.SDFsplit
function allows to split SD Files into any number of smaller SD Files. This can become important when working with very big SD Files. Users should note that this function can output many files, thus one should run it in a dedicated directory!
Create sample SD File with 100 molecules:
write.SDF(sdfset, "test.sdf")
Read in sample SD File. Note: reading file into SDFstr is much faster than into SDFset:
sdfstr <- read.SDFstr("test.sdf")
Run export on SDFstr
object:
write.SDFsplit(x=sdfstr, filetag="myfile", nmol=10) # 'nmol' defines the number of molecules to write to each file
Run export on SDFset
object:
write.SDFsplit(x=sdfset, filetag="myfile", nmol=10)
The sdfStream
function allows to stream through SD Files with millions of molecules without consuming much memory. During this process any set of descriptors, supported by ChemmineR
, can be computed (e.g. atom pairs, molecular properties, etc.), as long as they can be returned in tabular format. In addition to descriptor values, the function returns a line index that gives the start and end positions of each molecule in the source SD File. This line index can be used by the downstream read.SDFindex
function to retrieve specific molecules of interest from the source SD File without reading the entire file into R. The following outlines the typical workflow of this streaming functionality in ChemmineR
.
Create sample SD File with 100 molecules:
write.SDF(sdfset, "test.sdf")
Define descriptor set in a simple function:
desc <- function(sdfset)
cbind(SDFID=sdfid(sdfset),
# datablock2ma(datablocklist=datablock(sdfset)),
MW=MW(sdfset),
groups(sdfset), APFP=desc2fp(x=sdf2ap(sdfset), descnames=1024,
type="character"), AP=sdf2ap(sdfset, type="character"), rings(sdfset,
type="count", upper=6, arom=TRUE) )
Run sdfStream
with desc
function and write results to a file called matrix.xls
:
sdfStream(input="test.sdf", output="matrix.xls", fct=desc, Nlines=1000) # 'Nlines': number of lines to read from input SD File at a time
One can also start reading from a specific line number in the SD file. The following example starts at line number 950. This is useful for restarting and debugging the process. With append=TRUE
the result can be appended to an existing file.
sdfStream(input="test.sdf", output="matrix2.xls", append=FALSE, fct=desc, Nlines=1000, startline=950)
Select molecules meeting certain property criteria from SD File using line index generated by previous sdfStream
step:
indexDF <- read.delim("matrix.xls", row.names=1)[,1:4]
indexDFsub <- indexDF[indexDF$MW < 400, ] # Selects molecules with MW < 400
sdfset <- read.SDFindex(file="test.sdf", index=indexDFsub, type="SDFset") # Collects results in 'SDFset' container
Write results directly to SD file without storing larger numbers of molecules in memory:
read.SDFindex(file="test.sdf", index=indexDFsub, type="file",
outfile="sub.sdf")
Read AP/APFP strings from file into APset
or FP
object:
apset <- read.AP(x="matrix.xls", type="ap", colid="AP")
apfp <- read.AP(x="matrix.xls", type="fp", colid="APFP")
Alternatively, one can provide the AP/APFP strings in a named character vector:
apset <- read.AP(x=sdf2ap(sdfset[1:20], type="character"), type="ap")
fpchar <- desc2fp(sdf2ap(sdfset[1:20]), descnames=1024, type="character")
fpset <- as(fpchar, "FPset")
As an alternative to sdfStream, there is now also an option to store data in an SQL database, which then allows for fast queries and compound retrieval. The default database is SQLite, but any other SQL database should work with some minor modifications to the table definitions, which are stored in schema/compounds.SQLite under the ChemmineR package directory. Compounds are stored in their entirety in the databases so there is no need to keep any original data files.
Users can define their own set of compound features to compute and store when loading new compounds. Each of these features will be stored in its own, indexed table. Searches can then be performed using these features to quickly find specific compounds. Compounds can always be retrieved quickly because of the database index, no need to scan a large compound file. In addition to user defined features, descriptors can also be computed and stored for each compound.
A new database can be created with the initDb
function. This takes either an existing database connection, or a filename. If a filename is given then an SQLite database connection is created. It then ensures that the required tables exist and creates them if not. The connection object is then returned. This function can be called safely on the same connection or database many times and will not delete any data.
The functions loadSdf
and loadSmiles
can be used to load compound data from either a file (both) or an SDFset
(loadSdf
only). The fct
parameter should be a function to extract features from the data. It will be handed an SDFset
generated from the data being loaded. This may be done in batches, so there is no guarantee that the given SDFSset will contain the whole dataset. This function should return a data frame with a column for each feature and a row for each compound given. The order of the final data frame should be the same as that of the SDFset
. The column names will become the feature names. Each of these features will become a new, indexed, table in the database which can be used later to search for compounds.
The descriptors
parameter can be a function which computes descriptors. This function will also be given an SDFset
object, which may be done in batches. It should return a data frame with the following two columns: “descriptor” and “descriptor_type”. The “descriptor” column should contain a string representation of the descriptor, and “descriptor_type” is the type of the descriptor. Our convention for atom pair is “ap” and “fp” for finger print. The order should also be maintained.
When the data has been loaded, loadSdf
will return the compound id numbers of each compound loaded. These compound id numbers are computed by the database and are not extracted from the compound data itself. They can be used to quickly retrieve compounds later.
New features can also be added using this function. However, all compounds must have all features so if new features are added to a new set of compounds, all existing features must be computable by the fct
function given. If new features are detected, all existing compounds will be run through fct
in order to compute the new features for them as well.
For example, if dataset X is loaded with features F1 and F2, and then at a later time we load dataset Y with new feature F3, the fct
function used to load dataset Y must compute and return features F1, F2, and F3. loadSdf
will call fct
with both datasets X and Y so that all features are available for all compounds. If any features are missing an error will be raised. If just new features are being added, but no new compounds, use the addNewFeatures
function.
In this example, we create a new database called “test.db” and load it with data from an SDFset
. We also define fct
to compute the molecular weight, “MW”, and the number of rings and aromatic rings. The rings function actually returns a data frame with columns “RINGS” and “AROMATIC”, which will be merged into the data frame being created which will also contain the “MW” column. These will be the names used for these features and must be used when searching with them. Finally, the new compound ids are returned and stored in the “ids” variable.
data(sdfsample)
#create and initialize a new SQLite database
conn <- initDb("test.db")
## Loading required package: RSQLite
## [1] "createing db"
# load data and compute 3 features: molecular weight, with the MW function,
# and counts for RINGS and AROMATIC, as computed by rings, which
# returns a data frame itself.
ids<-loadSdf(conn,sdfsample, function(sdfset)
data.frame(rings(sdfset,type="count",upper=6, arom=TRUE)) )
## adding new features to existing compounds. This could take a while
## Warning: RSQLite::dbGetException() is deprecated, please switch to using standard error handling via
## tryCatch().
#list features in the database:
print(listFeatures(conn))
## [1] "aromatic" "rings"
By default the loadSdf
/ loadSmiles
functions will detect duplicate compound entries and only insert one of them. This means it is safe to run these functions on the same data set several times and you won’t end up with duplicates. This allows the functions to be re-run in the event that a previous run on a dataset does not complete. Duplicate compounds are detected by compouting the MD5 checksum on the textual representation of it.
It can also update existing compounds with new versions of the same compound. To enable this, set updateByName
to true. It will then consider two compounds with the same name to be the same, even if the definition is different. Then, if the name of a compound exists in the database and it is trying to insert another compound with the same name, it will overwrite the existing compound. It will also drop and re-compute all associated descriptors and features for the new compound (assuming the required functions for descriptor and feature computation are available at the time the update is performed).
It is often the case when loading a large set of compounds that several compounds will produce the same descriptor. ChemmineR
detects this case and only stores one copy of the descriptor for every compound it is for. This feature saves some space and some time for processes that need to be applied to every descriptor. It also highlights a new problem. If you have a descriptor in hand and you want to find a single compound to represent it, which compound should be used if the descriptor was produced from multiple compounds? To address this problem, ChemmineR
allows you to set priority values for each compound-descriptor mapping. Then, in contexts where a single compound is required, the highest priority compound will be chosen. Highest priority corresponds to the lowest numerical value. So mapping with priority 0 would be used first.
To set these priorities there is the function setPriorities
. It takes a function, priorityFn
, for computing these priority values. The setPriorities
function should be run after loading a complete set of data. It will find each group of compounds which share the same descriptor and call the given function, priorityFn
, with the compound_id numbers of the group. This function should then assign priorities to each compound-descriptor pair, however it wishes.
One built in priority function is forestSizePriorities
. This simply prefers compounds with fewer disconnected components over compounds with more dissconnected components.
setPriorities(conn,forestSizePriorities)
Compounds can be searched for using the findCompounds
function. This function takes a connection object, a vector of feature names used in the tests, and finally, a vector of tests that must all pass for a compound to be included in the result set. Each test should be a boolean expression. For example: c("MW <= 400","RINGS \> 3")
would return all compounds with a molecular weight of 400 or less and more than 3 rings, assuming these features exist in the database. The syntax for each test is '\<feature name\> \<SQL operator\> \<value\>'
. If you know SQL you can go beyond this basic syntax. These tests will simply be concatenated together with “AND” in-between them and tacked on the end of a WHERE clause of an SQL statement. So any SQL that will work in that context is fine. The function will return a list of compound ids, the actual compounds can be fetched with getCompounds
. If just the names are needed, the getCompoundNames
function can be used. Compounds can also be fetched by name using the findCompoundsByName
function.
In this example we search for compounds with 0 or 1 rings:
results = findCompounds(conn,"rings",c("rings <= 1"))
message("found ",length(results))
## found 3
If more than one test is given, only compounds which satisfy all tests are found. So if we wanted to further restrict our search to compounds with 2 or more aromatic rings we could do:
results = findCompounds(conn,c("rings","aromatic"),c("rings<=2","aromatic >= 2"))
message("found ",length(results))
## found 10
Remember that any feature used in some test must be listed in the second argument.
String patterns can also be used. So if we wanted to match a substring of the molecular formula, say to find compounds with 21 carbon atoms, we could do:
results = findCompounds(conn,"formula",c("formula like '%C21%'"))
message("found ",length(results))
The “like” operator does a pattern match. There are two wildcard operators that can be used with this operator. The “%” will match any stretch of characters while the “?” will match any single character. So the above expression would match a formula like “C21H28N4O6”.
Valid comparison operators are:
The boolean operators “AND” and “OR” can also be used to create more complex expressions within a single test.
If you just want to fetch every compound in the database you can use the getAllCompoundIds
function:
allIds = getAllCompoundIds(conn)
message("found ",length(allIds))
## found 100
Once you have a list of compound ids from the findCompounds
function, you can either fetch the compound names, or the whole set of compounds as an SDFset.
#get the names of the compounds:
names = getCompoundNames(conn,results)
#if the name order is important set keepOrder=TRUE
#It will take a little longer though
names = getCompoundNames(conn,results,keepOrder=TRUE)
# get the whole set of compounds
compounds = getCompounds(conn,results)
#in order:
compounds = getCompounds(conn,results,keepOrder=TRUE)
#write results directly to a file:
compounds = getCompounds(conn,results,filename=file.path(tempdir(),"results.sdf"))
Using the getCompoundFeatures
function, you can get a set of feature values as a data frame:
getCompoundFeatures(conn,results[1:5],c("rings","aromatic"))
## compound_id rings aromatic
## 1 209 2 2
## 2 216 2 2
## 3 224 2 2
## 4 236 2 2
## 5 240 2 2
#write results directly to a CSV file (reduces memory usage):
getCompoundFeatures(conn,results[1:5],c("rings","aromatic"),filename="features.csv")
#maintain input order in output:
print(results[1:5])
## [1] 209 216 224 236 240
getCompoundFeatures(conn,results[1:5],c("rings","aromatic"),keepOrder=TRUE)
## compound_id rings aromatic
## 209 209 2 2
## 216 216 2 2
## 224 224 2 2
## 236 236 2 2
## 240 240 2 2
We have pre-built SQLite databases for the Drug Bank and DUD datasets. They can be found in the ChemmineDrugs annotation package. Connections to these databases can be fetched from the functions DrugBank
and DUD
to get the corresponding database. Any of the above functions can then be used to query the database.
The DUD dataset was downloaded from here. A description can be found here.
The Drug Bank data set is version 4.1. It can be downloaded here
The following features are included:
The DUD database additionally includes:
Several methods are available to return the different data components of SDF/SDFset
containers in batches. The following examples list the most important ones. To save space their content is not printed in the manual.
view(sdfset[1:4]) # Summary view of several molecules
length(sdfset) # Returns number of molecules
sdfset[[1]] # Returns single molecule from SDFset as SDF object
sdfset[[1]][[2]] # Returns atom block from first compound as matrix
sdfset[[1]][[2]][1:4,]
c(sdfset[1:4], sdfset[5:8]) # Concatenation of several SDFsets
The grepSDFset
function allows string matching/searching on the different data components in SDFset
. By default the function returns a SDF summary of the matching entries. Alternatively, an index of the matches can be returned with the setting mode="index"
.
grepSDFset("650001", sdfset, field="datablock", mode="subset") # To return index, set mode="index")
Utilities to maintain unique compound IDs:
sdfid(sdfset[1:4]) # Retrieves CMP IDs from Molecule Name field in header block.
cid(sdfset[1:4]) # Retrieves CMP IDs from ID slot in SDFset.
unique_ids <- makeUnique(sdfid(sdfset)) # Creates unique IDs by appending a counter to duplicates.
cid(sdfset) <- unique_ids # Assigns uniquified IDs to ID slot
Subsetting by character, index and logical vectors:
view(sdfset[c("650001", "650012")])
view(sdfset[4:1])
mylog <- cid(sdfset)
view(sdfset[mylog])
Accessing SDF/SDFset
components: header, atom, bond and data blocks:
atomblock(sdf); sdf[[2]];
sdf[["atomblock"]] # All three methods return the same component
header(sdfset[1:4])
atomblock(sdfset[1:4])
bondblock(sdfset[1:4])
datablock(sdfset[1:4])
header(sdfset[[1]])
atomblock(sdfset[[1]])
bondblock(sdfset[[1]])
datablock(sdfset[[1]])
Replacement Methods:
sdfset[[1]][[2]][1,1] <- 999
atomblock(sdfset)[1] <- atomblock(sdfset)[2]
datablock(sdfset)[1] <- datablock(sdfset)[2]
Assign matrix data to data block:
datablock(sdfset) <- as.matrix(iris[1:100,])
view(sdfset[1:4])
Class coercions from SDFstr
to list
, SDF
and SDFset
:
as(sdfstr[1:2], "list") as(sdfstr[[1]], "SDF")
as(sdfstr[1:2], "SDFset")
Class coercions from SDF
to SDFstr
, SDFset
, list with SDF sub-components:
sdfcomplist <- as(sdf, "list") sdfcomplist <-
as(sdfset[1:4], "list"); as(sdfcomplist[[1]], "SDF") sdflist <-
as(sdfset[1:4], "SDF"); as(sdflist, "SDFset") as(sdfset[[1]], "SDFstr")
as(sdfset[[1]], "SDFset")
Class coercions from SDFset
to lists with components consisting of SDF or sub-components:
as(sdfset[1:4], "SDF") as(sdfset[1:4], "list") as(sdfset[1:4], "SDFstr")
Several methods and functions are available to compute basic compound descriptors, such as molecular formula (MF), molecular weight (MW), and frequencies of atoms and functional groups. In many of these functions, it is important to set addH=TRUE
in order to include/add hydrogens that are often not specified in an SD file.
propma <- atomcountMA(sdfset, addH=FALSE)
boxplot(propma, col="blue", main="Atom Frequency")
boxplot(rowSums(propma), main="All Atom Frequency")
Data frame provided by library containing atom names, atom symbols, standard atomic weights, group and period numbers:
data(atomprop)
atomprop[1:4,]
## Number Name Symbol Atomic_weight Group Period
## 1 1 hydrogen H 1.007940 1 1
## 2 2 helium He 4.002602 18 1
## 3 3 lithium Li 6.941000 1 2
## 4 4 beryllium Be 9.012182 2 2
Compute MW and formula:
MW(sdfset[1:4], addH=FALSE)
## CMP1 CMP2 CMP3 CMP4
## 456.4916 357.4069 370.4255 461.5346
MF(sdfset[1:4], addH=FALSE)
## CMP1 CMP2 CMP3 CMP4
## "C23H28N4O6" "C18H23N5O3" "C18H18N4O3S" "C21H27N5O5S"
Enumerate functional groups:
groups(sdfset[1:4], groups="fctgroup", type="countMA")
## RNH2 R2NH R3N ROPO3 ROH RCHO RCOR RCOOH RCOOR ROR RCCH RCN
## CMP1 0 2 1 0 0 0 0 0 0 2 0 0
## CMP2 0 2 2 0 1 0 0 0 0 0 0 0
## CMP3 0 1 1 0 1 0 1 0 0 0 0 0
## CMP4 0 1 3 0 0 0 0 0 0 2 0 0
Combine MW, MF, charges, atom counts, functional group counts and ring counts in one data frame:
propma <- data.frame(MF=MF(sdfset, addH=FALSE), MW=MW(sdfset, addH=FALSE),
Ncharges=sapply(bonds(sdfset, type="charge"), length),
atomcountMA(sdfset, addH=FALSE),
groups(sdfset, type="countMA"),
rings(sdfset, upper=6, type="count", arom=TRUE))
propma[1:4,]
## MF MW Ncharges C H N O S F Cl RNH2 R2NH R3N ROPO3 ROH RCHO RCOR RCOOH RCOOR
## CMP1 C23H28N4O6 456.4916 0 23 28 4 6 0 0 0 0 2 1 0 0 0 0 0 0
## CMP2 C18H23N5O3 357.4069 0 18 23 5 3 0 0 0 0 2 2 0 1 0 0 0 0
## CMP3 C18H18N4O3S 370.4255 0 18 18 4 3 1 0 0 0 1 1 0 1 0 1 0 0
## CMP4 C21H27N5O5S 461.5346 0 21 27 5 5 1 0 0 0 1 3 0 0 0 0 0 0
## ROR RCCH RCN RINGS AROMATIC
## CMP1 2 0 0 4 2
## CMP2 0 0 0 3 3
## CMP3 0 0 0 4 3
## CMP4 2 0 0 3 3
The following shows an example for assigning the values stored in a matrix (e.g. property descriptors) to the data block components in an SDFset
. Each matrix row will be assigned to the corresponding slot position in the SDFset
.
datablock(sdfset) <- propma # Works with all SDF components
datablock(sdfset)[1:4]
test <- apply(propma[1:4,], 1, function(x)
data.frame(col=colnames(propma), value=x))
The data blocks in SDFs contain often important annotation information about compounds. The datablock2ma
function returns this information as matrix for all compounds stored in an SDFset
container. The splitNumChar
function can then be used to organize all numeric columns in a numeric matrix
and the character columns in a character matrix
as components of a list
object.
datablocktag(sdfset, tag="PUBCHEM_NIST_INCHI")
datablocktag(sdfset,
tag="PUBCHEM_OPENEYE_CAN_SMILES")
Convert entire data block to matrix:
blockmatrix <- datablock2ma(datablocklist=datablock(sdfset)) # Converts data block to matrix
numchar <- splitNumChar(blockmatrix=blockmatrix) # Splits matrix to numeric matrix and character matrix
numchar[[1]][1:4,]; numchar[[2]][1:4,]
# Splits matrix to numeric matrix and character matrix
Bond matrices provide an efficient data structure for many basic computations on small molecules. The function conMA
creates this data structure from SDF
and SDFset
objects. The resulting bond matrix contains the atom labels in the row/column titles and the bond types in the data part. The labels are defined as follows: 0 is no connection, 1 is a single bond, 2 is a double bond and 3 is a triple bond.
conMA(sdfset[1:2],
exclude=c("H")) # Create bond matrix for first two molecules in sdfset
conMA(sdfset[[1]], exclude=c("H")) # Return bond matrix for first molecule
plot(sdfset[1], atomnum = TRUE, noHbonds=FALSE , no_print_atoms = "", atomcex=0.8) # Plot its structure with atom numbering
rowSums(conMA(sdfset[[1]], exclude=c("H"))) # Return number of non-H bonds for each atom
The function bonds
returns information about the number of bonds, charges and missing hydrogens in SDF
and SDFset
objects. It is used by many other functions (e.g. MW
, MF
, atomcount
, atomcuntMA
and plot
) to correct for missing hydrogens that are often not specified in SD files.
bonds(sdfset[[1]], type="bonds")[1:4,]
## atom Nbondcount Nbondrule charge
## 1 O 2 2 0
## 2 O 2 2 0
## 3 O 2 2 0
## 4 O 2 2 0
bonds(sdfset[1:2], type="charge")
## $CMP1
## NULL
##
## $CMP2
## NULL
bonds(sdfset[1:2], type="addNH")
## CMP1 CMP2
## 0 0
The function rings
identifies all possible rings in one or many molecules (here sdfset[1]
) using the exhaustive ring perception algorithm from Hanser et al. (1996). In addition, the function can return all smallest possible rings as well as aromaticity information.
The following example returns all possible rings in a list
. The argument upper
allows to specify an upper length limit for rings. Choosing smaller length limits will reduce the search space resulting in shortened compute times. Note: each ring is represented by a character vector of atom symbols that are numbered by their position in the atom block of the corresponding SDF/SDFset
object.
ringatoms <- rings(sdfset[1], upper=Inf, type="all", arom=FALSE, inner=FALSE)
For visual inspection, the corresponding compound structure can be plotted with the ring bonds highlighted in color:
atomindex <- as.numeric(gsub(".*_", "", unique(unlist(ringatoms))))
plot(sdfset[1], print=FALSE, colbonds=atomindex)
Alternatively, one can include the atom numbers in the plot:
plot(sdfset[1], print=FALSE, atomnum=TRUE, no_print_atoms="H")