sdfStream {ChemmineR} | R Documentation |
Streaming function to compute descriptors for large SD Files without consuming much memory. In addition to descriptor values, it returns a line index that defines the positions of each molecule in the source SD File. This line index can be used by the read.SDFindex
function to retrieve specific compounds of interest from large SD Files without reading the entire file into memory.
sdfStream(input, output, append=FALSE, fct, Nlines = 10000, startline=1, restartNlines=10000, silent = FALSE, ...)
input |
file name of input SD file |
output |
file name of tabular descriptor file |
append |
if |
fct |
Function to select descriptor sets; any combination of descriptors, supported by |
Nlines |
Number of lines to read from input SD File at a time; the memory consumption will be proportional to this value. |
startline |
For restarting sdfStream at specific line assigned to |
restartNlines |
Number of lines to parse when |
silent |
if |
... |
Arguments to be passed to/from other methods. |
...
Writes a descriptor matrix to a tabular file. The first and last line number (position index) of each molecule is specified in the first two columns of the tabular output file, respectively.
Thomas Girke
SDF format definition: http://www.symyx.com/downloads/public/ctfile/ctfile.jsp
Import/export functions: read.AP
, read.SDFset
, read.SDFstr
, read.SDFstr
, read.SDFset
, write.SDFsplit
## Load sample data library(ChemmineR) data(sdfsample); sdfset <- sdfsample ## Not run: write.SDF(sdfset, "test.sdf") ## Define descriptor set in a simple function desc <- function(sdfset) { cbind(SDFID=sdfid(sdfset), # datablock2ma(datablocklist=datablock(sdfset)), MW=MW(sdfset), groups(sdfset), # AP=sdf2ap(sdfset, type="character"), rings(sdfset, type="count", upper=6, arom=TRUE) ) } ## Run sdfStream with desc function and write results to a file called 'matrix.xls' sdfStream(input="test.sdf", output="matrix.xls", append=FALSE, fct=desc, Nlines=1000) ## Same as before but starting in SD file at line number 950 sdfStream(input="test.sdf", output="matrix.xls", append=FALSE, fct=desc, Nlines=1000, startline=950) ## Select molecules from SD File using line index from sdfStream indexDF <- read.delim("matrix.xls", row.names=1)[,1:4] indexDFsub <- indexDF[indexDF$MW < 400, ] # Selects molecules with MW < 400 sdfset <- read.SDFindex(file="test.sdf", index=indexDFsub, type="SDFset") ## Write result directly to SD file without storing larger numbers of molecules in memory read.SDFindex(file="test.sdf", index=indexDFsub, type="file", outfile="sub.sdf") ## Read atom pair string representation from file into APset apset <- read.AP(file="matrix.xls", colid="AP") cid(apsdf) <- as.character(indexDF$SDFID) ## End(Not run)