MSnbase 2.6.4
In this vignette, we will document various timings and benchmarkings
of the recent MSnbase development (aka MSnbase2
),
that focuses on on-disk data access (as opposed to
in-memory). More details about the new implementation will be
documented elsewhere.
As a benchmarking dataset, we are going to use a subset of an TMT 6-plex experiment acquired on an LTQ Orbitrap Velos, that is distributed with the msdata package
library("msdata")
f <- msdata::proteomics(full.names = TRUE,
pattern = "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz")
basename(f)
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz"
We need to load the MSnbase package and set the
session-wide verbosity flag to FALSE
.
library("MSnbase")
setMSnbaseVerbose(FALSE)
We first read the data using the original behaviour readMSData
function by setting the mode
argument to "inMemory"
to generates
an in-memory representation of the MS2-level raw data and measure the
time needed for this operation.
system.time(inmem <- readMSData(f, msLevel = 2,
mode = "inMemory",
centroided = TRUE))
## user system elapsed
## 6.720 0.064 6.826
Next, we use the readMSData
function to generate an on-disk
representation of the same data by setting mode = "onDisk"
.
system.time(ondisk <- readMSData(f, msLevel = 2,
mode = "onDisk",
centroided = TRUE))
## user system elapsed
## 2.452 0.056 2.510
Creating the on-disk experiment is considerable faster and scales to much bigger, multi-file data, both in terms of object creation time, but also in terms of object size (see next section). We must of course make sure that these two datasets are equivalent:
all.equal(inmem, ondisk)
## [1] TRUE
To compare the size occupied in memory of these two objects, we are
going to use the object_size
function from the pryr
package, which accounts for the data (the spectra) in the assayData
environment (as opposed to the object.size
function from the utils
package).
library("pryr")
object_size(inmem)
## 2.77 MB
object_size(ondisk)
## 217 kB
The difference is explained by the fact that for ondisk
, the spectra
are not created and stored in memory; they are access on disk when
needed, such as for example for plotting:
plot(inmem[[200]], full = TRUE)
plot(ondisk[[200]], full = TRUE)
The drawback of the on-disk representation is when the spectrum data has to actually be accessed. To compare access time, we are going to use the microbenchmark and repeat access 10 times to compare access to all 451 and a single spectrum in-memory (i.e. pre-loaded and constructed) and on-disk (i.e. on-the-fly access).
library("microbenchmark")
mb <- microbenchmark(spectra(inmem),
inmem[[200]],
spectra(ondisk),
ondisk[[200]],
times = 10)
mb
## Unit: microseconds
## expr min lq mean median uq
## spectra(inmem) 95.089 203.919 361.2514 262.6300 347.019
## inmem[[200]] 32.433 44.231 68.3990 61.7695 80.955
## spectra(ondisk) 497843.648 500686.619 549026.3521 520287.0335 579555.929
## ondisk[[200]] 230648.373 234640.918 242419.3885 238302.3400 243459.989
## max neval cld
## 1453.956 10 a
## 128.530 10 a
## 677225.291 10 c
## 276029.393 10 b
While it takes order or magnitudes more time to access the data on-the-fly rather than a pre-generated spectrum, accessing all spectra is only marginally slower than accessing all spectra, as most of the time is spent preparing the file for access, which is done only once.
On-disk access performance will depend on the read throughput of the
disk. A comparison of the data import of the above file from an
internal solid state drive and from an USB3 connected hard disk showed
only small differences for the onDisk
mode (1.07 vs 1.36 seconds),
while no difference were observed for accessing individual or all
spectra. Thus, for this particular setup, performance was about the
same for SSD and HDD. This might however not apply to setting in which
data import is performed in parallel from multiple files.
Data access does not prohibit interactive usage, such as plotting, for example, as it is about 1/2 seconds, which is an operation that is relatively rare, compared to subsetting and filtering, which are faster for on-disk data:
i <- sample(length(inmem), 100)
system.time(inmem[i])
## user system elapsed
## 0.124 0.000 0.122
system.time(ondisk[i])
## user system elapsed
## 0.032 0.000 0.032
Operations on the spectra data, such as peak picking, smoothing, cleaning, … are cleverly cached and only applied when the data is accessed, to minimise file access overhead. Finally, specific operations such as for example quantitation (see next section) are optimised for speed.
Below, we perform TMT 6-plex reporter ions quantitation on the first 100 spectra and verify that the results are identical (ignoring feature names).
system.time(eim <- quantify(inmem[1:100], reporters = TMT6,
method = "max"))
## user system elapsed
## 3.808 1.096 1.529
system.time(eod <- quantify(ondisk[1:100], reporters = TMT6,
method = "max"))
## user system elapsed
## 0.336 0.016 0.352
all.equal(eim, eod, check.attributes = FALSE)
## [1] TRUE
The MSnExp
and OnDiskMSnExp
documentation files and the MSnbase
developement vignette provide more information about implementation
details.
On-disk support multiple MS levels in one object, while in-memory only supports a single level. While support for multiple MS levels could be added to the in-memory back-end, memory constrains make this pretty-much useless and will most likely never happen.
In-memory objects can be save()
ed and load()
ed, while on-disk
can’t. As a workaround, the latter can be coerced to in-memory
instances with as(, "MSnExp")
. We would need mzML
write support in
mzR to be able to implement serialisation for on-disk
data.
Whenever possible, accessing and processing on-disk data is delayed (lazy processing). These operations are stored in a processing queue until the spectra are effectively instantiated.
The on-disk validObject
method doesn’t verify the validity on the
spectra (as there aren’t any to check). The validateOnDiskMSnExp
function, on the other hand, instantiates all spectra and checks their
validity (in addition to calling validObject
).
This document focuses on speed and size improvements of the new
on-disk MSnExp
representation. The extend of these improvements will
substantially increase for larger data.
For general functionality about the on-disk MSnExp
data class and
MSnbase in general, see other vignettes available with
vignette(package = "MSnbase")