1 Introduction

In this vignette, we will document various timings and benchmarkings of the recent MSnbase development (aka MSnbase2), that focuses on on-disk data access (as opposed to in-memory). More details about the new implementation will be documented elsewhere.

As a benchmarking dataset, we are going to use a subset of an TMT 6-plex experiment acquired on an LTQ Orbitrap Velos, that is distributed with the msdata package

library("msdata")
f <- msdata::proteomics(full.names = TRUE, pattern = "TMT_Erwinia")
basename(f)
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz"

We need to load the MSnbase package and set the session-wide verbosity flag to FALSE.

library("MSnbase")
setMSnbaseVerbose(FALSE)

2 Benchmarking

2.1 Reading data

We first read the data using the original readMSData function that generates an in-memory representation of the MS2-level raw data and measure the time needed for this operation.

system.time(inmem <- readMSData(f, msLevel = 2,
                                centroided = TRUE))
##    user  system elapsed 
##   5.216   0.036   5.333

Next, we use the readMSData2 function to generate an on-disk representation of the same data.

system.time(ondisk <- readMSData2(f, msLevel = 2,
                                  centroided = TRUE))
##    user  system elapsed 
##   1.644   0.056   1.698

Creating the on-disk experiment is considerable faster and scales to much bigger, multi-file data, both in terms of object creation time, but also in terms of object size (see next section). We must of course make sure that these two datasets are equivalent:

all.equal(inmem, ondisk)
## [1] TRUE

2.2 Data size

To compare the size occupied in memory of these two objects, we are going to use the object_size function from the pryr package, which accounts for the data (the spectra) in the assayData environment (as opposed to the object.size function from the utils package).

library("pryr")
object_size(inmem)
## 2.68 MB
object_size(ondisk)
## 115 kB

The difference is explained by the fact that for ondisk, the spectra are not created and stored in memory; they are access on disk when needed, such as for example for plotting:

plot(inmem[[200]], full = TRUE)
plot(ondisk[[200]], full = TRUE)