1 Introduction

In this vignette, we will document various timings and benchmarkings of the MSnbase version 2, that focuses on on-disk data access (as opposed to in-memory). More details about the new implementation are documented in the respective classes manual pages and in

MSnbase, efficient and elegant R-based processing and visualisation of raw mass spectrometry data. Laurent Gatto, Sebastian Gibb, Johannes Rainer. bioRxiv 2020.04.29.067868; doi: https://doi.org/10.1101/2020.04.29.067868

As a benchmarking dataset, we are going to use a subset of an TMT 6-plex experiment acquired on an LTQ Orbitrap Velos, that is distributed with the msdata package

library("msdata")
f <- msdata::proteomics(full.names = TRUE,
                        pattern = "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz")
basename(f)

## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz"

We need to load the MSnbase package and set the session-wide verbosity flag to FALSE.

library("MSnbase")
setMSnbaseVerbose(FALSE)

2 Benchmarking

2.1 Reading data

We first read the data using the original behaviour readMSData function by setting the mode argument to "inMemory" to generates an in-memory representation of the MS2-level raw data and measure the time needed for this operation.

system.time(inmem <- readMSData(f, msLevel = 2,
                                mode = "inMemory",
                                centroided = TRUE))

##    user  system elapsed 
##   7.128   0.080   8.555

Next, we use the readMSData function to generate an on-disk representation of the same data by setting mode = "onDisk".

system.time(ondisk <- readMSData(f, msLevel = 2,
                                  mode = "onDisk",
                                  centroided = TRUE))

##    user  system elapsed 
##   1.688   0.044   1.733

Creating the on-disk experiment is considerable faster and scales to much bigger, multi-file data, both in terms of object creation time, but also in terms of object size (see next section). We must of course make sure that these two datasets are equivalent:

all.equal(inmem, ondisk)

## [1] TRUE

2.2 Data size

To compare the size occupied in memory of these two objects, we are going to use the object_size function from the pryr package, which accounts for the data (the spectra) in the assayData environment (as opposed to the object.size function from the utils package).

library("pryr")

## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp

object_size(inmem)

## 2.77 MB

object_size(ondisk)

## 238 kB

The difference is explained by the fact that for ondisk, the spectra are not created and stored in memory; they are access on disk when needed, such as for example for plotting:

plot(inmem[[200]], full = TRUE)
plot(ondisk[[200]], full = TRUE)

Figure 1: Plotting in-memory and on-disk spectra

2.3 Accessing spectra

The drawback of the on-disk representation is when the spectrum data has to actually be accessed. To compare access time, we are going to use the microbenchmark and repeat access 10 times to compare access to all 451 and a single spectrum in-memory (i.e. pre-loaded and constructed) and on-disk (i.e. on-the-fly access).

library("microbenchmark")
mb <- microbenchmark(spectra(inmem),
                     inmem[[200]],
                     spectra(ondisk),
                     ondisk[[200]],
                     times = 10)
mb

## Unit: microseconds
##             expr        min         lq        mean     median         uq
##   spectra(inmem)     96.186    171.903    423.3112    283.944    405.671
##     inmem[[200]]     44.338     60.497     84.2655     86.386     94.818
##  spectra(ondisk) 442353.592 449972.668 497157.4715 488027.546 521516.955
##    ondisk[[200]] 219240.328 223599.031 235833.7103 224310.559 226732.156
##         max neval cld
##    1800.458    10 a  
##     126.881    10 a  
##  603571.208    10   c
##  287341.389    10  b

While it takes order or magnitudes more time to access the data on-the-fly rather than a pre-generated spectrum, accessing all spectra is only marginally slower than accessing all spectra, as most of the time is spent preparing the file for access, which is done only once.

On-disk access performance will depend on the read throughput of the disk. A comparison of the data import of the above file from an internal solid state drive and from an USB3 connected hard disk showed only small differences for the onDisk mode (1.07 vs 1.36 seconds), while no difference were observed for accessing individual or all spectra. Thus, for this particular setup, performance was about the same for SSD and HDD. This might however not apply to setting in which data import is performed in parallel from multiple files.

Data access does not prohibit interactive usage, such as plotting, for example, as it is about 1/2 seconds, which is an operation that is relatively rare, compared to subsetting and filtering, which are faster for on-disk data:

i <- sample(length(inmem), 100)
system.time(inmem[i])

##    user  system elapsed 
##   0.132   0.000   0.131

system.time(ondisk[i])

##    user  system elapsed 
##    0.02    0.00    0.02

Operations on the spectra data, such as peak picking, smoothing, cleaning, … are cleverly cached and only applied when the data is accessed, to minimise file access overhead. Finally, specific operations such as for example quantitation (see next section) are optimised for speed.

2.4 MS2 quantitation

Below, we perform TMT 6-plex reporter ions quantitation on the first 100 spectra and verify that the results are identical (ignoring feature names).

system.time(eim <- quantify(inmem[1:100], reporters = TMT6,
                            method = "max"))

##    user  system elapsed 
##   6.512   0.396   2.082

system.time(eod <- quantify(ondisk[1:100], reporters = TMT6,
                            method = "max"))

##    user  system elapsed 
##   0.320   0.028   0.346

all.equal(eim, eod, check.attributes = FALSE)

## [1] TRUE

3 Notable differences on-disk and in-memory implementations

The MSnExp and OnDiskMSnExp documentation files and the MSnbase developement vignette provide more information about implementation details.

3.1 MS levels

On-disk support multiple MS levels in one object, while in-memory only supports a single level. While support for multiple MS levels could be added to the in-memory back-end, memory constrains make this pretty-much useless and will most likely never happen.

3.2 Serialisation

In-memory objects can be save()ed and load()ed, while on-disk can’t. As a workaround, the latter can be coerced to in-memory instances with as(, "MSnExp"). We would need mzML write support in mzR to be able to implement serialisation for on-disk data.

3.3 Data processing

Whenever possible, accessing and processing on-disk data is delayed (lazy processing). These operations are stored in a processing queue until the spectra are effectively instantiated.

3.4 Validity

The on-disk validObject method doesn’t verify the validity on the spectra (as there aren’t any to check). The validateOnDiskMSnExp function, on the other hand, instantiates all spectra and checks their validity (in addition to calling validObject).

4 Conclusions

This document focuses on speed and size improvements of the new on-disk MSnExp representation. The extend of these improvements will substantially increase for larger data.

For general functionality about the on-disk MSnExp data class and MSnbase in general, see other vignettes available with

vignette(package = "MSnbase")

MSnbase benchmarking

21 January 2021

Package