Foreword

MSnbase is under active development; current functionality is evolving and new features will be added. This software is free and open-source software. If you use it, please support the project by citing it in publications:

Gatto L, Lilley KS. MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics. 2012 Jan 15;28(2):288-9. doi: 10.1093/bioinformatics/btr645. PMID: 22113085.

Questions and bugs

For bugs, typos, suggestions or other questions, please file an issue in our tracking system (https://github.com/lgatto/MSnbase/issues) providing as much information as possible, a reproducible example and the output of sessionInfo().

If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.

1 Introduction

MSnbase (Gatto and Lilley 2012) aims are providing a reproducible research framework to proteomics data analysis. It should allow researcher to easily mine mass spectrometry data, explore the data and its statistical properties and visually display these.

MSnbase also aims at being compatible with the infrastructure implemented in Bioconductor, in particular Biobase. As such, classes developed specifically for proteomics mass spectrometry data are based on the eSet and ExpressionSet classes. The main goal is to assure seamless compatibility with existing meta data structure, accessor methods and normalisation techniques.

This vignette illustrates MSnbase utility using a dummy data sets provided with the package without describing the underlying data structures. More details can be found in the package, classes, method and function documentations. A description of the classes is provided in the MSnbase-development vignette1 in R, open it with vignette("MSnbase-development") or read it online here.

1.1 Speed and memory requirements

Raw mass spectrometry file are generally several hundreds of MB large and most of this is used for binary raw spectrum data. As such, data containers can easily grow very large and thus require large amounts of RAM. This requirement is being tackled by avoiding to load the raw data into memory and using on-disk random access to the content of mzXML/mzML data files on demand. When focusing on reporter ion quantitation, a direct solution for this is to trim the spectra using the trimMz method to select the area of interest and thus substantially reduce the size of the Spectrum objects. This is illustrated in section 6.2.

Parallel processing The independent handling of spectra is ideally suited for parallel processing. The quantify method for example performs reporter peaks quantitation in parallel.

Parallel support is provided by the BiocParallel and various backends including multicore (forking, default on Linux), simple networf network of workstations (SNOW, default on Windows) using sockets, forking or MPI among others. We refer readers to the documentation in BiocParallel. Automatic parallel processing of spectra is only established for a certain number of spectra (per file). This value (default is 1000) can be set with the setMSnbaseParallelThresh function.

In sock-based parallel processing, the main worker process has to start new R instances and connect to them via sock. Sometimes these connections can not be established and the processes get stuck. To test this, users can disable parallel processing by disabling parallel processing with register(SerialParam()). To avoid these deadlocks, it is possible to initiate the parallel processing setup explicitly at the beginning of the script using, for example

library("doParallel")
registerDoParallel(3) ## using 3 slave nodes
register(DoparParam(), default = TRUE)

## rest of script comes below

On-disk access Developmenets in version 2 of the package have solved the memory issue by implementing and on-disk version the of data class storing raw data (MSnExp, see section 2.3), where the spectra a accessed on-disk only when required. The benchmarking vignette compares the on-disk and in-memory implemenatations2 in R, open it with vignette("benchmarking") or read it online here. See details below.

2 Data structure and content

2.1 Importing experiments

MSnbase is able to import raw MS data stored in one of the XML-based formats as well as peak lists in the mfg format3 Mascot Generic Format, see http://www.matrixscience.com/help/data_file_help.html#GEN.

Raw data The XML-based formats, mzXML (Pedrioli et al. 2004), mzData (Orchard et al. 2007) and mzML (Martens et al. 2010) can be imported with the readMSData function, as illustrated below (see ?readMSData for more details). To make use of the new on-disk implementation, set mode = "onDisk" in readMSData rather than using the default mode = "inMemory".

file <- dir(system.file(package = "MSnbase", dir = "extdata"),
            full.names = TRUE, pattern = "mzXML$")
rawdata <- readMSData(file, msLevel = 2, verbose = FALSE)

Only spectra of a given MS level can be loaded at a time by setting the msLevel parameter accordingly in readMSData and in-memory data. In this document, we will use the itraqdata data set, provided with MSnbase. It includes feature metadata, accessible with the fData accessor. The metadata includes identification data for the 55 MS2 spectra.

Version 2.0 and later of MSnbase provide a new on-disk data storage model (see the benchmarking vignette for more details). The new data backend is compatible with the orignal in-memory model. To make use of the new infrastructure, read your raw data by setting the mode argument to "onDisk" (the default is still "inMemory" but is likely to change in the future). The new on-disk implementation supports several MS levels in a single raw data object. All existing operations work irrespective of the backend.

Peak lists can often be exported after spectrum processing from vendor-specific software and are also used as input to search engines. Peak lists in mgf format can be imported with the function readMgfData (see ?readMgfData for details) to create experiment objects. Experiments or individual spectra can be exported to an mgf file with the writeMgfData methods (see ?writeMgfData for details and examples).

Experiments with multiple runs Although it is possible to load and process multiple files serially and later merge the resulting quantitation data as show in section 13, it is also feasible to load several raw data files at once. Here, we report the analysis of an LC-MSMS experiment were 14 liquid chromatography (LC) fractions were loaded in memory using readMSData on a 32-cores servers with 128 Gb of RAM. It took about 90 minutes to read the 14 uncentroided mzXML raw files (4.9 Gb on disk in total) and create a 3.3 Gb raw data object (an MSnExp instance, see next section). Quantitation of 9 reporter ions (iTRAQ9 object, see 2.5) for 88690 features was performed in parallel on 16 processors and took 76 minutes. The resulting quantitation data was only 22.1 Mb and could easily be further processed. These number are based on the older in-memory implementation. As shown in the benchmarking vignette, using on-disk data greatly reduces memory requirement and computation time.

See also section 7.2 to import quantitative data stored in spreadsheets into R for further processing using MSnbase. The MSnbase-iovignette[in R, open it with vignette("MSnbase-io") or read it online here] gives a general overview of MSnbase’s input/ouput capabilites.

See section 7.3 for importing chromatographic data of SRM/MRM experiments.

2.2 Exporting experiments/MS data

MSnbase supports also to write MSnExp or OnDiskMSnExp objects to mzML or mzXML files using the writeMSData function. This is specifically useful in workflows in which the MS data was heavily manipulated. Presently, each sample/file is exported into one file.

Below we write the data in mzML format to a temporary file. By setting the optional parameter copy = TRUE general metadata (such as instrument info or all data processing descriptions) are copied over from the originating file.

writeMSData(rawdata, file = paste0(tempfile(), ".mzML"), copy = TRUE)

2.3 MS experiments

Raw data is contained in MSnExp objects, that stores all the spectra of an experiment, as defined by one or multiple raw data files.

library("MSnbase")
itraqdata
## MSn experiment data ("MSnExp")
## Object size in memory: 1.9 Mb
## - - - Spectra data - - -
##  MS level(s): 2 
##  Number of spectra: 55 
##  MSn retention times: 19:9 - 50:18 minutes
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011 
## Updated from version 0.3.0 to 0.3.1 [Fri Jul  8 20:23:25 2016] 
##  MSnbase version: 1.1.22 
## - - - Meta data  - - -
## phenoData
##   rowNames: 1
##   varLabels: sampleNames sampleNumbers
##   varMetadata: labelDescription
## Loaded from:
##   dummyiTRAQ.mzXML 
## protocolData: none
## featureData
##   featureNames: X1 X10 ... X9 (55 total)
##   fvarLabels: spectrum ProteinAccession ProteinDescription
##     PeptideSequence
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
head(fData(itraqdata))
##     spectrum ProteinAccession                       ProteinDescription
## X1         1              BSA                     bovine serum albumin
## X10       10          ECA1422 glucose-1-phosphate cytidylyltransferase
## X11       11          ECA4030         50S ribosomal subunit protein L4
## X12       12          ECA3882                   chaperone protein DnaK
## X13       13          ECA1364      succinyl-CoA synthetase alpha chain
## X14       14          ECA0871              NADP-dependent malic enzyme
##     PeptideSequence
## X1           NYQEAK
## X10 VTLVDTGEHSMTGGR
## X11           SPIWR
## X12        TAIDDALK
## X13          SILINK
## X14    DFEVVNNESDPR

As illustrated above, showing the experiment textually displays it’s content:

  • Information about the raw data, i.e. the spectra.

  • Specific information about the experiment processing4 This part will be automatically updated when the object is modified with it’s ad hoc methods, as illustrated later. and package version. This slot can be accessed with the processingData method.

  • Other meta data, including experimental phenotype, file name(s) used to import the data, protocol data, information about features (individual spectra here) and experiment data. Most of these are implemented as in the eSet class and are described in more details in their respective manual pages. See ?MSnExp and references therein for additional background information.

    The experiment meta data associated with an MSnExp experiment is of class MIAPE. It stores general information about the experiment as well as MIAPE (Minimum Information About a Proteomics Experiment) information (Taylor et al. 2007, Taylor et al. (2008)). This meta-data can be accessed with the experimentData method. When available, a summary of MIAPE-MS data can be printed with the msInfo method. See ?MIAPE for more details.

2.4 Spectra objects

The raw data is composed of the 55 MS spectra. The spectra are named individually (X1, X10, X11, X12, X13, X14, …) and stored in a environment. They can be accessed individually with itraqdata[["X1"]] or itraqdata[[1]], or as a list with spectra(itraqdata). As we have loaded our experiment specifying msLevel=2, the spectra will all be of level 2 (or higher, if available).

sp <- itraqdata[["X1"]]
sp
## Object of class "Spectrum2"
##  Precursor: 520.7833 
##  Retention time: 19:9 
##  Charge: 2 
##  MSn level: 2 
##  Peaks count: 1922 
##  Total ion count: 26413754

Attributes of individual spectra or of all spectra of an experiment can be accessed with their respective methods: precursorCharge for the precursor charge, rtime for the retention time, mz for the MZ values, intensity for the intensities, … see the Spectrum, Spectrum1 and Spectrum2 manuals for more details.

peaksCount(sp)
## [1] 1922
head(peaksCount(itraqdata))
##   X1  X10  X11  X12  X13  X14 
## 1922 1376 1571 2397 2574 1829
rtime(sp)
## [1] 1149.31
head(rtime(itraqdata))
##      X1     X10     X11     X12     X13     X14 
## 1149.31 1503.03 1663.61 1663.86 1664.08 1664.32

2.5 Reporter ions

Reporter ions are defined with the ReporterIons class. Specific peaks of interest are defined by a MZ value, a with around the expected MZ and a name (and optionally a colour for plotting, see section 3). ReporterIons instances are required to quantify reporter peaks in MSnExp experiments. Instances for the most commonly used isobaric tags like iTRAQ 4-plex and 8-plex and TMT 6- and 10-plex tags are already defined in MSnbase. See ?ReporterIons for details about how to generate new ReporterIons objects.

iTRAQ4
## Object of class "ReporterIons"
## iTRAQ4: '4-plex iTRAQ' with 4 reporter ions
##  - [iTRAQ4.114] 114.1112 +/- 0.05 (red)
##  - [iTRAQ4.115] 115.1083 +/- 0.05 (green)
##  - [iTRAQ4.116] 116.1116 +/- 0.05 (blue)
##  - [iTRAQ4.117] 117.115 +/- 0.05 (yellow)
TMT10
## Object of class "ReporterIons"
## TMT10HCD: '10-plex TMT HCD' with 10 reporter ions
##  - [126] 126.1277 +/- 0.002 (#8DD3C7)
##  - [127N] 127.1248 +/- 0.002 (#FFFFB3)
##  - [127C] 127.1311 +/- 0.002 (#BEBADA)
##  - [128N] 128.1281 +/- 0.002 (#FB8072)
##  - [128C] 128.1344 +/- 0.002 (#80B1D3)
##  - [129N] 129.1315 +/- 0.002 (#FDB462)
##  - [129C] 129.1378 +/- 0.002 (#B3DE69)
##  - [130N] 130.1348 +/- 0.002 (#FCCDE5)
##  - [130C] 130.1411 +/- 0.002 (#D9D9D9)
##  - [131] 131.1382 +/- 0.002 (#BC80BD)

2.6 Chromatogram objects

Chromatographic data, i.e. intensity values along the retention time dimension for a given \(m/z\) range/slice, can be extracted with the chromatogram method. Below we read a file from the msdata package and extract the (MS level 1) chromatogram. Without providing an \(m/z\) and a retention time range the function returns the total ion chromatogram (TIC) for each file within the MSnExp or OnDiskMSnExp object. See also section 7.3 for importing chromatographic data from SRM/MRM experiments.

f <- c(system.file("microtofq/MM14.mzML", package = "msdata"))
mtof <- readMSData(f, mode = "onDisk")
mtof_tic <- chromatogram(mtof)
mtof_tic
## Chromatograms with 1 row and 1 column
##           MM14.mzML
##      <Chromatogram>
## [1,]    length: 112
## phenoData with 1 variables
## featureData with 1 variables

Chromatographic data, represented by the intensity-retention time duplets, is stored in the Chromatogram object. The chromatogram method returns a Chromatograms object (note the s) which holds multiple Chromatogram objects and arranges them in a two-dimensional grid with columns representing files/samples of the MSnExp or OnDiskMSnExp object and rows \(m/z\)-retention time ranges. In the example above the Chromatograms object contains only a single Chromatogram object. Below we access this chromatogram object. Similar to the Spectrum objects, Chromatogram objects provide the accessor functions intensity and rtime to access the data, as well as the mz function, that returns the \(m/z\) range of the chromatogram.

mtof_tic[1, 1]
## Object of class: Chromatogram
## Intensity values aggregated using: sum 
## length of object: 112
## from file: 1
## mz range: [94.80679, 1004.962]
## rt range: [270.334, 307.678]
## MS level: 1
head(intensity(mtof_tic[1, 1]))
## F1.S001 F1.S002 F1.S003 F1.S004 F1.S005 F1.S006 
##   64989   67445   77843  105097  155609  212760
head(rtime(mtof_tic[1, 1]))
## F1.S001 F1.S002 F1.S003 F1.S004 F1.S005 F1.S006 
## 270.334 270.671 271.007 271.343 271.680 272.016
mz(mtof_tic[1, 1])
## [1]   94.80679 1004.96155

To extract the base peak chromatogram (the largest peak along the \(m/z\) dimension for each retention time/spectrum) we set the aggregationFun argument to "max".

mtof_bpc <- chromatogram(mtof, aggregationFun = "max")

See the Chromatogram help page and the vignettes from the xcms package for more details and use cases, also on how to extract chromatograms for specific ions.

3 Plotting raw data

3.1 MS data space

The MSmap class can be used to isolate specific slices of interest from a complete MS acquisition by specifying \(m/z\) and retention time ranges. One needs a raw data file in a format supported by mzR’s openMSfile (mzML, mzXML, …). Below we first download a raw data file from the PRIDE repository and create an MSmap containing all the MS1 spectra between acquired between 30 and 35 minutes and peaks between 521 and 523 \(m/z\). See ?MSmap for details.

## downloads the data
library("rpx")
px1 <- PXDataset("PXD000001")
mzf <- pxget(px1, 7)

## reads the data
ms <- openMSfile(mzf)
hd <- header(ms)

## a set of spectra of interest: MS1 spectra eluted
## between 30 and 35 minutes retention time
ms1 <- which(hd$msLevel == 1)
rtsel <- hd$retentionTime[ms1] / 60 > 30 &
    hd$retentionTime[ms1] / 60 < 35

## the map
M <- MSmap(ms, ms1[rtsel], 521, 523, .005, hd, zeroIsNA = TRUE)
M
## Object of class "MSmap"
##  Map [75, 401]
##   [1]  Retention time: 30:1 - 34:58 
##   [2]  M/Z: 521 - 523 (res 0.005)

The M map object can be rendered as a heatmap with plot, as shown on figure 1.

plot(M, aspect = 1, allTicks = FALSE)
Heat map of a chunk of the MS data.

Figure 1: Heat map of a chunk of the MS data

One can also render the data in 3 dimension with the plot3D function, as show on figure 2.

plot3D(M)