This vignette describes the functionality implemented in the MSnbase package. MSnbase aims at (1) facilitating the import, processing, visualisation and quantification of mass spectrometry data into the R environment (R Development Core Team 2011) by providing specific data classes and methods and (2) enabling the utilisation of throughput-high data analysis pipelines provided by the Bioconductor (Gentleman et al. 2004) project.
MSnbase is under active development; current functionality is evolving and new features will be added. This software is free and open-source software. If you use it, please support the project by citing it in publications:
Gatto L, Lilley KS. MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics. 2012 Jan 15;28(2):288-9. doi: 10.1093/bioinformatics/btr645. PMID: 22113085.
For bugs, typos, suggestions or other questions, please file an issue
in our tracking system (https://github.com/lgatto/MSnbase/issues)
providing as much information as possible, a reproducible example and
the output of
If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.
MSnbase (Gatto and Lilley 2012) aims are providing a reproducible research framework to proteomics data analysis. It should allow researcher to easily mine mass spectrometry data, explore the data and its statistical properties and visually display these.
MSnbase also aims at being compatible with the infrastructure implemented in Bioconductor, in particular Biobase. As such, classes developed specifically for proteomics mass spectrometry data are based on the eSet and ExpressionSet classes. The main goal is to assure seamless compatibility with existing meta data structure, accessor methods and normalisation techniques.
This vignette illustrates MSnbase utility using a dummy
data sets provided with the package without describing the underlying
data structures. More details can be found in the package, classes,
method and function documentations. A description of the classes is
provided in the MSnbase-development
vignette1 in R, open it with
vignette("MSnbase-development") or read it online here.
Raw mass spectrometry file are generally several hundreds of MB large
and most of this is used for binary raw spectrum data. As such, data
containers can easily grow very large and thus require large amounts
of RAM. This requirement is being tackled by avoiding to load the raw
data into memory and using on-disk random access to the content of
mzML data files on demand. When focusing on reporter ion
quantitation, a direct solution for this is to trim the spectra using
trimMz method to select the area of interest and thus
substantially reduce the size of the
Spectrum objects. This is
illustrated in section 6.2.
Parallel processing The independent handling of spectra is ideally
suited for parallel processing. The
quantify method for example
performs reporter peaks quantitation in parallel.
Parallel support is provided by the BiocParallel and
various backends including multicore (forking, default on Linux),
simple networf network of workstations (SNOW, default on Windows)
using sockets, forking or MPI among others. We refer readers to the
documentation in BiocParallel. Automatic parallel
processing of spectra is only established for a certain number of
spectra (per file). This value (default is 1000) can be set with the
In sock-based parallel processing, the main worker process has to
start new R instances and connect to them via sock. Sometimes these
connections can not be established and the processes get stuck. To
test this, users can disable parallel processing by disabling parallel
register(SerialParam()). To avoid these deadlocks,
it is possible to initiate the parallel processing setup explicitly at
the beginning of the script using, for example
library("doParallel") registerDoParallel(3) ## using 3 slave nodes register(DoparParam(), default = TRUE) ## rest of script comes below
On-disk access Developmenets in version 2 of the package have
solved the memory issue by implementing and on-disk version the of
data class storing raw data (MSnExp, see section 2.3),
where the spectra a accessed on-disk only when required. The
benchmarking vignette compares the on-disk and in-memory
implemenatations2 in R, open it with
read it online
MSnbase is able to import raw MS data stored in one of
XML-based formats as well as peak lists in the
format3 Mascot Generic Format, see http://www.matrixscience.com/help/data_file_help.html#GEN.
Raw data The
mzXML (Pedrioli et al. 2004),
mzData (Orchard et al. 2007) and
mzML (Martens et al. 2010) can be imported with
readMSData function, as illustrated below (see
more details). To make use of the new on-disk implementation, set
mode = "onDisk" in
readMSData rather than using the default
mode = "inMemory".
file <- dir(system.file(package = "MSnbase", dir = "extdata"), full.names = TRUE, pattern = "mzXML$") rawdata <- readMSData(file, msLevel = 2, verbose = FALSE)
Only spectra of a given MS level can be loaded at a time by setting
msLevel parameter accordingly in
readMSData and in-memory
data. In this document, we will use the
itraqdata data set, provided
with MSnbase. It includes feature metadata, accessible
fData accessor. The metadata includes identification data
for the 55 MS2 spectra.
Version 2.0 and later of MSnbase provide a new
on-disk data storage model (see the benchmarking vignette for more
details). The new data backend is compatible with the orignal
in-memory model. To make use of the new infrastructure, read your
raw data by setting the
mode argument to
"onDisk" (the default is
"inMemory" but is likely to change in the future). The new
on-disk implementation supports several MS levels in a single raw
data object. All existing operations work irrespective of the backend.
Peak lists can often be exported after spectrum processing from
vendor-specific software and are also used as input to search engines.
Peak lists in
mgf format can be imported with the function
?readMgfData for details) to create experiment
objects. Experiments or individual spectra can be exported to an
mgf file with the
writeMgfData methods (see
details and examples).
Experiments with multiple runs Although it is possible to load and
process multiple files serially and later merge the resulting
quantitation data as show in section 13, it is also
feasible to load several raw data files at once. Here, we report the
analysis of an LC-MSMS experiment were 14 liquid chromatography (LC)
fractions were loaded in memory using
readMSData on a 32-cores
servers with 128 Gb of RAM. It took about 90 minutes to read the 14
mzXML raw files (4.9 Gb on disk in total) and create a
3.3 Gb raw data object (an MSnExp instance, see next section).
Quantitation of 9 reporter ions (iTRAQ9 object, see
2.5) for 88690 features was performed in parallel
on 16 processors and took 76 minutes. The resulting quantitation data
was only 22.1 Mb and could easily be further processed. These number
are based on the older in-memory implementation. As shown in the
benchmarking vignette, using on-disk data greatly reduces memory
requirement and computation time.
See also section 7.2 to import quantitative data stored in
spreadsheets into R for further processing using MSnbase.
The MSnbase-iovignette[in R, open it with
or read it online
gives a general overview of MSnbase’s input/ouput
See section 7.3 for importing chromatographic data of SRM/MRM experiments.
MSnbase supports also to write
OnDiskMSnExp objects to
mzXML files using the
writeMSData function. This is specifically useful in
workflows in which the MS data was heavily manipulated. Presently, each
sample/file is exported into one file.
Below we write the data in
mzML format to a temporary file. By setting the
copy = TRUE general metadata (such as instrument info or
all data processing descriptions) are copied over from the originating file.
writeMSData(rawdata, file = paste0(tempfile(), ".mzML"), copy = TRUE)
Raw data is contained in MSnExp objects, that stores all the spectra of an experiment, as defined by one or multiple raw data files.
## MSn experiment data ("MSnExp") ## Object size in memory: 1.9 Mb ## - - - Spectra data - - - ## MS level(s): 2 ## Number of spectra: 55 ## MSn retention times: 19:9 - 50:18 minutes ## - - - Processing information - - - ## Data loaded: Wed May 11 18:54:39 2011 ## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016] ## MSnbase version: 1.1.22 ## - - - Meta data - - - ## phenoData ## rowNames: 1 ## varLabels: sampleNames sampleNumbers ## varMetadata: labelDescription ## Loaded from: ## dummyiTRAQ.mzXML ## protocolData: none ## featureData ## featureNames: X1 X10 ... X9 (55 total) ## fvarLabels: spectrum ProteinAccession ProteinDescription ## PeptideSequence ## fvarMetadata: labelDescription ## experimentData: use 'experimentData(object)'
## spectrum ProteinAccession ProteinDescription ## X1 1 BSA bovine serum albumin ## X10 10 ECA1422 glucose-1-phosphate cytidylyltransferase ## X11 11 ECA4030 50S ribosomal subunit protein L4 ## X12 12 ECA3882 chaperone protein DnaK ## X13 13 ECA1364 succinyl-CoA synthetase alpha chain ## X14 14 ECA0871 NADP-dependent malic enzyme ## PeptideSequence ## X1 NYQEAK ## X10 VTLVDTGEHSMTGGR ## X11 SPIWR ## X12 TAIDDALK ## X13 SILINK ## X14 DFEVVNNESDPR
As illustrated above, showing the experiment textually displays it’s content:
Information about the raw data, i.e. the spectra.
Specific information about the experiment
processing4 This part will be automatically updated when the object is modified with it’s ad hoc methods, as illustrated later.
and package version. This slot can be accessed with the
Other meta data, including experimental phenotype, file name(s) used
to import the data, protocol data, information about features
(individual spectra here) and experiment data. Most of these are
implemented as in the eSet class and are described in more details
in their respective manual pages. See
?MSnExp and references
therein for additional background information.
The experiment meta data associated with an MSnExp experiment is
of class MIAPE. It stores general information about the experiment
as well as MIAPE (Minimum Information About a Proteomics Experiment)
information (Taylor et al. 2007, @Taylor2008). This meta-data can be
accessed with the
experimentData method. When available, a
summary of MIAPE-MS data can be printed with the
?MIAPE for more details.
The raw data is composed of the 55 MS spectra. The
spectra are named individually
(X1, X10, X11, X12, X13, X14, …)
and stored in a
environment. They can be accessed individually with
itraqdata[], or as a list with
spectra(itraqdata). As we have loaded our experiment specifying
msLevel=2, the spectra will all be of level 2 (or higher, if
sp <- itraqdata[["X1"]] sp
## Object of class "Spectrum2" ## Precursor: 520.7833 ## Retention time: 19:9 ## Charge: 2 ## MSn level: 2 ## Peaks count: 1922 ## Total ion count: 26413754
Attributes of individual spectra or of all spectra of an experiment
can be accessed with their respective methods:
precursorCharge for the precursor charge,
rtime for the retention time,
mz for the MZ
intensity for the intensities, … see the
Spectrum, Spectrum1 and Spectrum2
manuals for more details.
##  1922
## X1 X10 X11 X12 X13 X14 ## 1922 1376 1571 2397 2574 1829
##  1149.31
## X1 X10 X11 X12 X13 X14 ## 1149.31 1503.03 1663.61 1663.86 1664.08 1664.32
Reporter ions are defined with the ReporterIons class.
Specific peaks of interest are defined by a MZ value, a with around
the expected MZ and a name (and optionally a colour for plotting, see
section 3). ReporterIons instances are
required to quantify reporter peaks in MSnExp
experiments. Instances for the most commonly used isobaric tags like
iTRAQ 4-plex and 8-plex and TMT 6- and 10-plex tags are already
defined in MSnbase. See
details about how to generate new ReporterIons objects.
## Object of class "ReporterIons" ## iTRAQ4: '4-plex iTRAQ' with 4 reporter ions ## - [iTRAQ4.114] 114.1112 +/- 0.05 (red) ## - [iTRAQ4.115] 115.1083 +/- 0.05 (green) ## - [iTRAQ4.116] 116.1116 +/- 0.05 (blue) ## - [iTRAQ4.117] 117.115 +/- 0.05 (yellow)
## Object of class "ReporterIons" ## TMT10HCD: '10-plex TMT HCD' with 10 reporter ions ## -  126.1277 +/- 0.002 (#8DD3C7) ## - [127N] 127.1248 +/- 0.002 (#FFFFB3) ## - [127C] 127.1311 +/- 0.002 (#BEBADA) ## - [128N] 128.1281 +/- 0.002 (#FB8072) ## - [128C] 128.1344 +/- 0.002 (#80B1D3) ## - [129N] 129.1315 +/- 0.002 (#FDB462) ## - [129C] 129.1378 +/- 0.002 (#B3DE69) ## - [130N] 130.1348 +/- 0.002 (#FCCDE5) ## - [130C] 130.1411 +/- 0.002 (#D9D9D9) ## -  131.1382 +/- 0.002 (#BC80BD)
Chromatographic data, i.e. intensity values along the retention time dimension
for a given \(m/z\) range/slice, can be extracted with the
method. Below we read a file from the
msdata package and extract the (MS level
1) chromatogram. Without providing an \(m/z\) and a retention time range the
function returns the total ion chromatogram (TIC) for each file within the
OnDiskMSnExp object. See also section 7.3 for importing
chromatographic data from SRM/MRM experiments.
f <- c(system.file("microtofq/MM14.mzML", package = "msdata")) mtof <- readMSData(f, mode = "onDisk") mtof_tic <- chromatogram(mtof) mtof_tic
## MChromatograms with 1 row and 1 column ## MM14.mzML ## <Chromatogram> ## [1,] length: 112 ## phenoData with 1 variables ## featureData with 1 variables
Chromatographic data, represented by the intensity-retention time duplets, is
stored in the
Chromatogram object. The
chromatogram method returns a
Chromatograms object (note the s) which holds multiple
objects and arranges them in a two-dimensional grid with columns representing
files/samples of the
OnDiskMSnExp object and rows \(m/z\)-retention
time ranges. In the example above the
Chromatograms object contains only a
Chromatogram object. Below we access this chromatogram object. Similar
Chromatogram objects provide the accessor functions
rtime to access the data, as well as the
mz function, that
returns the \(m/z\) range of the chromatogram.
## Object of class: Chromatogram ## Intensity values aggregated using: sum ## length of object: 112 ## from file: 1 ## mz range: [94.80679, 1004.962] ## rt range: [270.334, 307.678] ## MS level: 1
## F1.S001 F1.S002 F1.S003 F1.S004 F1.S005 F1.S006 ## 64989 67445 77843 105097 155609 212760
## F1.S001 F1.S002 F1.S003 F1.S004 F1.S005 F1.S006 ## 270.334 270.671 271.007 271.343 271.680 272.016
##  94.80679 1004.96155
To extract the base peak chromatogram (the largest peak
along the \(m/z\) dimension for each retention time/spectrum) we set the
aggregationFun argument to
mtof_bpc <- chromatogram(mtof, aggregationFun = "max")
Chromatogram help page and the vignettes from the xcms
package for more details and use cases, also on how to extract
chromatograms for specific ions.
The MSmap class can be used to isolate specific slices of
interest from a complete MS acquisition by specifying \(m/z\) and
retention time ranges. One needs a raw data file in a format supported
mzXML, …). Below we first download a raw data file from the
PRIDE repository and create an MSmap containing all the MS1 spectra
between acquired between 30 and 35 minutes and peaks between 521 and
523 \(m/z\). See
?MSmap for details.
## downloads the data library("rpx") px1 <- PXDataset("PXD000001") mzf <- pxget(px1, 7) ## reads the data ms <- openMSfile(mzf) hd <- header(ms) ## a set of spectra of interest: MS1 spectra eluted ## between 30 and 35 minutes retention time ms1 <- which(hd$msLevel == 1) rtsel <- hd$retentionTime[ms1] / 60 > 30 & hd$retentionTime[ms1] / 60 < 35 ## the map M <- MSmap(ms, ms1[rtsel], 521, 523, .005, hd, zeroIsNA = TRUE)
## Object of class "MSmap" ## Map [75, 401] ##  Retention time: 30:1 - 34:58 ##  M/Z: 521 - 523 (res 0.005)
M map object can be rendered as a heatmap with
plot, as shown
on figure 1.
plot(M, aspect = 1, allTicks = FALSE)
One can also render the data in 3 dimension with the
function, as show on figure 2.