This vignette describes the classes implemented in package. It is intended as a starting point for developers or users who would like to learn more or further develop/extend mass spectrometry and proteomics data structures.
MSnbase 2.2.0
MSnbase is under active development; current functionality is evolving and new features will be added. This software is free and open-source software. If you use it, please support the project by citing it in publications:
Laurent Gatto and Kathryn S. Lilley. MSnbase - an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics 28, 288-289 (2011).
For bugs, typos, suggestions or other questions, please file an issue in our tracking system (https://github.com/lgatto/MSnbase/issues) providing as much information as possible, a reproducible example and the output of sessionInfo()
.
If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.
This document is not a replacement for the individual manual pages, that document the slots of the MSnbase classes. It is a centralised high-level description of the package design.
MSnbase aims at being compatible with the Biobase infrastructure (Gentleman et al. 2004). Many meta data structures that are used in eSet
and associated classes are also used here. As such, knowledge of the Biobase development and the new eSet vignette would be beneficial; the vignette can directly be accessed with vignette("BiobaseDevelopment", package="Biobase")
.
The initial goal is to use the MSnbase infrastructure for MS2 labelled (iTRAQ (Ross et al. 2004) and TMT (Thompson et al. 2003)) and label-free (spectral counting, index and abundance) quantitation - see the documentation for the quantify
function for details. The infrastructure is currently extended to support a wider range of technologies, including metabolomics.
All classes have a .__classVersion__
slot, of class Versioned
from the Biobase package. This slot documents the class version for any instance to be used for debugging and object update purposes. Any change in a class implementation should trigger a version change.
pSet
: a virtual class for raw mass spectrometry data and meta dataThis virtual class is the main container for mass spectrometry data, i.e spectra, and meta data. It is based on the eSet
implementation for genomic data. The main difference with eSet
is that the assayData
slot is an environment containing any number of Spectrum
instances (see the Spectrum
section).
One new slot is introduced, namely processingData
, that contains one MSnProcess
instance (see the MSnProcess
section). and the experimentData
slot is now expected to contain MIAPE
data. The annotation
slot has not been implemented, as no prior feature annotation is known in shotgun proteomics.
getClass("pSet")
## Virtual Class "pSet" [package "MSnbase"]
##
## Slots:
##
## Name: assayData phenoData featureData
## Class: environment NAnnotatedDataFrame AnnotatedDataFrame
##
## Name: experimentData protocolData processingData
## Class: MIAxE AnnotatedDataFrame MSnProcess
##
## Name: .cache .__classVersion__
## Class: environment Versions
##
## Extends: "Versioned"
##
## Known Subclasses:
## Class "MSnExp", directly
## Class "OnDiskMSnExp", by class "MSnExp", distance 2, with explicit coerce
MSnExp
: a class for MS experimentsMSnExp
extends pSet
to store MS experiments. It does not add any new slots to pSet
. Accessors and setters are all inherited from pSet
and new ones should be implemented for pSet
. Methods that manipulate actual data in experiments are implemented for MSnExp
objects.
getClass("MSnExp")
## Class "MSnExp" [package "MSnbase"]
##
## Slots:
##
## Name: assayData phenoData featureData
## Class: environment NAnnotatedDataFrame AnnotatedDataFrame
##
## Name: experimentData protocolData processingData
## Class: MIAxE AnnotatedDataFrame MSnProcess
##
## Name: .cache .__classVersion__
## Class: environment Versions
##
## Extends:
## Class "pSet", directly
## Class "Versioned", by class "pSet", distance 2
##
## Known Subclasses:
## Class "OnDiskMSnExp", directly, with explicit coerce
OnDiskMSnExp
: a on-disk implementation of the MSnExp
classThe OnDiskMSnExp
class extends MSnExp
and inherits all of its functionality but is aimed to use as little memory as possible based on a balance between memory demand and performance. Most of the spectrum-specific data, like retention time, polarity, total ion current are stored within the object’s featureData
slot. The actual M/Z and intensity values from the individual spectra are, in contrast to MSnExp
objects, not kept in memory (in the assayData
slot), but are fetched from the original files on-demand. Because mzML files are indexed, using the mzR package to read the relevant spectrum data is fast and only moderately slower than for in-memory MSnExp
1 The benchmarking vignette compares data size and operation speed of the two implementations..
To keep track of data manipulation steps that are applied to spectrum data (such as performed by methods removePeaks
or clean
) a lazy execution framework was implemented. Methods that manipulate or subset a spectrum’s M/Z or intensity values can not be applied directly to a OnDiskMSnExp
object, since the relevant data is not kept in memory. Thus, any call to a processing method that changes or subset M/Z or intensity values are added as ProcessingStep
items to the object’s spectraProcessingQueue
. When the spectrum data is then queried from an OnDiskMSnExp
, the spectra are read in from the file and all these processing steps are applied on-the-fly to the spectrum data before being returned to the user.
The operations involving extracting or manipulating spectrum data are applied on a per-file basis, which enables parallel processing. Thus, all corresponding method implementations for OnDiskMSnExp
objects have an argument BPPARAM
and users can set a PARALLEL_THRESH
option flag2 see ?MSnbaseOptions
for details. that enables to define how and when parallel processing should be performed (using the BiocParallel package).
Note that all data manipulations that are not applied to M/Z or intensity values of a spectrum (e.g. sub-setting by retention time etc) are very fast as they operate directly to the object’s featureData
slot.
getClass("OnDiskMSnExp")
## Class "OnDiskMSnExp" [package "MSnbase"]
##
## Slots:
##
## Name: spectraProcessingQueue backend assayData
## Class: list character environment
##
## Name: phenoData featureData experimentData
## Class: NAnnotatedDataFrame AnnotatedDataFrame MIAxE
##
## Name: protocolData processingData .cache
## Class: AnnotatedDataFrame MSnProcess environment
##
## Name: .__classVersion__
## Class: Versions
##
## Extends:
## Class "MSnExp", directly
## Class "pSet", by class "MSnExp", distance 2
## Class "Versioned", by class "MSnExp", distance 3
The distinction between MSnExp
and OnDiskMSnExp
is often not explicitly stated as it should not matter, from a user’s perspective, which data structure they are working with, as both behave in equivalent ways. Often, they are referred to as in-memory and on-disk MSnExp
implementations.
MSnSet
: a class for quantitative proteomics dataThis class stores quantitation data and meta data after running quantify
on an MSnExp
object or by creating an MSnSet
instance from an external file, as described in the MSnbase-io vignette and in ?readMSnSet
, readMzTabData
, etc. The quantitative data is in form of a n by p matrix, where n is the number of features/spectra originally in the MSnExp
used as parameter in quantify
and p is the number of reporter ions. If read from an external file, n corresponds to the number of features (protein groups, proteins, peptides, spectra) in the file and \(p\) is the number of columns with quantitative data (samples) in the file.
This prompted to keep a similar implementation as the ExpressionSet
class, while adding the proteomics-specific annotation slot introduced in the pSet
class, namely processingData
for objects of class MSnProcess
.
getClass(&qu