A short introduction to MSnbase development

Laurent Gatto1, Johannes Rainer2 and Sebastian Gibb3

1Computational Proteomics Unit, Cambridge, UK.
2Center for Biomedicine, EURAC, Bolzano, Italy.
3Department of Anesthesiology and Intensive Care, University Medicine Greifswald, Germany.

24 April 2017

Abstract

This vignette describes the classes implemented in package. It is intended as a starting point for developers or users who would like to learn more or further develop/extend mass spectrometry and proteomics data structures.

Package

MSnbase 2.2.0

Foreword

MSnbase is under active development; current functionality is evolving and new features will be added. This software is free and open-source software. If you use it, please support the project by citing it in publications:

Laurent Gatto and Kathryn S. Lilley. MSnbase - an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics 28, 288-289 (2011).

Questions and bugs

For bugs, typos, suggestions or other questions, please file an issue in our tracking system (https://github.com/lgatto/MSnbase/issues) providing as much information as possible, a reproducible example and the output of sessionInfo().

If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.

1 Introduction

This document is not a replacement for the individual manual pages, that document the slots of the MSnbase classes. It is a centralised high-level description of the package design.

MSnbase aims at being compatible with the Biobase infrastructure (Gentleman et al. 2004). Many meta data structures that are used in eSet and associated classes are also used here. As such, knowledge of the Biobase development and the new eSet vignette would be beneficial; the vignette can directly be accessed with vignette("BiobaseDevelopment", package="Biobase").

The initial goal is to use the MSnbase infrastructure for MS2 labelled (iTRAQ (Ross et al. 2004) and TMT (Thompson et al. 2003)) and label-free (spectral counting, index and abundance) quantitation - see the documentation for the quantify function for details. The infrastructure is currently extended to support a wider range of technologies, including metabolomics.

2 MSnbase classes

All classes have a .__classVersion__ slot, of class Versioned from the Biobase package. This slot documents the class version for any instance to be used for debugging and object update purposes. Any change in a class implementation should trigger a version change.

2.1 `pSet`: a virtual class for raw mass spectrometry data and meta data

This virtual class is the main container for mass spectrometry data, i.e spectra, and meta data. It is based on the eSet implementation for genomic data. The main difference with eSet is that the assayData slot is an environment containing any number of Spectrum instances (see the Spectrum section).

One new slot is introduced, namely processingData, that contains one MSnProcess instance (see the MSnProcess section). and the experimentData slot is now expected to contain MIAPE data. The annotation slot has not been implemented, as no prior feature annotation is known in shotgun proteomics.

getClass("pSet")

## Virtual Class "pSet" [package "MSnbase"]
## 
## Slots:
##                                                                   
## Name:            assayData           phenoData         featureData
## Class:         environment NAnnotatedDataFrame  AnnotatedDataFrame
##                                                                   
## Name:       experimentData        protocolData      processingData
## Class:               MIAxE  AnnotatedDataFrame          MSnProcess
##                                               
## Name:               .cache   .__classVersion__
## Class:         environment            Versions
## 
## Extends: "Versioned"
## 
## Known Subclasses: 
## Class "MSnExp", directly
## Class "OnDiskMSnExp", by class "MSnExp", distance 2, with explicit coerce

2.2 `MSnExp`: a class for MS experiments

MSnExp extends pSet to store MS experiments. It does not add any new slots to pSet. Accessors and setters are all inherited from pSet and new ones should be implemented for pSet. Methods that manipulate actual data in experiments are implemented for MSnExp objects.

getClass("MSnExp")

## Class "MSnExp" [package "MSnbase"]
## 
## Slots:
##                                                                   
## Name:            assayData           phenoData         featureData
## Class:         environment NAnnotatedDataFrame  AnnotatedDataFrame
##                                                                   
## Name:       experimentData        protocolData      processingData
## Class:               MIAxE  AnnotatedDataFrame          MSnProcess
##                                               
## Name:               .cache   .__classVersion__
## Class:         environment            Versions
## 
## Extends: 
## Class "pSet", directly
## Class "Versioned", by class "pSet", distance 2
## 
## Known Subclasses: 
## Class "OnDiskMSnExp", directly, with explicit coerce

2.3 `OnDiskMSnExp`: a on-disk implementation of the `MSnExp` class

The OnDiskMSnExp class extends MSnExp and inherits all of its functionality but is aimed to use as little memory as possible based on a balance between memory demand and performance. Most of the spectrum-specific data, like retention time, polarity, total ion current are stored within the object’s featureData slot. The actual M/Z and intensity values from the individual spectra are, in contrast to MSnExp objects, not kept in memory (in the assayData slot), but are fetched from the original files on-demand. Because mzML files are indexed, using the mzR package to read the relevant spectrum data is fast and only moderately slower than for in-memory MSnExp11 The benchmarking vignette compares data size and operation speed of the two implementations..

To keep track of data manipulation steps that are applied to spectrum data (such as performed by methods removePeaks or clean) a lazy execution framework was implemented. Methods that manipulate or subset a spectrum’s M/Z or intensity values can not be applied directly to a OnDiskMSnExp object, since the relevant data is not kept in memory. Thus, any call to a processing method that changes or subset M/Z or intensity values are added as ProcessingStep items to the object’s spectraProcessingQueue. When the spectrum data is then queried from an OnDiskMSnExp, the spectra are read in from the file and all these processing steps are applied on-the-fly to the spectrum data before being returned to the user.

The operations involving extracting or manipulating spectrum data are applied on a per-file basis, which enables parallel processing. Thus, all corresponding method implementations for OnDiskMSnExp objects have an argument BPPARAM and users can set a PARALLEL_THRESH option flag22 see ?MSnbaseOptions for details. that enables to define how and when parallel processing should be performed (using the BiocParallel package).

Note that all data manipulations that are not applied to M/Z or intensity values of a spectrum (e.g. sub-setting by retention time etc) are very fast as they operate directly to the object’s featureData slot.

getClass("OnDiskMSnExp")

## Class "OnDiskMSnExp" [package "MSnbase"]
## 
## Slots:
##                                                                            
## Name:  spectraProcessingQueue                backend              assayData
## Class:                   list              character            environment
##                                                                            
## Name:               phenoData            featureData         experimentData
## Class:    NAnnotatedDataFrame     AnnotatedDataFrame                  MIAxE
##                                                                            
## Name:            protocolData         processingData                 .cache
## Class:     AnnotatedDataFrame             MSnProcess            environment
##                              
## Name:       .__classVersion__
## Class:               Versions
## 
## Extends: 
## Class "MSnExp", directly
## Class "pSet", by class "MSnExp", distance 2
## Class "Versioned", by class "MSnExp", distance 3

The distinction between MSnExp and OnDiskMSnExp is often not explicitly stated as it should not matter, from a user’s perspective, which data structure they are working with, as both behave in equivalent ways. Often, they are referred to as in-memory and on-disk MSnExp implementations.

2.4 `MSnSet`: a class for quantitative proteomics data

This class stores quantitation data and meta data after running quantify on an MSnExp object or by creating an MSnSet instance from an external file, as described in the MSnbase-io vignette and in ?readMSnSet, readMzTabData, etc. The quantitative data is in form of a n by p matrix, where n is the number of features/spectra originally in the MSnExp used as parameter in quantify and p is the number of reporter ions. If read from an external file, n corresponds to the number of features (protein groups, proteins, peptides, spectra) in the file and \(p\) is the number of columns with quantitative data (samples) in the file.

This prompted to keep a similar implementation as the ExpressionSet class, while adding the proteomics-specific annotation slot introduced in the pSet class, namely processingData for objects of class MSnProcess.

getClass("MSnSet")

## Class "MSnSet" [package "MSnbase"]
## 
## Slots:
##                                                                
## Name:      experimentData     processingData               qual
## Class:              MIAPE         MSnProcess         data.frame
##                                                                
## Name:           assayData          phenoData        featureData
## Class:          AssayData AnnotatedDataFrame AnnotatedDataFrame
##                                                                
## Name:          annotation       protocolData  .__classVersion__
## Class:          character AnnotatedDataFrame           Versions
## 
## Extends: 
## Class "eSet", directly
## Class "VersionedBiobase", by class "eSet", distance 2
## Class "Versioned", by class "eSet", distance 3

The MSnSet class extends the virtual eSet class to provide compatibility for ExpressionSet-like behaviour. The experiment meta-data in experimentData is also of class MIAPE . The annotation slot, inherited from eSet is not used. As a result, it is easy to convert ExpressionSet data from/to MSnSet objects with the coersion method as.

data(msnset)
class(msnset)

## [1] "MSnSet"
## attr(,"package")
## [1] "MSnbase"

class(as(msnset, "ExpressionSet"))

## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"

data(sample.ExpressionSet)
class(sample.ExpressionSet)

## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"

class(as(sample.ExpressionSet, "MSnSet"))

## [1] "MSnSet"
## attr(,"package")
## [1] "MSnbase"

2.5 `MSnProcess`: a class for logging processing meta data

This class aims at recording specific manipulations applied to MSnExp or MSnSet instances. The processing slot is a character vector that describes major processing. Most other slots are of class logical that indicate whether the data has been centroided, smoothed, although many of the functionality is not implemented yet. Any new processing that is implemented should be documented and logged here.

It also documents the raw data file from which the data originates (files slot) and the MSnbase version that was in use when the MSnProcess instance, and hence the MSnExp/MSnSet objects, were originally created.

getClass("MSnProcess")

## Class "MSnProcess" [package "MSnbase"]
## 
## Slots:
##                                                             
## Name:              files        processing            merged
## Class:         character         character           logical
##                                                             
## Name:            cleaned      removedPeaks          smoothed
## Class:           logical         character           logical
##                                                             
## Name:            trimmed        normalised    MSnbaseVersion
## Class:           numeric           logical         character
##                         
## Name:  .__classVersion__
## Class:          Versions
## 
## Extends: "Versioned"

2.6 `MIAPE`: Minimum Information About a Proteomics Experiment

The Minimum Information About a Proteomics Experiment (Chris F. Taylor et al. 2007; Chris F Taylor et al. 2008) MIAPE class describes the experiment, including contact details, information about the mass spectrometer and control and analysis software.

getClass("MIAPE")

## Class "MIAPE" [package "MSnbase"]
## 
## Slots:
##                                                         
## Name:                     title                      url
## Class:                character                character
##                                                         
## Name:                  abstract                pubMedIds
## Class:                character                character
##                                                         
## Name:                   samples            preprocessing
## Class:                     list                     list
##                                                         
## Name:                     other                dateStamp
## Class:                     list                character
##                                                         
## Name:                      name                      lab
## Class:                character                character
##                                                         
## Name:                   contact                    email
## Class:                character                character
##                                                         
## Name:           instrumentModel   instrumentManufacturer
## Class:                character                character
##                                                         
## Name:  instrumentCustomisations             softwareName
## Class:                character                character
##                                                         
## Name:           softwareVersion        switchingCriteria
## Class:                character                character
##                                                         
## Name:            isolationWidth            parameterFile
## Class:                  numeric                character
##                                                         
## Name:                 ionSource         ionSourceDetails
## Class:                character                character
##                                                         
## Name:                  analyser          analyserDetails
## Class:                character                character
##                                                         
## Name:              collisionGas        collisionPressure
## Class:                character                  numeric
##                                                         
## Name:           collisionEnergy             detectorType
## Class:                character                character
##                                                         
## Name:       detectorSensitivity        .__classVersion__
## Class:                character                 Versions
## 
## Extends: 
## Class "MIAxE", directly
## Class "Versioned", by class "MIAxE", distance 2

2.7 `Spectrum` et al.: classes for MS spectra

Spectrum is a virtual class that defines common attributes to all types of spectra. MS1 and MS2 specific attributes are defined in the Spectrum1 and Spectrum2 classes, that directly extend Spectrum.

getClass("Spectrum")

## Virtual Class "Spectrum" [package "MSnbase"]
## 
## Slots:
##                                                             
## Name:            msLevel        peaksCount                rt
## Class:           integer           integer           numeric
##                                                             
## Name:     acquisitionNum         scanIndex               tic
## Class:           integer           integer           numeric
##                                                             
## Name:                 mz         intensity          fromFile
## Class:           numeric           numeric           integer
##                                                             
## Name:         centroided          smoothed          polarity
## Class:           logical           logical           integer
##                         
## Name:  .__classVersion__
## Class:          Versions
## 
## Extends: "Versioned"
## 
## Known Subclasses: "Spectrum2", "Spectrum1"

getClass("Spectrum1")

## Class "Spectrum1" [package "MSnbase"]
## 
## Slots:
##                                                             
## Name:            msLevel        peaksCount                rt
## Class:           integer           integer           numeric
##                                                             
## Name:     acquisitionNum         scanIndex               tic
## Class:           integer           integer           numeric
##                                                             
## Name:                 mz         intensity          fromFile
## Class:           numeric           numeric           integer
##                                                             
## Name:         centroided          smoothed          polarity
## Class:           logical           logical           integer
##                         
## Name:  .__classVersion__
## Class:          Versions
## 
## Extends: 
## Class "Spectrum", directly
## Class "Versioned", by class "Spectrum", distance 2

getClass("Spectrum2")

## Class "Spectrum2" [package "MSnbase"]
## 
## Slots:
##                                                                
## Name:              merged        precScanNum        precursorMz
## Class:            numeric            integer            numeric
##                                                                
## Name:  precursorIntensity    precursorCharge    collisionEnergy
## Class:            numeric            integer            numeric
##                                                                
## Name:             msLevel         peaksCount                 rt
## Class:            integer            integer            numeric
##                                                                
## Name:      acquisitionNum          scanIndex                tic
## Class:            integer            integer            numeric
##                                                                
## Name:                  mz          intensity           fromFile
## Class:            numeric            numeric            integer
##                                                                
## Name:          centroided           smoothed           polarity
## Class:            logical            logical            integer
##                          
## Name:   .__classVersion__
## Class:           Versions
## 
## Extends: 
## Class "Spectrum", directly
## Class "Versioned", by class "Spectrum", distance 2

2.8 `ReporterIons`: a class for isobaric tags

The iTRAQ and TMT (or any other peak of interest) are implemented ReporterIons instances, that essentially defines an expected MZ position for the peak and a width around this value as well a names for the reporters.

getClass("ReporterIons")

## Class "ReporterIons" [package "MSnbase"]
## 
## Slots:
##                                                             
## Name:               name     reporterNames       description
## Class:         character         character         character
##                                                             
## Name:                 mz               col             width
## Class:           numeric         character           numeric
##                         
## Name:  .__classVersion__
## Class:          Versions
## 
## Extends: "Versioned"

2.9 `NAnnotatedDataFrame`: multiplexed `AnnotatedDataFrame`s

The simple expansion of the AnnotatedDataFrame classes adds the multiplex and multiLabel slots to document the number and names of multiplexed samples.

getClass("NAnnotatedDataFrame")

## Class "NAnnotatedDataFrame" [package "MSnbase"]
## 
## Slots:
##                                                             
## Name:          multiplex       multiLabels       varMetadata
## Class:           numeric         character        data.frame
##                                                             
## Name:               data         dimLabels .__classVersion__
## Class:        data.frame         character          Versions
## 
## Extends: 
## Class "AnnotatedDataFrame", directly
## Class "Versioned", by class "AnnotatedDataFrame", distance 2

2.10 Other classes

Lists of `MSnSet` instances

When several MSnSet instances are related to each other and should be stored together as different objects, they can be grouped as a list into and MSnSetList object. In addition to the actual list slot, this class also has basic logging functionality and enables iteration over the MSnSet instances using a dedicated lapply methods.

getClass("MSnSetList")

## Class "MSnSetList" [package "MSnbase"]
## 
## Slots:
##                                                             
## Name:                  x               log .__classVersion__
## Class:              list              list          Versions
## 
## Extends: "Versioned"

3 Miscellaneous

Unit tests

MSnbase implements unit tests with the testthat package.

Processing methods

Methods that process raw data, i.e. spectra should be implemented for Spectrum objects first and then eapplyed (or similar) to the assayData slot of an MSnExp instance in the specific method.

4 Session information

sessionInfo()

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] MSnbase_2.2.0       ProtGenerics_1.8.0  BiocParallel_1.10.0
## [4] mzR_2.10.0          Rcpp_0.12.10        Biobase_2.36.0     
## [7] BiocGenerics_0.22.0 BiocStyle_2.4.0    
## 
## loaded via a namespace (and not attached):
##  [1] BiocInstaller_1.26.0  compiler_3.4.0        plyr_1.8.4           
##  [4] iterators_1.0.8       zlibbioc_1.22.0       tools_3.4.0          
##  [7] MALDIquant_1.16.2     digest_0.6.12         evaluate_0.10        
## [10] tibble_1.3.0          preprocessCore_1.38.0 gtable_0.2.0         
## [13] lattice_0.20-35       foreach_1.4.3         yaml_2.1.14          
## [16] stringr_1.2.0         knitr_1.15.1          IRanges_2.10.0       
## [19] S4Vectors_0.14.0      stats4_3.4.0          rprojroot_1.2        
## [22] grid_3.4.0            impute_1.50.0         XML_3.98-1.6         
## [25] rmarkdown_1.4         bookdown_0.3          limma_3.32.0         
## [28] ggplot2_2.2.1         magrittr_1.5          backports_1.0.5      
## [31] scales_0.4.1          pcaMethods_1.68.0     codetools_0.2-15     
## [34] htmltools_0.3.5       mzID_1.14.0           colorspace_1.3-2     
## [37] affy_1.54.0           stringi_1.1.5         doParallel_1.0.10    
## [40] lazyeval_0.2.0        munsell_0.4.3         vsn_3.44.0           
## [43] affyio_1.46.0

References

Gentleman, Robert C., Vincent J. Carey, Douglas M. Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, et al. 2004. “Bioconductor: Open Software Development for Computational Biology and Bioinformatics.” Genome Biol 5 (10): –80. doi:10.1186/gb-2004-5-10-r80.

Ross, Philip L., Yulin N. Huang, Jason N. Marchese, Brian Williamson, Kenneth Parker, Stephen Hattan, Nikita Khainovski, et al. 2004. “Multiplexed Protein Quantitation in Saccharomyces Cerevisiae Using Amine-Reactive Isobaric Tagging Reagents.” Mol Cell Proteomics 3 (12). Applied Biosystems, Framingham, MA 01701, USA.: 1154–69. doi:10.1074/mcp.M400129-MCP200.

Taylor, Chris F, Pierre-Alain Binz, Ruedi Aebersold, Michel Affolter, Robert Barkovich, Eric W Deutsch, David M Horn, et al. 2008. “Guidelines for Reporting the Use of Mass Spectrometry in Proteomics.” Nat. Biotechnol. 26 (8): 860–1. doi:10.1038/nbt0808-860.

Taylor, Chris F., Norman W. Paton, Kathryn S. Lilley, Pierre-Alain Binz, Randall K. Julian, Andrew R. Jones, Weimin Zhu, et al. 2007. “The Minimum Information About a Proteomics Experiment (Miape).” Nat Biotechnol 25 (8). The HUPO Proteomics Standards Initiative, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK. chris.taylor@ebi.ac.uk: 887–93. doi:10.1038/nbt1329.

Thompson, Andrew, Jürgen Schäfer, Karsten Kuhn, Stefan Kienle, Josef Schwarz, Günter Schmidt, Thomas Neumann, R Johnstone, A Karim A Mohammed, and Christian Hamon. 2003. “Tandem Mass Tags: A Novel Quantification Strategy for Comparative Analysis of Complex Protein Mixtures by MS/MS.” Anal. Chem. 75 (8): 1895–1904.