Contents

1 Version Info

R version: R version 3.6.0 (2019-04-26)
Bioconductor version: 3.9
Package version: 1.8.0

2 Setup

The follow packages will be used throughout this documents. R version 3.3.1 or higher is required to install all the packages using BiocManager::install.

library("mzR")
library("mzID")
library("MSnID")
library("MSnbase")
library("rpx")
library("MLInterfaces")
library("pRoloc")
library("pRolocdata")
library("MSGFplus")
library("rols")
library("hpar")

The most convenient way to install all the tutorials requirement (and more related content), is to install RforProteomics with all its dependencies.

library("BiocManager")
BiocManager::install("RforProteomics", dependencies = TRUE)

Other packages of interest, such as rTANDEM or MSGFgui will be described later in the document but are not required to execute the code in this workflow.

3 Introduction

This workflow illustrates R / Bioconductor infrastructure for proteomics. Topics covered focus on support for open community-driven formats for raw data and identification results, packages for peptide-spectrum matching, data processing and analysis:

Links to other packages and references are also documented. In particular, the vignettes included in the RforProteomics package also contains relevant material.

4 Exploring available infrastructure

In Bioconductor version 3.9, there are respectively 128 proteomics, 85 mass spectrometry software packages and 22 mass spectrometry experiment packages. These respective packages can be extracted with the proteomicsPackages(), massSpectrometryPackages() and massSpectrometryDataPackages() and explored interactively.

library("RforProteomics")
pp <- proteomicsPackages()
display(pp)

5 Mass spectrometry data

Most community-driven formats are supported in R, as detailed in the table below.

Type Format Package
raw mzML, mzXML, netCDF, mzData mzR (read)
identification mzIdentML mzR (read) and mzID (read)
quantitation mzQuantML
peak lists mgf MSnbase (read/write)
other mzTab MSnbase (read)

6 Getting data from proteomics repositories

MS-based proteomics data is disseminated through the ProteomeXchange infrastructure, which centrally coordinates submission, storage and dissemination through multiple data repositories, such as the PRIDE data base at the EBI for MS/MS experiments, PASSEL at the ISB for SRM data and the MassIVE resource. The rpx is an interface to ProteomeXchange and provides a basic access to PX data.

library("rpx")
pxannounced()
## 15 new ProteomeXchange annoucements
##     Data.Set    Publication.Data Message
## 1  PXD010862 2019-05-06 14:47:47     New
## 2  PXD013015 2019-05-06 14:34:16     New
## 3  PXD013546 2019-05-06 14:27:06     New
## 4  PXD010307 2019-05-06 14:20:15     New
## 5  PXD013397 2019-05-06 14:15:20     New
## 6  PXD010168 2019-05-06 14:13:43     New
## 7  PXD012877 2019-05-06 14:03:00     New
## 8  PXD011588 2019-05-06 13:42:46     New
## 9  PXD012379 2019-05-06 13:31:53     New
## 10 PXD013425 2019-05-06 13:26:35     New
## 11 PXD013702 2019-05-06 13:24:59     New
## 12 PXD010791 2019-05-05 02:13:41     New
## 13 PXD010790 2019-05-05 00:29:31     New
## 14 PXD010959 2019-05-04 10:35:53     New
## 15 PXD013097 2019-05-04 10:24:28     New

Using the unique PXD000001 identifier, we can retrieve the relevant metadata that will be stored in a PXDataset object. The names of the files available in this data can be retrieved with the pxfiles accessor function.

px <- PXDataset("PXD000001")
px
## Object of class "PXDataset"
##  Id: PXD000001 with 12 files
##  [1] 'F063721.dat' ... [12] 'generated'
##  Use 'pxfiles(.)' to see all files.
pxfiles(px)
##  [1] "F063721.dat"                                                         
##  [2] "F063721.dat-mztab.txt"                                               
##  [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz"                                  
##  [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz"                                    
##  [5] "PXD000001_mztab.txt"                                                 
##  [6] "README.txt"                                                          
##  [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML" 
##  [8] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML"
##  [9] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"         
## [10] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw"           
## [11] "erwinia_carotovora.fasta"                                            
## [12] "generated"

Other metadata for the px data set:

pxtax(px)
## [1] "Erwinia carotovora"
pxurl(px)
## [1] "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001"
pxref(px)
## [1] "Gatto L, Christoforou A. Using R and Bioconductor for proteomics data analysis. Biochim Biophys Acta. 2014 1844(1 pt a):42-51"

Data files can then be downloaded with the pxget function. Below, we retrieve the raw data file. The file is downloaded in the working directory and the name of the file is return by the function and stored in the mzf variable for later use.

fn <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
mzf <- pxget(px, fn)
## Downloading 1 file
mzf
## [1] "/tmp/RtmpM7a8cs/Rbuild68e538add17d/proteomics/vignettes/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"

7 Handling raw MS data

The mzR package provides an interface to the proteowizard C/C++ code base to access various raw data files, such as mzML, mzXML, netCDF, and mzData. The data is accessed on-disk, i.e it is not loaded entirely in memory by default but only when explicitly requested. The three main functions are openMSfile to create a file handle to a raw data file, header to extract metadata about the spectra contained in the file and peaks to extract one or multiple spectra of interest. Other functions such as instrumentInfo, or runInfo can be used to gather general information about a run.

Below, we access the raw data file downloaded in the previous section and open a file handle that will allow us to extract data and metadata of interest.

library("mzR")
ms <- openMSfile(mzf)
ms
## Mass Spectrometry file handle.
## Filename:  TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML 
## Number of scans:  7534

The header function returns the metadata of all available peaks:

hd <- header(ms)
dim(hd)
## [1] 7534   29
names(hd)
##  [1] "seqNum"                     "acquisitionNum"            
##  [3] "msLevel"                    "polarity"                  
##  [5] "peaksCount"                 "totIonCurrent"             
##  [7] "retentionTime"              "basePeakMZ"                
##  [9] "basePeakIntensity"          "collisionEnergy"           
## [11] "ionisationEnergy"           "lowMZ"                     
## [13] "highMZ"                     "precursorScanNum"          
## [15] "precursorMZ"                "precursorCharge"           
## [17] "precursorIntensity"         "mergedScan"                
## [19] "mergedResultScanNum"        "mergedResultStartScanNum"  
## [21] "mergedResultEndScanNum"     "injectionTime"             
## [23] "filterString"               "spectrumId"                
## [25] "centroided"                 "ionMobilityDriftTime"      
## [27] "isolationWindowTargetMZ"    "isolationWindowLowerOffset"
## [29] "isolationWindowUpperOffset"

We can extract metadata and scan data for scan 1000 as follows:

hd[1000, ]
##      seqNum acquisitionNum msLevel polarity peaksCount totIonCurrent
## 1000   1000           1000       2        1        274       1048554
##      retentionTime basePeakMZ basePeakIntensity collisionEnergy
## 1000      1106.916    136.061            164464              45
##      ionisationEnergy    lowMZ   highMZ precursorScanNum precursorMZ
## 1000                0 104.5467 1370.758              992    683.0817
##      precursorCharge precursorIntensity mergedScan mergedResultScanNum
## 1000               2           689443.7          0                   0
##      mergedResultStartScanNum mergedResultEndScanNum injectionTime
## 1000                        0                      0      55.21463
##                                                  filterString
## 1000 FTMS + p NSI d Full ms2 683.08@hcd45.00 [100.00-1380.00]
##                                         spectrumId centroided
## 1000 controllerType=0 controllerNumber=1 scan=1000       TRUE
##      ionMobilityDriftTime isolationWindowTargetMZ isolationWindowLowerOffset
## 1000                   NA                  683.08                          1
##      isolationWindowUpperOffset
## 1000                          1
head(peaks(ms, 1000))
##          [,1]     [,2]
## [1,] 104.5467 308.9326
## [2,] 104.5684 308.6961
## [3,] 108.8340 346.7183
## [4,] 109.3928 365.1236
## [5,] 110.0345 616.7905
## [6,] 110.0703 429.1975
plot(peaks(ms, 1000), type = "h")

Below we reproduce the example from the MSmap function from the MSnbase package to plot a specific slice of the raw data using the mzR functions we have just described.

## a set of spectra of interest: MS1 spectra eluted
## between 30 and 35 minutes retention time
ms1 <- which(hd$msLevel == 1)
rtsel <- hd$retentionTime[ms1] / 60 > 30 &
    hd$retentionTime[ms1] / 60 < 35

## the map
M <- MSmap(ms, ms1[rtsel], 521, 523, .005, hd)

plot(M, aspect = 1, allTicks = FALSE)

plot3D(M)