R version: R version 4.1.0 (2021-05-18)
Bioconductor version: 3.13
Package version: 1.16.0
The follow packages will be used throughout this documents. R version
3.3.1
or higher is required to install all the packages using
BiocManager::install
.
library("mzR")
library("mzID")
library("MSnID")
library("MSnbase")
library("rpx")
library("MLInterfaces")
library("pRoloc")
library("pRolocdata")
library("MSGFplus")
library("rols")
library("hpar")
The most convenient way to install all the tutorials requirement (and more related content), is to install RforProteomics with all its dependencies.
library("BiocManager")
BiocManager::install("RforProteomics", dependencies = TRUE)
Other packages of interest, such as rTANDEM or MSGFgui will be described later in the document but are not required to execute the code in this workflow.
This workflow illustrates R / Bioconductor infrastructure for proteomics. Topics covered focus on support for open community-driven formats for raw data and identification results, packages for peptide-spectrum matching, data processing and analysis:
Links to other packages and references are also documented. In particular, the vignettes included in the RforProteomics package also contains relevant material.
In Bioconductor version 3.13, there are respectively 157
proteomics,
113
mass spectrometry software packages
and 25
mass spectrometry experiment packages. These
respective packages can be extracted with the proteomicsPackages()
,
massSpectrometryPackages()
and massSpectrometryDataPackages()
and
explored interactively.
library("RforProteomics")
pp <- proteomicsPackages()
display(pp)
Most community-driven formats are supported in R
, as detailed in the
table below.
MS-based proteomics data is disseminated through the ProteomeXchange infrastructure, which centrally coordinates submission, storage and dissemination through multiple data repositories, such as the PRIDE data base at the EBI for MS/MS experiments, PASSEL at the ISB for SRM data and the MassIVE resource. The rpx is an interface to ProteomeXchange and provides a basic access to PX data.
library("rpx")
pxannounced()
## 15 new ProteomeXchange annoucements
## Data.Set Publication.Data Message
## 1 PXD022998 2021-06-04 09:33:06 New
## 2 PXD026477 2021-06-04 08:38:22 New
## 3 PXD026479 2021-06-04 06:30:46 New
## 4 PXD026478 2021-06-04 06:30:09 New
## 5 PXD023207 2021-06-04 06:28:53 New
## 6 PXD026459 2021-06-04 05:28:46 New
## 7 PXD026476 2021-06-04 05:24:27 New
## 8 PXD025853 2021-06-04 03:26:15 New
## 9 PXD020420 2021-06-03 15:49:01 New
## 10 PXD024298 2021-06-03 15:43:11 New
## 11 PXD025064 2021-06-03 15:30:00 New
## 12 PXD017986 2021-06-03 15:22:29 New
## 13 PXD025866 2021-06-03 15:21:36 New
## 14 PXD019978 2021-06-03 15:12:47 New
## 15 PXD015712 2021-06-03 15:00:06 New
Using the unique PXD000001
identifier, we can retrieve the relevant
metadata that will be stored in a PXDataset
object. The names of the
files available in this data can be retrieved with the pxfiles
accessor function.
px <- PXDataset("PXD000001")
px
## Object of class "PXDataset"
## Id: PXD000001 with 11 files
## [1] 'F063721.dat' ... [11] 'erwinia_carotovora.fasta'
## Use 'pxfiles(.)' to see all files.
pxfiles(px)
## [1] "F063721.dat"
## [2] "F063721.dat-mztab.txt"
## [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz"
## [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz"
## [5] "PXD000001_mztab.txt"
## [6] "README.txt"
## [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
## [8] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML"
## [9] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
## [10] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw"
## [11] "erwinia_carotovora.fasta"
Other metadata for the px
data set:
pxtax(px)
## [1] "Erwinia carotovora"
pxurl(px)
## [1] "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001"
pxref(px)
## [1] "Gatto L, Christoforou A. Using R and Bioconductor for proteomics data analysis. Biochim Biophys Acta. 2013 May 18. doi:pii: S1570-9639(13)00186-6. 10.1016/j.bbapap.2013.04.032"
Data files can then be downloaded with the pxget
function. Below, we
retrieve the raw data file. The file is downloaded in the working
directory and the name of the file is return by the function and
stored in the mzf
variable for later use.
fn <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
mzf <- pxget(px, fn)
## Loading TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML from cache.
mzf
## [1] "/home/biocbuild/.cache/R/rpx/182f5e5be80f69_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
The mzR
package provides an interface to the
proteowizard C/C++ code base
to access various raw data files, such as mzML
, mzXML
, netCDF
,
and mzData
. The data is accessed on-disk, i.e it is not loaded
entirely in memory by default but only when explicitly requested. The
three main functions are openMSfile
to create a file handle to a raw
data file, header
to extract metadata about the spectra contained in
the file and peaks
to extract one or multiple spectra of
interest. Other functions such as instrumentInfo
, or runInfo
can
be used to gather general information about a run.
Below, we access the raw data file downloaded in the previous section and open a file handle that will allow us to extract data and metadata of interest.
library("mzR")
ms <- openMSfile(mzf)
ms
## Mass Spectrometry file handle.
## Filename: 182f5e5be80f69_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## Number of scans: 7534
The header
function returns the metadata of all available peaks:
hd <- header(ms)
dim(hd)
## [1] 7534 31
names(hd)
## [1] "seqNum" "acquisitionNum"
## [3] "msLevel" "polarity"
## [5] "peaksCount" "totIonCurrent"
## [7] "retentionTime" "basePeakMZ"
## [9] "basePeakIntensity" "collisionEnergy"
## [11] "ionisationEnergy" "lowMZ"
## [13] "highMZ" "precursorScanNum"
## [15] "precursorMZ" "precursorCharge"
## [17] "precursorIntensity" "mergedScan"
## [19] "mergedResultScanNum" "mergedResultStartScanNum"
## [21] "mergedResultEndScanNum" "injectionTime"
## [23] "filterString" "spectrumId"
## [25] "centroided" "ionMobilityDriftTime"
## [27] "isolationWindowTargetMZ" "isolationWindowLowerOffset"
## [29] "isolationWindowUpperOffset" "scanWindowLowerLimit"
## [31] "scanWindowUpperLimit"
We can extract metadata and scan data for scan 1000 as follows:
hd[1000, ]
## seqNum acquisitionNum msLevel polarity peaksCount totIonCurrent
## 1000 1000 1000 2 1 274 1048554
## retentionTime basePeakMZ basePeakIntensity collisionEnergy
## 1000 1106.916 136.061 164464 45
## ionisationEnergy lowMZ highMZ precursorScanNum precursorMZ
## 1000 0 104.5467 1370.758 992 683.0817
## precursorCharge precursorIntensity mergedScan mergedResultScanNum
## 1000 2 689443.7 NA NA
## mergedResultStartScanNum mergedResultEndScanNum injectionTime
## 1000 NA NA 55.21463
## filterString
## 1000 FTMS + p NSI d Full ms2 683.08@hcd45.00 [100.00-1380.00]
## spectrumId centroided
## 1000 controllerType=0 controllerNumber=1 scan=1000 TRUE
## ionMobilityDriftTime isolationWindowTargetMZ isolationWindowLowerOffset
## 1000 NA 683.08 1
## isolationWindowUpperOffset scanWindowLowerLimit scanWindowUpperLimit
## 1000 1 100 1380
head(peaks(ms, 1000))
## [,1] [,2]
## [1,] 104.5467 308.9326
## [2,] 104.5684 308.6961
## [3,] 108.8340 346.7183
## [4,] 109.3928 365.1236
## [5,] 110.0345 616.7905
## [6,] 110.0703 429.1975
plot(peaks(ms, 1000), type = "h")
Below we reproduce the example from the MSmap
function from the
MSnbase
package to plot a specific slice of the raw data using the
mzR
functions we have just described.
## a set of spectra of interest: MS1 spectra eluted
## between 30 and 35 minutes retention time
ms1 <- which(hd$msLevel == 1)
rtsel <- hd$retentionTime[ms1] / 60 > 30 &
hd$retentionTime[ms1] / 60 < 35
## the map
M <- MSmap(ms, ms1[rtsel], 521, 523, .005, hd)
plot(M, aspect = 1, allTicks = FALSE)
plot3D(M)