R version: R version 4.1.1 (2021-08-10)
Bioconductor version: 3.14
Package version: 1.18.0
The follow packages will be used throughout this documents. R version
3.3.1
or higher is required to install all the packages using
BiocManager::install
.
library("mzR")
library("mzID")
library("MSnID")
library("MSnbase")
library("rpx")
library("MLInterfaces")
library("pRoloc")
library("pRolocdata")
library("rols")
library("hpar")
The most convenient way to install all the tutorials requirement (and more related content), is to install RforProteomics with all its dependencies.
library("BiocManager")
BiocManager::install("RforProteomics", dependencies = TRUE)
This workflow illustrates R / Bioconductor infrastructure for proteomics. Topics covered focus on support for open community-driven formats for raw data and identification results, packages for peptide-spectrum matching, data processing and analysis:
Links to other packages and references are also documented. In particular, the vignettes included in the RforProteomics package also contains relevant material.
In Bioconductor version 3.14, there are respectively 163
proteomics,
118
mass spectrometry software packages
and 25
mass spectrometry experiment packages. These
respective packages can be extracted with the proteomicsPackages()
,
massSpectrometryPackages()
and massSpectrometryDataPackages()
and
explored interactively.
library("RforProteomics")
pp <- proteomicsPackages()
display(pp)
Most community-driven formats are supported in R
, as detailed in the
table below.
MS-based proteomics data is disseminated through the ProteomeXchange infrastructure, which centrally coordinates submission, storage and dissemination through multiple data repositories, such as the PRIDE data base at the EBI for MS/MS experiments, PASSEL at the ISB for SRM data and the MassIVE resource. The rpx is an interface to ProteomeXchange and provides a basic access to PX data.
library("rpx")
pxannounced()
## The google group rss feeds are currenlty down (see
## https://github.com/lgatto/rpx/issues/7 for details). Please visit
## https://groups.google.com/g/proteomexchange for the latest annoucements
Using the unique PXD000001
identifier, we can retrieve the relevant
metadata that will be stored in a PXDataset
object. The names of the
files available in this data can be retrieved with the pxfiles
accessor function.
px <- PXDataset("PXD000001")
## Loading PXD000001 from cache.
px
## Project PXD000001 with 10 files
##
## Resource ID BFC6 in cache in /home/biocbuild/.cache/R/rpx.
## [1] 'erwinia_carotovora.fasta' ... [10] 'PRIDE_Exp_Complete_Ac_22134.pride.mgf.gz'
## Use 'pxfiles(.)' to see all files.
pxfiles(px)
## Project PXD000001 files (10):
## [local] erwinia_carotovora.fasta
## [remote] F063721.dat
## [local] F063721.dat-mztab.txt
## [remote] PRIDE_Exp_Complete_Ac_22134.xml.gz
## [remote] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw
## [local] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
## [remote] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML
## [local] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## [remote] PRIDE_Exp_Complete_Ac_22134.pride.mztab.gz
## [remote] PRIDE_Exp_Complete_Ac_22134.pride.mgf.gz
Other metadata for the px
data set:
pxtax(px)
## [1] "Erwinia carotovora"
pxurl(px)
## [1] "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001"
pxref(px)
## [1] "Gatto L, Christoforou A; Using R and Bioconductor for proteomics data analysis., Biochim Biophys Acta, 2013 May 18, doi:10.1016/j.bbapap.2013.04.032 PMID:23692960"
Data files can then be downloaded with the pxget
function. Below, we
retrieve the raw data file. The file is downloaded in the working
directory and the name of the file is return by the function and
stored in the mzf
variable for later use.
fn <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
mzf <- pxget(px, fn)
## Loading TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML from cache.
mzf
## [1] "/home/biocbuild/.cache/R/rpx/24bb4e50562623_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
The mzR
package provides an interface to the
proteowizard C/C++ code base
to access various raw data files, such as mzML
, mzXML
, netCDF
,
and mzData
. The data is accessed on-disk, i.e it is not loaded
entirely in memory by default but only when explicitly requested. The
three main functions are openMSfile
to create a file handle to a raw
data file, header
to extract metadata about the spectra contained in
the file and peaks
to extract one or multiple spectra of
interest. Other functions such as instrumentInfo
, or runInfo
can
be used to gather general information about a run.
Below, we access the raw data file downloaded in the previous section and open a file handle that will allow us to extract data and metadata of interest.
library("mzR")
ms <- openMSfile(mzf)
ms
## Mass Spectrometry file handle.
## Filename: 24bb4e50562623_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## Number of scans: 7534
The header
function returns the metadata of all available peaks:
hd <- header(ms)
dim(hd)
## [1] 7534 31
names(hd)
## [1] "seqNum" "acquisitionNum"
## [3] "msLevel" "polarity"
## [5] "peaksCount" "totIonCurrent"
## [7] "retentionTime" "basePeakMZ"
## [9] "basePeakIntensity" "collisionEnergy"
## [11] "ionisationEnergy" "lowMZ"
## [13] "highMZ" "precursorScanNum"
## [15] "precursorMZ" "precursorCharge"
## [17] "precursorIntensity" "mergedScan"
## [19] "mergedResultScanNum" "mergedResultStartScanNum"
## [21] "mergedResultEndScanNum" "injectionTime"
## [23] "filterString" "spectrumId"
## [25] "centroided" "ionMobilityDriftTime"
## [27] "isolationWindowTargetMZ" "isolationWindowLowerOffset"
## [29] "isolationWindowUpperOffset" "scanWindowLowerLimit"
## [31] "scanWindowUpperLimit"
We can extract metadata and scan data for scan 1000 as follows:
hd[1000, ]
## seqNum acquisitionNum msLevel polarity peaksCount totIonCurrent
## 1000 1000 1000 2 1 274 1048554
## retentionTime basePeakMZ basePeakIntensity collisionEnergy
## 1000 1106.916 136.061 164464 45
## ionisationEnergy lowMZ highMZ precursorScanNum precursorMZ
## 1000 0 104.5467 1370.758 992 683.0817
## precursorCharge precursorIntensity mergedScan mergedResultScanNum
## 1000 2 689443.7 NA NA
## mergedResultStartScanNum mergedResultEndScanNum injectionTime
## 1000 NA NA 55.21463
## filterString
## 1000 FTMS + p NSI d Full ms2 683.08@hcd45.00 [100.00-1380.00]
## spectrumId centroided
## 1000 controllerType=0 controllerNumber=1 scan=1000 TRUE
## ionMobilityDriftTime isolationWindowTargetMZ isolationWindowLowerOffset
## 1000 NA 683.08 1
## isolationWindowUpperOffset scanWindowLowerLimit scanWindowUpperLimit
## 1000 1 100 1380
head(peaks(ms, 1000))
## [,1] [,2]
## [1,] 104.5467 308.9326
## [2,] 104.5684 308.6961
## [3,] 108.8340 346.7183
## [4,] 109.3928 365.1236
## [5,] 110.0345 616.7905
## [6,] 110.0703 429.1975
plot(peaks(ms, 1000), type = "h")
Below we reproduce the example from the MSmap
function from the
MSnbase
package to plot a specific slice of the raw data using the
mzR
functions we have just described.
## a set of spectra of interest: MS1 spectra eluted
## between 30 and 35 minutes retention time
ms1 <- which(hd$msLevel == 1)
rtsel <- hd$retentionTime[ms1] / 60 > 30 &
hd$retentionTime[ms1] / 60 < 35
## the map
M <- MSmap(ms, ms1[rtsel], 521, 523, .005, hd)
plot(M, aspect = 1, allTicks = FALSE)
plot3D(M)