R version: R version 3.6.0 (2019-04-26)
Bioconductor version: 3.9
Package version: 1.8.0
The follow packages will be used throughout this documents. R version
3.3.1
or higher is required to install all the packages using
BiocManager::install
.
library("mzR")
library("mzID")
library("MSnID")
library("MSnbase")
library("rpx")
library("MLInterfaces")
library("pRoloc")
library("pRolocdata")
library("MSGFplus")
library("rols")
library("hpar")
The most convenient way to install all the tutorials requirement (and more related content), is to install RforProteomics with all its dependencies.
library("BiocManager")
BiocManager::install("RforProteomics", dependencies = TRUE)
Other packages of interest, such as rTANDEM or MSGFgui will be described later in the document but are not required to execute the code in this workflow.
This workflow illustrates R / Bioconductor infrastructure for proteomics. Topics covered focus on support for open community-driven formats for raw data and identification results, packages for peptide-spectrum matching, data processing and analysis:
Links to other packages and references are also documented. In particular, the vignettes included in the RforProteomics package also contains relevant material.
In Bioconductor version 3.9, there are respectively 128
proteomics,
85
mass spectrometry software packages
and 22
mass spectrometry experiment packages. These
respective packages can be extracted with the proteomicsPackages()
,
massSpectrometryPackages()
and massSpectrometryDataPackages()
and
explored interactively.
library("RforProteomics")
pp <- proteomicsPackages()
display(pp)
Most community-driven formats are supported in R
, as detailed in the
table below.
Type | Format | Package |
---|---|---|
raw | mzML, mzXML, netCDF, mzData | mzR (read) |
identification | mzIdentML | mzR (read) and mzID (read) |
quantitation | mzQuantML | |
peak lists | mgf | MSnbase (read/write) |
other | mzTab | MSnbase (read) |
MS-based proteomics data is disseminated through the ProteomeXchange infrastructure, which centrally coordinates submission, storage and dissemination through multiple data repositories, such as the PRIDE data base at the EBI for MS/MS experiments, PASSEL at the ISB for SRM data and the MassIVE resource. The rpx is an interface to ProteomeXchange and provides a basic access to PX data.
library("rpx")
pxannounced()
## 15 new ProteomeXchange annoucements
## Data.Set Publication.Data Message
## 1 PXD010862 2019-05-06 14:47:47 New
## 2 PXD013015 2019-05-06 14:34:16 New
## 3 PXD013546 2019-05-06 14:27:06 New
## 4 PXD010307 2019-05-06 14:20:15 New
## 5 PXD013397 2019-05-06 14:15:20 New
## 6 PXD010168 2019-05-06 14:13:43 New
## 7 PXD012877 2019-05-06 14:03:00 New
## 8 PXD011588 2019-05-06 13:42:46 New
## 9 PXD012379 2019-05-06 13:31:53 New
## 10 PXD013425 2019-05-06 13:26:35 New
## 11 PXD013702 2019-05-06 13:24:59 New
## 12 PXD010791 2019-05-05 02:13:41 New
## 13 PXD010790 2019-05-05 00:29:31 New
## 14 PXD010959 2019-05-04 10:35:53 New
## 15 PXD013097 2019-05-04 10:24:28 New
Using the unique PXD000001
identifier, we can retrieve the relevant
metadata that will be stored in a PXDataset
object. The names of the
files available in this data can be retrieved with the pxfiles
accessor function.
px <- PXDataset("PXD000001")
px
## Object of class "PXDataset"
## Id: PXD000001 with 12 files
## [1] 'F063721.dat' ... [12] 'generated'
## Use 'pxfiles(.)' to see all files.
pxfiles(px)
## [1] "F063721.dat"
## [2] "F063721.dat-mztab.txt"
## [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz"
## [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz"
## [5] "PXD000001_mztab.txt"
## [6] "README.txt"
## [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
## [8] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML"
## [9] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
## [10] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw"
## [11] "erwinia_carotovora.fasta"
## [12] "generated"
Other metadata for the px
data set:
pxtax(px)
## [1] "Erwinia carotovora"
pxurl(px)
## [1] "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001"
pxref(px)
## [1] "Gatto L, Christoforou A. Using R and Bioconductor for proteomics data analysis. Biochim Biophys Acta. 2014 1844(1 pt a):42-51"
Data files can then be downloaded with the pxget
function. Below, we
retrieve the raw data file. The file is downloaded in the working
directory and the name of the file is return by the function and
stored in the mzf
variable for later use.
fn <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
mzf <- pxget(px, fn)
## Downloading 1 file
mzf
## [1] "/tmp/RtmpM7a8cs/Rbuild68e538add17d/proteomics/vignettes/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
The mzR
package provides an interface to the
proteowizard C/C++ code base
to access various raw data files, such as mzML
, mzXML
, netCDF
,
and mzData
. The data is accessed on-disk, i.e it is not loaded
entirely in memory by default but only when explicitly requested. The
three main functions are openMSfile
to create a file handle to a raw
data file, header
to extract metadata about the spectra contained in
the file and peaks
to extract one or multiple spectra of
interest. Other functions such as instrumentInfo
, or runInfo
can
be used to gather general information about a run.
Below, we access the raw data file downloaded in the previous section and open a file handle that will allow us to extract data and metadata of interest.
library("mzR")
ms <- openMSfile(mzf)
ms
## Mass Spectrometry file handle.
## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## Number of scans: 7534
The header
function returns the metadata of all available peaks:
hd <- header(ms)
dim(hd)
## [1] 7534 29
names(hd)
## [1] "seqNum" "acquisitionNum"
## [3] "msLevel" "polarity"
## [5] "peaksCount" "totIonCurrent"
## [7] "retentionTime" "basePeakMZ"
## [9] "basePeakIntensity" "collisionEnergy"
## [11] "ionisationEnergy" "lowMZ"
## [13] "highMZ" "precursorScanNum"
## [15] "precursorMZ" "precursorCharge"
## [17] "precursorIntensity" "mergedScan"
## [19] "mergedResultScanNum" "mergedResultStartScanNum"
## [21] "mergedResultEndScanNum" "injectionTime"
## [23] "filterString" "spectrumId"
## [25] "centroided" "ionMobilityDriftTime"
## [27] "isolationWindowTargetMZ" "isolationWindowLowerOffset"
## [29] "isolationWindowUpperOffset"
We can extract metadata and scan data for scan 1000 as follows:
hd[1000, ]
## seqNum acquisitionNum msLevel polarity peaksCount totIonCurrent
## 1000 1000 1000 2 1 274 1048554
## retentionTime basePeakMZ basePeakIntensity collisionEnergy
## 1000 1106.916 136.061 164464 45
## ionisationEnergy lowMZ highMZ precursorScanNum precursorMZ
## 1000 0 104.5467 1370.758 992 683.0817
## precursorCharge precursorIntensity mergedScan mergedResultScanNum
## 1000 2 689443.7 0 0
## mergedResultStartScanNum mergedResultEndScanNum injectionTime
## 1000 0 0 55.21463
## filterString
## 1000 FTMS + p NSI d Full ms2 683.08@hcd45.00 [100.00-1380.00]
## spectrumId centroided
## 1000 controllerType=0 controllerNumber=1 scan=1000 TRUE
## ionMobilityDriftTime isolationWindowTargetMZ isolationWindowLowerOffset
## 1000 NA 683.08 1
## isolationWindowUpperOffset
## 1000 1
head(peaks(ms, 1000))
## [,1] [,2]
## [1,] 104.5467 308.9326
## [2,] 104.5684 308.6961
## [3,] 108.8340 346.7183
## [4,] 109.3928 365.1236
## [5,] 110.0345 616.7905
## [6,] 110.0703 429.1975
plot(peaks(ms, 1000), type = "h")
Below we reproduce the example from the MSmap
function from the
MSnbase
package to plot a specific slice of the raw data using the
mzR
functions we have just described.
## a set of spectra of interest: MS1 spectra eluted
## between 30 and 35 minutes retention time
ms1 <- which(hd$msLevel == 1)
rtsel <- hd$retentionTime[ms1] / 60 > 30 &
hd$retentionTime[ms1] / 60 < 35
## the map
M <- MSmap(ms, ms1[rtsel], 521, 523, .005, hd)
plot(M, aspect = 1, allTicks = FALSE)
plot3D(M)