The goal of the rpx package is to provide programmatic access to proteomics data from R, in particular to the ProteomeXchange (PX) central repository (see http://www.proteomexchange.org/ and http://central.proteomexchange.org/).
Vizcaino J.A. et al. ProteomeXchange: globally co-ordinated proteomics data submission and dissemination, Nature Biotechnology 2014, 32, 223 – 226, doi:10.1038/nbt.2839.
Additional repositories are likely to be added in the future.
PXDataset
objectsThe central object that handles data access is the PXDataset
class. Such an instance can be generated by passing a valid PX
experiment identifier to the PXDataset
constructor.
library("rpx")
id <- "PXD000001"
px <- PXDataset(id)
px
## Object of class "PXDataset"
## Id: PXD000001 with 12 files
## [1] 'F063721.dat' ... [12] 'generated'
## Use 'pxfiles(.)' to see all files.
Several attributes can be extracted from an PXDataset
instance, as
described below.
The experiment identifier, that was originally used to create the \Robject{PXDataset} instance can be extracted with the \Rfunction{pxid} method:
pxid(px)
## [1] "PXD000001"
The file transfer url where the data files can be accessed can be
queried with the pxurl
method:
pxurl(px)
## [1] "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001"
The species the data has been generated the data can be obtain calling
the pxtax
function:
pxtax(px)
## [1] "Erwinia carotovora"
Relevant bibliographic references can be queried with the
pxref
method:
strwrap(pxref(px))
## [1] "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\"> <html><head>"
## [2] "<title>301 Moved Permanently</title> </head><body> <h1>Moved"
## [3] "Permanently</h1> <p>The document has moved <a"
## [4] "href=\"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi/efetch.fcgi?db=pubmed&id=23692960&rettype=docsum&retmode=text\">here</a>.</p>"
## [5] "</body></html>"
All files available for the PX experiment can be obtained with the
pxfiles
method:
pxfiles(px)
## [1] "F063721.dat"
## [2] "F063721.dat-mztab.txt"
## [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz"
## [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz"
## [5] "PXD000001_mztab.txt"
## [6] "README.txt"
## [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
## [8] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML"
## [9] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
## [10] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw"
## [11] "erwinia_carotovora.fasta"
## [12] "generated"
The complete or partial data set can be downloaded with the pxget
function. The function takes an instance of class PXDataset
as first
mandatory argument.
The next argument, list
, specifies what files to download. If
missing, a menu is printed and the user can select a file. If set to
"all"
, all files of the experiment are downloaded in the working
directory. Alternatively, numerics or logicals can also be used to
subset the relevant files to be downloaded based on the pxfiles(.)
output.
The last argument, force
, can be set to TRUE
to force the download
of files that already exists in the working directory.
pxget(px, "erwinia_carotovora.fasta")
## Downloading 1 file
dir(pattern = "fasta")
## [1] "erwinia_carotovora.fasta"
By default, pxget
will not download and overwrite a file if already
available. The last argument of pxget
, force
, can be set to TRUE
to force the download of files that already exists in the working
directory.
(i <- grep("fasta", pxfiles(px)))
## [1] 11
pxget(px, i) ## same as above
## Downloading 1 file
## /tmp/RtmpZTTXHZ/Rbuild2f437e61f66b/rpx/vignettes/erwinia_carotovora.fasta already present.
Finally, a list of recent PX additions and updates can be obtained
using the pxannounced()
function:
pxannounced()
## 15 new ProteomeXchange annoucements
## Data.Set Publication.Data Message
## 1 PXD005834 2017-10-05 10:54:47 New
## 2 PXD007741 2017-10-05 07:14:37 New
## 3 PXD006857 2017-10-05 07:13:24 New
## 4 PXD003421 2017-10-04 16:44:47 New
## 5 PXD006303 2017-10-04 15:49:21 New
## 6 PXD006484 2017-10-04 13:05:39 New
## 7 PXD005215 2017-10-04 09:53:07 New
## 8 PXD006654 2017-10-04 09:39:25 New
## 9 PXD006467 2017-10-04 09:17:19 New
## 10 PXD007227 2017-10-04 09:07:37 New
## 11 PXD003177 2017-10-04 09:06:34 New
## 12 PXD006871 2017-10-04 08:19:10 New
## 13 PXD007207 2017-10-04 07:59:52 Updated information
## 14 PXD007864 2017-10-04 07:35:54 New
## 15 PXD007727 2017-10-04 07:34:08 New
Below, we show how to automate the extraction of files of interest
(fasta and mzTab files), download them and read them using appropriate
Bioconductor infrastructure. (Note that we read version 0.9 of the
MzTab format below. For recent data, the version
argument would be
omitted.)
(mzt <- grep("F0.+mztab", pxfiles(px), value = TRUE))
## [1] "F063721.dat-mztab.txt"
(fas <- grep("fasta", pxfiles(px), value = TRUE))
## [1] "erwinia_carotovora.fasta"
pxget(px, c(mzt, fas))
## Downloading 2 files
## /tmp/RtmpZTTXHZ/Rbuild2f437e61f66b/rpx/vignettes/erwinia_carotovora.fasta already present.
library("Biostrings")
readAAStringSet(fas)
## A AAStringSet instance of length 4499
## width seq names
## [1] 147 MADITLISGSTLGSAEYVA...QIPEDPAEEWLGSWVNLLK ECA0001 putative ...
## [2] 153 VAEIYQIDNLDRGILSALM...QSTETLISLQNPIMRTIAP ECA0002 AsnC-fami...
## [3] 330 MKKQYIEKQQQISFVKSFF...QVQCGVWPQPLRESVSGLL ECA0003 putative ...
## [4] 492 MITLESLEMLLSIDENELL...FDTGLKSRLMRRWQHGKAY ECA0004 conserved...
## [5] 499 MRQTAALAERISRLSHALE...IEASLQQVAEQIQQSEQQD ECA0005 conserved...
## ... ... ...
## [4495] 634 MSDKIIHLTDDSFDTDVLK...KVDPLRVFASDMARRLELL trx-rv3790 trx-rv...
## [4496] 93 MTKMNNKARRTARELKHLG...LRDEFPMGYLGDYKDDDDK TimBlower TimBlower
## [4497] 309 MFSNLSKRWAQRTLSKSFY...KWAGIKTRKFVFNPPKPRK sp|P07143|CY1_YEA...
## [4498] 231 FPTDDDDKIVGGYTCAANS...VYTKVCNYVNWIQQTIAAN sp|P00761|TRYP_PI...
## [4499] 269 GVSGSCNIDVVCPEGNGHR...AGTGAQFIDGLDSTGTPPV sp|Q7M135|LYSC_LY...
library("MSnbase")
(x <- readMzTabData(mzt, "PEP", version = "0.9"))
## MSnSet (storageMode: lockedEnvironment)
## assayData: 1528 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: sub[1] sub[2] ... sub[6] (6 total)
## varLabels: abundance
## varMetadata: labelDescription
## featureData
## featureNames: 1 2 ... 1528 (1528 total)
## fvarLabels: sequence accession ... uri (14 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## mzTab read: Fri Oct 6 01:58:45 2017
## MSnbase version: 2.2.0
head(exprs(x))
## sub[1] sub[2] sub[3] sub[4] sub[5] sub[6]
## 1 10630132 11238708 12424917 10997763 9928972 10398534
## 2 11105690 12403253 13160903 12229367 11061660 10131218
## 3 1183431 1322371 1599088 1243715 1306602 1159064
## 4 5384958 5508454 6883086 6136023 5626680 5213771
## 5 18033537 17926487 21052620 19810368 17381162 17268329
## 6 9873585 10299931 11142071 10258214 9664315 9518271
head(fData(x)[, 1:2])
## sequence accession
## 1 DGVSVAR ECA0625
## 2 NVVLDK ECA0625
## 3 VEDALHATR ECA0625
## 4 LAGGVAVIK ECA0625
## 5 LIAEAMEK ECA0625
## 6 SFGAPTITK ECA0625
Eithe post questions on the Bioconductor support forum or open a GitHub issue.
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] rpx_1.12.3 MSnbase_2.2.0 ProtGenerics_1.8.0
## [4] BiocParallel_1.10.1 mzR_2.10.0 Rcpp_0.12.13
## [7] Biobase_2.36.2 Biostrings_2.44.2 XVector_0.16.0
## [10] IRanges_2.10.4 S4Vectors_0.14.6 BiocGenerics_0.22.1
## [13] BiocStyle_2.4.1
##
## loaded via a namespace (and not attached):
## [1] BiocInstaller_1.26.1 compiler_3.4.2 plyr_1.8.4
## [4] bitops_1.0-6 iterators_1.0.8 tools_3.4.2
## [7] zlibbioc_1.22.0 MALDIquant_1.16.4 digest_0.6.12
## [10] evaluate_0.10.1 tibble_1.3.4 preprocessCore_1.38.1
## [13] gtable_0.2.0 lattice_0.20-35 rlang_0.1.2
## [16] foreach_1.4.3 curl_2.8.1 yaml_2.1.14
## [19] xml2_1.1.1 stringr_1.2.0 knitr_1.17
## [22] rprojroot_1.2 grid_3.4.2 impute_1.50.1
## [25] XML_3.98-1.9 rmarkdown_1.6 limma_3.32.8
## [28] ggplot2_2.2.1 magrittr_1.5 backports_1.1.1
## [31] scales_0.5.0 pcaMethods_1.68.0 codetools_0.2-15
## [34] htmltools_0.3.6 mzID_1.14.0 colorspace_1.3-2
## [37] affy_1.54.0 stringi_1.1.5 RCurl_1.95-4.8
## [40] doParallel_1.0.11 lazyeval_0.2.0 munsell_0.4.3
## [43] vsn_3.44.0 affyio_1.46.0