Using MSGFplus

Thomas Lin Pedersen

August 1st 2014


This document describe how to use the MSGFplus package to perform MS-GF+ analyses on liquid-chromatography tandem mass-spectroscopy (LC-MS/MS) data. It will walk you through creating a parameter set and initialising the MS-GF+ analysis in different ways.

The different parameters needed to run MS-GF+ will not be discussed here as they are well documented on the MS-GF+ webpage.

Introduction

One of the most ubiquitous steps of modern proteomics is the detection of peptides from MS/MS fragmentation data. Usually the results of such an analysis will be used to infer the presence and possibly the quantity of proteins in your starting material. Automatic identification of peptides from LC-MS/MS experiments was made possible with the SEQUEST algorithm from 1994, but the process has constantly been refined and improved. Currently there exists a range of different algorithms for performing the identification task, all with strengths and weaknesses, and MS-GF+ is an increasingly popular one of them. This package makes it possible to run MS-GF+ directly from within R, but is not a reimplementation of the algorithm. Underneath it all it is still the same java code performing the analysis.

Related in functionality to this package is rTANDEM, which provides an R interface for the X! Tandem algorithm, in much the same way as this package does for MS-GF+. As stated above each algorithm can have an upper hand on certain kind of data and the consensus is increasingly to use multiple algorithms on your data and combine the results. Having multiple algorithms available from R is only making this easier. Other packages in the Bioconductor project are concerned with other aspects of the proteomic workflow and interested readers can get an overview in the proteomics package directory.

Data in the package

As LC-MS/MS data tend to be very big, example files are not included in the package. If you do not have access to any LC-MS/MS data but wish to experiment with the package, raw data can be obtained from different data repositories (e.g. PeptideAtlas). The only included data is a very short fasta file (milk-proteins.fasta), that contains the 3 caseins and 2 whey proteins that make up the bulk of proteins in milk.

Creating a parameter set

The cornerstone of the MSGFplus package is the msgfPar class. Instances of this class can be created in a number of ways, and after creation they can be modified in many ways.

Building a parameter set iteratively

To create an empty msgfPar object use the eponymous creator function:

par <- msgfPar()
show(par)
## An empty msgfPar object

In general parameters that are not set will use the default value. As a minimum it is needed to specify the fasta file used as a database during peptide search

databaseFile <- system.file('extdata', 'milk-proteins.fasta', package='MSGFplus')
db(par) <- databaseFile

Usually the default parameters aren’t a good match (at least not all of them) and blindly running an analysis with the default values are not recommended. All the different parameters can be accessed and modified using relevant setter and getter methods. In the following they are all illustrated:

tolerance(par) <- '20 ppm'      # Set parent ion tolerance
chargeRange(par) <- c(2, 6)     # Set the range of charge states to look after
lengthRange(par) <- c(6, 25)    # Set the range of peptide length to look after
instrument(par) <- 'QExactive'  # Set the instrument used for acquisition
enzyme(par) <- 'Trypsin'        # Set the enzyme used for digestion
fragmentation(par) <- 0         # Set the fragmentation method
protocol(par) <- 0              # Set the protocol type
isotopeError(par) <- c(0,2)     # Set the isotope error
matches(par) <- 2               # Set the number of matches to report per scan
ntt(par) <- 1                   # Set number of tolerable termini
tda(par) <- TRUE                # Use target decoy approach

par
## An msgfPar object
## 
## Database:                   /tmp/RtmpZ0tSVo/Rinst6edf501adcfb/MSGFplus/extdata/milk-proteins.fasta
## Tolerance:                  20 ppm 
## Isotope error range:        0 - 2 
## TDA:                        TRUE 
## Fragmentation:              0:   As written in the spectrum or CID if no info
## Instrument:                 3:   QExactive
## Enzyme:                     1:   Trypsin
## Protocol:                   0:   No protocol
## No. tolerable termini:      1 
## Peptide length:             6 - 25 
## Charges:                    2 - 6 
## Number of reported matches: 2

As can be seen the show method of the object gives a clear overview over the parameters that have been set.

The last parameter that can be added are modifications to expect in your data. The syntax for this is a bit different as the nature of the parameter is more complex:

mods(par)[[1]] <- msgfParModification(name = 'Carbamidomethyl', 
                                      composition = 'C2H3N1O1', 
                                      residues = 'C', 
                                      type = 'opt', 
                                      position = 'any')
mods(par)[[2]] <- msgfParModification(name = 'Oxidation', 
                                      mass = 15.994915, 
                                      residues = 'M', 
                                      type = 'opt', 
                                      position = 'any')
nMod(par) <- 2                  # Set max number of modifications per peptide

par
## An msgfPar object
## 
## Database:                   /tmp/RtmpZ0tSVo/Rinst6edf501adcfb/MSGFplus/extdata/milk-proteins.fasta
## Tolerance:                  20 ppm 
## Isotope error range:        0 - 2 
## TDA:                        TRUE 
## Fragmentation:              0:   As written in the spectrum or CID if no info
## Instrument:                 3:   QExactive
## Enzyme:                     1:   Trypsin
## Protocol:                   0:   No protocol
## No. tolerable termini:      1 
## Peptide length:             6 - 25 
## Charges:                    2 - 6 
## Number of reported matches: 2 
## 
## Modifications:
## 
## Number of modifications per peptide: 2
## 
## Carbamidomethyl: C2H3N1O1, C, opt, any
## Oxidation:   15.99492, M, opt, any

Other ways to define parameters

Apart from building up the parameters iteratively, they can also be specified at creation time.

par <- msgfPar(database = databaseFile, 
               tolerance = '20 ppm', 
               tda=TRUE,
               instrument='QExactive')

par
## An msgfPar object
## 
## Database:                   /tmp/RtmpZ0tSVo/Rinst6edf501adcfb/MSGFplus/extdata/milk-proteins.fasta
## Tolerance:                  20 ppm 
## TDA:                        TRUE 
## Instrument:                 3:   QExactive

A third method is to read parameter data from a result file generated by MS-GF+. This makes it easy to quickly replicate the parameter used for a certain search in order to compare results.

par <- msgfParFromID('/path/to/results/file.mzid')

par

The last method of setting up a parameter set is with a simple gui that is pretty much self-explanatory:

require(gWidgets)

par <- msgfParGUI()

Running MS-GF+

When a parameter set has been defined to your likening you can start an MS-GF+ analysis in two ways, depending on your aim.

Running MS-GF+ in batch mode

To start an MS-GF+ run for one or several raw data files the method runMSGF() is used:

res <- runMSGF(par, 'your_rawfile.mzML')

If a vector of file paths are provided, these will be run in succession. By default result files are written besides the original rawfiles with an mzid extension instead of their original extension (silently overriding exiting files). Alternatively a vector of the same length as the number of raw files can be provided to use as save names.

The results are reimported into R as either an mzID or mzIDCollection object, depending on the number of raw files. If importing is not desired it can be avoided with import=FALSE.

Running MS-GF+ asynchronously

If async=TRUE is set in runMSGF() an asynchronous run of MS-GF+ is started. In contrast to running in batch mode, only one file at a time can be run asynchronously and as such it is mostly useful for embedding in code, where you want to use the time when MS-GF+ is running to perform some other tasks in R. When running asynchronously runMSGF() returns an object of class msgfAsync. This object is used to check whether MS-GF+ has finished analysing your code, and in the case that it has finished, import the results.

msgf <- runMSGF(par, 'your_rawfile.mzML', async=TRUE)

while(running(msgf)) {
    Sys.sleep(1)             # You could arguably do more meaningfull stuff here
}
if(finished(msgf)) {
    res <- import(msgf)
}

Running MS-GF+ from a GUI

GUI’s are so nice that it got it’s own package. Check out MSGFgui (also part of Bioconductor) if that is more to your taste. Besides running analysis it got a whole range of ways to investigate your results.

Session

sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MSGFplus_1.6.2
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.5        codetools_0.2-14   XML_3.98-1.4      
##  [4] digest_0.6.9       foreach_1.4.3      plyr_1.8.3        
##  [7] mzID_1.10.2        formatR_1.4        magrittr_1.5      
## [10] evaluate_0.9       stringi_1.0-1      doParallel_1.0.10 
## [13] rmarkdown_0.9.6    iterators_1.0.8    tools_3.3.0       
## [16] stringr_1.0.0      ProtGenerics_1.4.0 parallel_3.3.0    
## [19] yaml_2.1.13        htmltools_0.3.5    knitr_1.13