This document is intended to guide the user through the different aspects of using MSGFgui to perform peptide identification from raw LC-MS/MS data and inveestigate the results. Note that this document will not describe the inner workings of MS-GF+ - the algorithm that performs the identification. For users interested in the nitty-gritty of the algorithm please see the MS-GF+ webpage.
This package comes with a sister package, MSGFplus, that provides the interface to the original MS-GF+ java code. If you wish to implement MS-GF+ analysis within your own functions or packages, MSGFplus provide a command-line interface to MS-GF+. This package, conversely, only provides a GUI overlay and a symphony of visualisations coded in Javascript (using D3.js) that cannot be accessed through R code.
MSGFgui is being maintained at its GitHub repository, where bug reports and feature requests are happily accepted.
Well obviously you need R, but lets assume you got that covered. The main point is that working with proteomics data puts some strain on your system. I don’t want to throw you a list of specs you should compare your system to, but do know that the kind of analysis facilitated by the GUI is best suited for a workstation class system. That means lots of memory and a multitude of cores. When that is said there is nothing about the GUI itself that requires a mighty machine, so if you have some small (~200 mb) raw files you’ll have no problem playing around with the GUI on a decent laptop.
WARNING: Due to some unfortunate incompatibility between shiny, mzR/Rcpp and RStudio, running MSGFgui through RStudio will cause the R session to crash once raw data is trying to get accessed (retrieving raw MS/MS scans). This is not a problem when running R in the standard way. This problem will hopefully soon be adressed and this warning will disappear.
As this package is all about a graphic user interface, it exposes very few functions to the user (only 2). The one that you will probably use most often is MSGFgui()
(the other one will be discussed in a bit). In all its simplicity the GUI is started from the R terminal as such:
# Standard fashion
MSGFgui()
# You can pass parameters along to shiny's runApp()
MSGFgui(port='0.0.0.0')
This will open up MSGFgui in your standard browser. Once the GUI is running the R process is interupted. To regain control of R (and shut down MSGFgui) press the ‘esc’ key.
Once the GUI has opened you will be greated by an interface split in two: The left side aids you in selecting data files and setting the parameters for an MS-GF+ analysis. The right side lets you explore the results of the analysis as well as export it to different formats.
It should be noted that it is not necessary to use the GUI for running MS-GF+ in order to be able to use the exploratory features of the GUI. Result files generated using MSGFplus or barebone MS-GF+ can be imported, provided the raw data file is still present alongside the result file. This makes it able to reimport older analysis for comparing etc. Do note that the GUI only support results from MS-GF+ and not other peptide identification software, no matter which output format they support. Trying to import other result files will not crash the GUI but be met with an alert during import.
MSGFgui makes it easy for people uncomfortable with command line interfaces to run MS-GF+ analyses. The benfits for other include easy batch-job creation and instant parameter documentation lookup. Everything related to running MS-GF+ is located on the left hand side, and split into file selection and parameter setup.
File selection
In order to run an analysis, two types of files are needed. The obvious one is raw LC-MS/MS data files. MS’GF+ supports most open MS data file formats but encourages the use of mzML. mzML files can be created from proprietary vendor formats using msconvert.
To add a raw data file simple click on the topmost ‘Upload’ button and navigate to the file. Batchjobs are created simply by selection multiple files. In addition a fasta file containing the proteins expected to be in your samples is needed. This is selected by clicking the bottommost ‘Upload’ button. Note that the fasta file should only contain the expected proteins. MS-GF+ creates its own decoy database from this file.
Parameters
Below the file selection area is a list of all parameters that can be set, in order to fine tune the analysis. A full description of all parameters is available in the MS-GF+ documentation but more or less the same information is also available as tooltips when the user hover over the name of the parameter.
There is automatic checking of the parameters, meaning that it should be virtually imposible to supply erroneous parameters to the analysis. The Analyse button is simply not active if the parameters don’t conform, and the violating values will be marked with a red border.
The parameters have a huge influence on the quality of the analysis, so don’t fill them in blindly. Unless you are continuously analysing samples from the same setup, try to experiment with different sensible values until you have reached the optimal setup
Running
Once everything is filled out to your liking it is time to start the analysis. This is done by clicking the aptly named ‘Analyse’ button in the lower right corner of the pane. If the button is not clickable it means you have not supplied the parameters in the right format (see above) so revisit those and fix the errors. Once the analysis is running a progress bar will inform you about how things are proceeding. Results are imported as they are available, but you will experience lag in some operations if you try to interact with the results before the full batch is done.
Once raw data have been analysed or old results imported it is time to investigate them. MSGFgui tries to distance itself from the usual “provide a table with results and some plots upon double clicking rows”, and tries to implement a more fluid and natural interaction with the identification data.
The first pane, and the one visible at startup, concerns itself with communicating the overall results in a concise way. It contains information such as: number of samples, number of scans, number of identified peptides and proteins etc. as well as overall quality statistics.
The main part of the pane is allocated to two plots. One plot shows the distrubution of score values for peptide-spectrum matches (psm), divided between real (green) and decoy (red) hits. The other plot shows the placement of parent ions from accepted psm’s in a scatter plot. Both of these plots respond to sample selection in the list below them. If multiple samples are selected the plots will shows the union of the selected samples. Beside the sample selection is a list of numeric statistics pertaining to the current selection. The different statistics are rather self-explanatory, but it should be noted that with the exception of psm, all numbers are for the filtered data (see the discussion of the filter pane below). For psm both filtered and unfiltered numbers are given.
The second pane is all about investigating the nature of the identification. It is organised in a protein centric way from the belief that most users will primarily be interested in the data in a top-down manner (from protein to peptide to scan).
Proteins
Selecting a protein from the leftmost list will result in a visualisation of the protein swooshing in from the left. The visualisation shows the full length of the protein with identified peptides shown above it according to their position on the protein. Peptide identification that pass the current filter are shown in green while those that don’t are shown in grey. In the centre of the visualisation general information about the protein is shown. This information relies upon a properly annotated database file used during analysis.
Peptides
In the centre list all peptides for the selected protein passing the filter is shown. Selecting one will highlight the selected peptide in the visualisation and dim the rest. Furthermore the protein information in the middle of the visualisation is substituted for a representation of the residues making up the peptide sequence. The flanking residues are shown in grey and modification are shown at their corresponding residues if any is present in the peptide.
Scans
In the rightmost list all scans resulting in a match to the selected peptide and which passes the filter is shown. Clicking on one of them will move the protein arc out of the way to give room for an annotated spectrum plot. Fragment ions will be shown in black with the type on top and the parent ion, if observed, in red. The rest of the ions will be grey. The peptide sequence will still be visible but is now overlayed with the observed fragmentation points, as well as which types of ions support the specific fragmentation. Lastly a small plot in the upper right corner shows the peak that the parent ion is part of along with the position of the parent ion and other sampling points if present.
The last tab pane contains a wide variety of options for filtering the identifications. The filtering options are divided horizontally into options related to quality, scan information and peptide information.
At the top, the only quality related filtering option is related to setting the FDR cutoff. By default this is set to 0.01 as that is the de facto standard in proteomics. It should be noted that the FDR cutoff is dumb in the sense that it relies on the q-values calculated by MS-GF+. This means that it does not take into account any filtering that is happening in the GUI. Following this is a range of options for trimming down the scans, either by only looking at specific samples, retention times, m/z values or charges. At the bottom is the options for filtering on peptide information. It is possible to choose only to look at peptides related to a subset of proteins or of a certain length, as well as having a specific modification. Once a filter is set it is applied the instance another tab is selected. If the filter is too tight and no psm’s are left it will ask you to loosen the filter. It should be noted that the filtering is provided with the purpose of making it easier to find the information of interest. For instance if one is mostly interested in looking at prtoeins with detected phosphorylation sites, selecting phosphorylation in the modification list will mean that only those proteins where phosphorylated peptides have been identified is visible. On the other hand it is not meant as a way to improve the quality of the results. If this is the interest the MSnID package would be a good starting point. The latter point also means that the filtering is not applied when exporting the results.
In the beginning I promised another R function, and the time has now come for that. While the export function makes it possible to reimport the results into R for further processing, this operation is so common that it has warranted a shortcut. While the GUI is running it is possible to start another R session and access the results currently present in the GUI by using currentData()
. The function returns an mzIDCollection object populated with the results. If no GUI is open it will return the results of the last session.
results <- currentData()
show(results) # Empty as no analysis has been run
## An empty mzIDCollection object
Do note that there is no direct connection between the GUI and the output from currentData()
. This means that if changes happens in the GUI they are not automatically reflected in the mzIDCollection (in the above example results
will still be empty after an analysis has completed in the GUI). A new call to currentData()
is necessary to bring the two sessions in sync again.
The GUI was build out of a need for an easy interface to running MS-GF+. During development it gradually became more about pushing the envelope on how protein identification is displayed, as well as the richness and polish in R/shiny GUIs. There is a bunch of things I would like to add both in terms of visualisation and under-the-hood functionality. If you have any requests yourself please let me know either through the GitHub page or by email. If you’re just a happy user who would like to see the software evolve, you’re also welcome to drop me a line - I’ll probably put more attention to it if many people are using it.
Following is a roadmap in no particular order:
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
## [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MSGFgui_1.8.0 xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-8
## [5] mzR_2.8.0 Rcpp_0.12.7
##
## loaded via a namespace (and not attached):
## [1] knitr_1.14 magrittr_1.5 BiocGenerics_0.20.0
## [4] doParallel_1.0.10 xtable_1.8-2 R6_2.2.0
## [7] foreach_1.4.3 stringr_1.1.0 plyr_1.8.4
## [10] MSGFplus_1.8.0 tools_3.3.1 parallel_3.3.1
## [13] mzID_1.12.0 Biobase_2.34.0 shinyFiles_0.6.2
## [16] htmltools_0.3.5 ProtGenerics_1.6.0 iterators_1.0.8
## [19] yaml_2.1.13 assertthat_0.1 digest_0.6.10
## [22] tibble_1.2 RJSONIO_1.3-0 shiny_0.14.1
## [25] formatR_1.4 codetools_0.2-15 mime_0.5
## [28] evaluate_0.10 rmarkdown_1.1 stringi_1.1.2
## [31] XML_3.98-1.4 httpuv_1.3.3