Combining HDMSe/MSe data using synapter to optimise identification and quantitation

Laurent Gatto1, Sebastian Gibb2 and Pavel V. Shliaha3

1Computational Proteomics Unit, Cambridge, UK.
2Department of Anesthesiology and Intensive Care, University Medicine Greifswald, Germany.
3Department of Biochemistry and Molecular Biology, University of Southern Denmark, Denmark.

24 April 2017

Abstract

This vignette describes the functionality implemented in the ‘synapter’ package. It allows to re-analyse label-free quantitative proteomics data obtained on a Synapt G2 instrument to optimise quantitation and identification. Several combination strategies are possible and described. Typically, a user can combine identification-optimised data (HDMS$^E$ data using ion mobility separation) and quantitation-optimised data (MS$^E$ data). Additionally, a method to combine several data files into a master set while controlling the false discovery rate, is presented.

Package

synapter 2.0.0

Foreword

synapter is free and open-source software. If you use it, please support the project by citing it in publications:

Nicholas James Bond, Pavel Vyacheslavovich Shliaha, Kathryn S. Lilley, and Laurent Gatto. Improving qualitative and quantitative performance for MS$^E$-based label free proteomics. J. Proteome Res., 2013, 12 (6), pp 2340–2353

Questions and bugs

For bugs, typos, suggestions or other questions, please file an issue in our tracking system (https://github.com/lgatto/synapter/issues) providing as much information as possible, a reproducible example and the output of sessionInfo().

If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.

1 Introduction

1.1 Background

The main functionality of synapter is to combine proteomics data acquired under different mass spectrometry settings or with different samples to (i) optimise the respective qualities of the two data sets or (ii) increase the number of identifications, thereby decreasing missing values. Besides synapter offers other functionality inaccessible in the default pipeline, like peptide FDR estimation and filtering on peptide match type and peptide uniqueness.

The example that motivated the development of this package was to combine data obtained on a Synapt G2 instrument:

HDMS$^E$ data, acquired with additional peptide separation using an ion mobility cell, thus leading to better (both in number and in quality) identification and
standard MS$^E$ data (acquired without ion mobility separation), providing better data quantitation.

The former is data is called identification peptides and the latter quantitation peptides, irrespective of the acquisition mode (HDMS$^E$ or MS$^E$). This HDMS$^E$/MS$^E$ design is used in this document to illustrate the synapter package.

However, although HDMS$^E$ mode possesses superior identification and MS$^E$ mode superior quantitation capabilities and transferring identifications from HDMS$^E$ to MS$^E$ is a priori the most efficient setup, identifications can be transferred between any runs, independently of the acquisition mode. This allows to reduce the number of missing values, one of the primary limitation of label-free proteomics. Thus users will benefit from synapter’s functionality even if they run their instruments in a single mode (HDMS$^E$ or MS$^E$ only).

However, as will be shown in section Data analysis, transferring identifications from multiple runs to each other increases analysis time and peptide FDR within the analysis. synapter allows to minimise these effects to acceptable degree by choosing runs to transfer identifications from and merging them in the master HDMS$^E$ file.

This data processing methodology is described in section HDMS$^E$/MS$^E$ data analysis and the analysis pipeline is described in section Different pipelines.

To maximise the benefit of combining better identification and quantitation data, it is also possible to combine several, previously merged identification data files into one master set. This functionality is described in section Using master peptide files.

Finally, section Analysis of complex experiments illustrates a complete pipeline including synapter and MSnbase (Gatto and Lilley 2012) packages to perform protein label-free quantitation: how to combine multiple synapter results to represent the complete experimental design under study and further explore the data, normalise it and perform robust statistical data analysis inside the R environment.

The rationale underlying synapter’s functionality are described in (Shliaha et al. 2013) and (Bond et al. 2013). The first reference describes the benefits of ion mobility separation on identification and the effects on quantitation, that led to the development of synapter, which in described and demonstrated in (Bond et al. 2013).

synapter is written for R~(R Core Team 2012), an open source, cross platform, freely available statistical computing environment and programming language11 https://www.r-project.org/. Functionality available in the R environment can be extended though the usage of packages. Thousands of developers have contributed packages that are distributed via the Comprehensive R Archive Network (CRAN) or through specific initiatives like the Bioconductor22 https://www.bioconductor.org/ project (Gentleman et al. 2004), focusing on the analysis and comprehension of high-throughput biological data.

synapter is such an R package dedicated to the analysis of label-free proteomics data. To obtain detailed information about any function in the package, it is possible to access it’s documentation by preceding it’s name with a question mark at the command line prompt. For example, to obtain information about the synapter package, one would type ?synapter.

1.2 Installation

synapter is available through the Bioconductor project. Details about the package and the installation procedure can be found on its page33 https://bioconductor.org/packages/synapter/. Briefly, installation of the package and all its dependencies should be done using the dedicated Bioconductor infrastructure as shown below:

source("https://bioconductor.org/biocLite.R")
## or, if you have already used the above before
library("BiocInstaller")
## and to install the package
biocLite("synapter")

After installation, synapter will have to be explicitly loaded with

library("synapter")

so that all the package’s functionality is available to the user.

2 Data analysis using synapter

2.1 Preparing the input data

Preparation of the data for synapter requires the .raw data first to be processed with Waters’ ProteinLynx Global Serve (PLGS) software. The PLGS result is then exported as csv spreadsheet files in user specified folders. These csv files can then be used as input for synapter.

We also highly recommend users to acquaint themselves with the PLGS search algorithm for data independent acquisitions (G. Z. Li et al. 2009).

First the user has to specify the output folders for files to be used in synapter analysis as demonstrated in the figures. After the folders are specified ignore the message that appears requiring restarting PLGS.

Specifying PLGS output folders 1. Select Options > Automation Setup.

At the first stage PLGS performs noise reduction and centroiding, based on user specified preferences called processing parameters. These preferences determine thresholds in intensity for discriminating between noise peaks and peptide and fragment ion peaks in high and low energy functions of an acquisition. The optimal value of these parameters is sample dependant and different between MS$^E$ and HDMS$^E$ modes. For synapter to function properly all acquisitions in the analysis have to be processed with the same thresholds, optimal for the mode identifications are transferred from (typically HDMS$^E$ mode). The user is expected to identify optimal parameters himself for every new sample type by repeatedly analysing a representative acquisition with different thresholds.

After the ions peaks have been determined and centroided, the ions representing charge states and isotopes of a peptide are collapsed into a single entity called EMRT (exact mass retention time pair). The EMRTs in low energy function represent unidentified peptides and are assigned peptides sequences during database search. The total list of EMRTs can be found in the pep3DAMRT.csv file and it is one of the synapter input files for the runs used for quantitation (typically MS$^E$ mode)

Prior to the database search, randomised entries are added to the database to allow PLGS to compute protein false positive rate. The randomised entries can either be added automatically or manually, using the Randomise Databank function in the Databank admin tool. To properly prepare the files for synapter, the user has to add randomised entries manually via Databank admin tool, since only then randomised entries identified in the database search will be displayed in the csv output. The following figures demonstrate how to create a randomised databank manually using one randomised entry per regular entry.

Databank creation in PLGS. Databank search options in PLGS.

The user is also expected to use a minimum of 1 fragment per peptide, 3 fragments per protein and 1 peptide per protein identification thresholds and 100% False Positive Rate44 This is erroneously termed false positive rate in the software and manuscript and should be considered a false discovery rate. for protein identification during database search for all of the acquisitions in the analysis as demonstrated in figure. This allows to maximise the number of identified peptides from the randomised part of the database, needed to estimate peptide identifications statistics. The total list of identified peptides is given in final_peptide.csv files. A single final_peptide.csv file has to be supplied to synapter for every run in the analysis (for both identification and quantitation runs).

More details and screenshots are available in a separate document available at https://lgatto.github.com/synapter/.

2.2 HDMS$^E$/MS$^E$ data analysis

The analysis of pairs of HDMS$^E$ and MS$^E$ data files is based on the following rationale – combine strengths or each approach by matching high quality HDMS$^E$ identifications to quantified MS$^E$ EMRTs applying the following algorithm:

Apply various peptide filters to HDMS$^E$ and MS$^E$ peptides to obtain two sets of reliably identified unique proteotypic peptides.
Use shared HDMS$^E$ and MS$^E$ peptides to model the deviations in retention time between the two mass spectrometer runs.
Optimise the parameters that will be used to optimally match all HDMS$^E$ peptides and quantified MS$^E$ EMRTs using a grid search.
Using the best parameters, match identified HDMS$^E$ peptides to quantified MS$^E$ EMRTs.

2.3 Different pipelines

Two different pipeline are available to the user:

2.3.1 Wrapper function

The synergise function is a high level wrapper that implements a suggested analysis to combine two files (see next paragraph for details). A set of parameters can be passed, although sensible defaults are provided. While the analysis is executed, a html report is created, including all result files in text spreadsheet (csv format) and binary R output. This level allows easy scripting for automated batch analysis. Using data from the synapterdata package, the following code chunk illustrates the synergise usage. An example report can be found online at https://lgatto.github.com/synapter/.

library("synapterdata")
hdmsefile <- getHDMSeFinalPeptide()[2]
basename(hdmsefile)

## [1] "HDMSe_101111_25fmol_UPS1_in_Ecoli_04_IA_final_peptide.csv.gz"

msefile <- getMSeFinalPeptide()[2]
basename(msefile)

## [1] "MSe_101111_25fmol_UPS1_in_Ecoli_03_IA_final_peptide.csv.gz"

msepep3dfile <- getMSePep3D()[2]
basename(msepep3dfile)

## [1] "MSe_101111_25fmol_UPS1_in_Ecoli_03_Pep3DAMRT.csv.gz"

fas <- getFasta()
basename(fas)

## [1] "EcoliK12_enolase_UPSsimga_NB.fasta"

## the synergise input is a (named) list of filenames
input <- list(identpeptide = hdmsefile,
              quantpeptide = msefile,
              quantpep3d = msepep3dfile,
              fasta = fas)
## a report and result files will be stored
## in the 'output' directory
output <- tempdir()
output

## [1] "/tmp/Rtmpej8nXB"

res <- synergise(filenames = input, outputdir = output)

performance(res)

See ?synergise for details.

2.3.2 Detailed step-by-step analysis

The user can have detailed control on each step of the analysis by executing each low-level function manually. This pipeline, including generation of data containers (class instances) and all available operations are available in ?Synapter. This strategy allows the maximum flexibility to develop new unexplored approaches.

2.4 Using master peptide files

While analysing one MS$^E$ file against one single HDMS$^E$ file increased the total number of reliably identified and quantified features compared to each single MS$^E$ analysis, a better procedure can be applied when replicates are available. Consider the following design with two pairs of files: HDMS$^E_1$, MS$^E_1$, HDMS$^E_2$ and MS$^E_2$. The classical approach would lead to combining for example, HDMS$^E_1$ and MS$^E_1$ and HDMS$^E_2$ and MS$^E_2$. However, HDMS$^E_1$ – MS$^E_2$ and HDMS$^E_2$ – MS$^E_1$ would also be suitable, possibly leading to new identified and quantified features. Instead of repeating all possible combinations, which could hardly be applied for more replicates, we allow to merge HDMS$^E_1$ and HDMS$^E_2$ into a new master HDMS$^E_{12}$ and then using it to transfer identification to both MS$^E$ runs. In addition to leading to a simpler set of analyses, this approach also allows to control the false positive rate during the HDMS$^E$ merging (see section Choosing HDMS$^E$ files). Such master HDMS$^E$ files can be readily created with the makeMaster function, as described in section Generating a master file.

We will use data from the synapterdata to illustrate how to create master files.

2.4.1 Choosing which HDMS$^E$ files to combine

In a more complex design, a greater number of HDMS$^E$ files might need to be combined. When combining files, one also accumulates false peptides assignments. The extent to which combining files increases new reliable identification at the cost of accumulating false assignments can be estimated with the estimateMasterFdr function.

To illustrate how FDR is estimated for master HDMS$^E$ files, let’s consider two extreme cases.

In the first one, the two files (each with $1000$ peptides filtered at an FDR of $0.01$) to be combined are nearly identical, sharing $900$ peptides. The combined data will have $900 (shared) + 2 \times 100 (unique)$ peptides and each file, taken separately is estimated to have $1000 \times 0.01 = 10$ false positive identifications. We thus estimate the upper FDR bound after merging the two files to be $\frac{20}{1100} = 0.0182$.
In the second hypothetical case, the two files (again each with $1000$ peptides filtered at a FDR of $0.01$) to be combined are very different and share only $100$ peptides. The combined data will have $100 (shared) + 2 \times 900 (unique)$ peptides and, as above, each file is estimated to have $10$ false discoveries. In this case, we obtain an upper FDR bound of $\frac{20}{1900} = 0.0105$.

In general, the final false discovery for two files will be \[FDR_{master} = \frac{nfd_{1} + nfd_{2}}{union(peptides~HDMS$^E_{1}, peptides~HDMS$E_{2})}\]

where $nfd_{i}$ is the number of false discoveries in HDMS$^E$ file $i$. Note that we do not make any assumptions about repeated identification in multiple files here.

estimateMasterFdr generalised this for any number of HDMS$^E$ files and indicates the best combination at a fixed user-specified masterFdr level. Mandatory input is a list of HDMS$^E$ file names and a fasta database file name to filter non-unique proteotypic peptides.

The result of estimateMasterFdr stores the number of unique proteotypic peptides and FDR for all possible 57 combinations of 6 files. A summary can be printed on the console or plotted with plot(cmb) (see figure 1).

## using the full set of 6 HDMSe files and a
## fasta database from the synapterdata package
inputfiles <- getHDMSeFinalPeptide()
fasta <- getFasta()
cmb <- estimateMasterFdr(inputfiles, fasta, masterFdr = 0.02)
cmb

## 6 files - 57 combinations
## Best combination: 4 5 
##  - 5730 proteotypic peptides
##  - 6642 unique peptides
##  - 0.017 FDR

plot(cmb)

Figure illustrating the relation between the number of unique peptides in the combined HDMS$^E$ file and the resulting false discovery rate. The symbols on the figure represent the number of files for that particular combination. The dotted line is the user defined threshold for the combined FDR (`masterFdr` parameter). The best combination, i.e the one that maximises the number of unique peptides while keeping the FDR below `masterFdr` is highlighted in red.

Figure 1: Figure illustrating the relation between the number of unique peptides in the combined HDMS$^E$ file and the resulting false discovery rate
The symbols on the figure represent the number of files for that particular combination. The dotted line is the user defined threshold for the combined FDR (masterFdr parameter). The best combination, i.e the one that maximises the number of unique peptides while keeping the FDR below masterFdr is highlighted in red.

The best combination can be extracted with the bestComb function.

bestComb(cmb)

## [1] 4 5

See ?estimateMasterFdr and references therein for more details about the function and the returned object.

2.4.2 Generating a master file

Now that we have identified which files should be used to create the master file, we can directly pass the relevant identification files to the makeMaster function to generate the master file. The function has one mandatory input parameter, pepfiles, a list oh identification file names to be merged. The output is an object of class MasterPeptides that stores the relevant peptides from the original input files. The result can be saved to disk using saveRDS for further analysis, as described in section HDMS$^E$/MS$^E$ data analysis.

master <- makeMaster(inputfiles[bestComb(cmb)])
master

## Object of class "MasterPeptides"
##  1st Master [ 1 2 ] has 6642 peptides 
##  2nd Master [ 2 1 ] has 6642 peptides 
##  [1] HDMSe_111111_50fmol_UPS1_in_Ecoli_04_IA_final_peptide.csv.gz
##  [2] HDMSe_111111_50fmol_UPS1_in_Ecoli_02_IA_final_peptide.csv.gz 
## 
##  No fragment library.

More details can be found in the function documentation accessible with ?makeMaster.

2.5 Summary

Two functions are needed to choose a set of identification run files and create the master identification run. One function enables to perform a complete identification transfer. The following table summarises all there is to know to utilise synapter’s functionality.

Function	Description
`synergise`	Runs the complete identification transfer
`estimateMasterFdr`	Chooses which files to be used to create the master IR
`makeMaster`	Creates the master IR

3 Analysing complete experiments

The functionality described in this section relies on the MSnbase package (Gatto and Lilley 2012), which is installed by default with synapter. Please refer to the MSnbase Bioconductor web page. the associated vignettes and the respective manual pages for more details.

The synapterdata already provides preprocessed PLGS data. Six Synapter instances are available: 3 replicates (labelled a, b and c) of the Universal Proteomics Standard (UPS1) 48 protein mix at 25 fmol and 3 replicates at 50 fmol, in a constant E. coli background. The 6 files can be loaded in your working space with

data(ups25a, ups25b, ups25c, ups50a, ups50b, ups50c)

3.1 Applying the Top 3 approach

We will start by describing the analysis of ups25a in details, and then show how to analyse all the runs using more compact code. The first step of our analysis is to convert the synapter object output (an Synapter instance), into a MSnbase-compatible object, called an MSnSet, that we will name ms25a. We can obtain a description of the MSnSet object by typing its name.

ms25a <- as(ups25a, "MSnSet")
class(ups25a)[1]

## [1] "Synapter"

class(ms25a)[1]

## [1] "MSnSet"

ms25a

## MSnSet (storageMode: lockedEnvironment)
## assayData: 5642 features, 1 samples 
##   element names: exprs 
## protocolData: none
## phenoData: none
## featureData
##   featureNames: AALESTLAAITESLK IAAANVPAFVSGK ... NDSALGLFNGDIGIALDR
##     (5642 total)
##   fvarLabels: peptide.seq.MSe_101111_25fmol_UPS1_in_Ecoli_01
##     protein.Accession.MSe_101111_25fmol_UPS1_in_Ecoli_01 ...
##     qval.1.MSe_101111_25fmol_UPS1_in_Ecoli_01 (21 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: No annotation 
## - - - Processing information - - -
##  MSnbase version: 2.2.0

It contains quantitation information about 5642 peptides for 1 sample. In the code chunk below, we update the default sample name Synapter1 with a more meaningful one.

sampleNames(ms25a)

## [1] "MSe_101111_25fmol_UPS1_in_Ecoli_01"

sampleNames(ms25a) <- "ups25a"
sampleNames(ms25a)

## [1] "ups25a"

Quantitative data and meta-data, which has been acquired by synapter, can be extracted with the exprs and fData methods.

tail(exprs(ms25a))

##                    ups25a
## AFLNDK                 NA
## IEAQLNDVIADLDAVR     1671
## AVFNGLINVAQHAIK      1499
## LEEVK                 664
## VALQGNMDPSMLYAPPAR   2969
## NDSALGLFNGDIGIALDR     NA

tail(fData(ms25a)[, c(2,9)])

##                    protein.Accession.MSe_101111_25fmol_UPS1_in_Ecoli_01
## AFLNDK                                                       MNME_ECODH
## IEAQLNDVIADLDAVR                                             MNME_ECODH
## AVFNGLINVAQHAIK                                            B1XFY9_ECODH
## LEEVK                                                      B1X7F0_ECODH
## VALQGNMDPSMLYAPPAR                                           DCUP_ECODH
## NDSALGLFNGDIGIALDR                                         B1XDM5_ECODH
##                    precursor.retT.MSe_101111_25fmol_UPS1_in_Ecoli_01
## AFLNDK                                                      33.50725
## IEAQLNDVIADLDAVR                                            84.15609
## AVFNGLINVAQHAIK                                             76.22252
## LEEVK                                                       41.40945
## VALQGNMDPSMLYAPPAR                                          67.60099
## NDSALGLFNGDIGIALDR                                          97.65882

## all fetaure metadata
fvarLabels(ms25a)

##  [1] "peptide.seq.MSe_101111_25fmol_UPS1_in_Ecoli_01"              
##  [2] "protein.Accession.MSe_101111_25fmol_UPS1_in_Ecoli_01"        
##  [3] "protein.Description.MSe_101111_25fmol_UPS1_in_Ecoli_01"      
##  [4] "protein.falsePositiveRate.MSe_101111_25fmol_UPS1_in_Ecoli_01"
##  [5] "peptide.matchType.MSe_101111_25fmol_UPS1_in_Ecoli_01"        
##  [6] "peptide.mhp.MSe_101111_25fmol_UPS1_in_Ecoli_01"              
##  [7] "peptide.score.MSe_101111_25fmol_UPS1_in_Ecoli_01"            
##  [8] "precursor.mhp.MSe_101111_25fmol_UPS1_in_Ecoli_01"            
##  [9] "precursor.retT.MSe_101111_25fmol_UPS1_in_Ecoli_01"           
## [10] "precursor.inten.MSe_101111_25fmol_UPS1_in_Ecoli_01"          
## [11] "precursor.Mobility.MSe_101111_25fmol_UPS1_in_Ecoli_01"       
## [12] "spectrumID.MSe_101111_25fmol_UPS1_in_Ecoli_01"               
## [13] "Intensity.MSe_101111_25fmol_UPS1_in_Ecoli_01"                
## [14] "ion_ID.MSe_101111_25fmol_UPS1_in_Ecoli_01"                   
## [15] "ion_area.MSe_101111_25fmol_UPS1_in_Ecoli_01"                 
## [16] "ion_counts.MSe_101111_25fmol_UPS1_in_Ecoli_01"               
## [17] "pval.MSe_101111_25fmol_UPS1_in_Ecoli_01"                     
## [18] "Bonferroni.MSe_101111_25fmol_UPS1_in_Ecoli_01"               
## [19] "BH.MSe_101111_25fmol_UPS1_in_Ecoli_01"                       
## [20] "qval.MSe_101111_25fmol_UPS1_in_Ecoli_01"                     
## [21] "qval.1.MSe_101111_25fmol_UPS1_in_Ecoli_01"

We will describe a how to process the data using a Top 3 approach, where the 3 most intense peptides of each protein are used to compute the protein intensity, using the topN and combineFeatures methods. The former allows to extract the top most intense peptides (default n is 3) and remove all other peptides from the MSnSet object. The latter than aggregates the n most intense peptides per protein using a user-defined function (sum, below). Finally, we also scale protein intensity values depending on the actual number of peptides that have summed. This number of quantified peptides can be calculated (after topN, but before combineFeatures) with nQuants.

ms25a <- topN(ms25a,
              groupBy = fData(ms25a)$protein.Accession,
              n = 3)
nPeps <- nQuants(ms25a,
                 groupBy = fData(ms25a)$protein.Accession)
ms25a <- combineFeatures(ms25a,
                         fData(ms25a)$protein.Accession,
                         "sum", na.rm = TRUE,
                         verbose = FALSE)
head(exprs(ms25a))

##            ups25a
## 6PGL_ECODH  71555
## ABDH_ECODH  47542
## ACCA_ECODH  38249
## ACCD_ECODH  25615
## ACP_ECODH   16388
## APT_ECODH       0

head(nPeps)

##            ups25a
## 6PGL_ECODH      3
## ABDH_ECODH      3
## ACCA_ECODH      3
## ACCD_ECODH      3
## ACP_ECODH       1
## APT_ECODH       0

## scale intensities
exprs(ms25a) <- exprs(ms25a) * (3/nPeps)
## NaN result from the division by 0, when
## no peptide was found for that protein
head(exprs(ms25a))

##            ups25a
## 6PGL_ECODH  71555
## ABDH_ECODH  47542
## ACCA_ECODH  38249
## ACCD_ECODH  25615
## ACP_ECODH   49164
## APT_ECODH     NaN

3.2 Batch processing

The code chunk below repeats the above processing for the other 5 UPS1/E. coli runs.

nms <- c(paste0("ups", 25, c("b", "c")),
         paste0("ups", 50, c("a", "b", "c")))
tmp <- sapply(nms, function(.ups) {
  cat("Processing", .ups, "... ")
  ## get the object from workspace and convert to MSnSet
  x <- get(.ups, envir = .GlobalEnv)
  x <- as(x, "MSnSet")
  sampleNames(x) <- .ups
  ## extract top 3 peptides
  x <- topN(x, groupBy = fData(x)$protein.Accession, n = 3)
  ## calculate the number of peptides that are available
  nPeps <- nQuants(x, fData(x)$protein.Accession)
  ## sum top3 peptides into protein quantitation
  x <- combineFeatures(x, fData(x)$protein.Accession,
                       "sum", na.rm = TRUE, verbose = FALSE)
  ## adjust protein intensity based on actual number of top peptides
  exprs(x) <- exprs(x) * (3/nPeps)
  ## adjust feature variable names for combine
  x <- updateFvarLabels(x, .ups)
  ## save the new MSnExp instance in the workspace
  varnm <- sub("ups", "ms", .ups)
  assign(varnm, x, envir = .GlobalEnv)
  cat("done\n")
})

## Processing ups25b ... done
## Processing ups25c ... done
## Processing ups50a ... done
## Processing ups50b ... done
## Processing ups50c ... done

We now have 6 MSnSet instances, containing protein quantitation for the 6 UPS/E. coli runs.

3.3 Combining data and filtering

We now want to filter data out based on missing quantitation data, retaining proteins that have been quantified in at a least two out of three replicates. Filtering based on missing data can be done using the filterNA method and a maximum missing data content as defined by pNA. Multiple MSnSet instances can be combined with the combine method, which is described in details in the MSnbase-demo vignette55 The vignette is accessible from within R with vignette("MSnbase-demo" package = "MSnbase").. The 6 objects have appropriate distinct sample names and common feature (protein) names, which will be used to properly combine the quantitation data.

ms25 <- combine(ms25a, ms25b)
ms25 <- combine(ms25, ms25c)
dim(ms25)

## [1] 729   3

ms25 <- filterNA(ms25, pNA = 1/3)
dim(ms25)

## [1] 709   3

Once combined and filtered, the 25 fmol group retains 709 entries with at least 2 out of 3 quantitation values, out of the 729 total number of proteins.

ms50 <- combine(ms50a, ms50b)
ms50 <- combine(ms50, ms50c)
dim(ms50)

## [1] 729   3

ms50 <- filterNA(ms50, pNA = 1/3)
dim(ms50)

## [1] 709   3

Similarly, the 50 fmol group retains 709 entries with at least 2 out of 3 quantitation values, out of the 729 initial proteins.

We now combine the two subgroups into one MSnSet object that contains all 6 samples and filter for proteins that are observed in both groups, i.e retaining proteins with a maximum of 2/6 missing values. We also compute a summary table with the number of protein that have 4, 5, or 6 quantitation values across the 6 samples.

msUps <- combine(ms25, ms50)
msUps <- filterNA(msUps, pNA = 2/6)
head(exprs(msUps))

##            ups25a ups25b ups25c ups50a ups50b ups50c
## 6PGL_ECODH  71555  62114  60655  59920  56185  53874
## ABDH_ECODH  47542  37805  36746  45570  43163  39506
## ACCA_ECODH  38249  31543  29570  30697  29656  27851
## ACCD_ECODH  25615  22247  20295  22206  19698  19819
## ACP_ECODH   49164 738365 706538 734425 712076 655842
## AROB_ECODH   5442   4050   3684   4095   4500   3879

table(apply(exprs(msUps), 1,
            function(.x)sum(!is.na(.x))))

## 
##   4   5   6 
##   6  25 674

We obtain a final data set containing 705 proteins. Finally, we normalise protein intensities in each sample to correct for experimental loading biases and pipetting errors. To do so, we compute 6 sample medians using all constant E. coli background proteins and divide each protein by its respective sample mean.

ecoli <- -grep("ups$", featureNames(msUps))
meds <- apply(exprs(msUps)[ecoli, ], 2,
              median, na.rm=TRUE)
exprs(msUps) <- t(apply(exprs(msUps), 1,
                        "/", meds))

This same procedure could be applied with a set of constant spikes to estimate absolute protein quantities.

3.4 Statistical analysis of differentially expressed proteins

The UPS1 spiked-in protein mix is composed of 48 proteins, 47 of which have been observed and quantified in our final data object. In this section, we will illustrate how to analyse the 705 proteins to extract those that show differences between the two groups and show that these candidates represent the UPS1 spikes.

The R environment and many of the available packages allow extremely powerful statistical analysis. In this document, we will apply a standard t-test on $log_2$ transformed data for convenience, to calculate p-value for individual proteins (pv variable below). For best performance with small number of samples and more complex designs, we recommend the Bioconductor limma package (G. K. Smyth 2005). We then perform multiple comparison adjustment using the qvalue from the qvalue package, that implements the method from (Storey and Tibshirani 2003) (qv variable below). The multtest package provides several other p-value adjustment methods. We will also compute $log_2$ fold-changes and illustrate the results on a volcano plot (figure 2). Figure 3 illustrates the UPS1 proteins and samples on a classical heatmap.

## use log2 data for t-test
exprs(msUps) <- log2(exprs(msUps))
## apply a t-test and extract the p-value
pv <- apply(exprs(msUps), 1 ,
            function(x)t.test(x[1:3], x[4:6])$p.value)
## calculate q-values
library("qvalue")
qv <- qvalue(pv)$qvalues
## calculate log2 fold-changes
lfc <- apply(exprs(msUps), 1 ,
             function(x) mean(x[1:3], na.rm=TRUE)-mean(x[4:6], na.rm=TRUE))
## create a summary table
res <- data.frame(cbind(exprs(msUps), pv, qv, lfc))
## reorder based on q-values
res <- res[order(res$qv), ]
knitr::kable(head(round(res, 3)))

	ups25a	ups25b	ups25c	ups50a	ups50b	ups50c	qv	lfc
P01112ups	-0.053	-0.015	-0.001	1.072	1.118	1.080	0.000	-1.113
P00918ups	-0.247	-0.201	-0.214	0.616	0.667	0.674	0.001	-0.873
P01008ups	0.112	0.106	0.178	1.075	1.132	1.090	0.001	-0.967
Q06830ups	0.174	0.118	0.156	1.073	1.095	1.100	0.002	-0.940
P10145ups	-0.323	-0.332	-0.274	0.689	0.764	0.788	0.003	-1.057
P02788ups	0.564	0.647	0.601	1.494	1.532	1.532	0.003	-0.915

In the above example, quantitation values and statistics data are saved in a summary dataframe (res), that can be exported to a comma-separated spreadsheet with

write.csv(res, file = "upsResults.csv")

plot(res$lfc, -log10(res$qv),
     col = ifelse(grepl("ups$", rownames(res)),
       "#4582B3AA",
       "#A1A1A180"),
     pch = 19,
     xlab = expression(log[2]~fold-change),
     ylab = expression(-log[10]~q-value))
grid()
abline(v = -1, lty = "dotted")
abline(h = -log10(0.1), lty = "dotted")
legend("topright", c("UPS", "E. coli"),
       col = c("#4582B3AA", "#A1A1A1AA"),
       pch = 19, bty = "n")

$Volcano plot. On the volcano plot, each protein is represented by a dot and positioned according to its $log_2$ fold-change along the x axis and $-log_{10}$ of its q-value along the y axis. Proteins with large fold-changes are positioned on the sides of the plot, while proteins with low q-values are at the top of the figure. The most promising candidates are this located on the top corners. In our case, the UPS proteins (in blue) have $log2$ fold-changes around -1 (vertical dotted line), as expected. The horizontal dotted line represents a q-value threshold of 0.10.$

Figure 2: Volcano plot
On the volcano plot, each protein is represented by a dot and positioned according to its $log_2$ fold-change along the x axis and $-log_{10}$ of its q-value along the y axis. Proteins with large fold-changes are positioned on the sides of the plot, while proteins with low q-values are at the top of the figure. The most promising candidates are this located on the top corners. In our case, the UPS proteins (in blue) have $log2$ fold-changes around -1 (vertical dotted line), as expected. The horizontal dotted line represents a q-value threshold of 0.10.

heatmap(exprs(msUps)[grep("ups",featureNames(msUps)), ])

Figure 3: A heatmap of all UPS1 proteins present in the final data set

Table 1: UPS1 proteins
	ups25a	ups25b	ups25c	ups50a	ups50b	ups50c	pv	qv	lfc
P01112ups	-0.053	-0.015	-0.001	1.072	1.118	1.080	0.000	0.000	-1.113
P00918ups	-0.247	-0.201	-0.214	0.616	0.667	0.674	0.000	0.001	-0.873
P01008ups	0.112	0.106	0.178	1.075	1.132	1.090	0.000	0.001	-0.967
Q06830ups	0.174	0.118	0.156	1.073	1.095	1.100	0.000	0.002	-0.940
P10145ups	-0.323	-0.332	-0.274	0.689	0.764	0.788	0.000	0.003	-1.057
P02788ups	0.564	0.647	0.601	1.494	1.532	1.532	0.000	0.003	-0.915
P02753ups	-1.901	-1.822	-1.906	-1.006	-0.919	-0.876	0.000	0.005	-0.943
P01375ups	0.812	0.930	0.963	1.825	1.756	1.692	0.000	0.008	-0.856
P69905ups	-1.440	-1.563	-1.513	-0.637	-0.582	-0.579	0.000	0.008	-0.906
P00167ups	0.869	0.890	0.975	1.923	1.963	1.938	0.000	0.011	-1.030
P12081ups	-0.094	0.097	-0.022	0.936	1.038	1.047	0.000	0.011	-1.013
P00709ups	-0.187	-0.308	-0.324	0.424	0.505	0.508	0.000	0.012	-0.752
O00762ups	0.432	0.267	0.264	1.086	1.212	1.214	0.000	0.012	-0.850
P05413ups	-0.173	-0.395	-0.284	0.576	0.680	0.689	0.001	0.024	-0.932
P00441ups	-0.227	-0.414	-0.391	0.289	0.413	0.406	0.001	0.027	-0.714
P04040ups	-0.065	0.142	0.211	1.094	1.245	1.237	0.001	0.027	-1.096
P02787ups	0.003	0.148	0.066	1.504	1.533	1.238	0.001	0.033	-1.353
P10636-8ups	0.727	0.534	0.558	1.493	1.502	1.576	0.001	0.033	-0.917
P06396ups	0.228	0.299	0.037	1.179	1.268	1.298	0.002	0.036	-1.061
P16083ups	-0.179	-0.407	-0.282	1.169	1.149	1.168	0.002	0.040	-1.452
P02768ups	0.554	0.340	0.396	1.355	1.426	1.408	0.002	0.041	-0.966
P01127ups	0.330	0.145	0.216	1.183	1.183	1.213	0.002	0.045	-0.963
P08758ups	0.273	0.094	0.081	1.136	1.183	1.158	0.003	0.048	-1.010
P00915ups	0.055	-0.182	0.063	1.100	1.137	1.157	0.004	0.058	-1.153
P15559ups	0.118	-0.088	-0.079	0.876	0.934	0.883	0.003	0.058	-0.914
P55957ups	-1.083	-1.461	-1.329	-0.348	-0.393	-0.181	0.004	0.058	-0.984
P62988ups	0.513	0.270	0.367	1.289	1.293	1.236	0.004	0.064	-0.889
P01031ups	-0.407	-0.648	-0.639	0.628	0.636	0.633	0.004	0.064	-1.197
P61626ups	-0.096	-0.363	-0.320	0.619	0.681	0.669	0.006	0.086	-0.917
P51965ups	-0.893	-1.182	-1.301	0.017	-0.039	-0.009	0.011	0.143	-1.115
P01344ups	-0.044	-0.397	-0.060	0.570	0.721	0.642	0.011	0.149	-0.811
P01579ups	-0.948	-0.716	-0.660	-0.257	-0.254	-0.165	0.016	0.204	-0.549
P41159ups	0.285	-0.206	-0.245	0.776	0.856	1.030	0.018	0.215	-0.943
P62937ups	-1.375	-0.691	-1.117	0.312	0.380	0.262	0.018	0.215	-1.379
P68871ups	-0.206	-0.440	-0.589	0.374	0.359	0.358	0.020	0.218	-0.775
P08263ups	-1.113	-0.642	-1.517	0.188	0.254	0.279	0.033	0.302	-1.331
P99999ups	-1.202	-1.992	-1.953	-0.899	-0.192	-0.838	0.036	0.316	-1.073
P10599ups	-0.875	-1.133	0.166	1.387	0.897	0.758	0.037	0.317	-1.628
P02144ups	-0.963	-0.121	-0.075	0.992	1.059	1.000	0.039	0.324	-1.403
P01133ups	-0.796	0.011	-0.382	0.529	0.564	0.499	0.058	0.434	-0.920
P02741ups	-0.265	-1.165	-1.088	0.309	0.323	0.302	0.057	0.434	-1.151
P61769ups	-0.656	-0.151	-0.160	0.725	0.375	-0.029	0.073	0.465	-0.679
P63165ups	-0.984	0.080	0.220	1.074	1.124	1.112	0.073	0.465	-1.332
O76070ups	-0.457	0.253	0.328	0.881	0.802	0.935	0.077	0.479	-0.831
P06732ups	1.505	0.541	0.556	1.318	1.358	1.382	0.267	0.498	-0.485
P63279ups	-1.864	-3.085	-1.577	-0.643	-0.680	-0.687	0.083	0.498	-1.505
P09211ups	1.728	-1.862	-1.840	-0.762	-0.328	-0.654	0.955	0.586	-0.077

A total 29 proteins show a statistically different pattern between the two groups, at a false discovery rate of 10%. Table upstab summarises the results for all UPS1 proteins.

4 Session information

All software and respective versions used to produce this document are listed below.

sessionInfo()

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] qvalue_2.8.0        synapterdata_1.13.3 synapter_2.0.0     
##  [4] MSnbase_2.2.0       ProtGenerics_1.8.0  BiocParallel_1.10.0
##  [7] mzR_2.10.0          Rcpp_0.12.10        Biobase_2.36.0     
## [10] BiocGenerics_0.22.0 BiocStyle_2.4.0    
## 
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.2        splines_3.4.0         lattice_0.20-35      
##  [4] colorspace_1.3-2      htmltools_0.3.5       stats4_3.4.0         
##  [7] yaml_2.1.14           vsn_3.44.0            XML_3.98-1.6         
## [10] survival_2.41-3       affy_1.54.0           RColorBrewer_1.1-2   
## [13] cleaver_1.14.0        affyio_1.46.0         foreach_1.4.3        
## [16] plyr_1.8.4            mzID_1.14.0           stringr_1.2.0        
## [19] zlibbioc_1.22.0       Biostrings_2.44.0     munsell_0.4.3        
## [22] pcaMethods_1.68.0     gtable_0.2.0          codetools_0.2-15     
## [25] evaluate_0.10         knitr_1.15.1          IRanges_2.10.0       
## [28] doParallel_1.0.10     BiocInstaller_1.26.0  highr_0.6            
## [31] preprocessCore_1.38.0 readr_1.1.0           scales_0.4.1         
## [34] backports_1.0.5       limma_3.32.0          S4Vectors_0.14.0     
## [37] XVector_0.16.0        impute_1.50.0         hms_0.3              
## [40] ggplot2_2.2.1         digest_0.6.12         stringi_1.1.5        
## [43] bookdown_0.3          grid_3.4.0            rprojroot_1.2        
## [46] tools_3.4.0           magrittr_1.5          lazyeval_0.2.0       
## [49] tibble_1.3.0          MASS_7.3-47           Matrix_1.2-9         
## [52] rmarkdown_1.4         iterators_1.0.8       R6_2.2.0             
## [55] MALDIquant_1.16.2     multtest_2.32.0       compiler_3.4.0

References

Bond, Nicholas James, Pavel Vyacheslavovich Shliaha, Kathryn S. Lilley, and Laurent Gatto. 2013. “Improving Qualitative and Quantitative Performance for Ms$^E$-Based Label Free Proteomics.” Journal of Proteome Research 12 (6). doi:10.1021/pr300776t.

Gatto, Laurent, and Kathryn S Lilley. 2012. “MSnbase – an R/Bioconductor Package for Isobaric Tagged Mass Spectrometry Data Visualization, Processing and Quantitation.” Bioinformatics 28 (2): 288–9. doi:10.1093/bioinformatics/btr645.

Gentleman, Robert C, Vincent J Carey, Douglas M Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, et al. 2004. “Bioconductor: Open Software Development for Computational Biology and Bioinformatics.” Genome Biol 5 (10): R80. doi:10.1186/gb-2004-5-10-r80.

Li, G Z, J P Vissers, J C Silva, D Golick, M V Gorenstein, and S J Geromanos. 2009. “Database Searching and Accounting of Multiplexed Precursor and Product Ion Spectra from the Data Independent Analysis of Simple and Complex Peptide Mixtures.” Proteomics 9 (6): 1696–1719. doi:10.1002/pmic.200800564.

R Core Team. 2012. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.

Shliaha, Pavel Vyacheslavovich, Nicholas James Bond, Laurent Gatto, and Kathryn S. Lilley. 2013. “The Effects of Travelling Wave Ion Mobility Separation on Data Independent Acquisition in Proteomics Studies.” Journal of Proteome Research 2 (6). doi:10.1021/pr300775k.

Smyth, Gordon K. 2005. “Limma: Linear Models for Microarray Data.” In Bioinformatics and Computational Biology Solutions Using R and Bioconductor, edited by R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, and W. Huber, 397–420. New York: Springer.

Storey, John D., and Robert Tibshirani. 2003. “Statistical Significance for Genomewide Studies.” Proceedings of the National Academy of Sciences of the United States of America 100 (16): 9440–5. doi:10.1073/pnas.1530509100.