1 Introduction

The function of this R package is to assess the contribution of the targeted precursor in a fragmentation isolation window using a metric called “precursor purity”.

What we call “Precursor purity” is a measure of the contribution of a selected precursor peak in an isolation window used for fragmentation. The simple calculation involves dividing the intensity of the selected precursor peak by the total intensity of the isolation window. When assessing MS/MS spectra this calculation is done before and after the MS/MS scan of interest and the purity is interpolated at the time of the MS/MS acquisition. The calculation is very similar to the “Precursor Ion Fraction” (PIF) metric described by (Michalski, Cox, and Mann 2011) for proteomics with the exception that purity here is interpolated at the recorded point of MS2 acquisition using bordering full-scan spectra. Additionally, low abundance ions that are remove that are thought to have limited contribution to the resulting MS2 spectra and can optionally take into account the isolation efficiency of the mass spectrometer

There are two main use cases for the package

Assessing precursor purity of previously acquired MS2 spectra: A user has acquired either LC-MS2 or DIMS2 spectra and an assessment is made of the precursor purity for each MS2 scan. purityA
Assessing precursor purity of anticipated isolation windows for MS2 spectra: A user has acquired either LC-MS (purityX) or DIMS (purityD) full scan (MS1) data and an assessment is to be made of the precursor purity of detected features using anticipated or theoretical isolation windows. This information can then be used to guide further targeted MS2 experiments.

The package has been developed to be used with DI-MS or LC-MS data and has been checked to work with the following vendor files after conversion to mzML: Thermo, Agilent and AB Sciex.

2 Assessing precursor purity of previously acquired MS2 spectra

2.1 purityA

Given a vector of LC-MS/MS or DI-MS/MS mzML file paths the function purityA will calculate the precursor purity of each MS/MS scan. The output is a S4 class object where a dataframe of the purity results can be accessed using the appropriate slot (@puritydf).

The isolation widths will be determined automatically from the mzML file. For some mzML files this is not recorded and in these cases the offsets can be given as a parameter.

In the case of Agilent only the “narrow” isolation is supported. This roughly equates to +/- 0.65 Da (depending on the instrument). If the file is detected as originating from an Agilent instrument the isolation widths will automatically be set as +/- 0.65 Da (this can be overwritten with the offsets argument)

The purity dataframe (pa@puritydf) consists of the following columns:

pid: unique id for MS/MS scan
fileid: unqiue id for file
seqNum: scan number
precursorIntensity: precursor intensity value as defined from mzML file
precursorMZ: precursor m/z value as defined from mzML file
precursorRT: precursor RT value as defined from mzML file
precursorScanNum: precursor scan number value as defined from mzML file
id: unique id (redundant)
filename: mzML filename
precursorNearest: MS1 scan nearest to this MS/MS scan
aMz: The m/z value in the precursorNearest scan which most closely matches the precursorMZ value provided from the mzML file
aPurity: The purity score for aMz
apkNm: The number of peaks in the isolation window for aMz
iMz: The m/z value in the precursorNearest scan that is the most intense within the isolation window.
iPurity: The purity score for iMz
ipkNm: The number of peaks in the isolation window for iMz
inPurity: The interpolated purity score
inpkNm: The interpolated number of peaks in the isolation window

library(msPurity)

## Loading required package: Rcpp

msmsPths <- list.files(system.file("extdata", "lcms", "mzML", package="msPurityData"), full.names = TRUE, pattern = "MSMS")
msPths <- list.files(system.file("extdata", "lcms", "mzML", package="msPurityData"), full.names = TRUE, pattern = "LCMS_")

pa <- purityA(msmsPths)

print(head(pa@puritydf))

##   pid fileid seqNum precursorIntensity precursorMZ precursorRT
## 1   1      1      7          2338044.2    391.2838    2.707016
## 2   2      1      8          1415939.8    149.0232    2.707016
## 3   3      1      9          1319700.2    135.1015    2.707016
## 4   4      1     10          1179373.9    219.1742    2.707016
## 5   5      1     11          1065425.9    136.0200    2.707016
## 6   6      1     13           817673.7    235.1690    3.583746
##   precursorScanNum id      filename precursorNearest      aMz   aPurity
## 1                6  7 LCMSMS_1.mzML                6 391.2838 1.0000000
## 2                6  8 LCMSMS_1.mzML                6 149.0233 0.8535700
## 3                6  9 LCMSMS_1.mzML                6 135.1015 0.7616688
## 4                6 10 LCMSMS_1.mzML               12 219.1742 0.7173636
## 5                6 11 LCMSMS_1.mzML               12 136.0215 0.8163521
## 6               12 13 LCMSMS_1.mzML               12 235.1691 0.8312278
##   apkNm      iMz   iPurity ipkNm inPkNm  inPurity
## 1     1 391.2838 1.0000000     1      1 1.0000000
## 2     2 149.0233 0.8535700     2      2 0.8475240
## 3     4 135.1015 0.7616688     4      4 0.7558731
## 4     3 219.1742 0.7173636     3      3 0.7248489
## 5     4 136.0215 0.8163521     4      3 0.8247355
## 6     2 235.1691 0.8312278     2      2 0.8299369

2.2 Mapping XCMS features to fragmentation spectra

The MS/MS spectra can be assigned to an XCMS grouped feature using the frag4feature function.

First an xcmsSet object of the same files is required #```{r results=‘hide’, message=FALSE, warning=FALSE}

library(xcms)

## Loading required package: Biobase

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, cbind, colMeans, colSums, colnames, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, rank, rbind, rowMeans,
##     rowSums, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which, which.max, which.min

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: BiocParallel

## Loading required package: MSnbase

## Loading required package: ProtGenerics

## 
## This is MSnbase version 2.2.0 
##   Read '?MSnbase' and references therein for information
##   about the package and how to get started.

## 
## Attaching package: 'MSnbase'

## The following object is masked from 'package:stats':
## 
##     smooth

## The following object is masked from 'package:base':
## 
##     trimws

## 
## This is xcms version 1.52.0

## 
## Attaching package: 'xcms'

## The following object is masked from 'package:stats':
## 
##     sigma

xset <- xcms::xcmsSet(msmsPths)
xset <- xcms::group(xset)

## Processing 3163 mz slices ...

## OK

xset <- xcms::retcor(xset)

## Performing retention time correction using 351 peak groups.

xset <- xcms::group(xset)

## Processing 3163 mz slices ... OK

pa <- frag4feature(pa, xset)

The slot grped_df is a dataframe of the grouped XCMS features linked to a reference to any associated MS/MS scans in the region of the full width of the XCMS feature in each file. The dataframe contains the following columns.

grpid: XCMS grouped feature id
mz: derived from XCMS peaklist
mzmin: derived from XCMS peaklist
mzmax: derived from XCMS peaklist
rt: derived from XCMS peaklist
rtmin: derived from XCMS peaklist
rtmax: derived from XCMS peaklist
into: derived from XCMS peaklist
intb: derived from XCMS peaklist
maxo: derived from XCMS peaklist
sn: derived from XCMS peaklist
sample: derived from XCMS peaklist
id: unique id of MS/MS scan
precurMtchID: Associated nearest precursor scan id (file specific)
precurMtchRT: Associated precursor scan RT
precurMtchMZ: Associated precursor m/z
precurMtchPPM: Associated precursor m/z parts per million (ppm) tolerance to XCMS feauture m/z
inPurity: The interpolated purity score

print(head(pa@grped_df))

##     grpid       mz    mzmin    mzmax       rt    rtmin    rtmax      into
## 108     8 112.0508 112.0507 112.0872 67.60929 55.27690 80.36167  36223791
## 109     8 112.0509 112.0506 112.1205 67.51574 55.41402 80.55541  36139266
## 16     12 116.0706 116.0109 116.0709 47.68880 35.74403 60.00379 130337063
## 17     12 116.0706 116.0109 116.0709 47.78054 35.59266 59.86578 124086404
## 46     12 116.0706 116.0109 116.0709 47.68880 35.74403 60.00379 130337063
## 47     12 116.0706 116.0109 116.0709 47.78054 35.59266 59.86578 124086404
##          intf     maxo     maxf i       sn sample  id      filename
## 108 133522504  7158012  8555976 1 21.05495      2 398 LCMSMS_2.mzML
## 109 133395721  7426336  8522973 1 20.38040      1   9 LCMSMS_1.mzML
## 16  491415400 27555850 32138223 1 24.64411      1  13 LCMSMS_1.mzML
## 17  465322433 26501960 30360236 1 24.07917      2 402 LCMSMS_2.mzML
## 46  491415400 27555850 32138223 1 24.64411      1  13 LCMSMS_1.mzML
## 47  465322433 26501960 30360236 1 24.07917      2 402 LCMSMS_2.mzML
##     rtminCorrected rtmaxCorrected precurMtchID precurMtchRT precurMtchMZ
## 108       55.37223       80.35088          466     64.42730     112.0507
## 109       55.39383       80.40870          472     65.37616     112.0507
## 16        35.55508       60.13504          277     41.09841     116.0708
## 17        35.81265       59.93620          277     40.95480     116.0708
## 46        35.55508       60.13504          343     49.31567     116.0708
## 47        35.81265       59.93620          343     49.18240     116.0708
##     precurMtchPPM  inPurity  pid
## 108     0.5516288 1.0000000 1213
## 109     1.0047960 1.0000000  389
## 16      2.2246382 0.9893762  226
## 17      1.7359354 0.9506322 1055
## 46      2.2903689 1.0000000  281
## 47      1.6702047 1.0000000 1110

The slot grped_ms2 is a list of the associated fragmentation spectra for the grouped features.

print(pa@grped_ms2[2:3])

## $`12`
## $`12`[[1]]
##          [,1]       [,2]
## [1,] 107.2701   1726.613
## [2,] 116.0164   2890.495
## [3,] 116.0709 100876.133
## [4,] 116.1072   2424.613
## 
## $`12`[[2]]
##          [,1]      [,2]
## [1,] 116.0168  3725.937
## [2,] 116.0709 97631.586
## [3,] 116.1071  3945.327
## 
## $`12`[[3]]
##          [,1]    [,2]
## [1,] 116.0709 1847703
## 
## $`12`[[4]]
##          [,1]        [,2]
## [1,] 103.1290    4419.712
## [2,] 116.0164    5682.144
## [3,] 116.0709 1782171.000
## [4,] 130.0276    4081.138
## 
## $`12`[[5]]
##          [,1]       [,2]
## [1,] 116.0166   4434.369
## [2,] 116.0709 165623.641
## [3,] 116.1073  11372.488
## 
## $`12`[[6]]
##          [,1]       [,2]
## [1,] 116.0168  14364.784
## [2,] 116.0709 149471.266
## [3,] 116.1074   8359.903
## 
## 
## $`27`
## $`27`[[1]]
##          [,1]       [,2]
## [1,] 117.8772   5004.664
## [2,] 132.1019 273406.250
## 
## $`27`[[2]]
##          [,1]      [,2]
## [1,] 132.1020 402822.69
## [2,] 144.2789   7715.69
## [3,] 146.6829   7014.51
## 
## $`27`[[3]]
##         [,1]      [,2]
## [1,] 130.187  121726.4
## [2,] 132.102 3973065.5
## 
## $`27`[[4]]
##          [,1]      [,2]
## [1,] 104.9648  111113.7
## [2,] 132.1021 3328366.8
## 
## $`27`[[5]]
##          [,1]     [,2]
## [1,] 132.1021 77799.47
## 
## $`27`[[6]]
##          [,1]     [,2]
## [1,] 115.4372  2424.58
## [2,] 132.1020 67118.03

3 Assessing precursor purity of anticipated isolation windows for MS2 spectra

3.1 purityX: Assessing anticipated purity of XCMS features from an LC-MS run

NOTE ON TERMINOLOGY: The term ‘anticipated purity’ and ‘predicted purity’ are used interchangeably

A processed xcmsSet object is required to determine the anticipated (predicted) precursor purity score from an LC-MS dataset. The offsets chosen in the parameters should reflect what settings would be used in a hypothetical fragmentation experiment.

The slot predictions provides the anticipated (predicted) purity scores for each feature. The dataframe contains the following columns:

grpid: XCMS grouped feature id
mean: Mean predicted purity of the feature
median: Median predicted purity of the feature
sd: Standard deviation of the predicted purity of the feature
stde: Standard error of the predicted purity of the feature
pknm: Median peak number in isolation window
RSD: Relative standard deviation of the predicted purity of the feature
i: Median intensity of the grouped feature. Uses XCMS “into” intensity value.
mz: m/z of the XCMS grouped feature

XCMS run on an LC-MS dataset

xset <- xcms::xcmsSet(msPths)
xset <- xcms::group(xset)

## Processing 3179 mz slices ...

## OK

xset <- xcms::retcor(xset)

## Performing retention time correction using 763 peak groups.

xset <- xcms::group(xset)

## Processing 3179 mz slices ... OK

Perform purity calculations

ppLCMS <- purityX(xset, offsets=c(0.5, 0.5), xgroups = c(1, 2))

print(head(ppLCMS@predictions))

##   grpid      mean    median          sd         stde       RSD pknm
## 1     1 0.9901505 0.9901505 0.001354984 0.0009581183 0.1368463 2.75
## 2     2 1.0000000 1.0000000 0.000000000 0.0000000000 0.0000000 1.00
##          i       mz
## 1 61925043 102.0916
## 2 25719001 103.0544

3.2 purityD: Assessing anticipated purity from a DI-MS run

The anticipated/predicted purity for a DI-MS experiment can be performed on any DI-MS dataset consisting of multiple MS1 scans of the same mass range, i.e. it has not been developed to be used with any SIM stitching approach.

A number of simple data processing steps are performed on the mzML files to provide a DI-MS peak list (features) to perform the purity predictions on.

These data processing steps consist of:

Averaging peaks across multiple scans
Removing peaks below a signal to noise threshold [optional]
Removing peaks less than an intensity threshold [optional]
Removing peaks above a RSD threshold for intensity [optional]
Where there is a blank, subtracting blank peaks [optional]

The averaged peaks before and after filtering are stored in the avPeaks slot of purityPD S4 object.

Get file dataframe: The purityD constructor requires a dataframe consisting of the following columns:

filepth
name
sampleType [either sample or blank]
class [for grouping samples together]
polarity [optional]

datapth <- system.file("extdata", "dims", "mzML", package="msPurityData")
inDF <- Getfiles(datapth, pattern=".mzML", check = FALSE)
ppDIMS <- purityD(inDF, mzML=TRUE)

Average spectra: The default averaging will use a Hierarchal clustering approach. Noise filtering is also performed here.

ppDIMS <- averageSpectra(ppDIMS, snMeth = "median", snthr = 5)

Filter by RSD and Intensity

ppDIMS <- filterp(ppDIMS, thr=5000, rsd = 10)

Subtract blank

ppDIMS <- subtract(ppDIMS)

Predict purity

ppDIMS <- dimsPredictPurity(ppDIMS)

print(head(ppDIMS@avPeaks$processed$B02_Daph_TEST_pos))

##    peakID       mz          i        snr      rsd        inorm
## 5       5 173.0806 11272447.0 216.506319 9.006126 0.0108585920
## 7       7 179.1177   606983.2  11.425825 6.019861 0.0005729283
## 10     10 217.1067 17770220.0 343.292914 8.602331 0.0171178067
## 15     15 235.1173  4950841.5  95.991762 6.302825 0.0047694791
## 16     16 236.1206   486912.0   9.270517 8.811437 0.0004638254
## 17     17 239.1485  2533134.5  48.892062 5.781277 0.0024401334
##    medianPurity meanPurity   sdPurity cvPurity   sdePurity medianPeakNum
## 5     1.0000000  1.0000000 0.00000000 0.000000 0.000000000             1
## 7     1.0000000  1.0000000 0.00000000 0.000000 0.000000000             1
## 10    0.7797864  0.7808917 0.01261501 1.615462 0.005641605             2
## 15    1.0000000  1.0000000 0.00000000 0.000000 0.000000000             1
## 16    0.8818313  0.8755873 0.01056807 1.206969 0.004726184             2
## 17    0.8123950  0.8229505 0.04384595 5.327896 0.019608505             2

3.3 Calculating the anticipated (predicted) purity from a known m/z target list for DI-MS

The data processing steps carried out through purityPD can be bypassed if the peaks (m/z values) of interest are already known. The function dimsPredictPuritySingle() can be used to predict the purity of a list of m/z values in a chosen mzML file.

mzpth <- system.file("extdata", "dims", "mzML", "B02_Daph_TEST_pos.mzML", package="msPurityData")
predicted <- dimsPredictPuritySingle(filepth = mzpth, mztargets = c(111.0436, 113.1069))
print(predicted)

##   medianPurity meanPurity  sdPurity  cvPurity  sdePurity medianPeakNum
## 1    0.6390276  0.6251787 0.0356821  5.707505 0.01595752             5
## 2    0.7453778  0.7619277 0.1008513 13.236338 0.04510209             5

References

Michalski, Annette, Juergen Cox, and Matthias Mann. 2011. “More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LC-MS/MS.” Journal of Proteome Research 10 (4): 1785–93. doi:10.1021/pr101060v.

Using msPurity for Automated Evaluation of Precursor Ion Purity for Mass Spectrometry Based Fragmentation in Metabolomics

Thomas N. Lawson

2017-04-24

Contents