1 Introduction

As the RMassBank-workflow is described in the other manual, this document mainly explains how to utilize the XCMS-, MassBank-, and peaklist-readMethods for Step 1 of the workflow.

2 Input files

2.1 LC/MS data

RMassBank handles high-resolution LC/MS spectra in mzML or mzdata format in centroid111 The term “centroid” here refers to any kind of data which are not in profile mode, i.e. don’t have continuous m/z data. It does not refer to the (mathematical) centroid peak, i.e. the area-weighted mass peak. or in profile mode.

Data in the examples was acquired using a QTOF instrument.

In the standard workflow, the file names are used to identify a compound: file names must be in the format xxxxxxxx_1234_xxx.mzXML, where the xxx parts denote anything and the 1234 part denotes the compound ID in the compound list (see below). Advanced and alternative uses can be implemented; consult the implementation of msmsRead, msms_workflow and findMsMsHRperX.direct for more information.

3 Additional Workflow-Methods

The data used in the following example is available as a package RMassBankData, so both libraries have to be installed to run this vignette.

library(RMassBank)
library(RMassBankData)

3.1 Options

In the first part of the workflow, spectra are extracted from the files and processed. In the following example, we will process the Glucolesquerellin spectra from the provided files.

For the workflow to work correctly, we use the default settings, and modify then to match the data acquisition method. The settings have to contain the same parameters as the mzR-method would for the workflow.

RmbDefaultSettings()
rmbo <- getOption("RMassBank")
rmbo$spectraList <- list(
  list(mode="CID", ces="10eV", ce="10eV", res=12000),
  list(mode="CID", ces="20eV", ce="20eV", res=12000)
)

rmbo$multiplicityFilter <- 1
rmbo$annotations$instrument <- "Bruker micrOTOFq"
rmbo$annotations$instrument_type <- "LC-ESI-QTOF"
rmbo$recalibrator$MS1 <- "recalibrate.identity"
rmbo$recalibrator$MS2 <- "recalibrate.identity"
options("RMassBank" = rmbo)

3.2 XCMS-workflow

First, a workspace for the msmsWorkflow must be created:

msmsList <- newMsmsWorkspace()

The full paths of the files must be loaded into the container in the array files:

msmsList@files <- list.files(system.file("spectra.Glucolesquerellin",
                                         package = "RMassBankData"),
                             "Glucolesquerellin.*mzData", full.names=TRUE)

Note the position of the compound IDs in the filenames. Historically, the “pos” at the end was used to denote the polarity; it is obsolete now, but the ID must be terminated with an underscore. If you have multiple files for one compound, you have to give them the same ID, but thanks to the polarity at the end being obsolete, you can just enumerate them.

Additionally, the compound list must be loaded using loadList:

loadList(system.file("list/PlantDataset.csv",package="RMassBankData"))

## Loaded compoundlist successfully

Basically, the changes to the workflow using XCMS can be described as follows:

The MS2-Spectra(and optionally the MS1-spectrum) are extracted and peakpicked using XCMS. You can pass different parameters for the findPeaks function of XCMS using the findPeaksArgs-argument to detect actual peaks. Then, CAMERA processes the peak lists and creates pseudospectra (or compound spectra). The obtained pseudospectra are stored in the array specs.

Please note that “findPeaksArgs” has to be a list with the list elements named after the arguments that the method you want to use contains, as findPeaks is called by do.call. For example, if you want to use centWave with a peakwidth from 5 to 12 and 25 ppm, findPeaksArgs would look like this:

Args <- list(method="centWave",
                     peakwidth=c(5,12),
                     prefilter=c(0,0),
                     ppm=25, snthr=2)

If you want to utilize XCMS for Step 1 of the workflow, you have to set the readMethod-parameter to “xcms” and - if you don’t want to use standard values for findPeaks - pass on findPeaksArgs to the workflow.

msmsList <- msmsRead(msmsList, files= msmsList@files, 
                     readMethod = "xcms", mode = "mH", Args = Args, plots = TRUE)

## Provided scanrange was adjusted to 1 - 0

## Loading required package: xcms

## Loading required package: BiocParallel

## Loading required package: MSnbase

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect,
##     is.unsorted, lapply, mapply, match, mget, order, paste, pmax,
##     pmax.int, pmin, pmin.int, rank, rbind, rownames, sapply,
##     setdiff, sort, table, tapply, union, unique, unsplit,
##     which.max, which.min

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages
##     'citation("pkgname")'.

## Loading required package: mzR

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:gplots':
## 
##     space

## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname

## Loading required package: ProtGenerics

## 
## Attaching package: 'ProtGenerics'

## The following object is masked from 'package:stats':
## 
##     smooth

## 
## This is MSnbase version 2.18.0 
##   Visit https://lgatto.github.io/MSnbase/ to get started.

## 
## Attaching package: 'MSnbase'

## The following object is masked from 'package:base':
## 
##     trimws

## 
## This is xcms version 3.14.0

## 
## Attaching package: 'xcms'

## The following object is masked from 'package:RMassBank':
## 
##     ppm

## The following object is masked from 'package:stats':
## 
##     sigma

## Provided scanrange was adjusted to 1 - 0

## MS2 spectra without precursorScan references, using estimation

## Detecting mass traces at 25 ppm ...

## OK

## Detecting chromatographic peaks in 40 regions of interest ... OK: 29 found.
## Provided scanrange was adjusted to 1 - 0
## Provided scanrange was adjusted to 1 - 0

## MS2 spectra without precursorScan references, using estimation

## Detecting mass traces at 25 ppm ... OK
## Detecting chromatographic peaks in 32 regions of interest ... OK: 29 found.

msmsList <- msmsWorkflow(msmsList, steps=2:8,
                         mode="mH", readMethod="xcms")

## msmsWorkflow: Step 2. First analysis pre recalibration
## msmsWorkflow: Step 3. Aggregate all spectra
## msmsWorkflow: Step 4. Recalibrate m/z values in raw spectra
## msmsWorkflow: Step 5. Reanalyze recalibrated spectra
## msmsWorkflow: Step 6. Aggregate recalibrated results
## msmsWorkflow: Step 7. Reanalyze fail peaks for N2 + O
## msmsWorkflow: Step 8. Peak multiplicity filtering
## msmsWorkflow: Done.

You can of course run the rest of the workflow as usual, by - like here - setting steps to 1:8

3.3 Export the records

To export the records from the XCMS workflow, follow the same procedure as the standard RMassBank workflow, i.e.:

mb <- newMbWorkspace(msmsList)
mb <- resetInfolists(mb)
mb <- loadInfolist(mb,system.file("infolists/PlantDataset.csv",
                                  package = "RMassBankData"))
# Step
mb <- mbWorkflow(mb, steps=1:8)

## mbWorkflow: Step 1. Gather info from several databases

## mbWorkflow: Step 2. Export infolist (if required)

## No new data added.

## mbWorkflow: Step 3. Data reformatting

## mbWorkflow: Step 4. Spectra compilation

## Compiling: Glucolesquerellin

## mbWorkflow: [Legacy Step 5. Flattening records] ignored

## mbWorkflow: Step 6. Generate molfiles

## mbWorkflow: Step 7. Generate subdirs and export

## mbWorkflow: Step 8. Create list.tsv

3.4 peaklist-workflow

The peaklist-workflow works akin to the normal mzR-workflow with the only difference being, that the supplied data has to be in .csv format and contain 2 columns: “mz” and “int”. You can look at an example file in the RMassBankData-package in spectra.Glucolesquerellin. Please note that the naming of the csv has to be similar to the mzdata-files, with the only difference being the filename extension. The readMethod name for this is “peaklist”

msmsPeaklist <- newMsmsWorkspace()
msmsPeaklist@files <- list.files(system.file("spectra.Glucolesquerellin",
                                             package = "RMassBankData"),
                                 "Glucolesquerellin.*csv", full.names=TRUE)
msmsPeaklist <- msmsWorkflow(msmsPeaklist, steps=1:8,
                             mode="mH", readMethod="peaklist")

## msmsWorkflow: Step 1. Acquire all MSMS spectra from files

## Peaks read

## msmsWorkflow: Step 2. First analysis pre recalibration

## msmsWorkflow: Step 3. Aggregate all spectra

## msmsWorkflow: Step 4. Recalibrate m/z values in raw spectra

## msmsWorkflow: Step 5. Reanalyze recalibrated spectra

## msmsWorkflow: Step 6. Aggregate recalibrated results

## msmsWorkflow: Step 7. Reanalyze fail peaks for N2 + O

## msmsWorkflow: Step 8. Peak multiplicity filtering

## msmsWorkflow: Done.

The records can then be generated and exported with mbWorkflow.

4 Session information

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
##  [3] LC_TIME=en_GB                 LC_COLLATE=C                 
##  [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
##  [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
##  [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
## [11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] xcms_3.14.0          MSnbase_2.18.0       ProtGenerics_1.24.0 
##  [4] S4Vectors_0.30.0     mzR_2.26.0           Biobase_2.52.0      
##  [7] BiocGenerics_0.38.0  BiocParallel_1.26.0  gplots_3.1.1        
## [10] RMassBankData_1.29.0 RMassBank_3.2.0      Rcpp_1.0.6          
## [13] BiocStyle_2.20.0    
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-1            rjson_0.2.20               
##   [3] ellipsis_0.3.2              htmlTable_2.2.1            
##   [5] XVector_0.32.0              GenomicRanges_1.44.0       
##   [7] base64enc_0.1-3             rstudioapi_0.13            
##   [9] clue_0.3-59                 affyio_1.62.0              
##  [11] fansi_0.4.2                 codetools_0.2-18           
##  [13] splines_4.1.0               ncdf4_1.17                 
##  [15] doParallel_1.0.16           impute_1.66.0              
##  [17] robustbase_0.93-7           knitr_1.33                 
##  [19] itertools_0.1-3             Formula_1.2-4              
##  [21] jsonlite_1.7.2              rJava_1.0-4                
##  [23] cluster_2.1.2               vsn_3.60.0                 
##  [25] png_0.1-7                   graph_1.70.0               
##  [27] BiocManager_1.30.15         compiler_4.1.0             
##  [29] httr_1.4.2                  backports_1.2.1            
##  [31] assertthat_0.2.1            Matrix_1.3-3               
##  [33] limma_3.48.0                htmltools_0.5.1.1          
##  [35] tools_4.1.0                 igraph_1.2.6               
##  [37] gtable_0.3.0                glue_1.4.2                 
##  [39] GenomeInfoDbData_1.2.6      affy_1.70.0                
##  [41] RANN_2.6.1                  dplyr_1.0.6                
##  [43] MALDIquant_1.19.3           jquerylib_0.1.4            
##  [45] vctrs_0.3.8                 preprocessCore_1.54.0      
##  [47] iterators_1.0.13            xfun_0.23                  
##  [49] stringr_1.4.0               lifecycle_1.0.0            
##  [51] gtools_3.8.2                XML_3.99-0.6               
##  [53] DEoptimR_1.0-8              zlibbioc_1.38.0            
##  [55] MASS_7.3-54                 scales_1.1.1               
##  [57] pcaMethods_1.84.0           MatrixGenerics_1.4.0       
##  [59] fingerprint_3.5.7           SummarizedExperiment_1.22.0
##  [61] RBGL_1.68.0                 MassSpecWavelet_1.58.0     
##  [63] RColorBrewer_1.1-2          rcdk_3.5.0                 
##  [65] yaml_2.2.1                  curl_4.3.1                 
##  [67] gridExtra_2.3               ggplot2_3.3.3              
##  [69] sass_0.4.0                  rpart_4.1-15               
##  [71] latticeExtra_0.6-29         stringi_1.6.2              
##  [73] highr_0.9                   foreach_1.5.1              
##  [75] checkmate_2.0.0             caTools_1.18.2             
##  [77] GenomeInfoDb_1.28.0         rlang_0.4.11               
##  [79] pkgconfig_2.0.3             bitops_1.0-7               
##  [81] matrixStats_0.58.0          mzID_1.30.0                
##  [83] evaluate_0.14               lattice_0.20-44            
##  [85] purrr_0.3.4                 htmlwidgets_1.5.3          
##  [87] tidyselect_1.1.1            plyr_1.8.6                 
##  [89] magrittr_2.0.1              bookdown_0.22              
##  [91] R6_2.5.0                    IRanges_2.26.0             
##  [93] magick_2.7.2                generics_0.1.0             
##  [95] Hmisc_4.5-0                 DelayedArray_0.18.0        
##  [97] DBI_1.1.1                   pillar_1.6.1               
##  [99] foreign_0.8-81              MsCoreUtils_1.4.0          
## [101] nnet_7.3-16                 survival_3.2-11            
## [103] RCurl_1.98-1.3              tibble_3.1.2               
## [105] CAMERA_1.48.0               crayon_1.4.1               
## [107] rcdklibs_2.3                KernSmooth_2.23-20         
## [109] utf8_1.2.1                  rmarkdown_2.8              
## [111] jpeg_0.1-8.1                grid_4.1.0                 
## [113] data.table_1.14.0           digest_0.6.27              
## [115] munsell_0.5.0               bslib_0.2.5.1

RMassBank for XCMS

19 May 2021