MTBLS2 Processing and Analysis with xcms and CAMERA and export to MetaboLights

Introduction
A few global settings
Raw data conversion
R and ISAtab
ISAtab, Risa and xcms
Grouping and Retention time correction
QC on peaks picked
Data imputation
QC with some diagnostic plots
QC with PCA
Annotated diffreport
Diffreport
Combine diffreport and CAMERA spectra
R, ISAtab, xcms and CAMERA revisited

library(Risa)
library(xcms)
library(CAMERA)

## Warning: replacing previous import 'xcms::plot' by 'graphics::plot' when
## loading 'CAMERA'

library(pcaMethods)

Introduction

Indole-3-acetaldoxime (IAOx) represents an early intermediate of the biosynthesis of a variety of indolic secondary metabolites including the phytoanticipin indol-3-ylmethyl glucosinolate and the phytoalexin camalexin (3-thiazol-2’-yl-indole). Arabidopsis thaliana cyp79B2 cyp79B3 double knockout plants are completely impaired in the conversion of tryptophan to indole-3-acetaldoxime and do not accumulate IAOx-derived metabolites any longer. Consequently, comparative analysis of wild-type and cyp79B2 cyp79B3 plant lines has the potential to explore the complete range of IAOx-derived indolic secondary metabolites.

Since 2006, the Bioconductor package xcms (Smith et al, 2006) provides a rich set of algorithms for mass spectrometry data processing. Typically, xcms will create an xcmsSet object from several raw data files in an assay, which are obtained from the samples in the study.
Allowed raw data formats are netCDF, mzData, mzXML and mzML.

In this vignette, we demonstrate the processing of the MTBLS2 dataset, which was described in Neumann 2012.

A few global settings

A few things might be worth to define at the beginning of an analysis

# prefilter <- c(3,200)  ## standard
prefilter=c(6,750)      ## quick-run for debugging
nSlaves=4

Raw data conversion

This can be done with the vendor tools, or the open source proteowizard converter. The preferred format should be mzML or mzData/mzXML. An overview of formats (and problems) is available at the xcms online help pages.

R and ISAtab

An ISAtab archive will contain the metadata description in several tab-separated files. (One of) the assay files contains the column Raw Spectral Data File with the paths to the mass spectral raw data files in one of the above formats.

ISAmtbls2 <- readISAtab(find.package("mtbls2"))
a.filename <- ISAmtbls2["assay.filenames"][[1]]

msfiles <- getAssayRawDataFilenames(ISAmtbls2@assay.tabs[[1]], full.path = TRUE)[,1]
adf <- getAnnotatedDataFrameAssay(ISAmtbls2, assay.filename = a.filename)

ISAtab, Risa and xcms

With the combination of Risa and xcms, we can convert the MS raw data in an ISAtab archive into an xcmsSet:

cwp <- CentWaveParam(ppm = 25, peakwidth = c(20, 50), snthresh = 10,
  prefilter = c(3, 100), mzCenterFun = "wMean", integrate = 1L,
  mzdiff = -0.001, fitgauss = FALSE, noise = 0, verboseColumns = FALSE,
  roiList = list(), firstBaselineCheck = TRUE, roiScales = numeric())

raw_data <- readMSData(msfiles, mode = "onDisk")

## Perform the peak detection using the settings defined above.
mtbls2 <- findChromPeaks(raw_data, param = cwp, BPPARAM = bpparam())

pData(mtbls2) <- pData(adf)
          

head(chromPeaks(mtbls2))

##               mz    mzmin    mzmax       rt    rtmin    rtmax     into
## CP00001 112.8962 112.8942 112.8979 24.80802 20.43102 47.37300 22816.15
## CP00002 399.0922 399.0856 399.0976 27.50400 21.10398 45.35298 31005.04
## CP00003 240.8674 240.8649 240.8718 26.15700 22.11402 35.58600 16004.71
## CP00004 146.1652 146.1631 146.1671 20.43102 14.70498 41.64702 54426.96
## CP00005 223.9742 223.9701 223.9791 26.83002 21.44100 45.35298 11775.13
## CP00006 482.0321 482.0214 482.0399 28.17702 22.11402 46.69998 21127.42
##             intb maxo   sn sample
## CP00001 22654.98 6051   86      1
## CP00002 29206.56 3797   14      1
## CP00003 15993.93 5504 5503      1
## CP00004 53544.79 9077   65      1
## CP00005 11749.60 1881   56      1
## CP00006 20629.23 2789   22      1

The result is the same type of xcmsSet object:

show(mtbls2)

## MSn experiment data ("XCMSnExp")
## Object size in memory: 15.72 Mb
## - - - Spectra data - - -
##  MS level(s): 1 
##  Number of spectra: 57066 
##  MSn retention times: 0:0 - 20:2 minutes
## - - - Processing information - - -
## Data loaded [Tue Nov  5 11:58:43 2019] 
##  MSnbase version: 2.12.0 
## - - - Meta data  - - -
## phenoData
##   rowNames: Ex1-Col0-48h-Ag-1 Ex1-Col0-48h-Ag-2 ...
##     Ex2-cyp79-48h-Ag-4 (16 total)
##   varLabels: Factor Value[genotype] Factor Value[replicate]
##   varMetadata: labelDescription
## Loaded from:
##   [1] MSpos-Ex1-Col0-48h-Ag-1_1-A,1_01_9818.mzData...  [16] MSpos-Ex2-cyp79-48h-Ag-4_1-B,4_01_9834.mzData
##   Use 'fileNames(.)' to see all files.
## protocolData: none
## featureData
##   featureNames: F01.S0001 F01.S0002 ... F16.S3568 (57066 total)
##   fvarLabels: fileIdx spIdx ... spectrum (35 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
##  method: centWave 
##  24786 peaks identified in 16 samples.
##  On average 1549 chromatographic peaks per sample.

Several options exist to quantify the individual intensities. For each feature, additional attributes are available, such as the minimum/maximum and average retention time and m/z values.

Grouping and Retention time correction

In the following steps, we perform a grouping: because the UPLC system used here has very stable retention times, we just use the retention time correction step as quality control of the raw data. After that, ‘fillPeaks()’ will integrate the raw data for those features, which were not detected in some of the samples.

## Perform the peak grouping with default settings:
pdp <- PeakDensityParam(sampleGroups = as.integer(interaction(pData(mtbls2), drop=TRUE)))

mtbls2grouped <- groupChromPeaks(mtbls2, pdp)

## Processing 7195 mz slices ... OK

pgp <- PeakGroupsParam(minFraction = 1, extraPeaks = 1, smooth = "loess",
  span = 0.2, family = "gaussian")

mtbls2groupedretcor <- adjustRtime(mtbls2grouped, param = pgp)

## Performing retention time correction using 586 peak groups.

## Warning: Adjusted retention times had to be re-adjusted for some files
## to ensure them being in the same order than the raw retention times. A
## call to 'dropAdjustedRtime' might thus fail to restore retention times
## of chromatographic peaks to their original values. Eventually consider to
## increase the value of the 'span' parameter.

## Applying retention time adjustment to the identified chromatographic peaks ... OK

## Visualize the impact of the alignment. We show both versions of the plot,
## with the raw retention times on the x-axis (top) and with the adjusted
## retention times (bottom).
par(mfrow = c(2, 1))
plotAdjustedRtime(mtbls2groupedretcor, adjusted = FALSE)
grid()
plotAdjustedRtime(mtbls2groupedretcor)
grid()

QC on peaks picked

A first QC step is the visual inspection of intensities across the samples. Alternatively to a boxplot, one could also create histograms/density plots.

boxplot(featureValues(mtbls2grouped, value="into") +1, 
        #col=as.numeric(sampclass(mtbls2Set))+1, 
        log="y", las=2)

Data imputation

After grouping, peaks might be missing/not found in some samples. fillPekas() will impute them, using the consensus mz and RT from the other samples.

mtbls2groupedFilled <- fillChromPeaks(mtbls2grouped)

## Defining peak areas for filling-in .... OK
## Start integrating peak areas from original files

The final xcmsSet represents a rectangular matrix of mass spectral features, which were detected (or imputed) across the samples. The dimensionality is M * N, where M denotes the number of samples in the assay, and N the number of features grouped across the samples.

QC with some diagnostic plots

QC of mass accuracy and retention time consistency

#plotQC(mtbls2Set)

QC with PCA

In addition to the boxplot for QC, we can also check a hierarchical clustering and the PCA of the samples.

sdThresh <- 4.0 ## Filter low-standard deviation rows for plot
data <- log(featureValues(mtbls2groupedFilled))+1

pca.result <- pca(data, nPcs=3)
plotPcs(pca.result, type="loadings", 
        #col=as.numeric(sampclass(mtbls2Set))+1
        )

Annotated diffreport

## Since CAMERA has not yet been ported to XCMSnExp,
## we need to convert to xcmsSet. Note that 
## the conversion only makes sense for somple XCMSnSets, 
## without e.g. MS level filtering (where CAMERA would then 
## extract the wrong )
mtbls2Set <- as(mtbls2groupedFilled, "xcmsSet")

## Note: you might want to set/adjust the 'sampclass' of the returned xcmSet object before proceeding with the analysis.

mtbls2Set <- fillPeaks(mtbls2Set)

## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex1-Col0-48h-Ag-1_1-A,1_01_9818.mzData 
## method:  bin 
## step:  0.1 
## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex1-Col0-48h-Ag-2_1-A,1_01_9820.mzData 
## method:  bin 
## step:  0.1 
## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex1-Col0-48h-Ag-3_1-A,1_01_9822.mzData 
## method:  bin 
## step:  0.1 
## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex1-Col0-48h-Ag-4_1-A,1_01_9824.mzData 
## method:  bin 
## step:  0.1

## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex2-Col0-48h-Ag-1_1-A,2_01_9827.mzData 
## method:  bin 
## step:  0.1 
## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex2-Col0-48h-Ag-2_1-A,3_01_9829.mzData 
## method:  bin 
## step:  0.1 
## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex2-Col0-48h-Ag-3_1-A,4_01_9831.mzData 
## method:  bin 
## step:  0.1 
## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex2-Col0-48h-Ag-4_1-A,2_01_9833.mzData 
## method:  bin 
## step:  0.1

## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex2-cyp79-48h-Ag-1_1-B,3_01_9828.mzData 
## method:  bin 
## step:  0.1 
## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex2-cyp79-48h-Ag-2_1-B,4_01_9830.mzData 
## method:  bin 
## step:  0.1 
## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex2-cyp79-48h-Ag-3_1-B,3_01_9832.mzData 
## method:  bin 
## step:  0.1 
## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex2-cyp79-48h-Ag-4_1-B,4_01_9834.mzData 
## method:  bin 
## step:  0.1

## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex1-cyp79-48h-Ag-1_1-B,1_01_9819.mzData 
## method:  bin 
## step:  0.1 
## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex1-cyp79-48h-Ag-2_1-B,2_01_9821.mzData 
## method:  bin 
## step:  0.1 
## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex1-cyp79-48h-Ag-3_1-B,1_01_9823.mzData 
## method:  bin 
## step:  0.1 
## /tmp/RtmpHaKMMs/Rinst2c3c4c910e53/mtbls2/mzData/MSpos-Ex1-cyp79-48h-Ag-4_1-B,2_01_9825.mzData 
## method:  bin 
## step:  0.1

##
## Now do the normal CAMERA workflow:
##
an <- xsAnnotate(mtbls2Set,
                 sample=seq(1,length(sampnames(mtbls2Set))),
                 nSlaves=nSlaves)

## Loading required package: Rmpi

an <- groupFWHM(an)
an <- findIsotopes(an)  # optional but recommended.
an <- groupCorr(an,
                graphMethod="lpc",
                calcIso = TRUE,
                calcCiS = TRUE,
                calcCaS = TRUE,
                cor_eic_th=0.5)

## Setup ruleSet
rs <- new("ruleSet")

rs@ionlistfile <- file.path(find.package("mtbls2"), "lists","ions.csv")
rs@neutraladditionfile <- file.path(find.package("mtbls2"), "lists","neutraladdition.csv")
rs@neutrallossfile <- file.path(find.package("mtbls2"), "lists","neutralloss.csv")

rs <- readLists(rs)
rs <- setDefaultParams(rs)
rs <- generateRules(rs)

an <- findAdducts(an,
                  rules=rs@rules,
                  polarity="positive")

Diffreport

dr <- diffreport(mtbls2Set, sortpval=FALSE, filebase="mtbls2diffreport", eicmax=20 )

## Provided scanrange was adjusted to 1 - 3562

## Create profile matrix with method 'bin' and step 0.1 ... OK
## No need to perform retention time correction, raw and corrected rt are identical for MSpos-Ex1-Col0-48h-Ag-1_1-A,1_01_9818.mzData.
## Create profile matrix with method 'bin' and step 0.1 ... OK
## No need to perform retention time correction, raw and corrected rt are identical for MSpos-Ex1-Col0-48h-Ag-2_1-A,1_01_9820.mzData.
## Provided scanrange was adjusted to 1 - 3564
## Create profile matrix with method 'bin' and step 0.1 ... OK
## No need to perform retention time correction, raw and corrected rt are identical for MSpos-Ex1-Col0-48h-Ag-3_1-A,1_01_9822.mzData.
## Provided scanrange was adjusted to 1 - 3566
## Create profile matrix with method 'bin' and step 0.1 ... OK
## No need to perform retention time correction, raw and corrected rt are identical for MSpos-Ex1-Col0-48h-Ag-4_1-A,1_01_9824.mzData.
## Provided scanrange was adjusted to 1 - 3565
## Create profile matrix with method 'bin' and step 0.1 ... OK
## No need to perform retention time correction, raw and corrected rt are identical for MSpos-Ex1-cyp79-48h-Ag-1_1-B,1_01_9819.mzData.
## Provided scanrange was adjusted to 1 - 3566
## Create profile matrix with method 'bin' and step 0.1 ... OK
## No need to perform retention time correction, raw and corrected rt are identical for MSpos-Ex1-cyp79-48h-Ag-2_1-B,2_01_9821.mzData.
## Provided scanrange was adjusted to 1 - 3567
## Create profile matrix with method 'bin' and step 0.1 ... OK
## No need to perform retention time correction, raw and corrected rt are identical for MSpos-Ex1-cyp79-48h-Ag-3_1-B,1_01_9823.mzData.
## Provided scanrange was adjusted to 1 - 3568
## Create profile matrix with method 'bin' and step 0.1 ... OK
## No need to perform retention time correction, raw and corrected rt are identical for MSpos-Ex1-cyp79-48h-Ag-4_1-B,2_01_9825.mzData.

## Warning in dir.create(eicdir): 'mtbls2diffreport_eic' already exists

## Warning in dir.create(boxdir): 'mtbls2diffreport_box' already exists

cspl <- getPeaklist(an)

annotatedDiffreport <- cbind(dr, cspl)

Combine diffreport and CAMERA spectra

interestingPspec <- tapply(seq(1, nrow(annotatedDiffreport)),
                               INDEX=annotatedDiffreport[,"pcgroup"],
                               FUN=function(x, a) {m <- median(annotatedDiffreport[x, "pvalue"]);
                                                   p <- max(annotatedDiffreport[x, "pcgroup"]);
                                                   as.numeric(c(pvalue=m,pcgroup=p))},
                               annotatedDiffreport)

interestingPspec <- do.call(rbind, interestingPspec)
colnames(interestingPspec) <- c("pvalue", "pcgroup") 

o <- order(interestingPspec[,"pvalue"])

pdf("interestingPspec.pdf")
dummy <- lapply(interestingPspec[o[1:40], "pcgroup"],
                function(x) {suppressWarnings(plotPsSpectrum(an, pspec=x, maxlabel=5))})
dev.off()

## png 
##   2

R, ISAtab, xcms and CAMERA revisited

These attributes and the intensity matrix could already be exported to conform to the specification for the ``metabolite assignment file’’ in the mzTab format used in MetaboLights. Currently, this functionality is not included in xcms. A prototype snippet is the following:

pl <- annotatedDiffreport 

charge <- sapply(an@isotopes, function(x) {
  ifelse( length(x) > 0, x$charge, NA) 
})
abundance <- groupval(an@xcmsSet, value="into")


##
## load ISA assay files
## 

a.samples <- ISAmtbls2["samples.per.assay.filename"][[ a.filename ]]

##
## These columns are defined by mzTab
##

maf.std.colnames <- c("identifier", "chemical_formula", "description",
"mass_to_charge", "fragmentation", "charge", "retention_time",
"taxid", "species", "database", "database_version", "reliability",
"uri", "search_engine", "search_engine_score", "modifications",
"smallmolecule_abundance_sub", "smallmolecule_abundance_stdev_sub",
"smallmolecule_abundance_std_error_sub")

##
## Plus the columns for the sample intensities
##
all.colnames <- c(maf.std.colnames, a.samples)

##
## Now assemble new maf
##

l <- nrow(pl)

maf <- data.frame(identifier = character(l), 
                  chemical_formula = character(l), 
                  description = character(l), 
                  mass_to_charge = pl$mz, 
                  fragmentation = character(l), 
                  charge = charge, 
                  retention_time = pl$rt, 
                  taxid = character(l), 
                  species = character(l), 
                  database = character(l), 
                  database_version = character(l), 
                  reliability = character(l), 
                  uri = character(l), 
                  search_engine = character(l), 
                  search_engine_score = character(l),
                  modifications = character(l), 
                  smallmolecule_abundance_sub = character(l),
                  smallmolecule_abundance_stdev_sub = character(l), 
                  smallmolecule_abundance_std_error_sub = character(l),
                  abundance, stringsAsFactors=FALSE)

##
## Make sure maf table is quoted properly, 
## and add to the ISAmtbls2 assay file.
## 
maf_character <- apply(maf, 2, as.character)

write.table(maf_character, 
            file=paste(tempdir(), "/a_mtbl2_metabolite profiling_mass spectrometry_maf.csv", sep=""),
            row.names=FALSE, col.names=all.colnames, 
            quote=TRUE, sep="\t", na="\"\"")

ISAmtbls2 <- updateAssayMetadata(ISAmtbls2, a.filename,
             "Metabolite Assignment File",
         paste(tempdir(), "/a_mtbl2_metabolite profiling_mass spectrometry_maf.csv", sep=""))

write.assay.file(ISAmtbls2, a.filename)

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] Rmpi_0.6-9          pcaMethods_1.78.0   CAMERA_1.42.0      
##  [4] xcms_3.8.0          MSnbase_2.12.0      ProtGenerics_1.18.0
##  [7] S4Vectors_0.24.0    mzR_2.20.0          BiocParallel_1.20.0
## [10] Risa_1.28.0         affy_1.64.0         biocViews_1.54.0   
## [13] Rcpp_1.0.2          Biobase_2.46.0      BiocGenerics_0.32.0
## 
## loaded via a namespace (and not attached):
##  [1] vsn_3.54.0             splines_3.6.1          foreach_1.4.7         
##  [4] Formula_1.2-3          assertthat_0.2.1       BiocManager_1.30.9    
##  [7] latticeExtra_0.6-28    RBGL_1.62.1            yaml_2.2.0            
## [10] robustbase_0.93-5      impute_1.60.0          backports_1.1.5       
## [13] pillar_1.4.2           lattice_0.20-38        glue_1.3.1            
## [16] limma_3.42.0           RUnit_0.4.32           digest_0.6.22         
## [19] checkmate_1.9.4        RColorBrewer_1.1-2     colorspace_1.4-1      
## [22] htmltools_0.4.0        preprocessCore_1.48.0  Matrix_1.2-17         
## [25] plyr_1.8.4             MALDIquant_1.19.3      XML_3.98-1.20         
## [28] pkgconfig_2.0.3        zlibbioc_1.32.0        purrr_0.3.3           
## [31] scales_1.0.0           RANN_2.6.1             affyio_1.56.0         
## [34] htmlTable_1.13.2       tibble_2.1.3           IRanges_2.20.0        
## [37] ggplot2_3.2.1          nnet_7.3-12            lazyeval_0.2.2        
## [40] MassSpecWavelet_1.52.0 survival_2.44-1.1      magrittr_1.5          
## [43] crayon_1.3.4           evaluate_0.14          ncdf4_1.17            
## [46] doParallel_1.0.15      MASS_7.3-51.4          foreign_0.8-72        
## [49] graph_1.64.0           data.table_1.12.6      tools_3.6.1           
## [52] stringr_1.4.0          munsell_0.5.0          cluster_2.1.0         
## [55] compiler_3.6.1         mzID_1.24.0            rlang_0.4.1           
## [58] grid_3.6.1             RCurl_1.95-4.12        rstudioapi_0.10       
## [61] iterators_1.0.12       htmlwidgets_1.5.1      igraph_1.2.4.1        
## [64] bitops_1.0-6           base64enc_0.1-3        rmarkdown_1.16        
## [67] gtable_0.3.0           codetools_0.2-16       multtest_2.42.0       
## [70] R6_2.4.0               gridExtra_2.3          knitr_1.25            
## [73] dplyr_0.8.3            Hmisc_4.2-0            stringi_1.4.3         
## [76] rpart_4.1-15           acepack_1.4.1          DEoptimR_1.0-8        
## [79] tidyselect_0.2.5       xfun_0.10