1 Introduction

The yamss (yet another mass spectrometry software) package provides tools to preprocess raw mass spectral data files arising from metabolomics experiments with the primary goal of providing high-quality differential analysis. Currently, yamss implements a preprocessing method “bakedpi”, which stands for bivariate approximate kernel density estimation for peak identification.

Alternatives to this package include xcms (Smith et al. 2006) (available on Bioconductor) and MZMine 2 (Pluskal et al. 2010). These packages also provide preprocessing functionality but focus more on feature detection and alignment in individual samples. The input data to yamss are standard metabolomics mass spectral files which can be read by mzR.

1.1 bakedpi

The “bakedpi” algorithm is a new preprocessing algorithm for untargeted metabolomics data (Myint et al. 2017). The output of “bakedpi” is essentially a table with dimension peaks (adducts) by samples, containing quantified intensity measurements representing the abundance of metabolites. This table, which is very similar to output in gene expression analysis, can directly be used in standard statistical packages, such as limma, to identify differentially abundant metabolites between conditions. It is critical that all samples which are analyzed together, are processed together through bakedpi.

1.2 Citing yamss

Please cite our paper when using yamss (Myint et al. 2017).

1.3 Dependencies

This document has the following dependencies

library(yamss)
library(mtbls2)

2 Processing a metabolomics dataset

We will be looking at data provided in the mtbls2 data package. In this experiment, investigators exposed wild-type and mutant Arabidopsis thaliana leaves to silver nitrate and performed global LC/MS profiling. The experiment was repeated twice, but we will only be looking at the first replicate. There are 4 wild-type and 4 mutant plants in this experiment.

filepath <- file.path(find.package("mtbls2"), "mzData")
files <- list.files(filepath, pattern = "MSpos-Ex1", recursive = TRUE, full.names = TRUE)
classes <- rep(c("wild-type", "mutant"), each = 4)

The first step is to read the raw mass spec data into an R representation using readMSdata():

colData <- DataFrame(sampClasses = classes, ionmode = "pos")
cmsRaw <- readMSdata(files = files, colData = colData, mzsubset = c(500,520), verbose = TRUE)
## [readRaw]: Reading 8 files
cmsRaw
## class: CMSraw
## Representing 8 data files
## Number of scans: 3389
## M/Z: 500.000120 - 519.998100

The output of readMSdata() is an object of class CMSraw representating raw (unprocessed) data. We use the colData argument to store phenotype information about the different samples.

The next step is to use bakedpi() to preprocess these samples. This function takes a while to run, so we only run it on a small slice of M/Z values, using the mzsubset argument. This is only done for speed.

cmsProc <- bakedpi(cmsRaw, dbandwidth = c(0.01, 10), dgridstep = c(0.01, 1),
                   outfileDens = NULL, dortalign = TRUE, mzsubset = c(500, 520), verbose = TRUE)
## [bakedpi] Background correction
## [bakedpi] Retention time alignment
## [bakedpi] Density estimation
cmsProc
## class: CMSproc
## Representing 8 data files
## Number of scans: 3389
## M/Z: 500.000120 - 519.998100

The dbandwidth and dgridstep arguments influence the bivariate kernel density estimation which forms the core of bakedpi. dgridstep is a vector of length 2 that specifies the spacing of the grid upon which the density is estimated. The first component specifies the spacing in the M/Z direction, and the second component specifies the spacing in the scan (retention time) direction. To showcase a fast example, we have specified the M/Z and scan spacings to be 0.01 and 1 respectively, but we recommend keeping the defaults of 0.005 and 1 because this will more accurately define the M/Z and scan bounds of the detected peaks. dbandwidth is a vector of length 2 that specifies the kernel density bandwidth in the M/Z and scan directions in the first and second components respectively. Note that dbandwidth[1] should be greater than or equal to dgridstep[1] and dbandwidth[2] should be greater than or equal to dgridstep[2]. Because a binning strategy is used to speed up computation of the density estimate, this is to ensure that data points falling into the same grid location all have the same distances associated with them.

For experiments with a wide range of M/Z values or several thousands of scans, the computation of the density estimate can be time-intensive. For this reason, it can be useful to save the density estimate in an external file specified by the outfileDens argument. If outfileDens is set to NULL, then the density estimate is not saved and must be recomputed if we wish to process the data again. Specifying the filename of the saved density estimate in outfileDens when rerunning bakedpi() skips the density estimation phase which can save a lot of time.

The resulting object is an instance of class CMSproc which contains the bivariate kernel density estimate as well as some useful preprocessing metadata. In order to obtain peak bounds and quantifications, the last step is to run slicepi(), which computes a global threshold for the density, uses this threshold to call peak bounds, and quantifies the peaks. If the cutoff argument is supplied as a number, then slicepi() will use this as the density threshold. Otherwise if cutoff is left as the default NULL, a data-driven threshold will be identified.

cmsSlice <- slicepi(cmsProc, cutoff = NULL, verbose = TRUE)
## [slicepi] Computing cutoff
## [slicepi] Computing peak bounds
## [slicepi] Quantifying peaks
cmsSlice
## An object of class 'CMSslice'
## Representing 8 data files
## Number of scans: 3389
## M/Z: 500.000120 - 519.998100
## Number of peaks: 72

The output of slicepi() is an instance of class CMSslice and contains the peak bounds and quantifications as well as sample and preprocessing metadata.

3 Differential Analysis

We can access the differential analysis report with diffrep(). This is a convenience function, similar to diffreport() from the xcms package. In our case it uses limma to do differential analysis; the output of diffrep() is basically topTable() from limma.

dr <- diffrep(cmsSlice, classes = classes)
head(dr)
##        logFC  AveExpr         t      P.Value    adj.P.Val        B
## 25  3.474403 15.32845  70.51488 9.110811e-12 3.650248e-10 17.95487
## 54 -2.156827 17.52813 -69.50470 1.013958e-11 3.650248e-10 17.84146
## 21  1.209102 17.87641  44.72779 2.655826e-10 4.527970e-09 14.29968
## 55 -2.004350 15.66532 -43.91037 3.044188e-10 4.527970e-09 14.14912
## 4  -1.137775 16.23114 -43.71853 3.144424e-10 4.527970e-09 14.11336
## 47  1.252761 17.88478  35.34145 1.515174e-09 1.818209e-08 12.36901

Let’s look at the peak bound information for the peaks with the strongest differential signal. We can store the IDs for the top 10 peaks with

topPeaks <- as.numeric(rownames(dr)[1:10])
topPeaks
##  [1] 25 54 21 55  4 47 46 16 59 17

We can access the peak bound information with peakBounds() and select the rows corresponding to topPeaks.

bounds <- peakBounds(cmsSlice)
idx <- match(topPeaks, bounds[,"peaknum"])
bounds[idx,]
## DataFrame with 10 rows and 5 columns
##        mzmin     mzmax   scanmin   scanmax   peaknum
##    <numeric> <numeric> <numeric> <numeric> <numeric>
## 1     501.04    501.07       795       822        25
## 2     501.25    501.29      2095      2138        54
## 3     516.08    516.12       703       748        21
## 4     502.26    502.29      2104      2126        55
## 5     508.19    508.21       207       233         4
## 6     516.05    516.09      1586      1636        47
## 7     500.09    500.12      1583      1628        46
## 8      501.6    501.63       666       710        16
## 9      513.3    513.31      2813      2831        59
## 10    515.61    515.63       668       686        17

4 Information contained in a CMSproc object

CMSproc objects contain information useful in exploring your data.

4.1 Density estimate

The bivariate kernel density estimate matrix can be accessed with densityEstimate().

dmat <- densityEstimate(cmsProc)

We can make a plot of the density estimate in a particular M/Z and scan region with plotDensityRegion().

mzrange <- c(bounds[idx[1], "mzmin"], bounds[idx[1], "mzmax"])
scanrange <- c(bounds[idx[1], "scanmin"], bounds[idx[1], "scanmax"])
plotDensityRegion(cmsProc, mzrange = mzrange + c(-0.5,0.5), scanrange = scanrange + c(-30,30))

Peaks are called by thresholding the density estimate. If we wish to investigate the impact of varying this cutoff, we can use densityCutoff to obtain the original cutoff and updatePeaks() to re-call peaks and quantify them. Quantiles of the non-zero density values are also available via densityQuantiles().

cmsSlice2 <- slicepi(cmsProc, densityCutoff(cmsSlice)*0.99)
## [slicepi] Computing peak bounds
## [slicepi] Quantifying peaks
dqs <- densityQuantiles(cmsProc)
head(dqs)
##         0.0%         0.1%         0.2%         0.3%         0.4% 
## 2.270341e-22 8.066129e-20 1.616783e-19 2.473895e-19 3.376696e-19 
##         0.5% 
## 4.398691e-19
cmsSlice3 <- slicepi(cmsProc, dqs["98.5%"])
## [slicepi] Computing peak bounds
## [slicepi] Quantifying peaks
nrow(peakBounds(cmsSlice)) # Number of peaks detected - original
## [1] 72
nrow(peakBounds(cmsSlice2)) # Number of peaks detected - updated
## [1] 72
nrow(peakBounds(cmsSlice3)) # Number of peaks detected - updated
## [1] 206

5 Sessioninfo

R version 3.5.0 (2018-04-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.4 LTS

Matrix products: default BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] mtbls2_1.9.1 yamss_1.6.0
[3] SummarizedExperiment_1.10.0 DelayedArray_0.6.0
[5] BiocParallel_1.14.0 matrixStats_0.53.1
[7] Biobase_2.40.0 GenomicRanges_1.32.0
[9] GenomeInfoDb_1.16.0 IRanges_2.14.0
[11] S4Vectors_0.18.0 BiocGenerics_0.26.0
[13] BiocStyle_2.8.0

loaded via a namespace (and not attached): [1] Rcpp_0.12.16 compiler_3.5.0 XVector_0.20.0
[4] bitops_1.0-6 ProtGenerics_1.12.0 tools_3.5.0
[7] zlibbioc_1.26.0 digest_0.6.15 evaluate_0.10.1
[10] lattice_0.20-35 png_0.1-7 Matrix_1.2-14
[13] yaml_2.1.18 xfun_0.1 GenomeInfoDbData_1.1.0 [16] stringr_1.3.0 knitr_1.20 htmlwidgets_1.2
[19] fftwtools_0.9-8 locfit_1.5-9.1 rprojroot_1.3-2
[22] grid_3.5.0 data.table_1.10.4-3 jpeg_0.1-8
[25] rmarkdown_1.9 bookdown_0.7 limma_3.36.0
[28] mzR_2.14.0 magrittr_1.5 backports_1.1.2
[31] codetools_0.2-15 htmltools_0.3.6 abind_1.4-5
[34] EBImage_4.22.0 tiff_0.1-5 stringi_1.1.7
[37] RCurl_1.95-4.10

References

Myint, Leslie, Andre Kleensang, Liang Zhao, Thomas Hartung, and Kasper D Hansen. 2017. “Joint Bounding of Peaks Across Samples Improves Differential Analysis in Mass Spectrometry-Based Metabolomics.” Analytical Chemistry 89 (6):3517–23. https://doi.org/10.1021/acs.analchem.6b04719.

Pluskal, Tomás, Sandra Castillo, Alejandro Villar-Briones, and Matej Oresic. 2010. “MZmine 2: Modular Framework for Processing, Visualizing, and Analyzing Mass Spectrometry-Based Molecular Profile Data.” BMC Bioinformatics 11:395. https://doi.org/10.1186/1471-2105-11-395.

Smith, Colin A, Elizabeth J Want, Grace O’Maille, Ruben Abagyan, and Gary Siuzdak. 2006. “XCMS: Processing Mass Spectrometry Data for Metabolite Profiling Using Nonlinear Peak Alignment, Matching, and Identification.” Analytical Chemistry 78:779–87. https://doi.org/10.1021/ac051437y.