1 New functionality in xcms

This document describes new functionality and changes to existing functionality in the xcms package introduced during the update to version 3.


1.1 Modernized user interface

The modernization of the user interface comprises new classes for data representation and new data analysis methods. In addition, the core logic for the data processing has been extracted from the old methods and put into a set of R functions, the so called core API functions (or do_ functions). These functions take standard R data structures as input and return standard R data types as result and can hence be easily included in other R packages.

The new user interface aims at simplifying and streamlining the xcms workflow while guaranteeing data integrity and performance also for large scale metabolomics experiments. Importantly, a simplified access to the original raw data should be provided throughout the whole metabolomics data analysis workflow.

The new interface re-uses objects from the MSnbase Bioconductor package, such as the OnDiskMSnExp object. This object is specifically designed for large scale MS experiments as it initially reads just the scan header information from the mzML while the mz-intensity value pairs from all or from selected spectra of a file are read on demand hence minimizing the memory demand. Also, in contrast to the old xcmsRaw object, the OnDiskMSnExp contains information from all files of an experiment. In addition, all data normalization and adjustment methods implemented in the MSnbase package can be directly applied to the MS data without the need to re-implement such methods in xcms. Results from xcms preprocessings, such as chromatographic peak detection or correspondence are stored into the new XCMSnExp object. This object extends the OnDiskMSnExp object and inherits thus all of its methods including raw data access.

Class and method/function names follow also a new naming convention trying tp avoid the partially confusing nomenclature of the original xcms methods (such as the group method to perform the correspondence of peaks across samples). To distinguish them from mass peaks, the peaks identified by the peak detection in an LS/GC-MS experiment are referred to as chromatographic peaks. The respective method to identify such peaks is hence called findChromPeaks and the identified peaks can be accessed using the XCMSnExp chromPeaks method. The results from an correspondence analysis which aims to match and group chromatographic peaks within and between samples are called features. A feature corresponds to individual ions with a unique mass-to-charge ratio (mz) and a unique retention time (rt). The definition of such mz-rt features (i.e. the result from the groupChromPeaks method) can be accessed via the featureDefinitions method of the XCMSnExp class. Finally, alignment (retention time correction) can be performed using the adjustRtime method.

The settings for any of the new analysis methods are bundled in parameter classes, one class for each method. This encapsulation of the parameters to a function into a parameter class (such as CentWaveParam) avoids busy function calls (with many single parameters) and enables saving, reloading and reusing the settings. In addition, the parameter classes are added, along with other information to the process history of an XCMSnExp object thus providing a detailed documentation of each processing step of an analysis, with the possibility to recall all settings of the performed analyses at any stage. In addition, validation of the parameters can be performed within the parameter object and hence is no longer required in the analysis function.

1.2 New naming convention

Peaks identified in LC/GC-MS metabolomics are referred to as chromatographic peaks where possible to avoid any misconceptions with mass peaks identified in mz dimension.

Methods for data analysis from the original xcms code have been renamed to avoid potential confusions:

  • Chromatographic peak detection: findChromPeaks instead of findPeaks: for new functions and methods the term peak is avoided as much as possible, as it is usually used to describe a mass peak in mz dimension. To clearly distinguish between these peaks and peaks in retention time space, the latter are referred to as chromatographic peak, or chromPeak.

  • Correspondence: groupChromPeaks instead of group to clearly indicate what is being grouped. Group might be a sample group or a peak group, the latter being referred to also by (mz-rt) feature.

  • Alignment: adjustRtime instead of retcor for retention time correction. The word cor in retcor might be easily misinterpreted as correlation instead of correction.

1.3 New data classes

1.3.1 OnDiskMSnExp

This object is defined and documented in the MSnbase package. In brief, it is a container for the full raw data from an MS-based experiment. To keep the memory footprint low the mz and intensity values are only loaded from the raw data files when required. The OnDiskMSnExp object replaces the xcmsRaw object.

1.3.2 XCMSnExp

The XCMSnExp class extends the OnDiskMSnExp object from the MSnbase package and represents a container for the xcms-based preprocessing results while (since it inherits all functionality from its parent class) keeping a direct relation to the (raw) data on which the processing was performed. An additional slot .processHistory in the object allows to keep track of all performed processing steps. Each analysis method, such as findChromPeaks adds an XProcessHistory object which includes also the parameter class passed to the analysis method. Hence not only the time and type of the analysis, but its exact settings are reported within the XCMSnExp object. The XCMSnExp is thus equivalent to the xcmsSet from the original xcms implementation, but keeps in addition a link to the raw data on which the preprocessing was performed.

1.3.3 Chromatogram

The Chromatogram class (available in the MSnbase package since version 2.3.8) allows a data representation that is orthogonal to the Spectrum class (also defined in MSnbase). The Chromatogram class stores retention time and intensity duplets and is designed to accommodate most use cases, from total ion chromatogram, base peak chromatogram to extracted ion chromatogram and SRM/MRM ion traces.

Chromatogram objects can be extracted from XCMSnExp (and MSnExp and OnDiskMSnExp) objects using the chromatogram method.

Note that this class is still considered developmental and might thus undergo some changes in the future.

1.4 Binning and missing value imputation functions

The binning/profile matrix generation functions have been completely rewritten. The new binYonX function replaces the binning of intensity values into bins defined by their m/z values implemented in the profBin, profBinLin and profBinLinBase methods. The binYonX function provides also additional functionality:

  • Breaks for the bins can be defined based on either the number of desired bins (nBins) or the size of a bin (binSize). In addition it is possible to provide a vector with pre-defined breaks. This allows to bin data from multiple files or scans on the same bin-definition.

  • The function returns a list with element y containing the binned values and element x the bin mid-points.

  • Values in input vector y can be aggregated within each bin with different methods: max, min, sum and mean.

  • The index of the largest (or smallest for method being “min”) within each bin can be returned by setting argument returnIndex to TRUE.

  • Binning can be performed on single or multiple sub-sets of the input vectors using the fromIdx and toIdx arguments. This replaces the M methods (such as profBinM). These sub-sets can be overlapping.

The missing value imputation logic inherently build into the profBinLin and profBinLinBase methods has been implemented in the imputeLinInterpol function.

The example below illustrates the binning and imputation with the binYtoX and imputeLinInterpol functions. After binning of the test vectors below some of the bins have missing values, for which we impute a value using imputeLinInterpol. By default, binYonX selects the largest value within each bin, but other aggregation methods are also available (i.e. min, max, mean, sum).

## Defining the variables:
X <- sort(abs(rnorm(30, mean = 20, sd = 25))) ## 10
Y <- abs(rnorm(30, mean = 50, sd = 30))

## Bin the values in Y into 20 bins defined on X
res <- binYonX(X, Y, nBins = 22)

## $x
##  [1]  3.207154  6.066022  8.924891 11.783759 14.642628 17.501497 20.360365
##  [8] 23.219234 26.078102 28.936971 31.795840 34.654708 37.513577 40.372445
## [15] 43.231314 46.090183 48.949051 51.807920 54.666788 57.525657 60.384526
## [22] 63.243394
## $y
##  [1]  76.85377  76.34400  48.14265  29.15879  43.76248        NA 115.06868
##  [8]  86.23886        NA  73.39895  49.14360        NA  91.05807  43.22687
## [15]        NA        NA        NA  95.49412        NA        NA  67.53841
## [22]  56.47825

As a result we get a list with the bin mid-points ($x) and the binned y values ($y).

Next we use two different imputation approaches, a simple linear interpolation and the linear imputation approach that was defined in the profBinLinBase method. The latter performs linear interpolation only considering a certain neighborhood of missing values otherwise replacing the NA with a base value.

## Plot the actual data values.
plot(X, Y, pch = 16, ylim = c(0, max(Y)))
## Visualizing the bins
abline(v = breaks_on_nBins(min(X), max(X), nBins = 22), col = "grey")

## Define colors:
point_colors <- paste0(brewer.pal(4, "Set1"), 80)
## Plot the binned values.
points(x = res$x, y = res$y, col = point_colors[1], pch = 15)

## Perform the linear imputation.
res_lin <- imputeLinInterpol(res$y)

points(x = res$x, y = res_lin, col = point_colors[2], type = "b")

## Perform the linear imputation "linbase"
res_linbase <- imputeLinInterpol(res$y, method = "linbase")
points(x = res$x, y = res_linbase, col = point_colors[3], type = "b", lty = 2) 
Binning and missing value imputation results. Black points represent the input values, red the results from the binning and blue and green the results from the imputation (with method lin and linbase, respectively).

Figure 1: Binning and missing value imputation results
Black points represent the input values, red the results from the binning and blue and green the results from the imputation (with method lin and linbase, respectively).

The difference between the linear interpolation method lin and linbase is that the latter only performs the linear interpolation in a pre-defined neighborhood of the bin with the missing value (1 by default). The other missing values are set to a base value corresponding to half of the smallest bin value. Both methods thus yield same results, except for bins 15-17 (see Figure above).

1.5 Core functionality exposed via simple functions

The core logic from the chromatographic peak detection methods findPeaks.centWave, findPeaks.massifquant, findPeaks.matchedFilter and findPeaks.MSW and from all alignment (group.*) and correspondence (retcor.*) methods has been extracted and put into functions with the common prefix do_findChromPeaks, do_adjustRtime and do_groupChromPeaks, respectively, with the aim, as detailed in issue #30, to separate the core logic from the analysis methods invoked by the users to enable also the use these methods using base R parameters (i.e. without specific classes containing the data such as the xcmsRaw class). This simplifies also the re-use of these functions in other packages and simplifies the future implementation of the peak detection algorithms for e.g. the MSnExp or OnDiskMSnExp objects from the MSnbase Bioconductor package. The implemented functions are:

  • peak detection methods:
    • do_findChromPeaks_centWave: peak density and wavelet based peak detection for high resolution LC/MS data in centroid mode [1].
    • do_findChromPeaks_matchedFilter: identification of peak in the chromatographic domain based on matched filtration [2].
    • do_findChromPeaks_massifquant: identification of peaks using Kalman filters.
    • do_findChromPeaks_MSW: single spectrum, non-chromatographic peak detection.
  • alignment methods:
    • do_adjustRtime_peakGroups: perform sample alignment (retention time correction) using alignment of well behaved chromatographic peaks that are present in most samples (and are expected to have the same retention time).
  • correspondence methods:
    • do_groupChromPeaks_density: perform chromatographic peak grouping (within and across samples) based on the density distribution of peaks along the retention time axis.
    • do_groupChromPeaks_nearest: groups peaks across samples similar to the method implemented in mzMine.
    • do_groupChromPeaks_mzClust: performs high resolution correspondence on single spectra samples.

One possible drawback from the introduction of this new layer is, that more objects get copied by R which could eventually result in a larger memory demand or performance decrease (while no such was decrease was observed up to now).

1.6 Usability improvements in the old user interface

  • [ subsetting method for xcmsRaw objects that enables to subset an xcmsRaw object to specific scans/spectra.
  • profMat method to extract the profile matrix from the xcmsRaw object. This method should be used instead of directly accessing the @env$profile slot, as it will create the profile matrix on the fly if it was not pre-calculated (or if profile matrix generation settings have been changed).

2 Changes due to bug fixes and modified functionality

2.1 Differences in linear interpolation of missing values (profBinLin).

From xcms version 1.51.1 on the new binning functions are used, thus, the bug described here are fixed.

Two bugs are present in the profBinLin method (reported as issues #46 and #49 on github) which are fixed in the new binYonX and imputeLinInterpol functions:

  • The first bin value calculated by profBinLin can be wrong (i.e. not being the max value within that bin, but the first).
  • If the last bin contains also missing values, the method fails to determine a correct value for that bin.

The profBinLin method is used in findPeaks.matchedFilter if the profile method is set to “binlin”.

The example below illustrates both differences.

## Define a vector with empty values at the end.
X <- 1:11
Y <- sort(rnorm(11, mean = 20, sd = 10))
Y[9:11] <- NA
nas <- is.na(Y)
## Do interpolation with profBinLin:
resX <- xcms:::profBinLin(X[!nas], Y[!nas], 5, xstart = min(X),
                          xend = max(X))
## Warning: Use of 'profBinLin' is deprecated! Use 'binYonX' with
## 'imputeLinInterpol' instead.
## [1]  7.349388 15.543380 21.292877  0.000000  0.000000
res <- binYonX(X, Y, nBins = 5L, shiftByHalfBinSize = TRUE)
resM <- imputeLinInterpol(res$y, method = "lin",
                          noInterpolAtEnds = TRUE)
## [1] 13.13147 15.54338 21.29288 24.60916  0.00000

Plotting the results helps to better compare the differences. The black points in the figure below represent the actual values of Y and the grey vertical lines the breaks defining the bins. The blue lines and points represent the result from the profBinLin method. The bin values for the first and 4th bin are clearly wrong. The green colored points and lines represent the results from the binYonX and imputeLinInterpol functions (showing the correct binning and interpolation).

plot(x = X, y = Y, pch = 16, ylim = c(0, max(Y, na.rm = TRUE)),
     xlim = c(0, 12))
## Plot the breaks
abline(v = breaks_on_nBins(min(X), max(X), 5L, TRUE), col = "grey")
## Result from profBinLin:
points(x = res$x, y = resX, col = "blue", type = "b")
## Results from imputeLinInterpol
points(x = res$x, y = resM, col = "green", type = "b",
       pch = 4, lty = 2)