xcms 3.0.2
xcms
This document describes new functionality and changes to existing functionality
in the xcms
package introduced during the update to version 3.
library(xcms)
library(RColorBrewer)
register(SerialParam())
The modernization of the user interface comprises new classes for data
representation and new data analysis methods. In addition, the core logic for
the data processing has been extracted from the old methods and put into a set
of R functions, the so called core API functions (or do_
functions). These
functions take standard R data structures as input and return standard R data
types as result and can hence be easily included in other R packages.
The new user interface aims at simplifying and streamlining the xcms
workflow
while guaranteeing data integrity and performance also for large scale
metabolomics experiments. Importantly, a simplified access to the original raw
data should be provided throughout the whole metabolomics data analysis workflow.
The new interface re-uses objects from the MSnbase
Bioconductor package, such as
the OnDiskMSnExp
object. This object is specifically designed for large scale MS
experiments as it initially reads just the scan header information from the mzML
while the mz-intensity value pairs from all or from selected spectra of a file
are read on demand hence minimizing the memory demand. Also, in contrast to
the old xcmsRaw
object, the OnDiskMSnExp
contains information from all files of
an experiment. In addition, all data normalization and adjustment methods
implemented in the MSnbase
package can be directly applied to the MS data
without the need to re-implement such methods in xcms
. Results from xcms
preprocessings, such as chromatographic peak detection or correspondence are
stored into the new XCMSnExp
object. This object extends the OnDiskMSnExp
object
and inherits thus all of its methods including raw data access.
Class and method/function names follow also a new naming convention trying tp
avoid the partially confusing nomenclature of the original xcms
methods (such as
the group
method to perform the correspondence of peaks across samples). To
distinguish them from mass peaks, the peaks identified by the peak detection in
an LS/GC-MS experiment are referred to as chromatographic peaks. The respective
method to identify such peaks is hence called findChromPeaks
and the identified
peaks can be accessed using the XCMSnExp
chromPeaks
method. The results from an
correspondence analysis which aims to match and group chromatographic peaks
within and between samples are called features. A feature corresponds to
individual ions with a unique mass-to-charge ratio (mz) and a unique retention
time (rt). The definition of such mz-rt features (i.e. the result from the
groupChromPeaks
method) can be accessed via the featureDefinitions
method of
the XCMSnExp
class. Finally, alignment (retention time correction) can be
performed using the adjustRtime
method.
The settings for any of the new analysis methods are bundled in parameter
classes, one class for each method. This encapsulation of the parameters to a
function into a parameter class (such as CentWaveParam
) avoids busy function
calls (with many single parameters) and enables saving, reloading and reusing
the settings. In addition, the parameter classes are added, along with other
information to the process history of an XCMSnExp
object thus providing a
detailed documentation of each processing step of an analysis, with the
possibility to recall all settings of the performed analyses at any stage. In
addition, validation of the parameters can be performed within the parameter
object and hence is no longer required in the analysis function.
Peaks identified in LC/GC-MS metabolomics are referred to as chromatographic peaks where possible to avoid any misconceptions with mass peaks identified in mz dimension.
Methods for data analysis from the original xcms
code have been renamed to avoid
potential confusions:
Chromatographic peak detection: findChromPeaks
instead of findPeaks
: for new
functions and methods the term peak is avoided as much as possible, as it is
usually used to describe a mass peak in mz dimension. To clearly distinguish
between these peaks and peaks in retention time space, the latter are referred
to as chromatographic peak, or chromPeak
.
Correspondence: groupChromPeaks
instead of group
to clearly indicate what is
being grouped. Group might be a sample group or a peak group, the latter being
referred to also by (mz-rt) feature.
Alignment: adjustRtime
instead of retcor
for retention time correction. The
word cor in retcor might be easily misinterpreted as correlation instead of
correction.
OnDiskMSnExp
This object is defined and documented in the MSnbase
package. In brief, it is a
container for the full raw data from an MS-based experiment. To keep the memory
footprint low the mz and intensity values are only loaded from the raw data
files when required. The OnDiskMSnExp
object replaces the xcmsRaw
object.
XCMSnExp
The XCMSnExp
class extends the OnDiskMSnExp
object from the MSnbase
package and
represents a container for the xcms-based preprocessing results while (since it
inherits all functionality from its parent class) keeping a direct relation to
the (raw) data on which the processing was performed. An additional slot
.processHistory
in the object allows to keep track of all performed processing
steps. Each analysis method, such as findChromPeaks
adds an XProcessHistory
object which includes also the parameter class passed to the analysis
method. Hence not only the time and type of the analysis, but its exact settings
are reported within the XCMSnExp
object. The XCMSnExp
is thus equivalent to the
xcmsSet
from the original xcms
implementation, but keeps in addition a link to
the raw data on which the preprocessing was performed.
Chromatogram
The Chromatogram
class (available in the MSnbase
package since version 2.3.8)
allows a data representation that is orthogonal to the Spectrum
class (also
defined in MSnbase
). The Chromatogram
class stores retention time and intensity
duplets and is designed to accommodate most use cases, from total ion
chromatogram, base peak chromatogram to extracted ion chromatogram and SRM/MRM
ion traces.
Chromatogram
objects can be extracted from XCMSnExp
(and MSnExp
and
OnDiskMSnExp
) objects using the chromatogram
method.
Note that this class is still considered developmental and might thus undergo some changes in the future.
The binning/profile matrix generation functions have been completely
rewritten. The new binYonX
function replaces the binning of intensity values
into bins defined by their m/z values implemented in the profBin
, profBinLin
and
profBinLinBase
methods. The binYonX
function provides also additional functionality:
Breaks for the bins can be defined based on either the number of desired bins
(nBins
) or the size of a bin (binSize
). In addition it is possible to provide
a vector with pre-defined breaks. This allows to bin data from multiple files
or scans on the same bin-definition.
The function returns a list with element y
containing the binned values and
element x
the bin mid-points.
Values in input vector y
can be aggregated within each bin with different
methods: max
, min
, sum
and mean
.
The index of the largest (or smallest for method
being “min”) within each bin
can be returned by setting argument returnIndex
to TRUE
.
Binning can be performed on single or multiple sub-sets of the input vectors
using the fromIdx
and toIdx
arguments. This replaces the M methods (such as
profBinM
). These sub-sets can be overlapping.
The missing value imputation logic inherently build into the profBinLin
and
profBinLinBase
methods has been implemented in the imputeLinInterpol
function.
The example below illustrates the binning and imputation with the binYtoX
and
imputeLinInterpol
functions. After binning of the test vectors below some of the
bins have missing values, for which we impute a value using
imputeLinInterpol
. By default, binYonX
selects the largest value within each
bin, but other aggregation methods are also available (i.e. min, max, mean,
sum).
## Defining the variables:
set.seed(123)
X <- sort(abs(rnorm(30, mean = 20, sd = 25))) ## 10
Y <- abs(rnorm(30, mean = 50, sd = 30))
## Bin the values in Y into 20 bins defined on X
res <- binYonX(X, Y, nBins = 22)
res
## $x
## [1] 3.207154 6.066022 8.924891 11.783759 14.642628 17.501497 20.360365
## [8] 23.219234 26.078102 28.936971 31.795840 34.654708 37.513577 40.372445
## [15] 43.231314 46.090183 48.949051 51.807920 54.666788 57.525657 60.384526
## [22] 63.243394
##
## $y
## [1] 76.85377 76.34400 48.14265 29.15879 43.76248 NA 115.06868
## [8] 86.23886 NA 73.39895 49.14360 NA 91.05807 43.22687
## [15] NA NA NA 95.49412 NA NA 67.53841
## [22] 56.47825
As a result we get a list
with the bin mid-points ($x
) and the binned y
values
($y
).
Next we use two different imputation approaches, a simple linear interpolation
and the linear imputation approach that was defined in the profBinLinBase
method. The latter performs linear interpolation only considering a certain
neighborhood of missing values otherwise replacing the NA
with a base value.
## Plot the actual data values.
plot(X, Y, pch = 16, ylim = c(0, max(Y)))
## Visualizing the bins
abline(v = breaks_on_nBins(min(X), max(X), nBins = 22), col = "grey")
## Define colors:
point_colors <- paste0(brewer.pal(4, "Set1"), 80)
## Plot the binned values.
points(x = res$x, y = res$y, col = point_colors[1], pch = 15)
## Perform the linear imputation.
res_lin <- imputeLinInterpol(res$y)
points(x = res$x, y = res_lin, col = point_colors[2], type = "b")
## Perform the linear imputation "linbase"
res_linbase <- imputeLinInterpol(res$y, method = "linbase")
points(x = res$x, y = res_linbase, col = point_colors[3], type = "b", lty = 2)
The difference between the linear interpolation method lin
and linbase
is that
the latter only performs the linear interpolation in a pre-defined neighborhood
of the bin with the missing value (1
by default). The other missing values are
set to a base value corresponding to half of the smallest bin value. Both
methods thus yield same results, except for bins 15-17 (see Figure above).
The core logic from the chromatographic peak detection methods
findPeaks.centWave
, findPeaks.massifquant
, findPeaks.matchedFilter
and
findPeaks.MSW
and from all alignment (group.*
) and correspondence (retcor.*
)
methods has been extracted and put into functions with the common prefix
do_findChromPeaks
, do_adjustRtime
and do_groupChromPeaks
, respectively, with the
aim, as detailed in issue #30, to separate the core logic from the analysis
methods invoked by the users to enable also the use these methods using base R
parameters (i.e. without specific classes containing the data such as the
xcmsRaw
class). This simplifies also the re-use of these functions in other
packages and simplifies the future implementation of the peak detection
algorithms for e.g. the MSnExp
or OnDiskMSnExp
objects from the MSnbase
Bioconductor package. The implemented functions are:
do_findChromPeaks_centWave
: peak density and wavelet based peak detection
for high resolution LC/MS data in centroid mode [1].do_findChromPeaks_matchedFilter
: identification of peak in the
chromatographic domain based on matched filtration [2].do_findChromPeaks_massifquant
: identification of peaks using Kalman
filters.do_findChromPeaks_MSW
: single spectrum, non-chromatographic peak detection.do_adjustRtime_peakGroups
: perform sample alignment (retention time
correction) using alignment of well behaved chromatographic peaks that are
present in most samples (and are expected to have the same retention time).do_groupChromPeaks_density
: perform chromatographic peak grouping (within
and across samples) based on the density distribution of peaks along the
retention time axis.do_groupChromPeaks_nearest
: groups peaks across samples similar to the
method implemented in mzMine.do_groupChromPeaks_mzClust
: performs high resolution correspondence on
single spectra samples.One possible drawback from the introduction of this new layer is, that more objects get copied by R which could eventually result in a larger memory demand or performance decrease (while no such was decrease was observed up to now).
[
subsetting method for xcmsRaw
objects that enables to subset an xcmsRaw
object to specific scans/spectra.profMat
method to extract the profile matrix from the xcmsRaw
object. This
method should be used instead of directly accessing the @env$profile
slot, as
it will create the profile matrix on the fly if it was not pre-calculated (or
if profile matrix generation settings have been changed).profBinLin
).From xcms
version 1.51.1 on the new binning functions are used, thus, the bug
described here are fixed.
Two bugs are present in the profBinLin
method (reported as issues #46 and #49 on
github) which are fixed in the new binYonX
and imputeLinInterpol
functions:
profBinLin
can be wrong (i.e. not being the
max value within that bin, but the first).The profBinLin
method is used in findPeaks.matchedFilter
if the profile
method is set to “binlin”.
The example below illustrates both differences.
## Define a vector with empty values at the end.
X <- 1:11
set.seed(123)
Y <- sort(rnorm(11, mean = 20, sd = 10))
Y[9:11] <- NA
nas <- is.na(Y)
## Do interpolation with profBinLin:
resX <- xcms:::profBinLin(X[!nas], Y[!nas], 5, xstart = min(X),
xend = max(X))
## Warning: Use of 'profBinLin' is deprecated! Use 'binYonX' with
## 'imputeLinInterpol' instead.
resX
## [1] 7.349388 15.543380 21.292877 0.000000 0.000000
res <- binYonX(X, Y, nBins = 5L, shiftByHalfBinSize = TRUE)
resM <- imputeLinInterpol(res$y, method = "lin",
noInterpolAtEnds = TRUE)
resM
## [1] 13.13147 15.54338 21.29288 24.60916 0.00000
Plotting the results helps to better compare the differences. The black points
in the figure below represent the actual values of Y
and the grey vertical lines
the breaks defining the bins. The blue lines and points represent the result
from the profBinLin
method. The bin values for the first and 4th bin are clearly
wrong. The green colored points and lines represent the results from the binYonX
and imputeLinInterpol
functions (showing the correct binning and interpolation).
plot(x = X, y = Y, pch = 16, ylim = c(0, max(Y, na.rm = TRUE)),
xlim = c(0, 12))
## Plot the breaks
abline(v = breaks_on_nBins(min(X), max(X), 5L, TRUE), col = "grey")
## Result from profBinLin:
points(x = res$x, y = resX, col = "blue", type = "b")
## Results from imputeLinInterpol
points(x = res$x, y = resM, col = "green", type = "b",
pch = 4, lty = 2)