1 New functionality in xcms

This document describes new functionality and changes to existing functionality in the xcms package introduced during the update to version 3.


1.1 Modernized user interface

The modernization of the user interface comprises new classes for data representation and new data analysis methods. In addition, the core logic for the data processing has been extracted from the old methods and put into a set of R functions, the so called core API functions (or do_ functions). These functions take standard R data structures as input and return standard R data types as result and can hence be easily included in other R packages.

The new user interface aims at simplifying and streamlining the xcms workflow while guaranteeing data integrity and performance also for large scale metabolomics experiments. Importantly, a simplified access to the original raw data should be provided throughout the whole metabolomics data analysis workflow.

The new interface re-uses objects from the MSnbase Bioconductor package, such as the OnDiskMSnExp object. This object is specifically designed for large scale MS experiments as it initially reads just the scan header information from the mzML while the mz-intensity value pairs from all or from selected spectra of a file are read on demand hence minimizing the memory demand. Also, in contrast to the old xcmsRaw object, the OnDiskMSnExp contains information from all files of an experiment. In addition, all data normalization and adjustment methods implemented in the MSnbase package can be directly applied to the MS data without the need to re-implement such methods in xcms. Results from xcms preprocessings, such as chromatographic peak detection or correspondence are stored into the new XCMSnExp object. This object extends the OnDiskMSnExp object and inherits thus all of its methods including raw data access.

Class and method/function names follow also a new naming convention trying tp avoid the partially confusing nomenclature of the original xcms methods (such as the group method to perform the correspondence of peaks across samples). To distinguish them from mass peaks, the peaks identified by the peak detection in an LS/GC-MS experiment are referred to as chromatographic peaks. The respective method to identify such peaks is hence called findChromPeaks and the identified peaks can be accessed using the XCMSnExp chromPeaks method. The results from an correspondence analysis which aims to match and group chromatographic peaks within and between samples are called features. The definition of such mz-rt features (i.e. the result from the groupChromPeaks method) can be accessed via the featureDefinitions method of the XCMSnExp class. Finally, alignment (retention time correction) can be performed using the adjustRtime method.

The settings for any of the new analysis methods are bundled in parameter classes, one class for each method. This encapsulation of the parameters to a function into a parameter class (such as CentWaveParam) avoids busy function calls (with many single parameters) and enables saving, reloading and reusing the settings. In addition, the parameter classes are added, along with other information to the process history of an XCMSnExp object thus providing a detailed documentation of each processing step of an analysis, with the possibility to recall all settings of the performed analyses at any stage. In addition, validation of the parameters can be performed within the parameter object and hence is no longer required in the analysis function.

The example below illustrates the new user interface. First we load the raw data files from the faahKO package using the readMSData2 from the MSnbase package.

## Reading the raw data using the MSnbase package
## Load 6 of the CDF files from the faahKO
cdf_files <- dir(system.file("cdf", package = "faahKO"), recursive = TRUE,
         full.names = TRUE)[c(1:3, 7:9)]

## Define the sample grouping.
s_groups <- rep("KO", length(cdf_files))
s_groups[grep(cdf_files, pattern = "WT")] <- "WT"
## Define a data.frame that will be used as phenodata
pheno <- data.frame(sample_name = sub(basename(cdf_files), pattern = ".CDF",
                      replacement = "", fixed = TRUE),
            sample_group = s_groups, stringsAsFactors = FALSE)

## Read the data.
raw_data <- readMSData2(cdf_files, pdata = new("NAnnotatedDataFrame", pheno))

We next plot the total ion chromatogram (TIC) for all files within the experiment. Note that we are iteratively sub-setting the full data per file using the filterFile method, which, for OnDiskMSnExp objects, is an efficient way to subset the data while ensuring that all data, including metadata, stays consistent.

sample_colors <- brewer.pal(3, "Set1")[1:2]
names(sample_colors) <- c("KO", "WT")
## Subset the full raw data by file and plot the data.
tmp <- filterFile(raw_data, file = 1)
plot(x = rtime(tmp), y = tic(tmp), xlab = "retention time", ylab = "TIC",
     col = paste0(sample_colors[pData(tmp)$sample_group], 80), type = "l")
for (i in 2:length(fileNames(raw_data))) {
    tmp <- filterFile(raw_data, file = i)
    points(rtime(tmp), tic(tmp), type = "l",
       col = paste0(sample_colors[pData(tmp)$sample_group], 80))
legend("topleft", col = sample_colors, legend = names(sample_colors), lty = 1)