Package: xcms
Authors: Johannes Rainer
Modified: 2023-04-25 14:01:32.005268
Compiled: Tue Apr 25 19:15:42 2023

1 Introduction

This documents describes data import, exploration, preprocessing and analysis of LCMS experiments with xcms version >= 3. The examples and basic workflow was adapted from the original LC/MS Preprocessing and Analysis with xcms vignette from Colin A. Smith.

The new user interface and methods use the XCMSnExp object (instead of the old xcmsSet object) as a container for the pre-processing results. To support packages and pipelines relying on the xcmsSet object, it is however possible to convert an XCMSnExp into a xcmsSet object using the as method (i.e.xset <- as(x, "xcmsSet"), with x being an XCMSnExp object.

2 Data import

xcms supports analysis of LC/MS data from files in (AIA/ANDI) NetCDF, mzXML and mzML format. For the actual data import Bioconductor’s mzR is used. For demonstration purpose we will analyze a subset of the data from [1] in which the metabolic consequences of knocking out the fatty acid amide hydrolase (FAAH) gene in mice was investigated. The raw data files (in NetCDF format) are provided with the faahKO data package. The data set consists of samples from the spinal cords of 6 knock-out and 6 wild-type mice. Each file contains data in centroid mode acquired in positive ion mode form 200-600 m/z and 2500-4500 seconds. To speed up processing of this vignette we will restrict the analysis to only 8 files and to the retention time range from 2500 to 3500 seconds.

Below we load all required packages, locate the raw CDF files within the faahKO package and build a phenodata data frame describing the experimental setup. Note that for real experiments it is suggested to define a file (table) that contains the file names of the raw data files along with descriptions of the samples for each file as additional columns. Such a file could then be imported with e.g. read.table as variable pd (instead of being defined within R as in the example below) and the file names could be passed along to the readMSData function below with e.g. files = paste0(MZML_PATH, "/", pd$mzML_file) where MZML_PATH would be the path to directory in which the files are located and "mzML_file" the name of the column in the phenodata file that contains the file names.


## Get the full path to the CDF files
cdfs <- dir(system.file("cdf", package = "faahKO"), full.names = TRUE,
            recursive = TRUE)[c(1, 2, 5, 6, 7, 8, 11, 12)]
## Create a phenodata data.frame
pd <- data.frame(sample_name = sub(basename(cdfs), pattern = ".CDF",
                                   replacement = "", fixed = TRUE),
                 sample_group = c(rep("KO", 4), rep("WT", 4)),
                 stringsAsFactors = FALSE)

Subsequently we load the raw data as an OnDiskMSnExp object using the readMSData method from the MSnbase package. The MSnbase provides based structures and infrastructure for the processing of mass spectrometry data. Also, MSnbase can be used to centroid profile-mode MS data (see the corresponding vignette in the MSnbase package).

raw_data <- readMSData(files = cdfs, pdata = new("NAnnotatedDataFrame", pd),
                       mode = "onDisk")

We next restrict the data set to the retention time range from 2500 to 3500 seconds. This is merely to reduce the processing time of this vignette.

raw_data <- filterRt(raw_data, c(2500, 3500))

The resulting OnDiskMSnExp object contains general information about the number of spectra, retention times, the measured total ion current etc, but does not contain the full raw data (i.e. the m/z and intensity values from each measured spectrum). Its memory footprint is thus rather small making it an ideal object to represent large metabolomics experiments while allowing to perform simple quality controls, data inspection and exploration as well as data sub-setting operations. The m/z and intensity values are imported from the raw data files on demand, hence the location of the raw data files should not be changed after initial data import.

3 Initial data inspection

The OnDiskMSnExp organizes the MS data by spectrum and provides the methods intensity, mz and rtime to access the raw data from the files (the measured intensity values, the corresponding m/z and retention time values). In addition, the spectra method could be used to return all data encapsulated in Spectrum objects. Below we extract the retention time values from the object.

## F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 
## 2501.378 2502.943 2504.508 2506.073 2507.638 2509.203

All data is returned as one-dimensional vectors (a numeric vector for rtime and a list of numeric vectors for mz and intensity, each containing the values from one spectrum), even if the experiment consists of multiple files/samples. The fromFile function returns an integer vector providing the mapping of the values to the originating file. Below we use the fromFile indices to organize the mz values by file.

mzs <- mz(raw_data)

## Split the list by file
mzs_by_file <- split(mzs, f = fromFile(raw_data))

## [1] 8

As a first evaluation of the data we plot below the base peak chromatogram (BPC) for each file in our experiment. We use the chromatogram method and set the aggregationFun to "max" to return for each spectrum the maximal intensity and hence create the BPC from the raw data. To create a total ion chromatogram we could set aggregationFun to sum.

## Get the base peak chromatograms. This reads data from the files.
bpis <- chromatogram(raw_data, aggregationFun = "max")
## Define colors for the two groups
group_colors <- paste0(brewer.pal(3, "Set1")[1:2], "60")
names(group_colors) <- c("KO", "WT")

## Plot all chromatograms.
plot(bpis, col = group_colors[raw_data$sample_group])