xcms 3.2.0
This document describes how to use xcms for the analysis of direct injection mass spec data, including peak detection, calibration and correspondence (grouping of peaks across samples).
Prior to any other analysis step, peaks have to be identified in the mass spec
data. In contrast to the typical metabolomics workflow, in which peaks are
identified in the chromatographic (time) dimension, in direct injection mass
spec data sets peaks are identified in the m/z dimension. xcms
uses functionality from the MassSpecWavelet
package to identify such peaks.
Below we load the required packages. We disable parallel processing. To enable
and customize parallel processing please see the BiocParallel
vignette.
library(xcms)
library(MassSpecWavelet)
register(SerialParam())
In this documentation we use an example data set from the msdata
package. Assuming that msdata is installed, we locate the path of
the package and load the data set. We create also a data.frame
describing the
experimental setup based on the file names.
mzdata_path <- system.file("fticr", package = "msdata")
mzdata_files <- list.files(mzdata_path, recursive = TRUE, full.names = TRUE)
## Create a data.frame assigning samples to sample groups, i.e. ham4 and ham5.
grp <- rep("ham4", length(mzdata_files))
grp[grep(basename(mzdata_files), pattern = "^HAM005")] <- "ham5"
pd <- data.frame(filename = basename(mzdata_files), sample_group = grp)
## Load the data.
ham_raw <- readMSData(files = mzdata_files,
pdata = new("NAnnotatedDataFrame", pd),
mode = "onDisk")
The data files are from direct injection mass spectrometry experiments, i.e. we have only a single spectrum available for each sample and no retention times.
## Only a single spectrum with an *artificial* retention time is available
## for each sample
rtime(ham_raw)
## F01.S1 F02.S1 F03.S1 F04.S1 F05.S1 F06.S1 F07.S1 F08.S1 F09.S1 F10.S1
## 1 1 1 1 1 1 1 1 1 1
Peaks are identified within each spectrum using the mass spec wavelet method.
## Define the parameters for the peak detection
msw <- MSWParam(scales = c(1, 4, 9), nearbyPeak = TRUE, winSize.noise = 500,
SNR.method = "data.mean", snthresh = 10)
ham_prep <- findChromPeaks(ham_raw, param = msw)
head(chromPeaks(ham_prep))
## mz mzmin mzmax rt rtmin rtmax into maxo sn
## [1,] 403.2367 403.2279 403.2447 -1 -1 -1 4735258 372259.4 22.97534
## [2,] 409.1845 409.1747 409.1936 -1 -1 -1 4158404 310572.1 20.61382
## [3,] 413.2677 413.2585 413.2769 -1 -1 -1 6099006 435462.6 27.21723
## [4,] 423.2363 423.2266 423.2459 -1 -1 -1 2708391 174252.7 14.74527
## [5,] 427.2681 427.2574 427.2779 -1 -1 -1 6302089 461385.6 32.50050
## [6,] 437.2375 437.2254 437.2488 -1 -1 -1 7523070 517917.6 34.37645
## intf maxf sample is_filled
## [1,] NA 814693.1 1 0
## [2,] NA 732119.9 1 0
## [3,] NA 1018994.8 1 0
## [4,] NA 435858.5 1 0
## [5,] NA 1125644.3 1 0
## [6,] NA 1282906.5 1 0
The calibrate
method can be used to correct the m/z values of identified
peaks. The currently implemented method requires identified peaks and a list of
m/z values for known calibrants. The identified peaks m/z values are then
adjusted based on the differences between the calibrants’ m/z values and the m/z
values of the closest peaks (within a user defined permitted maximal
distance). Note that this method does presently only calibrate identified peaks,
but not the original m/z values in the spectra.
Below we demonstrate the calibrate
method on one of the data files with
artificially defined calibration m/z values. We first subset the data set to the
first data file, extract the m/z values of 3 peaks and modify the values
slightly.
## Subset to the first file.
first_file <- filterFile(ham_prep, file = 1)
## Extract 3 m/z values
calib_mz <- chromPeaks(first_file)[c(1, 4, 7), "mz"]
calib_mz <- calib_mz + 0.00001 * runif(1, 0, 0.4) * calib_mz + 0.0001
Next we calibrate the data set using the previously defined artificial
calibrants. We are using the "edgeshift"
method for calibration that adjusts
all peaks within the range of the m/z values of the calibrants using a linear
interpolation and shifts all chromatographic peaks outside of that range by a
constant factor (the difference between the lowest respectively largest
calibrant m/z with the closest peak’s m/z). Note that in a real use case, the
m/z values would obviously represent known m/z of calibrants and would not be
defined on the actual data.
## Set-up the parameter class for the calibration
prm <- CalibrantMassParam(mz = calib_mz, method = "edgeshift",
mzabs = 0.0001, mzppm = 5)
first_file_calibrated <- calibrate(first_file, param = prm)
To evaluate the calibration we plot below the difference between the adjusted and raw m/z values (y-axis) against the raw m/z values.
diffs <- chromPeaks(first_file_calibrated)[, "mz"] -
chromPeaks(first_file)[, "mz"]
plot(x = chromPeaks(first_file)[, "mz"], xlab = expression(m/z[raw]),
y = diffs, ylab = expression(m/z[calibrated] - m/z[raw]))
Correspondence aims to group peaks across samples to define the features (ions
with the same m/z values). Peaks from single spectrum, direct injection MS
experiments can be grouped with the MZclust method. Below we perform the
correspondence analysis with the groupChromPeaks
method using default
settings.
## Using default settings but define sample group assignment
mzc_prm <- MzClustParam(sampleGroups = ham_prep$sample_group)
ham_prep <- groupChromPeaks(ham_prep, param = mzc_prm)
Getting an overview of the performed processings:
ham_prep
## MSn experiment data ("XCMSnExp")
## Object size in memory: 0.05 Mb
## - - - Spectra data - - -
## MS level(s): 1
## Number of spectra: 10
## MSn retention times: 0:1 - 0:1 minutes
## - - - Processing information - - -
## Data loaded [Mon Apr 30 18:47:36 2018]
## MSnbase version: 2.6.0
## - - - Meta data - - -
## phenoData
## rowNames: 1 2 ... 10 (10 total)
## varLabels: filename sample_group
## varMetadata: labelDescription
## Loaded from:
## [1] HAM004_641fE_14-11-07--Exp1.extracted.mzdata... [10] HAM005_641fE_14-11-07--Exp5.extracted.mzdata
## Use 'fileNames(.)' to see all files.
## protocolData: none
## featureData
## featureNames: F01.S1 F02.S1 ... F10.S1 (10 total)
## fvarLabels: fileIdx spIdx ... spectrum (29 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
## method: MSW
## 88 peaks identified in 10 samples.
## On average 8.8 chromatographic peaks per sample.
## Correspondence:
## method: mzClust
## 18 features identified.
## Median mz range of features: 0.00088501
## Median rt range of features: 0
The peak group information, i.e. the feature definitions can be accessed with
the featureDefinitions
method.
featureDefinitions(ham_prep)
## DataFrame with 18 rows and 10 columns
## mzmed mzmin mzmax rtmed rtmin
## <numeric> <numeric> <numeric> <numeric> <numeric>
## FT01 402.285382080078 402.285064697266 402.285858154297 -1 -1
## FT02 403.236521402995 403.235687255859 403.236694335938 -1 -1
## FT03 405.108932495117 405.108734130859 405.109527587891 -1 -1
## FT04 409.184381103516 409.183746337891 409.184539794922 -1 -1
## FT05 410.144424438477 410.144012451172 410.144836425781 -1 -1
## ... ... ... ... ... ...
## FT14 438.240702311198 438.240386962891 438.241333007812 -1 -1
## FT15 439.150836181641 439.150268554688 439.151214599609 -1 -1
## FT16 441.129954020182 441.129638671875 441.130584716797 -1 -1
## FT17 444.129648844401 444.128997802734 444.129974365234 -1 -1
## FT18 445.29288482666 445.292175292969 445.293121337891 -1 -1
## rtmax npeaks ham4 ham5
## <numeric> <numeric> <numeric> <numeric>
## FT01 -1 5 0 5
## FT02 -1 9 5 4
## FT03 -1 4 0 4
## FT04 -1 5 5 0
## FT05 -1 4 0 4
## ... ... ... ... ...
## FT14 -1 3 3 0
## FT15 -1 5 0 5
## FT16 -1 3 0 3
## FT17 -1 3 0 3
## FT18 -1 4 4 0
## peakidx
## <list>
## FT01 c(37, 60, 80, 49, 70)
## FT02 c(61, 38, 50, 71, 1, 9, 16, 23, 31)
## FT03 c(39, 51, 62, 81)
## FT04 c(10, 2, 17, 24, 32)
## FT05 c(40, 72, 52, 63)
## ... ...
## FT14 c(7, 14, 22)
## FT15 c(67, 86, 47, 58, 78)
## FT16 c(59, 87, 48)
## FT17 c(68, 79, 88)
## FT18 c(15, 8, 30, 36)
Plotting the raw data for direct injection samples involves a little more
processing than for LC/GC-MS data in which we can simply use the chromatogram
method to extract the data. Below we extract the m/z-intensity pairs for the
peaks associated with the first feature. We thus first identify the peaks for
that feature and define their m/z values range. Using this range we can
subsequently use the filterMz
function to sub-set the full data set to the
signal associated with the feature’s peaks. On that object we can then call the
mz
and intensity
functions to extract the data.
## Get the peaks belonging to the first feature
pks <- chromPeaks(ham_prep)[featureDefinitions(ham_prep)$peakidx[[1]], ]
## Define the m/z range
mzr <- c(min(pks[, "mzmin"]) - 0.001, max(pks[, "mzmax"]) + 0.001)
## Subset the object to the m/z range
ham_prep_sub <- filterMz(ham_prep, mz = mzr)
## Extract the mz and intensity values
mzs <- mz(ham_prep_sub, bySample = TRUE)
ints <- intensity(ham_prep_sub, bySample = TRUE)
## Plot the data
plot(3, 3, pch = NA, xlim = range(mzs), ylim = range(ints), main = "FT01",
xlab = "m/z", ylab = "intensity")
## Define colors
cols <- rep("#ff000080", length(mzs))
cols[ham_prep_sub$sample_group == "ham5"] <- "#0000ff80"
tmp <- mapply(mzs, ints, cols, FUN = function(x, y, col) {
points(x, y, col = col, type = "l")
})
To access the actual intensity values of each feature in each sample the
featureValue
method can be used. The setting value = "into"
tells the
function to return the integrated signal for each peak (one representative peak)
per sample.
feat_vals <- featureValues(ham_prep, value = "into")
head(feat_vals)
## HAM004_641fE_14-11-07--Exp1.extracted.mzdata
## FT01 NA
## FT02 4735258
## FT03 NA
## FT04 4158404
## FT05 NA
## FT06 6099006
## HAM004_641fE_14-11-07--Exp2.extracted.mzdata
## FT01 NA
## FT02 6202418
## FT03 NA
## FT04 5004546
## FT05 NA
## FT06 4950642
## HAM004_641fE_14-11-07--Exp3.extracted.mzdata
## FT01 NA
## FT02 6117414
## FT03 NA
## FT04 4403588
## FT05 NA
## FT06 5517710
## HAM004_641fE_14-11-07--Exp4.extracted.mzdata
## FT01 NA
## FT02 5328574
## FT03 NA
## FT04 4336554
## FT05 NA
## FT06 5008542
## HAM004_641fE_14-11-07--Exp5.extracted.mzdata
## FT01 NA
## FT02 6429029
## FT03 NA
## FT04 4580893
## FT05 NA
## FT06 4856606
## HAM005_641fE_14-11-07--Exp1.extracted.mzdata
## FT01 4095293
## FT02 4811391
## FT03 2982453
## FT04 NA
## FT05 2872023
## FT06 NA
## HAM005_641fE_14-11-07--Exp2.extracted.mzdata
## FT01 4804763
## FT02 2581183
## FT03 2268984
## FT04 NA
## FT05 2133219
## FT06 NA
## HAM005_641fE_14-11-07--Exp3.extracted.mzdata
## FT01 4657727
## FT02 2727237
## FT03 2971705
## FT04 NA
## FT05 2466626
## FT06 NA
## HAM005_641fE_14-11-07--Exp4.extracted.mzdata
## FT01 3755890
## FT02 2496859
## FT03 NA
## FT04 NA
## FT05 2980997
## FT06 NA
## HAM005_641fE_14-11-07--Exp5.extracted.mzdata
## FT01 5265972
## FT02 NA
## FT03 3009065
## FT04 NA
## FT05 NA
## FT06 NA
NA
is reported for features in samples for which no peak was identified at the
feature’s m/z value. In some instances there might still be a signal at the
feature’s position in the raw data files, but the peak detection failed to
identify a peak. For these cases signal can be recovered using the
fillChromPeaks
method that integrates all raw signal at the feature’s
location. If there is no signal at that location an NA
is reported.
ham_prep <- fillChromPeaks(ham_prep, param = FillChromPeaksParam())
head(featureValues(ham_prep, value = "into"))
## HAM004_641fE_14-11-07--Exp1.extracted.mzdata
## FT01 751860.7
## FT02 4735257.5
## FT03 749756.9
## FT04 4158404.5
## FT05 707975.5
## FT06 6099006.3
## HAM004_641fE_14-11-07--Exp2.extracted.mzdata
## FT01 1202735.5
## FT02 6202417.6
## FT03 450339.0
## FT04 5004546.3
## FT05 481158.4
## FT06 4950641.7
## HAM004_641fE_14-11-07--Exp3.extracted.mzdata
## FT01 779492.4
## FT02 6117414.1
## FT03 588609.5
## FT04 4403588.2
## FT05 1149542.8
## FT06 5517709.5
## HAM004_641fE_14-11-07--Exp4.extracted.mzdata
## FT01 541406.3
## FT02 5328574.1
## FT03 877991.1
## FT04 4336554.2
## FT05 798341.5
## FT06 5008541.7
## HAM004_641fE_14-11-07--Exp5.extracted.mzdata
## FT01 543921.1
## FT02 6429028.9
## FT03 1363759.7
## FT04 4580892.8
## FT05 1107571.1
## FT06 4856606.0
## HAM005_641fE_14-11-07--Exp1.extracted.mzdata
## FT01 4095293
## FT02 4811391
## FT03 2982453
## FT04 1174152
## FT05 2872023
## FT06 1708432
## HAM005_641fE_14-11-07--Exp2.extracted.mzdata
## FT01 4804763
## FT02 2581183
## FT03 2268984
## FT04 1183772
## FT05 2133219
## FT06 1037441
## HAM005_641fE_14-11-07--Exp3.extracted.mzdata
## FT01 4657726.8
## FT02 2727237.5
## FT03 2971705.2
## FT04 517915.7
## FT05 2466625.6
## FT06 867137.8
## HAM005_641fE_14-11-07--Exp4.extracted.mzdata
## FT01 3755889.7
## FT02 2496858.9
## FT03 2291624.1
## FT04 1321138.8
## FT05 2980996.6
## FT06 947289.2
## HAM005_641fE_14-11-07--Exp5.extracted.mzdata
## FT01 5265972
## FT02 2128723
## FT03 3009065
## FT04 1163582
## FT05 2227016
## FT06 1006506
Further analysis, i.e. detection of features/metabolites with significantly different abundances, or PCA analyses can be performed on the feature matrix using functionality from other R packages, such as limma.