xcms 3.2.0
Package: xcms
Authors: Johannes Rainer
Modified: 2018-04-30 13:21:36
Compiled: Mon Apr 30 18:47:52 2018
This documents describes data import, exploration, preprocessing and analysis of
LCMS experiments with xcms
version >= 3. The examples and basic workflow was
adapted from the original LC/MS Preprocessing and Analysis with xcms vignette
from Colin A. Smith.
The new user interface and methods use the XCMSnExp
object (instead of the old
xcmsSet
object) as a container for the pre-processing results. To support
packages and pipelines relying on the xcmsSet
object, it is however possible to
convert an XCMSnExp
into a xcmsSet
object using the as
method (i.e. xset <- as(x, "xcmsSet")
, with x
being an XCMSnExp
object.
xcms
supports analysis of LC/MS data from files in (AIA/ANDI) NetCDF, mzML/mzXML
and mzData format. For the actual data import Bioconductor’s SRC_R[:exports
both]{Biocpkg(“mzR”)} is used. For demonstration purpose we will analyze a
subset of the data from [1] in which the metabolic consequences
of knocking out the fatty acid amide hydrolase (FAAH) gene in mice was
investigated. The raw data files (in NetCDF format) are provided with the faahKO
data package. The data set consists of samples from the spinal cords of 6
knock-out and 6 wild-type mice. Each file contains data in centroid mode
acquired in positive ion mode form 200-600 m/z and 2500-4500 seconds.
Below we load all required packages, locate the raw CDF files within the faahKO
package and build a phenodata data frame describing the experimental setup.
library(xcms)
library(faahKO)
library(RColorBrewer)
library(pander)
library(magrittr)
## Get the full path to the CDF files
cdfs <- dir(system.file("cdf", package = "faahKO"), full.names = TRUE,
recursive = TRUE)
## Create a phenodata data.frame
pd <- data.frame(sample_name = sub(basename(cdfs), pattern = ".CDF",
replacement = "", fixed = TRUE),
sample_group = c(rep("KO", 6), rep("WT", 6)),
stringsAsFactors = FALSE)
Subsequently we load the raw data as an OnDiskMSnExp
object using the
readMSData
method from the MSnbase
package. While the MSnbase
package was
originally developed for proteomics data processing, many of its functionality,
including raw data import and data representation, can be shared and reused in
metabolomics data analysis. Also, MSnbase
can be used to centroid profile-mode
MS data (see the corresponding vignette in the MSnbase
package).
raw_data <- readMSData(files = cdfs, pdata = new("NAnnotatedDataFrame", pd),
mode = "onDisk")
The OnDiskMSnExp
object contains general information about the number of
spectra, retention times, the measured total ion current etc, but does not
contain the full raw data (i.e. the m/z and intensity values from each measured
spectrum). Its memory footprint is thus rather small making it an ideal object
to represent large metabolomics experiments while still allowing to perform
simple quality controls, data inspection and exploration as well as data
sub-setting operations. The m/z and intensity values are imported from the raw
data files on demand, hence the location of the raw data files should not be
changed after initial data import.
The OnDiskMSnExp
organizes the MS data by spectrum and provides the methods
intensity
, mz
and rtime
to access the raw data from the files (the measured
intensity values, the corresponding m/z and retention time values). In addition,
the spectra
method could be used to return all data encapsulated in Spectrum
classes. Below we extract the retention time values from the object.
head(rtime(raw_data))
## F01.S0001 F01.S0002 F01.S0003 F01.S0004 F01.S0005 F01.S0006
## 2501.378 2502.943 2504.508 2506.073 2507.638 2509.203
All data is returned as one-dimensional vectors (a numeric vector for rtime
and
a list
of numeric vectors for mz
and intensity
, each containing the values from
one spectrum), even if the experiment consists of multiple files/samples. The
fromFile
function returns a numeric vector that provides the mapping of the
values to the originating file. Below we use the fromFile
indices to organize
the mz
values by file.
mzs <- mz(raw_data)
## Split the list by file
mzs_by_file <- split(mzs, f = fromFile(raw_data))
length(mzs_by_file)
## [1] 12
As a first evaluation of the data we plot below the base peak chromatogram (BPC)
for each file in our experiment. We use the chromatogram
method and set the
aggregationFun
to "max"
to return for each spectrum the maximal intensity and
hence create the BPC from the raw data. To create a total ion chromatogram we
could set aggregationFun
to sum
.
## Get the base peak chromatograms. This reads data from the files.
bpis <- chromatogram(raw_data, aggregationFun = "max")
## Define colors for the two groups
group_colors <- brewer.pal(3, "Set1")[1:2]
names(group_colors) <- c("KO", "WT")
## Plot all chromatograms.
plot(bpis, col = group_colors[raw_data$sample_group])
The chromatogram
method returned a Chromatograms
object that organizes
individual Chromatogram
objects (which in fact contain the chromatographic data)
in a two-dimensional array: columns represent samples and rows (optionally) m/z
and/or retention time ranges. Below we extract the chromatogram of the first
sample and access its retention time and intensity values.
bpi_1 <- bpis[1, 1]
head(rtime(bpi_1))
## F01.S0001 F01.S0002 F01.S0003 F01.S0004 F01.S0005 F01.S0006
## 2501.378 2502.943 2504.508 2506.073 2507.638 2509.203
head(intensity(bpi_1))
## F01.S0001 F01.S0002 F01.S0003 F01.S0004 F01.S0005 F01.S0006
## 43888 43960 43392 42632 42200 42288
The chromatogram
method supports also extraction of chromatographic data from a
m/z-rt slice of the MS data. In the next section we will use this method to
create an extracted ion chromatogram (EIC) for a selected peak.
Note that chromatogram
reads the raw data from each file to calculate the
chromatogram. The bpi
and tic
methods on the other hand do not read any data
from the raw files but use the respective information that was provided in the
header definition of the input files (which might be different from the actual
data).
Below we create boxplots representing the distribution of total ion currents per file. Such plots can be very useful to spot problematic or failing MS runs.
## Get the total ion current by file
tc <- split(tic(raw_data), f = fromFile(raw_data))
boxplot(tc, col = group_colors[raw_data$sample_group],
ylab = "intensity", main = "Total ion current")
Next we perform the chromatographic peak detection using the centWave algorithm
[2]. Before running the peak detection it is however
strongly suggested to visually inspect e.g. the extracted ion chromatogram of
internal standards or known compounds to evaluate and adapt the peak detection
settings since the default settings will not be appropriate for most LCMS
experiments. The two most critical parameters for centWave are the peakwidth
(expected range of chromatographic peak widths) and ppm
(maximum expected
deviation of m/z values of centroids corresponding to one chromatographic peak;
this is usually much larger than the ppm specified by the manufacturer)
parameters.
To evaluate the typical chromatographic peak width we plot the EIC for one peak.
## Define the rt and m/z range of the peak area
rtr <- c(2700, 2900)
mzr <- c(334.9, 335.1)
## extract the chromatogram
chr_raw <- chromatogram(raw_data, mz = mzr, rt = rtr)
plot(chr_raw, col = group_colors[chr_raw$sample_group])
Note that Chromatogram
objects extracted by the chromatogram
method contain an
NA
value if in a certain scan (i.e. for a specific retention time) no signal was
measured in the respective mz range. This is reflected by the lines not being
drawn as continuous lines in the plot above.
The peak above has a width of about 50 seconds. The peakwidth
parameter should
be set to accommodate the expected widths of peak in the data set. We set it to
20,80
for the present example data set.
For the ppm
parameter we extract the full MS data (intensity, retention time and
m/z values) corresponding to the above peak. To this end we first filter the raw
object by retention time, then by m/z and finally plot the object with type = "XIC"
to produce the plot below. We use the pipe (%>%
) command better
illustrate the corresponding workflow.
raw_data %>%
filterRt(rt = rtr) %>%
filterMz(mz = mzr) %>%
plot(type = "XIC")
In the present data there is actually no variation in the m/z values. Usually
one would see the m/z values (lower panel) scatter around the real m/z value of
the compound. It is suggested to inspect the ranges of m/z values for many
compounds (either internal standards or compounds known to be present in the
sample) and define the ppm
parameter for centWave according to these.
Below we perform the chromatographic peak detection using the findChromPeaks
method. The submitted parameter object defines which algorithm will be used and
allows to define the settings for this algorithm. Note that we set the argument
noise
to 1000
to slightly speed up the analysis by considering only signals with
a value larger than 1000 in the peak detection step.
cwp <- CentWaveParam(peakwidth = c(30, 80), noise = 1000)
xdata <- findChromPeaks(raw_data, param = cwp)
The results are returned as an XCMSnExp
object which extends the OnDiskMSnExp
object by storing also LC/GC-MS preprocessing results. This means also that all
methods to sub-set and filter the data or to access the (raw) data are inherited
from the OnDiskMSnExp
object. The results from the chromatographic peak
detection can be accessed with the chromPeaks
method.
head(chromPeaks(xdata))
## mz mzmin mzmax rt rtmin rtmax into intb
## [1,] 236.1 236.1 236.1 2520.158 2501.378 2553.022 301289.53 268211.58
## [2,] 362.9 362.9 362.9 2596.840 2587.451 2604.665 19916.60 19674.79
## [3,] 302.0 302.0 302.0 2617.185 2587.451 2648.484 699458.52 687162.27
## [4,] 316.2 316.2 316.2 2668.828 2661.003 2676.653 51310.09 50854.60
## [5,] 370.1 370.1 370.1 2673.523 2643.789 2706.387 458247.09 423667.40
## [6,] 360.0 360.0 360.0 2682.913 2635.964 2718.907 9034381.61 8518838.59
## maxo sn sample is_filled
## [1,] 12957 13 1 0
## [2,] 1453 12 1 0
## [3,] 30552 73 1 0
## [4,] 4267 17 1 0
## [5,] 25672 16 1 0
## [6,] 317568 20 1 0
The returned matrix
provides the m/z and retention time range for each
identified chromatographic peak as well as the integrated signal intensity
(“into”) and the maximal peak intensitity (“maxo”). Columns “sample” contains
the index of the sample in the object/experiment in which the peak was
identified.
Below we use the data from this table to calculate some per-file summaries.
summary_fun <- function(z) {
c(peak_count = nrow(z), rt = quantile(z[, "rtmax"] - z[, "rtmin"]))
}
T <- lapply(split.data.frame(chromPeaks(xdata),
f = chromPeaks(xdata)[, "sample"]),
FUN = summary_fun)
T <- do.call(rbind, T)
rownames(T) <- basename(fileNames(xdata))
pandoc.table(T,
caption = paste0("Summary statistics on identified chromatographic",
" peaks. Shown are number of identified peaks per",
" sample and widths/duration of chromatographic ",
"peaks."))
peak_count | rt.0% | rt.25% | rt.50% | rt.75% | rt.100% | |
---|---|---|---|---|---|---|
ko15.CDF | 374 | 14.08 | 50.47 | 57.9 | 67.29 | 169 |
ko16.CDF | 489 | 14.08 | 48.51 | 59.47 | 67.29 | 169 |
ko18.CDF | 342 | 14.09 | 51.64 | 61.03 | 68.86 | 223.8 |
ko19.CDF | 259 | 14.08 | 53.21 | 64.16 | 71.99 | 173.7 |
ko21.CDF | 222 | 14.09 | 56.34 | 65.73 | 73.16 | 164.3 |
ko22.CDF | 255 | 14.08 | 51.64 | 62.6 | 71.99 | 134.6 |
wt15.CDF | 362 | 14.08 | 50.08 | 57.9 | 67.29 | 139.3 |
wt16.CDF | 352 | 14.09 | 53.21 | 61.03 | 68.86 | 133 |
wt18.CDF | 302 | 14.08 | 53.21 | 61.03 | 70.42 | 192.5 |
wt19.CDF | 274 | 14.08 | 56.34 | 64.16 | 75.12 | 184.7 |
wt21.CDF | 229 | 15.65 | 54.77 | 64.16 | 73.55 | 136.2 |
wt22.CDF | 337 | 14.09 | 48.51 | 61.03 | 70.42 | 134.6 |
We can also plot the location of the identified chromatographic peaks in the
m/z - retention time space for one file using the plotChromPeaks
function. Below
we plot the chromatographic peaks for the 3rd sample.
plotChromPeaks(xdata, file = 3)
To get a global overview of the peak detection we can plot the frequency of identified peaks per file along the retention time axis. This allows to identify time periods along the MS run in which a higher number of peaks was identified and evaluate whether this is consistent across files.
plotChromPeakImage(xdata)
Next we highlight the identified chromatographic peaks for the example peak from before. Evaluating such plots on a list of peaks corresponding to known peaks or internal standards helps to ensure that peak detection settings were appropriate and correctly identified the expected peaks.
plot(chr_raw, col = group_colors[chr_raw$sample_group], lwd = 2)
highlightChromPeaks(xdata, border = group_colors[chr_raw$sample_group],
lty = 3, rt = rtr, mz = mzr)
Note that we can also specifically extract identified chromatographic peaks for
a selected region by providing the respective m/z and retention time ranges with
the mz
and rt
arguments in the chromPeaks
method.
pander(chromPeaks(xdata, mz = mzr, rt = rtr),
caption = paste("Identified chromatographic peaks in a selected ",
"m/z and retention time range."))
mz | mzmin | mzmax | rt | rtmin | rtmax | into | intb | maxo | sn |
---|---|---|---|---|---|---|---|---|---|
335 | 335 | 335 | 2783 | 2756 | 2817 | 421286 | 392692 | 16856 | 26 |
335 | 335 | 335 | 2788 | 2756 | 2821 | 1529680 | 1490861 | 58736 | 77 |
335 | 335 | 335 | 2788 | 2758 | 2822 | 221355 | 204694 | 8158 | 19 |
335 | 335 | 335 | 2786 | 2756 | 2821 | 299084 | 281522 | 9154 | 22 |
335 | 335 | 335 | 2799 | 2758 | 2838 | 174587 | 160226 | 6295 | 13 |
335 | 335 | 335 | 2788 | 2756 | 2822 | 954335 | 933983 | 35856 | 76 |
335 | 335 | 335 | 2791 | 2758 | 2827 | 1668431 | 1635820 | 54640 | 94 |
335 | 335 | 335 | 2786 | 2756 | 2810 | 644912 | 632013 | 20672 | 54 |
335 | 335 | 335 | 2794 | 2769 | 2832 | 931316 | 904196 | 27200 | 49 |
sample | is_filled |
---|---|
1 | 0 |
2 | 0 |
3 | 0 |
4 | 0 |
5 | 0 |
8 | 0 |
9 | 0 |
10 | 0 |
11 | 0 |
Finally we plot also the distribution of peak intensity per sample. This allows to investigate whether systematic differences in peak signals between samples are present.
## Extract a list of per-sample peak intensities (in log2 scale)
ints <- split(log2(chromPeaks(xdata)[, "into"]),
f = chromPeaks(xdata)[, "sample"])
boxplot(ints, varwidth = TRUE, col = group_colors[xdata$sample_group],
ylab = expression(log[2]~intensity), main = "Peak intensities")
grid(nx = NA, ny = NULL)
The time at which analytes elute in the chromatography can vary between samples (and even compounds). Such a difference was already observable in the extracted ion chromatogram plot shown as an example in the previous section. The alignment step, also referred to as retention time correction, aims at adjusting this by shifting signals along the retention time axis to align the signals between different samples within an experiment.
A plethora of alignment algorithms exist (see [3]), with some of
them being implemented also in xcms
. The method to perform the
alignment/retention time correction in xcms
is adjustRtime
which uses different
alignment algorithms depending on the provided parameter class. In the example
below we use the obiwarp method [4] to align the samples. We
use a binSize = 0.6
which creates warping functions in mz bins of 0.6. Also here
it is advisable to modify the settings for each experiment and evaluate if
retention time correction did align internal controls or known compounds
properly.
xdata <- adjustRtime(xdata, param = ObiwarpParam(binSize = 0.6))
adjustRtime
, besides calculating adjusted retention times for each spectrum,
does also adjust the reported retention times of the identified chromatographic
peaks.
To extract the adjusted retention times we can use the adjustedRtime
method, or
simply the rtime
method that, if present, returns by default adjusted retention
times from an XCMSnExp
object.
## Extract adjusted retention times
head(adjustedRtime(xdata))
## F01.S0001 F01.S0002 F01.S0003 F01.S0004 F01.S0005 F01.S0006
## 2501.378 2502.958 2504.538 2506.118 2507.699 2509.280
## Or simply use the rtime method
head(rtime(xdata))
## F01.S0001 F01.S0002 F01.S0003 F01.S0004 F01.S0005 F01.S0006
## 2501.378 2502.958 2504.538 2506.118 2507.699 2509.280
Raw retention times can be extracted from an XCMSnExp
containing
aligned data with rtime(xdata, adjusted = FALSE)
.
To evaluate the impact of the alignment we plot the BPC on the adjusted data. In
addition we plot the differences of the adjusted- to the raw retention times per
sample using the plotAdjustedRtime
function.
## Get the base peak chromatograms.
bpis_adj <- chromatogram(xdata, aggregationFun = "max")
par(mfrow = c(2, 1), mar = c(4.5, 4.2, 1, 0.5))
plot(bpis_adj, col = group_colors[bpis_adj$sample_group])
## Plot also the difference of adjusted to raw retention time.
plotAdjustedRtime(xdata, col = group_colors[xdata$sample_group])
Too large differences between adjusted and raw retention times could indicate poorly performing samples or alignment.
Alternatively we could use the peak groups alignment method that adjusts the
retention time by aligning previously identified hook peaks (chromatographic
peaks present in most/all samples). Ideally, these hook peaks should span most
part of the retention time range. Below we first restore the raw retention times
(also of the identified peaks) using the dropAdjustedRtime
methods. Note that a
drop*
method is available for each preprocessing step allowing to remove the
respective results from the XCMSnExp
object.
## Does the object have adjusted retention times?
hasAdjustedRtime(xdata)
## [1] TRUE
## Drop the alignment results.
xdata <- dropAdjustedRtime(xdata)
## Does the object have adjusted retention times?
hasAdjustedRtime(xdata)
## [1] FALSE
Note: XCMSnExp
objects hold the raw along with the adjusted retention times and
subsetting will in most cases drop the adjusted retention times. Sometimes it
might thus be useful to replace the raw retention times with the adjusted
retention times. This can be done with the applyAdjustedRtime
.
As noted above the peak groups method requires peak groups (features) present in
most samples to perform the alignment. We thus have to perform a first
correspondence run to identify such peaks (details about the algorithm used are
presented in the next section). We use here again default settings, but it is
strongly advised to adapt the parameters for each data set. The definition of
the sample groups (i.e. assignment of individual samples to the sample groups in
the experiment) is mandatory for the PeakDensityParam
. If there are no sample
groups in the experiment sampleGroups
should be set to a single value for each
file (e.g. rep(1, length(fileNames(xdata))
).
## Correspondence: group peaks across samples.
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
minFraction = 0.8)
xdata <- groupChromPeaks(xdata, param = pdp)
## Now the retention time correction.
pgp <- PeakGroupsParam(minFraction = 0.85)
## Get the peak groups that would be used for alignment.
xdata <- adjustRtime(xdata, param = pgp)
Note also that we could use the adjustedRtimePeakGroups
method on the object
before alignment to evaluate on which features (peak groups) the alignment would
be performed. This can be useful to test different settings for the peak groups
algorithm. Also, it is possible to manually select or define certain peak groups
(i.e. their retention times per sample) and provide this matrix to the
PeakGroupsParam
class with the peakGroupsMatrix
argument.
Below plot the difference between raw and adjusted retention times
using the plotAdjustedRtime
function, which, if the peak groups method is used
for alignment, also highlights the peak groups used in the adjustment.
## Plot the difference of adjusted to raw retention time.
plotAdjustedRtime(xdata, col = group_colors[xdata$sample_group],
peakGroupsCol = "grey", peakGroupsPch = 1)
At last we evaluate the impact of the alignment on the test peak.
par(mfrow = c(2, 1))
## Plot the raw data
plot(chr_raw, col = group_colors[chr_raw$sample_group])
## Extract the chromatogram from the adjusted object
chr_adj <- chromatogram(xdata, rt = rtr, mz = mzr)
plot(chr_adj, col = group_colors[chr_raw$sample_group])
The final step in the metabolomics preprocessing is the correspondence that
matches detected chromatographic peaks between samples (and depending on the
settings, also within samples if they are adjacent). The method to perform the
correspondence in xcms
is groupChromPeaks
. We will use the peak density method
[5] to group chromatographic peaks. The algorithm combines
chromatographic peaks depending on the density of peaks along the retention time
axis within small slices along the mz dimension. To illustrate this we plot
below the chromatogram for an mz slice with multiple chromatographic peaks
within each sample. We use below a value of 0.4 for the minFraction
parameter
hence only chromatographic peaks present in at least 40% of the samples per
sample group are grouped into a feature. The sample group assignment is
specified with the sampleGroups
argument.
## Define the mz slice.
mzr <- c(305.05, 305.15)
## Extract and plot the chromatograms
chr_mzr <- chromatogram(xdata, mz = mzr, rt = c(2500, 4000))
par(mfrow = c(3, 1), mar = c(1, 4, 1, 0.5))
cols <- group_colors[chr_mzr$sample_group]
plot(chr_mzr, col = cols, xaxt = "n", xlab = "")
## Highlight the detected peaks in that region.
highlightChromPeaks(xdata, mz = mzr, col = cols, type = "point", pch = 16)
## Define the parameters for the peak density method
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
minFraction = 0.4, bw = 30)
par(mar = c(4, 4, 1, 0.5))
plotChromPeakDensity(xdata, mz = mzr, col = cols, param = pdp,
pch = 16, xlim = c(2500, 4000))
## Use a different bw
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
minFraction = 0.4, bw = 20)
plotChromPeakDensity(xdata, mz = mzr, col = cols, param = pdp,
pch = 16, xlim = c(2500, 4000))
The upper panel in the plot above shows the extracted ion chromatogram for each
sample with the detected peaks highlighted. The middle and lower plot shows the
retention time for each detected peak within the different samples. The black
solid line represents the density distribution of detected peaks along the
retention times. Peaks combined into features (peak groups) are indicated with
grey rectangles. Different values for the bw
parameter of the PeakDensityParam
were used: bw = 30
in the middle and bw = 20
in the lower panel. With the
default value for the parameter bw
the two neighboring chromatographic peaks
would be grouped into the same feature, while with a bw
of 20 they would be
grouped into separate features. This grouping depends on the parameters for the
density function and other parameters passed to the algorithm with the
PeakDensityParam
.
## Perform the correspondence
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
minFraction = 0.4, bw = 20)
xdata <- groupChromPeaks(xdata, param = pdp)
The results from the correspondence can be extracted using the
featureDefinitions
method, that returns a DataFrame
with the definition of the
features (i.e. the mz and rt ranges and, in column peakidx
, the index of the
chromatographic peaks in the chromPeaks
matrix for each feature).
## Extract the feature definitions
featureDefinitions(xdata)
## DataFrame with 312 rows and 10 columns
## mzmed mzmin mzmax rtmed
## <numeric> <numeric> <numeric> <numeric>
## FT001 205 205 205 2790.91667807044
## FT002 206 206 206 2790.76510706472
## FT003 207.100006103516 207.100006103516 207.100006103516 2720.76064413194
## FT004 236.100006103516 236.100006103516 236.100006103516 2526.73236656529
## FT005 241.100006103516 241.100006103516 241.199996948242 3678.87576330383
## ... ... ... ... ...
## FT308 594.200012207031 594.200012207031 594.200012207031 3405.34666745362
## FT309 594.299987792969 594.299987792969 594.299987792969 3613.40135845759
## FT310 595.200012207031 595.200012207031 595.299987792969 3004.51480319748
## FT311 596.299987792969 596.299987792969 596.400024414062 3818.52945007536
## FT312 597.400024414062 597.299987792969 597.400024414062 3814.92705405312
## rtmin rtmax npeaks KO WT
## <numeric> <numeric> <numeric> <numeric> <numeric>
## FT001 2790.25914950764 2791.33708124005 12 6 6
## FT002 2787.95417597022 2794.07524052365 11 4 6
## FT003 2718.2665593745 2723.95355653805 8 4 4
## FT004 2523.80327395574 2527.55945827299 5 2 3
## FT005 3664.1493202833 3683.06286789441 9 4 2
## ... ... ... ... ... ...
## FT308 3395.29146342443 3410.28858323004 3 0 3
## FT309 3608.97879037478 3619.97980830287 6 3 2
## FT310 3000.07972689414 3019.16896250021 7 2 4
## FT311 3813.57966000225 3826.51868541398 7 4 2
## FT312 3809.07197917863 3818.5741287851 5 1 3
## peakidx
## <list>
## FT001 c(43, 416, 912, 1242, 1500, 1727, 1978, 2350, 2705, 3000, 3270, 3508)
## FT002 c(25, 1232, 1244, 1495, 1724, 1968, 2341, 2693, 2996, 3267, 3501)
## FT003 c(399, 881, 1482, 1707, 1962, 2327, 2673, 3482)
## FT004 c(1, 376, 1943, 2304, 3235)
## FT005 c(253, 1122, 1397, 1406, 1880, 1887, 2540, 3171, 3184)
## ... ...
## FT308 c(2797, 3095, 3638)
## FT309 c(217, 1605, 1865, 3390, 3396, 3698)
## FT310 c(63, 495, 2026, 2735, 3293, 3538, 3546)
## FT311 c(306, 1145, 1147, 1656, 1910, 2578, 3768)
## FT312 c(1150, 2917, 2919, 3440, 3765)
The featureValues
method returns a matrix
with rows being features and columns
samples. The content of this matrix can be defined using the value
argument. Setting value = "into"
returns a matrix with the integrated signal of
the peaks corresponding to a feature in a sample. Any column name of the
chromPeaks
matrix can be passed to the argument value
. Below we extract the
integrated peak intensity per feature/sample.
## Extract the into column for each feature.
head(featureValues(xdata, value = "into"))
## ko15.CDF ko16.CDF ko18.CDF ko19.CDF ko21.CDF ko22.CDF wt15.CDF
## FT001 2024095.6 1832137.0 1756373.2 1275015.4 1558397.4 1274955.4 2220917.7
## FT002 229446.1 NA NA 148895.1 169782.6 181300.6 271403.0
## FT003 NA 470042.7 349545.6 NA 356775.9 240606.3 370510.3
## FT004 301289.5 250053.3 NA NA NA NA 293360.4
## FT005 784970.1 NA 1377891.5 659602.9 NA 528679.9 NA
## FT006 681973.4 266391.1 103640.3 NA NA 132666.7 682289.2
## wt16.CDF wt18.CDF wt19.CDF wt21.CDF wt22.CDF
## FT001 1717218.1 1871280.3 1551566.88 1713396.7 1438088.9
## FT002 261254.3 237429.1 225250.02 244356.5 188864.7
## FT003 388938.3 312991.1 NA NA 251502.7
## FT004 190338.9 NA NA 305942.5 NA
## FT005 1698777.3 NA 704491.40 NA NA
## FT006 NA 186512.1 25193.37 NA 1462255.2
This feature matrix contains NA
for samples in which no chromatographic peak was
detected in the feature’s m/z-rt region. While in many cases there might indeed
be no peak signal in the respective region, it might also be that there is
signal, but the peak detection algorithm failed to detect a chromatographic
peak. xcms
provides the fillChromPeaks
method to fill in intensity data for such
missing values from the original files. The filled in peaks are added to the
chromPeaks
matrix and are flagged with an 1
in the "is_filled"
column. Below we
perform such a filling-in of missing peaks.
## Filling missing peaks using default settings. Alternatively we could
## pass a FillChromPeaksParam object to the method.
xdata <- fillChromPeaks(xdata)
head(featureValues(xdata))
## ko15.CDF ko16.CDF ko18.CDF ko19.CDF ko21.CDF ko22.CDF wt15.CDF
## FT001 43 416 912 1242 1500 1727 1978
## FT002 25 3892 3986 1232 1495 1724 1968
## FT003 3798 399 881 4069 1482 1707 1962
## FT004 1 376 3987 4070 4185 4310 1943
## FT005 253 3893 1122 1397 4186 1880 4415
## FT006 46 424 916 4071 4187 1728 1984
## wt16.CDF wt18.CDF wt19.CDF wt21.CDF wt22.CDF
## FT001 2350 2705 3000 3270 3508
## FT002 2341 2693 2996 3267 3501
## FT003 2327 2673 4742 4858 3482
## FT004 2304 4627 4743 3235 4990
## FT005 2540 4628 3184 4859 4991
## FT006 4527 2706 3004 4860 3519
For features without detected peaks in a sample, the method extracts all
intensities in the mz-rt region of the feature, integrates the signal and adds a
filled-in peak to the chromPeaks
matrix. No peak is added if no signal is
measured/available for the mz-rt region of the feature. For these, even after
filling in missing peak data, a NA
is reported in the featureValues
matrix.
Below we compare the number of missing values before and after filling in
missing values. We can use the parameter filled
of the featureValues
method to
define whether or not filled-in peak values should be returned too.
## Missing values before filling in peaks
apply(featureValues(xdata, filled = FALSE), MARGIN = 2,
FUN = function(z) sum(is.na(z)))
## ko15.CDF ko16.CDF ko18.CDF ko19.CDF ko21.CDF ko22.CDF wt15.CDF wt16.CDF
## 98 97 84 118 133 111 117 105
## wt18.CDF wt19.CDF wt21.CDF wt22.CDF
## 121 120 139 113
## Missing values after filling in peaks
apply(featureValues(xdata), MARGIN = 2,
FUN = function(z) sum(is.na(z)))
## ko15.CDF ko16.CDF ko18.CDF ko19.CDF ko21.CDF ko22.CDF wt15.CDF wt16.CDF
## 4 3 1 2 8 6 5 5
## wt18.CDF wt19.CDF wt21.CDF wt22.CDF
## 6 4 7 8
At last we perform a principal component analysis to evaluate the grouping of the samples in this experiment. Note that we did not perform any data normalization hence the grouping might (and will) also be influenced by technical biases.
## Extract the features and log2 transform them
ft_ints <- log2(featureValues(xdata, value = "into"))
## Perform the PCA omitting all features with an NA in any of the
## samples. Also, the intensities are mean centered.
pc <- prcomp(t(na.omit(ft_ints)), center = TRUE)
## Plot the PCA
cols <- group_colors[xdata$sample_group]
pcSummary <- summary(pc)
plot(pc$x[, 1], pc$x[,2], pch = 21, main = "",
xlab = paste0("PC1: ", format(pcSummary$importance[2, 1] * 100,
digits = 3), " % variance"),
ylab = paste0("PC2: ", format(pcSummary$importance[2, 2] * 100,
digits = 3), " % variance"),
col = "darkgrey", bg = cols, cex = 2)
grid()
text(pc$x[, 1], pc$x[,2], labels = xdata$sample_name, col = "darkgrey",
pos = 3, cex = 2)
We can see the expected separation between the KO and WT samples on PC2. On PC1 samples separate based on their ID, samples with an ID <= 18 from samples with an ID > 18. This separation might be caused by a technical bias (e.g. measurements performed on different days/weeks) or due to biological properties of the mice analyzed (sex, age, litter mates etc).
Normalizing features’ signal intensities is required, but at present not (yet)
supported in xcms
(some methods might be added in near future). Also, for the
identification of e.g. features with significant different
intensities/abundances it is suggested to use functionality provided in other R
packages, such as Bioconductor’s excellent limma
package. To enable support also
for other packages that rely on the old xcmsSet
result object, it is possible to
coerce the new XCMSnExp
object to an xcmsSet
object using xset <- as(x, "xcmsSet")
, with x
being an XCMSnExp
object.
For a detailed description of the new data objects and changes/improvements compared to the original user interface see the new_functionality vignette.
XCMSnExp
objects allow to capture all performed pre-processing steps along with
the used parameter class within the @processHistory
slot. Storing also the
parameter class ensures the highest possible degree of analysis documentation
and in future might enable to replay analyses or parts of it. The list of all
performed preprocessings can be extracted using the processHistory
method.
processHistory(xdata)
## [[1]]
## Object of class "XProcessHistory"
## type: Peak detection
## date: Mon Apr 30 18:48:18 2018
## info:
## fileIndex: 1,2,3,4,5,6,7,8,9,10,11,12
## Parameter class: CentWaveParam
## MS level(s) 1
##
## [[2]]
## Object of class "XProcessHistory"
## type: Peak grouping
## date: Mon Apr 30 18:49:15 2018
## info:
## fileIndex: 1,2,3,4,5,6,7,8,9,10,11,12
## Parameter class: PeakDensityParam
##
## [[3]]
## Object of class "XProcessHistory"
## type: Retention time correction
## date: Mon Apr 30 18:49:17 2018
## info:
## fileIndex: 1,2,3,4,5,6,7,8,9,10,11,12
## Parameter class: PeakGroupsParam
## MS level(s) 1
##
## [[4]]
## Object of class "XProcessHistory"
## type: Peak grouping
## date: Mon Apr 30 18:49:29 2018
## info:
## fileIndex: 1,2,3,4,5,6,7,8,9,10,11,12
## Parameter class: PeakDensityParam
##
## [[5]]
## Object of class "XProcessHistory"
## type: Missing peak filling
## date: Mon Apr 30 18:49:31 2018
## info:
## fileIndex: 1,2,3,4,5,6,7,8,9,10,11,12
## Parameter class: FillChromPeaksParam
It is also possible to extract specific processing steps by specifying its
type. Available types can be listed with the processHistoryTypes
function. Below
we extract the parameter class for the alignment/retention time adjustment step.
ph <- processHistory(xdata, type = "Retention time correction")
ph
## [[1]]
## Object of class "XProcessHistory"
## type: Retention time correction
## date: Mon Apr 30 18:49:17 2018
## info:
## fileIndex: 1,2,3,4,5,6,7,8,9,10,11,12
## Parameter class: PeakGroupsParam
## MS level(s) 1
And we can also extract the parameter class used in this preprocessing step.
## Access the parameter
processParam(ph[[1]])
## Object of class: PeakGroupsParam
## Parameters:
## minFraction: 0.85
## extraPeaks: 1
## smooth: loess
## span: 0.2
## family: gaussian
## number of peak groups: 50
XCMSnEx
objects can be subsetted/filtered using the [
method, or one of the many
filter*
methods. All these methods aim to ensure that the data in the
returned object is consistent. This means for example that if the object is
subsetted by selecting specific spectra (by using the [
method) all identified
chromatographic peaks are removed. Correspondence results (i.e. identified
features) are removed if the object is subsetted to contain only data from
selected files (using the filterFile
method). This is because the correspondence
results depend on the files on which the analysis was performed - running a
correspondence on a subset of the files would lead to different results.
As an exception, it is possible to force keeping adjusted retention times in the
subsetted object setting the keepAdjustedRtime
argument to TRUE
in any of the
subsetting methods.
Below we subset our results object the data for the files 2 and 4.
subs <- filterFile(xdata, file = c(2, 4))
## Do we have identified chromatographic peaks?
hasChromPeaks(subs)
## [1] TRUE
Peak detection is performed separately on each file, thus the subsetted object contains all identified chromatographic peaks from the two files. However, we used a retention time adjustment (alignment) that was based on available features. All features have however been removed and also the adjusted retention times (since the alignment based on features that were identified on chromatographic peaks on all files).
## Do we still have features?
hasFeatures(subs)
## [1] FALSE
## Do we still have adjusted retention times?
hasAdjustedRtime(subs)
## [1] FALSE
We can however use the keepAdjustedRtime
argument to force keeping the adjusted
retention times.
subs <- filterFile(xdata, keepAdjustedRtime = TRUE)
hasAdjustedRtime(subs)
## [1] TRUE
The filterRt
method can be used to subset the object to spectra within a certain
retention time range.
subs <- filterRt(xdata, rt = c(3000, 3500))
range(rtime(subs))
## [1] 3000.011 3499.909
Filtering by retention time does not change/affect adjusted retention times (also, if adjusted retention times are present, the filtering is performed on the adjusted retention times).
hasAdjustedRtime(subs)
## [1] TRUE
Also, we have all identified chromatographic peaks within the specified retention time range:
hasChromPeaks(subs)
## [1] TRUE
range(chromPeaks(subs)[, "rt"])
## [1] 3000.011 3499.909
The most natural way to subset any object in R is with [
. Using [
on an XCMSnExp
object subsets it keeping only the selected spectra. The index i
used in [
has
thus to be an integer between 1 and the total number of spectra (across all
files). Below we subset xdata
using both [
and filterFile
to keep all spectra
from one file.
## Extract all data from the 3rd file.
one_file <- filterFile(xdata, file = 3)
one_file_2 <- xdata[fromFile(xdata) == 3]
## Is the content the same?
all.equal(spectra(one_file), spectra(one_file_2))
## [1] TRUE
While the spectra-content is the same in both objects, one_file
contains also
the identified chromatographic peaks while one_file_2
does not. Thus, in most
situations subsetting using one of the filter functions is preferred over the
use of [
.
## Subsetting with filterFile preserves chromatographic peaks
head(chromPeaks(one_file))
## mz mzmin mzmax rt rtmin rtmax into intb maxo sn
## [1,] 402.1 402.1 402.1 2515.464 2501.379 2548.328 213455.2 185881.2 8038 11
## [2,] 339.0 339.0 339.0 2515.464 2501.379 2540.503 162264.4 158324.9 5638 14
## [3,] 354.0 354.0 354.0 2515.464 2501.379 2532.678 173130.4 165213.5 6334 11
## [4,] 352.9 352.9 352.9 2517.029 2501.379 2534.243 525993.0 522760.1 20488 55
## [5,] 316.0 316.0 316.0 2517.029 2501.379 2535.808 741708.6 726925.4 27816 47
## [6,] 333.0 333.0 333.0 2515.464 2501.379 2545.198 968861.6 951705.4 33992 49
## sample is_filled
## [1,] 1 0
## [2,] 1 0
## [3,] 1 0
## [4,] 1 0
## [5,] 1 0
## [6,] 1 0
## Subsetting with [ not
head(chromPeaks(one_file_2))
## Warning in .local(object, ...): No chromatographic peaks available.
## NULL
Note however that also [
does support the keepAdjustedRtime
argument. Below we
subset the object to spectra 20:30.
subs <- xdata[20:30, keepAdjustedRtime = TRUE]
## Warning in xdata[20:30, keepAdjustedRtime = TRUE]: Removed preprocessing
## results
hasAdjustedRtime(subs)
## [1] TRUE
## Access adjusted retention times:
rtime(subs)
## F01.S0020 F01.S0021 F01.S0022 F01.S0023 F01.S0024 F01.S0025 F01.S0026
## 2538.347 2539.912 2541.477 2543.042 2544.607 2546.172 2547.737
## F01.S0027 F01.S0028 F01.S0029 F01.S0030
## 2549.302 2550.867 2552.432 2553.997
## Access raw retention times:
rtime(subs, adjusted = FALSE)
## F01.S0020 F01.S0021 F01.S0022 F01.S0023 F01.S0024 F01.S0025 F01.S0026
## 2531.112 2532.677 2534.242 2535.807 2537.372 2538.937 2540.502
## F01.S0027 F01.S0028 F01.S0029 F01.S0030
## 2542.067 2543.632 2545.197 2546.762
As with MSnExp
and OnDiskMSnExp
objects, [[
can be used to extract a single
spectrum object from an XCMSnExp
object. The retention time of the spectrum
corresponds to the adjusted retention time if present.
## Extract a single spectrum
xdata[[14]]
## Object of class "Spectrum1"
## Retention time: 42:9
## MSn level: 1
## Total ion count: 445
## Polarity: NA
At last we can also use the split
method that allows to split an XCMSnExp
based
on a provided factor f
. Below we split xdata
per file. Using keepAdjustedRtime = TRUE
ensures that adjusted retention times are not removed.
x_list <- split(xdata, f = fromFile(xdata), keepAdjustedRtime = TRUE)
lengths(x_list)
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1278 1278 1278 1278 1278 1278 1278 1278 1278 1278 1278 1278
lapply(x_list, hasAdjustedRtime)
## $`1`
## [1] TRUE
##
## $`2`
## [1] TRUE
##
## $`3`
## [1] TRUE
##
## $`4`
## [1] TRUE
##
## $`5`
## [1] TRUE
##
## $`6`
## [1] TRUE
##
## $`7`
## [1] TRUE
##
## $`8`
## [1] TRUE
##
## $`9`
## [1] TRUE
##
## $`10`
## [1] TRUE
##
## $`11`
## [1] TRUE
##
## $`12`
## [1] TRUE
Note however that there is also a dedicated splitByFile
method instead for that
operation, that internally uses filterFile
and hence does e.g. not remove
identified chromatographic peaks. The method does not yet support the
keepAdjustedRtime
parameter and thus removes by default adjusted retention
times.
xdata_by_file <- splitByFile(xdata, f = factor(1:length(fileNames(xdata))))
lapply(xdata_by_file, hasChromPeaks)
## $`1`
## [1] TRUE
##
## $`2`
## [1] TRUE
##
## $`3`
## [1] TRUE
##
## $`4`
## [1] TRUE
##
## $`5`
## [1] TRUE
##
## $`6`
## [1] TRUE
##
## $`7`
## [1] TRUE
##
## $`8`
## [1] TRUE
##
## $`9`
## [1] TRUE
##
## $`10`
## [1] TRUE
##
## $`11`
## [1] TRUE
##
## $`12`
## [1] TRUE
Most methods in xcms
support parallel processing. Parallel processing is handled
and configured by the BiocParallel
Bioconductor package and can be globally
defined for an R session.
Unix-based systems (Linux, macOS) support multicore
-based parallel
processing. To configure it globally we register
the parameter class. Note also
that bpstart
is used below to initialize the parallel processes.
register(bpstart(MulticoreParam(2)))
Windows supports only socket-based parallel processing:
register(bpstart(SnowParam(2)))
Note that multicore
-based parallel processing might be buggy or failing on
macOS. If so, the DoparParam
could be used instead (requiring the foreach
package).
For other options and details see the vignettes from the BiocParallel
package.
1. Saghatelian A, Trauger SA, Want EJ, Hawkins EG, Siuzdak G, Cravatt BF: Assignment of endogenous substrates to enzymes by global metabolite profiling. Biochemistry 2004, 43:14332–9.
2. Tautenhahn R, Böttcher C, Neumann S: Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics 2008, 9:504.
3. Smith R, Ventura D, Prince JT: LC-MS alignment in theory and practice: a comprehensive algorithmic review. Briefings in bioinformatics 2013, 16:bbt080–117.
4. Prince JT, Marcotte EM: Chromatographic alignment of ESI-LC-MS proteomics data sets by ordered bijective interpolated warping. Analytical chemistry 2006, 78:6140–6152.
5. Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G: XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical chemistry 2006, 78:779–787.