For experiments in which analyzed samples come from different classes or conditions, a common goal of statistical analysis is class comparison via hypothesis testing.
Statistical testing is performed to find peaks that are differentially abundant among different classes or conditions.
Valid statistical testing requires biological replicates in order to compare between different conditions. It should never be performed with less than 3 samples per condition.
In this vignette, we present an example class comparison workflow using Cardinal.
We begin by loading the package:
library(Cardinal)
This example uses DESI spectra collected from a renal cell carcinoma (RCC) cancer dataset consisting of 8 matched pairs of human kidney tissue. Each tissue pair consists of a normal tissue sample and a cancerous tissue sample.
MH0204_33 | UH0505_12 | UH0710_33 | UH9610_15 |
---|---|---|---|
UH9812_03 | UH9905_18 | UH9911_05 | UH9912_01 |
---|---|---|---|
For the RCC cancer dataset, the goal is to find m/z features differentially abundant between normal and cancer tissue.
First, we load the dataset from the CardinalWorkflows package. The data is stored in an older format, so we need to coerce it to an MSImagingExperiment
.
data(rcc, package="CardinalWorkflows")
rcc <- as(rcc, "MSImagingExperiment")
The dataset contains 16,000 spectra with 10,200 m/z-values.
rcc
## An object of class 'MSContinuousImagingExperiment'
## <10200 feature, 16000 pixel> imaging dataset
## imageData(1): intensity
## featureData(0):
## pixelData(1): diagnosis
## run(8): MH0204_33 UH0505_12 ... UH9911_05 UH9912_01
## raster dimensions: 99 x 38
## coord(2): x = 1..99, y = 1..38
## mass range: 150.08 to 1000.00
## centroided: FALSE
Before fitting any statistical model, pre-processing is necessary to remove noise and the number of m/z values.
To process the dataset, we will first perform peak picking on the mean spectrum to create a set of reference peaks. We will then bin the peaks in the entire dataset to this reference.
rcc_mean <- summarize(rcc, .stat="mean")
rcc_ref <- rcc_mean %>%
peakPick(SNR=3) %>%
peakAlign(ref="mean",
tolerance=0.5,
units="mz") %>%
peakFilter() %>%
process()
## nrows changed from 10200 to 752
Now we normalize and bin the rest of the dataset to the reference peaks.
rcc_peaks <- rcc %>%
normalize(method="tic") %>%
peakBin(ref=mz(rcc_ref),
tolerance=0.5,
units="mz") %>%
process()
rcc_peaks
## An object of class 'MSContinuousImagingExperiment'
## <82 feature, 16000 pixel> imaging dataset
## imageData(1): intensity
## featureData(0):
## pixelData(1): diagnosis
## processing complete(2): normalize peakBin
## processing pending(0):
## run(8): MH0204_33 UH0505_12 ... UH9911_05 UH9912_01
## raster dimensions: 99 x 38
## coord(2): x = 1..99, y = 1..38
## mass range: 157.2765 to 888.4058
## centroided: TRUE
This produces a centroided dataset with 82 peaks.
Rather than rely on the manual region-of-interest selection, we will rely on the fact that cancer tissue is on the left and the normal tissue is on the right on each slide.
xcutoff<-c(35, 23, 28, 39, 29, 28, 44, 32)
rcc_peaks$rough_diagnosis <- factor("normal", level=c("cancer", "normal"))
for ( i in 1:nlevels(run(rcc_peaks)) ) {
cur_run <- run(rcc_peaks) == runNames(rcc_peaks)[i]
pData(rcc_peaks)$rough_diagnosis[cur_run & coord(rcc_peaks)$x < xcutoff[i]] <- "cancer"
}
rcc_peaks$groups <- interaction(run(rcc_peaks), rcc_peaks$rough_diagnosis)
image(rcc_peaks, mz=810, groups=rough_diagnosis,
contrast.enhance="histogram", layout=c(4,2))
In order to reduce the size of the dataset further (because the computation we are working toward can be time consuming), we will perform non-specific filtering.
This means filtering our peaks based on a summary statistic unrelated to the condition. We will use the variance.
rcc_var <- summarize(rcc_peaks, .stat="var", .as="DataFrame")
plot(rcc_var, var ~ mz, main="variance")
Now we keep only the peaks above the top 80% quantile of variance among peaks.
rcc_peaks2 <- rcc_peaks[rcc_var$var >= quantile(rcc_var$var, 0.8),]
Spatial-DGMM performs peak-specific segmentation. It detects peak-specific tissue segments with homogeneous spatial composition. It can estimate the number of segments and the mean and variance of each segment.
This gives us a useful summary of the spatial distribution of each peak.
set.seed(1)
rcc_dgmm1 <- spatialDGMM(rcc_peaks2[16,], r=1, k=4, groups=1)
summary(rcc_dgmm1)
## Spatially-aware Dirichlet Gaussian mixture models:
##
## Segmentation on 1 group: 1
## Method = gaussian
## Distance = chebyshev
##
## Radius (r) Init (k) Feature Classes/Group
## 1 1 4 1 3
image(rcc_dgmm1, layout=c(4,2))
This is useful because we can use it to automatically detect segments to compare for statistical testing (e.g., “cancer” vs “normal” tissue). However, to do this without bias, we must make sure the segmentation is performed independently for each sample.
set.seed(1)
rcc_dgmm <- spatialDGMM(rcc_peaks2, r=1, k=4, groups=rcc_peaks2$groups)
summary(rcc_dgmm)
## Spatially-aware Dirichlet Gaussian mixture models:
##
## Segmentation on 16 groups: MH0204_33.cancer UH0505_12.cancer ... UH9911_05.normal Segmentation on UH9912_01.normal
## Method = gaussian
## Distance = chebyshev
##
## Radius (r) Init (k) Feature Classes/Group
## 1 1 4 1 3.50
## 2 1 4 2 3.31
## 3 1 4 3 3.12
## 4 1 4 4 3.38
## 5 1 4 5 3.31
## 6 1 4 6 3.12
## 7 1 4 7 3.38
## 8 1 4 8 3.44
## 9 1 4 9 3.25
## 10 1 4 10 3.81
## 11 1 4 11 3.50
## 12 1 4 12 3.69
## 13 1 4 13 3.19
## 14 1 4 14 3.25
## 15 1 4 15 3.12
## 16 1 4 16 3.31
## 17 1 4 17 2.69
As introduced earlier, statistical testing is performed to find peaks differentially abundant among different groups. Since MS imaging produces many hundreds of measurements on the same sample, we can’t treat each mass spectrum as a separate observation. Rather, we need to compare entire samples rather than individual pixels.
One way to do this is to summarize each sample by calculating its mean intensity. We can then fit linear models to the means-summarized data.
In Cardinal, we can simply use meansTest()
to do means-based testing in a MS imaging experiment. We use a one-sided formula to specify the fixed effects (the diagnosis in this particular dataset). The groups indicating the observational units must also be provided. Each group is summarized by its mean, and then a linear model is fit to the summaries.
mtest <- meansTest(rcc_peaks2, ~ rough_diagnosis, groups=rcc_peaks2$groups)
summary(mtest)
## Means-summarized linear model testing:
##
## Fixed effects: ~rough_diagnosis
##
## Summarized 16 groups: MH0204_33.cancer UH0505_12.cancer ... UH9911_05.normal
## Summarized UH9912_01.normal
##
## Likelihood ratio test for fixed effects:
##
## Feature LR PValue FDR
## 1 1 3.2377 0.07196111 0.4077796
## 2 2 0.2942 0.58755653 0.9365532
## 3 3 0.0538 0.81655196 0.9365532
## 4 4 0.0131 0.90871076 0.9365532
## 5 5 0.0063 0.93655316 0.9365532
## 6 6 0.2120 0.64521524 0.9365532
## 7 7 0.1771 0.67387698 0.9365532
## 8 8 0.0995 0.75237507 0.9365532
## 9 9 0.0818 0.77480632 0.9365532
## 10 10 0.6520 0.41938613 0.9365532
## 11 11 0.0505 0.82213078 0.9365532
## 12 12 0.0614 0.80428139 0.9365532
## 13 13 0.4109 0.52149586 0.9365532
## 14 14 0.1843 0.66769754 0.9365532
## 15 15 0.0420 0.83763741 0.9365532
## 16 16 3.6261 0.05687959 0.4077796
## 17 17 4.1306 0.04211538 0.4077796
The summarized results are automatically adjusted for multiple comparisons using FDR.
We can use the topFeatures()
method to find differentially abundant peaks.
topFeatures(mtest, p.adjust="fdr", AdjP < .1)
## Top-ranked tests: ~rough_diagnosis vs ~1
## mz feature LR PValue AdjP
But we don’t find any.
Means-based testing is fast and simple and can work well for homogeneous samples. However, doesn’t use the spatial structure of each peak, so it doesn’t take the advantage of MS imaging, and may result in missing differences that actually exist.
Rather than simply average the intensities, we can summarize each sample by segmenting it with spatial-DGMM, and comparing the resulting segments. This gives us a bias-free way to keep the spatial heterogeneous information.
First, we must segment the data with spatialDGMM()
, while making sure that each observational unit is segmented within a different group (as specified by groups
). We’ve already done this. Now we use segmentationTest()
to fit the models.
In order to fit the models, a representative spatial-DGMM segment must be selected for each group. There are two automated ways to do this via classControl: “Ymax” (default) means use the segments with the highest means, and “Mscore” means use the segments with the highest match scores with the fixed effects.
stest <- segmentationTest(rcc_dgmm, ~ rough_diagnosis)
summary(stest)
## Segmentation-based linear model testing:
##
## Fixed effects: ~rough_diagnosis
##
## Summarized 16 groups: MH0204_33.cancer UH0505_12.cancer ... UH9911_05.normal
## Summarized UH9912_01.normal
##
## Likelihood ratio test for fixed effects:
##
## Radius (r) Init (k) Feature LR PValue FDR
## 1 1 4 1 1.3788 0.240311536 0.97921487
## 2 1 4 2 0.0022 0.962186433 0.97921487
## 3 1 4 3 0.0753 0.783722493 0.97921487
## 4 1 4 4 0.0044 0.946819205 0.97921487
## 5 1 4 5 0.2305 0.631128419 0.97921487
## 6 1 4 6 0.1330 0.715309712 0.97921487
## 7 1 4 7 0.2520 0.615671585 0.97921487
## 8 1 4 8 0.0007 0.979214867 0.97921487
## 9 1 4 9 0.6114 0.434255077 0.97921487
## 10 1 4 10 0.5911 0.442003579 0.97921487
## 11 1 4 11 0.0257 0.872673836 0.97921487
## 12 1 4 12 0.0159 0.899624095 0.97921487
## 13 1 4 13 0.0198 0.888138034 0.97921487
## 14 1 4 14 0.0793 0.778204860 0.97921487
## 15 1 4 15 0.0102 0.919508803 0.97921487
## 16 1 4 16 7.3232 0.006807167 0.05786092
## 17 1 4 17 7.9774 0.004736522 0.05786092
Again, we can use the topFeatures()
method to find differentially abundant peaks.
topFeatures(stest, p.adjust="fdr", AdjP < .1)
## Top-ranked tests: ~rough_diagnosis vs ~1
## mz r k feature LR PValue AdjP
## 1 885.3767 1 4 16 7.3232 0.006807167 0.05786092
## 2 886.3707 1 4 17 7.9774 0.004736522 0.05786092
This time we find 2 differentially abundant peaks (though one is likely an isotope of the other).
plot(stest, model=list(feature=16))
image(rcc_peaks2, mz=885, layout=c(4,2),
contrast.enhance="suppress",
normalize.image="linear")
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## Random number generation:
## RNG: L'Ecuyer-CMRG
## Normal: Inversion
## Sample: Rejection
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] CardinalWorkflows_1.18.0 Cardinal_2.4.0
## [3] ProtGenerics_1.18.0 S4Vectors_0.24.0
## [5] EBImage_4.28.0 BiocParallel_1.20.0
## [7] BiocGenerics_0.32.0 BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] biglm_0.9-1 locfit_1.5-9.1 tidyselect_0.2.5
## [4] xfun_0.10 purrr_0.3.3 lattice_0.20-38
## [7] htmltools_0.4.0 viridisLite_0.3.0 yaml_2.2.0
## [10] rlang_0.4.1 pillar_1.4.2 glue_1.3.1
## [13] DBI_1.0.0 sp_1.3-1 jpeg_0.1-8.1
## [16] stringr_1.4.0 htmlwidgets_1.5.1 evaluate_0.14
## [19] Biobase_2.46.0 knitr_1.25 irlba_2.3.3
## [22] Rcpp_1.0.2 BiocManager_1.30.9 abind_1.4-5
## [25] png_0.1-7 digest_0.6.22 stringi_1.4.3
## [28] tiff_0.1-5 bookdown_0.14 dplyr_0.8.3
## [31] grid_3.6.1 tools_3.6.1 bitops_1.0-6
## [34] magrittr_1.5 RCurl_1.95-4.12 tibble_2.1.3
## [37] crayon_1.3.4 pkgconfig_2.0.3 MASS_7.3-51.4
## [40] Matrix_1.2-17 matter_1.12.0 assertthat_0.2.1
## [43] rmarkdown_1.16 R6_2.4.0 fftwtools_0.9-8
## [46] mclust_5.4.5 signal_0.7-6 nlme_3.1-141
## [49] compiler_3.6.1