Contents

1 Introduction

For experiments in which analyzed samples come from different classes or conditions, a common goal of statistical analysis is class comparison via hypothesis testing.

Statistical testing is performed to find peaks that are differentially abundant among different classes or conditions.

Valid statistical testing requires biological replicates in order to compare between different conditions. It should never be performed with less than 3 samples per condition.

In this vignette, we present an example class comparison workflow using Cardinal.

We begin by loading the package:

library(Cardinal)

2 Statistical testing for a renal cell carcinoma (RCC) cancer dataset

This example uses DESI spectra collected from a renal cell carcinoma (RCC) cancer dataset consisting of 8 matched pairs of human kidney tissue. Each tissue pair consists of a normal tissue sample and a cancerous tissue sample.

MH0204_33 UH0505_12 UH0710_33 UH9610_15
UH9812_03 UH9905_18 UH9911_05 UH9912_01

For the RCC cancer dataset, the goal is to find m/z features differentially abundant between normal and cancer tissue.

First, we load the dataset from the CardinalWorkflows package. The data is stored in an older format, so we need to coerce it to an MSImagingExperiment.

data(rcc, package="CardinalWorkflows")
rcc <- as(rcc, "MSImagingExperiment")

The dataset contains 16,000 spectra with 10,200 m/z-values.

rcc
## An object of class 'MSContinuousImagingExperiment'
##   <10200 feature, 16000 pixel> imaging dataset
##     imageData(1): intensity
##     featureData(0):
##     pixelData(1): diagnosis
##     run(8): MH0204_33 UH0505_12 ... UH9911_05 UH9912_01
##     raster dimensions: 99 x 38
##     coord(2): x = 1..99, y = 1..38
##     mass range:  150.08 to 1000.00 
##     centroided: FALSE

2.1 Pre-processing

Before fitting any statistical model, pre-processing is necessary to remove noise and the number of m/z values.

To process the dataset, we will first perform peak picking on the mean spectrum to create a set of reference peaks. We will then bin the peaks in the entire dataset to this reference.

rcc_mean <- summarize(rcc, .stat="mean")
rcc_ref <- rcc_mean %>%
  peakPick(SNR=3) %>%
  peakAlign(ref="mean",
            tolerance=0.5,
            units="mz") %>%
  peakFilter() %>%
  process()
## nrows changed from 10200 to 752

Now we normalize and bin the rest of the dataset to the reference peaks.

rcc_peaks <- rcc %>%
  normalize(method="tic") %>%
  peakBin(ref=mz(rcc_ref),
          tolerance=0.5,
          units="mz") %>%
  process()

rcc_peaks
## An object of class 'MSContinuousImagingExperiment'
##   <82 feature, 16000 pixel> imaging dataset
##     imageData(1): intensity
##     featureData(0):
##     pixelData(1): diagnosis
##     processing complete(2): normalize peakBin
##     processing pending(0):
##     run(8): MH0204_33 UH0505_12 ... UH9911_05 UH9912_01
##     raster dimensions: 99 x 38
##     coord(2): x = 1..99, y = 1..38
##     mass range: 157.2765 to 888.4058 
##     centroided: TRUE

This produces a centroided dataset with 82 peaks.

Rather than rely on the manual region-of-interest selection, we will rely on the fact that cancer tissue is on the left and the normal tissue is on the right on each slide.

xcutoff<-c(35, 23, 28, 39, 29, 28, 44, 32)

rcc_peaks$rough_diagnosis <- factor("normal", level=c("cancer", "normal"))

for ( i in 1:nlevels(run(rcc_peaks)) ) {
  cur_run <- run(rcc_peaks) == runNames(rcc_peaks)[i]
  pData(rcc_peaks)$rough_diagnosis[cur_run & coord(rcc_peaks)$x < xcutoff[i]] <- "cancer"
}

rcc_peaks$groups <- interaction(run(rcc_peaks), rcc_peaks$rough_diagnosis)
image(rcc_peaks, mz=810, groups=rough_diagnosis,
      contrast.enhance="histogram", layout=c(4,2))

2.1.1 Non-specific filtering to reduce data size

In order to reduce the size of the dataset further (because the computation we are working toward can be time consuming), we will perform non-specific filtering.

This means filtering our peaks based on a summary statistic unrelated to the condition. We will use the variance.

rcc_var <- summarize(rcc_peaks, .stat="var", .as="DataFrame")

plot(rcc_var, var ~ mz, main="variance")

Now we keep only the peaks above the top 80% quantile of variance among peaks.

rcc_peaks2 <- rcc_peaks[rcc_var$var >= quantile(rcc_var$var, 0.8),]

2.1.2 Segmentation with spatial Dirichlet Gaussian mixture model (DGMM)

Spatial-DGMM performs peak-specific segmentation. It detects peak-specific tissue segments with homogeneous spatial composition. It can estimate the number of segments and the mean and variance of each segment.

This gives us a useful summary of the spatial distribution of each peak.

set.seed(1)
rcc_dgmm1 <- spatialDGMM(rcc_peaks2[16,], r=1, k=4, groups=1)

summary(rcc_dgmm1)
## Spatially-aware Dirichlet Gaussian mixture models:
##  
##  Segmentation on 1 group: 1 
##  Method = gaussian 
##  Distance = chebyshev
##  
##   Radius (r) Init (k) Feature Classes/Group
## 1          1        4       1             3
image(rcc_dgmm1, layout=c(4,2))

This is useful because we can use it to automatically detect segments to compare for statistical testing (e.g., “cancer” vs “normal” tissue). However, to do this without bias, we must make sure the segmentation is performed independently for each sample.

set.seed(1)
rcc_dgmm <- spatialDGMM(rcc_peaks2, r=1, k=4, groups=rcc_peaks2$groups)

summary(rcc_dgmm)
## Spatially-aware Dirichlet Gaussian mixture models:
##  
##  Segmentation on 16 groups: MH0204_33.cancer UH0505_12.cancer ... UH9911_05.normal  Segmentation on     UH9912_01.normal 
##  Method = gaussian 
##  Distance = chebyshev
##  
##    Radius (r) Init (k) Feature Classes/Group
## 1           1        4       1          3.50
## 2           1        4       2          3.31
## 3           1        4       3          3.12
## 4           1        4       4          3.38
## 5           1        4       5          3.31
## 6           1        4       6          3.12
## 7           1        4       7          3.38
## 8           1        4       8          3.44
## 9           1        4       9          3.25
## 10          1        4      10          3.81
## 11          1        4      11          3.50
## 12          1        4      12          3.69
## 13          1        4      13          3.19
## 14          1        4      14          3.25
## 15          1        4      15          3.12
## 16          1        4      16          3.31
## 17          1        4      17          2.69

2.2 Visualization

2.3 Class comparison with means-based testing

As introduced earlier, statistical testing is performed to find peaks differentially abundant among different groups. Since MS imaging produces many hundreds of measurements on the same sample, we can’t treat each mass spectrum as a separate observation. Rather, we need to compare entire samples rather than individual pixels.

One way to do this is to summarize each sample by calculating its mean intensity. We can then fit linear models to the means-summarized data.

2.3.1 Fitting models with means-summarized groups

In Cardinal, we can simply use meansTest() to do means-based testing in a MS imaging experiment. We use a one-sided formula to specify the fixed effects (the diagnosis in this particular dataset). The groups indicating the observational units must also be provided. Each group is summarized by its mean, and then a linear model is fit to the summaries.

mtest <- meansTest(rcc_peaks2, ~ rough_diagnosis, groups=rcc_peaks2$groups)

summary(mtest)
## Means-summarized linear model testing:
##  
##  Fixed effects: ~rough_diagnosis 
## 
##  Summarized 16 groups: MH0204_33.cancer UH0505_12.cancer ... UH9911_05.normal 
##  Summarized     UH9912_01.normal 
## 
##  Likelihood ratio test for fixed effects:
##  
##    Feature     LR     PValue       FDR
## 1        1 3.2377 0.07196111 0.4077796
## 2        2 0.2942 0.58755653 0.9365532
## 3        3 0.0538 0.81655196 0.9365532
## 4        4 0.0131 0.90871076 0.9365532
## 5        5 0.0063 0.93655316 0.9365532
## 6        6 0.2120 0.64521524 0.9365532
## 7        7 0.1771 0.67387698 0.9365532
## 8        8 0.0995 0.75237507 0.9365532
## 9        9 0.0818 0.77480632 0.9365532
## 10      10 0.6520 0.41938613 0.9365532
## 11      11 0.0505 0.82213078 0.9365532
## 12      12 0.0614 0.80428139 0.9365532
## 13      13 0.4109 0.52149586 0.9365532
## 14      14 0.1843 0.66769754 0.9365532
## 15      15 0.0420 0.83763741 0.9365532
## 16      16 3.6261 0.05687959 0.4077796
## 17      17 4.1306 0.04211538 0.4077796

The summarized results are automatically adjusted for multiple comparisons using FDR.

2.3.2 Interpreting the results

We can use the topFeatures() method to find differentially abundant peaks.

topFeatures(mtest, p.adjust="fdr", AdjP < .1)
## Top-ranked tests: ~rough_diagnosis vs ~1 
##      mz feature LR PValue AdjP

But we don’t find any.

2.4 Class comparison with segmentation-based testing

Means-based testing is fast and simple and can work well for homogeneous samples. However, doesn’t use the spatial structure of each peak, so it doesn’t take the advantage of MS imaging, and may result in missing differences that actually exist.

Rather than simply average the intensities, we can summarize each sample by segmenting it with spatial-DGMM, and comparing the resulting segments. This gives us a bias-free way to keep the spatial heterogeneous information.

2.4.1 Fitting models with spatial-DGMM-summarized groups

First, we must segment the data with spatialDGMM(), while making sure that each observational unit is segmented within a different group (as specified by groups). We’ve already done this. Now we use segmentationTest() to fit the models.

In order to fit the models, a representative spatial-DGMM segment must be selected for each group. There are two automated ways to do this via classControl: “Ymax” (default) means use the segments with the highest means, and “Mscore” means use the segments with the highest match scores with the fixed effects.

stest <- segmentationTest(rcc_dgmm, ~ rough_diagnosis)

summary(stest)
## Segmentation-based linear model testing:
##  
##  Fixed effects: ~rough_diagnosis 
## 
##  Summarized 16 groups: MH0204_33.cancer UH0505_12.cancer ... UH9911_05.normal 
##  Summarized     UH9912_01.normal 
## 
##  Likelihood ratio test for fixed effects:
##  
##    Radius (r) Init (k) Feature     LR      PValue        FDR
## 1           1        4       1 1.3788 0.240311536 0.97921487
## 2           1        4       2 0.0022 0.962186433 0.97921487
## 3           1        4       3 0.0753 0.783722493 0.97921487
## 4           1        4       4 0.0044 0.946819205 0.97921487
## 5           1        4       5 0.2305 0.631128419 0.97921487
## 6           1        4       6 0.1330 0.715309712 0.97921487
## 7           1        4       7 0.2520 0.615671585 0.97921487
## 8           1        4       8 0.0007 0.979214867 0.97921487
## 9           1        4       9 0.6114 0.434255077 0.97921487
## 10          1        4      10 0.5911 0.442003579 0.97921487
## 11          1        4      11 0.0257 0.872673836 0.97921487
## 12          1        4      12 0.0159 0.899624095 0.97921487
## 13          1        4      13 0.0198 0.888138034 0.97921487
## 14          1        4      14 0.0793 0.778204860 0.97921487
## 15          1        4      15 0.0102 0.919508803 0.97921487
## 16          1        4      16 7.3232 0.006807167 0.05786092
## 17          1        4      17 7.9774 0.004736522 0.05786092

2.4.2 Interpreting the results

Again, we can use the topFeatures() method to find differentially abundant peaks.

topFeatures(stest, p.adjust="fdr", AdjP < .1)
## Top-ranked tests: ~rough_diagnosis vs ~1 
##         mz r k feature     LR      PValue       AdjP
## 1 885.3767 1 4      16 7.3232 0.006807167 0.05786092
## 2 886.3707 1 4      17 7.9774 0.004736522 0.05786092

This time we find 2 differentially abundant peaks (though one is likely an isotope of the other).

plot(stest, model=list(feature=16))

image(rcc_peaks2, mz=885, layout=c(4,2),
      contrast.enhance="suppress",
      normalize.image="linear")

3 Session information

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## Random number generation:
##  RNG:     L'Ecuyer-CMRG 
##  Normal:  Inversion 
##  Sample:  Rejection 
##  
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] CardinalWorkflows_1.18.0 Cardinal_2.4.0          
## [3] ProtGenerics_1.18.0      S4Vectors_0.24.0        
## [5] EBImage_4.28.0           BiocParallel_1.20.0     
## [7] BiocGenerics_0.32.0      BiocStyle_2.14.0        
## 
## loaded via a namespace (and not attached):
##  [1] biglm_0.9-1        locfit_1.5-9.1     tidyselect_0.2.5  
##  [4] xfun_0.10          purrr_0.3.3        lattice_0.20-38   
##  [7] htmltools_0.4.0    viridisLite_0.3.0  yaml_2.2.0        
## [10] rlang_0.4.1        pillar_1.4.2       glue_1.3.1        
## [13] DBI_1.0.0          sp_1.3-1           jpeg_0.1-8.1      
## [16] stringr_1.4.0      htmlwidgets_1.5.1  evaluate_0.14     
## [19] Biobase_2.46.0     knitr_1.25         irlba_2.3.3       
## [22] Rcpp_1.0.2         BiocManager_1.30.9 abind_1.4-5       
## [25] png_0.1-7          digest_0.6.22      stringi_1.4.3     
## [28] tiff_0.1-5         bookdown_0.14      dplyr_0.8.3       
## [31] grid_3.6.1         tools_3.6.1        bitops_1.0-6      
## [34] magrittr_1.5       RCurl_1.95-4.12    tibble_2.1.3      
## [37] crayon_1.3.4       pkgconfig_2.0.3    MASS_7.3-51.4     
## [40] Matrix_1.2-17      matter_1.12.0      assertthat_0.2.1  
## [43] rmarkdown_1.16     R6_2.4.0           fftwtools_0.9-8   
## [46] mclust_5.4.5       signal_0.7-6       nlme_3.1-141      
## [49] compiler_3.6.1