Contents

1 Introduction

For experiments in which analyzed samples come from different classes or conditions, a common goal of supervised analysis is classification: given a labeled training set for which classes are already known, we want to predict the class of a new sample.

Unlike unsupervised analysis such as segmentation, classification requires biological replicates for testing and validation, to avoid biased reporting of accuracy. Cardinal provides cross-validation for this purpose.

In this vignette, we present an example classification workflow using Cardinal.

We begin by loading the package:

library(Cardinal)

2 Classification of a renal cell carcinoma (RCC) cancer dataset

This example uses DESI spectra collected from a renal cell carcinoma (RCC) cancer dataset consisting of 8 matched pairs of human kidney tissue. Each tissue pair consists of a normal tissue sample and a cancerous tissue sample. The goal of the workflow is to develop classifiers for predicting whether a new tissue sample is normal or cancer.

MH0204_33 UH0505_12 UH0710_33 UH9610_15
UH9812_03 UH9905_18 UH9911_05 UH9912_01

In this RCC dataset, we expect that normal tissue and cancerous tissue will have unique chemical profiles, which we can use to classify new tissue based on the mass spectra.

First, we load the dataset from the CardinalWorkflows package. The data is stored in an older format, so we need to coerce it to an MSImagingExperiment.

data(rcc, package="CardinalWorkflows")
rcc <- as(rcc, "MSImagingExperiment")

The dataset contains 16,000 spectra with 10,200 m/z-values.

rcc
## An object of class 'MSContinuousImagingExperiment'
##   <10200 feature, 16000 pixel> imaging dataset
##     imageData(1): intensity
##     featureData(0):
##     pixelData(1): diagnosis
##     run(8): MH0204_33 UH0505_12 ... UH9911_05 UH9912_01
##     raster dimensions: 99 x 38
##     coord(2): x = 1..99, y = 1..38
##     mass range:  150.08 to 1000.00 
##     centroided: FALSE

2.1 Pre-processing

First, let’s subset the dataset to include only the pixels where we have manually selected the tissue regions and labeled the diagnosis.

Then, we normalize the mass spectra to a common TIC.

rcc <- rcc %>%
  select(!is.na(diagnosis)) %>%
  normalize(method="tic") %>%
  process()
rcc
## An object of class 'MSContinuousImagingExperiment'
##   <10200 feature, 6077 pixel> imaging dataset
##     imageData(1): intensity
##     featureData(0):
##     pixelData(1): diagnosis
##     processing complete(1): normalize
##     processing pending(0):
##     run(8): MH0204_33 UH0505_12 ... UH9911_05 UH9912_01
##     raster dimensions: 90 x 36
##     coord(2): x = 2..91, y = 2..37
##     mass range:  150.08 to 1000.00 
##     centroided: FALSE

To process the dataset, we will first perform peak picking on the mean spectrum to create a set of reference peaks. We will then bin the peaks in the entire dataset to this reference.

rcc_mean <- summarize(rcc, .stat="mean")
rcc_ref <- rcc_mean %>%
  peakPick(SNR=3) %>%
  peakAlign(ref="mean",
            tolerance=0.5,
            units="mz") %>%
  peakFilter() %>%
  process()
## nrows changed from 10200 to 842

Now we bin the rest of the dataset to the reference peaks.

rcc_peaks <- rcc %>%
  peakBin(ref=mz(rcc_ref),
          tolerance=0.5,
          units="mz") %>%
  process()

rcc_peaks
## An object of class 'MSContinuousImagingExperiment'
##   <103 feature, 6077 pixel> imaging dataset
##     imageData(1): intensity
##     featureData(0):
##     pixelData(1): diagnosis
##     processing complete(2): normalize peakBin
##     processing pending(0):
##     run(8): MH0204_33 UH0505_12 ... UH9911_05 UH9912_01
##     raster dimensions: 90 x 36
##     coord(2): x = 2..91, y = 2..37
##     mass range: 157.2287 to 890.4316 
##     centroided: TRUE

This produces a centroided dataset with 103 peaks.

Note that this centroided dataset will be useful for our exploratory analysis and for fitting the final classification model, but it cannot be used for cross-validation. This is because the mean spectrum was calculated from the whole dataset, so if we then used it for cross-validation, each CV fold would include some pre-processing steps that depended on the test set.

Pre-processing for cross-validation will require a different strategy.

2.2 Visualization

Before proceeding with the statistical analysis, we’ll first perform some and exploratory visual analysis of the dataset.

2.2.1 Ion images

Let’s plot the images for m/z 810, which appears abundant in both normal and tumor tissue, and doesn’t seem to be very predictive.

image(rcc_peaks, mz=810, layout=c(4,2),
      contrast.enhance="suppress", normalize.image="linear")

As can be seen above, each matched pair of tissues belonging to the same subject are on the same slide (and therefore belong to the same run). Note also the the cancer tissue is on the left and the normal tissue is on the right on each slide.

2.2.2 Prinipal components analysis (PCA)

Principal component analysis (PCA) is an unsupervised method for exploring a dataset. PCA is available in Cardinal through the PCA() method.

Below, we calculate the first 2 principal components. Note that PCA does not use any information about the diagnosis.

rcc_pca <- PCA(rcc_peaks, ncomp=2)

We can overlay the PC scores of the first 2 principal components. It doesn’t appear that either component distinguishes the diagnoses.

image(rcc_pca, layout=c(4,2), normalize.image="linear")

We can plot the scores of the first 2 components against each other to see how the separate the diagnoses (or don’t, in this case).

pc_scores <- DataFrame(resultData(rcc_pca, 1, "scores"))

It doesn’t appear that PCA separates cancer versus normal tissue. At least, not the first 2 components.

plot(pc_scores, PC1 ~ PC2, groups=rcc$diagnosis)

PCA is also a useful way to visualize how much each run clusters together. A large amount of variation in the data tends to be variation between experimental runs. This is why it’s useful to have matched pairs on the same slide.

plot(pc_scores, PC1 ~ PC2, groups=run(rcc))

2.3 Classification with spatial shrunken centroids (SSC)

To classify the dataset and automatically select peaks that distinguish each class, we will use the spatialShrunkenCentroids() method provided by Cardinal.

Important parameters to this method include:

  • method The type of spatial weights to use:

    • “gaussian” weights use a simple Gaussian smoothing kernel

    • “adaptive” weights use an adaptive kernel that sometimes preserve edges better

  • r The neighborhood smoothing radius; this should be selected based on the size and granularity of the spatial regions in your dataset

  • s The shrinkage or sparsity parameter; the higher this number, the fewer peaks will be used to determine the classification.

2.3.1 Cross-validation with SSC

In order to avoid over-fitting due to the pre-processing, each CV fold will be processed separately. For each CV fold, peak picking will be performed on the mean spectrum of the training set, and the test set will be binned to the peaks of the training set.

We use the crossValidate() method to perform cross-validation, and we treat each run as a separate fold.

rcc_ssc_cv <- crossValidate(rcc, rcc$diagnosis,
              .fun="spatialShrunkenCentroids",
              r=1, s=c(0,3,6,9,12,15),
              .fold=run(rcc), .process=TRUE,
              .processControl=list(SNR=3,
                                   tolerance=0.5,
                                   units="mz"))

summary(rcc_ssc_cv)
## Cross validation:
##  
##  Classification on 2 classes: cancer normal 
##  Summarized 8 folds: MH0204_33 UH0505_12 ... UH9911_05 UH9912_01
##  
##   r  s  Accuracy Sensitivity Specificity
## 1 1  0 0.9228680   0.7743570   0.9977366
## 2 1  3 0.9251193   0.7822133   0.9982630
## 3 1  6 0.9276070   0.7942608   0.9974212
## 4 1  9 0.9278434   0.8015583   0.9970539
## 5 1 12 0.9234143   0.8029425   0.9931620
## 6 1 15 0.9127316   0.7955741   0.9819103

It appears s = 9 produces the best accuracy.

plot(summary(rcc_ssc_cv), Accuracy ~ s, type='b')
abline(v=9, lty=2, col="red")

Having selected the parameters, we can re-fit the model on the full dataset in order to interpret it.

rcc_ssc <- spatialShrunkenCentroids(rcc_peaks, rcc$diagnosis, r=1, s=9)

summary(rcc_ssc)
## Spatially-aware nearest shrunken centroids:
##  
##  Classification on 2 classes: cancer normal 
##  Method = gaussian 
##  Distance = chebyshev
##  
##   Radius (r) Shrinkage (s) Features/Class  Accuracy Sensitivity Specificity
## 1          1             9             44 0.9425703   0.8745946   0.9996972

We should ignore the accuracy here though, as it is too optimistic.

2.3.2 Plotting the classified images

Now we plot the classified images. Opacity is used to reflect the probability of class membership.

image(rcc_ssc, layout=c(4,2))

2.3.3 Plotting the (shrunken) mean spectra

Below, we plot the centroids for each class separately.

setup.layout(c(2,1))
plot(rcc_ssc, column=1, col=discrete.colors(2)[1],
     lwd=2, layout=NULL)
plot(rcc_ssc, column=2, col=discrete.colors(2)[2],
     lwd=2, layout=NULL)

Although some differences are obvious, it is difficult to tell just from the mean spectra what peaks distinguish each diagnosis.

2.3.4 Plotting and interpretting t-statistics of the m/z values

Plotting the t-statistics tells us exactly the relationship between each class’s centroid and the global mean spectrum. The t-statistics are the difference between a class’s centroid and the global mean, divided by a standard error.

Positive t-statistics indicate that peak is systematically higher in that class as compared to the global mean spectrum.

Negative t-statistics indicate that peak is systematically lower in that class as compared to the global mean spectrum.

Due to the shrinkage parameter s, unimportant peaks will have a t-statistic of 0 and will effectively have no effect on the classification.

plot(rcc_ssc, values="statistic", lwd=2)

This lets us clearly see which peaks are distinguishing cancer versus normal tissue.

2.3.5 Retrieving the top m/z-values

We can use the topFeatures() method to retrive the m/z values associated with each class.

Let’s find the m/z values associated with cancer tissue.

topFeatures(rcc_ssc, class=="cancer")
## Top-ranked features: 
##          mz r k s  class   centers statistic
## 1  885.3817 1 2 9 cancer 475.48551 31.872571
## 2  886.3713 1 2 9 cancer 258.99739 30.741153
## 3  773.3928 1 2 9 cancer  61.46473 21.168045
## 4  750.3847 1 2 9 cancer  64.23589 19.863718
## 5  887.3785 1 2 9 cancer 139.23111 17.780913
## 6  774.4026 1 2 9 cancer  29.75428 12.849240
## 7  751.3974 1 2 9 cancer  27.05720 10.882855
## 8  771.3884 1 2 9 cancer  39.80619 10.773952
## 9  748.4115 1 2 9 cancer  64.50118 10.242484
## 10 888.4068 1 2 9 cancer  55.28092  8.908456
image(rcc_peaks, mz=885, contrast.enhance="suppress", layout=c(4,2))

And let’s find the m/z values associated with normal tissue.

topFeatures(rcc_ssc, class=="normal")
## Top-ranked features: 
##          mz r k s  class   centers statistic
## 1  215.1693 1 2 9 normal  60.20258  27.65943
## 2  217.1913 1 2 9 normal 148.67101  18.92472
## 3  810.2639 1 2 9 normal 139.23387  17.31908
## 4  353.2045 1 2 9 normal  28.29177  17.20477
## 5  738.2757 1 2 9 normal  34.00558  15.03878
## 6  171.2043 1 2 9 normal  41.68940  12.75510
## 7  219.1765 1 2 9 normal  40.53479  12.45263
## 8  271.1910 1 2 9 normal  67.30062  11.94238
## 9  279.3348 1 2 9 normal  87.70449  11.84226
## 10 187.1978 1 2 9 normal  28.26603  11.58026
image(rcc_peaks, mz=215, contrast.enhance="suppress", layout=c(4,2))

3 Additional notes on cross-validation

Because we used a matched pairs experiment, with one subject per run, it was straightforward to treat the experimental runs as our CV folds. However, it does not always work out this way.

In general, a CV fold needs to include examples of both positive and negative classes. In addition, spectra from the same sample should not be split across multiple CV folds. In MS imaging, it is important to remember that the sample size does NOT equal the number of spectra or the number of pixels.

4 Session information

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## Random number generation:
##  RNG:     L'Ecuyer-CMRG 
##  Normal:  Inversion 
##  Sample:  Rejection 
##  
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] CardinalWorkflows_1.18.0 Cardinal_2.4.0          
## [3] ProtGenerics_1.18.0      S4Vectors_0.24.0        
## [5] EBImage_4.28.0           BiocParallel_1.20.0     
## [7] BiocGenerics_0.32.0      BiocStyle_2.14.0        
## 
## loaded via a namespace (and not attached):
##  [1] biglm_0.9-1        locfit_1.5-9.1     tidyselect_0.2.5  
##  [4] xfun_0.10          purrr_0.3.3        lattice_0.20-38   
##  [7] htmltools_0.4.0    viridisLite_0.3.0  yaml_2.2.0        
## [10] rlang_0.4.1        pillar_1.4.2       glue_1.3.1        
## [13] DBI_1.0.0          sp_1.3-1           jpeg_0.1-8.1      
## [16] stringr_1.4.0      htmlwidgets_1.5.1  evaluate_0.14     
## [19] Biobase_2.46.0     knitr_1.25         irlba_2.3.3       
## [22] Rcpp_1.0.2         BiocManager_1.30.9 abind_1.4-5       
## [25] png_0.1-7          digest_0.6.22      stringi_1.4.3     
## [28] tiff_0.1-5         bookdown_0.14      dplyr_0.8.3       
## [31] grid_3.6.1         tools_3.6.1        bitops_1.0-6      
## [34] magrittr_1.5       RCurl_1.95-4.12    tibble_2.1.3      
## [37] crayon_1.3.4       pkgconfig_2.0.3    MASS_7.3-51.4     
## [40] Matrix_1.2-17      matter_1.12.0      assertthat_0.2.1  
## [43] rmarkdown_1.16     R6_2.4.0           fftwtools_0.9-8   
## [46] mclust_5.4.5       signal_0.7-6       nlme_3.1-141      
## [49] compiler_3.6.1