1 Introduction

Single-cell ’omics analysis enables high-resolution characterization of heterogeneous populations of cells by quantifying measurements in individual cells and thus provides a fuller, more nuanced picture into the complexity and heterogeneity between cells. However, the data also present new and significant challenges as compared to previous approaches, especially as single-cell data are much larger and sparser than data generated from bulk sequencing methods. Dimension reduction is a key step in the single-cell analysis to address the high dimension and sparsity of these data, and to enable the application of more complex, computationally expensive downstream pipelines.

Correspondence analysis (CA) is a matrix factorization method, and is similar to principal components analysis (PCA). Whereas PCA is designed for application to continuous, approximately normally distributed data, CA is appropriate for non-negative, count-based data that are in the same additive scale. corral implements CA for dimensionality reduction of a single matrix of single-cell data.

See the vignette for corralm for the multi-table adaptation of CA for single-cell batch alignment/integration.

corral can be used with various types of input. When called on a matrix (or other matrix-like object), it returns a list with the SVD output, principal coordinates, and standard coordinates. When called on a SingleCellExperiment, it returns the SingleCellExperiment with the corral embeddings in the reducedDim slot named corral. To retrieve the full list output from a SingleCellExperiment input, the fullout argument can be set to TRUE.

2 Loading packages and data

We will use the Zhengmix4eq dataset from the DuoClustering2018 package.

library(corral)
library(SingleCellExperiment)
library(ggplot2)
library(DuoClustering2018)
zm4eq.sce <- sce_full_Zhengmix4eq()
zm8eq <- sce_full_Zhengmix8eq()

This dataset includes approximately 4,000 pre-sorted and annotated cells of 4 types mixed by Duo et al. in approximately equal proportions (Duò, Robinson, and Soneson, n.d.). The cells were sampled from a “Massively parallel digital transcriptional profiling of single cells” (Zheng et al. 2017).

zm4eq.sce
## class: SingleCellExperiment 
## dim: 15568 3994 
## metadata(1): log.exprs.offset
## assays(3): counts logcounts normcounts
## rownames(15568): ENSG00000237683 ENSG00000228327 ... ENSG00000215700
##   ENSG00000215699
## rowData names(10): id symbol ... total_counts log10_total_counts
## colnames(3994): b.cells1147 b.cells6276 ... regulatory.t1084
##   regulatory.t9696
## colData names(14): dataset barcode ... libsize.drop feature.drop
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):
table(colData(zm4eq.sce)$phenoid)
## 
##         b.cells  cd14.monocytes naive.cytotoxic    regulatory.t 
##             999            1000             998             997

3 corral on SingleCellExperiment

We will run corral directly on the raw count data:

zm4eq.sce <- corral(inp = zm4eq.sce, 
                    whichmat = 'counts')

zm4eq.sce
## class: SingleCellExperiment 
## dim: 15568 3994 
## metadata(1): log.exprs.offset
## assays(3): counts logcounts normcounts
## rownames(15568): ENSG00000237683 ENSG00000228327 ... ENSG00000215700
##   ENSG00000215699
## rowData names(10): id symbol ... total_counts log10_total_counts
## colnames(3994): b.cells1147 b.cells6276 ... regulatory.t1084
##   regulatory.t9696
## colData names(14): dataset barcode ... libsize.drop feature.drop
## reducedDimNames(3): PCA TSNE corral
## mainExpName: NULL
## altExpNames(0):

We can use plot_embedding to visualize the output:

plot_embedding_sce(sce = zm4eq.sce,
                   which_embedding = 'corral',
                   plot_title = 'corral on Zhengmix4eq',
                   color_attr = 'phenoid',
                   color_title = 'cell type',
                   saveplot = FALSE)

Using the scater package, we can also add and visualize umap and tsne embeddings based on the corral output:

library(scater)
## Loading required package: scuttle
library(gridExtra) # so we can arrange the plots side by side

zm4eq.sce <- runUMAP(zm4eq.sce,
                     dimred = 'corral',
                     name = 'corral_UMAP')
zm4eq.sce <- runTSNE(zm4eq.sce,
                     dimred = 'corral',
                     name = 'corral_TSNE')

ggplot_umap <- plot_embedding_sce(sce = zm4eq.sce,
                                  which_embedding = 'corral_UMAP',
                                  plot_title = 'Zhengmix4eq corral with UMAP',
                                  color_attr = 'phenoid',
                                  color_title = 'cell type',
                                  returngg = TRUE,
                                  showplot = FALSE,
                                  saveplot = FALSE)

ggplot_tsne <- plot_embedding_sce(sce = zm4eq.sce,
                                  which_embedding = 'corral_TSNE',
                                  plot_title = 'Zhengmix4eq corral with tSNE',
                                  color_attr = 'phenoid',
                                  color_title = 'cell type',
                                  returngg = TRUE,
                                  showplot = FALSE,
                                  saveplot = FALSE)

multiplot(ggplot_umap, ggplot_tsne, cols = 2)
## Warning: 'multiplot' is deprecated.
## Use 'gridExtra::grid.arrange' instead.
## See help("Deprecated")

The corral embeddings stored in the reducedDim slot can be used in downstream analysis, such as for clustering or trajectory analysis.

corral can also be run on a SummarizedExperiment object.

4 corral on matrix

corral can also be performed on a matrix (or matrix-like) input.

zm4eq.countmat <- assay(zm4eq.sce,'counts')
zm4eq.countcorral <- corral(zm4eq.countmat)

The output is in a list format, including the SVD output (u,d,v), the standard coordinates (SCu,SCv), and the principal coordinates (PCu,PCv).

zm4eq.countcorral
## corral output summary===========================================
##   Output "list" includes standard coordinates (SCu, SCv),
##   principal coordinates (PCu, PCv), & SVD output (u, d, v)
## Variance explained----------------------------------------------
##                           PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8
## percent.Var.explained    0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## cumulative.Var.explained 0.01 0.02 0.02 0.02 0.03 0.03 0.03 0.03
## 
## Dimensions of output elements-----------------------------------
##   Singular values (d) :: 30
##   Left singular vectors & coordinates (u, SCu, PCu) :: 15568 30
##   Right singular vectors & coordinates (v, SCv, PCv) :: 3994 30
##   See corral help for details on each output element.
##   Use plot_embedding to visualize; see docs for details.
## ================================================================

We can use plot_embedding to visualize the output: (the embeddings are in the v matrix because these data are by genes in the rows and have cells in the columns; if this were reversed, with cells in the rows and genes/features in the column, then the cell embeddings would instead be in the u matrix.)

celltype_vec <- zm4eq.sce$phenoid
plot_embedding(embedding = zm4eq.countcorral$v,
               plot_title = 'corral on Zhengmix4eq',
               color_vec = celltype_vec,
               color_title = 'cell type',
               saveplot = FALSE)

The output is the same as above with the SingleCellExperiment, and can be passed as the low-dimension embedding for downstream analysis. Similarly, UMAP and tSNE can be computed for visualization. (Note that in performing SVD, the direction of the axes doesn’t matter so they may be flipped between runs, as corral and corralm use irlba to perform fast approximation.)

5 Updates to CA to address overdispersion

Correspondence analysis is known to be sensitive to “rare objects” (Greenacre, 2013). Sometimes this can be beneficial because the method can detect small perturbations of rare populations. However, in other cases, a couple outlier cells can be allowed to exert undue influence on a particular dimension.

In the corral manuscript, we describe three general approaches, included below; see our manuscript for more details and results. In this vignette we also present a fourth approach (Trimming extreme values with smooth mode)

5.1 Changing the residual type (rtype)

Standard correspondence analysis decomposes Pearson \(\chi^2\) residuals, computed with the formula: \[r_{p; ij} = \frac{\mathrm{observed} - \mathrm{expected}}{\sqrt{\mathrm{expected}}} = \frac{p_{ij} - p_{i.} \ p_{.j}}{\sqrt{p_{i.} \ p_{.j}}}\]

where \(p_{ij} = \frac{x_{ij}}{N}\), \(N = \sum_{i=1}^m \sum_{j=1}^n x_{ij}\), \(p_{i.} = \mathrm{row \ weights} = \sum_{i=1}^m p_{ij}\), and \(p_{.j} = \mathrm{col \ weights} = \sum_{j=1}^n p_{ij}\).

In corral, this is the default setting. It can also be explicitly selected by setting rtype = 'standardized' or rtype = 'pearson'.

Another \(\chi^2\) residual is the Freeman-Tukey: \[r_{f; ij} = \sqrt{p_{ij}} + \sqrt{p_{ij} + \frac{1}{N}} - \sqrt{4 p_{i.} \ p_{.j} + \frac{1}{N}}\]

It is more robust to overdispersion than the Pearson residuals, and therefore outperforms standard CA in many scRNAseq datasets.

In corral, this option can be selected by setting rtype = 'freemantukey'.

5.2 Variance stabilization before CA (vst_mth)

Another approach for addressing overdispersed counts is to apply a variance stabilizing transformation. The options included in the package:

  • Square root transform (\(\sqrt{x}\)): vst_mth = 'sqrt'
  • Anscombe transform (\(2 \sqrt{x + \frac{3}{8}}\)): vst_mth = 'anscombe'
  • Freeman-Tukey transform (\(\sqrt{x} + \sqrt{x + 1}\)): vst_mth = 'freemantukey' **Note that this option is different from setting the rtype parameter to 'freemantukey'

5.3 Power deflation (powdef_alpha)

To apply a smoothing effect to the \(\chi^2\) residuals, another approach is to transform the residual matrix by a power of \(\alpha \in (0,1)\). To achieve a “soft” smoothing effect, we suggest \(\alpha \in [0.9,0.99]\). This option is controlled with the powdef_alpha parameter, which takes the default value of NULL (not used). To set it, use this parameter and set it equal to the desired value for \(\alpha\) as a numeric. e.g., powdef_alpha = 0.95 would be including this option and setting \(\alpha = 0.95\).

5.4 Trimming extreme values (smooth mode)

One adaptation (not described in the manuscript) that addresses unduly influential outliers is to apply an alternative smoothing procedure that narrows the range of the \(\chi^2\)-transformed values by symmetrically trimming the top \(n\) fraction of extreme values (\(n\) defaults to \(.01\) and can be set with the pct_trim argument). Since the corral matrix pre-processing procedure transforms the values into standardized \(\chi^2\) space, they can be considered proportional to the significance of difference between observed and expected abundance for a given gene in a given cell. This approach differs from power deflation in that it only adjusts the most extreme values, and explicitly so, whereas power deflation shifts the distribution of all values to be less extreme.

This additional pre-processing step can be applied in corral by setting the smooth argument to TRUE (it defaults to FALSE), and this mode only works with standardized and indexed residuals options.

zm8eq.corral <- corral(zm8eq, fullout = TRUE)
zm8eq.corralsmooth <- corral(zm8eq, fullout = TRUE, smooth = TRUE)

7 Session information

sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scater_1.24.0               scuttle_1.6.0              
##  [3] DuoClustering2018_1.13.0    ggplot2_3.3.5              
##  [5] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.0
##  [7] Biobase_2.56.0              GenomicRanges_1.48.0       
##  [9] GenomeInfoDb_1.32.0         IRanges_2.30.0             
## [11] S4Vectors_0.34.0            BiocGenerics_0.42.0        
## [13] MatrixGenerics_1.8.0        matrixStats_0.62.0         
## [15] corral_1.6.0                gridExtra_2.3              
## [17] BiocStyle_2.24.0           
## 
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.16                    ggbeeswarm_0.6.0             
##   [3] colorspace_2.0-3              ellipsis_0.3.2               
##   [5] mclust_5.4.9                  XVector_0.36.0               
##   [7] BiocNeighbors_1.14.0          dichromat_2.0-0              
##   [9] farver_2.1.0                  ggrepel_0.9.1                
##  [11] MultiAssayExperiment_1.22.0   bit64_4.0.5                  
##  [13] RSpectra_0.16-1               interactiveDisplayBase_1.34.0
##  [15] AnnotationDbi_1.58.0          fansi_1.0.3                  
##  [17] sparseMatrixStats_1.8.0       cachem_1.0.6                 
##  [19] knitr_1.38                    jsonlite_1.8.0               
##  [21] dbplyr_2.1.1                  png_0.1-7                    
##  [23] uwot_0.1.11                   shiny_1.7.1                  
##  [25] BiocManager_1.30.17           mapproj_1.2.8                
##  [27] compiler_4.2.0                httr_1.4.2                   
##  [29] assertthat_0.2.1              Matrix_1.4-1                 
##  [31] fastmap_1.1.0                 cli_3.3.0                    
##  [33] BiocSingular_1.12.0           later_1.3.0                  
##  [35] htmltools_0.5.2               tools_4.2.0                  
##  [37] rsvd_1.0.5                    gtable_0.3.0                 
##  [39] glue_1.6.2                    GenomeInfoDbData_1.2.8       
##  [41] reshape2_1.4.4                dplyr_1.0.8                  
##  [43] ggthemes_4.2.4                maps_3.4.0                   
##  [45] rappdirs_0.3.3                Rcpp_1.0.8.3                 
##  [47] jquerylib_0.1.4               vctrs_0.4.1                  
##  [49] Biostrings_2.64.0             ExperimentHub_2.4.0          
##  [51] DelayedMatrixStats_1.18.0     xfun_0.30                    
##  [53] stringr_1.4.0                 beachmat_2.12.0              
##  [55] mime_0.12                     lifecycle_1.0.1              
##  [57] irlba_2.3.5                   AnnotationHub_3.4.0          
##  [59] zlibbioc_1.42.0               scales_1.2.0                 
##  [61] promises_1.2.0.1              parallel_4.2.0               
##  [63] yaml_2.3.5                    curl_4.3.2                   
##  [65] memoise_2.0.1                 sass_0.4.1                   
##  [67] stringi_1.7.6                 RSQLite_2.2.12               
##  [69] BiocVersion_3.15.2            highr_0.9                    
##  [71] ScaledMatrix_1.4.0            filelock_1.0.2               
##  [73] BiocParallel_1.30.0           pals_1.7                     
##  [75] rlang_1.0.2                   pkgconfig_2.0.3              
##  [77] bitops_1.0-7                  evaluate_0.15                
##  [79] lattice_0.20-45               purrr_0.3.4                  
##  [81] labeling_0.4.2                transport_0.12-2             
##  [83] bit_4.0.4                     tidyselect_1.1.2             
##  [85] plyr_1.8.7                    magrittr_2.0.3               
##  [87] bookdown_0.26                 R6_2.5.1                     
##  [89] magick_2.7.3                  generics_0.1.2               
##  [91] DelayedArray_0.22.0           DBI_1.1.2                    
##  [93] pillar_1.7.0                  withr_2.5.0                  
##  [95] KEGGREST_1.36.0               RCurl_1.98-1.6               
##  [97] tibble_3.1.6                  crayon_1.5.1                 
##  [99] utf8_1.2.2                    BiocFileCache_2.4.0          
## [101] rmarkdown_2.14                viridis_0.6.2                
## [103] grid_4.2.0                    data.table_1.14.2            
## [105] FNN_1.1.3                     blob_1.2.3                   
## [107] digest_0.6.29                 xtable_1.8-4                 
## [109] tidyr_1.2.0                   httpuv_1.6.5                 
## [111] munsell_0.5.0                 beeswarm_0.4.0               
## [113] viridisLite_0.4.0             vipor_0.4.5                  
## [115] bslib_0.3.1

References

Duò, A, MD Robinson, and C Soneson. n.d. “A Systematic Performance Evaluation of Clustering Methods for Single-Cell Rna-Seq Data [Version 2; Peer Review: 2 Approved], Journal = F1000Research, Volume = 7, Year = 2018, Number = 1141, Doi = 10.12688/f1000research.15666.2.”

Zheng, Grace X. Y., Jessica M. Terry, Phillip Belgrader, Paul Ryvkin, Zachary W. Bent, Ryan Wilson, Solongo B. Ziraldo, et al. 2017. “Massively Parallel Digital Transcriptional Profiling of Single Cells.” Nature Communications 8 (1): 14049. https://doi.org/10.1038/ncomms14049.