Cell omics such as single-cell genomics, proteomics and microbiomics allow the characterisation of tissue and microbial community composition, which can be compared between conditions to identify biological drivers. This strategy has been critical to unveiling markers of disease progression such as cancer and pathogen infection. For cell omic data, no method for differential variability analysis exists, and methods for differential composition analysis only take a few fundamental data properties into account. Here we introduce sccomp, a generalised method for differential composition and variability analyses able to jointly model data count distribution, compositionality, group-specific variability and proportion mean-variability association, with awareness against outliers. Sccomp is an extensive analysis framework that allows realistic data simulation and cross-study knowledge transfer. Here, we demonstrate that mean-variability association is ubiquitous across technologies showing the inadequacy of the very popular Dirichlet-multinomial modelling and provide mandatory principles for differential variability analysis. We show that sccomp accurately fits experimental data, with a 50% incremental improvement over state-of-the-art algorithms. Using sccomp, we identified novel differential constraints and composition in the microenvironment of primary breast cancer.

Installation

(simple) Suggested for single-cell and CyTOF analyses

Bioconductor

if (!requireNamespace("BiocManager")) {
   install.packages("BiocManager")
 }
 BiocManager::install("sccomp")

Github

devtools::install_github("stemangiola/sccomp")

(more complex and efficient, until further optimisation of the default installation) Suggested for microbiomics

Github

install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
check_cmdstan_toolchain()
install_cmdstan(cores = 2)
# Then, check the correct cmdstanr installation here
# https://mc-stan.org/cmdstanr/articles/cmdstanr.html

# Then install sccomp with the cmdstanr branch
devtools::install_github("stemangiola/sccomp@cmdstanr")

Analysis

sccomp can model changes in composition and variability. Normally the furmula for variability is either ~1, which assumes that the cell-group variability is independent on any covariate, or ~ factor_of_interest, which assumes that the model is dependent on the factor of interest only. However, more complex models for variability are possible, is the sample size is large. In any case the model for variability must be a subset of the model for composition.

From Seurat Object

res =
  seurat_obj |>
  sccomp_glm( 
   formula_composition = ~ type, 
    formula_variability = ~ 1, 
    sample, 
    cell_group 
  )

From SingleCellExperiment Object

res =
  sce_obj |>
  sccomp_glm( 
    formula_composition = ~ type, 
    formula_variability = ~ 1, 
    sample, 
    cell_group 
  )

From data.frame

res =
  seurat_obj[[]] |>
  sccomp_glm(
    formula_composition = ~ type, 
    formula_variability = ~ 1, 
    sample, 
    cell_group 
  )

From counts

res =
  counts_obj |>
  sccomp_glm( 
    formula_composition = ~ type, 
    formula_variability = ~ 1, 
    .sample = sample,
    .cell_group = cell_group,
    .count = count
  )

## sccomp says: outlier identification first pass - step 1/3 [ETA: ~20s]

## sccomp says: outlier identification second pass - step 2/3 [ETA: ~60s]

## sccomp says: outlier-free model fitting - step 3/3 [ETA: ~20s]

## sccomp says: the composition design matrix has columns: (Intercept), typecancer

## sccomp says: the variability design matrix has columns: (Intercept)

res

## # A tibble: 72 × 9
##    cell_group parameter   covariate c_lower c_effect c_upper    c_pH0     c_FDR
##    <chr>      <chr>       <chr>       <dbl>    <dbl>   <dbl>    <dbl>     <dbl>
##  1 B1         (Intercept) <NA>        0.577    0.762  0.946  0        0        
##  2 B1         typecancer  type       -1.28    -0.931 -0.579  0        0        
##  3 B2         (Intercept) <NA>        0.125    0.370  0.633  0.0886   0.00494  
##  4 B2         typecancer  type       -1.07    -0.613 -0.155  0.0358   0.00460  
##  5 B3         (Intercept) <NA>       -0.605   -0.412 -0.209  0.0218   0.00196  
##  6 B3         typecancer  type       -0.596   -0.212  0.153  0.475    0.127    
##  7 BM         (Intercept) <NA>       -1.31    -1.10  -0.887  0        0        
##  8 BM         typecancer  type       -0.760   -0.338  0.0699 0.242    0.0673   
##  9 CD4 1      (Intercept) <NA>        0.331    0.462  0.591  0.000250 0.0000114
## 10 CD4 1      typecancer  type       -0.123    0.108  0.359  0.769    0.221    
## # … with 62 more rows, and 1 more variable: count_data <list>

Of the output table, the estimate columns startwith the prefix c_ indicate composition.

Suggested settings for single-cell RNA sequencing

We reccommend to set bimodal_mean_variability_association = TRUE. The bimodality of the mean-variability association can be confirmed from the plots$credible_intervals_2D (see below).

Suggested settings for CyTOF and microbiome data

We reccommend to set bimodal_mean_variability_association = FALSE (Default).

Visualise data + inference

plots = plot_summary(res)

## Joining, by = c("sample", "cell_group")
## Joining, by = c("cell_group", "type")

## Warning: Ignoring unknown aesthetics: label

Plot of group proportion, faceted by groups. The blue boxplots represent the posterior predictive check. If the model is likely be descriptively adequate to the data, the blue boxplot should roughly overlay with the black boxplot, which represent the observed data. The outliers are coloured in red. A boxplot will be returned for every (discrete) covariates present in formula_composition. The color coding represent the significant associations for composition and/or variability.

plots$boxplot

## [[1]]

plot of chunk unnamed-chunk-11

Plot of estimates of differential composition (c) on the x axis, and differential variability (v) on the y axis. The error bars represent 95% credible intervals. The dashed lines represent the minimal effect that the hypothesis test is based on. An effect is labelled as significant if bigger than the minimal effect according to the 95% credible interval. Facets represent the covariates in the model.

plots$credible_intervals_1D

plot of chunk unnamed-chunk-12

Visualisation of the MCMC chains from the posterior distribution

It is possible to directly evaluate the posterior distribution. In this example we plot the Monte Carlo chain for the slope parameter of the first cell type. We can see that has converged and is negative with probability 1.

res %>% attr("fit") %>% rstan::traceplot("beta[2,1]")

plot of chunk unnamed-chunk-13

Differential variability

We can model the cell-group variability also dependent on type, and so test differences in variability

res = 
  counts_obj |>
  sccomp_glm( 
    formula_composition = ~ type, 
    formula_variability = ~ type, 
    .sample = sample,
    .cell_group = cell_group,
    .count = count
  )

## sccomp says: outlier identification first pass - step 1/3 [ETA: ~20s]

## sccomp says: outlier identification second pass - step 2/3 [ETA: ~60s]

## sccomp says: outlier-free model fitting - step 3/3 [ETA: ~20s]

## sccomp says: the composition design matrix has columns: (Intercept), typecancer

## sccomp says: the variability design matrix has columns: (Intercept), typecancer

res

## # A tibble: 72 × 14
##    cell_group parameter   covariate c_lower c_effect c_upper    c_pH0    c_FDR
##    <chr>      <chr>       <chr>       <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
##  1 B1         (Intercept) <NA>       0.630     0.795  0.967  0        0       
##  2 B1         typecancer  type      -1.26     -0.918 -0.571  0.000501 0.000100
##  3 B2         (Intercept) <NA>       0.181     0.422  0.676  0.0350   0.00183 
##  4 B2         typecancer  type      -1.03     -0.537 -0.0444 0.0926   0.0206  
##  5 B3         (Intercept) <NA>      -0.569    -0.364 -0.160  0.0558   0.00375 
##  6 B3         typecancer  type      -0.511    -0.111  0.290  0.669    0.177   
##  7 BM         (Intercept) <NA>      -1.28     -1.05  -0.827  0        0       
##  8 BM         typecancer  type      -0.714    -0.288  0.149  0.343    0.0864  
##  9 CD4 1      (Intercept) <NA>       0.396     0.531  0.683  0        0       
## 10 CD4 1      typecancer  type      -0.0435    0.247  0.549  0.381    0.109   
## # … with 62 more rows, and 6 more variables: v_lower <dbl>, v_effect <dbl>,
## #   v_upper <dbl>, v_pH0 <dbl>, v_FDR <dbl>, count_data <list>

Plot 1D significance plot

plots = plot_summary(res)

## Joining, by = c("sample", "cell_group")
## Joining, by = c("cell_group", "type")

## Warning: Ignoring unknown aesthetics: label

plots$credible_intervals_1D

plot of chunk unnamed-chunk-15

Plot 2D significance plot. Data points are cell groups. Error bars are the 95% credible interval. The dashed lines represent the default threshold fold change for which the probabilities (c_pH0, v_pH0) are calculated. pH0 of 0 represent the rejection of the null hypothesis, that no effect is observed.

This plot is provided only if differential variability has been tested. The differential variability estimates are reliable only if the linear association between mean and variability for (intercept) (left-hand side facet) is satisfied. A scatterplot (beside the Intercept) is provided for each of the categories of interest. The for each category of interest, the composition and variability effects should be generally uncorrelated.

plots$credible_intervals_2D

plot of chunk unnamed-chunk-16

sessionInfo()

## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rstan_2.21.5         StanHeaders_2.21.0-7 tidyr_1.2.0         
## [4] forcats_0.5.1        ggplot2_3.3.5        sccomp_1.0.0        
## [7] dplyr_1.0.8         
## 
## loaded via a namespace (and not attached):
##  [1] SeuratObject_4.0.4          ggrepel_0.9.1              
##  [3] Rcpp_1.0.8.3                lattice_0.20-45            
##  [5] prettyunits_1.1.1           ps_1.7.0                   
##  [7] digest_0.6.29               assertthat_0.2.1           
##  [9] SingleCellExperiment_1.18.0 utf8_1.2.2                 
## [11] R6_2.5.1                    GenomeInfoDb_1.32.0        
## [13] stats4_4.2.0                evaluate_0.15              
## [15] highr_0.9                   pillar_1.7.0               
## [17] zlibbioc_1.42.0             rlang_1.0.2                
## [19] callr_3.7.0                 S4Vectors_0.34.0           
## [21] Matrix_1.4-1                labeling_0.4.2             
## [23] readr_2.1.2                 stringr_1.4.0              
## [25] loo_2.5.1                   RCurl_1.98-1.6             
## [27] munsell_0.5.0               DelayedArray_0.22.0        
## [29] compiler_4.2.0              xfun_0.30                  
## [31] pkgconfig_2.0.3             BiocGenerics_0.42.0        
## [33] pkgbuild_1.3.1              rstantools_2.2.0           
## [35] tidyselect_1.1.2            SummarizedExperiment_1.26.0
## [37] gridExtra_2.3               tibble_3.1.6               
## [39] GenomeInfoDbData_1.2.8      codetools_0.2-18           
## [41] IRanges_2.30.0              matrixStats_0.62.0         
## [43] fansi_1.0.3                 withr_2.5.0                
## [45] crayon_1.5.1                tzdb_0.3.0                 
## [47] bitops_1.0-7                grid_4.2.0                 
## [49] gtable_0.3.0                lifecycle_1.0.1            
## [51] DBI_1.1.2                   magrittr_2.0.3             
## [53] scales_1.2.0                RcppParallel_5.1.5         
## [55] cli_3.3.0                   stringi_1.7.6              
## [57] farver_2.1.0                XVector_0.36.0             
## [59] ellipsis_0.3.2              generics_0.1.2             
## [61] vctrs_0.4.1                 boot_1.3-28                
## [63] RColorBrewer_1.1-3          tools_4.2.0                
## [65] Biobase_2.56.0              glue_1.6.2                 
## [67] purrr_0.3.4                 hms_1.1.1                  
## [69] MatrixGenerics_1.8.0        processx_3.5.3             
## [71] parallel_4.2.0              inline_0.3.19              
## [73] colorspace_2.0-3            GenomicRanges_1.48.0       
## [75] knitr_1.38                  patchwork_1.1.1