Differential cell-type-specific allelic imbalance with airpart

Wancen Mu, Hirak Sarkar, Avi Srivastava, Kwangbom Choi, Rob Patro, Michael I. Love

May 1, 2021

Abstract

Airpart identifies sets of genes displaying differential cell-type-specific allelic imbalance across cell types or states, utilizing single-cell allelic counts. It makes use of a generalized fused lasso with binomial observations of allelic counts to partition cell types by their allelic imbalance. Alternatively, a nonparametric method for partitioning cell types is offered. The package includes a number of visualizations and quality control functions for examining single cell allelic imbalance datasets.

Real data example

Vignette on Larsson 2019 data can be found here, which has allelic single-cell RNA-seq with 4 cell states.

Simulated data example I

The airpart package takes input data of counts from each of two alleles across genes (rows) and cells (columns) from a single-cell RNA-seq experiment.

For demonstration in the package vignette, we will simulate some data using makeSimulatedData function provided within the airpart package. We will examine the allelic counts and then perform QC steps before analyzing the data for allelic imbalance across groups of cells.

Simulation set-up

The simulated example dataset has 3 gene clusters with differential allelic imbalance (DAI):

the first cluster has pairs of cell types with same allelic ratio with 0.2 and 0.8 (larger DAI)
the second cluster has balanced allelic ratio
the third cluster has pairs of cell types with same allelic ratio with 0.7 and 0.9 (smaller DAI)

Below we specify a number of simulation settings as arguments to the simulation function:

the “noisy” cell count is 2
the normal cell count is 10
4 cell types
20 cells within each cell type
25 genes within each gene cluster
overdispersion parameter theta in rbetabinom is 20 (higher is less dispersion)

library(airpart)
suppressPackageStartupMessages(library(SingleCellExperiment))
p.vec <- rep(c(0.2, 0.8, 0.5, 0.5, 0.7, 0.9), each = 2)
set.seed(2021)
sce <- makeSimulatedData(
  mu1 = 2, mu2 = 10, nct = 4, n = 20,
  ngenecl = 25, theta = 20, ncl = 3,
  p.vec = p.vec
)

unique(rowData(sce)) # the true underlying allelic ratios

## DataFrame with 3 rows and 4 columns
##              ct1       ct2       ct3       ct4
##        <numeric> <numeric> <numeric> <numeric>
## gene1        0.2       0.2       0.8       0.8
## gene26       0.5       0.5       0.5       0.5
## gene51       0.7       0.7       0.9       0.9

table(sce$x) # counts of each cell type

## 
## ct1 ct2 ct3 ct4 
##  20  20  20  20

assays(sce)[["a1"]][1:5, 1:5] # allelic counts for the effect allele

##       cell1 cell2 cell3 cell4 cell5
## gene1     0     2     0     1     0
## gene2     0     1     0     0     2
## gene3     0     0     1     0     0
## gene4     0     2     0     0     0
## gene5     0     1     0     1     0

Required input data

In summary, airpart expects a SingleCellExperiment object with:

discrete cell types recorded as a variable x in the colData(sce)
effect and non-effect allelic counts as assays a1 and a2

The allelic ratio is calculated as a1 / (a1 + a2).

Note: We assume that the cell types have been either provided by the experiment, or identified based on total count. We assume the allelic ratio was not used in determining the cell groupings in x.

assayNames(sce)

## [1] "a1" "a2"

sce$x

##  [1] ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1
## [20] ct1 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2
## [39] ct2 ct2 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3
## [58] ct3 ct3 ct3 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4
## [77] ct4 ct4 ct4 ct4
## Levels: ct1 ct2 ct3 ct4

Create allelic ratio matrix

In the preprocess step, we add a pseudo-count for gene clustering and visualization (not used for inference later on allelic imbalance though, which uses original allelic counts). From the heatmap, we can clearly identify the three gene clusters (across rows), and we also see cell type differences (across columns). Within each cell type, there are some cells with noisier estimates (lower total count) than others. Again, the allelic ratio tells us how much more of the a1 allele is expressed, with 1 indicating all of the expression coming from the a1 allele and 0 indicating all of the expression coming from the a2 allele.

sce <- preprocess(sce)
makeHeatmap(sce)

Quality control steps

QC on cells

We recommend both QC on cells and on genes. We begin with cell allelic ratio quality control. For details on these metrics, see ?cellQC.

cellQCmetrics <- cellQC(sce, mad_detected = 4)
cellQCmetrics

## DataFrame with 80 rows and 7 columns
##               x       sum  detected spikePercent filter_sum filter_detected
##        <factor> <numeric> <numeric>    <numeric>  <logical>       <logical>
## cell1       ct1   2.19590        61            0       TRUE            TRUE
## cell2       ct1   2.23045        64            0       TRUE            TRUE
## cell3       ct1   2.09342        57            0       TRUE            TRUE
## cell4       ct1   2.18184        59            0       TRUE            TRUE
## cell5       ct1   2.19590        63            0       TRUE            TRUE
## ...         ...       ...       ...          ...        ...             ...
## cell76      ct4   2.88309        75            0       TRUE            TRUE
## cell77      ct4   2.83187        73            0       TRUE            TRUE
## cell78      ct4   2.86153        75            0       TRUE            TRUE
## cell79      ct4   2.89763        74            0       TRUE            TRUE
## cell80      ct4   2.83123        73            0       TRUE            TRUE
##        filter_spike
##           <logical>
## cell1          TRUE
## cell2          TRUE
## cell3          TRUE
## cell4          TRUE
## cell5          TRUE
## ...             ...
## cell76         TRUE
## cell77         TRUE
## cell78         TRUE
## cell79         TRUE
## cell80         TRUE

Now define cell filtering automatically or users can manually filter out based on sum,detected and spikePercent.

keep_cell <- (
  cellQCmetrics$filter_sum | # sufficient features (genes)
    cellQCmetrics$filter_detected | # sufficient molecules counted
    # sufficient features expressed compared to spike genes,
    # high quality cells
    cellQCmetrics$filter_spike
)
sce <- sce[, keep_cell]

QC on genes

We also recommend QC on genes for allelic ratio analysis. Note that we require genes to be expressed in at least 25% of cells within each cell type and the genes to have high allelic imbalance variation across cell types. The following code chunk is recommended (not evaluated here though). If users want to estimate homogeneous cell type allelic imbalance, they can set sd = 0 and examine the below summary step to find interesting gene clusters with weighted mean deviating from 0.5.

featureQCmetric <- featureQC(sce)
keep_feature <- (featureQCmetric$filter_celltype &
  featureQCmetric$filter_sd &
  featureQCmetric$filter_spike)
sce <- sce[keep_feature, ]

Gene clustering

airpart provides a function to cluster genes by their allelic imbalance profile across cells (not using cell grouping information, e.g. sce$x). We then recommend providing genes within a cluster to the partition function. Clustering genes increases power for detecting cell type partitions, and improves speed as it reduces the number of times the partition must be estimated.

We provide two methods for gene clustering.

Gaussian Mixture modeling

Gaussian mixture modeling is the default method for gene clustering. The scatter plot is shown based on top 2 PCs of the smoothed allelic ratio data. The argument plot=FALSE can be used to avoid showing the plot.

sce <- geneCluster(sce, G = 1:4)

## model-based optimal number of clusters: 3

metadata(sce)$geneCluster

## [1] 25 25 25

Hierarchical clustering

sce.hc <- geneCluster(sce, method = "hierarchical")

metadata(sce.hc)$geneCluster

## [1] 25 25 25

In this simulated dataset case, the clustering is very similar, but on allelic scRNA-seq datasets, we have found improved clustering with the Gaussian mixture model approach (more similar genes within cluster, based on visual inspection of PCA plot and of allelic ratio heatmaps).

Running airpart for allelic imbalance across groups of cells

Simple summary table of allelic ratio

We first quickly look at the weighted mean of allelic ratio for each gene cluster. From this step we will identify the interesting gene clusters. The mean is calculated, weighting the information from each gene x cell element of the matrices by the total count.

sce <- summaryAllelicRatio(sce)
metadata(sce)$summary

## $`gene cluster 1 with 25 genes`
##     x weighted.mean      mean        var
## 1 ct1     0.2084022 0.2056936 0.06355708
## 2 ct2     0.1981464 0.1879503 0.05634768
## 3 ct3     0.8133540 0.8090369 0.05572735
## 4 ct4     0.8054237 0.8136062 0.05143953
## 
## $`gene cluster 2 with 25 genes`
##     x weighted.mean      mean        var
## 1 ct1     0.5119248 0.5096426 0.09555925
## 2 ct2     0.4949174 0.5000794 0.09566643
## 3 ct3     0.4961165 0.5135395 0.09642713
## 4 ct4     0.5048356 0.4883957 0.08759873
## 
## $`gene cluster 3 with 25 genes`
##     x weighted.mean      mean        var
## 1 ct1     0.7085658 0.7216087 0.07258102
## 2 ct2     0.6942398 0.6803795 0.08580046
## 3 ct3     0.9117748 0.8952586 0.03821103
## 4 ct4     0.8956159 0.9017706 0.02942587

The following step is a complement of the QC on genes step. We recommend users only run airpart when the largest ordered allelic ratio difference > 0.05 for speed concerns. We find that the allelic ratio of most of the gene clusters in such cases (small absolute allelic ratio differences) won’t provide enough evidence to detect differential allelic imbalance.

summary <- metadata(sce)$summary
sapply(1:length(summary), function(i) {
  inst <- summary[[i]]
  inst_order <- inst[order(inst$weighted.mean), ]
  max(diff(inst_order$weighted.mean)) > 0.05
})

## [1]  TRUE FALSE  TRUE

Experiment-wide beta-binomial over-dispersion

We recommend examining the experiment-wide beta-binomial over-dispersion, which helps to inform whether to use a binomial likelihood or a nonparametric approach to partitioning the cell types by allelic imbalance.

We focus on the first gene cluster (if a gene cluster is not provided, estDisp will choose the largest cluster).

The blue trend line gives the typical values of over-dispersion across all the genes in the cluster, and across all the cell types (accounting for differences across the cell types in the expected ratio).

estDisp(sce, genecluster = 1)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Modeling using fused lasso with binomial likelihood

airpart offers a method for partitioning cell types using the generalized fused lasso with binomial likelihood, as implemented in the smurf package. Cell types are merged based on their similarity of allelic ratios, accounting for excess variance on the ratio from low counts. The penalization is determined using deviance on held-out data, with a 1 SE cross-validation rule for favoring smaller models (more fused cell types).

sce_sub <- fusedLasso(sce,
  model = "binomial",
  genecluster = 1, ncores = 1,
  niter = 2
)

The partition groups and the penalty $\lambda$ from the fused lasso are stored in the metadata:

knitr::kable(metadata(sce_sub)$partition, row.names = FALSE)

part1	part2	x	coef1	coef2
1	1	ct1	-1.366500	-1.366500
1	1	ct2	-1.366500	-1.366500
2	2	ct3	1.447169	1.447169
2	2	ct4	1.447169	1.447169

metadata(sce_sub)$lambda

##     part1     part2 
## 0.2053138 0.2053138

Above, ncores is the number of CPU used for parallelization. As a guide, one can specify niter=5 when the cts weighted allelic ratio difference is smaller than 0.1, in order to provide additional estimator robustness.

Consensus partition

If you run niter > 1, you can use our consensus partition function to derive the final partition. This function makes use of ensemble consensus clustering via the clue package.

sce_sub <- consensusPart(sce_sub)
knitr::kable(metadata(sce_sub)$partition, row.names = FALSE)

part	x
1	ct1
1	ct2
2	ct3
2	ct4

Modeling using pairwise Mann-Whitney-Wilcoxon extension

An alternative to the fused lasso with binomial likelihood is an extension we have implemented wherein all pairs cell types are compared with Mann-Whitney-Wilcoxon rank sum tests. In practice, we find that when the allelic counts deviates strongly from a binomial (e.g. large over-dispersion, small values of theta), the wilcoxExt function can offer improved performance, in terms of recovery of the true partition of cell types by allelic imbalance. The partition is decided based on a loss function motivated by the Bayesian Information Criteria.

thrs <- 10^seq(from = -2, to = -0.4, by = 0.2)
sce_sub_w <- wilcoxExt(sce, genecluster = 1, threshold = thrs)
knitr::kable(metadata(sce_sub_w)$partition, row.names = FALSE)

part	x
1	ct1
1	ct2
2	ct3
2	ct4

metadata(sce_sub_w)$threshold

## [1] 0.01

Calculating allelic ratio estimates via beta-binomial model

After airpart determines a partition of cell types either by the fused lasso with binomial likelihood or the nonparametric approach described above, it uses those fused lasso estimates or weighted means as the center of a Cauchy prior for posterior estimation of allelic ratios per cell type and per gene. Posterior mean and credible intervals are provided. The posterior inference makes use of a beta-binomial likelihood, and a moderated estimate of the over-dispersion. The prior from the partition and the moderated estimate of over-dispersion are provided to the apeglm function from the Bioconductor package of the same name.

Note that the estimates and credible intervals are not equal for cell types in the same partition and for genes, because in this step we re-estimate the conditional cell type means per cell type (not per partition) and account for each gene’s moderated estimate of over-dispersion.

sce_sub <- allelicRatio(sce_sub)
makeForest(sce_sub, showtext = TRUE)

## svalue shown in columns per cell type

Allelic ratio estimates (ar) as well as s value and credible interval (lower and upper) are stored in rowData.

genepoi <- paste0("gene", seq_len(5))
tab_cols <- c(grep("ar", colnames(rowData(sce_sub)), value = TRUE))
knitr::kable(rowData(sce_sub)[genepoi, tab_cols])

	ar_ct1	ar_ct2	ar_ct3	ar_ct4
gene1	0.257	0.284	0.751	0.732
gene2	0.312	0.225	0.776	0.711
gene3	0.228	0.235	0.789	0.791
gene4	0.314	0.282	0.756	0.745
gene5	0.352	0.202	0.678	0.697

Allelic ratio partition and posterior inference, example II

To demonstrate showing partition results on a heatmap, let’s make a more complex simulation, with 8 cell types, in 3 true groups by allelic ratio. In the code below, we construct the more complex simulation, run preprocessing, and examine the allelic ratio heatmap.

nct <- 8
p.vec <- (rep(c(
  -3, 0, -3, 3,
  rep(0, nct / 2),
  2, 3, 4, 2
), each = 2) + 5) / 10
sce <- makeSimulatedData(
  mu1 = 2, mu2 = 10, nct = nct, n = 30,
  ngenecl = 50, theta = 20, ncl = 3, p.vec = p.vec
)
sce <- preprocess(sce)

cellQCmetrics <- cellQC(sce, mad_detected = 4)
keep_cell <- (
  cellQCmetrics$filter_sum | # sufficient features (genes)
    cellQCmetrics$filter_detected | # sufficient molecules counted
    # sufficient features expressed compared to spike genes,
    # high quality cells
    cellQCmetrics$filter_spike
)
sce <- sce[, keep_cell]

featureQCmetric <- featureQC(sce)
keep_feature <- (featureQCmetric$filter_celltype &
  featureQCmetric$filter_sd &
  featureQCmetric$filter_spike)
sce <- sce[keep_feature, ]

makeHeatmap(sce)

We can then perform gene clustering:

sce <- geneCluster(sce, G = 1:4)

## model-based optimal number of clusters: 3

table(mcols(sce)$cluster)

## 
##  1  2  3 
## 50 16 50

We check for experiment-wide beta-binomial over-dispersion. Note that larger theta (y-axis) corresponds to less over-dispersion.

We focus on the first gene cluster (if a gene cluster is not provided, estDisp will choose the largest cluster).

sce <- summaryAllelicRatio(sce)
estDisp(sce, genecluster = 1)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We identify an interesting gene cluster and run the fused lasso.

sce_sub <- fusedLasso(sce,
  model = "binomial",
  genecluster = 1, ncores = 1
)

knitr::kable(metadata(sce_sub)$partition, row.names = FALSE)

part	x	coef
1	ct1	-1.3883646
1	ct2	-1.3883646
2	ct3	-0.0226667
2	ct4	-0.0226667
1	ct5	-1.3883646
1	ct6	-1.3883646
3	ct7	1.4140601
3	ct8	1.4140601

Next we estimate allelic ratios per cell type and per gene, with credible intervals. For demonstration, we subset to the first 10 genes.

sce_sub2 <- sce_sub[1:10, ]
sce_sub2 <- allelicRatio(sce_sub2)

We plot all cell types together, but one can set ctpoi=c(1,3,7) to limit the cell types to be plotted when there are too many cell types. And one can set genepoi=c(1,3,7) or genepoi=c("gene1","gene3","gene7") to only plot selected genes.

makeForest(sce_sub2)

## svalue shown in columns per cell type

A violin plot with posterior mean allelic ratios (one estimate per gene) on the y-axis:

makeViolin(sce_sub2)

## Joining, by = "x"

## Warning: position_dodge requires non-overlapping x intervals

Finally, a heatmap as before, but now with the cell types grouped according to the partition:

makeHeatmap(sce_sub2)

The heatmap can also be shown ordered by cell type.

makeHeatmap(sce_sub2, order_by_group = FALSE)

Session Info

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
##  [3] Biobase_2.52.0              GenomicRanges_1.44.0       
##  [5] GenomeInfoDb_1.28.1         IRanges_2.26.0             
##  [7] S4Vectors_0.30.0            BiocGenerics_0.38.0        
##  [9] MatrixGenerics_1.4.2        matrixStats_0.60.1         
## [11] airpart_1.0.1              
## 
## loaded via a namespace (and not attached):
##   [1] ggbeeswarm_0.6.0          colorspace_2.0-2         
##   [3] rjson_0.2.20              ellipsis_0.3.2           
##   [5] mclust_5.4.7              dynamicTreeCut_1.63-1    
##   [7] circlize_0.4.13           scuttle_1.2.1            
##   [9] XVector_0.32.0            GlobalOptions_0.1.2      
##  [11] BiocNeighbors_1.10.0      clue_0.3-59              
##  [13] farver_2.1.0              fansi_0.5.0              
##  [15] mvtnorm_1.1-2             apeglm_1.14.0            
##  [17] codetools_0.2-18          splines_4.1.1            
##  [19] sparseMatrixStats_1.4.2   doParallel_1.0.16        
##  [21] knitr_1.33                scater_1.20.1            
##  [23] jsonlite_1.7.2            speedglm_0.3-3           
##  [25] Cairo_1.5-12.2            catdata_1.2.2            
##  [27] cluster_2.1.2             png_0.1-7                
##  [29] compiler_4.1.1            backports_1.2.1          
##  [31] assertthat_0.2.1          Matrix_1.3-4             
##  [33] BiocSingular_1.8.1        htmltools_0.5.1.1        
##  [35] tools_4.1.1               rsvd_1.0.5               
##  [37] coda_0.19-4               gtable_0.3.0             
##  [39] glue_1.4.2                GenomeInfoDbData_1.2.6   
##  [41] dplyr_1.0.7               Rcpp_1.0.7               
##  [43] bbmle_1.0.24              jquerylib_0.1.4          
##  [45] vctrs_0.3.8               nlme_3.1-152             
##  [47] iterators_1.0.13          DelayedMatrixStats_1.14.2
##  [49] xfun_0.25                 stringr_1.4.0            
##  [51] beachmat_2.8.1            lpSolve_5.6.15           
##  [53] lifecycle_1.0.0           irlba_2.3.3              
##  [55] zlibbioc_1.38.0           MASS_7.3-54              
##  [57] scales_1.1.1              RColorBrewer_1.1-2       
##  [59] ComplexHeatmap_2.8.0      yaml_2.2.1               
##  [61] gridExtra_2.3             pbapply_1.4-3            
##  [63] ggplot2_3.3.5             emdbook_1.3.12           
##  [65] sass_0.4.0                bdsmatrix_1.3-4          
##  [67] stringi_1.7.3             highr_0.9                
##  [69] foreach_1.5.1             ScaledMatrix_1.0.0       
##  [71] checkmate_2.0.0           smurf_1.1.2              
##  [73] BiocParallel_1.26.2       shape_1.4.6              
##  [75] rlang_0.4.11              pkgconfig_2.0.3          
##  [77] bitops_1.0-7              evaluate_0.14            
##  [79] lattice_0.20-44           purrr_0.3.4              
##  [81] labeling_0.4.2            tidyselect_1.1.1         
##  [83] plyr_1.8.6                magrittr_2.0.1           
##  [85] R6_2.5.1                  magick_2.7.3             
##  [87] generics_0.1.0            DelayedArray_0.18.0      
##  [89] DBI_1.1.1                 pillar_1.6.2             
##  [91] mgcv_1.8-36               forestplot_2.0.0         
##  [93] survival_3.2-12           RCurl_1.98-1.4           
##  [95] tibble_3.1.3              crayon_1.4.1             
##  [97] utf8_1.2.2                rmarkdown_2.10           
##  [99] viridis_0.6.1             GetoptLong_1.0.5         
## [101] grid_4.1.1                digest_0.6.27            
## [103] tidyr_1.1.3               numDeriv_2016.8-1.1      
## [105] munsell_0.5.0             glmnet_4.1-2             
## [107] viridisLite_0.4.0         beeswarm_0.4.0           
## [109] vipor_0.4.5               bslib_0.2.5.1