Abstract
Airpart identifies sets of genes displaying differential cell-type-specific allelic imbalance across cell types or states, utilizing single-cell allelic counts. It makes use of a generalized fused lasso with binomial observations of allelic counts to partition cell types by their allelic imbalance. Alternatively, a nonparametric method for partitioning cell types is offered. The package includes a number of visualizations and quality control functions for examining single cell allelic imbalance datasets.Vignette on Larsson 2019 data can be found here, which has allelic single-cell RNA-seq with 4 cell states.
The airpart package takes input data of counts from each of two alleles across genes (rows) and cells (columns) from a single-cell RNA-seq experiment.
For demonstration in the package vignette, we will simulate some data using makeSimulatedData
function provided within the airpart package. We will examine the allelic counts and then perform QC steps before analyzing the data for allelic imbalance across groups of cells.
The simulated example dataset has 3 gene clusters with differential allelic imbalance (DAI):
Below we specify a number of simulation settings as arguments to the simulation function:
theta
in rbetabinom
is 20 (higher is less dispersion)library(airpart)
suppressPackageStartupMessages(library(SingleCellExperiment))
p.vec <- rep(c(0.2, 0.8, 0.5, 0.5, 0.7, 0.9), each = 2)
set.seed(2021)
sce <- makeSimulatedData(
mu1 = 2, mu2 = 10, nct = 4, n = 20,
ngenecl = 25, theta = 20, ncl = 3,
p.vec = p.vec
)
## DataFrame with 3 rows and 4 columns
## ct1 ct2 ct3 ct4
## <numeric> <numeric> <numeric> <numeric>
## gene1 0.2 0.2 0.8 0.8
## gene26 0.5 0.5 0.5 0.5
## gene51 0.7 0.7 0.9 0.9
##
## ct1 ct2 ct3 ct4
## 20 20 20 20
## cell1 cell2 cell3 cell4 cell5
## gene1 0 2 0 1 0
## gene2 0 1 0 0 2
## gene3 0 0 1 0 0
## gene4 0 2 0 0 0
## gene5 0 1 0 1 0
In summary, airpart expects a SingleCellExperiment object with:
x
in the colData(sce)
a1
and a2
The allelic ratio is calculated as a1 / (a1 + a2)
.
Note: We assume that the cell types have been either provided by the experiment, or identified based on total count. We assume the allelic ratio was not used in determining the cell groupings in x
.
## [1] "a1" "a2"
## [1] ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1 ct1
## [20] ct1 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2 ct2
## [39] ct2 ct2 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3 ct3
## [58] ct3 ct3 ct3 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4 ct4
## [77] ct4 ct4 ct4 ct4
## Levels: ct1 ct2 ct3 ct4
In the preprocess
step, we add a pseudo-count for gene clustering and visualization (not used for inference later on allelic imbalance though, which uses original allelic counts). From the heatmap, we can clearly identify the three gene clusters (across rows), and we also see cell type differences (across columns). Within each cell type, there are some cells with noisier estimates (lower total count) than others. Again, the allelic ratio tells us how much more of the a1
allele is expressed, with 1 indicating all of the expression coming from the a1
allele and 0 indicating all of the expression coming from the a2
allele.
We recommend both QC on cells and on genes. We begin with cell allelic ratio quality control. For details on these metrics, see ?cellQC
.
## DataFrame with 80 rows and 7 columns
## x sum detected spikePercent filter_sum filter_detected
## <factor> <numeric> <numeric> <numeric> <logical> <logical>
## cell1 ct1 2.19590 61 0 TRUE TRUE
## cell2 ct1 2.23045 64 0 TRUE TRUE
## cell3 ct1 2.09342 57 0 TRUE TRUE
## cell4 ct1 2.18184 59 0 TRUE TRUE
## cell5 ct1 2.19590 63 0 TRUE TRUE
## ... ... ... ... ... ... ...
## cell76 ct4 2.88309 75 0 TRUE TRUE
## cell77 ct4 2.83187 73 0 TRUE TRUE
## cell78 ct4 2.86153 75 0 TRUE TRUE
## cell79 ct4 2.89763 74 0 TRUE TRUE
## cell80 ct4 2.83123 73 0 TRUE TRUE
## filter_spike
## <logical>
## cell1 TRUE
## cell2 TRUE
## cell3 TRUE
## cell4 TRUE
## cell5 TRUE
## ... ...
## cell76 TRUE
## cell77 TRUE
## cell78 TRUE
## cell79 TRUE
## cell80 TRUE
Now define cell filtering automatically or users can manually filter out based on sum
,detected
and spikePercent
.
We also recommend QC on genes for allelic ratio analysis. Note that we require genes to be expressed in at least 25% of cells within each cell type and the genes to have high allelic imbalance variation across cell types. The following code chunk is recommended (not evaluated here though). If users want to estimate homogeneous cell type allelic imbalance, they can set sd = 0
and examine the below summary step to find interesting gene clusters with weighted mean deviating from 0.5.
airpart provides a function to cluster genes by their allelic imbalance profile across cells (not using cell grouping information, e.g. sce$x
). We then recommend providing genes within a cluster to the partition function. Clustering genes increases power for detecting cell type partitions, and improves speed as it reduces the number of times the partition must be estimated.
We provide two methods for gene clustering.
Gaussian mixture modeling is the default method for gene clustering. The scatter plot is shown based on top 2 PCs of the smoothed allelic ratio data. The argument plot=FALSE
can be used to avoid showing the plot.
## model-based optimal number of clusters: 3
## [1] 25 25 25