The recommended way to install the zinbwave
package is via Bioconductor.
source("https://bioconductor.org/biocLite.R")
biocLite("zinbwave")
Note that zinbwave
requires R (>=3.4) and Bioconductor (>=3.6).
This vignette provides an introductory example on how to work with the zinbwave
package, which implements the ZINB-WaVE method proposed in (Risso D 2017).
First, let’s load the packages and set serial computations.
library(zinbwave)
library(scRNAseq)
library(matrixStats)
library(magrittr)
library(ggplot2)
library(biomaRt)
# Register BiocParallel Serial Execution
BiocParallel::register(BiocParallel::SerialParam())
ZINB-WaVE is a general and flexible model for the analysis of high-dimensional zero-inflated count data, such as those recorded in single-cell RNA-seq assays. Given \(n\) samples (typically, \(n\) single cells) and \(J\) features (typically, \(J\) genes) that can be counted for each sample, we denote with \(Y_{ij}\) the count of feature \(j\) (\(j=1,\ldots,J\)) for sample \(i\) (\(i=1,\ldots,n\)). To account for various technical and biological effects, typical of single-cell sequencing technologies, we model \(Y_{ij}\) as a random variable following a zero-inflated negative binomial (ZINB) distribution with parameters \(\mu_{ij}\), \(\theta_{ij}\), and \(\pi_{ij}\), and consider the following regression models for the parameters:
\[\begin{align} \label{eq:model1} \ln(\mu_{ij}) &= \left( X\beta_\mu + (V\gamma_\mu)^\top + W\alpha_\mu + O_\mu\right)_{ij}\,,\\ \label{eq:model2} \text{logit}(\pi_{ij}) &= \left(X\beta_\pi + (V\gamma_\pi)^\top + W\alpha_\pi + O_\pi\right)_{ij} \,, \\ \label{eq:model3} \ln(\theta_{ij}) &= \zeta_j \,, \end{align}\].
where the elements of the regression models are as follows.
To illustrate the methodology, we will make use of the Fluidigm C1 dataset of (Pollen et al. 2014). The data consist of 65 cells, each sequenced at high and low depth. The data are publicly available as part of the scRNAseq package, in the form of a SummarizedExperiment
object.
data("fluidigm")
fluidigm
## class: SummarizedExperiment
## dim: 26255 130
## metadata(3): sample_info clusters which_qc
## assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
## rownames(26255): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(130): SRR1275356 SRR1274090 ... SRR1275366 SRR1275261
## colData names(28): NREADS NALIGNED ... Cluster1 Cluster2
table(colData(fluidigm)$Coverage_Type)
##
## High Low
## 65 65
First, we filter out the lowly expressed genes, by removing those genes that do not have at least 5 reads in at least 5 samples.
filter <- rowSums(assay(fluidigm)>5)>5
table(filter)
## filter
## FALSE TRUE
## 16127 10128
fluidigm <- fluidigm[filter,]
This leaves us with 10128 genes.
We next identify the 100 most variable genes, which will be the input of our ZINB-WaVE procedure. Although we apply ZINB-WaVE to only these genes primarily for computational reasons, it is generally a good idea to focus on a subset of highly-variable genes, in order to remove transcriptional noise and focus on the more biologically meaningful signals. However, at least 1,000 genes are probably needed for real analyses.
assay(fluidigm) %>% log1p %>% rowVars -> vars
names(vars) <- rownames(fluidigm)
vars <- sort(vars, decreasing = TRUE)
head(vars)
## IGFBPL1 STMN2 EGR1 ANP32E CENPF LDHA
## 13.06109 12.24748 11.90608 11.67819 10.83797 10.72307
fluidigm <- fluidigm[names(vars)[1:100],]
We can now apply the zinbFit
function to our reduced gene expression matrix, to fit the ZINB model.
zinb <- zinbFit(fluidigm, K=2, epsilon=1000)
By default, the zinbFit
function fits a ZINB model with \(X = {\bf 1}_n\) and \(V = {\bf 1}_J\). In this case, the model is a factor model akin to principal component analysis (PCA), where \(W\) is a factor matrix and \(\alpha_\mu\) and \(\alpha_\pi\) are loading matrices. By default, the epsilon
parameter is set to the number of genes. We empirically found that a high epsilon
is often required to obtained a good low-level representation. See ?zinbModel
for details. Here we set epsilon=1000
.
The parameter \(K\) controls how many latent variables we want to infer from the data. In this case, as we specified \(K=2\), we can visualize the resulting \(W\) matrix in a simple plot, color-coded by cell-type.
W <- getW(zinb)
colnames(W) <- paste0("W", 1:2)
data.frame(W, bio=colData(fluidigm)$Biological_Condition,
coverage=colData(fluidigm)$Coverage_Type) %>%
ggplot(aes(W1, W2, colour=bio, shape=coverage)) + geom_point() +
scale_color_brewer(type = "qual", palette = "Set1") + theme_classic()
The ZINB-WaVE model is more general than PCA, allowing the inclusion of additional sample and gene-level covariates that might help to infer the unknown factors.
Typically, one could include batch information as sample-level covariate, to account for batch effects. Here, we illustrate this capability by including the coverage (high or low) as a sample-level covariate.
The column Coverage_Type
in the colData
of fluidigm
contains the coverage information. We can specify a design matrix that includes an intercept and an indicator variable for the coverage, by using the formula interface of zinbFit
.
zinb_cov <- zinbFit(fluidigm, K=2, X="~Coverage_Type", epsilon=1000)
W <- getW(zinb_cov)
colnames(W) <- paste0("W", 1:2)
data.frame(W, bio=colData(fluidigm)$Biological_Condition,
coverage=colData(fluidigm)$Coverage_Type) %>%
ggplot(aes(W1, W2, colour=bio, shape=coverage)) + geom_point() +
scale_color_brewer(type = "qual", palette = "Set1") + theme_classic()
In this case, the inferred \(W\) matrix is essentially the same with or without covariates, indicating that the scaling factor included in the model (the \(\gamma\) parameters associated with the intercept of \(V\)) are enough to achieve a good low-dimensional representation of the data.
Analogously, we can include gene-level covariates, as columns of \(V\). Here, we illustrate this capability by including gene length and GC-content.
We use the biomaRt
package to compute gene length and GC-content.
mart <- useMart("ensembl")
mart <- useDataset("hsapiens_gene_ensembl", mart = mart)
bm <- getBM(attributes=c('hgnc_symbol', 'start_position',
'end_position', 'percentage_gene_gc_content'),
filters = 'hgnc_symbol',
values = rownames(fluidigm),
mart = mart)
bm$length <- bm$end_position - bm$start_position
len <- tapply(bm$length, bm$hgnc_symbol, mean)
len <- len[rownames(fluidigm)]
gcc <- tapply(bm$percentage_gene_gc_content, bm$hgnc_symbol, mean)
gcc <- gcc[rownames(fluidigm)]
We then include the gene-level information as rowData
in the fluidigm
object.
rowData(fluidigm) <- data.frame(gccontent = gcc, length = len)
zinb_gcc <- zinbFit(fluidigm, K=2, V="~gccontent + log(length)", epsilon=1000)
W <- getW(zinb_gcc)
colnames(W) <- paste0("W", 1:2)
data.frame(W, bio=colData(fluidigm)$Biological_Condition,
coverage=colData(fluidigm)$Coverage_Type) %>%
ggplot(aes(W1, W2, colour=bio, shape=coverage)) + geom_point() +
scale_color_brewer(type = "qual", palette = "Set1") + theme_classic()
A t-SNE representation of the data can be obtained by computing the cell distances in the reduced space and running the t-SNE algorithm on the distance.
set.seed(93024)
library(Rtsne)
d <- dist(getW(zinb_gcc))
tsne_data <- Rtsne(d, is_distance = TRUE, pca = FALSE,
perplexity=10, max_iter=5000)
data.frame(Dim1=tsne_data$Y[,1], Dim2=tsne_data$Y[,2],
bio=colData(fluidigm)$Biological_Condition,
coverage=colData(fluidigm)$Coverage_Type) %>%
ggplot(aes(Dim1, Dim2, colour=bio, shape=coverage)) + geom_point() +
scale_color_brewer(type = "qual", palette = "Set1") + theme_classic()
Sometimes it is useful to have normalized values for visualization and residuals for model evaluation. Both quantities can be computed with the zinbwave()
function, which will return a SingleCellExperiment
object with the W matrix, and optionally the residuals and normalized values.
se_norm <- zinbwave(fluidigm, K=2, epsilon=1000, normalizedValues=TRUE,
residuals = TRUE)
If one has already fitted a model with zinbFit
the resulting ZinbModel
object can be passed to zinbwave
to avoid repeating the same computations.
se_norm <- zinbwave(fluidigm, fitted_model=zinb, normalizedValues=TRUE,
residuals = TRUE)
The se_norm
object includes normalized values and residuals as additional assays
.
se_norm
## class: SingleCellExperiment
## dim: 100 130
## metadata(3): sample_info clusters which_qc
## assays(6): tophat_counts cufflinks_fpkm ... normalizedValues
## residuals
## rownames(100): IGFBPL1 STMN2 ... SRSF7 FAM117B
## rowData names(2): gccontent length
## colnames(130): SRR1275356 SRR1274090 ... SRR1275366 SRR1275261
## colData names(28): NREADS NALIGNED ... Cluster1 Cluster2
## reducedDimNames(1): zinbwave
## spikeNames(0):
To retrieve the W matrix, we can use the SingleCellExperiment
reducedDim
method.
head(reducedDim(se_norm))
## W1 W2
## SRR1275356 0.8135522 0.43983154
## SRR1274090 -0.9058600 -0.25480146
## SRR1275251 1.1400352 0.28257709
## SRR1275287 -0.1022971 0.05742253
## SRR1275364 0.5296325 0.09619288
## SRR1275269 -0.8302533 0.43202673
The zinbwave
package uses the BiocParallel
package to allow for parallel computing. Here, we used the register
command to ensure that the vignette runs with serial computations.
However, in real datasets, parallel computations can speed up the computations dramatically, in the presence of many genes and/or many cells.
There are two ways of allowing parallel computations in zinbwave
. The first is to register()
a parallel back-end (see ?BiocParallel::register
for details). Alternatively, one can pass a BPPARAM
object to zinbwave
and zinbFit
, e.g.
library(BiocParallel)
zinb_res <- zinbFit(fluidigm, K=2, BPPARAM=MulticoreParam(2))
We found that MulticoreParam()
may have some performance issues on Mac; hence, we recommend DoparParam()
when working on Mac.
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Rtsne_0.13 biomaRt_2.34.0
## [3] ggplot2_2.2.1 magrittr_1.5
## [5] scRNAseq_1.3.0 zinbwave_1.0.0
## [7] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.0
## [9] DelayedArray_0.4.0 matrixStats_0.52.2
## [11] Biobase_2.38.0 GenomicRanges_1.30.0
## [13] GenomeInfoDb_1.14.0 IRanges_2.12.0
## [15] S4Vectors_0.16.0 BiocGenerics_0.24.0
## [17] BiocStyle_2.6.0
##
## loaded via a namespace (and not attached):
## [1] edgeR_3.20.0 bit64_0.9-7 splines_3.4.2
## [4] foreach_1.4.3 assertthat_0.2.0 blob_1.1.0
## [7] GenomeInfoDbData_0.99.1 yaml_2.1.14 progress_1.1.2
## [10] numDeriv_2016.8-1 RSQLite_2.0 backports_1.1.1
## [13] lattice_0.20-35 limma_3.34.0 digest_0.6.12
## [16] RColorBrewer_1.1-2 XVector_0.18.0 colorspace_1.3-2
## [19] htmltools_0.3.6 Matrix_1.2-11 plyr_1.8.4
## [22] pcaPP_1.9-72 XML_3.98-1.9 genefilter_1.60.0
## [25] bookdown_0.5 zlibbioc_1.24.0 xtable_1.8-2
## [28] mvtnorm_1.0-6 scales_0.5.0 copula_0.999-18
## [31] BiocParallel_1.12.0 tibble_1.3.4 annotate_1.56.0
## [34] ADGofTest_0.3 lazyeval_0.2.1 survival_2.41-3
## [37] memoise_1.1.0 evaluate_0.10.1 gsl_1.9-10.3
## [40] tools_3.4.2 prettyunits_1.0.2 softImpute_1.4
## [43] pspline_1.0-18 stringr_1.2.0 munsell_0.4.3
## [46] glmnet_2.0-13 locfit_1.5-9.1 stabledist_0.7-1
## [49] AnnotationDbi_1.40.0 compiler_3.4.2 rlang_0.1.2
## [52] grid_3.4.2 RCurl_1.95-4.8 iterators_1.0.8
## [55] labeling_0.3 bitops_1.0-6 rmarkdown_1.6
## [58] gtable_0.2.0 codetools_0.2-15 DBI_0.7
## [61] R6_2.2.2 knitr_1.17 bit_1.1-12
## [64] rprojroot_1.2 stringi_1.1.5 Rcpp_0.12.13
Pollen, Alex A, Tomasz J Nowakowski, Joe Shuga, Xiaohui Wang, Anne A Leyrat, Jan H Lui, Nianzhen Li, et al. 2014. “Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex.” Nature Biotechnology 32 (10): 1053–8.
Risso D, Gribkova S, Perraudeau F. 2017. “ZINB-WaVE: A General and Flexible Method for Signal Extraction from Single-Cell RNA-Seq Data.” BioRxiv.