This vignette shows an example pseudobulk experiment testing cell size scale factors using a small example dataset of single-nucleus RNA-seq data (snRNA-seq) from human cortex (Darmanis et al. (2015)). Predictions are made using lute, and results plots are generated using ggplot2.

1 Experiment setup

1.1 Load example data

In this example, we source a real snRNA-seq dataset of human brain, including cortex and hippocampus published in darmanis_survey_2015. The full data along with other real single-cell RNA-seq datasets may be accessed from the scRNAseq package.

Load a stored subset of the example dataset with the following.

path <- system.file("extdata", "scRNAseq/darmanis_example.rda", package="lute")
data <- get(load(path))

The loaded dataset is of type SingleCellExperiment, which is handled by the lute() function (see ?lute for details). Before calling the framework function, it needs to be processed to (1) define cell types and samples of interest (2) subset on cell type markers, and (3) define pseudobulks for each available sample.

For this experiment, we will consider two principal cell types for brain, neuron and glial cells (a.k.a. “K2”).

1.2 Define cell types of interest

First, identify nuclei labeled from only these types and remove the rest. Then define a new label "k2" using the valid remaining nuclei.

sampleIdVariable <- "experiment_sample_name"
oldTypes <- "cell.type"; newTypes <- "k2"
# remove non-k2 types
filterK2 <- data[[oldTypes]] %in% 
  c("neurons", "oligodendrocytes", "astrocytes", "OPC", "microglia")
data <- data[,filterK2]
# define new k2 variable
data[[newTypes]] <- ifelse(data[[oldTypes]]=="neurons", "neuron", "glial")
data[[newTypes]] <- factor(data[[newTypes]])

1.3 Filter samples

Next, define the samples of interest for the experiment. We will select samples having at least 20 nuclei.

minNuclei <- 20
nucleiPerSample <- table(data[[sampleIdVariable]])
sampleIdVector <- unique(data[[sampleIdVariable]])
sampleIdVector <- sampleIdVector[nucleiPerSample >= minNuclei]
sampleIdVector # view
## [1] "AB_S8"  "AB_S11" "AB_S3"  "AB_S4"  "AB_S5"  "AB_S7"

Next, save samples having non-zero amounts of neuron and glial cells.

sampleIdVector <- unlist(lapply(sampleIdVector, function(sampleId){
  numTypes <- length(
    unique(
      data[,data[[sampleIdVariable]]==sampleId][[newTypes]]))
  if(numTypes==2){sampleId}
}))
sampleIdVector
## [1] "AB_S11" "AB_S4"  "AB_S5"  "AB_S7"

View the summaries by sample id, then save these as the true cell type proportions. These will be used later to assess the predictions.

proportionsList <- lapply(sampleIdVector, function(sampleId){
  prop.table(table(data[,data$experiment_sample_name==sampleId]$k2))
})
dfProportions <- do.call(rbind, proportionsList)
rownames(dfProportions) <- sampleIdVector
colnames(dfProportions) <- paste0(colnames(dfProportions), ".true")
dfProportions <- as.data.frame(dfProportions)
knitr::kable(dfProportions) # view
glial.true neuron.true
AB_S11 0.1525424 0.8474576
AB_S4 0.6724138 0.3275862
AB_S5 0.3571429 0.6428571
AB_S7 0.3000000 0.7000000

1.4 Make pseudobulks with cell size scale factors

Define the cell size scale factors and use these to make the pseudobulks. For demonstration we set these to have large difference (i.e. neuron/glial > 3). While we set these manually, the cell scale factors could also be defined from library sizes or by referencing the cellScaleFactors package (link).

cellScalesVector <- c("glial" = 3, "neuron" = 10)

Next make the pseudobulk datasets.

assayName <- "counts"
pseudobulkList <- lapply(sampleIdVector, function(sampleId){
  dataIteration <- data[,data[[sampleIdVariable]]==sampleId]
  ypb_from_sce(
    singleCellExperiment = dataIteration, assayName = assayName, 
    cellTypeVariable = newTypes, cellScaleFactors = cellScalesVector)
})

dfPseudobulk <- do.call(cbind, pseudobulkList)

dfPseudobulk <- as.data.frame(dfPseudobulk)

colnames(dfPseudobulk) <- sampleIdVector

knitr::kable(head(dfPseudobulk))
AB_S11 AB_S4 AB_S5 AB_S7
ZNHIT3 407.6780 334.3621 342.42857 489.48
ZYG11B 675.4746 835.5517 774.42857 1380.54
ZNF91 1204.6610 679.3793 2608.28571 856.76
ZSCAN18 197.5254 249.0000 450.07143 323.62
ZRANB2 1432.9831 1121.3621 1469.21429 1623.72
ZYX 141.7458 175.3621 39.28571 132.82

2 Get predictions

2.1 Predictions with scaling

Predict the neuron proportions using non-negative least squares (NNLS), the default deconvolution algorithm used by lute(). First, get the scaled proportions by setting the argument cellScaleFactors = cellScalesVector.

scaledResult <- lute(
  singleCellExperiment = data, 
  bulkExpression = as.matrix(dfPseudobulk), 
  cellScaleFactors = cellScalesVector,
  typemarkerAlgorithm = NULL,
  cellTypeVariable = newTypes,
  assayName = assayName)
## Parsing deconvolution arguments...
## Using NNLS...
proportions.scaled <- scaledResult[[1]]@predictionsTable
knitr::kable(proportions.scaled) # view
glial neuron
AB_S11 0.1638850 0.8361150
AB_S4 0.7281654 0.2718346
AB_S5 0.0000000 1.0000000
AB_S7 0.4049374 0.5950626

2.2 Predictions without scaling

Next, get the unscaled result without setting s.

unscaledResult <- lute(
  singleCellExperiment = data, 
  bulkExpression = as.matrix(dfPseudobulk), 
  typemarkerAlgorithm = NULL,
  cellTypeVariable = newTypes,
  assayName = assayName)
## Parsing deconvolution arguments...
## Using NNLS...
proportionsUnscaled <- unscaledResult[[1]]@predictionsTable
knitr::kable(proportionsUnscaled) # view
glial neuron
AB_S11 0.0555366 0.9444634
AB_S4 0.4455571 0.5544429
AB_S5 0.0000000 1.0000000
AB_S7 0.1695378 0.8304622

Note proportions didn’t change for samples which had all glial or all neuron (AB_S8 and AB_S3).

3 Plot differences

3.1 Get the plot data tables

We will show the outcome of performing the cell scale factor adjustments using scatterplots and boxplots. Begin by appending the neuron proportion predictions from scaling treatments (scaled and unscaled) to the true proportions table dfProportions.

dfProportions$neuron.unscaled <- proportionsUnscaled$neuron
dfProportions$neuron.scaled <- proportions.scaled$neuron
knitr::kable(dfProportions) # view
glial.true neuron.true neuron.unscaled neuron.scaled
AB_S11 0.1525424 0.8474576 0.9444634 0.8361150
AB_S4 0.6724138 0.3275862 0.5544429 0.2718346
AB_S5 0.3571429 0.6428571 1.0000000 1.0000000
AB_S7 0.3000000 0.7000000 0.8304622 0.5950626

Calculate bias as the difference between true and predicted neuron proportions. Then calculate the error as the absolute of the bias thus defined.

# get bias
dfProportions$bias.neuron.unscaled <- 
  dfProportions$neuron.true-dfProportions$neuron.unscaled
dfProportions$bias.neuron.scaled <- 
  dfProportions$neuron.true-dfProportions$neuron.scaled
# get error
dfProportions$error.neuron.unscaled <- 
  abs(dfProportions$bias.neuron.unscaled)
dfProportions$error.neuron.scaled <- 
  abs(dfProportions$bias.neuron.scaled)

Make the tall version of dfProportions in order to generate a plot with facets on the scale treatment (either “scaled” or “unscaled”).

dfPlotTall <- rbind(
  data.frame(true = dfProportions$neuron.true, 
             predicted = dfProportions$neuron.scaled,
             error = dfProportions$error.neuron.scaled,
             sampleId = rownames(dfProportions),
             type = rep("scaled", nrow(dfProportions))),
  data.frame(true = dfProportions$neuron.true, 
             predicted = dfProportions$neuron.unscaled, 
             error = dfProportions$error.neuron.unscaled,
             sampleId = rownames(dfProportions),
             type = rep("unscaled", nrow(dfProportions)))
)
dfPlotTall <- as.data.frame(dfPlotTall)

3.2 Make scatterplots of true versus predicted neuron proportions

Show sample results scatterplots of true (x-axis) by predicted (y-axis) neuron proportions. Also include a reference line (slope = 1, yintercept = 0) showing where agreement is absolute between proportions.

Also shows RMSE in plot titles.

dfPlotTallNew <- dfPlotTall

rmseScaled <- 
  rmse(
    dfPlotTallNew[dfPlotTallNew$type=="scaled",]$true,
    dfPlotTall[dfPlotTall$type=="scaled",]$predicted, "mean")

rmseUnscaled <- 
  rmse(
    dfPlotTallNew[dfPlotTallNew$type=="unscaled",]$true,
    dfPlotTallNew[dfPlotTallNew$type=="unscaled",]$predicted, "mean")

dfPlotTallNew$type <- 
  ifelse(grepl("un.*", dfPlotTallNew$type),
         paste0(dfPlotTallNew$type,
                " (RMSE = ", round(rmseScaled, 3), ")"),
         paste0(dfPlotTallNew$type,
                " (RMSE = ", round(rmseUnscaled, 3), ")"))

textSize <- 15
ggplot(dfPlotTallNew, aes(x = true, y = predicted)) + 
  geom_point(size = 4, alpha = 0.5) + geom_abline(slope = 1, intercept = 0) + 
  xlim(0, 1) + ylim(0, 1) + facet_wrap(~type) + theme_bw() +
  xlab("True") + ylab("Predicted") +
  theme(text = element_text(size = textSize),
        axis.text.x = element_text(angle = 45, hjust = 1))

3.3 Make boxplots with jittered points for neuron errors

Show jitters and boxplots by sample, depicting the neuron error (y-axis) by scale treatment (x-axis). The sample IDs are depicted by the point colors.

ggplot(dfPlotTall, aes(x = type, y = error, color = sampleId)) + 
  geom_jitter(alpha = 0.5, size = 4) + theme_bw() +
  geom_boxplot(alpha = 0, color = "cyan") +
  theme(text = element_text(size = textSize),
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Type") + ylab("Error (Neuron)")

This process could be readily repeated for the remaining cell types, or just glial cells in this case.

4 Conclusions

This vignette showed how to conduct a basic pseudobulk experiment using cell size scale factors and an example snRNAseq dataset from human brain Darmanis et al. (2015). Some key details include sourcing and snRNA-seq data, defining a new cell type variable, setting the scale factors, making predictions, and performing comparative analyses of the prediction results. Further details about the importance of cell size scale factors are discussed in Maden et al. (2023), and examples of their utilizations may be found in Monaco et al. (2019), Racle and Gfeller (2020), and Sosina et al. (2021).

5 Session info

sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggplot2_3.5.1               lute_1.0.0                 
##  [3] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
##  [5] Biobase_2.64.0              GenomicRanges_1.56.0       
##  [7] GenomeInfoDb_1.40.0         IRanges_2.38.0             
##  [9] S4Vectors_0.42.0            BiocGenerics_0.50.0        
## [11] MatrixGenerics_1.16.0       matrixStats_1.3.0          
## [13] BiocStyle_2.32.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1          farver_2.1.1             
##  [3] dplyr_1.1.4               fastmap_1.1.1            
##  [5] bluster_1.14.0            digest_0.6.35            
##  [7] rsvd_1.0.5                lifecycle_1.0.4          
##  [9] cluster_2.1.6             statmod_1.5.0            
## [11] magrittr_2.0.3            compiler_4.4.0           
## [13] rlang_1.1.3               sass_0.4.9               
## [15] tools_4.4.0               igraph_2.0.3             
## [17] utf8_1.2.4                yaml_2.3.8               
## [19] knitr_1.46                S4Arrays_1.4.0           
## [21] labeling_0.4.3            dqrng_0.3.2              
## [23] DelayedArray_0.30.0       abind_1.4-5              
## [25] BiocParallel_1.38.0       withr_3.0.0              
## [27] grid_4.4.0                fansi_1.0.6              
## [29] beachmat_2.20.0           colorspace_2.1-0         
## [31] edgeR_4.2.0               scales_1.3.0             
## [33] tinytex_0.50              cli_3.6.2                
## [35] rmarkdown_2.26            crayon_1.5.2             
## [37] generics_0.1.3            metapod_1.12.0           
## [39] httr_1.4.7                DelayedMatrixStats_1.26.0
## [41] scuttle_1.14.0            cachem_1.0.8             
## [43] zlibbioc_1.50.0           nnls_1.5                 
## [45] parallel_4.4.0            BiocManager_1.30.22      
## [47] XVector_0.44.0            vctrs_0.6.5              
## [49] Matrix_1.7-0              jsonlite_1.8.8           
## [51] bookdown_0.39             BiocSingular_1.20.0      
## [53] BiocNeighbors_1.22.0      irlba_2.3.5.1            
## [55] magick_2.8.3              locfit_1.5-9.9           
## [57] limma_3.60.0              jquerylib_0.1.4          
## [59] glue_1.7.0                codetools_0.2-20         
## [61] gtable_0.3.5              UCSC.utils_1.0.0         
## [63] ScaledMatrix_1.12.0       munsell_0.5.1            
## [65] tibble_3.2.1              pillar_1.9.0             
## [67] htmltools_0.5.8.1         GenomeInfoDbData_1.2.12  
## [69] R6_2.5.1                  sparseMatrixStats_1.16.0 
## [71] evaluate_0.23             lattice_0.22-6           
## [73] highr_0.10                scran_1.32.0             
## [75] bslib_0.7.0               Rcpp_1.0.12              
## [77] SparseArray_1.4.0         xfun_0.43                
## [79] pkgconfig_2.0.3

Works cited

Darmanis, Spyros, Steven A. Sloan, Ye Zhang, Martin Enge, Christine Caneda, Lawrence M. Shuer, Melanie G. Hayden Gephart, Ben A. Barres, and Stephen R. Quake. 2015. “A Survey of Human Brain Transcriptome Diversity at the Single Cell Level.” Proceedings of the National Academy of Sciences of the United States of America 112 (23): 7285–90. https://doi.org/10.1073/pnas.1507125112.

Maden, Sean K., Sang Ho Kwon, Louise A. Huuki-Myers, Leonardo Collado-Torres, Stephanie C. Hicks, and Kristen R. Maynard. 2023. “Challenges and Opportunities to Computationally Deconvolve Heterogeneous Tissue with Varying Cell Sizes Using Single Cell RNA-Sequencing Datasets.” arXiv. https://doi.org/10.48550/arXiv.2305.06501.

Monaco, Gianni, Bernett Lee, Weili Xu, Seri Mustafah, You Yi Hwang, Christophe Carré, Nicolas Burdin, et al. 2019. “RNA-Seq Signatures Normalized by mRNA Abundance Allow Absolute Deconvolution of Human Immune Cell Types.” Cell Reports 26 (6): 1627–1640.e7. https://doi.org/10.1016/j.celrep.2019.01.041.

Racle, Julien, and David Gfeller. 2020. “EPIC: A Tool to Estimate the Proportions of Different Cell Types from Bulk Gene Expression Data.” Edited by Sebastian Boegel, Methods in Molecular Biology,, 233–48. https://doi.org/10.1007/978-1-0716-0327-7_17.

Sosina, Olukayode A., Matthew N. Tran, Kristen R. Maynard, Ran Tao, Margaret A. Taub, Keri Martinowich, Stephen A. Semick, et al. 2021. “Strategies for Cellular Deconvolution in Human Brain RNA Sequencing Data,” no. 10:750 (August). https://doi.org/10.12688/f1000research.50858.1.