Introducing mRNA bias into simulations with scaling factors

Alexander Dietrich

11/29/2021

Using scaling factors

This package allows the user to introduce an mRNA bias into pseudo-bulk RNA-seq samples. Different cell-types contain different amounts of mRNA (Dendritic cells for examples contain much more than Neutrophils); this bias can be added into simulations artificially in different ways.
The scaling factors will always be applied on the single-cell dataset first, altering its expression profiles accordingly, and then the pseudo-bulk samples are generated by summing up the count data from the sampled cells.

# Example data
counts <- Matrix::Matrix(matrix(rpois(3e5, 5), ncol = 300), sparse = TRUE)
tpm <- Matrix::Matrix(matrix(rpois(3e5, 5), ncol = 300), sparse = TRUE)
tpm <- Matrix::t(1e6 * Matrix::t(tpm) / Matrix::colSums(tpm))
colnames(counts) <- paste0("cell_", rep(1:300))
colnames(tpm) <- paste0("cell_", rep(1:300))
rownames(counts) <- paste0("gene_", rep(1:1000))
rownames(tpm) <- paste0("gene_", rep(1:1000))
annotation <- data.frame(
  "ID" = paste0("cell_", rep(1:300)),
  "cell_type" = c(rep("T cells CD4", 150), rep("T cells CD8", 150)),
  "spikes" = runif(300),
  "add_1" = runif(300),
  "add_2" = runif(300)
)
ds <- SimBu::dataset(
  annotation = annotation,
  count_matrix = counts,
  name = "test_dataset"
)
#> Filtering genes...
#> Created dataset.

Pre-defined scaling factors

Some studies have proposed scaling factors for immune cells, such as EPIC (Racle et al. 2017) or quanTIseq (Finotello et al. 2019), deconvolution tools which correct for the mRNA bias internally using these values:

When you want to apply one of these scaling factors into your simulation (therefore in-/decreasing the expression signals for the cell-types), we can use the scaling_factor parameter. Note, that these pre-defined scaling factors only offer values for a certain number of cell types, and your annotation in the provided dataset has to match these names 1:1. All cell types from your dataset which are not present in this scaling factor remain unscaled and a warning message will appear.

We can also try out some custom scaling factors, for example increasing the expression levels for a single cell-type (T cells CD8) by 10-fold compared to the rest. All cell-types which are not mentioned in the named list given to custom_scaling_vector will be transformed with a scaling factor of 1, meaning nothing changes for them.

Important: Watch out that the cell-type annotation names in your dataset are the same as in the scaling factor! Otherwise the scaling factor will not be applied or even worse, applied to a different cell-type.

Dataset specific scaling factors

You can also choose to calculate scaling factors, which are depending on your provided single-cell dataset. Compared to the previous section, this will give a unique value for each cell rather than a cell-type, making it possibly more sensitive.

Reads and genes

Two straight forward approaches would be the number of reads or number of expressed genes/features. As these values are easily obtainable from the provided count data, SimBu already calculates them during dataset generation.

These options would also allow you to use other numerical measurements you have for the single cells as scaling factors, such as weight or size for example. Lets pretend, add_1 and add_2 are such measurements. With the additional_cols parameter, they can be added to the SimBu dataset and we can use them as scaling factor as well:

Spike-ins

One other numerical measurement can be spike-ins. Usually the number of reads mapped to spike-in molecules per cell is given in the cell annotations. If this is the case, they can be stored in the dataset annotation using the spike_in_col parameter, where you indicate the name of the column from the annotation dataframe in which the spike-in information is stored. To calculate a scaling factor from this, the number of reads are also necessary, so we will add this information as well (as above using the read_number_col parameter).

The scaling factor with spike-ins is calculated as the “% of reads NOT mapped to spike-in reads”, or: (n_reads - n_spike_in)/n_reads for each cell. We apply it like this:

Census - estimate mRNA counts per cell

Census is an approach which tries to convert TPM counts into relative transcript counts. This basically means, you get the mRNA counts per cell, which can differ between cell-types.
(Qiu et al. 2017) state in their paper, that it should only be applied to TPM/FPKM normalized data, but I tried it out with raw expression counts as well, which worked as well.
Census calculates a vector with a scaling value for each cell in a sample. You can switch this feature on, by setting the scaling_factor parameter to census.

In our analysis we found, that Census is basically a complicated way of estimating the number of expressed genes per cell. It will remain to the user to decide if he/she wants to use census or simply the number of expressed genes (as shown above) as scaling factor.

sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] SimBu_1.0.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.28.0 tidyselect_1.2.0           
#>  [3] xfun_0.37                   bslib_0.4.2                
#>  [5] purrr_1.0.1                 lattice_0.20-45            
#>  [7] colorspace_2.1-0            vctrs_0.5.2                
#>  [9] generics_0.1.3              htmltools_0.5.4            
#> [11] stats4_4.2.2                yaml_2.3.7                 
#> [13] utf8_1.2.3                  rlang_1.0.6                
#> [15] jquerylib_0.1.4             pillar_1.8.1               
#> [17] glue_1.6.2                  withr_2.5.0                
#> [19] BiocParallel_1.32.5         RColorBrewer_1.1-3         
#> [21] BiocGenerics_0.44.0         matrixStats_0.63.0         
#> [23] GenomeInfoDbData_1.2.9      lifecycle_1.0.3            
#> [25] zlibbioc_1.44.0             MatrixGenerics_1.10.0      
#> [27] munsell_0.5.0               gtable_0.3.1               
#> [29] proxyC_0.3.3                codetools_0.2-19           
#> [31] evaluate_0.20               labeling_0.4.2             
#> [33] Biobase_2.58.0              knitr_1.42                 
#> [35] IRanges_2.32.0              fastmap_1.1.1              
#> [37] GenomeInfoDb_1.34.9         parallel_4.2.2             
#> [39] fansi_1.0.4                 highr_0.10                 
#> [41] Rcpp_1.0.10                 scales_1.2.1               
#> [43] cachem_1.0.7                DelayedArray_0.24.0        
#> [45] S4Vectors_0.36.2            RcppParallel_5.1.6         
#> [47] jsonlite_1.8.4              XVector_0.38.0             
#> [49] farver_2.1.1                ggplot2_3.4.1              
#> [51] digest_0.6.31               dplyr_1.1.0                
#> [53] GenomicRanges_1.50.2        grid_4.2.2                 
#> [55] cli_3.6.0                   tools_4.2.2                
#> [57] bitops_1.0-7                magrittr_2.0.3             
#> [59] sass_0.4.5                  RCurl_1.98-1.10            
#> [61] tibble_3.1.8                tidyr_1.3.0                
#> [63] pkgconfig_2.0.3             Matrix_1.5-3               
#> [65] data.table_1.14.8           sparseMatrixStats_1.10.0   
#> [67] rmarkdown_2.20              R6_2.5.1                   
#> [69] compiler_4.2.2

References

Finotello, Francesca, Clemens Mayer, Christina Plattner, Gerhard Laschober, Dietmar Rieder, Hubert Hackl, Anne Krogsdam, et al. 2019. “Molecular and Pharmacological Modulators of the Tumor Immune Contexture Revealed by Deconvolution of Rna-Seq Data.” Genome Medicine 11 (1): 34. https://doi.org/10.1186/s13073-019-0638-6.

Qiu, Xiaojie, Andrew Hill, Jonathan Packer, Dejun Lin, Yi-An Ma, and Cole Trapnell. 2017. “Single-Cell mRNA Quantification and Differential Analysis with Census.” Nature Methods 14 (3): 309–15. https://doi.org/10.1038/nmeth.4150.

Racle, Julien, Kaat de Jonge, Petra Baumgaertner, Daniel E. Speiser, and David Gfeller. 2017. “Simultaneous Enumeration of Cancer and Immune Cell Types from Bulk Tumor Gene Expression Data.” eLife 6 (November): e26476. https://doi.org/10.7554/eLife.26476.