PhEMD is a package for comparing multiple heterogeneous single-cell samples to one another. It currently does so by first defining cell subtypes and relating them to one another using Monocle 2. It then computes Earth Mover’s Distance (EMD) between each pair of samples, incorporating information about intrinsic cell subtype dissimilarity (i.e. manifold distance between cell subtypes) and differences between samples with respect to relative abundance of each cell subtype. PhEMD uses these pairwise distances to construct a network of single-cell samples that may be visually related in 2D or 3D using a diffusion map and partitioned to identify groups of similar samples.
PhEMD requires R version >= 3.4.0 (recommended 3.5.0+), Bioconductor version >= 3.5 (recommended 3.7+), and Monocle 2 version >= 2.4.0 (recommended 2.8.0)
###Install from Bioconductor
BiocManager::install("phemd")
###Install from Github (though direct installation from Bioconductor is preferred)
library(devtools)
install_github("wschen/phemd")
###Load library after installation
library('phemd')
library('monocle')
PhEMD expects single-cell data to be represented as an R list of samples. Each sample (i.e. list element) is expected to be a matrix of dimension num_cells x num_markers, where markers may represent genes or cytometry protein markers. For this vignette, we will be demonstrating our analysis pipeline on a melanoma dataset consisting of tumor-infiltrating immune cell scRNA-seq data (selected genes) that were log-transformed following TPM-normalization (first published by Tirosh et al., 2016).
We first start by creating a PhEMD data object, specifying the multi-sample expression data (R list), marker names (i.e. column names of the data matrices in the list of expression data), and sample names (in the order they appear in the list of expression data).
load('melanomaData.RData')
myobj <- createDataObj(all_expn_data, all_genes, as.character(snames))
We can optionally remove samples in the PhEMD data object that have fewer than min_sz number of cells as follows:
myobj <- removeTinySamples(myobj, min_sz = 20)
## [1] "Mel78 removed because only contains 3 cells"
## [1] "Mel59 removed because only contains 12 cells"
Note that samples that don’t meet the meet the minimum cell yield criteria are removed from rawExpn(myobj) and from the list of sample names in sampleNames(myobj).
Next, aggregate data from all samples into a matrix that is stored in the PhEMD data object (in slot ‘data_aggregate’). This aggregated data will then be used for initial cell subtype definition and embedding. If there are more cells collectively in all samples than max_cells, an equal number of cells from each sample will be subsampled and stored in pooledCells(myobj).
myobj <- aggregateSamples(myobj, max_cells=12000)
Now that we have aggregated single-cell data from all samples, we are ready to perform cell subtype definition and dimensionality reduction to visually and spatially relate cells and cell subtypes. For this, we use Monocle 2. Before we begin, we first perform feature selection by selecting 44 important genes. Suggestions on how to choose important genes can be found here: http://cole-trapnell-lab.github.io/monocle-release/docs/#trajectory-step-1-choose-genes-that-define-a-cell-s-progress
myobj <- selectFeatures(myobj, selected_genes)
We are now ready to generate a Monocle 2 embedding. Our embedCells() function is a wrapper function for the reduceDimension() function in Monocle 2. For our example dataset, we specify the expression distribution model as ‘gaussianff’ as is recommended in the Monocle 2 tutorial for log-transformed scRNA-seq TPM values (http://cole-trapnell-lab.github.io/monocle-release/docs/#choosing-a-distribution-for-your-data-required). ‘negbinomial_sz’ is the recommended data type for most unnormalized scRNA-seq data (raw read counts) and ‘gaussianff’ is recommended for log-transformed data or arcsin-transformed mass cytometry data. See above link for more details.
Additional parameters may be passed to Monocle 2 reduceDimension() as optional named parameters in embed_cells(). We found that Monocle 2 is robust to a range of parameters. Sigma can be thought of as a “noise” parameter and we empirically found that sigma in the range of [0.01, 0.1] often works well for log-transformed scRNA-seq data or normalized CyTOF data. Greater values of sigma generally result in fewer total number of clusters. See Monocle 2 publication (Qiu et al., 2017) for additional details on parameter selection.
# generate 2D cell embedding and cell subtype assignments
myobj <- embedCells(myobj, data_model = 'gaussianff', pseudo_expr=0, sigma=0.02)
# generate pseudotime ordering of cells
myobj <- orderCellsMonocle(myobj)
The result of the code above is a Monocle 2 object stored in monocleInfo(myobj). This object contains cell subtype and pseudotime assignments for each cell in the aggregated data matrix (stored in pooledCells(myobj)). A 2D embedding of these cells has also been generated. We can visualize the embedding by writing them to file in this way:
cmap <- plotEmbeddings(myobj, cell_model='monocle2')
To visualize the expression profiles of the cell subtypes, we can plot a heatmap and save to file as such:
plotHeatmaps(myobj, cell_model='monocle2', selected_genes=heatmap_genes)
Now that we have identified a comprehensive set of cell subtypes across all single-cell samples and related them in a low-dimensional embedding by aggregating cells from all samples, we want to perform deconvolution to determine the abundance of each cell subtype on a per sample basis. To do so, we call this function:
# Determine cell subtype breakdown of each sample
myobj <- clusterIndividualSamples(myobj)
The results of this process are stored in celltypeFreqs(myobj). Row i column j represents the fraction of all cells in sample i assigned to cell subtype j.
To compare single-cell samples, we use Earth Mover’s Distance, which is a metric that takes into account both the difference in relative frequencies of matching cell subtypes (e.g. % of all cells in each sample that are CD8+ T-cells) and the dissimilarity of the cell subtypes themselves (e.g. intrinsic dissimilarity between CD8+ and CD4+ T-cells). To compute the intrinsic dissimilarity between cell subtypes, we call the following function:
# Determine (dis)similarity of different cell subtypes
myobj <- generateGDM(myobj)
generateGDM() stores the pairwise dissimilarity (i.e. “ground-distance” or “tree-distance”) between cell subtypes in GDM(myobj).
We are now ready to compare single-cell samples using EMD. To do so, we simply call the function compareSamples():
# Perform inter-sample comparisons using EMD
my_distmat <- compareSamples(myobj)
compareSamples() returns a distance matrix representing the pairwise EMD between single-cell samples; my_distmat[i,j] represents the dissimilarity between samples i and j (i.e. samples represented by rows i and j in celltypeFreqs(myobj)). We can use this distance matrix to identify similar groups of samples as such:
## Identify similar groups of inhibitors
group_assignments <- groupSamples(my_distmat, distfun = 'hclust', ncluster=5)
We can also use the PhEMD-based distance matrix to generate an embedding of single-cell samples, colored by group assignments.
dmap_obj <- plotGroupedSamplesDmap(my_distmat, group_assignments, pt_sz = 1.5)
To retrieve the cell subtype distribution on a per-sample basis, use the following function. Histograms can be subsequently plotted for a given sample as desired.
# Plot cell subtype distribution (histogram) for each sample
sample.cellfreqs <- getSampleHistsByCluster(myobj, group_assignments, cell_model='monocle2')
To plot cell subtype histograms summarizing groups of similar samples (bin-wise mean of each cell subtype across all samples assigned to a particular group), use the following plotting function:
# Plot representative cell subtype distribution for each group of samples
plotSummaryHistograms(myobj, group_assignments, cell_model='monocle2', cmap,
ncol.plot = 5, ax.lab.sz=1.3, title.sz=2)
To plot cell yield of each samples as a barplot, use the following function:
# Plot cell yield of each experimental condition
plotCellYield(myobj, group_assignments, font_sz = 0.7, w=8, h=5)
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] splines stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] phemd_1.4.0 monocle_2.16.0 DDRTree_0.1.5
## [4] irlba_2.3.3 VGAM_1.1-2 ggplot2_3.3.0
## [7] Biobase_2.48.0 BiocGenerics_0.34.0 Matrix_1.2-18
## [10] BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 RcppEigen_0.3.3.7.0
## [3] plyr_1.8.6 igraph_1.2.5
## [5] lazyeval_0.2.2 sp_1.4-1
## [7] RcppHNSW_0.2.0 listenv_0.8.0
## [9] densityClust_0.3 GenomeInfoDb_1.24.0
## [11] fastICA_1.2-2 digest_0.6.25
## [13] htmltools_0.4.0 magick_2.3
## [15] viridis_0.5.1 gdata_2.18.0
## [17] phateR_1.0.0 magrittr_1.5
## [19] cluster_2.1.0 ROCR_1.0-7
## [21] openxlsx_4.1.4 limma_3.44.0
## [23] globals_0.12.5 matrixStats_0.56.0
## [25] docopt_0.6.1 xts_0.12-0
## [27] colorspace_1.4-1 rappdirs_0.3.1
## [29] ggrepel_0.8.2 haven_2.2.0
## [31] xfun_0.13 dplyr_0.8.5
## [33] sparsesvd_0.2 crayon_1.3.4
## [35] RCurl_1.98-1.2 jsonlite_1.6.1
## [37] hexbin_1.28.1 survival_3.1-12
## [39] zoo_1.8-7 ape_5.3
## [41] glue_1.4.0 gtable_0.3.0
## [43] zlibbioc_1.34.0 XVector_0.28.0
## [45] leiden_0.3.3 DelayedArray_0.14.0
## [47] car_3.0-7 future.apply_1.5.0
## [49] SingleCellExperiment_1.10.0 DEoptimR_1.0-8
## [51] abind_1.4-5 VIM_5.1.1
## [53] scales_1.1.0 ggplot.multistats_1.0.0
## [55] pheatmap_1.0.12 ggthemes_4.2.0
## [57] Rcpp_1.0.4.6 viridisLite_0.3.0
## [59] laeken_0.5.1 reticulate_1.15
## [61] rsvd_1.0.3 foreign_0.8-79
## [63] proxy_0.4-24 tsne_0.1-3
## [65] vcd_1.4-7 htmlwidgets_1.5.1
## [67] httr_1.4.1 FNN_1.1.3
## [69] gplots_3.0.3 RColorBrewer_1.1-2
## [71] ellipsis_0.3.0 Seurat_3.1.5
## [73] ica_1.0-2 farver_2.0.3
## [75] pkgconfig_2.0.3 uwot_0.1.8
## [77] nnet_7.3-14 labeling_0.3
## [79] tidyselect_1.0.0 rlang_0.4.5
## [81] reshape2_1.4.4 munsell_0.5.0
## [83] cellranger_1.1.0 tools_4.0.0
## [85] ranger_0.12.1 ggridges_0.5.2
## [87] evaluate_0.14 stringr_1.4.0
## [89] maptree_1.4-7 yaml_2.2.1
## [91] npsurv_0.4-0 transport_0.12-2
## [93] knitr_1.28 fitdistrplus_1.0-14
## [95] zip_2.0.4 robustbase_0.93-6
## [97] caTools_1.18.0 purrr_0.3.4
## [99] RANN_2.6.1 pbapply_1.4-2
## [101] future_1.17.0 nlme_3.1-147
## [103] slam_0.1-47 pracma_2.2.9
## [105] compiler_4.0.0 png_0.1-7
## [107] plotly_4.9.2.1 curl_4.3
## [109] lsei_1.2-0 e1071_1.7-3
## [111] smoother_1.1 tibble_3.0.1
## [113] stringi_1.4.6 RSpectra_0.16-0
## [115] forcats_0.5.0 lattice_0.20-41
## [117] HSMMSingleCell_1.7.0 vctrs_0.2.4
## [119] pillar_1.4.3 lifecycle_0.2.0
## [121] BiocManager_1.30.10 combinat_0.0-8
## [123] lmtest_0.9-37 RcppAnnoy_0.0.16
## [125] data.table_1.12.8 cowplot_1.0.0
## [127] bitops_1.0-6 patchwork_1.0.0
## [129] GenomicRanges_1.40.0 R6_2.4.1
## [131] pcaMethods_1.80.0 bookdown_0.18
## [133] KernSmooth_2.23-17 gridExtra_2.3
## [135] rio_0.5.16 IRanges_2.22.0
## [137] codetools_0.2-16 boot_1.3-25
## [139] MASS_7.3-51.6 gtools_3.8.2
## [141] assertthat_0.2.1 destiny_3.2.0
## [143] SummarizedExperiment_1.18.0 withr_2.2.0
## [145] sctransform_0.2.1 qlcMatrix_0.9.7
## [147] S4Vectors_0.26.0 GenomeInfoDbData_1.2.3
## [149] hms_0.5.3 grid_4.0.0
## [151] rpart_4.1-15 tidyr_1.0.2
## [153] class_7.3-17 rmarkdown_2.1
## [155] carData_3.0-3 Rtsne_0.15
## [157] TTR_0.23-6 scatterplot3d_0.3-41