spatialHeatmap 2.4.0
The primary utility of the spatialHeatmap package is the generation of spatial heatmaps (SHM) for visualizing cell-, tissue- and organ-specific abundance patterns of biological molecules (e.g. RNAs) in anatomical images (Zhang et al. 2022). This is useful for identifying molecules with spatially enriched (SE) abundance patterns as well as clusters and/or network modules composed of molecules sharing similar abundance patterns such as similar gene expression patterns. These functionalities are introduced in the main vignette of the spatialHeatmap package. The following describes extended functionalities for integrating tissue with single cell data by co-visualizing them in composite plots that combine spatial heatmaps with embedding plots of high-dimensional data. The resulting spatial context information is important for gaining insights into the tissue-level organization of single cell data.
The required quantitative assay data, such as gene expression values, can be
provided in a variety of widely used tabular data structures (e.g.
data.frame
, SummarizedExperiment
or SingleCellExperiment
). The corresponding
anatomic images need to be supplied as annotated SVG (aSVG) images and can be stored in a specific S4 class SVG
. The
creation of aSVGs is described in the main vignette of this package. For the
embedding plots of single cell data, several dimensionality reduction
algorithms (e.g. PCA, UMAP or tSNE) are supported. To associate single cells
with their source tissues, the user can choose among three major methods including
annotation-based, manual and automated methods (Figure 1). Similar
to other functionalities in spatialHeatmap, these functionalities are available within
R as well as the corresponding Shiny app (Chang et al. 2021).
To co-visualize single cell data with tissue features (Figure
1), the individual cells of the single cell data are mapped
via their group labels to the corresponding tissue features in an aSVG image. If
the feature labels in an aSVG are different than the corresponding group labels
used for the single cell data, e.g. due to variable terminologies, a
translation map can be used to avoid manual relabelling. Throughout this
vignette the usage of the term feature is a generalization referring in most
cases to tissues or organs. For the implementation of the co-visualization
tool, spatialHeatmap takes advantage of efficient and reusable S4 classes for both assay data and aSVGs respectively. The former includes the Bioconductor
core data structures such as the widely used SingleCellExperiment
(SCE
)
container illustrated in Figure 1.1 (Amezquita et al. 2020). The slots
assays
, colData
, rowData
and reducedDims
in an SCE
contain expression data,
cell metadata, molecule metadata and reduced dimensionality embedding results,
respectively. The cell group labels are stored in the colData
slot as shown
in Figure 1.1. The S4 class SVG
(Figure 1.3) is developed specifically in spatialHeatmap
for storing aSVG instances. The two most important slots coordinate
and attribute
stores the aSVG feature coordinates and respective attributes such as fill color, line withs, etc. respectively, while other slots dimension
, svg
, and raster
stores image dimension, aSVG file paths, and raster image paths respectively. For handling cell-to-tissue grouping
information, three general methods are available including (a)
annotation-based, (b) manual and (c) automated.
The annotation-based and manual methods are similar by using known cell group
labels. The main difference is how the cell labels are provided. In the
annotation-based method, existing group labels are available and can be
uploaded and/or stored in the SCE
object, as is the case in some of the SCE
instances provided by the scRNAseq
package (Risso and Cole 2022). The manual method
allows users to create the cell to tissue associations one-by-one or import
them from a tabular file. In contrast to this, the automated method aims to
assign single cells to the corresponding source tissues computationally by a
co-clustering algorithm (Figure 8). This co-clustering is
experimental and requires bulk expression data that are obtained from the
tissues represented in the single cell data. The grouping information is
visualized by using for each group the same color in both the single cell
embedding plot and the tissue spatial heatmap plot (Figure
1.5). The colors can represent any type of custom or numeric
information. In a typical use case, either fixed tissue-specific colors or a
heat color gradient is used that is proportional to the numeric expression
information obtained from the single cell or bulk expression data of a chosen
gene. When the expression values among groups are very similar, toggling
between the two coloring option is important to track the tissue origin in
the single cell data. To color by single cell data, one often wants to first
summarize the expression of a given gene across the cells within each group via a
meaningful summary statistics, such as mean or median. Cells and tissues with
the same group label will be colored the same. When coloring by tissues
the color used for each tissue feature will be applied to the corresponding
cell groups represented in the embedding plot.
The spatialHeatmap
package can be installed with the BiocManager::install
command.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("spatialHeatmap")
Next, the packages required for running the sample code in this vignette need to be loaded.
library(spatialHeatmap); library(SummarizedExperiment); library(scran); library(scater); library(igraph); library(SingleCellExperiment); library(BiocParallel)
library(kableExtra)
The following lists the vignette(s) of this package in an HTML browser. Clicking the name of the corresponding vignette will open it.
browseVignettes('spatialHeatmap')
To reduce runtime, intermediate results can be cached under ~/.cache/shm
.
cache.pa <- '~/.cache/shm' # Set path of the cache directory
To obtain for examples with randomized data or parameters always the same results, a fixed seed is set.
set.seed(10)
This quick start example is demonstrated on cell-to-tissue mapping by using a single cell data set from oligodendrocytes of mouse brain (Marques et al. 2016). This data set is obtained from the scRNAseq
(Risso and Cole 2022) package with minor modificatons, which is included in spatialHeatmap
.
Prior to co-visualization the single cell data is pre-processed by the process_cell_meta
function that applies common QC, normalization and dimensionality reduction routines.
The details of these pre-processing methods are described in the corresponding
help file. Additional background information on these topics can be found in the
OSCA tutorial.
sce.pa <- system.file("extdata/shinyApp/example", "cell_mouse_brain.rds", package="spatialHeatmap")
sce <- readRDS(sce.pa)
sce.dimred.quick <- process_cell_meta(sce, qc.metric=list(subsets=list(Mt=rowData(sce)$featureType=='mito'), threshold=1))
colData(sce.dimred.quick)[1:3, 1:2]
## DataFrame with 3 rows and 2 columns
## label age
## <character> <character>
## C1.1772078.029.F11 hypothalamus p22
## C1.1772089.202.E04 SN.VTA p22
## C1.1772099.091.D10 dorsal.horn p20
To visualize single cell data in spatial heatmaps, the single cell expression values
are summarized using common summary statistics such as mean or median by their source tissue grouping, which is in the label
column of colData
slot.
sce.aggr.quick <- aggr_rep(sce.dimred.quick, assay.na='logcounts', sam.factor='label', aggr='mean')
The summarized values are then used to color the tissue features of the corresponding
aSVG file of mouse brain that is included in the spatialHeatmap
package. The
function read_svg
is used for parsing the aSVG and storing it in an SVG
object svg.mus.brain
.
svg.mus.brain.pa <- system.file("extdata/shinyApp/example", "mus_musculus.brain.svg", package="spatialHeatmap")
svg.mus.brain <- read_svg(svg.mus.brain.pa)
A subset of features and related attributes are returned from svg.mus.brain
, where fill
and stroke
refer to color and line width respectively.
tail(attribute(svg.mus.brain)[[1]])[, 1:4]
## # A tibble: 6 × 4
## feature id fill stroke
## <chr> <chr> <chr> <dbl>
## 1 brainstem UBERON_0002298 none 0.05
## 2 midbrain UBERON_0001891 none 0.05
## 3 dorsal.plus.ventral.thalamus UBERON_0001897 none 0.05
## 4 hypothalamus UBERON_0001898 none 0.05
## 5 nose UBERON_0000004 none 0.05
## 6 corpora.quadrigemina UBERON_0002259 none 0.05
To map cell group labels to aSVG features, a list
with named components is used, where cell labels are in name slots. Note, each cell label can be matched to multiple aSVG features but not vice versa.
lis.match.quick <- list(hypothalamus=c('hypothalamus'), cortex.S1=c('cerebral.cortex', 'nose'))
The co-visualization is created on gene Eif5b
using the function covis
. The embedding plot includes all cells before aggregation. The hypothalamus
and cortex.S1
cells are colored according to their respecitive aggregated expression values in data=sce.aggr.quick
. In the nearby spatial heatmap plot, aSVG features are filled by the same color as the matching cells defined in lis.match.quick
. The cell.group
argument indicates cell group labels in the colData
slot of sce.aggr.quick
, tar.cell
suggests the target cell groups to visualize, and dimred
specifies the embedding plot.
shm.lis.quick <- covis(svg=svg.mus.brain, data=sce.aggr.quick, ID=c('Eif5b'), sce.dimred=sce.dimred.quick, dimred='PCA', cell.group='label', tar.cell=names(lis.match.quick), lis.rematch=lis.match.quick, assay.na='logcounts', bar.width=0.11, dim.lgd.nrow=1, height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=3)
The annotation-based mapping uses existing cell-to-tissue labels. Those can be imported
(e.g. from a tabular file) and then stored in the colData
slot of an SCE
object.
Alternatively, an SCE
object might already contain this information. This is the
case in some of the SCE
objects provided by the scRNAseq
package (Risso and Cole 2022).
The Quick Start is domenstrated using the annotation-based method, since the single cell data are downloaded from scRNAseq
package. In this example, the tissue-to-cell mapping will be showcased. The single cell data and aSVG file are the same as the Quick Start, and the bulk data are modified from a research on mouse cerebellar development(Vacher et al. 2021).
The bulk count data are imported in a SummarizedExperiment
and partial are shown. Note, replicates
are indicated by the same sample names (e.g. cerebral.cortex
).
blk.mus.pa <- system.file("extdata/shinyApp/example", "bulk_mouse_cocluster.rds", package="spatialHeatmap")
blk.mus <- readRDS(blk.mus.pa)
assay(blk.mus)[1:3, 1:5]
## cerebral.cortex hippocampus hypothalamus cerebellum cerebral.cortex
## AI593442 177 256 50 24 285
## Actr3b 513 1465 228 244 666
## Adcy1 701 1243 57 1910 836
colData(blk.mus)[1:3, , drop=FALSE]
## DataFrame with 3 rows and 1 column
## tissue
## <character>
## cerebral.cortex cerebral.cortex
## hippocampus hippocampus
## hypothalamus hypothalamus
The bulk count data are normalized using the ESF
option, which refers to estimateSizeFactors
from DESeq2 (Love, Huber, and Anders 2014). The expression values for each gene are summarized by mean across replicates (here aggr='mean'
).
# Normalization.
blk.mus.nor <- norm_data(data=blk.mus, norm.fun='ESF', log2.trans=TRUE)
# Aggregation.
blk.mus.aggr <- aggr_rep(blk.mus.nor, sam.factor='tissue', aggr='mean')
assay(blk.mus.aggr)[1:2, ]
The single cell data from the Quick Start are used in this example. Its metadata in colData
are partially shown.
colData(sce.dimred.quick)[1:3, 1:2]
## DataFrame with 3 rows and 2 columns
## label age
## <character> <character>
## C1.1772078.029.F11 hypothalamus p22
## C1.1772089.202.E04 SN.VTA p22
## C1.1772099.091.D10 dorsal.horn p20
The single cell data are plotted at the UMAP dimensions using plot_dim
, where cells are represented by dots and are colored by the grouping information stored in the colData
slot of the SCE
object, here color.by="label"
.
plot_dim(sce.dimred.quick, color.by="label", dim='UMAP')
The aSVG instance of mouse brain from the Quick Start are used. Partial of the aSVG features are shown.
tail(attribute(svg.mus.brain)[[1]])[1:3, 1:4]
## # A tibble: 3 × 4
## feature id fill stroke
## <chr> <chr> <chr> <dbl>
## 1 brainstem UBERON_0002298 none 0.05
## 2 midbrain UBERON_0001891 none 0.05
## 3 dorsal.plus.ventral.thalamus UBERON_0001897 none 0.05
Same with conventions in the main vignette, at least one bulk tissue identifier should match with aSVG features so as to successfully plot spatial heatmaps.
colnames(blk.mus) %in% attribute(svg.mus.brain)[[1]]$feature
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
In practice the labels used for features in the aSVG image and the cell group labels of the single cell data may not be the same. To resolve this without manual relabeling, a translation list is used to make them match. In tissue-to-cell mapping, the feature and cell labels are expected to be the names and elements of the list components, respectively.
lis.match.blk <- list(cerebral.cortex=c('cortex.S1'), hypothalamus=c('corpus.callosum', 'hypothalamus'))
The following plots the corresponding co-visualization for sample gene Actr3b. The legend under the embedding plot shows the cell labels in the matching list (lis.match.blk
). The source tissue information is indicated by using the same colors in the embedding and tissue plots on the left and right, respectively. In contrast to the Quick Start, the tar.bulk
indicates target bulk tissues to show.
covis(svg=svg.mus.brain, data=blk.mus.aggr, ID=c('Actr3b'), sce.dimred=sce.dimred.quick, dimred='UMAP', cell.group='label', tar.bulk=names(lis.match.blk), lis.rematch=lis.match.blk, bar.width=0.09, dim.lgd.nrow=2, dim.lgd.text.size=12, height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=2)
In scenarios where expression values are similar across tissues, the mapping between cells and tissues can be indicated by fixed contrasting colors by setting profile=FALSE
.
covis(svg=svg.mus.brain, data=blk.mus.aggr, ID=c('Actr3b'), profile=FALSE, sce.dimred=sce.dimred.quick, dimred='UMAP', cell.group='label', lis.rematch=lis.match.blk, tar.bulk=names(lis.match.blk), bar.width=0.09, dim.lgd.nrow=2, dim.lgd.text.size=12, height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=2)
To provide additional flexibility for defining cell groupings, several manual
options are provided. Here users can assign cell groups manually or by
clustering methods for single cell embedding data that are often used in the
analysis of single cell data. The resulting cell grouping or cluster
information needs to be stored in a tabular file, that will be imported into an
SCE
object (here manual_group
function). The following demonstration uses
the same single cell and aSVG instance as the annotation example above. The only
difference is an additional clustering step. For demonstration purposes a small
example of a cluster file is included in the spatialHeatmap
package. In this
case the group labels were created by the cluster_cell
function. The details
of this function are available in its help file. The cluster file contains at
least two columns: a column (here cell
) with single cell identifiers used under
colData
and a column (here cluster
) with the cell group labels. For practical
reasons of building this vignette a pure manual example could not be used here.
However, the chosen clustering example can be easily adapted to manual or
hybrid grouping approaches since the underlying tabular data structure is the
same for both that can be generated in most text or spreadsheet programs.
manual.clus.mus.sc.pa <- system.file("extdata/shinyApp/example", "manual_cluster_mouse_brain.txt", package="spatialHeatmap")
manual.clus.mus.sc <- read.table(manual.clus.mus.sc.pa, header=TRUE, sep='\t')
manual.clus.mus.sc[1:3, ]
## cell cluster
## 1 C1.1772078.029.F11 clus7
## 2 C1.1772089.202.E04 clus7
## 3 C1.1772099.091.D10 clus1
The manual_group
function can be used to append the imported group labels to the colData
slot of an SCE
object without interfering with other functions and methods operating
on SCE
objects.
sce.clus <- manual_group(sce=sce.dimred.quick, df.group=manual.clus.mus.sc, cell='cell', cell.group='cluster')
colData(sce.clus)[1:3, c('cluster', 'label', 'expVar')]
## DataFrame with 3 rows and 3 columns
## cluster label expVar
## <character> <character> <character>
## C1.1772078.029.F11 clus7 hypothalamus control
## C1.1772089.202.E04 clus7 SN.VTA control
## C1.1772099.091.D10 clus1 dorsal.horn control
An embedding plot of single cell data is created. The cells represented as dots are
colored by the grouping information stored in the cluster
column of the colData
slot of SCE
.
plot_dim(sce.clus, color.by="cluster", dim='UMAP')
The same mouse brain aSVG as above is used here.
tail(attribute(svg.mus.brain)[[1]])[1:3, 1:4]
## # A tibble: 3 × 4
## feature id fill stroke
## <chr> <chr> <chr> <dbl>
## 1 brainstem UBERON_0002298 none 0.05
## 2 midbrain UBERON_0001891 none 0.05
## 3 dorsal.plus.ventral.thalamus UBERON_0001897 none 0.05
Similarly as above, a mapping list is used to match the cell clusters with aSVG features.
lis.match.clus <- list(clus1=c('cerebral.cortex'), clus3=c('brainstem', 'medulla.oblongata'))
This example is demonstrated through cell-to-tissue mapping, so gene expression values need to be summarized for the cells within each group label. Any grouping column in colData
can be used as labels for summarizing. In this manual method, the cluster
labels are chosen.
If additional experimental variables are provided, the summarizing will consider them as well (here expVar
). The following example uses the clsuter
and expVar
columns as group labels and experimental variables, respectively.
sce.clus.aggr <- aggr_rep(sce.clus, assay.na='logcounts', sam.factor='cluster', con.factor='expVar', aggr='mean')
colData(sce.clus.aggr)[1:3, c('cluster', 'label', 'expVar')]
## DataFrame with 3 rows and 3 columns
## cluster label expVar
## <character> <character> <character>
## clus7__control clus7 hypothalamus control
## clus1__control clus1 dorsal.horn control
## clus5__control clus5 corpus.callosum control
The co-visualization is plotted for gene Tcea1
. In this example the coloring
is based on the gene expression summary for each cell group obtained by clustering.
Completely manual groupings can be provided the same way.
covis(svg=svg.mus.brain, data=sce.clus.aggr, ID=c('Tcea1'), sce.dimred=sce.clus, dimred='UMAP', cell.group='cluster', assay.na='logcounts', tar.cell=names(lis.match.clus), lis.rematch=lis.match.clus, bar.width=0.09, dim.lgd.nrow=1, height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=3)
If both single cell and bulk gene expression data are available for the same or overlapping tissues then co-clustering can be used to assign cells to tissues automatically (Figure 8). Subsequently, the predicted cell-to-tissue assignments can be used for co-visualizing single cell embedding data with tissue level bulk data. This approach is useful for predicting the source tissues of unassigned cells without prior knowledge as is required for the annotation and manual approaches introduced above. While attractive there are various challenges to overcome to reliably co-cluster single cell data with the corresponding tissue-level bulk data. This is due to the different properties of single cell and bulk gene expression data, such as lower sensitivity and higher sparsity in single cell compared to bulk data. This section introduces a co-clustering method that is largely based on parameter optimization including three major steps. First, both data are preprocessed to retain the most reliable expression values (Figure 8.1a-b). Second, the genes in the bulk data are reduced to those robustly expressed in the single cell data (Figure 8.1c). Third, bulk and cell data are co-clustered by using optimal default settings (Table 1) that are obtained through optimization on real data with known cell-to-tissue assignments. The following introduces the three steps of this method in more detail using the example of RNA-Seq data.
The raw count matrices of bulk and single cells are column-wise combined for joint normalization (Figure 8.1a). After separated from bulk data, the single cell data are reduced to genes with robust expression across X% cells and to cells with robust expression across Y% genes (Figure 8.1b). In the bulk data, genes are filtered according to expression values \(\ge\) A at a proportion of \(\ge\) p across bulk samples and a coefficient of variance (CV) between CV1 and CV2 (Figure 8.1b).
The bulk data are subsetted to the same genes as the single cell data (Figure 8.1c). This and the previous filtering steps in single cell data reduce the sparsity in the single cell data and the bulk data are made more compareable to the single cell data by subsetting it to the same genes.
Bulk and single cell data are column-wise combined for joint embedding using PCA (UMAP, or other). Co-clustering is performed on the embedding data and three types of clusters are produced. First, only one bulk tissue is co-clustered with cells ((Figure 8.2a). This bulk is assigned to all cells in the same cocluster. Second, multiple bulk tissues are co-clustered with multiple cells (Figure 8.2b). The bulk tissues are assigned according to the nearest neighbor bulk of each cell, which is measured by Spearman’s correlation coefficients. Third, no bulk tissue is co-clustered with cells (Figure 8.2c). All these cells are un-labeled, which are candidates for discovering novel cell types.
After co-clustering, cells are labeled by bulk tissues or un-labeled (Figure 8.3) and these labels are used for co-visualization (Figure 8.4), where cells in embedding plot and matching tissues in spatial heatmap are filled by same colors.
After revising the method outline for the co-clustering please revise the figure 7 above accordingly. The legend can just refer to the main text to avoid duplication. PLEASE keep it simple and clear. If a figure raises more questions than it answers then it defeats the purpose of guiding the reader and providing clarity.
To obtain reasonably robust default settings for co-clustering, four parameters shown in Table 1 are optimized, where bold text indicates optimal settings that are treated as robust default settings. The reason to choose these parameters is they are most relevant to the co-clustering step. The details of this optimization are given here. The following demonstration applies the default settings (bold in Table 1) using single cell and bulk data from mouse brain (Vacher et al. 2021; Ortiz et al. 2020). Both data sets have been simplified for demonstraton purposes.
Parameter | Settings | Description |
---|---|---|
dimensionReduction | denoisePCA (PCA, scran), runUMAP (UMAP, scater) | Dimension reduction methods |
topDimensions | 5 to 80 (11, 19, 33, 22) | Number of top dimensions selected for co-clustering |
graphBuilding | buildKNNGraph (knn), buildSNNGraph (snn) (scran) | Methods for building a graph where nodes are cells and edges are connections between nearest neighbors |
clusterDetection | cluster_walktrap (wt), cluster_fast_greedy (fg), cluster_leading_eigen (le) (igraph) | Methods for partitioning the graph to generate clusters |
To obtain reproducible results, a fixed seed is set for generating random numbers.
set.seed(10)
The bulk data (blk.mus
) is the same with the tissue-to-cell mapping in the Annotation-based method. The following imports example single cell from the spatialHeatmap
package and shows its partial metadata in colData
slot.
sc.mus.pa <- system.file("extdata/shinyApp/example", "cell_mouse_cocluster.rds", package="spatialHeatmap")
sc.mus <- readRDS(sc.mus.pa)
colData(sc.mus)[1:3, , drop=FALSE]
## DataFrame with 3 rows and 1 column
## cell
## <character>
## isocort isocort
## isocort isocort
## isocort isocort
Bulk and single cell raw count data are jointly normalized by the function norm_cell
, which is a wrapper of computeSumFactors
from scran
(Lun, McCarthy, and Marioni 2016). com=FALSE
means bulk and single cells are separated after normalization for subsequent separate filtering.
#mus.lis.nor <- read_cache(cache.pa, 'mus.lis.nor')
#if (is.null(mus.lis.nor)) {
mus.lis.nor <- norm_cell(sce=sc.mus, bulk=blk.mus, com=FALSE)
# save_cache(dir=cache.pa, overwrite=TRUE, mus.lis.nor)
#}
The normalized single cell and bulk data (log2-scale) are filtered to reduce sparsity
and low expression values. In the bulk data, replicates are first aggregated by taking means using the function aggr_rep
. Then the filtering retains genes in
the bulk data to have expression values of \(\ge\) 1 at a proportion of \(\ge 10\%\) (pOA
) across bulk samples and a coefficient of variance (CV
) between \(0.1-50\) (Gentleman et al. 2018).
In the single cell data the filtering is set to retain genes with expression values of \(\ge 1\) (cutoff=1
) in \(\ge 1\%\) (p.in.gen=0.01
) of cells. In
addition, only those cells are retained that have expression values \(\ge 1\) (cutoff=1
)
in \(\ge 10\%\) (p.in.cell=0.1
) of their genes.
# Aggregate bulk replicates
blk.mus.aggr <- aggr_rep(data=mus.lis.nor$bulk, assay.na='logcounts', sam.factor='sample', aggr='mean')
# Filter bulk
blk.mus.fil <- filter_data(data=blk.mus.aggr, pOA=c(0.1, 1), CV=c(0.1, 50), verbose=FALSE)
# Filter cell and subset bulk to genes in cell
blk.sc.mus.fil <- filter_cell(sce=mus.lis.nor$cell, bulk=blk.mus.fil, cutoff=1, p.in.cell=0.1, p.in.gen=0.01, verbose=FALSE)
Compared to bulk RNA-Seq data, single cell data has a much higher level of
sparsity. This difference is reduced by the above filtering and then subsetting
the bulk data to the genes remaining in the filtered single cell data. This
entire process is accomplished by the filter_cell
function.
The same aSVG instance of mouse brain as in the Quick Start section above is used here and the aSVG importing is omitted for brevity.
tail(attribute(svg.mus.brain)[[1]])[1:3, 1:4] # Partial features are shown.
## # A tibble: 3 × 4
## feature id fill stroke
## <chr> <chr> <chr> <dbl>
## 1 brainstem UBERON_0002298 none 0.05
## 2 midbrain UBERON_0001891 none 0.05
## 3 dorsal.plus.ventral.thalamus UBERON_0001897 none 0.05
The co-clustering process is performed by calling the function cocluster
. In the following, default settings obtained from optimization are shown, where min.dim
, dimred
, graph.meth
, and cluster
refers to topDimensions
, dimensionReduction
, graphBuilding
, and clusterDetection
in Table 1 respectively.
#coclus.mus <- read_cache(cache.pa, 'coclus.mus')
#if (is.null(coclus.mus)) {
coclus.mus <- cocluster(bulk=blk.sc.mus.fil$bulk, cell=blk.sc.mus.fil$cell, min.dim=11, dimred='PCA', graph.meth='knn', cluster='fg')
# save_cache(dir=cache.pa, overwrite=TRUE, coclus.mus)
#}
The co-clustering results are stored in the colData
slot of coclus.mus
. The cluster
indicates cluster labels, the bulkCell
indicates bulk tissues or single cells, the sample
suggests original labels of bulk and cells, the assignedBulk
refers to bulk tissues assigned to cells with none
suggesting un-assigned, and the similarity
refers to Spearman’s correlation coefficients for assignments between bulk and cells, which is a measure of assignment strigency.
colData(coclus.mus)
## DataFrame with 4458 rows and 7 columns
## cluster bulkCell sample assignedBulk
## <character> <character> <character> <character>
## cerebral.cortex clus8 bulk cerebral.cortex none
## hippocampus clus5 bulk hippocampus none
## hypothalamus clus7 bulk hypothalamus none
## cerebellum clus7 bulk cerebellum none
## isocort clus8 cell isocort cerebral.cortex
## ... ... ... ... ...
## retrohipp clus8 cell retrohipp cerebral.cortex
## retrohipp clus5 cell retrohipp hippocampus
## retrohipp clus5 cell retrohipp hippocampus
## retrohipp clus5 cell retrohipp hippocampus
## retrohipp clus5 cell retrohipp hippocampus
## similarity index sizeFactor
## <character> <integer> <numeric>
## cerebral.cortex none 1 45.811788
## hippocampus none 2 100.736625
## hypothalamus none 3 28.249392
## cerebellum none 4 44.415396
## isocort 0.545 5 0.711862
## ... ... ... ...
## retrohipp 0.373 4454 1.006589
## retrohipp -0.1 4455 0.674636
## retrohipp 0.136 4456 0.646310
## retrohipp 0.127 4457 0.564467
## retrohipp 0.327 4458 0.636357
The strigency of assignments between bulk and cells can be controled by filtering the similarity
. This utility is impletmented in function filter_asg
, where min.sim
is the similarity
cutoff. Utilities are also developed to manually tailor the assignments and are explained in the Supplementary Section.
coclus.mus <- filter_asg(coclus.mus, min.sim=0.1)
The co-clusters including bulk and cells can be visualized with the function plot_dim
. The dim
argument specifies embedding methods and cocluster.only=TRUE
sets cells without bulk assignments in gray. In the embedding plot, bulk tissues and cells are indicated by large and small circles respectively.
plot_dim(coclus.mus, dim='PCA', color.by='cluster', cocluster.only=TRUE)
Bulk assignments are treated as group labels for cells in co-visualization. Similar with annotation-based and manual methods, co-visualization options include cell-to-bulk mapping, bulk-to-cell mapping, inclusion or exclusion of expression values.
In cell-to-bulk mapping, single cell data are separated from bulk and expression values are summarized by means within each cell group, i.e. bulk assignement.
# Separate single cell data.
coclus.sc <- subset(coclus.mus, , bulkCell=='cell')
# Summarize expression values in each cell group.
sc.aggr.coclus <- aggr_rep(data=coclus.sc, assay.na='logcounts', sam.factor='assignedBulk', aggr='mean')
colData(sc.aggr.coclus)
## DataFrame with 5 rows and 7 columns
## cluster bulkCell sample assignedBulk similarity
## <character> <character> <character> <character> <character>
## cerebral.cortex clus8 cell isocort cerebral.cortex 0.545
## none clus2 cell olfa none none
## cerebellum clus7 cell striatum cerebellum 0.3
## hypothalamus clus7 cell pallidum hypothalamus 0.418
## hippocampus clus5 cell olfa hippocampus 0.609
## index sizeFactor
## <integer> <numeric>
## cerebral.cortex 5 0.711862
## none 9 0.687270
## cerebellum 301 0.690039
## hypothalamus 403 0.715932
## hippocampus 500 1.055465
The co-visualization through cell-to-bulk mapping is built on gene Adcy1
. Cells in the embedding plot and respective assigned bulk tissues in spatial heatmap are revealed by same colors.
covis(svg=svg.mus.brain, data=sc.aggr.coclus, ID=c('Adcy1'), sce.dimred=coclus.sc, dimred='UMAP', tar.cell=c('hippocampus', 'hypothalamus', 'cerebellum', 'cerebral.cortex'), dim.lgd.text.size=10, dim.lgd.nrow=2, bar.width=0.1, legend.nrow=4)
In bulk-to-cell mapping, bulk data are separated from cell, where replicates are already aggregated in preprocessing.
coclus.blk <- subset(coclus.mus, , bulkCell=='bulk')
Same with conventions in the main vignette, at least one bulk tissue identifier should match with aSVG features so as to successfully plot spatial heatmaps.
colnames(coclus.blk) %in% attribute(svg.mus.brain)[[1]]$feature
## [1] TRUE TRUE TRUE TRUE
The co-visualization through bulk-to-cell mapping is built on gene Adcy1
. Cells in the embedding plot and respective assigned bulk tissues in spatial heatmap are revealed by same colors.
covis(svg=svg.mus.brain, data=coclus.blk, ID=c('Adcy1'), sce.dimred=coclus.sc, dimred='UMAP', tar.bulk=colnames(coclus.blk), assay.na='logcounts', legend.nrow=4, dim.lgd.text.size=10, dim.lgd.nrow=2, bar.width=0.08)
By setting profile=FALSE
, the co-visualization is created without expression values, where colors in the embedding plot and spatial heatmap only indicate matching between cells and bulk tissues.
covis(svg=svg.mus.brain, data=coclus.blk, ID=c('Adcy1'), sce.dimred=coclus.sc, dimred='UMAP', profile=FALSE, assay.na='logcounts', legend.nrow=4, dim.lgd.text.size=10, dim.lgd.nrow=2, bar.width=0.1)
The co-visualization feature is included in the integrated Shiny app that is an GUI implementation of spatialHeatmap
, including annotation-based, manual, and automated methods. To start this app, simply call shiny_shm()
in R. Below is a screenshot of the co-visulization output.
When using the Shiny app, if bulk and single cell raw count data are provided, column-wise combine them in a SCE
object, and format the metadata in the colData
slot according to the following rules:
In the bulkCell
column, use bulk
and cell
to indicate bulk and cell samples respectively. If no bulk
is included in this column, all samples are considered cells.
If multiple cell group labels (annotation-based, or manually created) are provided, include them in columns of label
, label1
, label2
, and so on respectively. In each of these label columns include corresponding aSVG features as bulk tissue labels.
After formatting the metadata, save the SCE
object as an .rds
file using saveRDS
, then upload the .rds
and aSVG file to the app. An example of bulk and single cell data for use in the Shiny app are included in spatialHeatmap
and shown below.
shiny.dat.pa <- system.file("extdata/shinyApp/example", "shiny_covis_bulk_cell_mouse_brain.rds", package="spatialHeatmap")
shiny.dat <- readRDS(shiny.dat.pa)
colData(shiny.dat)
## DataFrame with 1061 rows and 4 columns
## label label1 bulkCell expVar
## <character> <character> <character> <character>
## cerebral.cortex cerebral.cortex cerebral.cortex bulk control
## hippocampus hippocampus hippocampus bulk control
## hypothalamus hypothalamus hypothalamus bulk control
## cerebellum cerebellum cerebellum bulk control
## cerebral.cortex cerebral.cortex cerebral.cortex bulk control
## ... ... ... ... ...
## retrohipp retrohipp clus4 cell control
## retrohipp retrohipp clus4 cell control
## cere cere clus2 cell control
## cere cere clus2 cell control
## midbrain midbrain clus2 cell control
The bulk-cell assignments in the automated method can be manually tailored, which is optional. The tailoring can be performed in command line or on a Shiny app. This section illustrates the command-line based tailoring. First visualize single cells in an embedding plot as shown below. In order to define more accurate coordinates in the next step, tune the x-y axis breaks (x.break
, y.break
) and set panel.grid=TRUE
.
plot_dim(coclus.mus, dim='UMAP', color.by='sample', x.break=seq(-10, 10, 1), y.break=seq(-10, 10, 1), panel.grid=TRUE)
Define desired bulk tissues (desiredBulk
) for cells selected by x-y coordinate ranges (x.min
, x.max
, y.min
, y.max
) in the embedding plot in form of a data.frame
(df.desired.bulk
). The dimred
reveals where the coordinates come from and is required. In this example, cell populations near the bulk hippocampus
are selected and hippocampus
is the desired bulk tissue.
df.desired.bulk <- data.frame(x.min=c(6), x.max=c(8), y.min=c(-1), y.max=c(0.5), desiredBulk=c('hippocampus'), dimred='UMAP')
df.desired.bulk
## x.min x.max y.min y.max desiredBulk dimred
## 1 6 8 -1 0.5 hippocampus UMAP
Incorporate desired bulk assignments to co-clustering results by calling refine_asg
. The similarities corresponding to desired bulk are internally set at the maximum of 1. After inclusion of desired bulk, separate single cells from bulk for co-visualization.
# Incorporate desired bulk
coclus.mus.tailor <- refine_asg(sce.all=coclus.mus, df.desired.bulk=df.desired.bulk)
# Separate cells from bulk
coclus.sc.tailor <- subset(coclus.mus.tailor, , bulkCell=='cell')
After tailoring, the co-visualization through bulk-to-cell mapping is built on gene Adcy1
. To reveal the tailoring result, only hippocampus
is shown through the argument tar.bulk
in the plot, where cells defined in df.desired.bulk
have the same color as the desired bulk hippocampus
. hippocampus
cells in the embedding plot include tailored cells in df.desired.bulk
and those labeled hippocampus
in coclustering.
covis(svg=svg.mus.brain, data=coclus.blk, ID=c('Adcy1'), sce.dimred=coclus.sc.tailor, dimred='UMAP', tar.bulk=c('hippocampus'), assay.na='logcounts', legend.nrow=4, dim.lgd.text.size=10, dim.lgd.nrow=2, bar.width=0.08)
This section describes tailoring co-clustering results on the convenience Shiny app, which is lauched by calling desired_bulk_shiny
.
Figure 16 is the screenshot of the Shiny app. The file to upload is the co-clustering result returned by cocluster
, here coclus.mus
. It should be saved in an .rds
file by using saveRDS
before uploaded to the app. On the left embedding plot, cells are selected with the “Lasso Select” tool. On the right, selected cells and their coordinates are listed in a table, and the desired bulk tissues (aSVG features) can be selected from the dropdown list, here hippocampus
. To download the table just click the “Download” button. The “Help” button gives more instructions.
An example of desired bulk downloaded from the convenience Shiny app is shown below. The x-y coordinates refer to single cells in embbeding plots (dimred
). The df.desired.bulk
is ready to use in the tailoring section.
desired.blk.pa <- system.file("extdata/shinyApp/example", "selected_cells_with_desired_bulk.txt", package="spatialHeatmap")
df.desired.bulk <- read.table(desired.blk.pa, header=TRUE, row.names=1, sep='\t')
df.desired.bulk[1:3, ]
## x y key desiredBulk dimred
## 1 -7.564745 0.07702301 1039 hippocampus UMAP
## 2 -7.541277 0.07503040 1040 hippocampus UMAP
## 3 -7.554182 0.05069406 1046 hippocampus UMAP
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] kableExtra_1.3.4 BiocParallel_1.32.0
## [3] igraph_1.3.5 scater_1.26.0
## [5] ggplot2_3.3.6 scran_1.26.0
## [7] scuttle_1.8.0 SingleCellExperiment_1.20.0
## [9] SummarizedExperiment_1.28.0 Biobase_2.58.0
## [11] GenomicRanges_1.50.0 GenomeInfoDb_1.34.0
## [13] IRanges_2.32.0 S4Vectors_0.36.0
## [15] BiocGenerics_0.44.0 MatrixGenerics_1.10.0
## [17] matrixStats_0.62.0 spatialHeatmap_2.4.0
## [19] knitr_1.40 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.2 shinydashboard_0.7.2
## [3] tidyselect_1.2.0 RSQLite_2.2.18
## [5] AnnotationDbi_1.60.0 htmlwidgets_1.5.4
## [7] grid_4.2.1 Rtsne_0.16
## [9] munsell_0.5.0 ScaledMatrix_1.6.0
## [11] codetools_0.2-18 preprocessCore_1.60.0
## [13] interp_1.1-3 statmod_1.4.37
## [15] withr_2.5.0 colorspace_2.0-3
## [17] filelock_1.0.2 highr_0.9
## [19] rstudioapi_0.14 labeling_0.4.2
## [21] GenomeInfoDbData_1.2.9 farver_2.1.1
## [23] bit64_4.0.5 vctrs_0.5.0
## [25] generics_0.1.3 xfun_0.34
## [27] BiocFileCache_2.6.0 fastcluster_1.2.3
## [29] R6_2.5.1 doParallel_1.0.17
## [31] ggbeeswarm_0.6.0 rsvd_1.0.5
## [33] locfit_1.5-9.6 rsvg_2.3.2
## [35] bitops_1.0-7 cachem_1.0.6
## [37] gridGraphics_0.5-1 DelayedArray_0.24.0
## [39] assertthat_0.2.1 promises_1.2.0.1
## [41] scales_1.2.1 nnet_7.3-18
## [43] beeswarm_0.4.0 gtable_0.3.1
## [45] beachmat_2.14.0 WGCNA_1.71
## [47] rlang_1.0.6 genefilter_1.80.0
## [49] systemfonts_1.0.4 splines_4.2.1
## [51] lazyeval_0.2.2 impute_1.72.0
## [53] checkmate_2.1.0 BiocManager_1.30.19
## [55] yaml_2.3.6 reshape2_1.4.4
## [57] backports_1.4.1 httpuv_1.6.6
## [59] Hmisc_4.7-1 tools_4.2.1
## [61] bookdown_0.29 ggplotify_0.1.0
## [63] ellipsis_0.3.2 gplots_3.1.3
## [65] jquerylib_0.1.4 RColorBrewer_1.1-3
## [67] ggdendro_0.1.23 dynamicTreeCut_1.63-1
## [69] Rcpp_1.0.9 plyr_1.8.7
## [71] base64enc_0.1-3 sparseMatrixStats_1.10.0
## [73] visNetwork_2.1.2 zlibbioc_1.44.0
## [75] purrr_0.3.5 RCurl_1.98-1.9
## [77] rpart_4.1.19 deldir_1.0-6
## [79] viridis_0.6.2 ggrepel_0.9.1
## [81] cluster_2.1.4 magrittr_2.0.3
## [83] magick_2.7.3 data.table_1.14.4
## [85] grImport_0.9-5 mime_0.12
## [87] evaluate_0.17 xtable_1.8-4
## [89] XML_3.99-0.12 jpeg_0.1-9
## [91] gridExtra_2.3 compiler_4.2.1
## [93] tibble_3.1.8 KernSmooth_2.23-20
## [95] crayon_1.5.2 htmltools_0.5.3
## [97] later_1.3.0 Formula_1.2-4
## [99] geneplotter_1.76.0 tidyr_1.2.1
## [101] DBI_1.1.3 dbplyr_2.2.1
## [103] MASS_7.3-58.1 rappdirs_0.3.3
## [105] Matrix_1.5-1 cli_3.4.1
## [107] parallel_4.2.1 metapod_1.6.0
## [109] pkgconfig_2.0.3 flashClust_1.01-2
## [111] foreign_0.8-83 plotly_4.10.0
## [113] xml2_1.3.3 foreach_1.5.2
## [115] svglite_2.1.0 annotate_1.76.0
## [117] vipor_0.4.5 bslib_0.4.0
## [119] dqrng_0.3.0 webshot_0.5.4
## [121] XVector_0.38.0 rvest_1.0.3
## [123] yulab.utils_0.0.5 stringr_1.4.1
## [125] digest_0.6.30 RcppAnnoy_0.0.20
## [127] Biostrings_2.66.0 rmarkdown_2.17
## [129] htmlTable_2.4.1 uwot_0.1.14
## [131] edgeR_3.40.0 DelayedMatrixStats_1.20.0
## [133] curl_4.3.3 shiny_1.7.3
## [135] gtools_3.9.3 lifecycle_1.0.3
## [137] jsonlite_1.8.3 BiocNeighbors_1.16.0
## [139] viridisLite_0.4.1 limma_3.54.0
## [141] fansi_1.0.3 pillar_1.8.1
## [143] lattice_0.20-45 KEGGREST_1.38.0
## [145] fastmap_1.1.0 httr_1.4.4
## [147] survival_3.4-0 GO.db_3.16.0
## [149] glue_1.6.2 FNN_1.1.3.1
## [151] UpSetR_1.4.0 png_0.1-7
## [153] iterators_1.0.14 bluster_1.8.0
## [155] bit_4.0.4 stringi_1.7.8
## [157] sass_0.4.2 blob_1.2.3
## [159] DESeq2_1.38.0 BiocSingular_1.14.0
## [161] latticeExtra_0.6-30 caTools_1.18.2
## [163] memoise_2.0.1 dplyr_1.0.10
## [165] irlba_2.3.5.1
This project has been funded by NSF awards: PGRP-1546879, PGRP-1810468, PGRP-1936492.
Amezquita, Robert, Aaron Lun, Etienne Becht, Vince Carey, Lindsay Carpp, Ludwig Geistlinger, Federico Marini, et al. 2020. “Orchestrating Single-Cell Analysis with Bioconductor.” Nature Methods 17: 137–45. https://www.nature.com/articles/s41592-019-0654-x.
Chang, Winston, Joe Cheng, JJ Allaire, Cars on Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2021. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.
Gentleman, R, V Carey, W Huber, and F Hahne. 2018. “Genefilter: Methods for Filtering Genes from High-Throughput Experiments.” http://bioconductor.uib.no/2.7/bioc/html/genefilter.html.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.
Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell Rna-Seq Data with Bioconductor.” F1000Res. 5: 2122. https://doi.org/10.12688/f1000research.9501.2.
Marques, Sueli, Amit Zeisel, Simone Codeluppi, David van Bruggen, Ana Mendanha Falcão, Lin Xiao, Huiliang Li, et al. 2016. “Oligodendrocyte Heterogeneity in the Mouse Juvenile and Adult Central Nervous System.” Science 352 (6291): 1326–9.
Ortiz, Cantin, Jose Fernandez Navarro, Aleksandra Jurek, Antje Märtin, Joakim Lundeberg, and Konstantinos Meletis. 2020. “Molecular Atlas of the Adult Mouse Brain.” Science Advances 6 (26): eabb3446.
Risso, Davide, and Michael Cole. 2022. ScRNAseq: Collection of Public Single-Cell Rna-Seq Datasets.
Vacher, Claire-Marie, Helene Lacaille, Jiaqi J O’Reilly, Jacquelyn Salzbank, Dana Bakalar, Sonia Sebaoui, Philippe Liere, et al. 2021. “Placental Endocrine Function Shapes Cerebellar Development and Social Behavior.” Nat. Neurosci. 24 (10): 1392–1401.
Zhang, Jianhai, Jordan Hayes, Le Zhang, Bing Yang, Wolf Frommer, Julia Bailey-Serres, and Thomas Girke. 2022. SpatialHeatmap: SpatialHeatmap. https://github.com/jianhaizhang/spatialHeatmap.