spatialHeatmap 1.2.0
The spatialHeatmap package provides functionalities for visualizing cell-,
tissue- and organ-specific data of biological assays by coloring the
corresponding spatial features defined in anatomical images according to a
numeric color key. The color scheme used to represent the assay values can be
customized by the user. This core functionality of the package is called a
spatial heatmap (SHM) plot. It is enhanced with visualization tools for groups
of measured items (e.g. gene modules) sharing related abundance profiles, including
matrix heatmaps combined with hierarchical clustering dendrograms and network representations.
The functionalities of spatialHeatmap can be used either in a command-driven mode
from within R or a graphical user interface (GUI) provided by a Shiny App that
is also part of this package. While the R-based mode provides flexibility to
customize and automate analysis routines, the Shiny App includes a variety of
convenience features that will appeal to experimentalists and other users less
familiar with R. Moreover, the Shiny App can be used on both local computers as
well as centralized server-based deployments (e.g. cloud-based or custom
servers) that can be accessed remotely as a public web service for using
spatialHeatmap’s functionalities with community and/or private data. The
functionalities of the spatialHeatmap
package are illustrated in Figure
1.
As anatomical images the package supports both tissue maps from public repositories and custom images provided by the user. In general any type of image can be used as long as it can be provided in SVG (Scalable Vector Graphics) format, where the corresponding spatial features have been defined (see aSVG below). The numeric values plotted onto an SHM are usually quantitative measurements from a wide range of profiling technologies, such as microarrays, next generation sequencing (e.g. RNA-Seq and scRNA-Seq), proteomics, metabolomics, or many other small- or large-scale experiments. For convenience, several preprocessing and normalization methods for the most common use cases are included that support raw and/or preprocessed data. Currently, the main application domains of the spatialHeatmap package are numeric data sets and spatially mapped images from biological, agricultural and biomedical areas. Moreover, the package has been designed to also work with many other spatial data types, such a population data plotted onto geographic maps. This high level of flexibility is one of the unique features of spatialHeatmap. Related software tools for biological applications in this field are largely based on pure web applications (Lekschas et al. 2015; Papatheodorou et al. 2018; Winter et al. 2007; Waese et al. 2017) or local tools (Maag 2018; Muschelli, Sweeney, and Crainiceanu 2014) that typically lack customization functionalities. These restrictions limit users to utilizing pre-existing expression data and/or fixed sets of anatomical image collections. To close this gap for biological use cases, we have developed spatialHeatmap as a generic R/Bioconductor package for plotting quantitative values onto any type of spatially mapped images in a programmable environment and/or in an intuitive to use GUI application.
The core feature of spatialHeatmap
is to map assay values (e.g.
gene expression data) of one or many items (e.g. genes) measured under
different conditions in form of numerically graded colors onto the
corresponding cell types or tissues represented in a chosen SVG image. In the
gene profiling field, this feature supports comparisons of the expression
values among multiple genes by plotting their SHMs next to each
other. Similarly, one can display the expression values of a single or multiple
genes across multiple conditions in the same plot (Figure 4). This level of flexibility is
very efficient for visualizing complicated expression patterns across genes,
cell types and conditions. In case of more complex anatomical images with
overlapping multiple layer tissues, it is important to visually expose the
tissue layer of interest in the plots. To address this, several default and
customizable layer viewing options are provided. They allow to hide features in
the top layers by making them transparent in order to expose features below
them. This transparency viewing feature is highlighted below in the mouse
example (Figure 5). Except for spatial data, this package also works on spatiotemporal data and generates spatiotemporal heatmaps (STHMs, Figure 9). Moreover, one can plot multiple distinct
aSVGs in a single SHM plot as shown in Figure 11. This is
particularly useful for displaying abundance trends across multiple development
stages, where each is represented by its own aSVG image. In addition to
static SHM representations, one can visualize them in form of dynamic animations
embedded in interactive HTML files or generate videos for them.
To maximize reusability and extensibility, the package organizes large-scale
omics assay data along with the associated experimental design information in a
SummarizedExperiment
object (Figure 1A). The latter is one of the core S4 classes within
the Bioconductor ecosystem that has been widely adapted by many other software
packages dealing with gene-, protein- and metabolite-level profiling data
(Morgan et al. 2018). In case of gene expression data, the assays
slot of
the SummarizedExperiment
container is populated with a gene expression
matrix, where the rows and columns represent the genes and tissue/conditions,
respectively, while the colData
slot contains sample data including replicate
information. The tissues and/or cell type information in the object maps via
colData
to the corresponding features in the SVG images using unique
identifiers for the spatial features (e.g. tissues or cell types). This
allows to color the features of interest in an SVG image according to the
numeric data stored in a SummarizedExperiment
object. For simplicity the
numeric data can also be provided as numeric vectors
or data.frames
. This
can be useful for testing purposes and/or the usage of simple data sets that
may not require the more advanced features of the SummarizedExperiment
class,
such as measurements with only one or a few data points. The details about how to
access the SVG images and properly format the associated expression data are
provided in the Supplementary Section of this vignette.
SHMs are images where colors encode numeric values in features of any shape. For plotting SHMs, Scalable Vector Graphics (SVG) has been chosen as image format since it is a flexible and widely adapted vector graphics format that provides many advantages for computationally embedding numerical and other information in images. SVG is based on XML formatted text describing all components present in images, including lines, shapes and colors. In case of biological images suitable for SHMs, the shapes often represent anatomical or cell structures. To assign colors to specific features in SHMs, annotated SVG (aSVG) files are used where the shapes of interest are labeled according to certain conventions so that they can be addressed and colored programmatically. SVGs and aSVGs of anatomical structures can be downloaded from many sources including the repositories described below. Alternatively, users can generate them themselves with vector graphics software such as Inkscape. Typically, in aSVGs one or more shapes of a feature of interest, such as the cell shapes of an organ, are grouped together by a common feature identifier. Via these group identifiers one or many feature types can be colored simultaneously in an aSVG according to biological experiments assaying the corresponding feature types with the required spatial resolution. Correct assignment of image features and assay results is assured by using for both the same feature identifiers. The color gradient used to visually represent the numeric assay values is controlled by a color gradient parameter. To visually interpret the meaning of the colors, the corresponding color key is included in the SHM plots. Additional details for properly formatting and annotating both aSVG images and assay data are provided in the Supplementary Section section of this vignette.
If not generated by the user, SHMs can be generated with data downloaded from
various public repositories. This includes gene, protein and metabolic
profiling data from databases, such as GEO,
BAR and Expression
Atlas from EMBL-EBI (Papatheodorou et al. 2018). A
particularly useful resource, when working with spatialHeatmap
, is the EBI
Expression Atlas. This online service contains both assay and anatomical
images. Its assay data include mRNA and protein profiling experiments for
different species, tissues and conditions. The corresponding anatomical image
collections are also provided for a wide range of species including animals and
plants. In spatialHeatmap
several import functions are provided to work with
the expression and aSVG repository from the Expression Atlas
directly. The aSVG images developed by the spatialHeatmap
project are
available in its own repository called spatialHeatmap aSVG
Repository,
where users can contribute their aSVG images that are formatted according to
our guidlines.
The following sections of this vignette showcase the most important
functionalities of the spatialHeatmap
package using as initial example a simple
to understand toy data set, and then more complex mRNA profiling data from the
Expression Atlas and GEO databases. First, SHM plots are generated for both the toy
and mRNA expression data. The latter include gene expression data sets from
RNA-Seq and microarray experiments of Human Brain, Mouse
Organs, Chicken Organs, and Arabidopsis Shoots. The
first three are RNA-Seq data from the Expression
Atlas, while the last one is a microarray data
set from GEO. Second, gene context
analysis tools are introduced, which facilitate the visualization of
gene modules sharing similar expression patterns. This includes the
visualization of hierarchical clustering results with traditional matrix
heatmaps (Matrix Heatmap) as well co-expression network plots
(Network). Third, an overview of the corresponding Shiny App
is presented that provides access to the same functionalities as the R
functions, but executes them in an interactive GUI environment (Chang et al., n.d.; Chang and Borges Ribeiro 2018). Fourth, more advanced features for plotting customized
SHMs are covered using the Human Brain data set as an example.
The spatialHeatmap
package should be installed from an R (version \(\ge\) 3.6)
session with the BiocManager::install
command.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("spatialHeatmap")
Next, the packages required for running the sample code in this vignette need to be loaded.
library(spatialHeatmap); library(SummarizedExperiment); library(ExpressionAtlas); library(GEOquery)
The following lists the vignette(s) of this package in an HTML browser. Clicking the corresponding name will open this vignette.
browseVignettes('spatialHeatmap')
SHMs are plotted with the spatial_hm
function. To provide a quick
and intuitive overview how these plots are generated, the following uses a
generalized toy example where a small vector of random numeric values is
generated that are used to color features in an aSVG image. The image chosen
for this example is an aSVG depicting the human brain. The corresponding image
file ‘homo_sapiens.brain.svg’ is included in this package for testing purposes.
The path to this image on a user's system, where spatialHeatmap
is
installed, can be obtained with the system.file
function.
The following commands obtain the directory of the aSVG collection and the full path to the chosen target aSVG image on a user’s system, respectively.
svg.dir <- system.file("extdata/shinyApp/example", package="spatialHeatmap")
svg.hum <- system.file("extdata/shinyApp/example", 'homo_sapiens.brain.svg', package="spatialHeatmap")
To identify feature labels of interest in annotated aSVG images, the return_feature
function can be used. The following searches the aSVG images stored in dir
for the query terms ‘lobe’ and ‘homo sapiens’ under the feature
and species
fields, respectively. The identified matches are returned as a data.frame
.
feature.df <- return_feature(feature=c('lobe'), species=c('homo sapiens'), remote=FALSE, dir=svg.dir)
## Accessing features...
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## UBERON_0000451 LAYER_EFO
##
## homo.sapiens_brain.shiny_shm.svg, Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## UBERON_0001873 UBERON_0001872 UBERON_0002038 UBERON_0001874 UBERON_0001875 UBERON_0002360
##
## Duplicated title text detected: hippocampus
## homo_sapiens.brain.svg,
feature.df
## feature stroke color id element parent order
## 1 occipital.lobe 0.08000000 none UBERON_0002021 path LAYER_EFO 3
## 2 parietal.lobe 0.08000000 none UBERON_0001872 g LAYER_EFO 4
## 3 temporal.lobe 0.08000000 none UBERON_0001871 path LAYER_EFO 8
## 4 occipital.lobe 0.01600000 none UBERON_0002021 path LAYER_EFO 7
## 5 parietal.lobe 0.07060588 none UBERON_0001872 g LAYER_EFO 8
## 6 temporal.lobe 0.01600000 none UBERON_0001871 path LAYER_EFO 24
## SVG
## 1 homo.sapiens_brain.shiny_shm.svg
## 2 homo.sapiens_brain.shiny_shm.svg
## 3 homo.sapiens_brain.shiny_shm.svg
## 4 homo_sapiens.brain.svg
## 5 homo_sapiens.brain.svg
## 6 homo_sapiens.brain.svg
fnames <- feature.df[, 1]
The following example generates a small numeric toy vector, where the data slot
contains four numbers and its name slot is populated with the three feature
names obtained from the above aSVG image. In addition, a non-matching entry
(here ‘notMapped’) is included for demonstration purposes. Note, the numbers
are mapped to features via matching names among the numeric vector and the aSVG,
respectively. Accordingly, only numbers and features with matching name
counterparts can be colored in the aSVG image. Entries without name matches
are indicated by a message printed to the R console, here “notMapped”. This
behavior can be turned off with verbose=FALSE
in the corresponding function
call. In addition, a summary of the numeric assay to feature mappings is stored
in the result data.frame
returned by the spatial_hm
function (see below).
my_vec <- sample(1:100, length(unique(fnames))+1)
names(my_vec) <- c(unique(fnames), 'notMapped')
my_vec
## occipital.lobe parietal.lobe temporal.lobe notMapped
## 60 77 66 56
Next, the SHM is plotted with the spatial_hm
function (Figure
2). Internally, the numbers in my_vec
are translated into
colors based on the color key assigned to the col.com
argument, and then
painted onto the corresponding features in the aSVG, where the path to the image
file is defined by svg.path=svg.hum
. The remaining arguments used here include:
ID
for defining the title of the plot; ncol
for setting the column-wise layout
of the plot excluding the feature legend plot on the right; and height
for defining
the height of the SHM relative to its width. In addition, the outline feature g4320
covers all tissue features due to its default color, so it is set transparent through ft.trans
. More details of the transparency function is explained in the mouse example (Figure 5). In the given example
(Figure 2) only three features in my_vec
(‘occipital lobe’,
‘parietal lobe’, and ‘temporal lobe’) have matching entries in the corresponding
aSVG.
shm.lis <- spatial_hm(svg.path=svg.hum, data=my_vec, ID='toy', ncol=1, height=0.9, width=0.8, sub.title.size=20, legend.nrow=2, ft.trans=c('g4320'))
## Coordinates: homo_sapiens.brain.svg ...
## CPU cores: 1
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## UBERON_0001873 UBERON_0001872 UBERON_0002038 UBERON_0001874 UBERON_0001875 UBERON_0002360
##
## Duplicated title text detected: hippocampus
## Features in data not mapped: notMapped
## ggplots/grobs: homo_sapiens.brain.svg ...
## ggplot: toy, con
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## toy_con_1
The named numeric values in my_vec
, that have name matches with the features in the
chosen aSVG, are stored in the mapped_feature
slot. The attributes of features are stored in feature_attribute
slot.
# The SHM, mapped features, and feature attributes are stored in a list
names(shm.lis)
## [1] "spatial_heatmap" "mapped_feature" "feature_attribute"
# Mapped features
shm.lis[['mapped_feature']]
## rowID featureSVG value SVG
## 1 toy occipital.lobe 60 homo_sapiens.brain.svg
## 2 toy parietal.lobe 77 homo_sapiens.brain.svg
## 3 toy temporal.lobe 66 homo_sapiens.brain.svg
# Feature attributes
shm.lis[['feature_attribute']][1:3, ]
## feature stroke color id element parent order
## 1 g4320 0.080 none g4320 g LAYER_OUTLINE 1
## 2 locus.ceruleus 0.016 none UBERON_0002148 path LAYER_EFO 1
## 3 diencephalon 0.016 none UBERON_0001894 path LAYER_EFO 2
## SVG
## 1 homo_sapiens.brain.svg
## 2 homo_sapiens.brain.svg
## 3 homo_sapiens.brain.svg
This subsection introduces how to find cell- and tissue-specific assay data in
the Expression Atlas database. After choosing a gene expression experiment, the
data is downloaded directly into a user's R session. Subsequently, the
expression values for selected genes can be plotted onto a chosen aSVG image with
or without prior preprocessing steps (e.g. normalization). For querying and
downloading expression data from the Expression Atlas database, functions from
the ExpressionAtlas
package are used (Keays 2019).
The following example searches the Expression Atlas for expression data derived from specific tissues and species of interest, here ‘cerebellum’ and ‘Homo sapiens’, respectively.
To avoid repetitive downloading, the downloaded data sets are cached in ~/.cache/shm
in all the following examples.
cache.pa <- '~/.cache/shm' # The path of cache.
all.hum <- read_cache(cache.pa, 'all.hum') # Retrieve data from cache.
if (is.null(all.hum)) { # Save downloaded data to cache if it is not cached.
all.hum <- searchAtlasExperiments(properties="cerebellum", species="Homo sapiens")
save_cache(dir=cache.pa, overwrite=TRUE, all.hum)
}
The search result is stored in a DFrame
containing 13
accessions matching the above query. For the following sample code, the
accession
‘E-GEOD-67196’
from Prudencio et al. (2015) has been chosen, which corresponds
to an RNA-Seq profiling experiment of ‘cerebellum’ and ‘frontal cortex’ brain
tissue from patients with amyotrophic lateral sclerosis (ALS). Details about the
corresponding record can be returned as follows.
all.hum[2, ]
## DataFrame with 1 row and 4 columns
## Accession Species Type Title
## <character> <character> <character> <character>
## 1 E-MTAB-3358 Homo sapiens RNA-seq of coding RNA RNA-Seq CAGE (Cap An..
The getAtlasData
function allows to download the chosen RNA-Seq experiment
from the Expression Atlas and import it into a RangedSummarizedExperiment
object of a user's R session.
rse.hum <- read_cache(cache.pa, 'rse.hum') # Read data from cache.
if (is.null(rse.hum)) { # Save downloaded data to cache if it is not cached.
rse.hum <- getAtlasData('E-GEOD-67196')[[1]][[1]]
save_cache(dir=cache.pa, overwrite=TRUE, rse.hum)
}
The design of the downloaded RNA-Seq experiment is described in the colData
slot of
rse.hum
. The following returns only its first five rows and columns.
colData(rse.hum)[1:5, 1:5]
## DataFrame with 5 rows and 5 columns
## AtlasAssayGroup organism individual organism_part
## <character> <character> <character> <character>
## SRR1927019 g1 Homo sapiens individual1 cerebellum
## SRR1927020 g2 Homo sapiens individual1 frontal cortex
## SRR1927021 g1 Homo sapiens individual2 cerebellum
## SRR1927022 g2 Homo sapiens individual2 frontal cortex
## SRR1927023 g1 Homo sapiens individual34 cerebellum
## disease
## <character>
## SRR1927019 amyotrophic lateral ..
## SRR1927020 amyotrophic lateral ..
## SRR1927021 amyotrophic lateral ..
## SRR1927022 amyotrophic lateral ..
## SRR1927023 amyotrophic lateral ..
The following example shows how to download from the above described SVG
repositories an aSVG image that matches the tissues and species
assayed in the gene expression data set downloaded in the previous subsection.
The return_feature
function queries the repository for feature- and
species-related keywords, here c('frontal cortex', 'cerebellum')
and c('homo sapiens', 'brain')
, respectively. To return matching aSVGs, the argument
keywords.any
is set to TRUE
by default. When return.all=FALSE
, only aSVGs
matching the query keywords are returned and saved under dir
. Otherwise, all
aSVGs are returned regardless of the keywords. To avoid overwriting of existing
SVG files, it is recommended to start with an empty target directory, here
tmp.dir
. To search a local directory for matching aSVG images, the argument
setting remote=FALSE
needs to be used, while specifying the path of the
corresponding directory under dir
. All or only matching features are returned
if match.only
is set to FALSE
or TRUE
, respectively.
tmp.dir <- paste0(normalizePath(tempdir(check=TRUE), winslash="/", mustWork=FALSE), '/shm') # Create empty directory
feature.df <- return_feature(feature=c('frontal cortex', 'cerebellum'), species=c('homo sapiens', 'brain'), keywords.any=TRUE, return.all=FALSE, dir=tmp.dir, remote=TRUE, match.only=TRUE, desc=FALSE) # Query aSVGs
feature.df[1:8, ] # Return first 8 rows for checking
unique(feature.df$SVG) # Return all matching aSVGs
To build this vignettes according to the R/Bioconductor package requirements, the
following code section uses the aSVG file instance included in the
spatialHeatmap
package rather than the downloaded instance from the previous
example.
feature.df <- return_feature(feature=c('frontal cortex', 'cerebellum'), species=c('homo sapiens', 'brain'), keywords.any=TRUE, return.all=FALSE, dir=svg.dir, remote=FALSE)
## Accessing features...
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## UBERON_0000451 LAYER_EFO
##
## homo.sapiens_brain.shiny_shm.svg, Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## UBERON_0001873 UBERON_0001872 UBERON_0002038 UBERON_0001874 UBERON_0001875 UBERON_0002360
##
## Duplicated title text detected: hippocampus
## homo_sapiens.brain.svg,
Note, the target tissues frontal cortex
and cerebellum
are included in both
the experimental design slot of the downloaded expression data as well as the
annotations of the aSVG. This way these features can be colored in the downstream
SHM plots. If necessary users can also change from within R the feature identifiers
and names in an aSVG. Details on this utility are provided in the Supplementary Section.
feature.df
## feature stroke color id element parent order
## 1 middle.frontal.gyrus 0.08000000 none UBERON_0002702 path LAYER_EFO 2
## 2 prefrontal.cortex 0.08230311 none UBERON_0000451 g LAYER_EFO 5
## 3 frontal.cortex 0.08000000 none UBERON_0001870 path LAYER_EFO 6
## 4 cerebral.cortex 0.08000000 none UBERON_0000956 g LAYER_EFO 7
## 5 cerebellum 0.08000000 none UBERON_0002037 g LAYER_EFO 9
## 6 middle.frontal.gyrus 0.01600000 none UBERON_0002702 path LAYER_EFO 6
## 7 cingulate.cortex 0.01600000 none UBERON_0003027 path LAYER_EFO 19
## 8 prefrontal.cortex 0.01600000 none UBERON_0000451 g LAYER_EFO 21
## 9 frontal.cortex 0.01600000 none UBERON_0001870 path LAYER_EFO 22
## 10 cerebral.cortex 0.01600000 none UBERON_0000956 g LAYER_EFO 23
## 11 cerebellum 0.01600000 none UBERON_0002037 g LAYER_EFO 25
## SVG
## 1 homo.sapiens_brain.shiny_shm.svg
## 2 homo.sapiens_brain.shiny_shm.svg
## 3 homo.sapiens_brain.shiny_shm.svg
## 4 homo.sapiens_brain.shiny_shm.svg
## 5 homo.sapiens_brain.shiny_shm.svg
## 6 homo_sapiens.brain.svg
## 7 homo_sapiens.brain.svg
## 8 homo_sapiens.brain.svg
## 9 homo_sapiens.brain.svg
## 10 homo_sapiens.brain.svg
## 11 homo_sapiens.brain.svg
Since the Expression Atlas supports the cross-species anatomy
ontology, the corresponding UBERON identifiers are
included in the id
column of the data.frame
returned by the above function
call of return_feature
(Mungall et al. 2012). This ontology is also supported
by the rols
Bioconductor package (Gatto 2019).
For organizing experimental designs and downstream plotting purposes, it can be
desirable to customize the text in certain columns of colData
. This way one can
use the source data for displaying ‘pretty’ sample names in columns and legends
of all downstream tables and plots, respectively, in a consistent and automated
manner. To achieve this, the following example imports a ‘targets’ file that
can be generated and edited by the user in a text or spreadsheet program. In
the following example the target file content is used to replace the text in the
colData
slot of the RangedSummarizedExperiment
object with a version containing
shorter sample names that are more suitable for plotting purposes.
The following imports a custom target file containing simplified sample labels and experimental design information.
hum.tar <- system.file('extdata/shinyApp/example/target_human.txt', package='spatialHeatmap')
target.hum <- read.table(hum.tar, header=TRUE, row.names=1, sep='\t')
Load custom target data into colData
slot.
colData(rse.hum) <- DataFrame(target.hum)
A slice of the simplified colData
object is shown below, where the disease
column contains now shorter labels than in the original data set. Additional
details for generating and using target files in spatialHeatmap
are provided
in the Supplementary Section of this vignette.
colData(rse.hum)[c(1:3, 41:42), 4:5]
## DataFrame with 5 rows and 2 columns
## organism_part disease
## <character> <character>
## SRR1927019 cerebellum ALS
## SRR1927020 frontal cortex ALS
## SRR1927021 cerebellum ALS
## SRR1927059 cerebellum normal
## SRR1927060 frontal cortex normal
The actual gene expression data of the downloaded RNA-Seq experiment is stored
in the assay
slot of rse.hum
. Since it contains raw count data, it can be
desirable to apply basic preprocessing routines prior to plotting spatial
heatmaps. The following shows how to normalize the count data, aggregate
replicates and then remove genes with unreliable expression responses. These
preprocessing steps are optional and can be skipped if needed. For this,
the expression data can be provided to the spatial_hm
function directly, where
it is important to assign to the sam.factor
and con.factor
arguments
the corresponding sample and condition column names (Table 2).
For normalizing raw count data from RNA-Seq experiments, the norm_data
function can be used. It supports the following pre-existing functions from
widely used packages for analyzing count data in the next generation sequencing
(NGS) field: calcNormFactors
(CNF) from edgeR
(Robinson, McCarthy, and Smyth 2010); as well as
estimateSizeFactors
(ESF), varianceStabilizingTransformation
(VST), and
rlog
from DESeq2 (Love, Huber, and Anders 2014). The argument norm.fun
specifies one of the
four internal normalizing methods: CNF
, ESF
, VST
, and rlog
. If
norm.fun='none'
, no normalization is applied. The arguments for each
normalizing function are provided via a parameter.list
, which is a list
with named slots. For example, norm.fun='ESF'
and
parameter.list=list(type='ratio')
is equivalent to
estimateSizeFactors(object, type='ratio')
. If paramter.list=NULL
, the
default arguments are used by the normalizing function assigned to norm.fun
.
For additional details, users want to consult the help file of the norm_data
function by typing ?norm_data
in the R console.
The following example uses the ESF
normalization option. This method has been
chosen mainly due to its good time performance.
se.nor.hum <- norm_data(data=rse.hum, norm.fun='ESF', log2.trans=TRUE)
## Normalising: ESF
## type
## "ratio"
Replicates are aggregated with the aggr_rep
function, where the summary
statistics can be chosen under the aggr
argument (e.g. aggr='mean'
). The
columns specifying replicates can be assigned to the sam.factor
and
con.factor
arguments corresponding to samples and conditions, respectively.
For tracking, the corresponding sample/condition labels are used as column
titles in the aggregated assay
instance, where they are concatenated with a
double underscore as separator. In addition, the corresponding rows in the
colData
slot are collapsed accordingly.
se.aggr.hum <- aggr_rep(data=se.nor.hum, sam.factor='organism_part', con.factor='disease', aggr='mean')
## Syntactically valid column names are made!
assay(se.aggr.hum)[1:3, ]
## cerebellum__ALS frontal.cortex__ALS cerebellum__normal
## ENSG00000000003 7.024054 7.091484 6.406157
## ENSG00000000005 0.000000 1.540214 0.000000
## ENSG00000000419 7.866582 8.002549 8.073264
## frontal.cortex__normal
## ENSG00000000003 7.004446
## ENSG00000000005 1.403110
## ENSG00000000419 7.955709
To remove unreliable expression measures, filtering can be applied.
The following example retains genes with expression values
larger than 5 (log2 space) in at least 1% of all samples (pOA=c(0.01, 5)
), and
a coefficient of variance (CV) between 0.30 and 100 (CV=c(0.30, 100)
).
se.fil.hum <- filter_data(data=se.aggr.hum, sam.factor='organism_part', con.factor='disease', pOA=c(0.01, 5), CV=c(0.3, 100), dir=NULL)
## Syntactically valid column names are made!
## All values before filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.287 2.442 4.268 19.991
## All coefficient of variances (CVs) before filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0007742 0.0767696 0.4019655 0.6217813 0.9956157 2.0000000
## All values after filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.654 4.976 4.779 6.451 14.695
## All coefficient of variances (CVs) after filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3001 0.3648 0.4637 0.5651 0.7392 1.1548
To inspect the results, the following returns three selected rows of the fully preprocessed data matrix (Table 1).
assay(se.fil.hum)[c(5, 733:734), ]
cerebellum__ALS | frontal.cortex__ALS | cerebellum__normal | frontal.cortex__normal | |
---|---|---|---|---|
ENSG00000006047 | 1.134172 | 5.2629629 | 0.5377534 | 5.3588310 |
ENSG00000268433 | 5.324064 | 0.3419665 | 3.4780744 | 0.1340332 |
ENSG00000268555 | 5.954572 | 2.6148548 | 4.9349736 | 2.0351776 |
The preprocessed expression values for any gene in the assay
slot of
se.fil.hum
can be plotted as an SHM. The following uses gene
ENSG00000268433
as an example. The chosen aSVG is a depiction of the human
brain where the assayed featured are colored by the corresponding expression
values in se.fil.hum
.
shm.lis <- spatial_hm(svg.path=svg.hum, data=se.fil.hum, ID=c('ENSG00000268433'), height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=2, ft.trans=c('g4320'))
## Coordinates: homo_sapiens.brain.svg ...
## CPU cores: 1
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## UBERON_0001873 UBERON_0001872 UBERON_0002038 UBERON_0001874 UBERON_0001875 UBERON_0002360
##
## Duplicated title text detected: hippocampus
## ggplots/grobs: homo_sapiens.brain.svg ...
## ggplot: ENSG00000268433, ALS normal
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## ENSG00000268433_ALS_1 ENSG00000268433_normal_1
The plotting instructions of the SHM along with the corresponding
mapped features and feature attributes are stored as a list
, here named shm.lis
. Its components
can be accessed as follows.
names(shm.lis) # All slots.
## [1] "spatial_heatmap" "mapped_feature" "feature_attribute"
shm.lis[['mapped_feature']] # Mapped features.
## rowID featureSVG condition value SVG
## 1 ENSG00000268433 cerebellum ALS 5.3240638 homo_sapiens.brain.svg
## 2 ENSG00000268433 frontal.cortex ALS 0.3419665 homo_sapiens.brain.svg
## 3 ENSG00000268433 cerebellum normal 3.4780744 homo_sapiens.brain.svg
## 4 ENSG00000268433 frontal.cortex normal 0.1340332 homo_sapiens.brain.svg
shm.lis[['feature_attribute']][1:3, ] # Feature attributes.
## feature stroke color id element parent order
## 1 g4320 0.080 none g4320 g LAYER_OUTLINE 1
## 2 locus.ceruleus 0.016 none UBERON_0002148 path LAYER_EFO 1
## 3 diencephalon 0.016 none UBERON_0001894 path LAYER_EFO 2
## SVG
## 1 homo_sapiens.brain.svg
## 2 homo_sapiens.brain.svg
## 3 homo_sapiens.brain.svg
In the above example, the normalized expression values of gene ENSG00000268433
are colored in the frontal cortex and cerebellum, where the different conditions,
here normal and ALS, are given in separate SHMs plotted next to
each other. The color and feature mappings are defined
by the corresponding color key and legend plot on the left and right, respectively.
SHMs for multiple genes can be plotted by providing the
corresponding gene IDs under the ID
argument as a character vector. The
spatial_hm
function will then sequentially arrange the SHMs for
each gene in a single composite plot. To facilitate comparisons among expression
values across genes and/or conditions, the lay.shm
parameter can be assigned
'gene'
or 'con'
, respectively. For instance, in Figure 4 the
SHMs of the genes ENSG00000268433
and ENSG00000006047
are organized
by condition in a horizontal view. This functionality is particularly useful when
comparing gene families. Users can also customize the order of the SHM subplots, by
assigning lay.shm='none'
. With this setting the SHM subplots are organized according
to the gene and condition ordering under ID
and data
, respectively.
spatial_hm(svg.path=svg.hum, data=se.fil.hum, ID=c('ENSG00000268433', 'ENSG00000006047'), lay.shm='con', width=0.8, height=1, legend.r=1.5, legend.nrow=2, ft.trans=c('g4320'))
## Coordinates: homo_sapiens.brain.svg ...
## CPU cores: 1
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## UBERON_0001873 UBERON_0001872 UBERON_0002038 UBERON_0001874 UBERON_0001875 UBERON_0002360
##
## Duplicated title text detected: hippocampus
## ggplots/grobs: homo_sapiens.brain.svg ...
## ggplot: ENSG00000268433, ALS normal
## ggplot: ENSG00000006047, ALS normal
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## ENSG00000268433_ALS_1 ENSG00000268433_normal_1 ENSG00000006047_ALS_1 ENSG00000006047_normal_1
SHMs can be saved to interactive HTML files as well as video files. To trigger
this export behavior, the argument out.dir
needs to be assinged a directory
path where the HTML and video files will be stored. Each HTML file
contains an interactive SHM with zoom in and out functionality. Hovering over
graphics features will display data, gene, condition and other information. The
video will play the SHM subplots in the order specified under the lay.shm
argument.
The following example saves the interactive HTML and video files under
a directory named tmp.dir
.
tmp.dir <- paste0(normalizePath(tempdir(check=TRUE), winslash="/"), '/shm')
spatial_hm(svg.path=svg.hum, data=se.fil.hum, ID=c('ENSG00000268433', 'ENSG00000006047'), lay.shm='con', width=0.8, height=1, legend.r=1.5, legend.nrow=2, out.dir=tmp.dir, ft.trans=c('g4320'))
To provide a high level of flexibility, the spatial_hm
contains many arguments.
An overview of important arguments and their utility is provided in Table 2.
argument | description |
---|---|
svg.path | Path of aSVG |
data | Input data of SummarizedExperiment (SE), data frame, or vector |
sam.factor | Applies to SE. Column name of sample replicates in colData slot. Default is NULL |
con.factor | Applies to SE. Column name of condition replicates in colData slot. Default is NULL |
ID | A character vector of row items for plotting spatial heatmaps |
col.com | A character vector of color components for building colour scale. Default is c(‘yellow’, ‘orange’,‘red’) |
col.bar | ‘selected’ or ‘all’, the former means use values of ID to build the colour scale while the latter use all values in data. Default is ‘selected’. |
bar.width | A numeric of colour bar width. Default is 0.7 |
trans.scale | One of ‘log2’, ‘exp2’, ‘row’, ‘column’, or NULL, which means transform the data by ‘log2’ or ‘2-base expoent’, scale by ‘row’ or ‘column’, or no manipuation respectively. |
ft.trans | A vector of aSVG features to be transparent. Default is NULL. |
legend.r | A numeric to adjust the dimension of the legend plot. Default is 1. The larger, the higher ratio of width to height. |
sub.title.size | The title size of each spatial heatmap subplot. Default is 11. |
lay.shm | ‘gen’ or ‘con’, applies to multiple genes or conditions respectively. ‘gen’ means spatial heatmaps are organised by genes while ‘con’ organised by conditions. Default is ‘gen’ |
ncol | The total column number of spatial heatmaps, not including legend plot. Default is 2. |
ft.legend | ‘identical’, ‘all’, or a vector of samples/features in aSVG to show in legend plot. ‘identical’ only shows matching features while ‘all’ shows all features. |
legend.ncol, legend.nrow | Two numbers of columns and rows of legend keys respectively. Default is NULL, NULL, since they are automatically set. |
legend.position | the position of legend keys (‘none’, ‘left’, ‘right’,‘bottom’, ‘top’), or two-element numeric vector. Default is ‘bottom’. |
legend.key.size, legend.text.size | The size of legend keys and labels respectively. Default is 0.5 and 8 respectively. |
line.size, line.color | The size and colour of all plogyon outlines respectively. Default is 0.2 and ‘grey70’ respectively. |
verbose | TRUE or FALSE. Default is TRUE and the aSVG features not mapped are printed to R console. |
out.dir | The directory to save HTML and video files of spatial heatmaps. Default is NULL. |
This section generates an SHM plot for mouse data from the Expression Atlas. The code components are very similar to the previous Human Brain example. For brevity, the corresponding text explaining the code has been reduced to a minimum.
The chosen mouse RNA-Seq data compares tissue level gene expression across mammalian species (Merkin et al. 2012). The following searches the Expression Atlas for expression data from ‘heart’ and ‘Mus musculus’.
all.mus <- read_cache(cache.pa, 'all.mus') # Retrieve data from cache.
if (is.null(all.mus)) { # Save downloaded data to cache if it is not cached.
all.mus <- searchAtlasExperiments(properties="heart", species="Mus musculus")
save_cache(dir=cache.pa, overwrite=TRUE, all.mus)
}
Among the many matching entries, accession ‘E-MTAB-2801’ will be downloaded.
all.mus[7, ]
## DataFrame with 1 row and 4 columns
## Accession Species Type Title
## <character> <character> <character> <character>
## 1 E-MTAB-2801 Mus musculus RNA-seq of coding RNA Strand-specific RNA-..
rse.mus <- read_cache(cache.pa, 'rse.mus') # Read data from cache.
if (is.null(rse.mus)) { # Save downloaded data to cache if it is not cached.
rse.mus <- getAtlasData('E-MTAB-2801')[[1]][[1]]
save_cache(dir=cache.pa, overwrite=TRUE, rse.mus)
}
The design of the downloaded RNA-Seq experiment is described in the colData
slot of
rse.mus
. The following returns only its first three rows.
colData(rse.mus)[1:3, ]
## DataFrame with 3 rows and 4 columns
## AtlasAssayGroup organism organism_part strain
## <character> <character> <character> <character>
## SRR594393 g7 Mus musculus brain DBA/2J
## SRR594394 g21 Mus musculus colon DBA/2J
## SRR594395 g13 Mus musculus heart DBA/2J
The following example shows how to download from the above described SVG
repositories an aSVG image that matches the tissues and species
assayed in the gene expression data set downloaded in the previous subsection.
As before the image is saved to a directory named tmp.dir
.
tmp.dir <- paste0(normalizePath(tempdir(check=TRUE), winslash="/", mustWork=FALSE), '/shm')
feature.df <- return_feature(feature=c('heart', 'kidney'), species=c('Mus musculus'), keywords.any=TRUE, return.all=FALSE, dir=tmp.dir, remote=TRUE, match.only=FALSE)
To build this vignettes according to the R/Bioconductor package requirements, the
following code section uses the aSVG file instance included in the
spatialHeatmap
package rather than the downloaded instance from the example in
the previous step.
feature.df <- return_feature(feature=c('heart', 'kidney'), species=NULL, keywords.any=TRUE, return.all=FALSE, dir=svg.dir, remote=FALSE, match.only=FALSE)
## Accessing features...
## arabidopsis.thaliana_organ_shm.svg, arabidopsis.thaliana_organ_shm1.svg, arabidopsis.thaliana_organ_shm2.svg, arabidopsis.thaliana_root.cross_shm.svg, Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## LAYER_OUTLINE
##
## arabidopsis.thaliana_root.ebi_shm.svg, Extracted tiny path from path2027 is removed!
## Extracted tiny path from path2027-3 is removed!
## arabidopsis.thaliana_root.roottip_shm.svg, Extracted tiny path from path1146-6 is removed!
## Extracted tiny path from path1146 is removed!
## arabidopsis.thaliana_shoot.root_shm.svg, Extracted tiny path from path1146 is removed!
## arabidopsis.thaliana_shoot_shm.svg, Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## UBERON_0014892
##
## gallus_gallus.svg,
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## UBERON_0000451 LAYER_EFO
##
## homo.sapiens_brain.shiny_shm.svg, Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## UBERON_0001873 UBERON_0001872 UBERON_0002038 UBERON_0001874 UBERON_0001875 UBERON_0002360
##
## Duplicated title text detected: hippocampus
## homo_sapiens.brain.svg, Element "a" is removed: a4174 !
## mus_musculus.male.svg, oryza.sativa_coleoptile.ANT_shm.svg, oryza.sativa_coleoptile.NT_shm.svg, us_map_shm.svg,
Return the names of the matching aSVG files.
unique(feature.df$SVG)
## [1] "gallus_gallus.svg" "mus_musculus.male.svg"
The following first selects mus_musculus.male.svg
as target aSVG, then
returns the first three rows of the resulting feature.df
, and finally prints
the unique set of all aSVG features.
feature.df <- subset(feature.df, SVG=='mus_musculus.male.svg')
feature.df[1:3, ]
## feature stroke color id element parent order
## 10 kidney 0.05 none UBERON_0002113 g LAYER_EFO 3
## 11 heart 0.05 none UBERON_0000948 path LAYER_EFO 13
## 12 path4204 0.05 none path4204 g LAYER_OUTLINE 1
## SVG
## 10 mus_musculus.male.svg
## 11 mus_musculus.male.svg
## 12 mus_musculus.male.svg
unique(feature.df[, 1])
## [1] "kidney" "heart" "path4204" "spleen"
## [5] "adrenal.gland" "colon" "caecum" "esophagus"
## [9] "tongue" "testis" "penis" "lung"
## [13] "diaphragm" "liver" "brain" "skeletal.muscle"
Obtain path of target aSVG on user system.
svg.mus <- system.file("extdata/shinyApp/example", "mus_musculus.male.svg", package="spatialHeatmap")
The following imports a sample target file that is included in this package. To inspect its content, the first three rows of the target file are printed to the screen.
mus.tar <- system.file('extdata/shinyApp/example/target_mouse.txt', package='spatialHeatmap')
target.mus <- read.table(mus.tar, header=TRUE, row.names=1, sep='\t')
target.mus[1:3, ]
## AtlasAssayGroup organism organism_part strain
## SRR594393 g7 Mus musculus brain DBA.2J
## SRR594394 g21 Mus musculus colon DBA.2J
## SRR594395 g13 Mus musculus heart DBA.2J
unique(target.mus[, 3])
## [1] "brain" "colon" "heart" "kidney"
## [5] "liver" "lung" "skeletal muscle" "spleen"
## [9] "testis"
Load custom target data into colData
slot.
colData(rse.mus) <- DataFrame(target.mus)
The raw RNA-Seq count are preprocessed with the following steps: (1) normalization, (2) aggregation of replicates, and (3) filtering of reliable expression data. The details of these steps are explained in the sub-section above using data from human.
se.nor.mus <- norm_data(data=rse.mus, norm.fun='ESF', log2.trans=TRUE) # Normalization
## Normalising: ESF
## type
## "ratio"
se.aggr.mus <- aggr_rep(data=se.nor.mus, sam.factor='organism_part', con.factor='strain', aggr='mean') # Aggregation of replicates
## Syntactically valid column names are made!
se.fil.mus <- filter_data(data=se.aggr.mus, sam.factor='organism_part', con.factor='strain', pOA=c(0.01, 5), CV=c(0.6, 100), dir=NULL) # Filtering of genes with low counts and variance
## Syntactically valid column names are made!
## All values before filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.838 5.282 21.716
## All coefficient of variances (CVs) before filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01741 0.29033 1.09806 1.51730 2.34078 5.09902
## All values after filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.9781 2.1806 3.4151 21.7158
## All coefficient of variances (CVs) after filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6001 0.8869 1.3158 1.4953 1.9883 5.0990
The pre-processed expression data for gene ‘ENSMUSG00000000263’ is plotted in form of an SHM. In this case the plot includes expression data for 8 tissues across 3 mouse strains.
shm.lis <- spatial_hm(svg.path=svg.mus, data=se.fil.mus, ID=c('ENSMUSG00000000263'), height=0.7, legend.width=0.7, legend.text.size=10, sub.title.size=9, ncol=3, ft.trans=c('skeletal muscle', 'path4204'), legend.nrow=4, line.size=0.2, line.color='grey70')
## Coordinates: mus_musculus.male.svg ...
## CPU cores: 1
## Element "a" is removed: a4174 !
## ggplots/grobs: mus_musculus.male.svg ...
## ggplot: ENSMUSG00000000263, DBA.2J C57BL.6 CD1
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## ENSMUSG00000000263_DBA.2J_1 ENSMUSG00000000263_C57BL.6_1 ENSMUSG00000000263_CD1_1
The SHM plots in Figures 5 and below demonstrate
the usage of the transparency feature via the ft.trans
parameter. Except for the outline layer path4204
interfering with other tissues, the
corresponding mouse organ aSVG image includes overlapping tissue layers. In
this case the skelectal muscle layer partially overlaps with lung and heart
tissues. To view lung and heart in Figure 5, the skelectal
muscle tissue and outline are set transparent with ft.trans=c('skeletal muscle', 'path4204')
. To view
in the same aSVG the skeletal muscle tissue instead, ft.trans
is assigned
only path4204
as shown below.
To fine control the visual effects in feature rich aSVGs, the line.size
and
line.color
parameters are useful. This way one can adjust the thickness and
color of complex structures.
spatial_hm(svg.path=svg.mus, data=se.fil.mus, ID=c('ENSMUSG00000000263'), height=0.6, legend.text.size=10, sub.title.size=9, ncol=3, legend.ncol=2, line.size=0.1, line.color='grey70', ft.trans='path4204')
## Coordinates: mus_musculus.male.svg ...
## CPU cores: 1
## Element "a" is removed: a4174 !
## ggplots/grobs: mus_musculus.male.svg ...
## ggplot: ENSMUSG00000000263, DBA.2J C57BL.6 CD1
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## ENSMUSG00000000263_DBA.2J_1 ENSMUSG00000000263_C57BL.6_1 ENSMUSG00000000263_CD1_1
This section generates an SHM plot for chicken data from the Expression Atlas. The code components are very similar to the Human Brain example. For brevity, the corresponding text explaining the code has been reduced to a minimum.
The chosen chicken RNA-Seq experiment compares the developmental changes across nine time points of seven organs (Cardoso-Moreira et al. 2019).
The following searches the Expression Atlas for expression data from ‘heart’ and ‘gallus’.
all.chk <- read_cache(cache.pa, 'all.chk') # Retrieve data from cache.
if (is.null(all.chk)) { # Save downloaded data to cache if it is not cached.
all.chk <- searchAtlasExperiments(properties="heart", species="gallus")
save_cache(dir=cache.pa, overwrite=TRUE, all.chk)
}
Among the matching entries, accession ‘E-MTAB-6769’ will be downloaded.
all.chk[3, ]
## DataFrame with 1 row and 4 columns
## Accession Species Type Title
## <character> <character> <character> <character>
## 1 E-MTAB-6769 Gallus gallus RNA-seq of coding RNA Chicken RNA-seq time..
rse.chk <- read_cache(cache.pa, 'rse.chk') # Read data from cache.
if (is.null(rse.chk)) { # Save downloaded data to cache if it is not cached.
rse.chk <- getAtlasData('E-MTAB-6769')[[1]][[1]]
save_cache(dir=cache.pa, overwrite=TRUE, rse.chk)
}
The design of the downloaded RNA-Seq experiment is described in the colData
slot of rse.chk
. The following returns only its first three rows.
colData(rse.chk)[1:3, ]
## DataFrame with 3 rows and 8 columns
## AtlasAssayGroup organism strain genotype
## <character> <character> <character> <character>
## ERR2576379 g1 Gallus gallus Red Junglefowl wild type genotype
## ERR2576380 g1 Gallus gallus Red Junglefowl wild type genotype
## ERR2576381 g2 Gallus gallus Red Junglefowl wild type genotype
## developmental_stage age sex organism_part
## <character> <character> <character> <character>
## ERR2576379 embryo 10 day female brain
## ERR2576380 embryo 10 day female brain
## ERR2576381 embryo 10 day female cerebellum
The following example shows how to download from the above introduced SVG
repositories an aSVG image that matches the tissues and species
assayed in the gene expression data set downloaded in the previous subsection.
As before the image is saved to a directory named tmp.dir
.
tmp.dir <- paste0(normalizePath(tempdir(check=TRUE), winslash="/", mustWork=FALSE), '/shm')
# Query aSVGs.
feature.df <- return_feature(feature=c('heart', 'kidney'), species=c('gallus'), keywords.any=TRUE, return.all=FALSE, dir=tmp.dir, remote=TRUE, match.only=FALSE)
To build this vignettes according to the R/Bioconductor package requirements, the
following code section uses the aSVG file instance included in the
spatialHeatmap
package rather than the downloaded instance from the previous
step.
feature.df <- return_feature(feature=c('heart', 'kidney'), species=c('gallus'), keywords.any=TRUE, return.all=FALSE, dir=svg.dir, remote=FALSE, match.only=FALSE)
## Accessing features...
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## UBERON_0014892
##
## gallus_gallus.svg,
feature.df[1:3, ] # A slice of the features.
## feature stroke color id element parent order
## 1 heart 0 none UBERON_0000948 path LAYER_EFO 2
## 2 kidney 0 none UBERON_0002113 path LAYER_EFO 3
## 3 chicken_outline 0 #a0a1a2 chicken_outline g LAYER_OUTLINE 1
## SVG
## 1 gallus_gallus.svg
## 2 gallus_gallus.svg
## 3 gallus_gallus.svg
Obtain path of target aSVG on user system.
svg.chk <- system.file("extdata/shinyApp/example", "gallus_gallus.svg", package="spatialHeatmap")
The following imports a sample target file that is included in this package. To inspect its content, the first three rows of the target file are printed to the screen.
chk.tar <- system.file('extdata/shinyApp/example/target_chicken.txt', package='spatialHeatmap')
target.chk <- read.table(chk.tar, header=TRUE, row.names=1, sep='\t')
target.chk[1:3, ]
## AtlasAssayGroup organism strain genotype
## ERR2576379 g1 Gallus gallus Red Junglefowl wild type genotype
## ERR2576380 g1 Gallus gallus Red Junglefowl wild type genotype
## ERR2576381 g2 Gallus gallus Red Junglefowl wild type genotype
## developmental_stage age sex organism_part
## ERR2576379 embryo day10 female brain
## ERR2576380 embryo day10 female brain
## ERR2576381 embryo day10 female cerebellum
Load custom target data into colData
slot.
colData(rse.chk) <- DataFrame(target.chk)
Return samples used for plotting SHMs.
unique(colData(rse.chk)[, 'organism_part'])
## [1] "brain" "cerebellum" "heart" "kidney" "ovary"
## [6] "testis" "liver"
Return conditions considered for plotting downstream SHM.
unique(colData(rse.chk)[, 'age'])
## [1] "day10" "day12" "day14" "day17" "day0" "day155" "day35" "day7"
## [9] "day70"
The raw RNA-Seq count are preprocessed with the following steps: (1) normalization, (2) aggregation of replicates, and (3) filtering of reliable expression data. The details of these steps are explained in the above sub-section on human data.
se.nor.chk <- norm_data(data=rse.chk, norm.fun='ESF', log2.trans=TRUE) # Normalization
## Normalising: ESF
## type
## "ratio"
se.aggr.chk <- aggr_rep(data=se.nor.chk, sam.factor='organism_part', con.factor='age', aggr='mean') # Replicate agggregation using mean
se.fil.chk <- filter_data(data=se.aggr.chk, sam.factor='organism_part', con.factor='age', pOA=c(0.01, 5), CV=c(0.6, 100), dir=NULL) # Filtering of genes with low counts and varince
## All values before filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.6718 5.4389 5.2246 9.0067 23.0323
## All coefficient of variances (CVs) before filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01497 0.08457 0.29614 0.79232 1.02089 7.87401
## All values after filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2118 1.0459 2.0432 2.9370 23.0323
## All coefficient of variances (CVs) after filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6001 0.7793 1.0556 1.3299 1.5950 5.4224
The expression profile for gene ENSGALG00000006346
is plotted across nine time
points in four organs in form of a composite SHM with 9 panels. Their layout in
three columns is controlled with the argument setting ncol=3
. The target organs are labeled by text in legend plot via label=TRUE
.
spatial_hm(svg.path=svg.chk, data=se.fil.chk, ID='ENSGALG00000006346', width=0.9, legend.width=0.9, legend.r=1.5, sub.title.size=9, ncol=3, legend.nrow=2, label=TRUE)
## Coordinates: gallus_gallus.svg ...
## CPU cores: 1
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## UBERON_0014892
##
## Features in data not mapped: cerebellum, ovary, testis
## ggplots/grobs: gallus_gallus.svg ...
## ggplot: ENSGALG00000006346, day10 day12 day14 day17 day0 day155 day35 day7 day70
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## ENSGALG00000006346_day10_1 ENSGALG00000006346_day12_1 ENSGALG00000006346_day14_1 ENSGALG00000006346_day17_1 ENSGALG00000006346_day0_1 ENSGALG00000006346_day155_1 ENSGALG00000006346_day35_1 ENSGALG00000006346_day7_1 ENSGALG00000006346_day70_1
This section generates an SHM for Arabidopsis thaliana tissues with gene expression
data from the Affymetrix microarray technology. The chosen experiment used
ribosome-associated mRNAs from several cell populations of shoots and roots that were
exposed to hypoxia stress (Mustroph et al. 2009). In this case the expression data
will be downloaded from GEO with utilites
from the GEOquery
package (Davis and Meltzer 2007). The data preprocessing routines are
specific to the Affymetrix technology. The remaining code components for
generating SHMs are very similar to the previous examples. For brevity, the
text in this section explains mainly the steps that are specific to this data
set.
The GSE14502 data set will be downloaded with the getGEO
function from the
GEOquery
package. Intermediately, the expression data is stored in an
ExpressionSet
container (Huber et al. 2015), and then converted to a
SummarizedExperiment
object.
gset <- read_cache(cache.pa, 'gset') # Retrieve data from cache.
if (is.null(gset)) { # Save downloaded data to cache if it is not cached.
gset <- getGEO("GSE14502", GSEMatrix=TRUE, getGPL=TRUE)[[1]]
save_cache(dir=cache.pa, overwrite=TRUE, gset)
}
se.sh <- as(gset, "SummarizedExperiment")
The gene symbol identifiers are extracted from the rowData
component to be used
as row names. Similarly, one can work with AGI identifiers by providing below AGI
under Gene.Symbol
.
rownames(se.sh) <- make.names(rowData(se.sh)[, 'Gene.Symbol'])
The following returns a slice of the experimental design stored in the
colData
slot. Both the samples and conditions are contained in the title
column.
The samples include promoters (pGL2, pCO2, pSCR, pWOL, p35S), tissues
and organs (root atrichoblast epidermis, root cortex meristematic zone, root
endodermis, root vasculature, root_total and shoot_total); and the conditions
are control and hypoxia.
colData(se.sh)[60:63, 1:4]
## DataFrame with 4 rows and 4 columns
## title geo_accession status
## <character> <character> <character>
## GSM362227 shoot_hypoxia_pGL2_r.. GSM362227 Public on Oct 12 2009
## GSM362228 shoot_hypoxia_pGL2_r.. GSM362228 Public on Oct 12 2009
## GSM362229 shoot_control_pRBCS_.. GSM362229 Public on Oct 12 2009
## GSM362230 shoot_control_pRBCS_.. GSM362230 Public on Oct 12 2009
## submission_date
## <character>
## GSM362227 Jan 21 2009
## GSM362228 Jan 21 2009
## GSM362229 Jan 21 2009
## GSM362230 Jan 21 2009
In this example, the aSVG image has been generated in Inkscape from
the corresponding figure in Mustroph et al. (2009). The resulting custom figure
has been included as a sample aSVG file in the spatialHeatmap
package. Detailed
instructions for generating custom aSVG images in Inkscape are provided in the
SVG tutorial.
The annotations in the corresponding aSVG file located under svg.dir
can be
queried with the return_features
function.
feature.df <- return_feature(feature=c('pGL2', 'pRBCS'), species=c('shoot'), keywords.any=TRUE, return.all=FALSE, dir=svg.dir, remote=FALSE, match.only=FALSE)
## Accessing features...
## Extracted tiny path from path1146-6 is removed!
## Extracted tiny path from path1146 is removed!
## arabidopsis.thaliana_shoot.root_shm.svg, Extracted tiny path from path1146 is removed!
## arabidopsis.thaliana_shoot_shm.svg,
The unique set of the matching aSVG files can be returned as follows.
unique(feature.df$SVG)
## [1] "arabidopsis.thaliana_shoot.root_shm.svg"
## [2] "arabidopsis.thaliana_shoot_shm.svg"
The aSVG file arabidopsis.thaliana_shoot_shm.svg
is chosen to generate the SHM in this section.
feature.df <- subset(feature.df, SVG=='arabidopsis.thaliana_shoot_shm.svg')
feature.df[1:3, ]
## feature stroke color id element parent order
## 17 shoot_pGL2 0.1500001 #10ddeb shoot_pGL2 g container 2
## 18 shoot_pRBCS 0.1500001 #7227ab shoot_pRBCS g container 3
## 19 g258 0.1500001 #f7fcf5 g258 g container 1
## SVG
## 17 arabidopsis.thaliana_shoot_shm.svg
## 18 arabidopsis.thaliana_shoot_shm.svg
## 19 arabidopsis.thaliana_shoot_shm.svg
Obtain full path of target aSVG on user system.
svg.sh <- system.file("extdata/shinyApp/example", "arabidopsis.thaliana_shoot_shm.svg", package="spatialHeatmap")
The following imports a sample target file that is included in this package. To inspect its content, four selected rows of this target file are printed to the screen.
sh.tar <- system.file('extdata/shinyApp/example/target_arab.txt', package='spatialHeatmap')
target.sh <- read.table(sh.tar, header=TRUE, row.names=1, sep='\t')
target.sh[60:63, ]
## col.name sample condition
## shoot_hypoxia_pGL2_rep1 GSM362227 shoot_pGL2 hypoxia
## shoot_hypoxia_pGL2_rep2 GSM362228 shoot_pGL2 hypoxia
## shoot_control_pRBCS_rep1 GSM362229 shoot_pRBCS control
## shoot_control_pRBCS_rep2 GSM362230 shoot_pRBCS control
Return all samples present in target file.
unique(target.sh[, 'sample'])
## [1] "root_total" "root_p35S" "root_pSCR" "root_pSHR"
## [5] "root_pWOL" "root_pGL2" "root_pSUC2" "root_pSultr2.2"
## [9] "root_pCO2" "root_pPEP" "root_pRPL11C" "shoot_total"
## [13] "shoot_p35S" "shoot_pGL2" "shoot_pRBCS" "shoot_pSUC2"
## [17] "shoot_pSultr2.2" "shoot_pCER5" "shoot_pKAT1"
Return all conditions present in target file.
unique(target.sh[, 'condition'])
## [1] "control" "hypoxia"
Load custom target data into colData
slot.
colData(se.sh) <- DataFrame(target.sh)
The downloaded GSE14502 data set has already been normalized with the RMA algorithm (Gautier et al. 2004). Thus, the pre-processing steps can be restricted to the aggregation of replicates and filtering of reliably expressed genes. For the latter, the following code will retain genes with expression values larger than 6 (log2 space) in at least 3% of all samples (pOA=c(0.03, 6)), and with a coefficient of variance (CV) between 0.30 and 100 (CV=c(0.30, 100)).
se.aggr.sh <- aggr_rep(data=se.sh, sam.factor='sample', con.factor='condition', aggr='mean') # Replicate agggregation using mean
se.fil.arab <- filter_data(data=se.aggr.sh, sam.factor='sample', con.factor='condition', pOA=c(0.03, 6), CV=c(0.30, 100), dir=NULL) # Filtering of genes with low intensities and variance
## All values before filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.345 4.879 6.481 6.763 8.263 15.107
## All coefficient of variances (CVs) before filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01047 0.03424 0.05347 0.07706 0.09526 0.54344
## All values after filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.644 4.838 6.249 7.364 9.756 15.004
## All coefficient of variances (CVs) after filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3008 0.3203 0.3385 0.3531 0.3735 0.5434
The expression profile for the HRE2 gene is plotted for the control and the hypoxia treatment across six cell types (Figure 8).
spatial_hm(svg.path=svg.sh, data=se.fil.arab, ID=c("HRE2"), height=0.7, legend.nrow=3, legend.text.size=11)
## Coordinates: arabidopsis.thaliana_shoot_shm.svg ...
## CPU cores: 1
## Extracted tiny path from path1146 is removed!
## Features in data not mapped: root_total, root_p35S, root_pSCR, root_pSHR, root_pWOL, root_pGL2, root_pSUC2, root_pSultr2.2, root_pCO2, root_pPEP, root_pRPL11C, shoot_total, shoot_p35S
## ggplots/grobs: arabidopsis.thaliana_shoot_shm.svg ...
## ggplot: HRE2, control hypoxia
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## HRE2_control_1 HRE2_hypoxia_1
The above examples are SHMs plotted at the single spatial dimension. This section showcases the application of SHMs at spatial and temporal dimesions, i.e. data assayed in spatial feature(s) across different development stages.
The data at single spatial dimension contains only two factors: samples and conditions. By contrast, the spatiotemporal data contains three factors: samples, conditions, and times (development stages). There are three alternatives to organize the three factors: 1) combine samples and conditions; 2) combine samples and times; or 3) combine samples, conditions, and times. More details are provided in the Supplementary Section.
Which option to choose depends on the specific data and aSVGs, and the chosen one should achieve optimal visualization. In this example, the third is the best choice and will be showcased in the first part. Meanwhile, for demonstration purpose the second choice will also be illustrated in the second part. In the spatiotemporal application, different development stages can be represented in different aSVG images, and this feature will be presented in the third part.
The data is from the transcriptome analysis on rice coleoptile during germinating and developing stages under anoxia and re-oxygenation (Narsai et al. 2017), which is also downloaded with ExpressionAtlas
.
rse.clp <- read_cache(cache.pa, 'rse.clp') # Retrieve data from cache.
if (is.null(rse.clp)) { # Save downloaded data to cache if it is not cached.
rse.clp <- getAtlasData('E-GEOD-115371')[[1]][[1]]
save_cache(dir=cache.pa, overwrite=TRUE, rse.clp)
}
The targets file was prepared according to the experiment design stored in colData
slot of res.clp
by using the convenient function edit_tar
, and pre-packaged in spatialHeatmap
.
clp.tar <- system.file('extdata/shinyApp/example/target_coleoptile.txt', package='spatialHeatmap')
target.clp <- read_fr(clp.tar)
The helper function edit_tar
is designed to edit entries in the targets file. Below is an example of editing the tissue entries.
cdat <- colData(rse.clp) # Original targets file.
unique(cdat$organism_part) # Original tissues.
## [1] "plant embryo" "plant embryo coleoptile"
## [3] "coleoptile"
cdat <- edit_tar(cdat, column='organism_part', old=c('plant embryo', 'plant embryo coleoptile'), new=c('embryo', 'embryoColeoptile')) # Replace old entries with desired ones.
unique(cdat$organism_part) # New tissue entries.
## [1] "embryo" "embryoColeoptile" "coleoptile"
Inspect the tissues, conditions, and times, where “A” and “N” denote “aerobic” and “anaerobic” respectively.
target.clp[1:3, c(6, 7, 9, 10)] # A slice of the targets file.
## age organism_part stimulus con
## SRR7265373 0h embryo aerobic A
## SRR7265374 0h embryo aerobic A
## SRR7265375 0h embryo aerobic A
unique(target.clp[, 'age']) # All development stages.
## [1] "0h" "1h" "3h" "12h" "24h" "48h" "72h" "96h"
## [9] "72N24A"
unique(target.clp[, 'organism_part']) # All tissues.
## [1] "embryo" "embryoColeoptile" "coleoptile"
unique(target.clp[, 'stimulus']) # All conditions.
## [1] "aerobic" "anaerobic" "NA"
Combine sample, time, condition factors using the helper function com_factor
. The targets file including the new composite factors (samTimeCon
) is loaded to the colData
slot in rse.clp
internally.
rse.clp <- com_factor(rse.clp, target.clp, factors2com=c('organism_part', 'age', 'con'), sep='.', factor.new='samTimeCon')
## New combined factors: embryo.0h.A embryoColeoptile.1h.A embryoColeoptile.1h.N embryoColeoptile.3h.A embryoColeoptile.3h.N embryoColeoptile.12h.A embryoColeoptile.12h.N embryoColeoptile.24h.A embryoColeoptile.24h.N coleoptile.48h.A coleoptile.48h.N coleoptile.72h.A coleoptile.72h.N coleoptile.96h.A coleoptile.96h.N coleoptile.72N24A.NA
colData(rse.clp)[1:3, c(6, 7, 9:11)]
## DataFrame with 3 rows and 5 columns
## age organism_part stimulus con samTimeCon
## <character> <character> <character> <character> <character>
## SRR7265373 0h embryo aerobic A embryo.0h.A
## SRR7265374 0h embryo aerobic A embryo.0h.A
## SRR7265375 0h embryo aerobic A embryo.0h.A
Inspect the sample-time-condition composite factors. At least one of the composite factors should have a matching feature counterpart in the aSVG file, otherwise no aSVG file will be returned in the next section.
target.clp <- colData(rse.clp)
unique(target.clp$samTimeCon)
## [1] "embryo.0h.A" "embryoColeoptile.1h.A" "embryoColeoptile.1h.N"
## [4] "embryoColeoptile.3h.A" "embryoColeoptile.3h.N" "embryoColeoptile.12h.A"
## [7] "embryoColeoptile.12h.N" "embryoColeoptile.24h.A" "embryoColeoptile.24h.N"
## [10] "coleoptile.48h.A" "coleoptile.48h.N" "coleoptile.72h.A"
## [13] "coleoptile.72h.N" "coleoptile.96h.A" "coleoptile.96h.N"
## [16] "coleoptile.72N24A.NA"
Similar with the Arabidopsis Shoot example, the aSVG image has been generated in Inkscape from
the corresponding figure in Narsai et al. (2017) according to the SVG tutorial, and the resulting custom figure
has been included in spatialHeatmap
.
Query the aSVG files with one composite factor embryo.0h.A
.
feature.df <- return_feature(feature=c('embryo.0h.A', 'embryoColeoptile.1h.A'), species=c('oryza', 'sativa'), keywords.any=FALSE, return.all=FALSE, dir=svg.dir, remote=FALSE, match.only=FALSE)
## Accessing features...
## oryza.sativa_coleoptile.ANT_shm.svg, oryza.sativa_coleoptile.NT_shm.svg,
feature.df[1:2, ] # The first two rows of the query results.
## feature stroke color id element parent
## 1 embryo.0h.A 0.2 none embryo.0h.A g container
## 2 embryoColeoptile.1h.A 0.2 none embryoColeoptile.1h.A path container
## order SVG
## 1 25 oryza.sativa_coleoptile.ANT_shm.svg
## 2 27 oryza.sativa_coleoptile.ANT_shm.svg
Only one aSVG file oryza.sativa_coleoptile.ANT_shm.svg
is retrieved.
unique(feature.df$SVG)
## [1] "oryza.sativa_coleoptile.ANT_shm.svg"
Note no matter how the factors are combined, the composite factors of interest should always have matching counterparts in the aSVG file. In this example, all composite factors are matched to the aSVG.
unique(target.clp$samTimeCon) %in% unique(feature.df$feature)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE
Obtain full path of the retrieved aSVG on user system.
svg.clp1 <- system.file("extdata/shinyApp/example", "oryza.sativa_coleoptile.ANT_shm.svg", package="spatialHeatmap")
The raw RNA-Seq count are preprocessed with the following steps: (1)
normalization, (2) aggregation of replicates, and (3) filtering of reliable
expression data. The details of these steps are explained in the above
sub-section on human data. The normalization step is same for combined factors sample-time and sample-time-condition, while the aggregation and filtering are different. The difference is reflected by sam.factor
and con.factor
, and subsequently the column names in resulting assay
slot of the SummarizedExperiment
object.
se.nor.clp <- norm_data(data=rse.clp, norm.fun='ESF', log2.trans=TRUE) # Normalization
## Normalising: ESF
## type
## "ratio"
Aggregation and filtering by sample-time-condition factor.
se.aggr.clp1 <- aggr_rep(data=se.nor.clp, sam.factor='samTimeCon', con.factor=NULL, aggr='mean') # Replicate agggregation using mean
se.fil.clp1 <- filter_data(data=se.aggr.clp1, sam.factor='samTimeCon', con.factor=NULL, pOA=c(0.07, 7), CV=c(0.7, 100), dir=NULL) # Filtering of genes with low counts and varince.
## All values before filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2995 4.0283 4.3614 7.7436 18.0561
## All coefficient of variances (CVs) before filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.008408 0.087498 0.257247 0.654965 0.847843 4.000000
## All values after filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4401 2.6058 3.8745 7.1382 15.6921
## All coefficient of variances (CVs) after filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7005 0.7846 0.9030 0.9768 1.0853 1.9504
Since all three factors (conditions, times, tissues) are combined, the resulting data table loses the double underscore string.
assay(se.fil.clp1)[1:3, 1:3] # A slice of the resulting data table.
## embryo.0h.A embryoColeoptile.1h.A embryoColeoptile.1h.N
## Os01g0106300 2.549855 0.2403387 1.902315
## Os01g0111600 12.116707 12.9343197 12.708776
## Os01g0127600 6.495876 7.3024594 7.443524
The expression profile of gene Os12g0630200
and Os01g0106300
in coleoptile is plotted across eight time
points under anoxia and re-oxygenation in form of a composite STHM.
shm.lis <- spatial_hm(svg.path=svg.clp1, data=se.fil.clp1, ID=c('Os12g0630200', 'Os01g0106300'), legend.r=0.7, legend.key.size=0.01, legend.text.size=8, legend.nrow=8, ncol=1, width=0.8, line.size=0)
## Coordinates: oryza.sativa_coleoptile.ANT_shm.svg ...
## CPU cores: 1
## ggplots/grobs: oryza.sativa_coleoptile.ANT_shm.svg ...
## ggplot: Os12g0630200, con
## ggplot: Os01g0106300, con
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## Os12g0630200_con_1 Os01g0106300_con_1
This STHM example is also deployed as an interacive Shiny app, where STHMs are provided in form of static images, interactive HTML files, and video files. Click here to see this app.
This part showcases the usage of sample-time factor. The sample-condition factor could be applied similarly. To obtain optimal result, the data under aerobic is excluded. Since most steps are similar with the sample-time-condition factor, the following process is reduced to minimum.
The same RNA-seq count data from Narsai et al. (2017) is downloaded.
rse.clp <- read_cache(cache.pa, 'rse.clp') # Retrieve data from cache.
if (is.null(rse.clp)) { # Save downloaded data to cache if it is not cached.
rse.clp <- getAtlasData('E-GEOD-115371')[[1]][[1]]
save_cache(dir=cache.pa, overwrite=TRUE, rse.clp)
}
The same targets file with sample-time factor is imported.
clp.tar <- system.file('extdata/shinyApp/example/target_coleoptile.txt', package='spatialHeatmap')
target.clp <- read_fr(clp.tar)
Inspect the samples, conditions, and times.
target.clp[1:3, c(6, 7, 9, 10)] # A slice of the targets file.
## age organism_part stimulus con
## SRR7265373 0h embryo aerobic A
## SRR7265374 0h embryo aerobic A
## SRR7265375 0h embryo aerobic A
unique(target.clp[, 'age']) # All development stages.
## [1] "0h" "1h" "3h" "12h" "24h" "48h" "72h" "96h"
## [9] "72N24A"
unique(target.clp[, 'organism_part']) # All tissues.
## [1] "embryo" "embryoColeoptile" "coleoptile"
unique(target.clp[, 'stimulus']) # All conditions.
## [1] "aerobic" "anaerobic" "NA"
Combine sample and time factors, which is the essential difference with sample-time-condition example.
rse.clp <- com_factor(rse.clp, target.clp, factors2com=c('organism_part', 'age'), factor.new='samTime')
## New combined factors: embryo.0h embryoColeoptile.1h embryoColeoptile.3h embryoColeoptile.12h embryoColeoptile.24h coleoptile.48h coleoptile.72h coleoptile.96h coleoptile.72N24A
target.clp <- colData(rse.clp)
target.clp[1:3, ]
## DataFrame with 3 rows and 11 columns
## AtlasAssayGroup genotype organism cultivar
## <character> <character> <character> <character>
## SRR7265373 g1 wild type genotype Oryza sativa Amaroo
## SRR7265374 g1 wild type genotype Oryza sativa Amaroo
## SRR7265375 g1 wild type genotype Oryza sativa Amaroo
## developmental_stage age organism_part environmental_history
## <character> <character> <character> <character>
## SRR7265373 seed 0h embryo etiolation
## SRR7265374 seed 0h embryo etiolation
## SRR7265375 seed 0h embryo etiolation
## stimulus con samTime
## <character> <character> <character>
## SRR7265373 aerobic A embryo.0h
## SRR7265374 aerobic A embryo.0h
## SRR7265375 aerobic A embryo.0h
Similarly the custom aSVG image was generated in Inkscape from
the corresponding figure in Narsai et al. (2017) according to the SVG tutorial and included in spatialHeatmap
.
Query the aSVG files.
feature.df <- return_feature(feature=c('embryo.0h', 'embryoColeoptile1h'), species=c('oryza', 'sativa'), keywords.any=FALSE, return.all=FALSE, dir=svg.dir, remote=FALSE, match.only=FALSE)
## Accessing features...
## oryza.sativa_coleoptile.ANT_shm.svg, oryza.sativa_coleoptile.NT_shm.svg,
feature.df[1:2, ] # The first two rows of the query results.
## feature stroke color id element parent order
## 1 embryo.0h 0.2 none embryo.0h g container 25
## 2 rect1033_LGD 0.2 #3f1d38 rect1033_LGD path container 1
## SVG
## 1 oryza.sativa_coleoptile.NT_shm.svg
## 2 oryza.sativa_coleoptile.NT_shm.svg
Only one aSVG file oryza.sativa_coleoptile.NT_shm.svg
is retrieved.
unique(feature.df$SVG)
## [1] "oryza.sativa_coleoptile.NT_shm.svg"
Note no matter how the factors are combined, the composite factor of interest should always have matching counterparts in the aSVG file.
unique(target.clp$samTime) %in% unique(feature.df$feature)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Obtain full path of the retrieved aSVG on user system.
svg.clp2 <- system.file("extdata/shinyApp/example", "oryza.sativa_coleoptile.NT_shm.svg", package="spatialHeatmap")
The raw RNA-Seq count are preprocessed with the following steps: (1)
normalization, (2) aggregation of replicates, and (3) filtering of reliable
expression data. The normalization step is the same for composite factors sample-time and sample-time-condition, while the aggregation and filtering are different. The difference is reflected by sam.factor
and con.factor
, and subsequently the column names in the assay
slot of the resulting SummarizedExperiment
object.
se.nor.clp <- norm_data(data=rse.clp, norm.fun='ESF', log2.trans=TRUE) # Normalization
## Normalising: ESF
## type
## "ratio"
Aggregation and filtering by sample-time factor.
se.aggr.clp2 <- aggr_rep(data=se.nor.clp, sam.factor='samTime', con.factor='stimulus', aggr='mean') # Replicate agggregation using mean.
se.fil.clp2 <- filter_data(data=se.aggr.clp2, sam.factor='samTime', con.factor='stimulus', pOA=c(0.07, 7), CV=c(0.7, 100), dir=NULL) # Filtering of genes with low counts and varince.
## All values before filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2995 4.0283 4.3614 7.7436 18.0561
## All coefficient of variances (CVs) before filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.008408 0.087498 0.257247 0.654965 0.847843 4.000000
## All values after filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4401 2.6058 3.8745 7.1382 15.6921
## All coefficient of variances (CVs) after filtering:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7005 0.7846 0.9030 0.9768 1.0853 1.9504
Since only sample and time factors are combined, the resulting data table preserves the double underscore string, which is different from the sample-time-condition example.
df.fil.clp <- assay(se.fil.clp2)
df.fil.clp[1:3, 1:3] # A slice of the resulting data table.
## embryo.0h__aerobic embryoColeoptile.1h__aerobic
## Os01g0106300 2.549855 0.2403387
## Os01g0111600 12.116707 12.9343197
## Os01g0127600 6.495876 7.3024594
## embryoColeoptile.1h__anaerobic
## Os01g0106300 1.902315
## Os01g0111600 12.708776
## Os01g0127600 7.443524
The optimal viusalization on complete data is achieved on sample-time-condition factor. To also obtain the best result on sample-time factor, the data under aerobic is excluded.
df.fil.clp1 <- df.fil.clp[, !grepl('__aerobic', colnames(df.fil.clp))] # Exclude aerobic data.
df.fil.clp1[1:3, 1:3] # A slice of the data table without aerobic data.
## embryoColeoptile.1h__anaerobic embryoColeoptile.3h__anaerobic
## Os01g0106300 1.902315 1.357282
## Os01g0111600 12.708776 12.531359
## Os01g0127600 7.443524 6.919786
## embryoColeoptile.12h__anaerobic
## Os01g0106300 2.9825250
## Os01g0111600 9.7997716
## Os01g0127600 0.9402776
The expression profile of gene Os12g0630200
in coleoptile is plotted across eight time
points under anoxia and re-oxygenation respectively.
shm.lis <- spatial_hm(svg.path=svg.clp2, data=df.fil.clp1, ID=c('Os12g0630200'), legend.r=0.9, legend.key.size=0.02, legend.text.size=9, legend.nrow=8, ncol=1, line.size=0)
## Coordinates: oryza.sativa_coleoptile.NT_shm.svg ...
## CPU cores: 1
## ggplots/grobs: oryza.sativa_coleoptile.NT_shm.svg ...
## ggplot: Os12g0630200, anaerobic NA
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## Os12g0630200_anaerobic_1 Os12g0630200_NA_1
In the spatiotemporal application, different development stages may need to be represented in separate aSVG images. In that case, the spatial_hm
function is able to arrange multiple aSVGs in a single SHM plot. To organize the subplots, the names
of the separate aSVG files are expected to include the following suffixes: *_shm1.svg
,
*_shm2.svg
, etc. The paths to the aSVG files are provided under the
svg.path
argument. By default, every aSVG image will have a legend plot on
the right. The legend
argument provides fine control over which legend plots
to display.
As a simple toy example, the following stores random numbers in a
data.frame
.
df.random <- data.frame(matrix(sample(x=1:100, size=50, replace=TRUE), nrow=10))
colnames(df.random) <- c('shoot_totalA__condition1', 'shoot_totalA__condition2', 'shoot_totalB__condition1', 'shoot_totalB__condition2', 'notMapped') # Assign column names
rownames(df.random) <- paste0('gene', 1:10) # Assign row names
df.random[1:3, ]
## shoot_totalA__condition1 shoot_totalA__condition2
## gene1 66 17
## gene2 95 40
## gene3 74 9
## shoot_totalB__condition1 shoot_totalB__condition2 notMapped
## gene1 17 91 5
## gene2 90 69 64
## gene3 88 72 37
Next the paths to the aSVG files are obtained, here for younger and older plants using *_shm1
and *_shm1
, respectively, which are made from Mustroph et al. (2009).
svg.sh1 <- system.file("extdata/shinyApp/example", "arabidopsis.thaliana_organ_shm1.svg", package="spatialHeatmap")
svg.sh2 <- system.file("extdata/shinyApp/example", "arabidopsis.thaliana_organ_shm2.svg", package="spatialHeatmap")
The following generates the corresponding SHMs plot for gene1
. The orginal
image dimensions can be preserved by assigning TRUE
to the preserve.scale
argument.
spatial_hm(svg.path=c(svg.sh1, svg.sh2), data=df.random, ID=c('gene1'), width=0.7, legend.r=0.9, legend.width=1, preserve.scale=TRUE)
## Coordinates: arabidopsis.thaliana_organ_shm1.svg ...
## CPU cores: 1
## Coordinates: arabidopsis.thaliana_organ_shm2.svg ...
## CPU cores: 1
## Features in data not mapped: shoot_totalB, notMapped
## Features in data not mapped: shoot_totalA, notMapped
## ggplots/grobs: arabidopsis.thaliana_organ_shm1.svg ...
## ggplot: gene1, condition1 condition2
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## gene1_condition1_1 gene1_condition2_1
## ggplots/grobs: arabidopsis.thaliana_organ_shm2.svg ...
## ggplot: gene1, condition1 condition2
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## gene1_condition1_2 gene1_condition2_2
Note in Figure 11 shoots have thicker outlines than roots. This is another function of spatial_hm
, i.e. preserves the outline thicknesses defined in aSVG files. This feature is particularly useful in cellular SHMs where different cell types have different cell-wall thicknesses. The outline widths can be updated with update_feature
programatically or with Inkscape manually. The former is illustrated in the Supplementary Section.
In principle, the spatialHeatmap
is extendable to as many factors (e.g. samples, conditions, times) as possible. The most common scenario involves only two factors of samples and conditions (Section 3). If more factors are relevant such as development stages, geographical locations, genotypes, etc, all or selected factors should be combined to generate composite factors. The spatiotemporal section is an illustration of three factors. Similar combining strategies should be appied in cases of four or more factors. A rule of thumb is the composite factors of interest must have a matching counterpart in the aSVG file, otherwise target spatial features are not painted in spatial heatmaps.
SHMs are suitable for comparing assay profiles among small number of items
(e.g. few genes or proteins) across cell types and conditions. To also
support analysis routines of larger number of items, spatialHeatmap
integrates
functionalities for identifying groups of items with similar and/or dissimilar
assay profiles, and subsequently analyzing the results with data mining methods
methods that scale to larger numbers of items than SHMs, such as hierarchical
clustering and network analysis methods. The following introduces both
approaches using as sample data the gene expression data from Arabidopsis
shoots from the previous section.
To identify similar items, the submatrix
function can be used. The latter
identifies items sharing similar profiles with one or more query items of
interest. The given example uses correlation coefficients as similarity metric.
Three filtering parameters are provided to control the similarity and number of
items in the result. First, the p
argument allows to restrict the number of
similar items to return based on a percentage cutoff relative to the number of
items in the assay data set (e.g. 1% of the top most similar items). Second,
a fixed number n
of the most similar items can be returned. Third, all items
above a user definable correlation coefficient cutoff value can be returned,
such as a Pearson correlation coefficient (PCC) of at least 0.6. If several
query items are provided then the function returns the similar genes for each
query, while assuring uniqueness among items in the result.
The following example uses as query the two genes: ‘RCA’ and ‘HRE2’. The ann
argument corresponds to the column name in the rowData
slot containing gene
annotation information. The latter is only relevant if users want to see these
annotations when mousing over a node in the interactive network
plot generated in the next section.
sub.mat <- submatrix(data=se.fil.arab, ann='Target.Description', ID=c('RCA', 'HRE2'), p=0.1)
The result from the previous step is the assay matrix subsetted to the genes sharing similar assay profiles with the query genes ‘RCA’ and ‘HRE2’.
sub.mat[c('RCA', 'HRE2'), c(1:3, 37)] # Subsetted assay matrix
## root_total__control root_total__hypoxia root_p35S__control
## RCA 6.569305 6.416811 7.443822
## HRE2 5.486920 11.370161 5.578123
## Target.Description
## RCA hypothetical protein ;supported by full-length cDNA: Ceres:7114.
## HRE2 putative AP2 domain transcription factor
Subsequently, hierarchical clustering is applied to the subsetted assay matrix
containing only the genes that share profile similarities with the query genes
‘RCA’ and ‘HRE2’. The clustering result is displayed as a matrix heatmap where
the rows and columns are sorted by the corresponding hierarchical clustering
dendrograms (Figure 12). The position of the query genes (‘RCA’
and ‘HRE2’) is indicated in the heatmap by black lines. Setting static=FALSE
will launch the interactive mode, where users can zoom into the heatmap by
selecting subsections in the image or zoom out by double clicking.
matrix_hm(ID=c('RCA', 'HRE2'), data=sub.mat, angleCol=80, angleRow=35, cexRow=0.8, cexCol=0.8, margin=c(10, 6), static=TRUE, arg.lis1=list(offsetRow=0.01, offsetCol=0.01))
Network analysis is performed with the WGCNA algorithm (Langfelder and Horvath 2008; Ravasz et al. 2002) using as input the subsetted assay matrix generated in the
previous section. The objective is to identify network modules that can be
visualized in the following step in form of network graphs. Applied to the gene
expression sample data used here, these network modules represent groups of
genes sharing highly similar expression profiles. Internally, the network
module identification includes five major steps. First, a correlation matrix is
computed using distance or correlation-based similarity metrics. Second, the
obtained matrix is transformed into an adjacency matrix defining the
connections among items. Third, the adjacency matrix is used to calculate a
topological overlap matrix (TOM) where shared neighborhood information among
items is used to preserve robust connections, while removing spurious
connections. Fourth, the distance transformed TOM is used for hierarchical
clustering. To maximize time performance, the hierarchical clustering is
performed with the flashClust
package (Langfelder and Horvath 2012). Fifth, network modules
are identified with the dynamicTreeCut
package (Langfelder, Zhang, and Steve Horvath 2016). Its ds
(deepSplit
) argument can be assigned integer values from 0
to 3
, where
higher values increase the stringency of the module identification process. To
simplify the network module idenification process with WGCNA, the individual
steps can be executed with a single function called adj_mod
. The result is a
list containing the adjacency matrix and the final module assignments stored in
a data.frame
. Since the interactive network feature used in the
visualization step below performs best on smaller modules, only modules are
returned that were obtained with stringent ds
settings (here ds=2
and ds=3
).
adj.mod <- adj_mod(data=sub.mat)
A slice of the adjacency matrix is shown below.
adj.mod[['adj']][1:3, 1:3]
## CA1 PSAH.1 AT2G26500
## CA1 1.0000000 0.9514016 0.9636366
## PSAH.1 0.9514016 1.0000000 0.9611725
## AT2G26500 0.9636366 0.9611725 1.0000000
The module assignments are stored in a data frame
. Its columns contain the results
for the ds=2
and ds=3
settings. Integer values \(>0\) are the module labels, while \(0\)
indicates unassigned items. The following returns the first three rows of the module
assignment result.
adj.mod[['mod']][1:3, ]
## 2 3
## CA1 1 1
## PSAH.1 1 1
## AT2G26500 1 1
Network modules can be visualized with the network
function. To plot a module
containing an item (gene) of interest, its ID (e.g. ‘HRE2’) needs to be
provided under the corresponding argument. Typically, this could be one of the
items chosen for the above SHM plots. There are two modes to visualize the
selected module: static or interactive. Figure 13 was generated
with static=TRUE
. Setting static=FALSE
will generate the interactive
version. In the network plot shown below the nodes and edges represent genes
and their interactions, respectively. The thickness of the edges denotes the
adjacency levels, while the size of the nodes indicates the degree of
connectivity of each item in the network. The adjacency and connectivity levels
are also indicated by colors. For example, in Figure 13
connectivity increases from “turquoise” to “violet”, while the adjacency
increases from “yellow” to “blue”. If a gene of interest has been assigned
under ID
, then its ID is labeled in the plot with the suffix tag: ’*_target’.
network(ID="HRE2", data=sub.mat, adj.mod=adj.mod, adj.min=0.90, vertex.label.cex=1.2, vertex.cex=2, static=TRUE)
Setting static=FALSE
launches the interactive network. In this mode there
is an interactive color bar that denotes the gene connectivity. To modify it,
the color lables need to be provided in a comma separated format, e.g.:
yellow, orange, red
. The latter would indicate that the gene connectivity
increases from yellow to red.
If the subsetted expression matrix contains gene/protein annotation information in the last column, then it will be shown when moving the cursor over a network node.
network(ID="HRE2", data=sub.mat, adj.mod=adj.mod, static=FALSE)
In additon to running spatialHeatmap
from R, the package includes a Shiny
App that provides access to the same
functionalities from an intuitive-to-use web browser interface. Apart from
being very user-friendly, this App conveniently organizes the results of the
entire visualization workflow in a single browser window with options to adjust
the parameters of the individual components interactively. For instance, genes
can be selected and replotted in the SHM simply by clicking the corresponding
rows in the expression table included in the same window.
This representation is very efficient in guiding the interpretation of the results
in a visual and user-friendly manner. For testing purposes, the spatialHeatmap
Shiny App also includes ready-to-use sample expression data and aSVG images
along with embedded user instructions.
The Shiny App of spatialHeatmap
can be launched from an R session with the following function call.
shiny_shm()
The dashboard panels of the Shiny App are organized as follows:
A screenshot is shown below depicting an SHM plot generated with the Shiny App of spatialHeatmap
(Figure 14).
After launching, the Shiny App displays by default one of the included data sets.
The assay data and aSVG images are uploaded to the Shiny App as tabular files
(e.g. in CSV or TSV format) and SVG files, respectively. To also allow users
to upload gene expression data stored in SummarizedExperiment
objects, one
can export it from R to a tabular file with the filter_data
function, where
the user specifies the directory path under the dir
argument. This will create
in the target directory a tablular file named customData.txt
in TSV format.
The column names in this file preserve the experimental design information from the
colData
slot by concatenating the corresponding sample and condition
information separated by double underscores. An example of this format is shown
in Table 1.
se.fil.arab <- filter_data(data=se.aggr.sh, ann="Target.Description", sam.factor='sample', con.factor='condition', pOA=c(0.03, 6), CV=c(0.30, 100), dir='./')
To interactively access gene-, transcript- or protein-level annotations in the plots and
tables of the Shiny App, such as viewing functional descriptions by moving the
cursor over network nodes, the corresponding annotation column needs to be
present in the rowData
slot and its column name assigned to the ann
argument. In the exported tabular file, the extra annotation column is appended
to the expression matrix.
As most Shiny Apps, spatialHeatmap
can be deployed as a centralized web
service. A major advantage of a web server deployment is that the
functionalities can be accessed remotely by anyone on the internet without the
need to use R on the user system. For deployment one can use custom web
servers or cloud services, such as AWS, GCP or
shinysapps.io. An example web instance for testing
spatialHeatmap
online is available
here.
The spatialHeatmap
package also allows users to create customized Shiny App
instances using the custom_shiny
function. This function provides options to include
custom example data and aSVGs, and define default values within most visualization
panels (e.g. color schemes, image dimensions). For details users want
to consult the help file of the custom_shiny
function. To maximize
flexibilty, the default parameters are stored in a yaml file on the Shiny App.
This makes it easy to refine and optimize default parameters simply by changing
this yaml file.
To maintain scalability, the customized Shiny app is designed to work with HDF5-based database (Fischer, Smith, and Pau 2020; Pagès 2020), which enables users to incorporate data and aSVGs in a batch. The database is constructed with the function write_hdf5
. Basically, the accepted data formats are data.frame
or SummarizedExperiment
. Each data set is saved in an independent HDF5 database. Meanwhile, a pairing table describing matchings between data and aSVGs is required. All individual databases and the pairing table are then compressed in the file “data_shm.tar”. Accordingly, all aSVG files should be compressed in another tar file such as “aSVGs.tar”. Finally, these two tar files are included in the Shiny app by feeding their paths in a list to custom_shiny
.
Except for user-provided data, the database is able to store data sets downloaded from GEO and Expression Atlas. The detailed examples of the database utility are documented in the help file of write_hdf5
.
The numceric data used to color the features in aSVG images can be provided as
three different object types including vector
, data.frame
and
SummerizedExperiment
. When working with complex omics-based assay data then
the latter provides the most flexibility, and thus should be the preferred
container class for managing numeric data in spatialHeatmap
. Both
data.frame
and SummarizedExperiment
can hold data from many measured items,
such as many genes or proteins. In contrast to this, the
vector
class is only suitable for data from single items. Due to its
simplicity this less complex container is often useful for testing or when
dealing with simple data sets.
In data assayed only at spatial dimension, there are two factors samples and conditions, while data assayed at spatial and temporal dimension contains an additional factor time points or development stages. The spatialHeatmap
is able to work with both data types. In this section, the application of SHMs on spatial data is explained first in terms of the three object types, which is more popular. Later, the spatiotemporal usage of SHMs is presented in a specific subsection.
This subsection refers to data assayed only at spatial dimension, where two factors samples and conditions are involved.
vector
When using numeric vectors as input to spatial_hm
, then their name slot needs
to be populated with strings matching the feature names in the corresponding aSVG.
To also specify conditions, their labels need to be appended to the feature names
with double underscores as separator, i.e. ’feature__condition’.
The following example replots the toy example for two spatial features (‘occipital lobe’ and ‘parietal lobe’) and two conditions (‘1’ and ‘2’).
vec <- sample(x=1:100, size=5) # Random numeric values
names(vec) <- c('occipital lobe__condition1', 'occipital lobe__condition2', 'parietal lobe__condition1', 'parietal lobe__condition2', 'notMapped') # Assign unique names to random values
vec
## occipital lobe__condition1 occipital lobe__condition2
## 13 7
## parietal lobe__condition1 parietal lobe__condition2
## 43 37
## notMapped
## 51
With this configuration the resulting plot contains two spatial heatmap plots
for the human brain, one for ‘condition 1’ and another one for ‘contition 2’.
To keep the build time and storage size of this package to a minimum, the
spatial_hm
function call in the code block below is not evaluated. Thus,
the corresponding SHM is not shown in this vignette.
spatial_hm(svg.path=svg.hum, data=vec, ID='toy', ncol=1, legend.r=1.2, sub.title.size=14, ft.trans='g4320')
data.frame
Compared to the above vector input, data.frames
are structured here like row-wise
appended vectors, where the name slot information in the vectors is stored in the
column names. Each row also contains a name that corresponds to the corresponding
item name, such as a gene ID. The naming of spatial features and conditions in the
column names follows the same conventions as the naming used for the name slots in
the above vector example.
The following illustrates this with an example where a numeric data.frame
with
random numbers is generated containing 20 rows and 5 columns. To avoid name clashes,
the values in the rows and columns should be unique.
df.test <- data.frame(matrix(sample(x=1:1000, size=100), nrow=20)) # Create numeric data.frame
colnames(df.test) <- names(vec) # Assign column names
rownames(df.test) <- paste0('gene', 1:20) # Assign row names
df.test[1:3, ]
## occipital lobe__condition1 occipital lobe__condition2
## gene1 142 934
## gene2 798 147
## gene3 5 386
## parietal lobe__condition1 parietal lobe__condition2 notMapped
## gene1 987 968 751
## gene2 646 444 32
## gene3 790 756 407
With the resulting data.frame
one can generate the same SHM as in the previous
example, that used a named vector as input to the spatial_hm
function. Additionally,
one can now select each row by its name (here gene ID) under the ID
argument.
spatial_hm(svg.path=svg.hum, data=df.test, ID=c('gene1'), ncol=1, legend.r=1.2, sub.title.size=14)
Additional information can be appended to the data.frame
column-wise, such as
annotation data including gene descriptions. This information can then be displayed
interactively in the network plots of the Shiny App by placing the curser over
network nodes.
df.test$ann <- paste0('ann', 1:20)
df.test[1:3, ]
## occipital lobe__condition1 occipital lobe__condition2
## gene1 142 934
## gene2 798 147
## gene3 5 386
## parietal lobe__condition1 parietal lobe__condition2 notMapped ann
## gene1 987 968 751 ann1
## gene2 646 444 32 ann2
## gene3 790 756 407 ann3
SummarizedExperiment
The SummarizedExperiment
class is a much more extensible and flexible container
for providing metadata for both rows and columns of numeric data stored in tabular
format.
To import experimental design information from tabular files, users can provide
a target file that will be stored in the colData
slot of the
SummarizedExperiment
(SE, Morgan et al. (2018)) object. In other words, the
target file provides the metadata for the columns of the numeric assay data. Usually,
the target file contains at least two columns: one for the features/samples and
one for the conditions. Replicates are indicated by identical entries in these
columns. The actual numeric matrix representing the assay data is stored in
the assay
slot, where the rows correspond to items, such as gene IDs.
Optionally, additional annotation information for the rows (e.g. gene
descriptions) can be stored in the rowData
slot.
For constructing a valid SummarizedExperiment
object, that can be used by
the spatial_hm
function, the target file should meet the following requirements.
It should be imported with read.table
or read.delim
into a data.frame
or the data.frame
can be constructed in R on the fly (as shown below).
It should contain at least two columns. One column represents the features/samples
and the other one the conditions. The rows in the target file
correspond to the columns of the numeric data stored in the assay
slot. If
the condition column is empty, then the same condition is assumed under the
corresponding features/samples entry.
The feature/sample names must have matching entries in corresponding aSVG to be considered in the final SHM. Note, the double underscore is a special string that is reserved for specific purposes in spatialHeatmap, and thus should be avoided for naming feature/samples and conditions.
The following example illustrates the design of a valid SummarizedExperiment
object for generating SHMs. In this example, the ‘occipital lobe’ tissue has 2
conditions and each condition has 2 replicates. Thus, there are 4 assays for
occipital lobe
, and the same design applies to the parietal lobe
tissue.
sample <- c(rep('occipital lobe', 4), rep('parietal lobe', 4))
condition <- rep(c('condition1', 'condition1', 'condition2', 'condition2'), 2)
target.test <- data.frame(sample=sample, condition=condition, row.names=paste0('assay', 1:8))
target.test
## sample condition
## assay1 occipital lobe condition1
## assay2 occipital lobe condition1
## assay3 occipital lobe condition2
## assay4 occipital lobe condition2
## assay5 parietal lobe condition1
## assay6 parietal lobe condition1
## assay7 parietal lobe condition2
## assay8 parietal lobe condition2
The assay
slot is populated with a 8 x 20
data.frame
containing random
numbers. Each column corresponds to an assay in the target file (here imported
into colData
), while each row corresponds to a gene.
df.se <- data.frame(matrix(sample(x=1:1000, size=160), nrow=20))
rownames(df.se) <- paste0('gene', 1:20)
colnames(df.se) <- row.names(target.test)
df.se[1:3, ]
## assay1 assay2 assay3 assay4 assay5 assay6 assay7 assay8
## gene1 723 427 313 167 92 131 617 80
## gene2 13 739 712 324 700 227 534 682
## gene3 221 223 809 906 961 572 772 721
Next, the final SummarizedExperiment
object is constructed by providing the
numeric and target data under the assays
and colData
arguments,
respectively.
se <- SummarizedExperiment(assays=df.se, colData=target.test)
se
## class: SummarizedExperiment
## dim: 20 8
## metadata(0):
## assays(1): ''
## rownames(20): gene1 gene2 ... gene19 gene20
## rowData names(0):
## colnames(8): assay1 assay2 ... assay7 assay8
## colData names(2): sample condition
If needed row-wise annotation information (e.g. for genes) can be included in
the SummarizedExperiment
object as well. This can be done during the
construction of the initial object, or as below by injecting the information
into an existing SummarizedExperiment
object.
rowData(se) <- df.test['ann']
In this simple example, possible normalization and filtering steps are skipped. Yet, the aggregation of replicates is performed as shown below.
se.aggr <- aggr_rep(data=se, sam.factor='sample', con.factor='condition', aggr='mean')
## Syntactically valid column names are made!
assay(se.aggr)[1:3, ]
## occipital.lobe__condition1 occipital.lobe__condition2
## gene1 575 240.0
## gene2 376 518.0
## gene3 222 857.5
## parietal.lobe__condition1 parietal.lobe__condition2
## gene1 111.5 348.5
## gene2 463.5 608.0
## gene3 766.5 746.5
With the fully configured SummarizedExperiment
object, a similar SHM is plotted as
in the previous examples.
spatial_hm(svg.path=svg.hum, data=se.aggr, ID=c('gene1'), ncol=1, legend.r=1.2, sub.title.size=14, ft.trans=c('g4320'))
The above explanations on the three object types are applicable to data at a single spatial dimension directly. If the data is measured at spatial and temporal dimension, there are three factors: samples, conditions, and time points such as development stages. The three object types are still applicable, but the formatting convention is slightly different.
Specifically, there are three options to format the spatiotemporal data: 1) Combine samples and conditions. In vector
names and data.frame
column names, the composite sample-condition factor and time factor should be concatenated by double underscore, while in SummarizedExperiment
the former and latter should be regarded as the “sample” and “condition” columns respectively; 2) Combine samples and times. In vector
names and data.frame
column names, the composite sample-time factor and condition factor should be concatenated by double underscore (see here), while in SummarizedExperiment
the former and latter should be regarded as the “sample” and “condition” columns respectively; or 3) Combine samples, conditions, and times. The composite sample-time-condition factor will be the full names in vector
and full column names in data.frame
without the double underscore (see here), while in SummarizedExperiment
they will be the “sample” column and the “condition” column will be ignored (see here).
Which option to choose depends on the specific data and aSVGs, and the chosen option is expected to achieve optimal visualization. Regardless of the options, the pivotal requirement that target spatial features in aSVG must have matching counterparts in data should always be fulfilled (see here).
A public aSVG repository, that can be used by spatialHeatmap
directly, is the one
maintained by the EBI Gene Expression
Group.
It contains annatomical aSVG images from different species. The same aSVG
images are also used by the web service of the Expression Atlas project. In addition, the
spatialHeatmap
has its own repository called spatialHeatmap aSVG
Repository,
that stores custom aSVG files developed for this project (e.g. Figure
8).
If users cannot find a suitable aSVG image in these two repositories, we have developed a step-by-step SVG tutorial for creating custom aSVG images. For example, the BAR eFP browser at University of Toronto contains many anatomical images. These images can be used as templates in the SVG tutorial to construct custom aSVGs.
Moreover, in the future we will add more aSVGs to our repository, and users are welcome to deposit their own aSVGs to this repository to share them with the community.
To create and edit existing feature identifiers in aSVGs, the update_feature
function
can be used. The demonstration below, creates an empty folder tmp.dir1
and copies
into it the homo_sapiens.brain.svg
image provided by the spatialHeatmap
package. Subsequently, selected feature annotations are updated in this file.
tmp.dir1 <- paste0(normalizePath(tempdir(check=TRUE), winslash="/", mustWork=FALSE), '/shm1')
if (!dir.exists(tmp.dir1)) dir.create(tmp.dir1)
svg.hum <- system.file("extdata/shinyApp/example", 'homo_sapiens.brain.svg', package="spatialHeatmap")
file.copy(from=svg.hum, to=tmp.dir1, overwrite=TRUE) # Copy "homo_sapiens.brain.svg" file into 'tmp.dir1'
Query the above aSVG with feature and species keywords, and return the resulting matches
in a data.frame
.
feature.df <- return_feature(feature=c('frontal cortex', 'prefrontal cortex'), species=c('homo sapiens', 'brain'), dir=tmp.dir1, remote=FALSE, keywords.any=FALSE)
feature.df
Subsequently, create a character vector of new feature identifiers corresponding to each of the returned features.
Sample code that creates new feature names and stores them in a character vector.
f.new <- c('prefrontalCortex', 'frontalCortex')
Similarly, if the stroke (outline thickness) or color need to update, make vectors respectively and make sure each entry corresponds to each returned feature.
s.new <- c('0.05', '0.1') # New strokes.
c.new <- c('red', 'green') # New colors.
Next, new features, strokes, and colors are added to the feature data.frame
as three columns with the names featureNew
, strokeNew
, and colorNew
respectively. The three column names are used by the update_feature
function to look up new entries.
feature.df.new <- cbind(featureNew=f.new, strokeNew=s.new, colorNew=c.new, feature.df)
feature.df.new
Finally, the features, strokes, and colors are updated in the aSVG file(s) located in the tmp.dir1
directory.
update_feature(df.new=feature.df.new, dir=tmp.dir1)
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] GEOquery_2.60.0 ExpressionAtlas_1.20.0
## [3] xml2_1.3.2 limma_3.48.0
## [5] SummarizedExperiment_1.22.0 Biobase_2.52.0
## [7] GenomicRanges_1.44.0 GenomeInfoDb_1.28.0
## [9] IRanges_2.26.0 S4Vectors_0.30.0
## [11] BiocGenerics_0.38.0 MatrixGenerics_1.4.0
## [13] matrixStats_0.58.0 spatialHeatmap_1.2.0
## [15] knitr_1.33 BiocStyle_2.20.0
##
## loaded via a namespace (and not attached):
## [1] rols_2.20.0 backports_1.2.1 Hmisc_4.5-0
## [4] av_0.6.0 BiocFileCache_2.0.0 igraph_1.2.6
## [7] lazyeval_0.2.2 shinydashboard_0.7.1 splines_4.1.0
## [10] BiocParallel_1.26.0 ggplot2_3.3.3 digest_0.6.27
## [13] foreach_1.5.1 htmltools_0.5.1.1 magick_2.7.2
## [16] GO.db_3.13.0 fansi_0.4.2 magrittr_2.0.1
## [19] checkmate_2.0.0 memoise_2.0.0 cluster_2.1.2
## [22] doParallel_1.0.16 readr_1.4.0 fastcluster_1.1.25
## [25] Biostrings_2.60.0 annotate_1.70.0 prettyunits_1.1.1
## [28] jpeg_0.1-8.1 colorspace_2.0-1 blob_1.2.1
## [31] rappdirs_0.3.3 xfun_0.23 dplyr_1.0.6
## [34] crayon_1.4.1 RCurl_1.98-1.3 jsonlite_1.7.2
## [37] genefilter_1.74.0 impute_1.66.0 survival_3.2-11
## [40] iterators_1.0.13 glue_1.4.2 gtable_0.3.0
## [43] zlibbioc_1.38.0 XVector_0.32.0 DelayedArray_0.18.0
## [46] Rhdf5lib_1.14.0 HDF5Array_1.20.0 scales_1.1.1
## [49] DBI_1.1.1 edgeR_3.34.0 Rcpp_1.0.6
## [52] progress_1.2.2 viridisLite_0.4.0 xtable_1.8-4
## [55] htmlTable_2.2.1 gridGraphics_0.5-1 flashClust_1.01-2
## [58] foreign_0.8-81 bit_4.0.4 preprocessCore_1.54.0
## [61] Formula_1.2-4 rsvg_2.1.2 htmlwidgets_1.5.3
## [64] httr_1.4.2 gplots_3.1.1 RColorBrewer_1.1-2
## [67] ellipsis_0.3.2 farver_2.1.0 pkgconfig_2.0.3
## [70] XML_3.99-0.6 nnet_7.3-16 sass_0.4.0
## [73] dbplyr_2.1.1 locfit_1.5-9.4 utf8_1.2.1
## [76] dynamicTreeCut_1.63-1 labeling_0.4.2 later_1.2.0
## [79] ggplotify_0.0.7 tidyselect_1.1.1 rlang_0.4.11
## [82] AnnotationDbi_1.54.0 visNetwork_2.0.9 munsell_0.5.0
## [85] tools_4.1.0 cachem_1.0.5 generics_0.1.0
## [88] RSQLite_2.2.7 evaluate_0.14 stringr_1.4.0
## [91] fastmap_1.1.0 ggdendro_0.1.22 yaml_2.2.1
## [94] bit64_4.0.5 caTools_1.18.2 purrr_0.3.4
## [97] KEGGREST_1.32.0 mime_0.10 compiler_4.1.0
## [100] rstudioapi_0.13 plotly_4.9.3 filelock_1.0.2
## [103] curl_4.3.1 png_0.1-7 tibble_3.1.2
## [106] geneplotter_1.70.0 bslib_0.2.5.1 stringi_1.6.2
## [109] highr_0.9 lattice_0.20-44 Matrix_1.3-3
## [112] vctrs_0.3.8 pillar_1.6.1 lifecycle_1.0.0
## [115] rhdf5filters_1.4.0 BiocManager_1.30.15 jquerylib_0.1.4
## [118] data.table_1.14.0 bitops_1.0-7 grImport_0.9-3
## [121] httpuv_1.6.1 R6_2.5.0 latticeExtra_0.6-29
## [124] promises_1.2.0.1 bookdown_0.22 KernSmooth_2.23-20
## [127] gridExtra_2.3 codetools_0.2-18 MASS_7.3-54
## [130] gtools_3.8.2 assertthat_0.2.1 rhdf5_2.36.0
## [133] DESeq2_1.32.0 withr_2.4.2 GenomeInfoDbData_1.2.6
## [136] hms_1.1.0 grid_4.1.0 rpart_4.1-15
## [139] tidyr_1.1.3 rmarkdown_2.8 rvcheck_0.1.8
## [142] shiny_1.6.0 WGCNA_1.70-3 base64enc_0.1-3
This project has been funded by NSF awards: PGRP-1546879, PGRP-1810468, PGRP-1936492.
Cardoso-Moreira, Margarida, Jean Halbert, Delphine Valloton, Britta Velten, Chunyan Chen, Yi Shao, Angélica Liechti, et al. 2019. “Gene Expression Across Mammalian Organ Development.” Nature 571 (7766): 505–9.
Chang, Winston, and Barbara Borges Ribeiro. 2018. Shinydashboard: Create Dashboards with ’Shiny’. https://CRAN.R-project.org/package=shinydashboard.
Chang, Winston, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. n.d. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.
Davis, Sean, and Paul Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (Geo) and Bioconductor.” Bioinformatics 14: 1846–7.
Fischer, Bernd, Mike Smith, and Gregoire Pau. 2020. Rhdf5: R Interface to Hdf5. https://github.com/grimbough/rhdf5.
Gatto, Laurent. 2019. Rols: An R Interface to the Ontology Lookup Service. http://lgatto.github.com/rols/.
Gautier, Laurent, Leslie Cope, Benjamin M. Bolstad, and Rafael A. Irizarry. 2004. “Affy—Analysis of Affymetrix Genechip Data at the Probe Level.” Bioinformatics 20 (3): 307–15. https://doi.org/10.1093/bioinformatics/btg405.
Huber, W., V. J. Carey, R. Gentleman, S. An ders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis Wit H Bioconductor.” Nature Methods 12 (2): 115–21. http://www.nature.com/nmeth/journal/v12/n2/full/nmeth.3252.html.
Keays, Maria. 2019. ExpressionAtlas: Download Datasets from Embl-Ebi Expression Atlas.
Langfelder, Peter, and Steve Horvath. 2008. “WGCNA: An R Package for Weighted Correlation Network Analysis.” BMC Bioinformatics 9 (December): 559.
Langfelder, Peter, Bin Zhang, and with contributions from Steve Horvath. 2016. DynamicTreeCut: Methods for Detection of Clusters in Hierarchical Clustering Dendrograms. https://CRAN.R-project.org/package=dynamicTreeCut.
Langfelder, P., and S. Horvath. 2012. “Fast R Functions for Robust Correlations and Hierarchical Clustering.” J. Stat. Softw. 46 (11). https://www.ncbi.nlm.nih.gov/pubmed/23050260.
Lekschas, Fritz, Harald Stachelscheid, Stefanie Seltmann, and Andreas Kurtz. 2015. “Semantic Body Browser: Graphical Exploration of an Organism and Spatially Resolved Expression Data Visualization.” Bioinformatics 31 (5): 794–96.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.
Maag, Jesper L V. 2018. “Gganatogram: An R Package for Modular Visualisation of Anatograms and Tissues Based on Ggplot2.” F1000Res. 7 (September): 1576.
Merkin, Jason, Caitlin Russell, Ping Chen, and Christopher B Burge. 2012. “Evolutionary Dynamics of Gene and Isoform Regulation in Mammalian Tissues.” Science 338 (6114): 1593–9.
Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2018. SummarizedExperiment: SummarizedExperiment Container.
Mungall, Christopher J, Carlo Torniai, Georgios V Gkoutos, Suzanna E Lewis, and Melissa A Haendel. 2012. “Uberon, an integrative multi-species anatomy ontology.” Genome Biol. 13 (1): R5. https://doi.org/10.1186/gb-2012-13-1-r5.
Muschelli, John, Elizabeth Sweeney, and Ciprian Crainiceanu. 2014. “BrainR: Interactive 3 and 4D Images of High Resolution Neuroimage Data.” R J. 6 (1): 41–48.
Mustroph, Angelika, M Eugenia Zanetti, Charles J H Jang, Hans E Holtan, Peter P Repetti, David W Galbraith, Thomas Girke, and Julia Bailey-Serres. 2009. “Profiling Translatomes of Discrete Cell Populations Resolves Altered Cellular Priorities During Hypoxia in Arabidopsis.” Proc Natl Acad Sci U S A 106 (44): 18843–8.
Narsai, Reena, David Secco, Matthew D Schultz, Joseph R Ecker, Ryan Lister, and James Whelan. 2017. “Dynamic and Rapid Changes in the Transcriptome and Epigenome During Germination and in Developing Rice ( Oryza Sativa ) Coleoptiles Under Anoxia and Re-Oxygenation.” Plant J. 89 (4): 805–24.
Pagès, Hervé. 2020. HDF5Array: HDF5 Backend for Delayedarray Objects.
Papatheodorou, Irene, Nuno A Fonseca, Maria Keays, Y Amy Tang, Elisabet Barrera, Wojciech Bazant, Melissa Burke, et al. 2018. “Expression Atlas: gene and protein expression across multiple studies and organisms.” Nucleic Acids Res. 46 (D1): D246–D251. https://doi.org/10.1093/nar/gkx1158.
Prudencio, Mercedes, Veronique V Belzil, Ranjan Batra, Christian A Ross, Tania F Gendron, Luc J Pregent, Melissa E Murray, et al. 2015. “Distinct Brain Transcriptome Profiles in C9orf72-Associated and Sporadic ALS.” Nat. Neurosci. 18 (8): 1175–82.
Ravasz, E, A L Somera, D A Mongru, Z N Oltvai, and A L Barabási. 2002. “Hierarchical Organization of Modularity in Metabolic Networks.” Science 297 (5586): 1551–5.
Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.
Waese, Jamie, Jim Fan, Asher Pasha, Hans Yu, Geoffrey Fucile, Ruian Shi, Matthew Cumming, et al. 2017. “EPlant: Visualizing and Exploring Multiple Levels of Data for Hypothesis Generation in Plant Biology.” Plant Cell 29 (8): 1806–21.
Winter, Debbie, Ben Vinegar, Hardeep Nahal, Ron Ammar, Greg V Wilson, and Nicholas J Provart. 2007. “An ‘Electronic Fluorescent Pictograph’ Browser for Exploring and Analyzing Large-Scale Biological Data Sets.” PLoS One 2 (8): e718.