Table of Contents

Make Enriched Heatmaps

Author: Zuguang Gu ( z.gu@dkfz.de )

Date: 2015-10-14


Enriched heatmap is a special type of heatmap which visualizes the enrichment of genomic signals on specific target regions. It is broadly used to visualize e.g. how histone marks are enriched to transcription start sites.

There are several tools that can make such heatmap (e.g. ngs.plot or deepTools). Here we implement Enriched heatmap by ComplexHeatmap package. Since this type of heatmap is just a normal heatmap but with some special settings, with the functionality of ComplexHeatmap, it would be much easier to customize the heatmap as well as concatenating to a list of heatmaps to show correspondance between different data sources.

library(EnrichedHeatmap)

Load the example data that we will use for demostration:

set.seed(123)
load(paste0(system.file("extdata", "chr21_test_data.RData", package = "EnrichedHeatmap")))
ls()
## [1] "H3K4me3" "cgi"     "genes"   "meth"    "rpkm"

The example data are all GRanges objects:

In order to build the vignette fast, the data only include chromosome 21.

Single heatmap for histone marks

Heatmap for the enrichment of H3K4me3 histone mark on TSS:

tss = promoters(genes, upstream = 0, downstream = 1)
tss[1:5]
## GRanges object with 5 ranges and 0 metadata columns:
##                      seqnames               ranges strand
##                         <Rle>            <IRanges>  <Rle>
##    ENSG00000141956.9    chr21 [43299591, 43299591]      -
##   ENSG00000141959.12    chr21 [45719934, 45719934]      +
##    ENSG00000142149.4    chr21 [33245628, 33245628]      +
##   ENSG00000142156.10    chr21 [47401651, 47401651]      +
##    ENSG00000142166.8    chr21 [34696734, 34696734]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
H3K4me3[1:5]
## GRanges object with 5 ranges and 1 metadata column:
##       seqnames             ranges strand |  coverage
##          <Rle>          <IRanges>  <Rle> | <integer>
##   [1]    chr21 [9825467, 9825470]      * |        10
##   [2]    chr21 [9825470, 9825488]      * |        13
##   [3]    chr21 [9825488, 9825489]      * |        12
##   [4]    chr21 [9825489, 9825490]      * |        13
##   [5]    chr21 [9825490, 9825493]      * |        14
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
mat1 = normalizeToMatrix(H3K4me3, tss, value_column = "coverage", 
    extend = 5000, mean_mode = "w0", w = 50)
mat1
## Normalize H3K4me3 to tss:
##   Upstream 5000bp
##   Downstream 5000bp
##   Not show target regions
##   720 signal regions
EnrichedHeatmap(mat1, name = "H3K4me3")

plot of chunk h3k4me3

normalizeToMatrix() converts the association between genomic signals (H3K4me3) and targets(tss) in to a matrix. It first splits the extended targets regions ( the extension to upstream and downstream is controlled by extend argument) into a list of small windows (the width of the windows is controlled by w), then overlaps genomic signals to these small windows and calculates the value for every small window which is the mean value of genomic signals that intersects with the window (the value corresponds to genomic signals are controlled by value_column and how to calcualte the mean value is controlled by mean_mode).

There are several modes for mean_mode according to different types of genomic signals. It will be explained in latter sections.

EnrichedHeatmap() returns a EnrichedHeatmap class instance which is inherited from Heatmap class, so parameters and methods for Heatmap class can be directly applied to EnrichedHeatmap class.

There is a special column annotation function anno_enriched() which shows mean values of columns in the normalized matrix.

EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3",
    top_annotation = HeatmapAnnotation(lines = anno_enriched()), 
    top_annotation_height = unit(2, "cm"))

plot of chunk anno

Rows can be smoothed by setting smooth to TRUE when generating the matrix. But since data range will change after smoothing, you need to manually adjust the color mapping if you want to make figures comparable.

In following example, we use colorRamp2() from circlize package to define a color mapping function.

library(circlize)
mat1_smoothed = normalizeToMatrix(H3K4me3, tss, value_column = "coverage", 
    extend = 5000, mean_mode = "w0", w = 50, smooth = TRUE)
# note we set the color range same as the unsmoothed matrix
EnrichedHeatmap(mat1_smoothed, col = colorRamp2(range(mat1), c("white", "red")), 
    name = "H3K4me3")

plot of chunk smooth

Since EnrichedHeatmap class is inherited from Heatmap class, it is easy to customize the heatmap, e.g. split rows, make clustering on rows, add titles, …

Split rows by a vector or a data frame:

EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3", 
    split = sample(c("A", "B"), length(genes), replace = TRUE),
    column_title = "Enrichment of H3K4me3") 

plot of chunk split

Split rows by k-means clustering:

EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3", km = 3,
    column_title = "Enrichment of H3K4me3", row_title_rot = 0)

plot of chunk kmeans

Split rows and add column annotation as well:

EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3",
    # note we have three row-clusters, so we assign three colors for the annotation lines
    top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = 2:4))), 
    top_annotation_height = unit(2, "cm"),
    km = 3, row_title_rot = 0)

plot of chunk kmeans_anno

Cluster on rows:

EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3", 
    cluster_rows = TRUE, show_row_dend = FALSE, column_title = "Enrichment of H3K4me3")     

plot of chunk cluster

Some graphic settings specific for the EnrichedHeatmap object:

EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3", 
    pos_line_gp = gpar(col = "blue", lwd = 2), axis_name = c("-5kb", "TSS", "5kb"), 
    axis_name_rot = -45, border = FALSE)

plot of chunk aixs

Extension to upstream and downstream can be controled by extend either by a single value or a vector of length 2.

# upstream 1kb, downstream 2kb
mat12 = normalizeToMatrix(H3K4me3, tss, value_column = "coverage", 
    extend = c(1000, 2000), mean_mode = "w0", w = 50)
EnrichedHeatmap(mat12, name = "H3K4me3", col = c("white", "red"))

plot of chunk extend

Single heatmap for methylation

Following heatmap visualizes the enrichment of low methylated regions on TSS:

meth[1:5]
## GRanges object with 5 ranges and 1 metadata column:
##       seqnames             ranges strand |              meth
##          <Rle>          <IRanges>  <Rle> |         <numeric>
##   [1]    chr21 [9432427, 9432427]      * | 0.267104203931852
##   [2]    chr21 [9432428, 9432428]      * | 0.267106771955287
##   [3]    chr21 [9432964, 9432964]      * | 0.272709911227985
##   [4]    chr21 [9432965, 9432965]      * |   0.2727345344132
##   [5]    chr21 [9433315, 9433315]      * | 0.285114797969136
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
mat2 = normalizeToMatrix(meth, tss, value_column = "meth", mean_mode = "absolute",
    extend = 5000, w = 50, empty_value = 0.5)
EnrichedHeatmap(mat2, name = "methylation", column_title = "methylation near TSS")

plot of chunk meth

With window size set to 50bp, it can be possible that there is no CpG site in some of the windows. In this case, empty_value is used to fill such windows. Since high methylation and low methylation are the major methylation types in the genome, we set the empty value to 0.5.

Here we set mean_mode to absolute. Following illustrates different settings for mean_mode:

        4      5      2     values in signal
     ++++++   +++   +++++   signal
       ================     window (16bp)
        4      3     3      overlap

  value for this window:
     absolute: (4 + 5 + 2)/3
     weighted: (4*4 + 5*3 + 2*3)/(4 + 3 + 3)
     w0:       (4*4 + 5*3 + 2*3)/16

So, according to above rules, if the genomic signals are from single base or very small regions, setting mean_mode to absolute seems to be reasonable. For other case, mean_mode can be set to weighted or w0.

The values for those windows which contain no CpG sites can be fitted by locfit method:

mat2 = normalizeToMatrix(meth, tss, value_column = "meth", mean_mode = "absolute",
    extend = 5000, w = 50, empty_value = NA, smooth = TRUE)
EnrichedHeatmap(mat2, name = "methylation", column_title = "methylation near TSS")

plot of chunk meth2

The target of the enrichment can not only be a single point but also can be regions with width larger than 1. Following heatmap visualizes the enrichment of low methylation on CpG islands:

mat3 = normalizeToMatrix(meth, cgi, value_column = "meth", mean_mode = "absolute",
    extend = 5000, w = 50, empty_value = NA, smooth = TRUE)
EnrichedHeatmap(mat3, name = "methylation", column_title = "methylation near CGI")

plot of chunk cgi

Width of the target regions can be controlled by target_ratio which is relative to the width of the complete heatmap.

Target regions are also splitted into small windows, but since width of the target regions are different from each other, they are splitted by percent to their full width (the percent value is calculated automatically).

mat3 = normalizeToMatrix(meth, cgi, value_column = "meth", mean_mode = "absolute",
    extend = 5000, w = 50, empty_value = NA, smooth = TRUE, target_ratio = 0.3)
EnrichedHeatmap(mat3, name = "methylation", axis_name_rot = 90,
    column_title = "methylation near CGI")

plot of chunk fat_cgi

Multiple heatmaps

Thanks to the functionality of ComplexHeatmap package, heatmaps can be concatenated by + operator. EnrichedHeatmap objects and Heatmap objects can be mixed.

Following heatmaps visualizes correspondance between H3K4me3 mark, methylation and gene expression. It is quite straightforward to see high expression correlates with low methylation and high promoter activity around TSS.

EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3", width = 1) + 
EnrichedHeatmap(mat2, name = "methylation", width = 1) +
Heatmap(log2(rpkm+1), col = c("white", "orange"), name = "log2(rpkm+1)", 
    show_row_names = FALSE, width = unit(5, "mm"))

plot of chunk list

Of course you can split rows by splitting the main heatmap:

EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3", km = 3, width = 1,
    top_annotation = HeatmapAnnotation(lines = anno_enriched()), 
    top_annotation_height = unit(2, "cm"), row_title_rot = 0,
    column_title = "H3K4me3") + 
EnrichedHeatmap(mat2, name = "methylation", width = 1,
    column_title = "Methylation") +
Heatmap(log2(rpkm+1), col = c("white", "orange"), name = "log2(rpkm+1)", 
    show_row_names = FALSE, width = unit(5, "mm"))

plot of chunk list_split

Restrict overlapping by providing mapping

By default every genomic signal tries to intersect to every target region, but if mapping is provided, only those genomic signals that are mapped to the corresponding target region will be kept.

To illustrate it more clearly, we load the example data. gene column in neg_cr is used to map to the names of all_tss.

load(paste0(system.file("/extdata/neg_cr.RData", package = "EnrichedHeatmap")))
all_tss = promoters(all_genes, upstream = 0, downstream = 1)
all_tss = all_tss[unique(neg_cr$gene)]
neg_cr
## GRanges object with 5712 ranges and 1 metadata column:
##          seqnames                 ranges strand   |               gene
##             <Rle>              <IRanges>  <Rle>   |        <character>
##      [1]     chr1     [ 901460,  902041]      *   |  ENSG00000187583.5
##      [2]     chr1     [1238870, 1239998]      *   | ENSG00000131584.14
##      [3]     chr1     [1241678, 1242375]      *   | ENSG00000131584.14
##      [4]     chr1     [1371300, 1371689]      *   | ENSG00000179403.10
##      [5]     chr1     [1849952, 1850382]      *   |  ENSG00000178821.8
##      ...      ...                    ...    ... ...                ...
##   [5708]     chrX [153989953, 153990135]      *   | ENSG00000130826.11
##   [5709]     chrX [154113991, 154114210]      *   |  ENSG00000197932.3
##   [5710]     chrX [154423204, 154424142]      *   |  ENSG00000155959.6
##   [5711]     chrX [154426343, 154426667]      *   |  ENSG00000155959.6
##   [5712]     chrX [154488727, 154489937]      *   |  ENSG00000155961.4
##   -------
##   seqinfo: 17 sequences from an unspecified genome; no seqlengths
all_tss
## GRanges object with 3601 ranges and 0 metadata columns:
##                      seqnames                 ranges strand
##                         <Rle>              <IRanges>  <Rle>
##    ENSG00000187583.5     chr1     [ 901877,  901877]      +
##   ENSG00000131584.14     chr1     [1244989, 1244989]      -
##   ENSG00000179403.10     chr1     [1370241, 1370241]      +
##    ENSG00000178821.8     chr1     [1850712, 1850712]      -
##   ENSG00000142609.13     chr1     [1935276, 1935276]      -
##                  ...      ...                    ...    ...
##   ENSG00000126890.13     chrX [153881853, 153881853]      -
##   ENSG00000130826.11     chrX [153991031, 153991031]      +
##    ENSG00000197932.3     chrX [154114635, 154114635]      +
##    ENSG00000155959.6     chrX [154425284, 154425284]      +
##    ENSG00000155961.4     chrX [154493874, 154493874]      -
##   -------
##   seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths

In this example, neg_cr contains regions that show negative correlation between methylation and expression for the genes. The negative correlated regions are detected as:

  1. extend genes to upstream 5kb and downtream 5kb;
  2. for every gene, use a sliding window to go through left to right and find correlated regions by looking at the correlation between methylation in the window and expression for the gene.

Since genes may be close to each other, it is possible that one corrlated region for gene A overlaps with gene B. This is not what we want and by specifying the mapping, we can correspond correlated regions to the correct genes.

mat4 = normalizeToMatrix(neg_cr, all_tss, mapping_column = "gene", w = 50, mean_mode = "w0")
EnrichedHeatmap(mat4, col = c("white", "green"), name = "neg_cr", cluster_rows = TRUE,
    top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = "green"))), 
    top_annotation_height = unit(2, "cm"))

plot of chunk neg_cr

Above heatmap shows negative correlated regions are enriched at some distance after the TSS. We guess it is because genes have alternative transcripts and negative correlated regions are actually enriched at the start sites of transcripts.

Next we add another heatmap showing how transcripts are distributed to gene TSS. Maybe here the heatmap is not a nice way for showing transcripts, but according to the annotation graphs on the both top, we can see there is a perfect fitting for the peaks of negative correlated regions and transcripts.

mat5 = normalizeToMatrix(tx, all_tss, mapping_column="gene", w = 50, mean_mode = "w0")
ht_list = EnrichedHeatmap(mat4, col = c("white", "green"), name = "neg_cr", cluster_rows = TRUE,
              top_annotation = HeatmapAnnotation(lines1 = anno_enriched(gp = gpar(col = "green"))), 
              top_annotation_height = unit(2, "cm")) +
          EnrichedHeatmap(mat5, col = c("white", "black"), name = "tx",
              top_annotation = HeatmapAnnotation(lines2 = anno_enriched(gp = gpar(col = "black"))), 
              top_annotation_height = unit(2, "cm"))
draw(ht_list, gap = unit(1, "cm"))

plot of chunk neg_cr_with_tx

Features coming from ComplexHeatmap package

Since EnrichedHeatmap is built upon the ComplexHeatmap package, features in ComplexHeatmap can be used directly for EnrichedHeatmap. As shown before, heatmaps can be split either by km or spilt arguments.

The order of rows can be retrieved by row_order().

ht_list = draw(ht_list)
row_order(ht_list)

If you are interested in a small cluster, under the interactive mode, you can use mouse to select this region by select() function, and it will give you the order of rows for the selected sub-region.

draw(ht_list)
pos = select()

Since EnrichedHeatmap and EnrichedHeamtapList class are inherited from Heamtap and HeamtapList class respectively, all advanced parameters in the latter two classes can be directly used in the former two classes.

E.g. to change graphic settings for the heatmap title:

EnrichedHeatmap(..., column_title_gp = ...)

To change graphic settings for legends:

EnrichedHeatmap(..., heatmap_legend_param = ...)
# or set is globally
ht_global_opt(...)
EnrichedHeatmap(...)
ht_global_opt(RESET = TRUE)

To set the width of the heatmaps if there are more than one heatmaps:

EnrichedHeatmap(..., width = unit(...)) + EnrichedHeatmap(...)

For more advanced settings, please directly go to the vignettes in the ComplexHeamtap package.

Together with above features, you can make very complex heatmaps. Following example is from a real-world dataset. Some information is hidden because the data is not published yet, but still, you can see it is really easy to correspond different sources of information.

Notice

Due to findOverlaps() in GenomicRanges package, normalizeToMatrix() may use a lot of memory if your GRanges object contains too many regions. s argument in normalizeToMatrix() can split the data into s parts but it will increase the running time a lot.

If you generate the plot for the whole genome, I suggest you first save the figure as pdf format and then convert to png by convert software, instead of directly saving as png format.

Session info

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] stats4    parallel  grid      stats     graphics  grDevices utils     datasets  methods  
## [10] base     
## 
## other attached packages:
##  [1] circlize_0.3.1        EnrichedHeatmap_1.0.0 locfit_1.5-9.1        GenomicRanges_1.22.0 
##  [5] GenomeInfoDb_1.6.0    IRanges_2.4.0         S4Vectors_0.8.0       BiocGenerics_0.16.0  
##  [9] ComplexHeatmap_1.6.0  knitr_1.11            markdown_0.7.7       
## 
## loaded via a namespace (and not attached):
##  [1] whisker_0.3-2       XVector_0.10.0      magrittr_1.5        zlibbioc_1.16.0    
##  [5] lattice_0.20-33     colorspace_1.2-6    rjson_0.2.15        stringr_1.0.0      
##  [9] tools_3.2.2         matrixStats_0.14.2  RColorBrewer_1.1-2  formatR_1.2.1      
## [13] GlobalOptions_0.0.8 dendextend_1.1.0    shape_1.4.2         evaluate_0.8       
## [17] stringi_0.5-5       GetoptLong_0.1.0