Contents

1 Introduction

Example of epistack outputs

Example of epistack outputs

The epistack package main objective is the visualizations of stacks of genomic tracks (such as, but not restricted to, ChIP-seq or DNA methyation data) centered at genomic regions of interest. epistack needs three different inputs:

Each inputs are then combined in a single RangedSummarizedExperiment object use by epistack’s ploting functions.

After introducing epistack’s plotting capacity, this document will present two use cases:

2 Epistack visualisation

Epistack is a visualisation package. It uses a RangedSummarizedExperiment object as input, with matrices embeded as assays. We will discuss how to build such input obects in the next section. For now on, we will focus on the visualisation functions using the example dataset included in the package.

The dataset can be accessed with:

library(GenomicRanges)
library(SummarizedExperiment)
library(epistack)

data("stackepi")
stackepi
#> class: RangedSummarizedExperiment 
#> dim: 693 51 
#> metadata(0):
#> assays(1): DNAme
#> rownames(693): ENSSSCG00000016737 ENSSSCG00000036350 ...
#>   ENSSSCG00000024209 ENSSSCG00000048227
#> rowData names(3): gene_id exp score
#> colnames(51): window_1 window_2 ... window_50 window_51
#> colData names(0):

2.1 The plotEpisatck() function

This dataset can be visualised with the plotEpistack() function. The first parameter is the input RangedSummarizedExperiment object.

The second (optionnal) parameter, assays specifies which assay(s) should be displayed as heatmap(s). In the stackepi dataset, only one track is present, with an assay names DNAme. Note that it is possible to have several different tracks embeded in the same RangedSummarizedExperiment object, as demonstarted in the next sections.

An aditional metric_col is used, to display score associated with each anchor region, such as expression values or peak scores. Optionaly, the metric_col can be transformed before ploting using the metric_transfunc parameters.

plotEpistack(
  stackepi,
  assay = "DNAme", metric_col = "exp",
  ylim = c(0, 1), zlim = c(0, 1),
  x_labels = c("-2.5kb", "TSS", "+2.5kb"),
  titles = "DNA methylation", legends = "%mCpG",
  metric_title = "Expression", metric_label = "log10(TPM+1)",
  metric_transfunc = function(x) log10(x+1)
)

If a bin column is present, it is used to generate one average profile per bin.

stackepi <- addBins(stackepi, nbins = 5)

plotEpistack(
  stackepi,
  assay = "DNAme", metric_col = "exp",
  ylim = c(0, 1), zlim = c(0, 1),
  x_labels = c("-2.5kb", "TSS", "+2.5kb"),
  titles = "DNA methylation", legends = "%mCpG",
  metric_title = "Expression", metric_label = "log10(TPM+1)",
  metric_transfunc = function(x) log10(x+1)
)

Colours can be changed using dedicated parameters:

plotEpistack(
  stackepi,
  assay = "DNAme", metric_col = "exp",
  ylim = c(0, 1), zlim = c(0, 1),
  x_labels = c("-2.5kb", "TSS", "+2.5kb"),
  titles = "DNA methylation", legends = "%mCpG",
  metric_title = "Expression", metric_label = "log10(TPM+1)",
  metric_transfunc = function(x) log10(x+1),
  tints = "dodgerblue",
  bin_palette = rainbow
)

Text size, and other graphical parameters, can be changed using cex inside of plotEpistack(). Indeed, additional arguments will be passed internaly to par() (see ?par for more details).

plotEpistack(
  stackepi,
  assay = "DNAme", metric_col = "exp",
  ylim = c(0, 1), zlim = c(0, 1),
  x_labels = c("-2.5kb", "TSS", "+2.5kb"),
  titles = "DNA methylation", legends = "%mCpG",
  metric_title = "Expression", metric_label = "log10(TPM+1)",
  metric_transfunc = function(x) log10(x+1),
  cex = 0.4, cex.main = 0.6
)

2.2 Custom panel arrangements

Each panel can be plotted individually using dedicated functions. For example:

plotAverageProfile(
  stackepi,
  ylim = c(0, 1),
  assay = "DNAme",
  x_labels = c("-2.5kb", "TSS", "+2.5kb"),
)

And:

plotStackProfile(
  stackepi,
  assay = "DNAme",
  x_labels = c("-2.5kb", "TSS", "+2.5kb"),
  palette = viridisLite::viridis,
  zlim = c(0, 1)
)

It is therefore possible to arrange panels as you whish, using the multipanel framework of your choice (layout, grid, patchwork, etc.).

layout(matrix(1:3, ncol = 1), heights = c(1.5, 3, 0.5))
old_par <- par(mar = c(2.5, 4, 0.6, 0.6))

plotAverageProfile(
  stackepi, assay = "DNAme",
  x_labels = c("-2.5kb", "TSS", "+2.5kb"), ylim = c(0, 1),
)

plotStackProfile(
  stackepi, assay = "DNAme",
  x_labels = c("-2.5kb", "TSS", "+2.5kb"), zlim = c(0, 1),
  palette = viridisLite::viridis
)

plotStackProfileLegend(
  zlim = c(0, 1),
  palette = viridisLite::viridis
)

par(old_par)
layout(1)

3 Example 1: ChIP-seq coverage at peaks

3.1 data origin

In this part, we will use example ChIP-seq data from the 2016 CSAMA course “Basics of ChIP-seq data analysis” by Aleksandra Pekowska and Simon Anders. Data can be found in this github repository: https://github.com/Bioconductor/CSAMA2016/tree/master/lab-5-chipseq/EpigeneticsCSAMA/inst/bedfiles

There is no need to download the data as it can be remotely parsed. The data consists of two H3K27ac ChIP-seq replicates, an input control, and one list of peak for each replicates. It has been generated in mouse Embryonic Stem cells and been subseted to have only data from chromosome 6 to allow fast vignette generation (but epistack can deal with whole genome ChIP-seq data!).

path_reads <- c(
    rep1 = "https://raw.githubusercontent.com/Bioconductor/CSAMA2016/master/lab-5-chipseq/EpigeneticsCSAMA/inst/bedfiles/H3K27ac_rep1_filtered_ucsc_chr6.bed",
    rep2 = "https://raw.githubusercontent.com/Bioconductor/CSAMA2016/master/lab-5-chipseq/EpigeneticsCSAMA/inst/bedfiles/H3K27ac_rep2_filtered_ucsc_chr6.bed",
    input = "https://raw.githubusercontent.com/Bioconductor/CSAMA2016/master/lab-5-chipseq/EpigeneticsCSAMA/inst/bedfiles/ES_input_filtered_ucsc_chr6.bed"
)

path_peaks <- c(
    peak1 = "https://raw.githubusercontent.com/Bioconductor/CSAMA2016/master/lab-5-chipseq/EpigeneticsCSAMA/inst/bedfiles/Rep1_peaks_ucsc_chr6.bed",
    peak2 = "https://raw.githubusercontent.com/Bioconductor/CSAMA2016/master/lab-5-chipseq/EpigeneticsCSAMA/inst/bedfiles/Rep2_peaks_ucsc_chr6.bed"
)

3.2 data loading

We first read the peaks using rtracklayer:

library(rtracklayer)
peaks <- lapply(path_peaks, import)

Peaks from each replicates can be merged using GenomicRanges union() function (loaded with rtracklayer). We then rescue peaks metadata, and compute a new mean_score that we use to arrange our peak list.

merged_peaks <- GenomicRanges::union(peaks[[1]], peaks[[2]])

scores_rep1 <- double(length(merged_peaks))
scores_rep1[findOverlaps(peaks[[1]], merged_peaks, select = "first")] <- peaks[[1]]$score

scores_rep2 <- double(length(merged_peaks))
scores_rep2[findOverlaps(peaks[[2]], merged_peaks, select = "first")] <- peaks[[2]]$score

peak_type <- ifelse(
    scores_rep1 != 0 & scores_rep2 != 0, "Both", ifelse(
        scores_rep1 != 0, "Rep1 only", "Rep2 only"
    )
)

mcols(merged_peaks) <- DataFrame(scores_rep1, scores_rep2, peak_type)
merged_peaks$mean_scores <- apply((mcols(merged_peaks)[, c("scores_rep1", "scores_rep2")]), 1, mean)
merged_peaks <- merged_peaks[order(merged_peaks$mean_scores, decreasing = TRUE), ]
rm(scores_rep1, scores_rep2, peak_type)

merged_peaks
#> GRanges object with 3534 ranges and 4 metadata columns:
#>          seqnames              ranges strand | scores_rep1 scores_rep2
#>             <Rle>           <IRanges>  <Rle> |   <numeric>   <numeric>
#>      [1]     chr6   91638787-91646132      * |     3182.78     3203.04
#>      [2]     chr6   56746391-56748809      * |     3217.75     3100.00
#>      [3]     chr6   29296913-29298755      * |     3132.86     3100.00
#>      [4]     chr6     5445473-5448143      * |     3100.00     3100.00
#>      [5]     chr6   29996830-29999535      * |     3100.00     3100.00
#>      ...      ...                 ...    ... .         ...         ...
#>   [3530]     chr6   94556781-94557433      * |        0.00       50.10
#>   [3531]     chr6 119142482-119143018      * |       50.09        0.00
#>   [3532]     chr6 100424986-100425605      * |       50.08        0.00
#>   [3533]     chr6   66963731-66964304      * |        0.00       50.03
#>   [3534]     chr6   96545103-96545640      * |       50.01        0.00
#>            peak_type mean_scores
#>          <character>   <numeric>
#>      [1]        Both     3192.91
#>      [2]        Both     3158.88
#>      [3]        Both     3116.43
#>      [4]        Both     3100.00
#>      [5]        Both     3100.00
#>      ...         ...         ...
#>   [3530]   Rep2 only      25.050
#>   [3531]   Rep1 only      25.045
#>   [3532]   Rep1 only      25.040
#>   [3533]   Rep2 only      25.015
#>   [3534]   Rep1 only      25.005
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

We then import the ChIP-seq reads. In the example datasets, they are provided as .bed files, but for ChIP-seq we recommand importing .bigwig coverage files. Bam files can also be imported using GenomicAlignments::readGAlignment*().

reads <- lapply(path_reads, import)

3.3 Generating coverage matrices

We generate coverage matrices with ChIP-seq coverage signal summarized around each ChIP-seq peaks. Several tools exists to generate such coverage matrices. We will demonstrate the normalizeToMatrix() method from EnrichedHeatmap. Other alternatives include RepitoolsfeatureScores(), or seqplotsgetPlotSetArray().

We will focus on the regions around peak centers, extended from +/-5kb with a window size of 250 bp. We keep track of the extent of our region of interest in a xlab variable, and specify unique column names for each matrix.

Note: when using ChIP-seq data from a bigwig file, use the value_column parameter of the normalizeToMatrix() function.

library(EnrichedHeatmap)

coverage_matrices <- lapply(
    reads,
    function(x) {
        normalizeToMatrix(
            x,
            resize(merged_peaks, width = 1, fix = "center"),
            extend = 5000, w = 250, 
            mean_mode = "coverage"
        )
    }
)

xlabs <- c("-5kb", "peak center", "+5kb")

We then add the peak coordinates and each matrix to a RangedSummarizedExperiment object.

merged_peaks <- SummarizedExperiment(
    rowData = merged_peaks,
    assays = coverage_matrices
)

3.4 Plotting

The plotEpistack() function will use the merged_peaks object to generate a complex representation of the ChIP-seq signals around the genomic feature of interests (here ChIP-seq peaks).

plotEpistack(
    merged_peaks,
    assays = c("rep1", "rep2", "input"),
    tints = c("dodgerblue", "firebrick1", "grey"), 
    titles = c("Rep1", "Rep2" , "Input"),
    x_labels = xlabs,
    zlim = c(0, 4), ylim = c(0, 4), 
    metric_col = "mean_scores", metric_title = "Peak score",
    metric_label = "log10(score)",
    metric_transfunc = function(x) log10(x)
)

If a bin column is present in the input GRanges object, it will be used to annotate the figure and to generate one average profile per bin in the lower panels. Here we use the peak_type as our bin column.


rowRanges(merged_peaks)$bin <- rowRanges(merged_peaks)$peak_type

plotEpistack(
    merged_peaks,
    assays = c("rep1", "rep2", "input"),
    tints = c("dodgerblue", "firebrick1", "grey"), 
    titles = c("Rep1", "Rep2" , "Input"),
    x_labels = xlabs,
    zlim = c(0, 4), ylim = c(0, 4), 
    metric_col = "mean_scores", metric_title = "Peak score", metric_label = "log10(score)",
    metric_transfunc = function(x) log10(x),
    bin_palette = colorRampPalette(c("darkorchid1", "dodgerblue", "firebrick1")),
    npix_height = 300
)

We can also sort on the bins first, and then on peak score. Epistack will respect the order in the GRanges object.

merged_peaks <- merged_peaks[order(
  rowRanges(merged_peaks)$bin, rowRanges(merged_peaks)$mean_scores,
  decreasing = c(FALSE, TRUE), method = "radix"
), ]

plotEpistack(
    merged_peaks,
    patterns = c("rep1", "rep2", "input"),
    tints = c("dodgerblue", "firebrick1", "grey"), 
    titles = c("Rep1", "Rep2" , "Input"),
    x_labels = xlabs,
    zlim = c(0, 4), ylim = c(0, 4), 
    metric_col = "mean_scores", metric_title = "Peak score", metric_label = "log10(score)",
    metric_transfunc = function(x) log10(x),
    bin_palette = colorRampPalette(c("darkorchid1", "dodgerblue", "firebrick1")),
    npix_height = 300
)

4 Example 2: DNA methylation levels and ChIP-seq coverage at gene promoters sorted according to expression levels

4.1 Obtaining the TSS coordinates

In this part, we will plot the epigenetic signals at gene promoters, or more precisely around gene Transcription Start Sites (TSS). TSS coordinates can be obtained from various sources. One can access the Ensembl annotations using biomaRt, download a .gtf file and parse it using rtracklayer’s import(), or use AnnotationHub and ensembldb. It is however important to work with the same genome version has the one used to align the ChIP-seq reads.

For simplicity, we will use EnrichedHeatmap example data.

load(
    system.file("extdata", "chr21_test_data.RData",
                package = "EnrichedHeatmap"),
    verbose = TRUE
)
#> Loading objects:
#>   H3K4me3
#>   cgi
#>   genes
#>   meth
#>   rpkm

tss <- promoters(genes, upstream = 0, downstream = 1)
tss$gene_id <- names(tss)

tss
#> GRanges object with 720 ranges and 1 metadata column:
#>                      seqnames    ranges strand |            gene_id
#>                         <Rle> <IRanges>  <Rle> |        <character>
#>    ENSG00000141956.9    chr21  43299591      - |  ENSG00000141956.9
#>   ENSG00000141959.12    chr21  45719934      + | ENSG00000141959.12
#>    ENSG00000142149.4    chr21  33245628      + |  ENSG00000142149.4
#>   ENSG00000142156.10    chr21  47401651      + | ENSG00000142156.10
#>    ENSG00000142166.8    chr21  34696734      + |  ENSG00000142166.8
#>                  ...      ...       ...    ... .                ...
#>    ENSG00000270533.1    chr21  10476061      - |  ENSG00000270533.1
#>    ENSG00000270652.1    chr21  38315567      + |  ENSG00000270652.1
#>    ENSG00000270835.1    chr21  39577700      - |  ENSG00000270835.1
#>    ENSG00000271308.1    chr21  11169720      + |  ENSG00000271308.1
#>    ENSG00000271486.1    chr21  20993009      + |  ENSG00000271486.1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

4.2 Adding expression data to the TSS coordinates

Epistack can work with any units and any scores, not limited to expression data. Here we will use gene expression data from an RNA-seq experiment, in RPKM units, as it is the data format available in EnrichedHeatmap example dataset. To join the expression data to the TSS coordinates, we will use an epistack utility function addMetricAndArrangeGRanges():

expr <- data.frame(
    gene_id = names(rpkm),
    expr = rpkm
)

epidata <- addMetricAndArrangeGRanges(
    tss, expr,
    gr_key = "gene_id",
    order_key = "gene_id", order_value = "expr"
)

epidata
#> GRanges object with 720 ranges and 2 metadata columns:
#>                      seqnames    ranges strand |            gene_id      expr
#>                         <Rle> <IRanges>  <Rle> |        <character> <numeric>
#>    ENSG00000198618.4    chr21  20230097      + |  ENSG00000198618.4  461.0990
#>    ENSG00000160307.5    chr21  48025121      - |  ENSG00000160307.5  145.4498
#>   ENSG00000142192.16    chr21  27543446      - | ENSG00000142192.16  137.3838
#>    ENSG00000183255.7    chr21  46293752      - |  ENSG00000183255.7  131.7230
#>   ENSG00000142168.10    chr21  33031935      + | ENSG00000142168.10   97.5167
#>                  ...      ...       ...    ... .                ...       ...
#>    ENSG00000186965.3    chr21  31859755      - |  ENSG00000186965.3         0
#>    ENSG00000229086.3    chr21  34542541      - |  ENSG00000229086.3         0
#>    ENSG00000198434.1    chr21  23007049      - |  ENSG00000198434.1         0
#>    ENSG00000233783.2    chr21  27306401      + |  ENSG00000233783.2         0
#>    ENSG00000233300.1    chr21  28740266      - |  ENSG00000233300.1         0
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

We create 5 bins of genes of equal sizes depending on the expression levels, with epistack utility function addBins():

epidata <- addBins(epidata, nbins = 5)
epidata
#> GRanges object with 720 ranges and 3 metadata columns:
#>                      seqnames    ranges strand |            gene_id      expr
#>                         <Rle> <IRanges>  <Rle> |        <character> <numeric>
#>    ENSG00000198618.4    chr21  20230097      + |  ENSG00000198618.4  461.0990
#>    ENSG00000160307.5    chr21  48025121      - |  ENSG00000160307.5  145.4498
#>   ENSG00000142192.16    chr21  27543446      - | ENSG00000142192.16  137.3838
#>    ENSG00000183255.7    chr21  46293752      - |  ENSG00000183255.7  131.7230
#>   ENSG00000142168.10    chr21  33031935      + | ENSG00000142168.10   97.5167
#>                  ...      ...       ...    ... .                ...       ...
#>    ENSG00000186965.3    chr21  31859755      - |  ENSG00000186965.3         0
#>    ENSG00000229086.3    chr21  34542541      - |  ENSG00000229086.3         0
#>    ENSG00000198434.1    chr21  23007049      - |  ENSG00000198434.1         0
#>    ENSG00000233783.2    chr21  27306401      + |  ENSG00000233783.2         0
#>    ENSG00000233300.1    chr21  28740266      - |  ENSG00000233300.1         0
#>                            bin
#>                      <numeric>
#>    ENSG00000198618.4         1
#>    ENSG00000160307.5         1
#>   ENSG00000142192.16         1
#>    ENSG00000183255.7         1
#>   ENSG00000142168.10         1
#>                  ...       ...
#>    ENSG00000186965.3         5
#>    ENSG00000229086.3         5
#>    ENSG00000198434.1         5
#>    ENSG00000233783.2         5
#>    ENSG00000233300.1         5
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

4.3 Extracting the epigenetic signals

As previously described, we use EnrichedHeatmap’s normalizeToMatrix() function to extract the signals, rename the signal colmun names, and add them to the epidata GRanges object:

methstack <- normalizeToMatrix(
    meth, epidata, value_column = "meth",
    extend = 5000, w = 250, mean_mode = "absolute"
)

h3k4me3stack <- normalizeToMatrix(
    H3K4me3, epidata, value_column = "coverage",
    extend = 5000, w = 250, mean_mode = "coverage"
)

epidata <- SummarizedExperiment(
    rowData = epidata,
    assays = list(DNAme = methstack, H3K4me3 = h3k4me3stack)
)

4.4 Epistack plotting

The epidata GRanges object is now ready to plot:

plotEpistack(
    epidata,
    tints = c("tomato", "springgreen"),
    zlim = list(c(0, 1), c(0, 25)), ylim = list(c(0, 1), c(0, 50)),
    x_labels = c("-5kb", "TSS", "+5kb"),
    metric_col = "expr", metric_title = "Gene expression",
    metric_label = "log10(RPKM + 1)",
    metric_transfunc = function(x) log10(x + 1),
    npix_height = 300
)

5 sessionInfo()

sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] grid      stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] EnrichedHeatmap_1.28.0      ComplexHeatmap_2.14.0      
#>  [3] rtracklayer_1.58.0          epistack_1.4.0             
#>  [5] SummarizedExperiment_1.28.0 Biobase_2.58.0             
#>  [7] MatrixGenerics_1.10.0       matrixStats_0.62.0         
#>  [9] GenomicRanges_1.50.0        GenomeInfoDb_1.34.0        
#> [11] IRanges_2.32.0              S4Vectors_0.36.0           
#> [13] BiocGenerics_0.44.0         BiocStyle_2.26.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] locfit_1.5-9.6           Rcpp_1.0.9               lattice_0.20-45         
#>  [4] circlize_0.4.15          png_0.1-7                Rsamtools_2.14.0        
#>  [7] Biostrings_2.66.0        digest_0.6.30            foreach_1.5.2           
#> [10] R6_2.5.1                 evaluate_0.17            highr_0.9               
#> [13] GlobalOptions_0.1.2      zlibbioc_1.44.0          rlang_1.0.6             
#> [16] jquerylib_0.1.4          magick_2.7.3             GetoptLong_1.0.5        
#> [19] Matrix_1.5-1             rmarkdown_2.17           BiocParallel_1.32.0     
#> [22] stringr_1.4.1            RCurl_1.98-1.9           DelayedArray_0.24.0     
#> [25] compiler_4.2.1           xfun_0.34                shape_1.4.6             
#> [28] htmltools_0.5.3          GenomeInfoDbData_1.2.9   bookdown_0.29           
#> [31] codetools_0.2-18         XML_3.99-0.12            viridisLite_0.4.1       
#> [34] crayon_1.5.2             GenomicAlignments_1.34.0 bitops_1.0-7            
#> [37] jsonlite_1.8.3           magrittr_2.0.3           cli_3.4.1               
#> [40] stringi_1.7.8            cachem_1.0.6             XVector_0.38.0          
#> [43] doParallel_1.0.17        bslib_0.4.0              rjson_0.2.21            
#> [46] restfulr_0.0.15          RColorBrewer_1.1-3       iterators_1.0.14        
#> [49] tools_4.2.1              plotrix_3.8-2            parallel_4.2.1          
#> [52] fastmap_1.1.0            yaml_2.3.6               clue_0.3-62             
#> [55] colorspace_2.0-3         cluster_2.1.4            BiocManager_1.30.19     
#> [58] knitr_1.40               sass_0.4.2               BiocIO_1.8.0