1 Overview of this vignette

This vignette is organized hierarchically in terms of level of detail:

  • In the Quickstart section, we show a basic analysis with chromswitch using a small dataset included in the package.

  • In the next section, we give a brief overview of the method for detecting chromatin state switches, referencing specific functions in chromswitch which implement the different steps of the method.

  • In the Walkthrough, we demonstrate a basic analysis chromswitch with a discussion of data import, input, the most important parameters available to the user, and interpretation of chromswitch output. In this section we use the wrapper functions which provide one-line commands to call chromatin state switches based on a single mark or type of input data.

  • In last section, Step-by-step, we demonstrate how the steps of the method can be run individually to gain finer control over the analysis and show the intermediate results from chromswitch available to the user.

2 Quickstart

Load chromswitch:

library(chromswitch)

We will use the package rtracklayer to import data from BED files:

library(rtracklayer)

We’ll start with a toy dataset containing MACS2 narrow peak calls for H3K4me3 ChIP-seq in 3 brain tissues and 3 other adult tissues from the Roadmap Epigenomics Project, restricted to a short region on chromosome 19. In the code below, we import the input to chromswitch and run chromswitch on our dataset. This involves constructing a feature matrix, and we will describe two ways of doing so. We can then call chromatin state switches by thresholding on the value of the Consensus score, which scores the similarity between the cluster assignments and the biological condition labels.

Chromswitch essentially requires 3 inputs:

  1. Genomic query region(s): a GRanges object storing one or more regions of interest
  2. Reference metadata dataframe: a dataframe with at least two columns: Sample which stores sample IDs (these can be any strings), and Condition, which stores the biological condition labels of the samples (the conditions must be strings with only two different values in the column, e.g. ‘Fetal’/‘Adult’ or ‘WT’/‘Mut’). Additional columns are not used.
  3. Reference peaks or epigenomic features: a list of GRanges objects, each of which stores peaks or features for one sample, with elements named according to the sample IDs as specified in the metadata

The latter two inputs define the dataset on which the query is performed.

Each of these inputs can be imported from TSV or BED files. Learn more about GRanges objects by checking out GenomicRanges.

Here we use the import() function from rtracklayer to import query regions stored in a BED file.

# Path to BED file in chromswitch installation
query_path <- system.file("extdata/query.bed", package = "chromswitch")

# Read in BED file, creating a GRanges object
query <- rtracklayer::import(con = query_path, format = "BED")
query
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr19 54924105-54929104      *
##   [2]    chr19 54874319-54877536      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Path to TSV in chromswitch
meta_path <- system.file("extdata/metadata.tsv", package = "chromswitch")

# Read in the table from a 2-column TSV file
metadata <- read.delim(meta_path, sep = "\t", header = TRUE)
metadata

Here we assume that peaks are in the ENCODE narrowPeak format.

Note: If the metadata file has an additional column containing the path for each sample, then that column can be passed to this function, e.g. readNarrowPeak(paths = metadata$path, metadata = metadata).

# Paths to the BED files containing peak calls for each sample
peak_paths <- system.file("extdata", paste0(metadata$Sample, ".H3K4me3.bed"),
                     package = "chromswitch")

# Import BED files containing MACS2 narrow peak calls using rtracklayer
peaks <- readNarrowPeak(paths = peak_paths,  # Paths to files,
                        metadata = metadata) # Metadata dataframe

2.1 Using the summary strategy

Run chromswitch using the summary strategy:

callSummary(query = query,       # Input 1: Query regions
            metadata = metadata, # Input 2: Metadata dataframe
            peaks = peaks,       # Input 3: Peaks
            mark = "H3K4me3")    # Arbitrary string describing the data type

Chromswitch outputs a measure of cluster quality (Average_Silhouette), the score predicting a chromatin state switch (Consensus), and the cluster assignment for each sample.

Looking at the Consensus score in each case, which represents the similarity between the cluster assignments and the biological groups of the samples, here we see a good agreement in the first region, indicating a switch, and poor agreement in the second, indicating the absence of a switch. This score takes on values between -1 and 1, where 1 represents a perfect agreement between cluster assignments and the biological conditions of the sample

2.2 Using the binary strategy

Run chromswitch using the binary strategy:

callBinary(query = query,       # Input 1: Query regions
           metadata = metadata, # Input 2: Metadata dataframe
           peaks = peaks)       # Input 3: Peaks

The output has the same format for both strategies. In this case, both strategies predict a switch for the first region, and a non-switch for the second region.

Both of these wrapper functions have additional parameters which allow for greater sensitivity and finer control in the analysis. These are explored in the rest of the vignette.

3 Overview of the method

Our method for detecting chromatin state switches involves three steps. These are illustrated in the figure below, which is a schematic for the analysis performed on one query region, based on the reference metadata and peaks or features.

As input, chromswitch requires epigenetic features represented by their genomic coordinates and optionally, some associated statistics. Possible examples include ChIP-seq or DNase-seq peaks, or previously-learned chromatin state segmentations such as from ChromHMM. Here we’ll refer to peaks for simplicity, but the analysis is the same for other types of epigenetic features given as intervals.

In the pre-processing phase, the user can set thresholds on any statistics associated with peaks and filter out peaks below these thresholds (filterPeaks()). These statistics can then be normalized genome-wide for each sample (normalizePeaks()), which is strongly recommended. More detailed discussion of the normalization process can be found in the documentation of that function. Both these steps are optional. We then retrieve all the peaks in each sample which overlap the query region (retrievePeaks()).

Next, chromswitch transforms the peaks in the query region into a sample-by-feature matrix using one of two strategies. In the summary strategy (summarizePeaks()), we compute a set of summary statistics from the peaks in each sample in the query region. These can include the mean, median, and max of the statistics associated with the peaks in the input, as well as the fraction of the region overlapped by peak and the number of peaks. Genome-wide normalization of the data is therefore extremely important if choosing this strategy.

In the binary strategy (binarizePeaks()) we construct a binary matrix where each feature corresponds to a unique peak in the region, and the matrix holds the binary presence or absence calls of each unique peak in each sample. We obtain the unique peaks by collapsing the union of all peaks in the region observed in all samples using a parameter p which specifies how much reciprocal overlap is required between two peaks to call them the same. Since regions corresponding to the same biological event can occasionally result in separate peaks during the process of interpretation of raw signal, peak calling, etc., we also introduce an option to combine peaks which are separated by less than gap base pairs (reducePeaks()).

Finally, chromswitch clusters samples hierarchically and then scores the similarity between the inferred cluster assignments and the known biological condition labels of the samples (cluster()).