This vignette is organized hierarchically in terms of level of detail:
In the Quickstart section, we show a basic analysis with chromswitch using a small dataset included in the package.
In the next section, we give a brief overview of the method for detecting chromatin state switches, referencing specific functions in chromswitch which implement the different steps of the method.
In the Walkthrough, we demonstrate a basic analysis chromswitch with a discussion of data import, input, the most important parameters available to the user, and interpretation of chromswitch output. In this section we use the wrapper functions which provide one-line commands to call chromatin state switches based on a single mark or type of input data.
In last section, Step-by-step, we demonstrate how the steps of the method can be run individually to gain finer control over the analysis and show the intermediate results from chromswitch available to the user.
Load chromswitch:
library(chromswitch)
We will use the package rtracklayer to import data from BED files:
library(rtracklayer)
We’ll start with a toy dataset containing MACS2 narrow peak calls for H3K4me3 ChIP-seq in 3 brain tissues and 3 other adult tissues from the Roadmap Epigenomics Project, restricted to a short region on chromosome 19. In the code below, we import the input to chromswitch and run chromswitch on our dataset. This involves constructing a feature matrix, and we will describe two ways of doing so. We can then call chromatin state switches by thresholding on the value of the Consensus score, which scores the similarity between the cluster assignments and the biological condition labels.
Chromswitch essentially requires 3 inputs:
GRanges
object storing one or more regions of
interestGRanges
objects, each of which stores
peaks or features for one sample, with elements named according to the sample IDs as
specified in the metadataThe latter two inputs define the dataset on which the query is performed.
Each of these inputs can be imported from TSV or BED files. Learn more about GRanges
objects by checking out GenomicRanges.
Here we use the import()
function from rtracklayer to import
query regions stored in a BED file.
# Path to BED file in chromswitch installation
query_path <- system.file("extdata/query.bed", package = "chromswitch")
# Read in BED file, creating a GRanges object
query <- rtracklayer::import(con = query_path, format = "BED")
query
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr19 54924105-54929104 *
## [2] chr19 54874319-54877536 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Path to TSV in chromswitch
meta_path <- system.file("extdata/metadata.tsv", package = "chromswitch")
# Read in the table from a 2-column TSV file
metadata <- read.delim(meta_path, sep = "\t", header = TRUE)
metadata
Here we assume that peaks are in the ENCODE narrowPeak format.
Note: If the metadata file has an additional column containing the path for each sample, then that column can be passed to this function,
e.g. readNarrowPeak(paths = metadata$path, metadata = metadata)
.
# Paths to the BED files containing peak calls for each sample
peak_paths <- system.file("extdata", paste0(metadata$Sample, ".H3K4me3.bed"),
package = "chromswitch")
# Import BED files containing MACS2 narrow peak calls using rtracklayer
peaks <- readNarrowPeak(paths = peak_paths, # Paths to files,
metadata = metadata) # Metadata dataframe
Run chromswitch using the summary strategy:
callSummary(query = query, # Input 1: Query regions
metadata = metadata, # Input 2: Metadata dataframe
peaks = peaks, # Input 3: Peaks
mark = "H3K4me3") # Arbitrary string describing the data type
Chromswitch outputs a measure of cluster quality (Average_Silhouette), the score predicting a chromatin state switch (Consensus), and the cluster assignment for each sample.
Looking at the Consensus score in each case, which represents the similarity between the cluster assignments and the biological groups of the samples, here we see a good agreement in the first region, indicating a switch, and poor agreement in the second, indicating the absence of a switch. This score takes on values between -1 and 1, where 1 represents a perfect agreement between cluster assignments and the biological conditions of the sample
Run chromswitch using the binary strategy:
callBinary(query = query, # Input 1: Query regions
metadata = metadata, # Input 2: Metadata dataframe
peaks = peaks) # Input 3: Peaks
The output has the same format for both strategies. In this case, both strategies predict a switch for the first region, and a non-switch for the second region.
Both of these wrapper functions have additional parameters which allow for greater sensitivity and finer control in the analysis. These are explored in the rest of the vignette.
Our method for detecting chromatin state switches involves three steps. These are illustrated in the figure below, which is a schematic for the analysis performed on one query region, based on the reference metadata and peaks or features.
As input, chromswitch requires epigenetic features represented by their genomic coordinates and optionally, some associated statistics. Possible examples include ChIP-seq or DNase-seq peaks, or previously-learned chromatin state segmentations such as from ChromHMM. Here we’ll refer to peaks for simplicity, but the analysis is the same for other types of epigenetic features given as intervals.
In the pre-processing phase, the user can set thresholds on any statistics
associated with peaks and filter out peaks below these thresholds
(filterPeaks()
). These statistics can then be normalized genome-wide for each
sample (normalizePeaks()
), which is strongly recommended. More detailed discussion of the normalization process
can be found in the documentation of that function.
Both these steps are optional. We then retrieve all the peaks in each sample
which overlap the query region (retrievePeaks()
).
Next, chromswitch transforms the peaks in the query region into a
sample-by-feature matrix using one of two strategies. In the summary strategy
(summarizePeaks()
), we compute a set of summary statistics from the peaks
in each sample in the query region. These can include the mean, median, and max
of the statistics associated with the peaks in the input, as well as the
fraction of the region overlapped by peak and the number of peaks. Genome-wide normalization of the data is therefore extremely
important if choosing this strategy.
In the binary strategy (binarizePeaks()
) we
construct a binary matrix where each feature corresponds to a unique peak in the
region, and the matrix holds the binary presence or absence calls of each
unique peak in each sample. We obtain the unique peaks by collapsing the union
of all peaks in the region observed in all samples using a parameter p
which specifies how much reciprocal overlap is required between two peaks to
call them the same. Since regions corresponding to the same biological event
can occasionally result in separate peaks during the process of interpretation
of raw signal, peak calling, etc., we also introduce an option to combine
peaks which are separated by less than gap
base pairs (reducePeaks()
).
Finally, chromswitch clusters samples hierarchically
and then scores the similarity between the inferred cluster assignments and the
known biological condition labels of the samples (cluster()
).