An important question in comparative analysis of epigenomic data is how chromatin state differs between biological conditions. The package chromswitch implements a method for integrating epigenomic data to identify chromatin state switches in genomic regions of interest between samples in two biological conditions. Chromswitch is flexible in its input; possible data types include ChIP-seq peaks for histone modification, DNase-seq peaks, or previously-learned chromatin state segmentations. Chromswitch transforms epigenomic data in the query region into a sample-by-feature matrix using one of two strategies, clusters samples hierarchically, and then uses external cluster validity measures to predict a switch in chromatin state. Chromswitch is robust to small sample sizes and high class imbalance, free from data-intensive training steps, and suitable for analyzing marks with diverse binding profiles.
This vignette is organized hierarchically in terms of level of detail:
In the Quickstart section, we show a basic analysis with chromswitch using a small dataset included in the package.
In the next section, we give a brief overview of the method for detecting chromatin state switches, referencing specific functions in chromswitch which implement the different steps of the method.
In the Walkthrough, we demonstrate a basic analysis chromswitch with a discussion of data import, input, the most important parameters available to the user, and interpretation of chromswitch output. In this section we use the wrapper functions which provide one-line commands to call chromatin state switches based on a single mark or type of input data.
In last section, Step-by-step, we demonstrate how the steps of the method can be run individually to gain finer control over the analysis and show the intermediate results from chromswitch available to the user.
Load chromswitch:
library(chromswitch)
We will use the package rtracklayer to import data from BED files:
library(rtracklayer)
We’ll start with a toy dataset containing MACS2 narrow peak calls for H3K4me3 ChIP-seq in 3 brain tissues and 3 other adult tissues from the Roadmap Epigenomics Project, restricted to a short region on chromosome 19. In the code below, we import the input to chromswitch and run chromswitch on our dataset. This involves constructing a feature matrix, and we will describe two ways of doing so. We can then call chromatin state switches by thresholding on the value of the Consensus score, which scores the similarity between the cluster assignments and the biological condition labels.
Chromswitch essentially requires 3 inputs:
GRanges
object storing one or more regions of interestGRanges
objects, each of which stores peaks or features for one sample, with elements named according to the sample IDs as specified in the metadataThe latter two inputs define the dataset on which the query is performed.
Each of these inputs can be imported from TSV or BED files. Learn more about GRanges
objects by checking out GenomicRanges.
Here we use the import()
function from rtracklayer to import query regions stored in a BED file.
# Path to BED file in chromswitch installation
query_path <- system.file("extdata/query.bed", package = "chromswitch")
# Read in BED file, creating a GRanges object
query <- rtracklayer::import(con = query_path, format = "BED")
query
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr19 [54924105, 54929104] *
## [2] chr19 [54874319, 54877536] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Path to TSV in chromswitch
meta_path <- system.file("extdata/metadata.tsv", package = "chromswitch")
# Read in the table from a 2-column TSV file
metadata <- read.delim(meta_path, sep = "\t", header = TRUE)
metadata
Here we assume that peaks are in the ENCODE narrowPeak format.
Note: If the metadata file has an additional column containing the path for each sample, then that column can be passed to this function, e.g. readNarrowPeak(paths = metadata$path, metadata = metadata)
.
# Paths to the BED files containing peak calls for each sample
peak_paths <- system.file("extdata", paste0(metadata$Sample, ".H3K4me3.bed"),
package = "chromswitch")
# Import BED files containing MACS2 narrow peak calls using
# a helper function from chromswitch
peaks <- readNarrowPeak(paths = peak_paths, # Paths to files,
metadata = metadata) # Metadata dataframe
Run chromswitch using the summary strategy:
callSummary(query = query, # Input 1: Query regions
metadata = metadata, # Input 2: Metadata dataframe
peaks = peaks, # Input 3: Peaks
mark = "H3K4me3") # Arbitrary string describing the data type
Chromswitch outputs a measure of cluster quality (Average_Silhouette), the score predicting a chromatin state switch (Consensus), and the cluster assignment for each sample.
Looking at the Consensus score in each case, which represents the similarity between the cluster assignments and the biological groups of the samples, here we see a good agreement in the first region, indicating a switch, and poor agreement in the second, indicating the absence of a switch. This score takes on values between -1 and 1, where 1 represents a perfect agreement between cluster assignments and the biological conditions of the sample
Run chromswitch using the binary strategy:
callBinary(query = query, # Input 1: Query regions
metadata = metadata, # Input 2: Metadata dataframe
peaks = peaks) # Input 3: Peaks
The output has the same format for both strategies. In this case, both strategies predict a switch for the first region, and a non-switch for the second region.
Both of these wrapper functions have additional parameters which allow for greater sensitivity and finer control in the analysis. These are explored in the rest of the vignette.
Our method for detecting chromatin state switches involves three steps. These are illustrated in the figure below, which is a schematic for the analysis performed on one query region, based on the reference metadata and peaks or features.
As input, chromswitch requires epigenetic features represented by their genomic coordinates and optionally, some associated statistics. Possible examples include ChIP-seq or DNase-seq peaks, or previously-learned chromatin state segmentations such as from ChromHMM. Here we’ll refer to peaks for simplicity, but the analysis is the same for other types of epigenetic features given as intervals.
In the pre-processing phase, the user can set thresholds on any statistics associated with peaks and filter out peaks below these thresholds (filterPeaks()
). These statistics can then be normalized genome-wide for each sample (normalizePeaks()
), which is strongly recommended. More detailed discussion of the normalization process can be found in the documentation of that function. Both these steps are optional. We then retrieve all the peaks in each sample which overlap the query region (retrievePeaks()
).
Next, chromswitch transforms the peaks in the query region into a sample-by-feature matrix using one of two strategies. In the summary strategy (summarizePeaks()
), we compute a set of summary statistics from the peaks in each sample in the query region. These can include the mean, median, and max of the statistics associated with the peaks in the input, as well as the fraction of the region overlapped by peak and the number of peaks. Genome-wide normalization of the data is therefore extremely important if choosing this strategy.
In the binary strategy (binarizePeaks()
) we construct a binary matrix where each feature corresponds to a unique peak in the region, and the matrix holds the binary presence or absence calls of each unique peak in each sample. We obtain the unique peaks by collapsing the union of all peaks in the region observed in all samples using a parameter p
which specifies how much reciprocal overlap is required between two peaks to call them the same. Since regions corresponding to the same biological event can occasionally result in separate peaks during the process of interpretation of raw signal, peak calling, etc., we also introduce an option to combine peaks which are separated by less than gap
base pairs (reducePeaks()
).
Finally, chromswitch clusters samples hierarchically and then scores the similarity between the inferred cluster assignments and the known biological condition labels of the samples (cluster()
).
Schematic of the method outlined above.
The package ships with a small dataset that we will analyze to detect brain- specific chromatin state switches. The dataset contains MACS2 narrow peak calls for H3K4me3 in a short section of chromosome 19 for six samples, 3 adult brain tissues, and 3 other adult tissues. The peaks are available as BED files, stored in the extdata folder of chromswitch, as well as in the object H3K4me3
, which is a list of length 6. Each element of the list stores the peak calls as GRanges
objects. GRanges
objects are sets of genomic ranges, and here, one range describes one peak. The data packaged with chromswitch is described in the manual, and the documentation can be accessed by running ??chromswitch::H3K4me3
in the console.
Genome browser screenshot of H3K4me3 dataset included in the chromswitch package. Red boxes indicate the genes studied in the demo analysis in this section.
Chromswitch essentially requires 3 inputs:
The specification for the inputs and examples of how to import them are described below.
Chromswitch expects query regions in the form of a GRanges
object storing one or more regions of interest. GRanges objects are containers for genomic regions, implemented in GenomicRanges. An introduction to these objects can be obtained by running ??GenomicRanges::GRanges_and_GRangesList_slides
in the console.
We will apply chromswitch to 5kbp windows surrounding chromatin state switches in three genes on chromosome 19. Here, we will read in the query regions from a BED file using the import()
function from rtracklayer.
# Path to BED file in chromswitch installation
query_path <- system.file("extdata/query.bed", package = "chromswitch")
# Read in BED file, creating a GRanges object
query <- rtracklayer::import(con = query_path, format = "BED")
query
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr19 [54924105, 54929104] *
## [2] chr19 [54874319, 54877536] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
If your query regions are stored in another tabular format, this table can be read in using read.delim()
and passed to the makeGRangesFromDataFrame()
function from GenomicRanges, which converts query regions stored in a dataframe with at least 3 columns, chr, start, and end, into a GRanges
object (remember the keep.extra.columns = TRUE
argument to preserve any additional data associated with the regions, such as gene symbols). Any metadata columns in the GRanges
object passed to the query
argument will be included in the chromswitch output (but not used for the analysis).
Chromswitch accepts metadata in the form of a dataframe with at least two columns: Sample which stores sample IDs (these can be any strings), and Condition, which stores the biological condition labels of the samples (these must be strings, with only two possible values in the column). Additional columns are not used.
In the code below, we read in the metadata from a TSV file. The resulting dataframe can be passed to the metadata
argument of any chromswitch functions that require it.
# Path to TSV in chromswitch
meta_path <- system.file("extdata/metadata.tsv", package = "chromswitch")
# Read in the table from a 2-column TSV file
metadata <- read.delim(meta_path, sep = "\t", header = TRUE)
metadata
Chromswitch expects epigenomic features in the form of a named list of GRanges
objects, each of which stores peaks or features for one sample, with elements named according to the sample IDs as specified in the metadata. The peaks should be in the same order (with respect to samples) as in the metadata.
The nature of these features are flexible: for example, they can include peak calls from a ChIP-seq or DNase-seq experiment, or assignments of a certain state obtained from a previously learned chromatin state segmentation such as from ChromHMM. These features may be attached to certain metrics quantifying enrichment, significance, or probabilities, and the format of the data will control how we import it. Here we focus on narrow peaks for H3K4me3 ChIP-seq, but demonstrate three ways of importing data into GRanges
objects, each suitable for different formats.
Option 1: To import BED files containing peak MACS2 narow peak calls, we can use a helper function implemented in chromswitch, which processes peaks which follow exactly the ENCODE narrowPeak format.
# Paths to the BED files containing peak calls for each sample
peak_paths <- system.file("extdata", paste0(metadata$Sample, ".H3K4me3.bed"),
package = "chromswitch")
# Import BED files containing MACS2 narrow peak calls using
# a helper function from chromswitch
peaks <- readNarrowPeak(paths = peak_paths, # Paths to files,
metadata = metadata)
# Ensure the list is named by sample
names(peaks) <- metadata$Sample
Option 2: To read in features in other formats where the first three columns are chr, start, and end, use rtracklayer and specify the identity and type of the other columns. For example, we can read in the narrow peak calls manually. The same process can be applied to BED files containing epigenomic features other than peaks (for example, chromatin state segmentations); the extraCols
argument to rtracklayer::import
should be modified to fit the data. More information about importing BED files can be obtained by running ??rtracklayer::BEDFile
in the console to access the rtracklayer
documentation.
extra_cols <- c("name" = "character",
"score" = "integer",
"strand" = "character",
"signalValue" = "numeric",
"pValue" = "numeric",
"qValue" = "numeric",
"peak" = "numeric")
# Obtain a list of GRanges objects containing peak calls
peaks <- lapply(peak_paths, rtracklayer::import, format = "bed",
extraCols = extra_cols)
# Ensure the list is named by sample
names(peaks) <- metadata$Sample
Option 3: Alternatively, if your epigenomic features are stored in another tabular format, read in the files using read.delim()
and convert them to GRanges
objects afterwards. We demonstrate on the same narrow peak calls:
# Read in all files into dataframes
df <- lapply(peak_paths, read.delim, sep = "\t",
header = FALSE,
col.names = c("chr", "start", "end", "name", "score",
"strand", "signalValue", "pValue",
"qValue", "peak"))
# Convert the dataframes into GRanges objects, retaining
# additional columns besides chr, start, end
peaks <- lapply(df, makeGRangesFromDataFrame, keep.extra.columns = TRUE)
# Ensure the list is named by sample
names(peaks) <- metadata$Sample
All three methods described above produce an identical peaks
object.
We will first run a basic analysis using the summary strategy for constructing feature matrices, using the default features under the summary matrix construction strategy: the number of peaks in the region, and the fraction of the region overlapped by peaks. Note that whne the number of peaks is large (\(n >> 1\)), the output, particularly the heatmap, may be difficult to interpret. All the computations described in the method are wrapped in one command; in later sections we’ll explore running each step of the method (preprocessing, feature matrix construction, clustering) individually.
Note that the column names passed to arguments in the wrappers (e.g. normalize_columns
, summarize_columns
, etc.) must match exactly the column names in the BED files.
out <- callSummary(query = query, # Input 1: Query regions
metadata = metadata, # Input 2: Metadata dataframe
peaks = peaks, # Input 3: Peaks
mark = "H3K4me3") # Arbitrary string describing the data type
out
We can threshold on the consensus score to subset the query regions to those containing putative chromatin state switches:
out[out$Consensus >= 0.75, ]
Now, we’ll construct the feature matrix by summarizing on the qValue
and signalValue
statistics. This means that for each sample, the features used to the cluster samples in the region will be the mean, median, and maximum of these two statistics across peaks. We will apply genome-wide normalization to the same columns we will use in the feature matrices. Normalization is an optional step, but strongly recommended.
Let’s explore some more options for detecting chromatin state switches with chromswitch. The options are briefly described in the comments, but you can obtain additional explanation of arguments and explore others not covered here by running ??chromswitch::callSummary
. Note that wherever column names are passed to chromswitch functions, these must match the columns in the peaks/features data exactly (this is case sensitive).
out2 <- callSummary(
# Standard arguments of the function
query = query,
metadata = metadata,
peaks = peaks,
# Arbitrary string describing data type
mark = "H3K4me3",
# For quality control, filter peaks based on associated stats
# prior to constructing feature matrices
filter = TRUE,
# Provide column names and thresholds to use in the same order
filter_columns = c("qValue", "signalValue"),
filter_thresholds = c(10, 4),
# Normalization options
normalize = TRUE, # Strongly recommended
# By default, set to equal summarize_columns, below
normalize_columns = c("qValue", "signalValue"),
# Columns to use for for feature matrix construction
summarize_columns = c("qValue", "signalValue"),
# In addition to summarizing peak statistics,
# we can also optionally compute the
# fraction of the region overlapped by peaks
# and the number of peaks
fraction = TRUE,
n = FALSE,
# TRUE by default, return the optimal number
# of clusters, otherwise require k = 2
optimal_clusters = TRUE,
# Set this to TRUE to save a PDF of the heatmap
# for each region to the current working directory
heatmap = FALSE,
# Chromswitch uses BiocParallel as a backend for
# parallel computations. Analysis is parallelized at the
# level of query regions.
BPPARAM = BiocParallel::SerialParam())
out2
The summary approach can be applied to epigenomic data where there are no statistics associated with the features (peaks, states, etc.). In this case, set summarize_columns = NULL
, filter = FALSE
, normalize = FALSE
, and ensure that either n
or fraction
(or both) are set to TRUE
.
The binary strategy requires approximately the same basic input as the summary strategy. It also uses two tuning parameters:
gap
which is the distance between peaks in the same sample below which two peaks should be merged. This preprocessing step is optional, and is controlled by the option reduce
p
which is the fraction of reciprocal overlap required to call two peaks the same. This rule is used to obtain a set of unique peaks observed across samples in the query region, and to assign binary presence or absence of each peak in each sample to construct the feature matrix.We use default values of gap = 300
and p = 0.4
, but the method is robust to changes in these parameters within reasonable ranges.
The other option unique to this strategy is n_features
. The number of features in the matrix used for clustering corresponds to the number of unique peaks that chromswitch identifies for these samples in the query region, and this option controls whether to include an additional column recording the number of features in the output.
Additional parameters can be explored by running ??chromswitch::callBinary
in the console.
out3 <- callBinary(
# Standard arguments of the function
query = query,
metadata = metadata,
peaks = peaks,
# Logical, controls whether to
# reduce gaps between peaks to eliminate noise
reduce = TRUE,
# Peaks in the same sample which are within this many bp
# of each other will be merged
gap = 300,
# The fraction of reciprocal overlap required to define
# two peaks as non unique, used to construct a binary ft matrix
p = 0.4,
# Include in output the number of features obtained in
# each query region
n_features = TRUE)
out3
Again, threshold the output to obtain putative switches:
out3[out3$Consensus >= 0.75, ]
The basic output of chromswitch is a tidy dataframe which includes:
optimal_clusters = FALSE
, otherwise, the optimal set of clusters is obtained by selecting the clusters with the highest average Silhouette width, displayed in the next column)If chromswitch performs hierarchical clustering on a feature matrix with more than one feature, both wrapper functions, callSummary()
and callBinary()
as well as cluster()
can optionally produce a heatmap with the resulting dendrogram as a PDF. Whether this heatmap is produced or not is passed as a logical value to the argument heatmap
. The title of the heatmap and prefix of the file name can be passed as a string to title
, while the path to the output directory can be passed to outdir
.
Chromswitch is implemented in a relatively modular way, with functions for each step of the method. In this section, we repeat the analysis we performed in the previous sections, except that instead of using the wrapper functions, callBinary()
and callSummary()
, we perform each step individually, and allocate some more discussion to options and parameters at each step.
This section leverages the modularity of chromswitch, so the pipe operator %>%
from magrittr is helpful here.
library(magrittr)
We will also inspect some of the intermediate objects returned by chromswitch functions, so we load the data manipulation package, dplyr.
library(dplyr)
After preparing the metadata dataframe (meta
) and importing data into GenomicRanges objects (H3K4me3
), we can pre-process the data. First, since the peak calls from MACS2 are associated with a fold change of enrichment of ChIP-seq signal, a q value for enrichment, etc., we’ll set some thresholds on these statistics and filter out peaks which do not meet them. The user can pass the names of any numeric columns in the data to the columns
argument, ensuring that a numeric threshold for each is passed to thresholds
, in the same order. If too few threshold values provided, they will not be recycled; chromswitch will return an error.
# Number of peaks in each sample prior to filtering
lapply(H3K4me3, length) %>% as.data.frame()
H3K4me3_filt <- filterPeaks(peaks = H3K4me3,
columns = c("qValue", "signalValue"),
thresholds = c(10, 4))
# Number of peaks in each sample after filtering
lapply(H3K4me3_filt, length) %>% as.data.frame()
There are some additional pre-processing steps which are specific to the approach used to construct feature matrices from the data, and we explain these below.
In the summary approach, the features for each sample are a set of summary statistics which represent the peaks observed in that sample in the query region. It’s important, therefore, that the features have comparable ranges. We normalize each statistic genome-wide for each sample. The normalization process essentially involves rescaling the central part of the data to the range [0, 1] and bounding the lower and upper outliers to 0 and 1 respectively. The amount of outliers in each tail to bound can be specified by the user in the tail
parameter, which expects a fraction in [0, 1] (for example, tail = 0.005
results in bounding the upper and lower 0.5% of the data, the default).
This normalization is optional, but is strongly recommended. The effect of not normalizing is that the hierarchical clustering algorithm will be influenced by small changes between samples, which may lead to false positive chromatin state switch calls.
# Summary of the two statistics we will use downstream in raw data in one sample
H3K4me3_filt %>% lapply(as.data.frame) %>%
lapply(select, signalValue, qValue) %>%
lapply(summary) %>%
`[`(1)
## $E068
## signalValue qValue
## Min. : 5.745 Min. : 10.97
## 1st Qu.: 9.431 1st Qu.: 20.06
## Median :18.124 Median : 58.10
## Mean :17.965 Mean : 62.62
## 3rd Qu.:24.403 3rd Qu.: 89.93
## Max. :38.874 Max. :160.87
H3K4me3_norm <- normalizePeaks(H3K4me3_filt,
columns = c("qValue", "signalValue"),
tail = 0.005)
# Summary after normalization
H3K4me3_norm %>% lapply(as.data.frame) %>%
lapply(select, signalValue, qValue) %>%
lapply(summary) %>%
`[`(1)
## $E068
## signalValue qValue
## Min. :0.0000 Min. :0.00000
## 1st Qu.:0.1113 1st Qu.:0.06064
## Median :0.3737 Median :0.31438
## Mean :0.3689 Mean :0.34455
## 3rd Qu.:0.5632 3rd Qu.:0.52675
## Max. :1.0000 Max. :1.00000
Next, we’ll retrieve the peaks in the query region. Here we’ll consider the 5 kbp window around the transcription start site of TTYH1, which is a known brain-specific gene. This returns an object of the LocalPeaks
class. A LocalPeaks
object is a container for the peaks for a set of samples in a specific genomic region of interest, as well as the genomic region itself, and the sample IDs. These components are needed to convert sets of peaks into rectangular feature-by-sample matrices which we can then use for downstream analysis - and in particular, as input to a clustering algorithm in order to call a chromatin state switch.
# TTYH1
ttyh1 <- query[2]
ttyh1
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr19 [54874319, 54877536] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
ttyh1_pk <- retrievePeaks(peaks = H3K4me3_norm,
metadata = metadata,
region = ttyh1)
ttyh1_pk
## An object of class "LocalPeaks"
## Slot "region":
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr19 [54874319, 54877536] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## Slot "peaks":
## $E068
## GRanges object with 0 ranges and 7 metadata columns:
## seqnames ranges strand | name score signalValue pValue
## <Rle> <IRanges> <Rle> | <character> <integer> <numeric> <numeric>
## qValue peak length
## <numeric> <integer> <integer>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## $E071
## GRanges object with 1 range and 7 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr19 [54874977, 54876577] * | Rank_22081 194
## signalValue pValue qValue peak length
## <numeric> <numeric> <numeric> <integer> <integer>
## [1] 0.02973599 19.4892 0.02148628 354 1601
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## $E074
## GRanges object with 0 ranges and 7 metadata columns:
## seqnames ranges strand | name score signalValue pValue
## <Rle> <IRanges> <Rle> | <character> <integer> <numeric> <numeric>
## qValue peak length
## <numeric> <integer> <integer>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## $E101
## GRanges object with 1 range and 7 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr19 [54875141, 54876835] * | Rank_19868 256
## signalValue pValue qValue peak length
## <numeric> <numeric> <numeric> <integer> <integer>
## [1] 0.1070939 25.60248 0.05735531 1242 1695
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## $E102
## GRanges object with 1 range and 7 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr19 [54875949, 54876551] * | Rank_18632 194
## signalValue pValue qValue peak length
## <numeric> <numeric> <numeric> <integer> <integer>
## [1] 0.1103389 19.45267 0.0408516 467 603
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## $E110
## GRanges object with 0 ranges and 7 metadata columns:
## seqnames ranges strand | name score signalValue pValue
## <Rle> <IRanges> <Rle> | <character> <integer> <numeric> <numeric>
## qValue peak length
## <numeric> <integer> <integer>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
##
## Slot "samples":
## [1] "E068" "E071" "E074" "E101" "E102" "E110"
We can now construct a sample-by-feature matrix from the filtered and normalized data in the query region. To do so, we need to select some summary statistics to be the features in the matrix. If there are any values associated with the peaks or data, specifying these in the cols
parameter to summarizePeaks()
results in taking the mean, median, and maximum of each statistic as a separate feature. There are also two measures which can be calculated from genomic ranges alone: the number of peaks overlapping the query region (argument n
), and the fraction of the region overlapped by peaks (argument fraction
). These parameters take logical values specifying whether or not they should be included.
Here, we take the mean, median, and max of the q value and fold change of peaks in the query region, as well as the fraction of the region overlapped by peaks. When selecting which peak statistics to use in feature construction, it’s important to consider whether statistics are redundant, for example, here we avoid using both the p and q values.
summary_mat <- summarizePeaks(localpeaks = ttyh1_pk,
mark = "H3K4me3",
cols = c("qValue", "signalValue"),
fraction = TRUE,
n = FALSE)
# The sample-by-feature matrix
summary_mat
## H3K4me3_qValue_mean H3K4me3_signalValue_mean H3K4me3_qValue_median
## E068 0.00000000 0.00000000 0.00000000
## E071 0.02148628 0.02973599 0.02148628
## E074 0.00000000 0.00000000 0.00000000
## E101 0.05735531 0.10709394 0.05735531
## E102 0.04085160 0.11033891 0.04085160
## E110 0.00000000 0.00000000 0.00000000
## H3K4me3_signalValue_median H3K4me3_qValue_max H3K4me3_signalValue_max
## E068 0.00000000 0.00000000 0.00000000
## E071 0.02973599 0.02148628 0.02973599
## E074 0.00000000 0.00000000 0.00000000
## E101 0.10709394 0.05735531 0.10709394
## E102 0.11033891 0.04085160 0.11033891
## E110 0.00000000 0.00000000 0.00000000
## H3K4me3_fraction_region_in_peaks
## E068 0.0000000
## E071 0.4975140
## E074 0.0000000
## E101 0.5267247
## E102 0.1873835
## E110 0.0000000
Finally, we can cluster over samples using this matrix, and call a chromatin state switch by assessing the agreement between the inferred cluster assignments and the known biological condition labels of the samples. Since hierarchical clustering results in a dendrogram, we choose the partition of the samples which maximizes the average Silhouette width, which is an internal measure of cluster goodness based on cluster compactness and separation.
cluster(ft_mat = summary_mat,
query = ttyh1,
metadata = metadata,
heatmap = TRUE,
title = "TTYH1 - summary",
optimal_clusters = TRUE)
## Warning: Column `Sample` joining character vector and factor, coercing into
## character vector
The consensus score is equal to 1, indicating a perfect agreement between the inferred clusters and the biological groups (Brain/Other), so we can infer a chromatin state switch around the TSS of TTYH1.
The cluster
function has a heatmap
argument which controls whether a heatmap is produced as a PDF file in the current working directory or at a path specified by outdir
.
Heatmap showing hierarchical clustering result from chromswitch applied to TTYH1 using the summary strategy
In the binary approach, the features correspond to unique peaks observed in the region across samples. This can be sensitive to noisy data in the region, so we propose an additional pre-processing step prior to this approach. When considering peaks, often, we observe two scenarios:
The way chromswitch handles these is controlled by two tuning parameters. The first is the gap
parameter, and is used to decide when nearby peaks should be joined and replaced by one peak. The gap
parameter is the distance between two peaks below which they should be joined.
Example of a transformation on peaks to join nearby peaks which are likely due to the same biological event, implemented in chromswitch as reducePeaks()
.
The reducePeaks()
function takes a LocalPeaks
object as input, so we can use the peaks in the TTYH1 window that we’ve already obtained.
ttyh1_pk_red <- reducePeaks(localpeaks = ttyh1_pk,
gap = 300)
In this approach, we convert the data in the query region into a binary representation by modelling the presence or absence of each unique peak. In this function, first the unique peaks are obtained by collapsing down the union of all peaks supplied in the region, and then we look for each unique peak in each sample to construct the binary matrix. This involves using the second tuning parameter, p
, which specifies how much reciprocal overlap two peaks must have in order to be considered the same peak.
binary_mat <- binarizePeaks(ttyh1_pk_red,
p = 0.4)
# Chromswitch finds a single unique peak in the region
binary_mat
In terms of selecting values for these tuning parameters, our experiments indicate that chromswitch is very robust to changes in these parameters, within reasonable ranges. Visual inspection of the data in a genome browser can be useful for determining exact values, and can help to control the resolution of the analysis.
Again, chromswitch finds good agreement between the cluster assignments and the biological groups in our analysis.
cluster(ft_mat = binary_mat,
metadata = metadata,
query = ttyh1,
optimal_clusters = TRUE)
## Warning: Column `Sample` joining character vector and factor, coercing into
## character vector
These steps can also be easily run in a pipeline using the %>%
operator, which is convenient when using individual chromswitch functions to compose an analysis rather than the two wrapper functions.
Bug reports and questions about usage, the method, etc. are welcome. Please open an issue on the GitHub repository for the development version of chromswitch: https://github.com/sjessa/chromswitch/issues/new.
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] dplyr_0.7.4 magrittr_1.5 bigmemory_4.5.19
## [4] bigmemory.sri_0.1.3 Biobase_2.38.0 bindrcpp_0.2
## [7] rtracklayer_1.38.0 chromswitch_1.0.0 GenomicRanges_1.30.0
## [10] GenomeInfoDb_1.14.0 IRanges_2.12.0 S4Vectors_0.16.0
## [13] BiocGenerics_0.24.0 BiocParallel_1.12.0 BiocStyle_2.6.0
##
## loaded via a namespace (and not attached):
## [1] tidyr_0.7.2 jsonlite_1.5
## [3] foreach_1.4.3 gtools_3.5.0
## [5] assertthat_0.2.0 expm_0.999-2
## [7] GenomeInfoDbData_0.99.1 Rsamtools_1.30.0
## [9] yaml_2.1.14 backports_1.1.1
## [11] lattice_0.20-35 glue_1.2.0
## [13] digest_0.6.12 RColorBrewer_1.1-2
## [15] XVector_0.18.0 colorspace_1.3-2
## [17] htmltools_0.3.6 Matrix_1.2-11
## [19] plyr_1.8.4 XML_3.98-1.9
## [21] pkgconfig_2.0.1 bookdown_0.5
## [23] zlibbioc_1.24.0 mvtnorm_1.0-6
## [25] purrr_0.2.4 xtable_1.8-2
## [27] scales_0.5.0 gdata_2.18.0
## [29] manipulate_1.0.1 tibble_1.3.4
## [31] pkgmaker_0.22 ggplot2_2.2.1
## [33] SummarizedExperiment_1.8.0 lazyeval_0.2.1
## [35] mclust_5.3 evaluate_0.10.1
## [37] gplots_3.0.1 MASS_7.3-47
## [39] doParallel_1.0.11 foreign_0.8-69
## [41] NMF_0.20.6 tools_3.4.2
## [43] registry_0.3 matrixStats_0.52.2
## [45] gridBase_0.4-7 stringr_1.2.0
## [47] munsell_0.4.3 cluster_2.0.6
## [49] rngtools_1.2.4 DelayedArray_0.4.0
## [51] Biostrings_2.46.0 compiler_3.4.2
## [53] caTools_1.17.1 rlang_0.1.2
## [55] grid_3.4.2 RCurl_1.95-4.8
## [57] iterators_1.0.8 bitops_1.0-6
## [59] rmarkdown_1.6 boot_1.3-20
## [61] DescTools_0.99.22 gtable_0.2.0
## [63] codetools_0.2-15 reshape2_1.4.2
## [65] R6_2.2.2 GenomicAlignments_1.14.0
## [67] knitr_1.17 bindr_0.1
## [69] rprojroot_1.2 KernSmooth_2.23-15
## [71] stringi_1.1.5 Rcpp_0.12.13