BiocManager::install("TADCompare")
# Or, for the developmental version
devtools::install_github("dozmorovlab/TADCompare")
library(dplyr)
library(SpectralTAD)
library(TADCompare)
TADCompare
is a function that allows users to automatically identify differential TAD boundaries between two datasets. For each differential boundary, we provide unique classification (complex, merge, split, shifted, and strength change) defining how the TAD boundaries change between the datasets.
The only required input is two contact matrices in one of the permitted forms specified in the Input data vignette. TADCompare
function will automatically determine the type of matrix and convert it to an appropriate form, given it is one of the supported formats. The only requirement is that all matrices be in the same format. For the fastest results, we suggest using \(n \times n\) matrices. Additionally, we recommend users to provide resolution of their data. If the resolution is not provided, we estimate it using the numeric column names of contact matrices. We illustrate running TADCompare
using the data from GM12878 cell line, chromosome 22, 50kb resolution (Rao et al. 2014).
# Get the rao contact matrices built into the package
data("rao_chr22_prim")
data("rao_chr22_rep")
# We see these are n x n matrices
dim(rao_chr22_prim)
## [1] 704 704
dim(rao_chr22_rep)
## [1] 704 704
# And, get a glimpse how the data looks like
rao_chr22_prim[100:105, 100:105]
## 2.1e+07 21050000 21100000 21150000 21200000 21250000
## 2.1e+07 6215 1047 1073 714 458 383
## 21050000 1047 4542 2640 1236 619 489
## 21100000 1073 2640 13468 4680 1893 1266
## 21150000 714 1236 4680 13600 4430 2124
## 21200000 458 619 1893 4430 11313 4465
## 21250000 383 489 1266 2124 4465 11824
# Running the algorithm with resolution specified
results = TADCompare(rao_chr22_prim, rao_chr22_rep, resolution = 50000)
# Repeating without specifying resolution
no_res = TADCompare(rao_chr22_prim, rao_chr22_rep)
## Estimating resolution
# We can see below that resolution can be estimated automatically if necessary
identical(results$Diff_Loci, no_res$Diff_Loci)
## [1] TRUE
TADCompare
function returns a list with two data frames and one plot. The first data frame contains all regions of the genome containing a TAD boundary in at least one of the contact matrices. The results are shown below:
head(results$TAD_Frame)
## Boundary Gap_Score TAD_Score1 TAD_Score2 Differential Enriched_In
## 1 16300000 -18.6589859 0.7698503 7.704411 Differential Matrix 2
## 2 16850000 0.2361100 7.8431724 7.580056 Non-Differential Matrix 1
## 3 17350000 0.9683905 3.6628411 3.220253 Non-Differential Matrix 1
## 4 18000000 -0.1033894 2.3616107 2.347392 Non-Differential Matrix 2
## 5 18750000 3.0426660 4.7997622 3.558975 Differential Matrix 1
## 6 18850000 2.6911660 3.7674844 2.680707 Differential Matrix 1
## Type
## 1 <NA>
## 2 Non-Differential
## 3 Non-Differential
## 4 Non-Differential
## 5 Strength Change
## 6 Strength Change
The “Boundary” column contains genomic coordinates of the given boundary. “Gap_Score” corresponds to the differential boundary score (Z-score of the difference between boundary scores). “TAD_Score1” and “TAD_Score2” corresponds to the boundary score of the two contact matrices. “Differential” simply indicates whether the boundary is differential or not differential. “Enriched_In” indicates which matrix contains the TAD boundary. “Type” identifies the type of TAD boundary change. Note: The first boundary will always be classified as “NA” since specific classification is impossible without a reference boundary to the left and right.
The second data frame contains the same information as the first data frame but includes every region of the genome. We show it below:
head(results$Boundary_Scores)
## Boundary TAD_Score1 TAD_Score2 Gap_Score Differential Enriched_In
## 1 16300000 0.7698503 7.7044114 -18.658985912 Differential Matrix 2
## 2 16850000 7.8431724 7.5800558 0.236110024 Non-Differential Matrix 1
## 3 16900000 -0.4402121 -0.4337948 0.009161239 Non-Differential Matrix 1
## 4 16950000 -0.7460531 -0.5551327 -0.467725582 Non-Differential Matrix 2
## 5 17000000 -0.6803509 -0.6512025 -0.037456910 Non-Differential Matrix 2
## 6 17050000 -0.6626352 -0.5562997 -0.245694062 Non-Differential Matrix 2
## Type
## 1 <NA>
## 2 Non-Differential
## 3 Non-Differential
## 4 Non-Differential
## 5 Non-Differential
## 6 Non-Differential
Finally, we include a plot that contains a stacked barplot indicating the prevalence of each type of TAD boundary. The barplot is a ggplot2
object, making it completely customizable. We show this below:
p <- results$Count_Plot
class(p)
## [1] "gg" "ggplot"
plot(p)
## Warning: Removed 1 rows containing missing values (position_stack).
We recognize that users may like to use their own TAD caller for the identification of TAD boundaries prior to boundary comparison. As a result, we provide the option for pre-specification of TADs in a form of two data frames with chromosome, start, and end coordinates of TAD boundaries for the two contact matrices. With this option, only provided TAD boundaries will be tested. We provide an example below using the SpectralTAD TAD caller (Cresswell, Stansfield, and Dozmorov 2019–1AD):
# Call TADs using SpectralTAD
bed_coords1 = bind_rows(SpectralTAD::SpectralTAD(rao_chr22_prim, chr = "chr22", levels = 3))
## Estimating resolution
## Warning: Unknown or uninitialised column: `Group`.
## Warning: Unknown or uninitialised column: `Group`.
bed_coords2 = bind_rows(SpectralTAD(rao_chr22_rep, chr = "chr22", levels = 3))
## Estimating resolution
# Combining the TAD boundaries for both contact matrices
Combined_Bed = list(bed_coords1, bed_coords2)
# Printing the combined list of TAD boundaries
Above, we see the dataset output by SpectralTAD contains a column named “start” and “end” indicating the start and end coordinate of each TAD. This is required, though the output of any TAD caller can be used effectively with some data manipulation. The “Level” column indicates the level of TAD in a hierarchy.
# Running TADCompare with pre-specified TADs
TD_Compare = TADCompare(rao_chr22_prim, rao_chr22_rep, resolution = 50000, pre_tads = Combined_Bed)
# Returning the boundaries
head(TD_Compare$TAD_Frame)
## Boundary Gap_Score TAD_Score1 TAD_Score2 Differential Enriched_In
## 1 16850000 0.23611002 7.8431724 7.5800558 Non-Differential Matrix 1
## 2 17250000 -0.08875306 0.1103987 0.1409999 Non-Differential Matrix 2
## 3 17600000 -0.06733403 -0.4153272 -0.3809658 Non-Differential Matrix 2
## 4 18000000 -0.10338944 2.3616107 2.3473922 Non-Differential Matrix 2
## 5 18350000 -0.04284963 -0.7767081 -0.7433986 Non-Differential Matrix 2
## 6 19100000 0.01857312 -0.8269469 -0.8153997 Non-Differential Matrix 1
## Type
## 1 Non-Overlap
## 2 Non-Overlap
## 3 Non-Overlap
## 4 Non-Differential
## 5 Non-Differential
## 6 Non-Overlap
# Testing to show that all pre-specified boundaries are returned
table(TD_Compare$TAD_Frame$Boundary %in% bind_rows(Combined_Bed)$end)
##
## TRUE
## 94
Here, we see that the only boundaries that are returned are those we pre-specified.
We provide means for visualizing differential TAD boundaries using the DiffPlot
function. As an input, DiffPlot
takes the output from the TADCompare
function and the original contact matrices. As an output, it returns a ggplot2
object that can be further customized. We demonstrate visualization of the differences between GM12878 and IMR90 cell lines, on the subset of chromosome 2, 40kb resolution data processed in Schmitt et al. (Schmitt et al. 2016):
data("GM12878.40kb.raw.chr2")
data("IMR90.40kb.raw.chr2")
mtx1 <- GM12878.40kb.raw.chr2
mtx2 <- IMR90.40kb.raw.chr2
res <- 40000 # Set resolution
# Globally normalize matrices for better visual comparison (does not affect TAD calling)
mtx1_total <- sum(mtx1)
mtx2_total <- sum(mtx2)
(scaling_factor <- mtx1_total / mtx2_total)
## [1] 0.3837505
# Rescale matrices depending on which matrix is smaller
if (mtx1_total > mtx2_total) {
mtx2 <- mtx2 * scaling_factor
} else {
mtx1 <- mtx1 * (1 / scaling_factor)
}
# Coordinates of interesting regions
start_coord <- 8000000
end_coord <- 16000000
# Another interesting region
# start_coord <- 40000000
# end_coord <- 48000000
# Running TADCompare as-is
TD_Compare <- TADCompare(mtx1, mtx2, resolution = res)
# Running the plotting algorithm with pre-specified TADs
p <- DiffPlot(tad_diff = TD_Compare,
cont_mat1 = mtx1,
cont_mat2 = mtx2,
resolution = res,
start_coord = start_coord,
end_coord = end_coord,
show_types = TRUE,
point_size = 5,
max_height = 5,
rel_heights = c(1, 2),
palette = "RdYlBu")
plot(p)
From these results, we can see that boundary scores from both matrices (“Boundary score 1” and “Boundary score 2”) frequently detected as significant boundaries in both matrices (boundary scores above the 1.5 threshold), but are non-differential (black dots). The differential boundaries correspond to the “Differential boundary score” track, where absolute boundary score differences above the 2.0 threshold are detected. The different types of differential boundaries are defined according to a schema described in the TADCompare
manuscript. Note that coloring by types may be disabled by setting “show_types” to FALSE; only differential and non-differential labels will be shown.
We can also use pre-specified TADs by providing a list of TAD coordinates containing TAD boundaries for each contact matrix. The list should be of length 2. We show how to do this below, using SpectralTAD for pre-specification:
# Call TADs using SpectralTAD
bed_coords1 = bind_rows(SpectralTAD(mtx1, chr = "chr2", levels = 3))
## Estimating resolution
bed_coords2 = bind_rows(SpectralTAD::SpectralTAD(mtx2, chr = "chr2", levels = 3))
## Estimating resolution
# Placing the data in a list for the plotting procedure
Combined_Bed = list(bed_coords1, bed_coords2)
# Running TADCompare with pre-specified TADs
TD_Compare <- TADCompare(mtx1, mtx2, resolution = res, pre_tads = Combined_Bed)
# Running the plotting algorithm with pre-specified TADs
p <- DiffPlot(tad_diff = TD_Compare,
cont_mat1 = mtx1,
cont_mat2 = mtx2,
resolution = res,
start_coord = start_coord,
end_coord = end_coord,
pre_tad = Combined_Bed,
show_types = FALSE,
point_size = 5,
max_height = 10,
rel_heights = c(1.5, 2),
palette = "RdYlBu")
plot(p)