1 Introduction

The aim of HiCDOC is to detect significant A/B compartment changes, using Hi-C data with replicates.

HiCDOC normalizes intrachromosomal Hi-C matrices, uses unsupervised learning to predict A/B compartments from multiple replicates, and detects significant compartment changes between experiment conditions.

It provides a collection of functions assembled into a pipeline:

  1. Filter:
    1. Remove chromosomes which are too small to be useful.
    2. Filter sparse replicates to remove uninformative replicates with few interactions.
    3. Filter positions (bins) which have too few interactions.
  2. Normalize:
    1. Normalize technical biases (inter-matrix normalization) using cyclic loess normalization (Stansfield, Cresswell, and Dozmorov 2019), so that matrices are comparable.
    2. Normalize biological biases (intra-matrix normalization) using Knight-Ruiz matrix balancing (Knight and Ruiz 2012), so that all the bins are comparable.
    3. Normalize the distance effect, which results from higher interaction proportions between closer regions, with a MD loess.
  3. Predict:
    1. Predict compartments using constrained K-means (Wagstaff et al. 2001).
    2. Detect significant differences between experiment conditions.
  4. Visualize:
    1. Plot the interaction matrices of each replicate.
    2. Plot the overall distance effect on the proportion of interactions.
    3. Plot the compartments in each chromosome, along with their concordance (confidence measure) in each replicate, and significant changes between experiment conditions.
    4. Plot the overall distribution of concordance differences.
    5. Plot the result of the PCA on the compartments’ centroids.
    6. Plot the boxplots of self interaction ratios (differences between self interactions and the medians of other interactions) of each compartment, which is used for the A/B classification.

2 Installation

HiCDOC can be installed from Bioconductor:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("HiCDOC")

The package can then be loaded:

library(HiCDOC)
#> Loading required package: InteractionSet
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#>     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
#>     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians

3 Importing Hi-C data

HiCDOC can import Hi-C data sets in various different formats: - Tabular .tsv files. - Cooler .cool or .mcool files. - Juicer .hic files. - HiC-Pro .matrix and .bed files.

3.1 Tabular files

A tabular file is a tab-separated multi-replicate sparse matrix with a header:

chromosome position 1 position 2 C1.R1 C1.R2 C2.R1 … Y 1500000 7500000 145 184 72 … …

The number of interactions between position 1 and position 2 of chromosome are reported in each condition.replicate column. There is no limit to the number of conditions and replicates.

To load Hi-C data in this format:

hic.experiment <- HiCDOCDataSetFromTabular('path/to/data.tsv')

3.2 Cooler files

To load .cool or .mcool files generated by Cooler (Abdennur and Mirny 2019):

# Path to each file
paths = c(
    'path/to/condition-1.replicate-1.cool',
    'path/to/condition-1.replicate-2.cool',
    'path/to/condition-2.replicate-1.cool',
    'path/to/condition-2.replicate-2.cool',
    'path/to/condition-3.replicate-1.cool'
)

# Replicate and condition of each file. Can be names instead of numbers.
replicates <- c(1, 2, 1, 2, 1)
conditions <- c(1, 1, 2, 2, 3)

# Resolution to select in .mcool files
binSize = 500000

# Instantiation of data set
hic.experiment <- HiCDOCDataSetFromCool(
    paths,
    replicates = replicates,
    conditions = conditions,
    binSize = binSize # Specified for .mcool files.
)

3.3 Juicer files

To load .hic files generated by Juicer (Durand 2016):

# Path to each file
paths = c(
    'path/to/condition-1.replicate-1.hic',
    'path/to/condition-1.replicate-2.hic',
    'path/to/condition-2.replicate-1.hic',
    'path/to/condition-2.replicate-2.hic',
    'path/to/condition-3.replicate-1.hic'
)

# Replicate and condition of each file. Can be names instead of numbers.
replicates <- c(1, 2, 1, 2, 1)
conditions <- c(1, 1, 2, 2, 3)

# Resolution to select
binSize <- 500000

# Instantiation of data set
hic.experiment <- HiCDOCDataSetFromHiC(
    paths,
    replicates = replicates,
    conditions = conditions,
    binSize = binSize
)

3.4 HiC-Pro files

To load .matrix and .bed files generated by HiC-Pro (Servant 2015):

# Path to each matrix file
matrixPaths = c(
    'path/to/condition-1.replicate-1.matrix',
    'path/to/condition-1.replicate-2.matrix',
    'path/to/condition-2.replicate-1.matrix',
    'path/to/condition-2.replicate-2.matrix',
    'path/to/condition-3.replicate-1.matrix'
)

# Path to each bed file
bedPaths = c(
    'path/to/condition-1.replicate-1.bed',
    'path/to/condition-1.replicate-2.bed',
    'path/to/condition-2.replicate-1.bed',
    'path/to/condition-2.replicate-2.bed',
    'path/to/condition-3.replicate-1.bed'
)

# Replicate and condition of each file. Can be names instead of numbers.
replicates <- c(1, 2, 1, 2, 1)
conditions <- c(1, 1, 2, 2, 3)

# Instantiation of data set
hic.experiment <- HiCDOCDataSetFromHiCPro(
    matrixPaths = matrixPaths,
    bedPaths = bedPaths,
    replicates = replicates,
    conditions = conditions
)

4 Running the HiCDOC pipeline

An example dataset can be loaded from the HiCDOC package:

data(exampleHiCDOCDataSet)

Once your data is loaded, you can run all the filtering, normalization, and prediction steps with the command : HiCDOC(exampleHiCDOCDataSet). This one-liner runs all the steps detailed below.

4.1 Filtering data

Remove small chromosomes of length smaller than 100 positions (100 is the default value):

hic.experiment <- filterSmallChromosomes(exampleHiCDOCDataSet, threshold = 100)
#> Keeping chromosomes with at least 100 positions.
#> Kept 3 chromosomes: X, Y, Z
#> Removed 1 chromosome: W

Remove sparse replicates filled with less than 30% non-zero interactions (30% is the default value):

hic.experiment <- filterSparseReplicates(hic.experiment, threshold = 0.3)
#> Keeping replicates filled with at least 30% non-zero interactions.
#> 
#> Removed interactions matrix of chromosome X, condition 1, replicate R2 filled at 2.347%.
#> Removed 1 replicate in total.

Remove weak positions with less than 1 interaction in average (1 is the default value):

hic.experiment <- filterWeakPositions(hic.experiment, threshold = 1)
#> Keeping positions with interactions average greater or equal to 1.
#> Chromosome X: 2 positions removed, 118 positions remaining.
#> Chromosome Y: 3 positions removed, 157 positions remaining.
#> Chromosome Z: 0 positions removed, 200 positions remaining.
#> Removed 5 positions in total.

4.2 Normalizing biases

4.2.1 Technical biases

Normalize technical biases such as sequencing depth (inter-matrix normalization) so that matrices are comparable :

suppressWarnings(hic.experiment <- normalizeTechnicalBiases(hic.experiment))
#> Normalizing technical biases.

This normalization uses uses cyclic loess normalization from [multiHiCcompare package] (Stansfield, Cresswell, and Dozmorov 2019). Note : For large dataset, it is highly recommended to set a value for cycleLoessSpan parameter to reduce computing time and necessary memory. See ?HiCDOC::normalizeTechnicalBiases

4.2.2 Biological biases

Normalize biological biases, such as GC content, number of restriction sites, etc. (intra-matrix normalization):

hic.experiment <- normalizeBiologicalBiases(hic.experiment)
#> Chromosome X: normalizing biological biases.
#> Chromosome Y: normalizing biological biases.
#> Chromosome Z: normalizing biological biases.

4.2.3 Distance effect

Normalize the linear distance effect resulting from more interactions between closer genomic regions (20000 is the default value for loessSampleSize):

hic.experiment <- 
    normalizeDistanceEffect(hic.experiment, loessSampleSize = 20000)
#> Chromosome X: normalizing distance effect.
#> Chromosome Y: normalizing distance effect.
#> Chromosome Z: normalizing distance effect.

4.3 Predicting compartments and differences

Predict A and B compartments and detect significant differences, here using the default values as parameters:

hic.experiment <- detectCompartments(hic.experiment)
#> Clustering genomic positions.
#> Predicting A/B compartments.
#> Detecting significant differences.

5 Visualizing data and results

Plot the interaction matrix of each replicate:

p <- plotInteractions(hic.experiment, chromosome = "Y")
p