`HiCcompare`

Vignette- Introduction
- How to use
`HiCcompare`

`HiCcompare`

provides functions for the joint normalization and detection of differential chromatin interactions between two or multiple Hi-C datasets. `HiCcompare`

operates on processed Hi-C data in the form of chromosome-specific chromatin interaction matrices. It accepts three-column tab-separated text files storing chromatin interaction matrices in a sparse matrix format (see Creating the hic.table object). Functions to convert popular Hi-C data formats (`.hic`

, `.cool`

) to sparse format are available (see ?cooler2sparse). `HiCcompare`

differs from other packages that attempt to compare Hi-C data in that it works on processed data in chromatin interaction matrix format instead of pre-processed sequencing data. In addition, `HiCcompare`

provides a non-parametric method for the joint normalization and removal of biases between two Hi-C datasets for the purpose of comparative analysis. `HiCcompare`

also provides a simple yet robust permutation method for detecting differences between Hi-C datasets.

The `hic_loess`

function outputs normalized chromatin interactions for both matrices ready for the comparative analysis. The `hic_diff`

function performs the comparative analysis and outputs genomic coordinates of pairs of regions detected as differentially interacting, interaction frequencies, the difference and the corresponding permutation p-value.

`HiCcompare`

Install `HiCcompare`

from Bioconductor.

```
source("https://bioconductor.org/biocLite.R")
biocLite("HiCcompare")
library(HiCcompare)
```

You will need processed Hi-C data in the form of sparse upper triangular matrices or BEDPE files in order to use `HiCcompare`

. Data is available from several sources and two examples for downloading and extracting data are listed below. If you have full Hi-C contact matrices you can convert them to sparse upper triangular format using the full `full2sparse`

function as shown in additional functions

`.hic`

filesHi-C data is available from several sources and in many formats. `HiCcompare`

is built to work with the sparse upper triangular matrix format popularized by the lab of Erez Lieberman-Aiden http://aidenlab.org/data.html. If you already have Hi-C data either in the form of a sparse upper triangular matrix or a full contact matrix you can skip to the Creating the hic.table object section. If you obtain data from the Aiden Lab in the `.hic`

format you will need to first extract the matrices that you wish to compare.

- Download the
`straw`

software from https://github.com/theaidenlab/straw/wiki and install it. - Use
`straw`

to extract a Hi-C sparse upper triangular matrix. An example is below:

Say we downloaded the `GSE63525_K562_combined_30.hic`

file from GEO https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63525

To extract the raw matrix corresponding to chromosome 22 at 500kb resolution we would use the following command within the terminal

`./straw NONE GSE63525_K562_combined_30.hic 22 22 BP 500000 > K562.chr22.500kb.txt`

This will extract the matrix from the `.hic`

file and save it to the `K562.chr22.500kb.txt`

text file, in the sparse upper triangular matrix format. See more examples on on how to use `straw`

at https://github.com/theaidenlab/straw/wiki/CPP#running. Straw requires several inputs for the extraction of data from a `.hic`

file.

`<NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr1>[:x1:x2] <chr2>[:y1:y2] <BP/FRAG> <binsize>`

The first argument is the normalization method. For use in `HiCcompare`

you want the raw data so you should selected `NONE`

. The second argument is the `.hic`

file name. Next is the chromosome numbers of the matrix you want. For an intrachromosomal contact map both should be the same as in the above example. If you want a matrix of interchromosomal interactions you can use different chromosomes i.e. interactions between chromosome 1 and chromosome 2 (Note that `HiCcompare`

is only meant to be used on intrachromosomal interactions at this point in development). The next argument is whether you want basepair or fragment files. For `HiCcompare`

use `BP`

. The final argument is the binsize of the matrix (the resolution). To extract a matrix at a resolution of 1MB enter `10000000`

. Typical bin sizes are 1MB, 500KB, 100KB, 50KB, 5KB, 1KB. Note that most matrices with resolutions higher than 100KB (i.e. matrices with resolutions of 1KB - 50KB) are typically too sparse (due to insufficient sequencing coverage) for analysis in `HiCcompare`

.

From here we can just import the matrix into R as you would normally for any tab-delimited file.

- Import the data into R
`K562.chr22 <- read.table('K562.chr22.500kb.txt', header=FALSE)`

- Repeat these steps for any other Hi-C dataset that you wish to compare to the first dataset using
`HiCcompare`

, then proceed to Creating the`hic.table`

object.

`.cool`

filesThe `cooler`

software, http://cooler.readthedocs.io/en/latest/index.html, allows access to a large collection of Hi-C data. The cooler index ftp://cooler.csail.mit.edu/coolers contains Hi-C data for `hg19`

and `mm9`

from many different sources. To use data in the `.cool`

format in `HiCcompare`

follow these steps:

- Download and install
`cooler`

from http://cooler.readthedocs.io/en/latest/index.html - Download a
`.cool`

file from the cooler index ftp://cooler.csail.mit.edu/coolers. - Say we downloaded the
`Dixon2012-H1hESC-HindIII-allreps-filtered.1000kb.cool`

file. See`cooler dump --help`

for data extraction options. To extract the contact matrix we use the following commands in the terminal:

`cooler dump --join Dixon2012-H1hESC-HindIII-allreps-filtered.1000kb.cool > dixon.hESC.1000kb.txt`

- Read in the text file as you would any tab-delimited file in R

`hesc1000kb <- read.table("dixon.hESC.1000kb.txt", header = FALSE)`

- Convert to a sparse upper triangular matrix using the
`HiCcompare::cooler2sparse`

function.

`sparse <- cooler2sparse(hesc1000kb)`

- Repeat the steps for another Hi-C dataset that you wish to compare to the first dataset then proceed to Creating the
`hic.table`

object.

HiC-Pro is another tool for processing raw Hi-C data into usable matrix files. HiC-Pro will produce a `.matrix`

file and a `.bed`

file for the data. These `.matrix`

files are in sparse upper triangular format similar to the results of Juicer and the dumped contents of a `.hic`

file, however instead of using the genomic start coordinates for the first two columns of the sparse matrix they use an ID number. The `.bed`

file contains the mappings for each of these IDs to their genomic coordinates. `HiCcompare`

includes a function to convert the results of HiC-Pro into a usable format for analysis in `HiCcompare`

. See the help using `?hicpro2bedpe`

for more details.

Before you start a `HiCcompare`

analysis, you may want to detect CNV and exclude these regions along with any other regions known to exhibit changes or undesirable sequencing characteristics which could cause false positives. `HiCcompare`

provides a function to perform a CNV detection analysis utilizing the `QDNAseq`

R package. You will need to have your Hi-C data in `.bam`

files. This can be accomplished by downloading the raw sequencing results and aligning them using one of the many Hi-C processing pipelines. Once you have `.bam`

files place them in a specified folder and you can then run `get_CNV`

. Make sure to specify the bin size you will be using in kilobase pairs. The bin size should be the same as the resolution of the Hi-C data that will be used in the `HiCcompare`

analysis. The `CNV.level`

option will allow you to choose which level of CNV you would like to exclude. The CNV calls are defined as -2 = deletion, -1 = loss, 0 = normal, 1 = gain, 2 = amplification. In order to exclude amplifications and deletions set `CNV.level = 2`

. To exclude any amount of CNV set `CNV.level = 1`

.

```
cnv <- get_CNV(path2bam = 'path/to/bamfiles', out.file = 'path/to/bamfiles/outfile',
bin.size = 1000, genome = 'hg19', CNV.level = 2)
```

This function will run the CNV detection steps provided in `QDNAseq`

and export a `.txt`

file containing the copy number calls and a `.bed`

file containing the regions detected at or above the chosen CNV level. It will also return a data.frame containing the regions at or above the chosen CNV level. It is recommended to exclude regions with scores of -2 or 2, however this decision is left up to the user.

You may want to exclude the regions with CNV along with blacklisted regions from any further analysis. `HiCcompare`

has the ENCODE blacklists for hg19 and hg38 included (available using `data("hg19_blacklist")`

or `data("hg38_blacklist")`

). We can now create a data.frame (or `GenomicRanges`

object) containing all of the regions we want to exclude from any further analysis.

```
data('hg19_blacklist')
# combine cnv excluded regions with blacklist regions
exclude <- cbind(cnv, hg19_blacklist)
```

Now that we have a data.frame containing the regions to be excluded we just simply set the `exclude.regions`

options in the `create.hic.table`

function to the `exclude`

data.frame.

`hic.table`

objectA sparse matrix format represents a relatively compact and human-readable way to store pair-wise interactions. It is a tab-delimited text format containing three columns: “region1” - a start coordinate (in bp) of the first region, “region2” a start coordinate of the second region, and “IF” - the interaction frequency between them (IFs). Zero IFs are dropped (hence, the *sparse* format). Since the full matrix of chromatin interactions is symmetric, only the upper triangular portion, including the diagonal, is stored.

If you have two full Hi-C contact matrices you can convert them to sparse upper triangular matrices using the `HiCcompare::full2sparse`

function. Once you have two sparse matrices you are ready to create a `hic.table`

object. This will be illustrated using the included sparse matrices at 500kb resolution for chromosome 22 from the HMEC and NHEK cell lines.

```
library(`HiCcompare`)
# load the data
data("HMEC.chr22")
data("NHEK.chr22")
head(HMEC.chr22)
```

```
## region1 region2 IF
## 1 16000000 16000000 5
## 2 16000000 16500000 2
## 3 16500000 16500000 297
## 4 16000000 17000000 5
## 5 16500000 17000000 92
## 6 17000000 17000000 5690
```

Now that we have 2 sparse upper triangular matrices we can create the `hic.table`

object. `create.hic.table`

requires the input of 2 sparse matrices and the chromosome name.

```
# create the `hic.table` object
chr22.table <- create.hic.table(HMEC.chr22, NHEK.chr22, chr = 'chr22')
head(chr22.table)
```

```
## chr1 start1 end1 chr2 start2 end2 IF1 IF2 D M
## 1: chr22 16000000 16500000 chr22 16000000 16500000 5 5.203323 0 0.05750528
## 2: chr22 16000000 16500000 chr22 16500000 17000000 2 8.672206 1 2.11639897
## 3: chr22 16500000 17000000 chr22 16500000 17000000 297 480.440193 0 0.69389392
## 4: chr22 16000000 16500000 chr22 17000000 17500000 5 14.742750 2 1.56000562
## 5: chr22 16500000 17000000 chr22 17000000 17500000 92 277.510581 1 1.59283701
## 6: chr22 17000000 17500000 chr22 17000000 17500000 5690 9435.359752 0 0.72964887
```

By default, all regions for the data entered are used as is. It may be desirable to exclude certain regions from any further analysis because they contain CNV or are overlapping blacklisted regions with known sequencing issues. To exclude the CNV and blacklisted regions we defined above, create the `hic.table`

object as follows.

`chr22.table <- create.hic.table(HMEC.chr22, NHEK.chr22, chr = 'chr22', exclude.regions = exclude, exclude.overlap = 0.2)`

The `exclude.overlap`

option controls the percentage overlap required with the regions listed in your data in order to be exclude. CNV regions should in the same resolution and bins as your data so they will overlap any of your regions by 100%. Blacklisted regions tend to be small and may only overlap your regions by a few basepairs. If you want any amount of overlap to result in exclusion set `exclude.overlap = 0`

. To require 100% overlap before exclusion set `exclude.overlap = 1`

. By default this option will exclude regions with 20% or more overlap.

We can also list multiple `hic.tables`

in order to utilize parallel computing. First we create another `hic.table`

as was done above. Then we combine these two `hic.tables`

into a list.

```
# create list of `hic.table` objects
data("HMEC.chr10")
data("NHEK.chr10")
# create the `hic.table` object
chr10.table <- create.hic.table(HMEC.chr10, NHEK.chr10, chr = 'chr10')
hic.list <- list(chr10.table, chr22.table)
head(hic.list)
```

```
## [[1]]
## chr1 start1 end1 chr2 start2 end2 IF1 IF2 D M
## 1: chr10 0 500000 chr10 0 500000 11160 1.491828e+04 0 0.4187441
## 2: chr10 0 500000 chr10 500000 1000000 3339 4.766792e+03 1 0.5136024
## 3: chr10 500000 1000000 chr10 500000 1000000 15871 1.998235e+04 0 0.3323333
## 4: chr10 0 500000 chr10 1000000 1500000 1064 6.501731e+02 2 -0.7106025
## 5: chr10 500000 1000000 chr10 1000000 1500000 2963 2.097443e+03 1 -0.4984269
## ---
## 35366: chr10 134000000 134500000 chr10 135000000 135500000 1810 1.273570e+03 2 -0.5071112
## 35367: chr10 134500000 135000000 chr10 135000000 135500000 1971 3.634653e+03 1 0.8828898
## 35368: chr10 135000000 135500000 chr10 135000000 135500000 16716 1.188368e+04 0 -0.4922479
## 35369: chr10 134000000 134500000 chr10 135500000 136000000 1 6.865608e-01 3 -0.5425406
## 35370: chr10 135000000 135500000 chr10 135500000 136000000 4 3.432804e+00 1 -0.2206125
##
## [[2]]
## chr1 start1 end1 chr2 start2 end2 IF1 IF2 D M
## 1: chr22 16000000 16500000 chr22 16000000 16500000 5 5.203323 0 0.05750528
## 2: chr22 16000000 16500000 chr22 16500000 17000000 2 8.672206 1 2.11639897
## 3: chr22 16500000 17000000 chr22 16500000 17000000 297 480.440193 0 0.69389392
## 4: chr22 16000000 16500000 chr22 17000000 17500000 5 14.742750 2 1.56000562
## 5: chr22 16500000 17000000 chr22 17000000 17500000 92 277.510581 1 1.59283701
## ---
## 2479: chr22 49000000 49500000 chr22 51000000 51500000 21 54.634896 4 1.37943338
## 2480: chr22 49500000 50000000 chr22 51000000 51500000 35 71.112086 3 1.02273986
## 2481: chr22 50000000 50500000 chr22 51000000 51500000 394 339.083241 2 -0.21655615
## 2482: chr22 50500000 51000000 chr22 51000000 51500000 4066 2741.284207 1 -0.56875831
## 2483: chr22 51000000 51500000 chr22 51000000 51500000 9916 7132.021930 0 -0.47544713
```

The `hic.table`

object contains a summary of the differences between the two matrices. “IF1” and “IF2” correspond to interaction frequencies in the first and second matrices, “D” is the unit distance (length of each unit is equivalent to the resolution of the data, e.g., 500kb), “M” is the \(log_2(IF2)-log_2(IF1)\) difference.

A `hic.table`

object can also be created using data in the 7 column BEDPE format. An example of BEDPE data for the HMEC dataset used above is shown below.

```
## chr1 start1 end1 chr2 start2 end2 IF1
## 1: chr22 16000000 16500000 chr22 16000000 16500000 5
## 2: chr22 16000000 16500000 chr22 16500000 17000000 2
## 3: chr22 16500000 17000000 chr22 16500000 17000000 297
## 4: chr22 16000000 16500000 chr22 17000000 17500000 5
## 5: chr22 16500000 17000000 chr22 17000000 17500000 92
## 6: chr22 17000000 17500000 chr22 17000000 17500000 5690
```

To create a `hic.table`

object using BEDPE data is very similar to using data in the sparse upper triangular format.

```
bed.hic.tab <- create.hic.table(HMEC.chr22_BEDPE, NHEK.chr22_BEDPE)
head(bed.hic.tab)
```

```
## chr1 start1 end1 chr2 start2 end2 IF1 IF2 D M
## 1: chr22 16000000 16500000 chr22 16000000 16500000 5 5.203323 0 0.05750528
## 2: chr22 16000000 16500000 chr22 16500000 17000000 2 8.672206 1 2.11639897
## 3: chr22 16500000 17000000 chr22 16500000 17000000 297 480.440193 0 0.69389392
## 4: chr22 16000000 16500000 chr22 17000000 17500000 5 14.742750 2 1.56000562
## 5: chr22 16500000 17000000 chr22 17000000 17500000 92 277.510581 1 1.59283701
## 6: chr22 17000000 17500000 chr22 17000000 17500000 5690 9435.359752 0 0.72964887
```

A `hic.table`

can also be created using an `InteractionSet`

object. Simply enter two `InteractionSets`

representing two Hi-C matrices into the `create.hic.table`

function and they will be converted to the proper format. See the `InteractionSet`

vignette for creating `InteractionSet`

objects here

```
data("hmec.IS")
data("nhek.IS")
head(hmec.IS)
```

```
## GInteractions object with 6 interactions and 1 metadata column:
## seqnames1 ranges1 seqnames2 ranges2 | IF
## <Rle> <IRanges> <Rle> <IRanges> | <numeric>
## [1] chr22 [16000000, 16499999] --- chr22 [16000000, 16499999] | 5
## [2] chr22 [16000000, 16499999] --- chr22 [16500000, 16999999] | 2
## [3] chr22 [16500000, 16999999] --- chr22 [16500000, 16999999] | 297
## [4] chr22 [16000000, 16499999] --- chr22 [17000000, 17499999] | 5
## [5] chr22 [16500000, 16999999] --- chr22 [17000000, 17499999] | 92
## [6] chr22 [17000000, 17499999] --- chr22 [17000000, 17499999] | 5690
## -------
## regions: 71 ranges and 0 metadata columns
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

`IS.hic.tab <- create.hic.table(hmec.IS, nhek.IS)`

To shorten computational time, or if one is only interested in a subsection of a Hi-C matrix, one may wish to use a subset of the `hic.table`

object. Use the `subset.dist`

or `subset.index`

options, see help for the `create.hic.table`

function.

If you are performing your `HiCcompare`

analysis on the entire genome you can perform total sum scaling on the data before performing loess normalization. This requires the use of a list of `hic.table`

objects, one for each chromosome. Additionally, when you create these `hic.table`

objects you must set `scale = FALSE`

. Once you have your list of `hic.table`

objects you can then perform total sum scaling.

`hic.list <- total_sum(hic.list)`

It is recommended to use total sum scaling when performing a `HiCcompare`

analysis on the entire genome. If you are only using data for a single chromosome then total sum scaling should be equivalent to the `scale`

option in the `create.hic.table`

function.

Now that you have created a `hic.table`

object you can jointly normalize your two Hi-C matrices. The `hic_loess`

function has many options and can accept a single `hic.table`

or a list of hic.tables. If for example you wish to perform joint normalization for every chromosome on two cell lines while utilizing parallel computing you can create a list containing the hic.tables for each chromosome.

To change the degree of the polynomial for the loess joint normalization you can utilize the `degree`

option, default is 1 (linear regression). A user-defined span, or the amount of data used to build the loess model, can also be set with the `span`

option. However if `span = NA`

(the default) the automatic smoothing parameter selection process will run and determine the optimal span for the data. The type of selection process can be changed using the `loess.criterion`

option. Available settings are `gcv`

(the default) for generalized cross-validation or `aicc`

for Akaike information criterion. The loess automatic smoothing parameter selection uses a customized version of the `fANCOVA::loess.as`

function. For more information on parameter selection please see the `fANCOVA`

reference manual. It is recommended to use the default settings.

`hic_loess`

can utilize the `BiocParallel`

package to perform parallel computing and lower computation time. The `parallel`

option (FALSE by default) will only speed up computation if a list of hic.tables is entered into the function, i.e., it parallelizes processing of several chromosome-specific matrices. This is useful for performing joint normalization for every chromosome between two Hi-C datasets. For more information on `BiocParallel`

see the reference manual here.

The basis of `HiCcompare`

rests on the novel concept termed the MD plot. The MD plot is similar to the MA plot or the Bland-Altman plot. \(M\) is the log2 difference between the interaction frequencies from the two datasets. \(D\) is the unit distance between the two interacting regions. Loess is performed on the data after it is represented in the MD coordinate system. To visualize the MD plot the `Plot`

option can be set to TRUE.

If you wish to detect differences between the two Hi-C datasets immediately following joint normalization you can set the `check.differences`

option to TRUE. However, if you only want to jointly normalize the data for now keep this option set to FALSE. The difference detection process will be covered in the next section.

```
# Jointly normalize data for a single chromosome
hic.table <- hic_loess(chr22.table, Plot = TRUE, Plot.smooth = FALSE)
```

`## Span for loess: 0.219052588777294`

`## GCV for loess: 0.000118686898199309`

`## AIC for loess: -0.212037314500057`

```
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
```

`head(hic.table)`

```
## chr1 start1 end1 chr2 start2 end2 IF1 IF2 D M adj.IF1 adj.IF2 adj.M mc
## 1: chr22 16000000 16500000 chr22 16000000 16500000 5 5.203323 0 0.05750528 5.103586 5.097713 -0.001661128 0.05916641
## 2: chr22 16000000 16500000 chr22 16500000 17000000 2 8.672206 1 2.11639897 2.079392 8.341096 2.004074812 0.11232416
## 3: chr22 16500000 17000000 chr22 16500000 17000000 297 480.440193 0 0.69389392 303.153008 470.688840 0.634727512 0.05916641
## 4: chr22 16000000 16500000 chr22 17000000 17500000 5 14.742750 2 1.56000562 5.294664 13.922271 1.394783581 0.16522204
## 5: chr22 16500000 17000000 chr22 17000000 17500000 92 277.510581 1 1.59283701 95.652053 266.915059 1.480512855 0.11232416
## 6: chr22 17000000 17500000 chr22 17000000 17500000 5690 9435.359752 0 0.72964887 5807.880851 9243.853028 0.670482465 0.05916641
```

```
# Multiple hic.tables can be processed in parallel by entering a list of hic.tables
hic.list <- hic_loess(hic.list, parallel = TRUE)
```

The `hic_loess`

joint normalization function extends the `hic.table`

with the adjusted interaction frequencies, adjusted “M”, and the “mc” correction factor.

This is an optional step of the analysis process for advanced users. This step will require the user to look at several plots to decide if weighting is necessary and if so, decide on the input parameters for the weighting. Weighting of M values may not be necessary for data at low resolutions such as 1MB or 500KB. For higher resolution data this process will probably be more necessary. However, you will need to look at the data and plots in order to determine this for yourself. If you perform a standard `HiCcompare`

analysis without weighting of M values on your data and find that a large number of detected differences have low average expression it may be helpful to use M weighting.

M values are defined as \(log2(IF2 / IF1)\). Since M values are the result of a ratio, differences on the log scale can appear identical when on the raw scale they are actually very different. For instance The M values for IF1 = 1 and IF2 = 10 and for IF1 = 10 and IF2 = 100 are both 3.32, despite the fact that the second pair of interaction frequencies has a much larger difference on the raw scale. This could potentially lead to issues in the analysis of Hi-C data, as IFs with lower average expression values are less trustworthy and more prone to display differences due solely to chance or technical biases rather than true biological differences. Therefore, M values derived from IFs with large average expression should be treated as more trustworthy than those derived from IFs exhibiting small average expression. One solution to this issue to weight M values based on their corresponding average expression before performing difference detection.

To weight M values, first we calculate the average expression associated with each M value, \(A = (IF1 + IF2) / 2\). Simple plotting of A values indicates that average expression between two Hi-C datasets is roughly distributed in a powerlaw fashion, with the majority of A values being rather small and a long right tail of very large A values. A values less than 1 should not be trusted at all and so they can safely be ignored for now. At this point we will need to plot the distribution of A in order to determine a minimum value of A at which to start the weighting and at which quantile of A we should stop weighting.

We can plot the distribution of A using the `plot_A`

function on a list of hic.table objects. Here we use Hi-C data comparing two different regions of the brain for chromosomes 18 and 22 at 100KB resolution. The data has already been read into a `hic.table`

list and jointly normalized with `hic_loess`

.

```
data("brain_table")
aplot <- plot_A(brain_table)
```

From looking at the plots we can see that the density levels out by the 75th quantile so we can focus on the values of A < 75th quantile. Next we will need to determine the minimum value of A at which to start fitting the powerlaw for weighting. From the 4th plot we can see that after a value of 4 the density of A starts to more or less uniformly decrease so this should be a safe guess for the minimum A value. The peak of the log2(A) histogram looks to occur at about 1.5 so if we take the anti-log we get 2.8 further suggesting that a value between 2.8 and 4 should be okay for fitting the weights.

Now we can perform weighting of the M values.

`brain_table <- weight_M(brain_table, A.min = 4, quant = 0.75, Plot.diagnostic = TRUE, Plot.MD = TRUE)`

`## M values corresponding to Average expression < 4 being weighted to 0.`

`## M values corresponding to Average expression < 4 being weighted to 0.`

The diagnostic plots are the histogram of raw A values between the `A.min`

value and the `quant`

quantile. The solid line represents the powerlaw fit to the density. You should always check these plots to make sure the fit of the weights is okay. You can also see the new MD plots after M weighting above. The diagnostic plots indicate that the parameters chosen worked for the data. You can now proceed to difference detection on your data.

Weighting of the M values occurs as follows. We estimate the kernel density for the A values which fall into the range \(A \le A_{min} \ and \ A < A_{quant}\). The approximate probability density function, \(f_A(.)\), of the kernel density is then interpolated from the kernel density. Next, we fit a powerlaw to this subset of A. The powerlaw is defined as \(P(A) = cA^{-\alpha}\). We get the best fit of the powerlaw to the kernel density by minimizing the sum of squared residuals (SSR) between the powerlaw fit and the kernel density. The weights, \(w\), can be calculated as follows \[w = \begin{cases} 0, \ A < A_{min} \\ 1 - \frac{P(A)}{max(P(A))}, \ A_{min} \le A < A_{0.75} \\ 1, \ A \ge A_{0.75} \end{cases} \] where \(P(A)\) is the powerlaw fit based on the minimizing the SSR. This weighting scheme will produce weights on [0, 1] corresponding to the powerlaw fit to the estimated kernel density of A. The weights can then be multiplied to the normalized M values before difference detection is performed. The weights can only make the M values closer to 0 and thus the differences represented will become less significant if their corresponding A value is low. M values corresponding to high A values will be unchanged.

Difference detection also makes use of the MD plot. The jointly normalized datasets are once again plotted on the MD plot. Difference detection takes place on a per-unit-distance basis using a permutation test. Permutations are broken into blocks for each unit distance. See methods section of the manuscript here for more details. Difference detection can be performed at the same time as joint normalization if the `check.differences`

option in `hic_loess`

is set to TRUE. Otherwise a `hic.table`

object output from the `hic_loess`

function can be input into the `hic_diff`

function for difference detection.

`hic_diff`

accepts a `hic.table`

object or a list of hic.table objects output from `hic_loess`

. `hic_diff`

can also utilize the `BiocParallel`

package the same way as the `hic_loess`

loess function. The same limits and uses apply as stated above. If the `Plot`

option is set to TRUE, an MD plot will be displayed showing the normalized MD plot along with coloring based on the significance of each difference. The dotted lines on this MD plot represent the `diff.thresh`

threshold. This threshold is automatically calculated as 2 standard deviations of M from 0 on the MD plot by default but can be set by the user or turned off. This option helps to deal with false positives by restricting the regions that can be labelled as significant when they fall inside the threshold. The threshold can be thought of as a range in which differences do not exceed the level of the expected noise in the Hi-C data. If you used M weighting on your data you should use a `diff_thresh`

otherwise you will get many false positives for longer range interactions. The function will return a hic.table or a list of hic.tables with an additional column containing the p-value for the significance of the difference between the IFs in each cell of the two Hi-C matrices.

If you wish to visualize the p-values in the form of a Hi-C contact map you can use the `visualize_pvals`

function to transform the results to a contact matrix with the p-values inserted instead of IFs as illustrated in the example below.

```
# input hic.table object into hic_diff
hic.table <- hic_diff(hic.table, Plot = TRUE, Plot.smooth = FALSE)
```

```
# input a list of hic.tables
hic.list <- hic_diff(hic.list)
# visualize results as contact map of p-values
visualize_pvals(hic.table)
```

`HiCcompare`

results to `InteractionSet`

objectsIf after running `hic_loess`

or `hic_diff`

on your data you wish to perform additional analyses which require the `GRanges`

object class you can convert the results using the `make_InteractionSet`

function. This function will produce an `InteractionSet`

object with the genomic ranges contained in the `hic.table`

along with several metadata files containing the additional information produced by `hic_loess`

or `hic_diff`

.

`IntSet <- make_InteractionSet(hic.table)`

`HiCcompare`

includes functions for simulating Hi-C data. The `hic_simulate`

function allows you to simulate two Hi-C matrices with added bias and true differences at a specified fold change. As an example we will simulate two matrices with 250 true differences added at a fold change of 4.

```
number_of_unitdistances <- 100 # The dimensions of the square matrix to be simualted
number_of_changes <- 250 # How many cells in the matrix will have changes
i.range <- sample(1:number_of_unitdistances, number_of_changes, replace = TRUE) # Indexes of cells to have controlled changes
j.range <- sample(1:number_of_unitdistances, number_of_changes, replace = TRUE) # Indexes of cells to have controlled changes
sim_results <- hic_simulate(nrow = number_of_unitdistances, medianIF = 50000, sdIF = 14000, powerlaw.alpha = 1.8, fold.change = 4, i.range = i.range, j.range = j.range, Plot = TRUE, alpha = 0.1)
```

`## Span for loess: 0.189983528888389`

`## GCV for loess: 0.000103274505170277`

`## AIC for loess: 0.291630989785669`

`## True Positives: 225 Total added differences: 240 True Negatives: 4300`

`## TPR: 0.9375`

`## SPC: 0.955343257053988`

The results of the simulation are saved in a list.

`names(sim_results)`

`## [1] "TPR" "SPC" "pvals" "hic.table" "true.diff" "truth" "sim.table"`

TPR is the true positive rate, SPC is the specificity, pvals is a vector of the p-values for every cell in the matrices, hic.table is the resulting hic.table object after `hic_loess`

and `hic_diff`

have been applied to the simulated data, true.diff is a table for the cells that had the specified fold change applied to them, truth is a vector of 0’s and 1’s indicating if a cell had a true difference applied - this is useful for input into ROC packages, sim.table is the simulated data in a hic.table object before being scaled, normalized, and analyzed for differences.

`HiCcompare`

contains some additional functions that may be useful.

If you do not choose to show the MD plots when you initially run `hic_loess`

or `hic_diff`

you can use the `MD.plot1`

or `MD.plot2`

functions. `MD.plot1`

will create a side by side MD plot showing before and after loess normalization. Enter your original M and D vectors along with the M correction factor, `mc`

, calculated by `hic_loess`

. The `smooth`

option controls if the plot is plotted using `smoothScatter`

or as a `ggplot2`

scatter plot.

`MD.plot1(M = hic.table$M, D = hic.table$D, mc = hic.table$mc, smooth = TRUE)`

`MD.plot2`

will create a standard MD plot with optional coloring based on p-value. Just enter an M and D vector and a p-value vector if desired.

```
# no p-value coloring
MD.plot2(M = hic.table$adj.M, D = hic.table$D, smooth = FALSE)
```

```
# p-value coloring
MD.plot2(M = hic.table$adj.M, D = hic.table$D, hic.table$p.value, smooth = FALSE)
```

There are two matrix transformation functions included. `sparse2full`

will transform a sparse upper triangular matrix to a full Hi-C matrix. `full2sparse`

will transform a full Hi-C matrix to sparse upper triangular format.

`full.NHEK <- sparse2full(NHEK.chr22)`

`## Matrix dimensions: 71x71`

`full.NHEK[1:5, 1:5]`

```
## 1.6e+07 16500000 1.7e+07 17500000 1.8e+07
## 1.6e+07 6 10 17 13 1
## 16500000 10 554 320 118 23
## 1.7e+07 17 320 10880 2922 576
## 17500000 13 118 2922 27545 3693
## 1.8e+07 1 23 576 3693 25794
```

```
sparse.NHEK <- full2sparse(full.NHEK)
head(sparse.NHEK)
```

```
## region1 region2 IF
## 1: 16000000 16000000 6
## 2: 16000000 16500000 10
## 3: 16500000 16500000 554
## 4: 16000000 17000000 17
## 5: 16500000 17000000 320
## 6: 17000000 17000000 10880
```

`KRnorm`

will perform Knight-Ruiz normalization on a Hi-C matrix. Just enter the full Hi-C matrix to be normalized.

`KR.NHEK <- KRnorm(full.NHEK)`

`SCN`

will perform Sequential Component Normalization on a Hi-C matrix. Just enter the full Hi-C matrix to be normalized.

`SCN.NHEK <- SCN(full.NHEK)`

`MA_norm`

will perform MA normalization on a `hic.table`

object.

`result <- MA_norm(hic.table, Plot = TRUE)`