This vignette shows you how GenomicInteractions can be used to investigate significant interactions in HiC data that has been analysed using HOMER software [1]. GenomicInteractions can take a HOMER interaction file as input.
HiC data comes from chromosome conformation capture followed by high-throughput sequencing. Unlike 3C, 4C or 5C, which target specific regions, it can provide genome-wide information about the spatial proximity of regions of the genome. The raw data takes the form of paired-end reads connecting restriction fragments. The resolution of a HiC experiment is limited by the number of paired-end sequencing reads produced and by the sizes of restriction fragments. To increase the power to distinguish real interactions from random noise, HiC data is commonly analysed in bins from 20kb - 1Mb. There are a variety of tools available for binning the data, controlling for noise (e.g. self-ligations of restriction fragments), and finding significant interactions.
The data we are using comes from this paper [2] and can be downloaded from GEO. It is HiC data from wild type mouse double positive thymocytes. The experiment was carried out using the HindIII restriction enzyme. The paired-end reads were aligned to the mm9 mouse genome assembly and HOMER software was used to filter reads and detect significant interactions at a resolution of 100kb. For the purposes of this vignette, we will consider only data from chromosomes 14 and 15.
Load the data by specifying the file location and experiment type. You can also include an experiment name and description.
library(GenomicInteractions)
library(GenomicRanges)
hic_file <- system.file("extdata", "Seitan2013_WT_100kb_interactions.txt",
package="GenomicInteractions")
hic_data <- makeGenomicInteractionsFromFile(hic_file,
type="homer",
experiment_name = "HiC 100kb",
description = "HiC 100kb resolution")
seqlengths(hic_data) <- c(chr15 = 103494974, chr14 = 125194864)
The GenomicInteractions
object consists of two linked GenomicRanges
objects containing the anchors of each interaction, and the number of reads supporting each interaction. Metadata for each interaction (in this case, p-values and FDRs) is stored as a DataFrame
accessed by mcols()
or elementMetadata()
, similar to the metadata of a simple GRanges
. You can also access single metadata columns using $
.
hic_data
## GenomicInteractions object with 23276 interactions and 2 metadata columns:
## Name: HiC 100kb
## Description: HiC 100kb resolution
## Sum of interactions: 447000
## Annotated: no
## Interactions:
## Anchor One Anchor Two Counts
## [1] chr15:97600000..97699999 --- chr15:97500000..97599999 344
## [2] chr15:74800000..74899999 --- chr15:74700000..74799999 373
## [3] chr14:55000000..55099999 --- chr14:54800000..54899999 258
## [4] chr15:80400000..80499999 --- chr15:80300000..80399999 397
## [5] chr14:55100000..55199999 --- chr14:54800000..54899999 213
## ... ... ... ... ...
## [23272] chr15:82900000..82999999 --- chr15:6500000..6599999 7
## [23273] chr15:100500000..100599999 --- chr15:89000000..89099999 9
## [23274] chr15:46500000..46599999 --- chr15:19200000..19299999 9
## [23275] chr14:58500000..58599999 --- chr14:15100000..15199999 8
## [23276] chr14:72100000..72199999 --- chr14:47000000..47099999 10
## | p.value fdr
## [1] | 4.60038639163309e-143 0
## [2] | 4.40907012071168e-127 0
## [3] | 4.51853818556112e-124 0
## [4] | 5.4911744572704e-120 0
## [5] | 1.95389921464777e-118 0
## ... ... ... ...
## [23272] | 0.000999543383263305 1
## [23273] | 0.000999613353749072 1
## [23274] | 0.000999951280168987 1
## [23275] | 0.000999991279020165 1
## [23276] | 0.000999994278998502 1
##
## -------
## seqinfo: 2 sequences from an unspecified genome
mcols(hic_data)
## DataFrame with 23276 rows and 2 columns
## p.value fdr
## <numeric> <numeric>
## 1 4.6004e-143 0
## 2 4.4091e-127 0
## 3 4.5185e-124 0
## 4 5.4912e-120 0
## 5 1.9539e-118 0
## ... ... ...
## 23272 0.00099954 1
## 23273 0.00099961 1
## 23274 0.00099995 1
## 23275 0.00099999 1
## 23276 0.00099999 1
head(hic_data$fdr)
## [1] 0 0 0 0 0 0
We can check that the anchors are of the expected size (100kb).
summary(width(anchorOne(hic_data)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 89500 100000 100000 100000 100000 100000
summary(width(anchorTwo(hic_data)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100000 100000 100000 100000 100000 100000
Some anchors are shorter than 100kb due to the bin being at the end of a chromosome. There are 23276 interactions in total, with a total of 447000 reads supporting them. To calculate the average number of reads per interaction, first use interactionCounts()
to get the interactionCounts of reads supporting each individual interaction.
head(interactionCounts(hic_data))
## [1] 344 373 258 397 213 441
mean(interactionCounts(hic_data))
## [1] 19.204
However, since we have FDRs and p-values, it is probably more informative to use these to find interactions of interest.
plot(density(hic_data$p.value))
plot(density(hic_data$fdr))
The package provides some functions to plot summary statistics of your data that may be of interest, such as the percentage of interactions that are between regions on the same chromosome (cis-interactions) or on different chromosomes (trans-interactions), or the number of reads supporting each interaction. These plots can be used to assess the level of noise in your dataset - the presence of many interactions with high FDRs or low read counts suggests that the data may be noisy and contain a lot of false positive interactions. You can subset the GenomicInteractions object by FDR or by number of reads.
sum(hic_data$fdr < 0.1)
## [1] 8171
hic_data_subset <- hic_data[hic_data$fdr < 0.1]
plotCisTrans(hic_data)
plotCisTrans(hic_data_subset)
plotCounts(hic_data, cut=30)
plotCounts(hic_data_subset, cut=30)
Subsetting by FDR will tend to remove interactions that are supported by fewer reads. Trans interactions tend to have lower read support than cis interactions, so the percentage of trans interactions decreases.
One of the most powerful features of GenomicInteractions is that it allows you to annotate interactions by whether the anchors overlap genomic features of interest, such as promoters or enhancers. However this process can be slow for large datasets, so here we annotate with just promoters.
Genome annotation data can be obtained from, for example, UCSC databases using the GenomicFeatures package. We will use promoters of Refseq genes extended to a width of 5kb. Downloading all the data can be a slow process, so the data for promoters for chromosomes 14 and 15 is provided with this package.
## Not run
library(GenomicFeatures)
mm9.refseq.db <- makeTranscriptDbFromUCSC(genome="mm9", table="refGene")
refseq.genes = genes(mm9.refseq.db)
refseq.transcripts = transcriptsBy(mm9.refseq.db, by="gene")
refseq.transcripts = refseq.transcripts[ names(refseq.transcripts) %in% unlist(refseq.genes$gene_id) ]
mm9_refseq_promoters <- promoters(refseq.transcripts, 2500,2500)
mm9_refseq_promoters <- unlist(mm9_refseq_promoters[seqnames(mm9_refseq_promoters) %in% c("chr14", "chr15")])
annotateInteractions
takes a list of features in GRanges or GRangesList format and annotates the interaction anchors based on overlap with these features. The list of annotation features should have descriptive names, as these names are stored in the annotated GenomicInteractions object and used to assign anchor (node) classes.
data("mm9_refseq_promoters")
annotation.features <- list(promoter = mm9_refseq_promoters)
annotateInteractions(hic_data_subset, annotation.features)
## Annotating with promoter ...
hic_data_subset
## GenomicInteractions object with 8171 interactions and 2 metadata columns:
## Name: HiC 100kb
## Description: HiC 100kb resolution
## Sum of interactions: 284294
## Annotated: yes
## Annotated with: promoter, distal
## Interactions:
## Anchor One Anchor Two Counts
## [1] chr15:97600000..97699999 --- chr15:97500000..97599999 344
## [2] chr15:74800000..74899999 --- chr15:74700000..74799999 373
## [3] chr14:55000000..55099999 --- chr14:54800000..54899999 258
## [4] chr15:80400000..80499999 --- chr15:80300000..80399999 397
## [5] chr14:55100000..55199999 --- chr14:54800000..54899999 213
## ... ... ... ... ...
## [8167] chr14:64400000..64499999 --- chr14:56600000..56699999 12
## [8168] chr15:93200000..93299999 --- chr15:51700000..51799999 11
## [8169] chr15:103400000..103490927 --- chr15:95300000..95399999 12
## [8170] chr15:62600000..62699999 --- chr15:60900000..60999999 24
## [8171] chr14:80200000..80299999 --- chr14:74800000..74899999 13
## | p.value fdr
## [1] | 4.60038639163309e-143 0
## [2] | 4.40907012071168e-127 0
## [3] | 4.51853818556112e-124 0
## [4] | 5.4911744572704e-120 0
## [5] | 1.95389921464777e-118 0
## ... ... ... ...
## [8167] | 0.0000252334011149263 0.09979
## [8168] | 0.0000252645078228975 0.099881
## [8169] | 0.000025265922674951 0.099885
## [8170] | 0.0000252662006016291 0.099885
## [8171] | 0.000025277396008095 0.099916
##
## -------
## seqinfo: 2 sequences from an unspecified genome
In addition, the features themselves should have names or IDs. For GRangesLists these names can be the names()
of the items in the list. GRanges objects should have an “id” metadata column (note lowercase). These names or IDs for each feature are stored in the metadata for the anchors of the GenomicInteractions object. As each anchor may overlap multiple promoters, the “promoter.id” column is stored as a list. Promoter IDs can be accessed from the metadata of the anchors.
anchorOne(hic_data_subset)
## GRanges object with 8171 ranges and 2 metadata columns:
## seqnames ranges strand | node.class
## <Rle> <IRanges> <Rle> | <character>
## [1] chr15 [97600000, 97699999] * | promoter
## [2] chr15 [74800000, 74899999] * | promoter
## [3] chr14 [55000000, 55099999] * | promoter
## [4] chr15 [80400000, 80499999] * | promoter
## [5] chr14 [55100000, 55199999] * | promoter
## ... ... ... ... ... ...
## [8167] chr14 [ 64400000, 64499999] * | promoter
## [8168] chr15 [ 93200000, 93299999] * | promoter
## [8169] chr15 [103400000, 103490927] * | distal
## [8170] chr15 [ 62600000, 62699999] * | distal
## [8171] chr14 [ 80200000, 80299999] * | promoter
## promoter.id
## <list>
## [1] ########
## [2] ########
## [3] ########
## [4] ########
## [5] ########
## ... ...
## [8167] ########
## [8168] ########
## [8169] ########
## [8170] ########
## [8171] ########
## -------
## seqinfo: 2 sequences from an unspecified genome
head(anchorOne(hic_data_subset)$promoter.id)
## [[1]]
## [1] "223864" "67739" "56233"
##
## [[2]]
## [1] "57248" "110454" "17067"
##
## [[3]]
## [1] "20540" "68836" "17387" "65107" "140743"
##
## [[4]]
## [1] "17444" "102466773" "213956"
##
## [[5]]
## [1] "27374" "219072" "16475"
##
## [[6]]
## [1] "20491"
Node classes (or anchor classes) are assigned to each anchor based on overlap with annotation features and the order of those features within the list passed to the annotation function. If the list is list(promoter=..., transcript=...)
then an anchor which overlaps both a promoter and a transcript will be given the node class “promoter”. The features earlier in the list take priority. Any anchors which are not annotated with any of the given features will be assigned the class “distal”. In this case we have only annotated with one type of feature, so all anchors are either “promoter” or “distal”.
As the anchors are large, most of them overlap at least one promoter.
table(anchorOne(hic_data_subset)$node.class)
##
## distal promoter
## 1944 6227
table(anchorTwo(hic_data_subset)$node.class)
##
## distal promoter
## 2042 6129
Interaction types are determined by the classes of the interacting nodes. As we only have two node classes, we have three possible interaction classes, summarised in the plot below. Most of the interactions are between promoters. We can subset the data to look at interaction types that are of particular interest.
plotInteractionAnnotations(hic_data_subset)
Distal regions interacting with a promoter may contain distal regulatory elements such as enhancers or insulators. To get all promoter–distal interactions:
length(hic_data_subset[isInteractionType(hic_data_subset, "promoter", "distal")])
## [1] 1794
As this is a common type of interaction of interest, there is a function specifically for identifying these interactions (see the reference manual or help(isInteractionType)
for additional built in interaction types). isInteractionType
can be used with any pair of node classes. There are also functions for identifying cis or trans interactions.
length(hic_data_subset[is.pd(hic_data_subset)])
## [1] 1794
sum(is.trans(hic_data_subset))
## [1] 6
To find the strongest promoter–distal interaction:
hic_data_pd <- hic_data_subset[is.pd(hic_data_subset)]
max(interactionCounts(hic_data_pd))
## [1] 385
most_counts <- hic_data_pd[which.max(interactionCounts(hic_data_pd))]
most_counts
## GenomicInteractions object with 1 interaction and 2 metadata columns:
## Name: HiC 100kb
## Description: HiC 100kb resolution
## Sum of interactions: 385
## Annotated: yes
## Annotated with: promoter, distal
## Interactions:
## Anchor One Anchor Two Counts |
## [1] chr15:59400000..59499999 --- chr15:59300000..59399999 385 |
## p.value fdr
## [1] 1.22660624765005e-56 0
##
## -------
## seqinfo: 2 sequences from an unspecified genome
Or the most significant promoter–distal interaction:
min(hic_data_pd$p.value)
## [1] 9.9935e-102
min_pval <- hic_data_pd[which.min(hic_data_pd$p.value)]
min_pval
## GenomicInteractions object with 1 interaction and 2 metadata columns:
## Name: HiC 100kb
## Description: HiC 100kb resolution
## Sum of interactions: 250
## Annotated: yes
## Annotated with: promoter, distal
## Interactions:
## Anchor One Anchor Two Counts |
## [1] chr15:59100000..59199999 --- chr15:58800000..58899999 250 |
## p.value fdr
## [1] 9.99353601404246e-102 0
##
## -------
## seqinfo: 2 sequences from an unspecified genome
The distance between these interacting regions, or any interacting regions, can be found using calculateDistances
. For trans interactions the distance will be NA.
calculateDistances(most_counts, method="midpoint")
## [1] 99999
calculateDistances(min_pval,method="midpoint")
## [1] 299999
summary(calculateDistances(hic_data_subset,method="midpoint"))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 100000 1100000 6200000 15300000 22700000 113000000 6
The interactions can be exported to BED12 format for viewing in a genome browser. Anchors are visualised as thick blocks connected by thinner interactions.
## Not run
export.bed12(hic_data_subset, fn="hic_data_FDR0.1.bed", drop.trans = TRUE)
Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589.
Seitan, V. C. et al. Cohesin-based chromatin interactions enable regulated gene expression within pre-existing architectural compartments. Genome Res. 23, 2066-77 (2013).