diffloop: Identifying differential DNA loops from ChIA-PET data

Preprocessing

The raw FASTQ read files were preprocessed with the dnaloop (https://github.com/aryeelab/dnaloop) package, a PyPi package that relies on samtools, bedtools, cutadapt, and MACS2 to align reads, call anchor peaks, and summarize PETs per sample in the ChIA-PET experiment.

To use the loopsMake function from a different preprocessing step, have files X.loop_counts.bedpe,Y.loop_counts.bedpe, Z.loop_counts.bedpe in bed_dir for samples = (X,Y,Z) where the first lines should resemble:

1 10002272 10004045 10 120968807 120969483 . 1
1 10002272 10004045 10 99551498 99552470 . 1
1 10002272 10004045 1 10002272 10004045 . 17

where the first three columns specify the position (chr:start:stop) of the first anchor, the second three columns specify the position (chr:start:stop) of the secnd anchor, the 7th column is “.” and the 8th column is the number of paired-end reads support that particular PET.

Loading data

We read in data from the preprocessing pipeline using the loopsMake function. The example below uses sample data included in the diffloopdata package. It contains interactions involving chromosome 1 from a set of six Cohesin ChIA-PET samples.\(^{1,2}\) You would normally set bed_dir to your real data directory.

bed_dir <- system.file("extdata", "esc_jurkat", package="diffloopdata")
bed_dir

## [1] "/home/biocbuild/bbs-3.3-bioc/R/library/diffloopdata/extdata/esc_jurkat"

samples <- c("naive_esc_1", "naive_esc_2", "primed_esc_1", "primed_esc_2",
            "jurkat_1", "jurkat_2")

Here, we use 1500 bp as a value to merge nearby anchors into a single peak. In the loopsMake function, the default is zero and should be parameterized based on the resolution of the ChIA-PET Experiment.

full <- loopsMake(bed_dir, samples, 1500)
celltypes <- c("naive", "naive", "primed", "primed", "jurkat", "jurkat")
full <- updateLDGroups(full, celltypes)
head(full, 3)

## An object of class "loops"
## Slot "anchors":
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames           ranges strand
##          <Rle>        <IRanges>  <Rle>
##   [1]        1 [713970, 714336]      *
##   [2]        1 [804939, 805810]      *
##   [3]        1 [839766, 841153]      *
##   -------
##   seqinfo: 51 sequences from an unspecified genome; no seqlengths
## 
## Slot "interactions":
##      left right
## [1,]    1     1
## [2,]    1     2
## [3,]    1     3
## 
## Slot "counts":
##      naive_esc_1 naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1 jurkat_2
## [1,]           2           2            8            9        2        6
## [2,]           0           1            1            0        0        0
## [3,]           0           1            0            0        0        0
## 
## Slot "colData":
##              sizeFactor groups
## naive_esc_1           1  naive
## naive_esc_2           1  naive
## primed_esc_1          1 primed
## primed_esc_2          1 primed
## jurkat_1              1 jurkat
## jurkat_2              1 jurkat
## 
## Slot "rowData":
##   loopWidth
## 1         0
## 2     90604
## 3    125431

Alternatively, one could run full <- loopsMake(bed_dir) and the samples would be inferred from the directory contents. The default call to loopsMake does not merge nearby anchors but instead keeps them distinct. If we want to group nearby anchors after the initial data read, one can run the mergeAnchors command with the appropriate gap value.

We’ll now look at a PCA plot of the raw data:

pcaPlot(full) + geom_text(aes(label=samples))

From this initial PCA plot, it’s not clear how the samples separate based on their topologies, suggesting some quality control may be warranted.

QC

First, we normalize the loops object, which updates the sizeFactor of each sample by the number of reads identical to the normalization technique employed in DESeq2. We recommend computing the size factors on the original loops object to generate the best normalization factors.

full <- calcLDSizeFactors(full)
head(full,3)

## An object of class "loops"
## Slot "anchors":
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames           ranges strand
##          <Rle>        <IRanges>  <Rle>
##   [1]        1 [713970, 714336]      *
##   [2]        1 [804939, 805810]      *
##   [3]        1 [839766, 841153]      *
##   -------
##   seqinfo: 51 sequences from an unspecified genome; no seqlengths
## 
## Slot "interactions":
##      left right
## [1,]    1     1
## [2,]    1     2
## [3,]    1     3
## 
## Slot "counts":
##      naive_esc_1 naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1 jurkat_2
## [1,]           2           2            8            9        2        6
## [2,]           0           1            1            0        0        0
## [3,]           0           1            0            0        0        0
## 
## Slot "colData":
##              sizeFactor groups
## naive_esc_1   0.9164864  naive
## naive_esc_2   1.3742819  naive
## primed_esc_1  0.8946324 primed
## primed_esc_2  1.3785983 primed
## jurkat_1      0.5986950 jurkat
## jurkat_2      1.1775918 jurkat
## 
## Slot "rowData":
##   loopWidth
## 1         0
## 2     90604
## 3    125431

Looking at the full dataset, we can see that a majority of the counts were within self-loops:

loopMetrics(full)

##        naive_esc_1 naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1 jurkat_2
## self        131766      195502       120366       189452    89371   192755
## unique      139058      196861       202804       249808    78828   167673

We define “valid” looping in a combined experiment as those supported by at least 3 replicates in 2 or more samples. We can generate valid loops by using the filterLoops command

full_5kb <- filterLoops(full, width=5000, nreplicates=3, nsamples = 1)
dim(full_5kb)

##   anchors interactions samples colData rowData
## 1    3321         3457       6       2       1

valid_full <- filterLoops(full, width=5000, nreplicates=3, nsamples = 2)
dim(valid_full)

##   anchors interactions samples colData rowData
## 1    1508         1223       6       2       1

We can also see the proportion of loops that occur within the same chromosome (intrachromosomal) or have loops that bridge chromosomes (interchromosomal).

dim(intrachromosomal(valid_full))

##   anchors interactions samples colData rowData
## 1    1506         1221       6       2       1

dim(interchromosomal(valid_full))

##   anchors interactions samples colData rowData
## 1       3            2       6       2       1

Finally, we can determine the number of anchors supported by each sample as well as the loops with different thresholds of counts.

numLoops(valid_full,1:5)

##   nloops naive_esc_1 naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1
## 1      1         641         948          813         1043      422
## 2      2         443         777          605          919      260
## 3      3         249         519          380          641      138
## 4      4         145         353          239          470       79
## 5      5          89         261          175          335       53
##   jurkat_2
## 1      782
## 2      663
## 3      479
## 4      377
## 5      294

numAnchors(valid_full)

##   naive_esc_1 naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1 jurkat_2
## 1         546         222          332          143      896      423

Post-Normalization

pcaPlot(valid_full) + geom_text(aes(label=samples)) +
    expand_limits(x = c(-20, 34))

PCA plot of normalized ChIA-PET loop data Here, the PC plot is much better as the first principal component separates the stem cells and the jurkat samples while the second PC separates the two stem cell samples.

Differential Looping

Here, we describe the most general association method first fitting the model then doing pairwise comparisons by our choice of parameterizing the loopTest function.

# Fit edgeR Models Pairwise between cell types
fit <- loopFit(valid_full)

## The coefficients of the fitted GLM object are:
## (Intercept) groupsnaive groupsprimed

jn_res <- loopTest(fit,coef=2)
jp_res <- loopTest(fit,coef=3)
np_res <- loopTest(fit,contrast=c(0,-1,1))

head(np_res)

## An object of class "loops"
## Slot "anchors":
## GRanges object with 6 ranges and 0 metadata columns:
##       seqnames             ranges strand
##          <Rle>          <IRanges>  <Rle>
##   [1]        1 [ 919114,  920102]      *
##   [2]        1 [ 967614,  969793]      *
##   [3]        1 [ 999024,  999947]      *
##   [4]        1 [1283357, 1285466]      *
##   [5]        1 [1306905, 1308004]      *
##   [6]        1 [1875093, 1876136]      *
##   -------
##   seqinfo: 51 sequences from an unspecified genome; no seqlengths
## 
## Slot "interactions":
##      left right
## [1,]    1     3
## [2,]    2     3
## [3,]    4     5
## [4,]    6     8
## [5,]    7     8
## [6,]    9    10
## 
## Slot "counts":
##      naive_esc_1 naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1 jurkat_2
## [1,]           0           1            1            3        4        5
## [2,]           0           2            3            2        1        3
## [3,]           4           3            0            3        1        1
## [4,]           1           3            3            3        0        4
## [5,]          16          14            4            7        3        6
## [6,]           3           4            3           14        3       13
## 
## Slot "colData":
##              sizeFactor groups
## naive_esc_1   0.9164864  naive
## naive_esc_2   1.3742819  naive
## primed_esc_1  0.8946324 primed
## primed_esc_2  1.3785983 primed
## jurkat_1      0.5986950 jurkat
## jurkat_2      1.1775918 jurkat
## 
## Slot "rowData":
##   loopWidth      logFC    logCPM         F     PValue       FDR
## 1     78923  1.5469317 10.161615 1.1079860 0.29258948 0.8458779
## 2     29232  1.0051558  9.978743 0.5979946 0.43939401 0.9124109
## 3     21440 -1.3806368 10.044638 1.7261493 0.18898555 0.7721481
## 4     99664  0.3338366 10.150988 0.1070243 0.74357632 0.9833146
## 5     83277 -1.7004822 11.427281 8.5938853 0.00339378 0.2755610
## 6    185283  0.9849968 11.139303 1.9651851 0.16104446 0.7721481

We could also have computed pairwise associations by subsetting the data and using quickAssoc, which only allows loops objects with two group classes.

np <- valid_full[,1:4]
summary(np[1:3,])

##   chr_1 start_1   end_1 chr_2 start_2   end_2 naive_esc_1 naive_esc_2
## 1     1  919114  920102     1  999024  999947           0           1
## 2     1  967614  969793     1  999024  999947           0           2
## 3     1 1283357 1285466     1 1306905 1308004           4           3
##   primed_esc_1 primed_esc_2 loopWidth
## 1            1            3     78923
## 2            3            2     29232
## 3            0            3     21440

np_res2 <- quickAssoc(np)

Instead of determining the loops that distinguish these conditions from each other pairwise, this framework allows us to determine the loops that are significantly different between all of these conditions:

jnp_res <- loopTest(fit,coef=2:3)
head(jnp_res)

## An object of class "loops"
## Slot "anchors":
## GRanges object with 6 ranges and 0 metadata columns:
##       seqnames             ranges strand
##          <Rle>          <IRanges>  <Rle>
##   [1]        1 [ 919114,  920102]      *
##   [2]        1 [ 967614,  969793]      *
##   [3]        1 [ 999024,  999947]      *
##   [4]        1 [1283357, 1285466]      *
##   [5]        1 [1306905, 1308004]      *
##   [6]        1 [1875093, 1876136]      *
##   -------
##   seqinfo: 51 sequences from an unspecified genome; no seqlengths
## 
## Slot "interactions":
##      left right
## [1,]    1     3
## [2,]    2     3
## [3,]    4     5
## [4,]    6     8
## [5,]    7     8
## [6,]    9    10
## 
## Slot "counts":
##      naive_esc_1 naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1 jurkat_2
## [1,]           0           1            1            3        4        5
## [2,]           0           2            3            2        1        3
## [3,]           4           3            0            3        1        1
## [4,]           1           3            3            3        0        4
## [5,]          16          14            4            7        3        6
## [6,]           3           4            3           14        3       13
## 
## Slot "colData":
##              sizeFactor groups
## naive_esc_1   0.9164864  naive
## naive_esc_2   1.3742819  naive
## primed_esc_1  0.8946324 primed
## primed_esc_2  1.3785983 primed
## jurkat_1      0.5986950 jurkat
## jurkat_2      1.1775918 jurkat
## 
## Slot "rowData":
##   loopWidth logFC.groupsnaive logFC.groupsprimed    logCPM          F
## 1     78923        -3.0513552         -1.5044234 10.161615 3.29854703
## 2     29232        -1.0658210         -0.0606652  9.978743 0.38590748
## 3     21440         1.5542538          0.1736170 10.044638 1.23257697
## 4     99664        -0.1392212          0.1946153 10.150988 0.05492481
## 5     83277         1.5988669         -0.1016154 11.427281 5.57025931
## 6    185283        -1.2822134         -0.2972166 11.139303 1.72873888
##        PValue        FDR
## 1 0.037046364 0.21887779
## 2 0.679861009 0.82161069
## 3 0.291661009 0.53700916
## 4 0.946557097 0.96844419
## 5 0.003841779 0.06581858
## 6 0.177652683 0.43629051

Differential Looping Regions

What regions are significantly different between jurkat and naive? We can answer this by doing a sliding window test by specifying the window size and the step size in the slidingWindowTest function.

sw_jn <- slidingWindowTest(jn_res,200000,200000)
head(sw_jn)

##     chr     start      stop n         FDR       PValue
## 443   1  89519114  89719114 2 0.001010453 2.252961e-06
## 444   1  89719114  89919114 3 0.001010453 3.379441e-06
## 843   1 169519114 169719114 5 0.003292691 1.651852e-05
## 783   1 157519114 157719114 2 0.003365320 2.251050e-05
## 155   1  31919114  32119114 6 0.005379398 4.592769e-05
## 731   1 147119114 147319114 2 0.005379398 5.397389e-05

What gene bodies have significantly different looping between the cell lines? Whereas the slidingWindowTest makes no prior assumptions about the regions to be tested, featureTest requires a GRanges object of coordinates to determine what regions will be tested for association.

chr1 <- getHumanGenes(c("1"))
ft_jnp <- featureTest(jnp_res,chr1)
head(ft_jnp,10)

##      chr     start      stop n      feature          FDR       PValue
## 1533   1 169514166 169586588 4           F5 6.611069e-07 1.969778e-09
## 1535   1 169662007 169854080 5     C1orf112 6.611069e-07 2.462223e-09
## 863    1  89524836  89597864 2       LRRC8B 1.157464e-06 6.466280e-09
## 865    1  89633140  89933250 5 RP11-302M6.4 1.736196e-06 1.616570e-08
## 866    1  89821014  89936611 5       LRRC8D 1.736196e-06 1.616570e-08
## 1378   1 157513377 157552520 2        FCRL5 5.504059e-06 6.149787e-08
## 1108   1 147175351 147243050 1         FMO5 2.686776e-04 3.502316e-06
## 432    1  31906421  31944856 1       PTP4A2 9.383969e-04 1.572732e-05
## 1617   1 183248237 183418602 2       NMNAT2 9.383969e-04 1.556837e-05
## 1512   1 167094045 167129165 1       DUSP27 1.951215e-03 3.633548e-05

Quantifying Associations

How many differential loops are present pairwise between these conditions?

np_sig_res <- topLoops(np_res, FDR = 0.2)
dim(np_sig_res)

##   anchors interactions samples colData rowData
## 1      14            7       6       2       6

dim(topLoops(jp_res, FDR = 0.2))

##   anchors interactions samples colData rowData
## 1     295          169       6       2       6

dim(topLoops(jn_res, FDR = 0.2))

##   anchors interactions samples colData rowData
## 1     373          208       6       2       6

summary(np_sig_res)

##   chr_1   start_1     end_1 chr_2   start_2     end_2 naive_esc_1
## 1     1  15388841  15391302     1  15426303  15427898          10
## 2     1 113423172 113424209     1 113466536 113468099           5
## 3     1 120428104 120429676     1 120510345 120511491           4
## 4     1 161039764 161042346     1 161066522 161068559          12
## 5     1 183250284 183251529     1 183287447 183288171           6
## 6     1 201528472 201529494     1 201761758 201764758           1
## 7     1 202070278 202071285     1 202087247 202088661           5
##   naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1 jurkat_2 loopWidth
## 1           8            1            1        0        2     35002
## 2           6            0            0        0        0     42328
## 3           7            0            0        0        1     80670
## 4          24            4            7        2        6     24177
## 5          10            0            1        0        0     35919
## 6           0            8            8        2        6    232265
## 7           8            0            0        0        0     15963
##       logFC    logCPM        F       PValue       FDR
## 1 -3.247621 10.556695 13.85304 0.0002006411 0.0817947
## 2 -5.530195  9.986114 11.69001 0.0006352147 0.1449510
## 3 -5.527957 10.043625 12.42061 0.0004298078 0.1314137
## 4 -1.911138 11.519839 11.47942 0.0007111252 0.1449510
## 5 -3.896584 10.316748 14.25452 0.0001622045 0.0817947
## 6  3.476419 10.662947 10.91372 0.0009637006 0.1683723
## 7 -5.764451 10.104400 14.98909 0.0001100176 0.0817947

At FDR < 0.2, only 7 loops are significantly different from the naive and primed stem cells whereas the jurkat differs between the primed stem cells by 169 significant loops and the naive by 208 significant loops on chromosome 1. These results are in line with the underlying biology of these cell types. Now we can visualize a PC plot using only loops that are significantly different between celltypes

pcaPlot(topLoops(jnp_res, FDR = 0.05)) + geom_text(aes(label=samples))

PCA plot of differential loops Here, we see a much tighter cluster of the cell types, which may be expected since the only variables introduced into the principal components are the loops that are statistically different between the cell types.