Contents

In this guide, we illustrate here two common downstream analysis workflows for ChIP-seq experiments, one is for comparing and combining peaks for single transcription factor (TF) with replicates, and the other is for comparing binding profiles from ChIP-seq experiments with multiple TFs.

1 Workflow for ChIP-seq experiments of single transcription factor with replicates

This workflow shows how to convert BED/GFF files to GRanges, find overlapping peaks between two peak sets, and visualize the number of common and specific peaks with Venn diagram.

1.1 Import data and obtain overlapping peaks from replicates

The input for ChIPpeakAnno1 is a list of called peaks identified from ChIP-seq experiments or any other experiments that yield a set of chromosome coordinates. Although peaks are represented as GRanges in ChIPpeakAnno, other common peak formats such as BED, GFF and MACS can be converted to GRanges easily using a conversion toGRanges method. For detailed information on how to use this method, please type ?toGRanges.

The following examples illustrate the usage of this method to convert BED and GFF file to GRanges, add metadata from orignal peaks to the overlap GRanges using function addMetadata, and visualize the overlapping using function makeVennDiagram.

library(ChIPpeakAnno)
bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE) 
## one can also try import from rtracklayer
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
## must keep the class exactly same as gr1$score, i.e., numeric.
gr2$score <- as.numeric(gr2$score) 
ol <- findOverlapsOfPeaks(gr1, gr2)
## add metadata (mean of score) to the overlapping peaks
ol <- addMetadata(ol, colNames="score", FUN=mean) 
ol$peaklist[["gr1///gr2"]][1:2]
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames        ranges strand |                           peakNames
##          <Rle>     <IRanges>  <Rle> |                     <CharacterList>
##   [1]     chr1 713791-715578      * | gr1__MACS_peak_13,gr2__001,gr2__002
##   [2]     chr1 724851-727191      * |          gr2__003,gr1__MACS_peak_14
##                  score
##              <numeric>
##   [1] 850.203333333333
##   [2]            29.17
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
makeVennDiagram(ol, fill=c("#009E73", "#F0E442"), # circle fill color
                col=c("#D55E00", "#0072B2"), #circle border color
                cat.col=c("#D55E00", "#0072B2")) # label color, keep same as circle border color
Figure 1. Venn diagram of overlaps for replicated experiments

Figure 1: Figure 1
Venn diagram of overlaps for replicated experiments

## $p.value
##      gr1 gr2 pval
## [1,]   1   1    0
## 
## $vennCounts
##      gr1 gr2 Counts count.gr1 count.gr2
## [1,]   0   0      0         0         0
## [2,]   0   1     61         0        61
## [3,]   1   0     62        62         0
## [4,]   1   1    166       168       169
## attr(,"class")
## [1] "VennCounts"

1.2 Prepare annotation data

Annotation data should be an object of GRanges. Same as import peaks, we use the method toGRanges, which can return an object of GRanges, to represent the annotation data. An annotation data be constructed from not only BED, GFF or user defined readable text files, but also EnsDb or TxDb object, by calling the toGRanges method. Please type ?toGRanges for more information.

library(EnsDb.Hsapiens.v75) ##(hg19)
## create annotation file from EnsDb or TxDb
annoData <- toGRanges(EnsDb.Hsapiens.v75, feature="gene")
annoData[1:2]
## GRanges object with 2 ranges and 1 metadata column:
##                   seqnames      ranges strand |   gene_name
##                      <Rle>   <IRanges>  <Rle> | <character>
##   ENSG00000223972     chr1 11869-14412      + |     DDX11L1
##   ENSG00000227232     chr1 14363-29806      - |      WASH7P
##   -------
##   seqinfo: 273 sequences from GRCh37 genome

1.3 Visualize binding site distribution relative to features

After finding the overlapping peaks, the distribution of the distance of overlapped peaks to the nearest feature such as the transcription start sites (TSS) can be plotted by binOverFeature function. The sample code here plots the distribution of peaks around the TSS.

overlaps <- ol$peaklist[["gr1///gr2"]]
binOverFeature(overlaps, annotationData=annoData,
               radius=5000, nbins=20, FUN=length, errFun=0,
               ylab="count", 
               main="Distribution of aggregated peak numbers around TSS")
Figure 2. Distribution of peaks around transcript start sites.

Figure 2: Figure 2
Distribution of peaks around transcript start sites.

In addition, assignChromosomeRegion can be used to summarize the distribution of peaks over different type of features such as exon, intron, enhancer, proximal promoter, 5’ UTR and 3’ UTR. This distribution can be summarized in peak centric or nucleotide centric view using the function assignChromosomeRegion. Please note that one peak might span multiple type of features, leading to the number of annotated features greater than the total number of input peaks. At the peak centric view, precedence will dictate the annotation order when peaks span multiple type of features.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
aCR<-assignChromosomeRegion(overlaps, nucleotideLevel=FALSE, 
                           precedence=c("Promoters", "immediateDownstream", 
                                         "fiveUTRs", "threeUTRs", 
                                         "Exons", "Introns"), 
                           TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
barplot(aCR$percentage, las=3)
Figure 3. Peak distribution over different genomic features.

Figure 3: Figure 3
Peak distribution over different genomic features.

1.4 Annotate peaks

As shown from the distribution of aggregated peak numbers around TSS and the distribution of peaks in different of chromosome regions, most of the peaks locate around TSS. Therefore, it is reasonable to use annotatePeakInBatch or annoPeaks to annotate the peaks to the promoter regions of Hg19 genes. Promoters can be specified with bindingRegion. For the following example, promoter region is defined as upstream 2000 and downstream 500 from TSS (bindingRegion=c(-2000, 500)).

overlaps.anno <- annotatePeakInBatch(overlaps, 
                                     AnnotationData=annoData, 
                                     output="nearestBiDirectionalPromoters",
                                     bindingRegion=c(-2000, 500))
library(org.Hs.eg.db)
overlaps.anno <- addGeneIDs(overlaps.anno,
                            "org.Hs.eg.db",
                            IDs2Add = "entrez_id")
head(overlaps.anno)
## GRanges object with 6 ranges and 11 metadata columns:
##       seqnames        ranges strand |                           peakNames
##          <Rle>     <IRanges>  <Rle> |                     <CharacterList>
##    X1     chr1 713791-715578      * | gr1__MACS_peak_13,gr2__001,gr2__002
##    X1     chr1 713791-715578      * | gr1__MACS_peak_13,gr2__001,gr2__002
##    X3     chr1 839467-840090      * |          gr1__MACS_peak_16,gr2__004
##    X4     chr1 856361-856999      * |          gr1__MACS_peak_17,gr2__005
##    X5     chr1 859315-860144      * |          gr2__006,gr1__MACS_peak_18
##   X10     chr1 901118-902861      * |          gr2__012,gr1__MACS_peak_23
##                  score        peak         feature feature.ranges
##              <numeric> <character>     <character>      <IRanges>
##    X1 850.203333333333          X1 ENSG00000228327  700237-714006
##    X1 850.203333333333          X1 ENSG00000237491  714150-745440
##    X3            73.12          X3 ENSG00000272438  840214-851356
##    X4            54.69          X4 ENSG00000223764  852245-856396
##    X5           81.485          X5 ENSG00000187634  860260-879955
##   X10           119.91         X10 ENSG00000187583  901877-911245
##       feature.strand  distance insideFeature distanceToSite     gene_name
##                <Rle> <integer>      <factor>      <integer>   <character>
##    X1              -         0  overlapStart              0 RP11-206L10.2
##    X1              +         0  overlapStart              0 RP11-206L10.9
##    X3              +       123      upstream            123  RP11-54O7.16
##    X4              -         0  overlapStart              0   RP11-54O7.3
##    X5              +       115      upstream            115        SAMD11
##   X10              +         0  overlapStart              0       PLEKHN1
##         entrez_id
##       <character>
##    X1        <NA>
##    X1   105378580
##    X3        <NA>
##    X4   100130417
##    X5      148398
##   X10       84069
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
write.csv(as.data.frame(unname(overlaps.anno)), "anno.csv")

The distribution of the common peaks around features can be visualized using a pie chart.

pie1(table(overlaps.anno$insideFeature))
Figure 4. Pie chart of the distribution of common peaks around features.

Figure 4: Figure 4
Pie chart of the distribution of common peaks around features.

1.5 Obtain enriched GO terms and Pathways

The following example shows how to use getEnrichedGO to obtain a list of enriched GO terms with annotated peaks. For pathway analysis, please use function getEnrichedPATH with reactome or KEGG database. Please note that by default feature_id_type is set as “ensembl_gene_id”. If you are using TxDb as annotation data, please set it to “entrez_id”.

over <- getEnrichedGO(overlaps.anno, orgAnn="org.Hs.eg.db", 
                     maxP=.05, minGOterm=10, 
                     multiAdjMethod="BH", condense=TRUE)
head(over[["bp"]][, -c(3, 10)])
## [1] go.id              go.term            Ontology           pvalue            
## [5] count.InDataset    count.InGenome     totaltermInDataset totaltermInGenome 
## [9] EntrezID          
## <0 rows> (or 0-length row.names)
library(reactome.db)
path <- getEnrichedPATH(overlaps.anno, "org.Hs.eg.db", "reactome.db", maxP=.05)
head(path)
##         path.id EntrezID count.InDataset count.InGenome     pvalue
## 1 R-HSA-1296041     2782               1             25 0.04341304
## 2 R-HSA-1296059     2782               1             25 0.04341304
## 3  R-HSA-196854    55229               2            189 0.04479537
## 4  R-HSA-196854   375790               2            189 0.04479537
## 5 R-HSA-1971475   375790               1             26 0.04511000
## 6  R-HSA-199220    55229               1             17 0.02972888
##   totaltermInDataset totaltermInGenome PATH
## 1                197            111075   NA
## 2                197            111075   NA
## 3                197            111075   NA
## 4                197            111075   NA
## 5                197            111075   NA
## 6                197            111075   NA

1.6 Obtain the sequences surrounding the peaks

Here is an example to get the sequences of the peaks plus 20 bp upstream and downstream for PCR validation or motif discovery.

library(BSgenome.Hsapiens.UCSC.hg19)
seq <- getAllPeakSequence(overlaps, upstream=20, downstream=20, genome=Hsapiens)
write2FASTA(seq, "test.fa")

1.7 Output a summary of consensus in the peaks

Here is an example to get the Z-scores for short oligos3.

## summary of the short oligos
freqs <- oligoFrequency(Hsapiens$chr1, MarkovOrder=3)
os <- oligoSummary(seq, oligoLength=6, MarkovOrder=3, 
                   quickMotif=FALSE, freqs=freqs)
## plot the results
zscore <- sort(os$zscore)
h <- hist(zscore, breaks=100, xlim=c(-50, 50), main="Histogram of Z-score")
text(zscore[length(zscore)], max(h$counts)/10, 
     labels=names(zscore[length(zscore)]), adj=1)
Figure 5. Histogram of Z-score of 6-mer

Figure 5: Figure 5
Histogram of Z-score of 6-mer

## We can also try simulation data
seq.sim.motif <- list(c("t", "g", "c", "a", "t", "g"), 
                      c("g", "c", "a", "t", "g", "c"))
set.seed(1)
seq.sim <- sapply(sample(c(2, 1, 0), 1000, replace=TRUE, prob=c(0.07, 0.1, 0.83)), 
                  function(x){
    s <- sample(c("a", "c", "g", "t"), 
                sample(100:1000, 1), replace=TRUE)
    if(x>0){
        si <- sample.int(length(s), 1)
        if(si>length(s)-6) si <- length(s)-6
        s[si:(si+5)] <- seq.sim.motif[[x]]
    }
    paste(s, collapse="")
})
os <- oligoSummary(seq.sim, oligoLength=6, MarkovOrder=3, 
                   quickMotif=TRUE)
zscore <- sort(os$zscore, decreasing=TRUE)
h <- hist(zscore, breaks=100, main="Histogram of Z-score")
text(zscore[1:2], rep(5, 2), 
     labels=names(zscore[1:2]), adj=0, srt=90)
Figure 6. Histogram of Z-score of simulation data

Figure 6: Figure 6
Histogram of Z-score of simulation data

## generate the motifs
library(motifStack)
pfms <- mapply(function(.ele, id)
    new("pfm", mat=.ele, name=paste("SAMPLE motif", id)), 
    os$motifs, 1:length(os$motifs))
motifStack(pfms[[1]])
Figure 7. Motif of simulation data

(#fig:simulation.motif)Figure 7. Motif of simulation data

1.8 Find peaks with bi-directional promoters

Bidirectional promoters are the DNA regions located between TSS of two adjacent genes that are transcribed on opposite directions and often co-regulated by this shared promoter region5. Here is an example to find peaks near bi-directional promoters.

bdp <- peaksNearBDP(overlaps, annoData, maxgap=5000)
c(bdp$percentPeaksWithBDP, 
  bdp$n.peaks, 
  bdp$n.peaksWithBDP)
## [1]   0.1084337 166.0000000  18.0000000
bdp$peaksWithBDP[1:2]
## GRangesList object of length 2:
## $`1`
## GRanges object with 2 ranges and 11 metadata columns:
##      seqnames        ranges strand |                           peakNames
##         <Rle>     <IRanges>  <Rle> |                     <CharacterList>
##   X1     chr1 713791-715578      * | gr1__MACS_peak_13,gr2__001,gr2__002
##   X1     chr1 713791-715578      * | gr1__MACS_peak_13,gr2__001,gr2__002
##                 score   bdp_idx        peak         feature feature.ranges
##             <numeric> <integer> <character>     <character>      <IRanges>
##   X1 850.203333333333         1          X1 ENSG00000228327  700237-714006
##   X1 850.203333333333         1          X1 ENSG00000237491  714150-745440
##      feature.strand  distance insideFeature distanceToSite     gene_name
##               <Rle> <integer>      <factor>      <integer>   <character>
##   X1              -         0  overlapStart              0 RP11-206L10.2
##   X1              +         0  overlapStart              0 RP11-206L10.9
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $`4`
## GRanges object with 2 ranges and 11 metadata columns:
##      seqnames        ranges strand |                  peakNames     score
##         <Rle>     <IRanges>  <Rle> |            <CharacterList> <numeric>
##   X4     chr1 856361-856999      * | gr1__MACS_peak_17,gr2__005     54.69
##   X4     chr1 856361-856999      * | gr1__MACS_peak_17,gr2__005     54.69
##        bdp_idx        peak         feature feature.ranges feature.strand
##      <integer> <character>     <character>      <IRanges>          <Rle>
##   X4         4          X4 ENSG00000223764  852245-856396              -
##   X4         4          X4 ENSG00000187634  860260-879955              +
##       distance insideFeature distanceToSite   gene_name
##      <integer>      <factor>      <integer> <character>
##   X4         0  overlapStart              0 RP11-54O7.3
##   X4      3260      upstream           3260      SAMD11
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

1.9 Find possible enhancers with DNA interaction data

There are several techniques available to determine the spatial organization of chromosomes at high resolution such as 3C, 5C and HiC6. These techniques make it possible to search peaks binding to the potential enhancer regions. Here is an example to find peaks binding to the potential enhancer regions.

DNA5C <- system.file("extdata", 
                     "wgEncodeUmassDekker5CGm12878PkV2.bed.gz",
                     package="ChIPpeakAnno")
DNAinteractiveData <- toGRanges(gzfile(DNA5C))
findEnhancers(overlaps, annoData, DNAinteractiveData)
## GRanges object with 5 ranges and 14 metadata columns:
##      seqnames              ranges strand |                   peakNames
##         <Rle>           <IRanges>  <Rle> |             <CharacterList>
##   X1     chr1 151591700-151591800      * | gr2__229,gr1__MACS_peak_229
##   X1     chr1 151591700-151591800      * | gr2__229,gr1__MACS_peak_229
##   X1     chr1 151591700-151591800      * | gr2__229,gr1__MACS_peak_229
##   X1     chr1 151591700-151591800      * | gr2__229,gr1__MACS_peak_229
##   X1     chr1 151630186-151630286      * | gr2__230,gr1__MACS_peak_230
##          score         feature      feature.ranges feature.strand
##      <numeric>     <character>           <IRanges>          <Rle>
##   X1    78.675 ENSG00000207606 151518272-151518367              +
##   X1    78.675 ENSG00000143390 151313116-151319833              -
##   X1    78.675 ENSG00000143376 151584541-151671567              +
##   X1    78.675 ENSG00000143367 151512781-151556059              +
##   X1    78.675 ENSG00000143393 151264273-151300191              -
##      feature.shift.ranges feature.shift.strand  distance insideFeature
##                 <IRanges>                <Rle> <integer>      <factor>
##   X1  151594534-151594629                    +      2733      upstream
##   X1  151595209-151601927                    +      3408      upstream
##   X1  151500588-151587615                    -      4084      upstream
##   X1  151595902-151639180                    +      4101      upstream
##   X1  151594247-151630165                    -        20      upstream
##      distanceToSite   gene_name        peak DNAinteractive.ranges
##           <integer> <character> <character>             <IRanges>
##   X1           2733      MIR554          X1   151516086-151603110
##   X1           3408        RFX5          X1   151309062-151603110
##   X1           4084       SNX27          X1   151546428-151636526
##   X1           4101       TUFT1          X1   151546428-151636526
##   X1             20       PI4KB          X1   151283079-151636526
##      DNAinteractive.blocks
##              <IRangesList>
##   X1   1-19082,76263-87025
##   X1 1-13633,283287-294049
##   X1    1-6978,72324-90099
##   X1    1-6978,72324-90099
##   X1  1-5699,335673-353448
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

2 Workflow for comparing binding profiles from multiple transcription factors (TFs)

Given two or more peak lists from different TFs, one may be interested in finding whether DNA binding profile of those TFs are correlated, and if correlated, what is the common binding pattern. The workflow here shows how to test the correlation of binding profiles of three TFs and how to discover the common binding pattern.

2.1 Import data

path <- system.file("extdata", package="ChIPpeakAnno")
files <- dir(path, "broadPeak")
data <- sapply(file.path(path, files), toGRanges, format="broadPeak")
names(data) <- gsub(".broadPeak", "", files)

2.2 Determine if there is a significant overlap among multiple sets of peaks

2.2.1 Hypergeometric test

When we test the association between two sets of data based on hypergeometric distribution, the number of all potential binding sites is required. The parameter totalTest in the function makeVennDiagram indicates how many potential peaks in total will be used in the hypergeometric test. It should be larger than the largest number of peaks in the peak list. The smaller it is set, the more stringent the test is. The time used to calculate p-value does not depend on the value of the totalTest. For practical guidance on how to choose totalTest, please refer to the post. The following example makes an assumption that there are 3% of coding region plus promoter region. Because the sample data is only a subset of chromosome 2, we estimate that the total binding sites is 1/24 of possible binding region in the genome.

ol <- findOverlapsOfPeaks(data, connectedPeaks="keepAll")
averagePeakWidth <- mean(width(unlist(GRangesList(ol$peaklist))))
tot <- ceiling(3.3e+9 * .03 / averagePeakWidth / 24)
makeVennDiagram(ol, totalTest=tot, connectedPeaks="keepAll", 
                fill=c("#CC79A7", "#56B4E9", "#F0E442"), # circle fill color
                col=c("#D55E00", "#0072B2", "#E69F00"), #circle border color
                cat.col=c("#D55E00", "#0072B2", "#E69F00"))