Contents

1 Introduction

Chromatin immunoprecipitation (ChIP) followed by DNA sequencing (ChIP-seq) and ChIP followed by genome tiling array analysis (ChIP-chip) have become prevalent high throughput technologies for identifying the binding sites of DNA-binding proteins genome-wise. A number of algorithms have been published to facilitate the identification of the binding sites of the DNA-binding proteins of interest. The identified binding sites as the list of peaks are usually converted to BED or bigwig files to be loaded to the UCSC genome browser as custom tracks for investigators to view the proximity to various genomic features such as genes, exons or conserved elements. However, clicking through the genome browser is a daunting task when the number of peaks gets large or the peaks spread widely across the genome.

Here we developed ChIPpeakAnno, a Bioconductor1 package, to facilitate the batch annotation of the peaks identified from ChIP-seq or ChIP-chip experiments. We implemented functionality to find the nearest gene, exon, miRNA or other custom features supplied by users such as the most conserved elements and other transcription factor binding sites leveraging GRanges. Since the genome annotation gets updated frequently, we have leveraged the biomaRt package to retrieve the annotation data on the fly. The users also have the flexibility to pass their own annotation data or annotation from GenomicFeatures as GRanges. We have also leveraged BSgenome and biomaRt to retrieve the sequences around the identified peak for peak validation or motif discovery2. To understand whether the identified peaks are enriched around genes with certain GO terms, we have implemented the Gene Ontology (GO) enrichment test in the ChIPpeakAnno package leveraging the hypergeometric test phyper in the stats package and integrated with the GO annotation from the GO.db package and multiplicity adjustment functions from the multtest package3–8. The pathway analysis using reactome or KEGG is also supported. Starting 3.4, we also implement the functions for permutation test to determine whether there is a significant overlap between two sets of peaks. In addition, binding patterns of multiple transcription factors (TFs) or distributions of multiple epigenetic markers around genomic features could be visualized and compared easily using a side-by-side heatmap and density plot.

2 Quick start

library(ChIPpeakAnno)
## import the MACS output
macs <- system.file("extdata", "MACS_peaks.xls", package="ChIPpeakAnno")
macsOutput <- toGRanges(macs, format="MACS")
## annotate the peaks with precompiled ensembl annotation
data(TSS.human.GRCh38)
macs.anno <- annotatePeakInBatch(macsOutput, AnnotationData=TSS.human.GRCh38)
## add gene symbols
library(org.Hs.eg.db)
macs.anno <- addGeneIDs(annotatedPeak=macs.anno, 
                        orgAnn="org.Hs.eg.db", 
                        IDs2Add="symbol")

if(interactive()){## annotate the peaks with UCSC annotation
    library(GenomicFeatures)
    library(TxDb.Hsapiens.UCSC.hg38.knownGene)
    ucsc.hg38.knownGene <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
    macs.anno <- annotatePeakInBatch(macsOutput, 
                                     AnnotationData=ucsc.hg38.knownGene)
    macs.anno <- addGeneIDs(annotatedPeak=macs.anno, 
                            orgAnn="org.Hs.eg.db", 
                            feature_id_type="entrez_id",
                            IDs2Add="symbol")
}

3 An example of ChIP-seq analysis workflow using ChIPpeakAnno

We illustrate here a common downstream analysis workflow for ChIP-seq experiments. The input of ChIPpeakAnno is a list of called peaks identified from ChIP-seq experiments. The peaks are represented by GRanges in ChIPpeakAnno. We implemented a conversion functions toGRanges to convert commonly used peak file formats, such as BED, GFF, or other user defined formats such as MACS (a popular peak calling program) output file to GRanges. Please type ?toGRanges for more information.

The workflow here exemplifies converting the BED and GFF files to GRanges, finding the overlapping peaks between the two peak sets, and visualizing the number of common and specific peaks with Venn diagram.

bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE) 
## one can also try import from rtracklayer
library(rtracklayer)
gr1.import <- import(bed, format="BED")
identical(start(gr1), start(gr1.import))
## [1] TRUE
gr1[1:2]
## GRanges object with 2 ranges and 1 metadata column:
##               seqnames      ranges strand |     score
##                  <Rle>   <IRanges>  <Rle> | <numeric>
##   MACS_peak_1     chr1 28341-29610      * |    160.81
##   MACS_peak_2     chr1 90821-91234      * |    133.12
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
gr1.import[1:2] #note the name slot is different from gr1
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames      ranges strand |        name     score
##          <Rle>   <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr1 28341-29610      * | MACS_peak_1    160.81
##   [2]     chr1 90821-91234      * | MACS_peak_2    133.12
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
ol <- findOverlapsOfPeaks(gr1, gr2)
makeVennDiagram(ol,
                fill=c("#009E73", "#F0E442"), # circle fill color
                col=c("#D55E00", "#0072B2"), #circle border color
                cat.col=c("#D55E00", "#0072B2"))
venn diagram of overlaps for duplicated experiments

Figure 1: venn diagram of overlaps for duplicated experiments

## $p.value
##      gr1 gr2 pval
## [1,]   1   1    0
## 
## $vennCounts
##      gr1 gr2 Counts count.gr1 count.gr2
## [1,]   0   0      0         0         0
## [2,]   0   1     61         0        61
## [3,]   1   0     62        62         0
## [4,]   1   1    166       168       169
## attr(,"class")
## [1] "VennCounts"

A pie chart is used to demonstrate the overlap features of the common peaks.

pie1(table(ol$overlappingPeaks[["gr1///gr2"]]$overlapFeature))
Pie chart of common peaks among features

Figure 2: Pie chart of common peaks among features

After finding the overlapping peaks, you can use annotatePeakInBatch to annotate the overlapping peaks with the genomic features in the AnnotationData within certain distance away specified by maxgap, which is 5kb in the following example.

overlaps <- ol$peaklist[["gr1///gr2"]]
## ============== old style ===========
## data(TSS.human.GRCh37) 
## overlaps.anno <- annotatePeakInBatch(overlaps, AnnotationData=annoData, 
##                                      output="overlapping", maxgap=5000L)
## overlaps.anno <- addGeneIDs(overlaps.anno, "org.Hs.eg.db", "symbol")
## ============== new style ===========
library(EnsDb.Hsapiens.v75) ##(hg19)
## create annotation file from EnsDb or TxDb
annoData <- toGRanges(EnsDb.Hsapiens.v75, feature="gene")
annoData[1:2]
## GRanges object with 2 ranges and 1 metadata column:
##                   seqnames      ranges strand |   gene_name
##                      <Rle>   <IRanges>  <Rle> | <character>
##   ENSG00000223972     chr1 11869-14412      + |     DDX11L1
##   ENSG00000227232     chr1 14363-29806      - |      WASH7P
##   -------
##   seqinfo: 273 sequences from GRCh37 genome
overlaps.anno <- annotatePeakInBatch(overlaps, AnnotationData=annoData, 
                                    output="overlapping", maxgap=5000L)
overlaps.anno$gene_name <- 
    annoData$gene_name[match(overlaps.anno$feature,
                             names(annoData))]
head(overlaps.anno)
## GRanges object with 6 ranges and 11 metadata columns:
##                        seqnames        ranges strand |
##                           <Rle>     <IRanges>  <Rle> |
##   X001.ENSG00000228327     chr1 713791-715578      * |
##   X001.ENSG00000237491     chr1 713791-715578      * |
##   X002.ENSG00000237491     chr1 724851-727191      * |
##   X003.ENSG00000272438     chr1 839467-840090      * |
##   X004.ENSG00000223764     chr1 856361-856999      * |
##   X004.ENSG00000187634     chr1 856361-856999      * |
##                                                  peakNames        peak
##                                            <CharacterList> <character>
##   X001.ENSG00000228327 gr1__MACS_peak_13,gr2__001,gr2__002         001
##   X001.ENSG00000237491 gr1__MACS_peak_13,gr2__001,gr2__002         001
##   X002.ENSG00000237491          gr2__003,gr1__MACS_peak_14         002
##   X003.ENSG00000272438          gr1__MACS_peak_16,gr2__004         003
##   X004.ENSG00000223764          gr1__MACS_peak_17,gr2__005         004
##   X004.ENSG00000187634          gr1__MACS_peak_17,gr2__005         004
##                                feature start_position end_position
##                            <character>      <integer>    <integer>
##   X001.ENSG00000228327 ENSG00000228327         700237       714006
##   X001.ENSG00000237491 ENSG00000237491         714150       745440
##   X002.ENSG00000237491 ENSG00000237491         714150       745440
##   X003.ENSG00000272438 ENSG00000272438         840214       851356
##   X004.ENSG00000223764 ENSG00000223764         852245       856396
##   X004.ENSG00000187634 ENSG00000187634         860260       879955
##                        feature_strand insideFeature distancetoFeature
##                           <character>      <factor>         <numeric>
##   X001.ENSG00000228327              -  overlapStart               215
##   X001.ENSG00000237491              +  overlapStart              -359
##   X002.ENSG00000237491              +        inside             10701
##   X003.ENSG00000272438              +      upstream              -747
##   X004.ENSG00000223764              -  overlapStart                35
##   X004.ENSG00000187634              +      upstream             -3899
##                        shortestDistance fromOverlappingOrNearest     gene_name
##                               <integer>              <character>   <character>
##   X001.ENSG00000228327              215              Overlapping RP11-206L10.2
##   X001.ENSG00000237491              359              Overlapping RP11-206L10.9
##   X002.ENSG00000237491            10701              Overlapping RP11-206L10.9
##   X003.ENSG00000272438              124              Overlapping  RP11-54O7.16
##   X004.ENSG00000223764               35              Overlapping   RP11-54O7.3
##   X004.ENSG00000187634             3261              Overlapping        SAMD11
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Once the peaks are annotated, the distribution of the distance to the nearest feature such as the transcription start sites (TSS) can be plotted. The sample code here plots the distribution of the aggregated peak scores and the number of peaks around the TSS.

gr1.copy <- gr1
gr1.copy$score <- 1
binOverFeature(gr1, gr1.copy, annotationData=annoData,
               radius=5000, nbins=10, FUN=c(sum, length),
               ylab=c("score", "count"), 
               main=c("Distribution of aggregated peak scores around TSS", 
                      "Distribution of aggregated peak numbers around TSS"))
Distribution of aggregated peak scores or peak numbers around transcript start sites.

Figure 3: Distribution of aggregated peak scores or peak numbers around transcript start sites

The distribution of the peaks over exon, intron, enhancer, proximal promoter, 5’ UTR and 3’ UTR can be summarized in peak centric or nucleotide centric view using the function assignChromosomeRegion. Please note that setting nucleotideLevel = TRUE will give a nucleotide level distribution over different features.

if(require(TxDb.Hsapiens.UCSC.hg19.knownGene)){
    aCR<-assignChromosomeRegion(gr1, nucleotideLevel=FALSE, 
                           precedence=c("Promoters", "immediateDownstream", 
                                         "fiveUTRs", "threeUTRs", 
                                         "Exons", "Introns"), 
                           TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
    barplot(aCR$percentage)
}
Peak distribution over different genomic features.

Figure 4: Peak distribution over different genomic features

4 Detailed Use Cases and Scenarios

Here we describe some details in using different functions in ChIPpeakAnno for different tasks. As shown in the last section, the common workflow includes: loading called peaks from BED, GFF, or other formats; evaluating and visualizing the concordance among the biological replicates; combining peaks from replicates; preparing genomic annotation(s) as GRanges; associating/annotating peaks with the annotation(s); summarizing peak distributions over exon, intron, enhancer, proximal promoter, 5’UTR and 3’UTR regions; retrieving the sequences around the peaks; and enrichment analysis of GO and biological pathway. We also implemented the functions to plot the heatmap of given peak ranges, and perform permutation test to determine if there is a significant overlap between two sets of peaks.

4.1 Determine the overlapping peaks and visualize the overlaps with Venn diagram

Prior to associating features of interest with the peaks, it is a common practice to evaluate the concordance among the peaks from biological replicates and combine the peaks from biological replicates. Also, it is biologically interesting to obtain overlapping peaks from different ChIP-seq experiments to imply the potential formation of transcription factor complexes. ChIPpeakAnno implemented functions to achieve those goals and quantitatively determine the significance of peak overlaps and generate a Venn diagram for visualization.

Here is the sample code to obtain the overlapping peaks with maximum gap of 1kb for two peak ranges.

peaks1 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", 
                              "2", "6", "6", "6", "6", "5"),
                   ranges=IRanges(start=c(967654, 2010897, 2496704, 3075869, 
                                          3123260, 3857501, 201089, 1543200, 
                                          1557200, 1563000, 1569800, 167889600),
                                  end= c(967754, 2010997, 2496804, 3075969, 
                                         3123360, 3857601, 201089, 1555199,
                                         1560599, 1565199, 1573799, 167893599),
                                  names=paste("Site", 1:12, sep="")),
                  strand="+")

peaks2 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", "1", "2", "3", 
                                     "4", "5", "6", "6", "6", "6", "6", "5"),
                          ranges=IRanges(start=c(967659, 2010898, 2496700, 
                                                 3075866, 3123260, 3857500, 
                                                 96765, 201089, 249670, 307586, 
                                                 312326, 385750, 1549800, 
                                                 1554400, 1565000, 1569400,
                                                 167888600), 
                                         end=c(967869, 2011108, 2496920, 
                                               3076166,3123470, 3857780, 
                                               96985, 201299, 249890, 307796, 
                                               312586, 385960, 1550599, 1560799,
                                               1565399, 1571199, 167888999), 
                                         names=paste("t", 1:17, sep="")),
                          strand=c("+", "+", "+", "+", "+", "+", "-", "-", "-", 
                                   "-", "-", "-", "+", "+", "+", "+", "+"))

ol <- findOverlapsOfPeaks(peaks1, peaks2, maxgap=1000)
peaklist <- ol$peaklist

The function findOverlapsOfPeaks returns an object of overlappingPeaks, which contains there elements: venn_cnt, peaklist (a list of
overlapping peaks or unique peaks), and overlappingPeaks (a list of data frame consists of the annotation of all the overlapping peaks).

Within the overlappingPeaks element of the overlappingPeaks object ol (which is also a list), the element “peaks1///peaks2” is a data frame representing the overlapping peaks with maximum gap of 1kb between the two peak lists. Using the overlapFeature column in this data frame, a pie graph can be generated to describe the distribution of the features of the relative positions of peaks1 to peaks2 for the overlapping peaks.

overlappingPeaks <- ol$overlappingPeaks
names(overlappingPeaks)
## [1] "peaks1///peaks2"
dim(overlappingPeaks[["peaks1///peaks2"]])
## [1] 13 14
overlappingPeaks[["peaks1///peaks2"]][1:2, ]
##          peaks1 seqnames   start     end width strand peaks2 seqnames   start
## Site1_t1  Site1        1  967654  967754   101      +     t1        1  967659
## Site2_t2  Site2        2 2010897 2010997   101      +     t2        2 2010898
##              end width strand overlapFeature shortestDistance
## Site1_t1  967869   211      +   overlapStart                5
## Site2_t2 2011108   211      +   overlapStart                1
pie1(table(overlappingPeaks[["peaks1///peaks2"]]$overlapFeature))
Pie chart of common peaks among features.

Figure 5: Pie chart of common peaks among features

The following code returns the merged overlapping peaks from the peaklist object.

peaklist[["peaks1///peaks2"]]
## GRanges object with 11 ranges and 1 metadata column:
##        seqnames              ranges strand |
##           <Rle>           <IRanges>  <Rle> |
##    [1]        1       967654-967869      + |
##    [2]        2       201089-201299      * |
##    [3]        2     2010897-2011108      + |
##    [4]        3     2496700-2496920      + |
##    [5]        4     3075866-3076166      + |
##    [6]        5     3123260-3123470      + |
##    [7]        5 167888600-167893599      + |
##    [8]        6     1543200-1560799      + |
##    [9]        6     1563000-1565399      + |
##   [10]        6     1569400-1573799      + |
##   [11]        6     3857500-3857780      + |
##                                        peakNames
##                                  <CharacterList>
##    [1]                  peaks1__Site1,peaks2__t1
##    [2]                  peaks1__Site7,peaks2__t8
##    [3]                  peaks1__Site2,peaks2__t2
##    [4]                  peaks2__t3,peaks1__Site3
##    [5]                  peaks2__t4,peaks1__Site4
##    [6]                  peaks1__Site5,peaks2__t5
##    [7]                peaks2__t17,peaks1__Site12
##    [8] peaks1__Site8,peaks2__t13,peaks2__t14,...
##    [9]                peaks1__Site10,peaks2__t15
##   [10]                peaks2__t16,peaks1__Site11
##   [11]                  peaks2__t6,peaks1__Site6
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths

The peaks in peaks1 but not overlap with the peaks in peaks2 can be obtained with:

peaklist[["peaks1"]]
## NULL

The peaks in peaks2 but not overlap with the peaks in peaks1 can be obtained with:

peaklist[["peaks2"]]
## GRanges object with 5 ranges and 1 metadata column:
##       seqnames        ranges strand |       peakNames
##          <Rle>     <IRanges>  <Rle> | <CharacterList>
##   [1]        1   96765-96985      - |      peaks2__t7
##   [2]        3 249670-249890      - |      peaks2__t9
##   [3]        4 307586-307796      - |     peaks2__t10
##   [4]        5 312326-312586      - |     peaks2__t11
##   [5]        6 385750-385960      - |     peaks2__t12
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths

Venn diagram can be generated by the function makeVennDiagram using the output of findOverlapsOfPeaks as an input.

The makeVennDiagram also outputs p-values indicating whether the overlapping is significant.

makeVennDiagram(ol, totalTest=1e+2,
                fill=c("#009E73", "#F0E442"), # circle fill color
                col=c("#D55E00", "#0072B2"), #circle border color
                cat.col=c("#D55E00", "#0072B2"))