ChIPpeakAnno 3.12.7
Chromatin immunoprecipitation (ChIP) followed by DNA sequencing (ChIP-seq) and ChIP followed by genome tiling array analysis (ChIP-chip) have become prevalent high throughput technologies for identifying the binding sites of DNA-binding proteins genome-wise. A number of algorithms have been published to facilitate the identification of the binding sites of the DNA-binding proteins of interest. The identified binding sites as the list of peaks are usually converted to BED or bigwig files to be loaded to the UCSC genome browser as custom tracks for investigators to view the proximity to various genomic features such as genes, exons or conserved elements. However, clicking through the genome browser is a daunting task when the number of peaks gets large or the peaks spread widely across the genome.
Here we developed ChIPpeakAnno, a Bioconductor1 package, to facilitate the batch annotation of the peaks identified from ChIP-seq or ChIP-chip experiments. We implemented functionality to find the nearest gene, exon, miRNA or other custom features supplied by users such as the most conserved elements and other transcription factor binding sites leveraging GRanges. Since the genome annotation gets updated frequently, we have leveraged the biomaRt package to retrieve the annotation data on the fly. The users also have the flexibility to pass their own annotation data or annotation from GenomicFeatures as GRanges. We have also leveraged BSgenome and biomaRt to retrieve the sequences around the identified peak for peak validation or motif discovery2. To understand whether the identified peaks are enriched around genes with certain GO terms, we have implemented the Gene Ontology (GO) enrichment test in the ChIPpeakAnno package leveraging the hypergeometric test phyper in the stats package and integrated with the GO annotation from the GO.db package and multiplicity adjustment functions from the multtest package3–8. The pathway analysis using reactome or KEGG is also supported. Starting 3.4, we also implement the functions for permutation test to determine whether there is a significant overlap between two sets of peaks. In addition, binding patterns of multiple transcription factors (TFs) or distributions of multiple epigenetic markers around genomic features could be visualized and compared easily using a side-by-side heatmap and density plot.
library(ChIPpeakAnno)
## import the MACS output
macs <- system.file("extdata", "MACS_peaks.xls", package="ChIPpeakAnno")
macsOutput <- toGRanges(macs, format="MACS")
## annotate the peaks with precompiled ensembl annotation
data(TSS.human.GRCh38)
macs.anno <- annotatePeakInBatch(macsOutput, AnnotationData=TSS.human.GRCh38)
## add gene symbols
library(org.Hs.eg.db)
macs.anno <- addGeneIDs(annotatedPeak=macs.anno,
orgAnn="org.Hs.eg.db",
IDs2Add="symbol")
if(interactive()){## annotate the peaks with UCSC annotation
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
ucsc.hg38.knownGene <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
macs.anno <- annotatePeakInBatch(macsOutput,
AnnotationData=ucsc.hg38.knownGene)
macs.anno <- addGeneIDs(annotatedPeak=macs.anno,
orgAnn="org.Hs.eg.db",
feature_id_type="entrez_id",
IDs2Add="symbol")
}
We illustrate here a common downstream analysis workflow for ChIP-seq
experiments. The input of ChIPpeakAnno is a list of called peaks identified
from ChIP-seq experiments. The peaks are represented by GRanges
in ChIPpeakAnno. We implemented a conversion functions toGRanges
to convert commonly used peak file formats, such as
BED, GFF, or other user defined formats such as MACS (a popular peak calling
program) output file to GRanges. Please type ?toGRanges
for more information.
The workflow here exemplifies converting the BED and GFF files to GRanges, finding the overlapping peaks between the two peak sets, and visualizing the number of common and specific peaks with Venn diagram.
bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE)
## one can also try import from rtracklayer
library(rtracklayer)
gr1.import <- import(bed, format="BED")
identical(start(gr1), start(gr1.import))
## [1] TRUE
gr1[1:2]
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## MACS_peak_1 chr1 [28341, 29610] * | 160.81
## MACS_peak_2 chr1 [90821, 91234] * | 133.12
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
gr1.import[1:2] #note the name slot is different from gr1
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 [28341, 29610] * | MACS_peak_1 160.81
## [2] chr1 [90821, 91234] * | MACS_peak_2 133.12
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
ol <- findOverlapsOfPeaks(gr1, gr2)
makeVennDiagram(ol)
## $p.value
## gr1 gr2 pval
## [1,] 1 1 0
##
## $vennCounts
## gr1 gr2 Counts
## [1,] 0 0 0
## [2,] 0 1 61
## [3,] 1 0 62
## [4,] 1 1 166
## attr(,"class")
## [1] "VennCounts"
A pie chart is used to demonstrate the overlap features of the common peaks.
pie1(table(ol$overlappingPeaks[["gr1///gr2"]]$overlapFeature))
After finding the overlapping peaks, you can use annotatePeakInBatch
to annotate the overlapping peaks with the genomic features in the AnnotationData within certain distance away specified by maxgap, which is 5kb in the following example.
overlaps <- ol$peaklist[["gr1///gr2"]]
## ============== old style ===========
## data(TSS.human.GRCh37)
## overlaps.anno <- annotatePeakInBatch(overlaps, AnnotationData=annoData,
## output="overlapping", maxgap=5000L)
## overlaps.anno <- addGeneIDs(overlaps.anno, "org.Hs.eg.db", "symbol")
## ============== new style ===========
library(EnsDb.Hsapiens.v75) ##(hg19)
## create annotation file from EnsDb or TxDb
annoData <- toGRanges(EnsDb.Hsapiens.v75, feature="gene")
annoData[1:2]
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gene_name
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000223972 chr1 [11869, 14412] + | DDX11L1
## ENSG00000227232 chr1 [14363, 29806] - | WASH7P
## -------
## seqinfo: 273 sequences from GRCh37 genome
overlaps.anno <- annotatePeakInBatch(overlaps, AnnotationData=annoData,
output="overlapping", maxgap=5000L)
overlaps.anno$gene_name <-
annoData$gene_name[match(overlaps.anno$feature,
names(annoData))]
head(overlaps.anno)
## GRanges object with 6 ranges and 11 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## X001.ENSG00000228327 chr1 [713791, 715578] * |
## X001.ENSG00000237491 chr1 [713791, 715578] * |
## X002.ENSG00000237491 chr1 [724851, 727191] * |
## X003.ENSG00000272438 chr1 [839467, 840090] * |
## X004.ENSG00000223764 chr1 [856361, 856999] * |
## X004.ENSG00000187634 chr1 [856361, 856999] * |
## peakNames
## <CharacterList>
## X001.ENSG00000228327 gr1__MACS_peak_13,gr2__region_0,gr2__region_1
## X001.ENSG00000237491 gr1__MACS_peak_13,gr2__region_0,gr2__region_1
## X002.ENSG00000237491 gr2__region_2,gr1__MACS_peak_14
## X003.ENSG00000272438 gr1__MACS_peak_16,gr2__region_3
## X004.ENSG00000223764 gr1__MACS_peak_17,gr2__region_4
## X004.ENSG00000187634 gr1__MACS_peak_17,gr2__region_4
## peak feature start_position
## <character> <character> <integer>
## X001.ENSG00000228327 001 ENSG00000228327 700237
## X001.ENSG00000237491 001 ENSG00000237491 714150
## X002.ENSG00000237491 002 ENSG00000237491 714150
## X003.ENSG00000272438 003 ENSG00000272438 840214
## X004.ENSG00000223764 004 ENSG00000223764 852245
## X004.ENSG00000187634 004 ENSG00000187634 860260
## end_position feature_strand insideFeature
## <integer> <character> <factor>
## X001.ENSG00000228327 714006 - overlapStart
## X001.ENSG00000237491 745440 + overlapStart
## X002.ENSG00000237491 745440 + inside
## X003.ENSG00000272438 851356 + upstream
## X004.ENSG00000223764 856396 - overlapStart
## X004.ENSG00000187634 879955 + upstream
## distancetoFeature shortestDistance
## <numeric> <integer>
## X001.ENSG00000228327 215 215
## X001.ENSG00000237491 -359 359
## X002.ENSG00000237491 10701 10701
## X003.ENSG00000272438 -747 124
## X004.ENSG00000223764 35 35
## X004.ENSG00000187634 -3899 3261
## fromOverlappingOrNearest gene_name
## <character> <character>
## X001.ENSG00000228327 Overlapping RP11-206L10.2
## X001.ENSG00000237491 Overlapping RP11-206L10.9
## X002.ENSG00000237491 Overlapping RP11-206L10.9
## X003.ENSG00000272438 Overlapping RP11-54O7.16
## X004.ENSG00000223764 Overlapping RP11-54O7.3
## X004.ENSG00000187634 Overlapping SAMD11
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Once the peaks are annotated, the distribution of the distance to the nearest feature such as the transcription start sites (TSS) can be plotted. The sample code here plots the distribution of the aggregated peak scores and the number of peaks around the TSS.
gr1.copy <- gr1
gr1.copy$score <- 1
binOverFeature(gr1, gr1.copy, annotationData=annoData,
radius=5000, nbins=10, FUN=c(sum, length),
ylab=c("score", "count"),
main=c("Distribution of aggregated peak scores around TSS",
"Distribution of aggregated peak numbers around TSS"))
The distribution of the peaks over exon, intron, enhancer, proximal promoter,
5’ UTR and 3’ UTR can be summarized in peak centric or nucleotide centric view using
the function assignChromosomeRegion
.
Please note that setting nucleotideLevel = TRUE will give a nucleotide level distribution over
different features.
if(require(TxDb.Hsapiens.UCSC.hg19.knownGene)){
aCR<-assignChromosomeRegion(gr1, nucleotideLevel=FALSE,
precedence=c("Promoters", "immediateDownstream",
"fiveUTRs", "threeUTRs",
"Exons", "Introns"),
TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
barplot(aCR$percentage)
}
Here we describe some details in using different functions in ChIPpeakAnno
for different tasks. As shown in the last section, the common workflow includes:
loading called peaks from BED, GFF, or other formats; evaluating and visualizing
the concordance among the biological replicates; combining peaks from
replicates; preparing genomic annotation(s) as GRanges; associating/annotating
peaks with the annotation(s); summarizing peak
distributions over exon, intron, enhancer, proximal promoter, 5’UTR and 3’UTR
regions; retrieving the sequences around the peaks; and enrichment analysis of GO and
biological pathway. We also implemented the functions to plot the heatmap of given
peak ranges, and perform permutation test to determine if there is a significant overlap between two sets of peaks.
Prior to associating features of interest with the peaks, it is a common practice to evaluate the concordance among the peaks from biological replicates and combine the peaks from biological replicates. Also, it is biologically
interesting to obtain overlapping peaks from different ChIP-seq experiments
to imply the potential formation of transcription factor complexes. ChIPpeakAnno
implemented
functions to achieve those goals and quantitatively determine the significance of
peak overlaps and generate a Venn diagram for visualization.
Here is the sample code to obtain the overlapping peaks with maximum gap of 1kb for two peak ranges.
peaks1 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6",
"2", "6", "6", "6", "6", "5"),
ranges=IRanges(start=c(967654, 2010897, 2496704, 3075869,
3123260, 3857501, 201089, 1543200,
1557200, 1563000, 1569800, 167889600),
end= c(967754, 2010997, 2496804, 3075969,
3123360, 3857601, 201089, 1555199,
1560599, 1565199, 1573799, 167893599),
names=paste("Site", 1:12, sep="")),
strand="+")
peaks2 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", "1", "2", "3",
"4", "5", "6", "6", "6", "6", "6", "5"),
ranges=IRanges(start=c(967659, 2010898, 2496700,
3075866, 3123260, 3857500,
96765, 201089, 249670, 307586,
312326, 385750, 1549800,
1554400, 1565000, 1569400,
167888600),
end=c(967869, 2011108, 2496920,
3076166,3123470, 3857780,
96985, 201299, 249890, 307796,
312586, 385960, 1550599, 1560799,
1565399, 1571199, 167888999),
names=paste("t", 1:17, sep="")),
strand=c("+", "+", "+", "+", "+", "+", "-", "-", "-",
"-", "-", "-", "+", "+", "+", "+", "+"))
ol <- findOverlapsOfPeaks(peaks1, peaks2, maxgap=1000)
peaklist <- ol$peaklist
The function findOverlapsOfPeaks
returns an object of overlappingPeaks,
which contains there elements: venn_cnt, peaklist (a list of
overlapping peaks or unique peaks), and overlappingPeaks (a list of data frame
consists of the annotation of all the overlapping peaks).
Within the overlappingPeaks element of the overlappingPeaks object ol (which is also a list), the element “peaks1///peaks2” is a data frame representing the overlapping peaks with maximum gap of 1kb between the two peak lists. Using the overlapFeature column in this data frame, a pie graph can be generated to describe the distribution of the features of the relative positions of peaks1 to peaks2 for the overlapping peaks.
overlappingPeaks <- ol$overlappingPeaks
names(overlappingPeaks)
## [1] "peaks1///peaks2"
dim(overlappingPeaks[["peaks1///peaks2"]])
## [1] 13 14
overlappingPeaks[["peaks1///peaks2"]][1:2, ]
## peaks1 seqnames start end width strand peaks2 seqnames
## Site1_t1 Site1 1 967654 967754 101 + t1 1
## Site2_t2 Site2 2 2010897 2010997 101 + t2 2
## start end width strand overlapFeature shortestDistance
## Site1_t1 967659 967869 211 + overlapStart 5
## Site2_t2 2010898 2011108 211 + overlapStart 1
pie1(table(overlappingPeaks[["peaks1///peaks2"]]$overlapFeature))
The following code returns the merged overlapping peaks from the peaklist object.
peaklist[["peaks1///peaks2"]]
## GRanges object with 11 ranges and 1 metadata column:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] 1 [ 967654, 967869] + |
## [2] 2 [ 201089, 201299] * |
## [3] 2 [ 2010897, 2011108] + |
## [4] 3 [ 2496700, 2496920] + |
## [5] 4 [ 3075866, 3076166] + |
## [6] 5 [ 3123260, 3123470] + |
## [7] 5 [167888600, 167893599] + |
## [8] 6 [ 1543200, 1560799] + |
## [9] 6 [ 1563000, 1565399] + |
## [10] 6 [ 1569400, 1573799] + |
## [11] 6 [ 3857500, 3857780] + |
## peakNames
## <CharacterList>
## [1] peaks1__Site1,peaks2__t1
## [2] peaks1__Site7,peaks2__t8
## [3] peaks1__Site2,peaks2__t2
## [4] peaks2__t3,peaks1__Site3
## [5] peaks2__t4,peaks1__Site4
## [6] peaks1__Site5,peaks2__t5
## [7] peaks2__t17,peaks1__Site12
## [8] peaks1__Site8,peaks2__t13,peaks2__t14,...
## [9] peaks1__Site10,peaks2__t15
## [10] peaks2__t16,peaks1__Site11
## [11] peaks2__t6,peaks1__Site6
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
The peaks in peaks1 but not overlap with the peaks in peaks2 can be obtained with:
peaklist[["peaks1"]]
## NULL
The peaks in peaks2 but not overlap with the peaks in peaks1 can be obtained with:
peaklist[["peaks2"]]
## GRanges object with 5 ranges and 1 metadata column:
## seqnames ranges strand | peakNames
## <Rle> <IRanges> <Rle> | <CharacterList>
## [1] 1 [ 96765, 96985] - | peaks2__t7
## [2] 3 [249670, 249890] - | peaks2__t9
## [3] 4 [307586, 307796] - | peaks2__t10
## [4] 5 [312326, 312586] - | peaks2__t11
## [5] 6 [385750, 385960] - | peaks2__t12
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
Venn diagram can be generated by the function makeVennDiagram
using the output
of findOverlapsOfPeaks
as an input.
The makeVennDiagram
also outputs p-values indicating whether the overlapping is significant.
makeVennDiagram(ol, totalTest=1e+2)
## $p.value
## peaks1 peaks2 pval
## [1,] 1 1 5.890971e-12
##
## $vennCounts
## peaks1 peaks2 Counts
## [1,] 0 0 83
## [2,] 0 1 5
## [3,] 1 0 0
## [4,] 1 1 12
## attr(,"class")
## [1] "VennCounts"
Alternatively, users have the option to use other tools to plot Venn diagram. The following code demonstrates how to use a third party R package Vernerable with the output from the function findOverlapsOfPeaks
.
# install.packages("Vennerable", repos="http://R-Forge.R-project.org",
# type="source")
# library(Vennerable)
# venn_cnt2venn <- function(venn_cnt){
# n <- which(colnames(venn_cnt)=="Counts") - 1
# SetNames=colnames(venn_cnt)[1:n]
# Weight=venn_cnt[,"Counts"]
# names(Weight) <- apply(venn_cnt[,1:n], 1, base::paste, collapse="")
# Venn(SetNames=SetNames, Weight=Weight)
# }
#
# v <- venn_cnt2venn(ol$venn_cnt)
# plot(v)
The findOverlapsOfPeaks
function accepts
up to 5 peak lists for overlapping peaks. The following code is an example for 3 peak lists.
peaks3 <- GRanges(seqnames=c("1", "2", "3", "4", "5",
"6", "1", "2", "3", "4"),
ranges=IRanges(start=c(967859, 2010868, 2496500, 3075966,
3123460, 3851500, 96865, 201189,
249600, 307386),
end= c(967969, 2011908, 2496720, 3076166,
3123470, 3857680, 96985, 201299,
249890, 307796),
names=paste("p", 1:10, sep="")),
strand=c("+", "+", "+", "+", "+",
"+", "-", "-", "-", "-"))
ol <- findOverlapsOfPeaks(peaks1, peaks2, peaks3, maxgap=1000,
connectedPeaks="min")
makeVennDiagram(ol, totalTest=1e+2)
## $p.value
## peaks1 peaks2 peaks3 pval
## [1,] 0 1 1 1.123492e-09
## [2,] 1 0 1 5.131347e-06
## [3,] 1 1 0 5.890971e-12
##
## $vennCounts
## peaks1 peaks2 peaks3 Counts
## [1,] 0 0 0 83
## [2,] 0 0 1 0
## [3,] 0 1 0 2
## [4,] 0 1 1 3
## [5,] 1 0 0 0
## [6,] 1 0 1 0
## [7,] 1 1 0 5
## [8,] 1 1 1 7
## attr(,"class")
## [1] "VennCounts"
The parameter totalTest in the function makeVennDiagram
indicates the total number of potential peaks used in the hypergeometric test. It should be
larger than the largest number of peaks in the replicates. The smaller it is
set, the more stringent the test is. The time used to calculate p-value does not
depend on the value