Abstract
The package is for facilitating the downstream analysis for ChIP-seq experiments. It includes functions to find the nearest gene, exon, miRNA or custom features such as the most conserved elements and other transcription factor binding sites supplied by users, retrieve the sequences around the peak, obtain enriched Gene Ontology (GO) terms or pathways. Starting 2.0.5, new functions have been added for finding the peaks with bi-directional promoters with summary statistics (peaksNearBDP), for summarizing the occurrence of motifs in peaks (summarizePatternInPeaks) and for adding other IDs to annotated peaks or enrichedGO (addGeneIDs). Starting 3.4, permutation test has been added to determine whether there is a significant overlap between two sets of peaks. In addition, binding patterns of multiple transcription factors (TFs) or distributions of multiple epigenetic markers around genomic features could be visualized and compared easily using a side-by-side heatmap and density plot.
Chromatin immunoprecipitation (ChIP) followed by DNA sequencing (ChIP-seq) and ChIP followed by genome tiling array analysis (ChIP-chip) have become prevalent high throughput technologies for identifying the binding sites of DNA-binding proteins genome-wise. A number of algorithms have been published to facilitate the identification of the binding sites of the DNA-binding proteins of interest. The identified binding sites as the list of peaks are usually converted to BED or bigwig files to be loaded to the UCSC genome browser as custom tracks for investigators to view the proximity to various genomic features such as genes, exons or conserved elements. However, clicking through the genome browser is a daunting task when the number of peaks gets large or the peaks spread widely across the genome.
Here we developed ChIPpeakAnno, a Bioconductor1 package, to facilitate the batch annotation of the peaks identified from ChIP-seq or ChIP-chip experiments. We implemented functionality to find the nearest gene, exon, miRNA or other custom features supplied by users such as the most conserved elements and other transcription factor binding sites leveraging GRanges. Since the genome annotation gets updated frequently, we have leveraged the biomaRt package to retrieve the annotation data on the fly. The users also have the flexibility to pass their own annotation data or annotation from GenomicFeatures as GRanges. We have also leveraged BSgenome and biomaRt to retrieve the sequences around the identified peak for peak validation or motif discovery2. To understand whether the identified peaks are enriched around genes with certain GO terms, we have implemented the Gene Ontology (GO) enrichment test in the ChIPpeakAnno package leveraging the hypergeometric test phyper in the stats package and integrated with the GO annotation from the GO.db package and multiplicity adjustment functions from the multtest package3–8. The pathway analysis using reactome or KEGG is also supported. Starting 3.4, we also implement the functions for permutation test to determine whether there is a significant overlap between two sets of peaks. In addition, binding patterns of multiple transcription factors (TFs) or distributions of multiple epigenetic markers around genomic features could be visualized and compared easily using a side-by-side heatmap and density plot.
library(ChIPpeakAnno)
## import the MACS output
macs <- system.file("extdata", "MACS_peaks.xls", package="ChIPpeakAnno")
macsOutput <- toGRanges(macs, format="MACS")
## annotate the peaks with precompiled ensembl annotation
data(TSS.human.GRCh38)
macs.anno <- annotatePeakInBatch(macsOutput, AnnotationData=TSS.human.GRCh38)
## add gene symbols
library(org.Hs.eg.db)
macs.anno <- addGeneIDs(annotatedPeak=macs.anno,
orgAnn="org.Hs.eg.db",
IDs2Add="symbol")
if(interactive()){## annotate the peaks with UCSC annotation
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
ucsc.hg38.knownGene <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
macs.anno <- annotatePeakInBatch(macsOutput,
AnnotationData=ucsc.hg38.knownGene)
macs.anno <- addGeneIDs(annotatedPeak=macs.anno,
orgAnn="org.Hs.eg.db",
feature_id_type="entrez_id",
IDs2Add="symbol")
}
We illustrate here a common downstream analysis workflow for ChIP-seq experiments. The input of ChIPpeakAnno is a list of called peaks identified from ChIP-seq experiments. The peaks are represented by GRanges in ChIPpeakAnno. We implemented a conversion functions toGRanges
to convert commonly used peak file formats, such as BED, GFF, or other user defined formats such as MACS (a popular peak calling program) output file to GRanges. Please type ?toGRanges
for more information.
The workflow here exemplifies converting the BED and GFF files to GRanges, finding the overlapping peaks between the two peak sets, and visualizing the number of common and specific peaks with Venn diagram.
bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE)
## one can also try import from rtracklayer
library(rtracklayer)
gr1.import <- import(bed, format="BED")
identical(start(gr1), start(gr1.import))
## [1] TRUE
gr1[1:2]
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## MACS_peak_1 chr1 28341-29610 * | 160.81
## MACS_peak_2 chr1 90821-91234 * | 133.12
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
gr1.import[1:2] #note the name slot is different from gr1
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 28341-29610 * | MACS_peak_1 160.81
## [2] chr1 90821-91234 * | MACS_peak_2 133.12
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
ol <- findOverlapsOfPeaks(gr1, gr2)
makeVennDiagram(ol,
fill=c("#009E73", "#F0E442"), # circle fill color
col=c("#D55E00", "#0072B2"), #circle border color
cat.col=c("#D55E00", "#0072B2"))
## INFO [2023-04-27 16:19:33] $fill
## INFO [2023-04-27 16:19:33] [1] "#009E73" "#F0E442"
## INFO [2023-04-27 16:19:33]
## INFO [2023-04-27 16:19:33] $col
## INFO [2023-04-27 16:19:33] [1] "#D55E00" "#0072B2"
## INFO [2023-04-27 16:19:33]
## INFO [2023-04-27 16:19:33] $cat.col
## INFO [2023-04-27 16:19:33] [1] "#D55E00" "#0072B2"
## INFO [2023-04-27 16:19:33]
## INFO [2023-04-27 16:19:33] $cat.cex
## INFO [2023-04-27 16:19:33] [1] 1
## INFO [2023-04-27 16:19:33]
## INFO [2023-04-27 16:19:33] $cat.fontface
## INFO [2023-04-27 16:19:33] [1] "plain"
## INFO [2023-04-27 16:19:33]
## INFO [2023-04-27 16:19:33] $cat.fontfamily
## INFO [2023-04-27 16:19:33] [1] "serif"
## INFO [2023-04-27 16:19:33]
## INFO [2023-04-27 16:19:33] $x
## INFO [2023-04-27 16:19:33] $x$gr1
## INFO [2023-04-27 16:19:33] [1] 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
## INFO [2023-04-27 16:19:33] [19] 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
## INFO [2023-04-27 16:19:33] [37] 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
## INFO [2023-04-27 16:19:33] [55] 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133
## INFO [2023-04-27 16:19:33] [73] 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
## INFO [2023-04-27 16:19:33] [91] 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169
## INFO [2023-04-27 16:19:33] [109] 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187
## INFO [2023-04-27 16:19:33] [127] 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205
## INFO [2023-04-27 16:19:33] [145] 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
## INFO [2023-04-27 16:19:33] [163] 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241
## INFO [2023-04-27 16:19:33] [181] 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259
## INFO [2023-04-27 16:19:33] [199] 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277
## INFO [2023-04-27 16:19:33] [217] 278 279 280 281 282 283 284 285 286 287 288 289
## INFO [2023-04-27 16:19:33]
## INFO [2023-04-27 16:19:33] $x$gr2
## INFO [2023-04-27 16:19:33] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## INFO [2023-04-27 16:19:33] [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## INFO [2023-04-27 16:19:33] [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## INFO [2023-04-27 16:19:33] [55] 55 56 57 58 59 60 61 124 125 126 127 128 129 130 131 132 133 134
## INFO [2023-04-27 16:19:33] [73] 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
## INFO [2023-04-27 16:19:33] [91] 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170
## INFO [2023-04-27 16:19:33] [109] 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
## INFO [2023-04-27 16:19:33] [127] 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206
## INFO [2023-04-27 16:19:33] [145] 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
## INFO [2023-04-27 16:19:33] [163] 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242
## INFO [2023-04-27 16:19:33] [181] 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## INFO [2023-04-27 16:19:33] [199] 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278
## INFO [2023-04-27 16:19:33] [217] 279 280 281 282 283 284 285 286 287 288 289
## INFO [2023-04-27 16:19:33]
## INFO [2023-04-27 16:19:33]
## INFO [2023-04-27 16:19:33] $filename
## INFO [2023-04-27 16:19:33] NULL
## INFO [2023-04-27 16:19:33]
## INFO [2023-04-27 16:19:33] $disable.logging
## INFO [2023-04-27 16:19:33] [1] TRUE
## INFO [2023-04-27 16:19:33]
## $p.value
## gr1 gr2 pval
## [1,] 1 1 0
##
## $vennCounts
## gr1 gr2 Counts count.gr1 count.gr2
## [1,] 0 0 0 0 0
## [2,] 0 1 61 0 61
## [3,] 1 0 62 62 0
## [4,] 1 1 166 168 169
## attr(,"class")
## [1] "VennCounts"
A pie chart is used to demonstrate the overlap features of the common peaks.
pie1(table(ol$overlappingPeaks[["gr1///gr2"]]$overlapFeature))
After finding the overlapping peaks, you can use annotatePeakInBatch
to annotate the overlapping peaks with the genomic features in the AnnotationData within certain distance away specified by maxgap, which is 5kb in the following example.
overlaps <- ol$peaklist[["gr1///gr2"]]
## ============== old style ===========
## data(TSS.human.GRCh37)
## overlaps.anno <- annotatePeakInBatch(overlaps, AnnotationData=annoData,
## output="overlapping", maxgap=5000L)
## overlaps.anno <- addGeneIDs(overlaps.anno, "org.Hs.eg.db", "symbol")
## ============== new style ===========
library(EnsDb.Hsapiens.v75) ##(hg19)
## create annotation file from EnsDb or TxDb
annoData <- toGRanges(EnsDb.Hsapiens.v75, feature="gene")
annoData[1:2]
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gene_name
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000223972 chr1 11869-14412 + | DDX11L1
## ENSG00000227232 chr1 14363-29806 - | WASH7P
## -------
## seqinfo: 273 sequences (1 circular) from 2 genomes (hg19, GRCh37)
overlaps.anno <- annotatePeakInBatch(overlaps, AnnotationData=annoData,
output="overlapping", maxgap=5000L)
overlaps.anno$gene_name <-
annoData$gene_name[match(overlaps.anno$feature,
names(annoData))]
head(overlaps.anno)
## GRanges object with 6 ranges and 11 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## X001.ENSG00000228327 chr1 713791-715578 * |
## X001.ENSG00000237491 chr1 713791-715578 * |
## X002.ENSG00000237491 chr1 724851-727191 * |
## X003.ENSG00000272438 chr1 839467-840090 * |
## X004.ENSG00000223764 chr1 856361-856999 * |
## X004.ENSG00000187634 chr1 856361-856999 * |
## peakNames peak
## <CharacterList> <character>
## X001.ENSG00000228327 gr1__MACS_peak_13,gr2__001,gr2__002 001
## X001.ENSG00000237491 gr1__MACS_peak_13,gr2__001,gr2__002 001
## X002.ENSG00000237491 gr2__003,gr1__MACS_peak_14 002
## X003.ENSG00000272438 gr1__MACS_peak_16,gr2__004 003
## X004.ENSG00000223764 gr1__MACS_peak_17,gr2__005 004
## X004.ENSG00000187634 gr1__MACS_peak_17,gr2__005 004
## feature start_position end_position
## <character> <integer> <integer>
## X001.ENSG00000228327 ENSG00000228327 700237 714006
## X001.ENSG00000237491 ENSG00000237491 714150 745440
## X002.ENSG00000237491 ENSG00000237491 714150 745440
## X003.ENSG00000272438 ENSG00000272438 840214 851356
## X004.ENSG00000223764 ENSG00000223764 852245 856396
## X004.ENSG00000187634 ENSG00000187634 860260 879955
## feature_strand insideFeature distancetoFeature
## <character> <character> <numeric>
## X001.ENSG00000228327 - overlapStart 215
## X001.ENSG00000237491 + overlapStart -359
## X002.ENSG00000237491 + inside 10701
## X003.ENSG00000272438 + upstream -747
## X004.ENSG00000223764 - overlapStart 35
## X004.ENSG00000187634 + upstream -3899
## shortestDistance fromOverlappingOrNearest gene_name
## <integer> <character> <character>
## X001.ENSG00000228327 215 Overlapping RP11-206L10.2
## X001.ENSG00000237491 359 Overlapping RP11-206L10.9
## X002.ENSG00000237491 10701 Overlapping RP11-206L10.9
## X003.ENSG00000272438 124 Overlapping RP11-54O7.16
## X004.ENSG00000223764 35 Overlapping RP11-54O7.3
## X004.ENSG00000187634 3261 Overlapping SAMD11
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Once the peaks are annotated, the distribution of the distance to the nearest feature such as the transcription start sites (TSS) can be plotted. The sample code here plots the distribution of the aggregated peak scores and the number of peaks around the TSS.
gr1.copy <- gr1
gr1.copy$score <- 1
binOverFeature(gr1, gr1.copy, annotationData=annoData,
radius=5000, nbins=10, FUN=c(sum, length),
ylab=c("score", "count"),
main=c("Distribution of aggregated peak scores around TSS",
"Distribution of aggregated peak numbers around TSS"))
The distribution of the peaks over exon, intron, enhancer, proximal promoter, 5’ UTR and 3’ UTR can be summarized in peak centric or nucleotide centric view using the function assignChromosomeRegion
. Please note that setting nucleotideLevel = TRUE will give a nucleotide level distribution over different features.
if(require(TxDb.Hsapiens.UCSC.hg19.knownGene)){
aCR<-assignChromosomeRegion(gr1, nucleotideLevel=FALSE,
precedence=c("Promoters", "immediateDownstream",
"fiveUTRs", "threeUTRs",
"Exons", "Introns"),
TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
barplot(aCR$percentage)
}
Here we describe some details in using different functions in ChIPpeakAnno
for different tasks. As shown in the last section, the common workflow includes: loading called peaks from BED, GFF, or other formats; evaluating and visualizing the concordance among the biological replicates; combining peaks from replicates; preparing genomic annotation(s) as GRanges; associating/annotating peaks with the annotation(s); summarizing peak distributions over exon, intron, enhancer, proximal promoter, 5’UTR and 3’UTR regions; retrieving the sequences around the peaks; and enrichment analysis of GO and biological pathway. We also implemented the functions to plot the heatmap of given peak ranges, and perform permutation test to determine if there is a significant overlap between two sets of peaks.
Prior to associating features of interest with the peaks, it is a common practice to evaluate the concordance among the peaks from biological replicates and combine the peaks from biological replicates. Also, it is biologically interesting to obtain overlapping peaks from different ChIP-seq experiments to imply the potential formation of transcription factor complexes. ChIPpeakAnno
implemented functions to achieve those goals and quantitatively determine the significance of peak overlaps and generate a Venn diagram for visualization.
Here is the sample code to obtain the overlapping peaks with maximum gap of 1kb for two peak ranges.
peaks1 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6",
"2", "6", "6", "6", "6", "5"),
ranges=IRanges(start=c(967654, 2010897, 2496704, 3075869,
3123260, 3857501, 201089, 1543200,
1557200, 1563000, 1569800, 167889600),
end= c(967754, 2010997, 2496804, 3075969,
3123360, 3857601, 201089, 1555199,
1560599, 1565199, 1573799, 167893599),
names=paste("Site", 1:12, sep="")),
strand="+")
peaks2 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", "1", "2", "3",
"4", "5", "6", "6", "6", "6", "6", "5"),
ranges=IRanges(start=c(967659, 2010898, 2496700,
3075866, 3123260, 3857500,
96765, 201089, 249670, 307586,
312326, 385750, 1549800,
1554400, 1565000, 1569400,
167888600),
end=c(967869, 2011108, 2496920,
3076166,3123470, 3857780,
96985, 201299, 249890, 307796,
312586, 385960, 1550599, 1560799,
1565399, 1571199, 167888999),
names=paste("t", 1:17, sep="")),
strand=c("+", "+", "+", "+", "+", "+", "-", "-", "-",
"-", "-", "-", "+", "+", "+", "+", "+"))
ol <- findOverlapsOfPeaks(peaks1, peaks2, maxgap=1000)
peaklist <- ol$peaklist
The function findOverlapsOfPeaks
returns an object of overlappingPeaks, which contains there elements: venn_cnt, peaklist (a list of
overlapping peaks or unique peaks), and overlappingPeaks (a list of data frame consists of the annotation of all the overlapping peaks).
Within the overlappingPeaks element of the overlappingPeaks object ol (which is also a list), the element “peaks1///peaks2” is a data frame representing the overlapping peaks with maximum gap of 1kb between the two peak lists. Using the overlapFeature column in this data frame, a pie graph can be generated to describe the distribution of the features of the relative positions of peaks1 to peaks2 for the overlapping peaks.
overlappingPeaks <- ol$overlappingPeaks
names(overlappingPeaks)
## [1] "peaks1///peaks2"
dim(overlappingPeaks[["peaks1///peaks2"]])
## [1] 13 14
overlappingPeaks[["peaks1///peaks2"]][1:2, ]
## peaks1 seqnames start end width strand peaks2 seqnames start
## Site1_t1 Site1 1 967654 967754 101 + t1 1 967659
## Site2_t2 Site2 2 2010897 2010997 101 + t2 2 2010898
## end width strand overlapFeature shortestDistance
## Site1_t1 967869 211 + overlapStart 5
## Site2_t2 2011108 211 + overlapStart 1
pie1(table(overlappingPeaks[["peaks1///peaks2"]]$overlapFeature))
The following code returns the merged overlapping peaks from the peaklist object.
peaklist[["peaks1///peaks2"]]
## GRanges object with 11 ranges and 1 metadata column:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] 1 967654-967869 + |
## [2] 2 201089-201299 * |
## [3] 2 2010897-2011108 + |
## [4] 3 2496700-2496920 + |
## [5] 4 3075866-3076166 + |
## [6] 5 3123260-3123470 + |
## [7] 5 167888600-167893599 + |
## [8] 6 1543200-1560799 + |
## [9] 6 1563000-1565399 + |
## [10] 6 1569400-1573799 + |
## [11] 6 3857500-3857780 + |
## peakNames
## <CharacterList>
## [1] peaks1__Site1,peaks2__t1
## [2] peaks1__Site7,peaks2__t8
## [3] peaks1__Site2,peaks2__t2
## [4] peaks2__t3,peaks1__Site3
## [5] peaks2__t4,peaks1__Site4
## [6] peaks1__Site5,peaks2__t5
## [7] peaks2__t17,peaks1__Site12
## [8] peaks1__Site8,peaks2__t13,peaks2__t14,...
## [9] peaks1__Site10,peaks2__t15
## [10] peaks2__t16,peaks1__Site11
## [11] peaks2__t6,peaks1__Site6
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
The peaks in peaks1 but not overlap with the peaks in peaks2 can be obtained with:
peaklist[["peaks1"]]
## NULL
The peaks in peaks2 but not overlap with the peaks in peaks1 can be obtained with:
peaklist[["peaks2"]]
## GRanges object with 5 ranges and 1 metadata column:
## seqnames ranges strand | peakNames
## <Rle> <IRanges> <Rle> | <CharacterList>
## [1] 1 96765-96985 - | peaks2__t7
## [2] 3 249670-249890 - | peaks2__t9
## [3] 4 307586-307796 - | peaks2__t10
## [4] 5 312326-312586 - | peaks2__t11
## [5] 6 385750-385960 - | peaks2__t12
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
Venn diagram can be generated by the function makeVennDiagram
using the output of findOverlapsOfPeaks
as an input.
The makeVennDiagram
also outputs p-values indicating whether the overlapping is significant.
makeVennDiagram(ol, totalTest=1e+2,
fill=c("#009E73", "#F0E442"), # circle fill color
col=c("#D55E00", "#0072B2"), #circle border color
cat.col=c("#D55E00", "#0072B2"))
## INFO [2023-04-27 16:19:56] $fill
## INFO [2023-04-27 16:19:56] [1] "#009E73" "#F0E442"
## INFO [2023-04-27 16:19:56]
## INFO [2023-04-27 16:19:56] $col
## INFO [2023-04-27 16:19:56] [1] "#D55E00" "#0072B2"
## INFO [2023-04-27 16:19:56]
## INFO [2023-04-27 16:19:56] $cat.col
## INFO [2023-04-27 16:19:56] [1] "#D55E00" "#0072B2"
## INFO [2023-04-27 16:19:56]
## INFO [2023-04-27 16:19:56] $cat.cex
## INFO [2023-04-27 16:19:56] [1] 1
## INFO [2023-04-27 16:19:56]
## INFO [2023-04-27 16:19:56] $cat.fontface
## INFO [2023-04-27 16:19:56] [1] "plain"
## INFO [2023-04-27 16:19:56]
## INFO [2023-04-27 16:19:56] $cat.fontfamily
## INFO [2023-04-27 16:19:56] [1] "serif"
## INFO [2023-04-27 16:19:56]
## INFO [2023-04-27 16:19:56] $x
## INFO [2023-04-27 16:19:56] $x$peaks1
## INFO [2023-04-27 16:19:56] [1] 6 7 8 9 10 11 12 13 14 15 16 17
## INFO [2023-04-27 16:19:56]
## INFO [2023-04-27 16:19:56] $x$peaks2
## INFO [2023-04-27 16:19:56] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## INFO [2023-04-27 16:19:56]
## INFO [2023-04-27 16:19:56]
## INFO [2023-04-27 16:19:56] $filename
## INFO [2023-04-27 16:19:56] NULL
## INFO [2023-04-27 16:19:56]
## INFO [2023-04-27 16:19:56] $disable.logging
## INFO [2023-04-27 16:19:56] [1] TRUE
## INFO [2023-04-27 16:19:56]
## $p.value
## peaks1 peaks2 pval
## [1,] 1 1 5.890971e-12
##
## $vennCounts
## peaks1 peaks2 Counts count.peaks1 count.peaks2
## [1,] 0 0 83 0 0
## [2,] 0 1 5 0 5
## [3,] 1 0 0 0 0
## [4,] 1 1 12 12 12
## attr(,"class")
## [1] "VennCounts"
Alternatively, users have the option to use other tools to plot Venn diagram. The following code demonstrates how to use a third party R package Vernerable with the output from the function findOverlapsOfPeaks
.
# install.packages("Vennerable", repos="http://R-Forge.R-project.org",
# type="source")
# library(Vennerable)
# venn_cnt2venn <- function(venn_cnt){
# n <- which(colnames(venn_cnt)=="Counts") - 1
# SetNames=colnames(venn_cnt)[1:n]
# Weight=venn_cnt[,"Counts"]
# names(Weight) <- apply(venn_cnt[,1:n], 1, base::paste, collapse="")
# Venn(SetNames=SetNames, Weight=Weight)
# }
#
# v <- venn_cnt2venn(ol$venn_cnt)
# plot(v)
The findOverlapsOfPeaks
function accepts up to 5 peak lists for overlapping peaks. The following code is an example for 3 peak lists.
peaks3 <- GRanges(seqnames=c("1", "2", "3", "4", "5",
"6", "1", "2", "3", "4"),
ranges=IRanges(start=c(967859, 2010868, 2496500, 3075966,
3123460, 3851500, 96865, 201189,
249600, 307386),
end= c(967969, 2011908, 2496720, 3076166,
3123470, 3857680, 96985, 201299,
249890, 307796),
names=paste("p", 1:10, sep="")),
strand=c("+", "+", "+", "+", "+",
"+", "-", "-", "-", "-"))
ol <- findOverlapsOfPeaks(peaks1, peaks2, peaks3, maxgap=1000,
connectedPeaks="min")
makeVennDiagram(ol, totalTest=1e+2,
fill=c("#CC79A7", "#56B4E9", "#F0E442"), # circle fill color
col=c("#D55E00", "#0072B2", "#E69F00"), #circle border color
cat.col=c("#D55E00", "#0072B2", "#E69F00"))
## INFO [2023-04-27 16:19:58] $fill
## INFO [2023-04-27 16:19:58] [1] "#CC79A7" "#56B4E9" "#F0E442"
## INFO [2023-04-27 16:19:58]
## INFO [2023-04-27 16:19:58] $col
## INFO [2023-04-27 16:19:58] [1] "#D55E00" "#0072B2" "#E69F00"
## INFO [2023-04-27 16:19:58]
## INFO [2023-04-27 16:19:58] $cat.col
## INFO [2023-04-27 16:19:58] [1] "#D55E00" "#0072B2" "#E69F00"
## INFO [2023-04-27 16:19:58]
## INFO [2023-04-27 16:19:58] $cat.cex
## INFO [2023-04-27 16:19:58] [1] 1
## INFO [2023-04-27 16:19:58]
## INFO [2023-04-27 16:19:58] $cat.fontface
## INFO [2023-04-27 16:19:58] [1] "plain"
## INFO [2023-04-27 16:19:58]
## INFO [2023-04-27 16:19:58] $cat.fontfamily
## INFO [2023-04-27 16:19:58] [1] "serif"
## INFO [2023-04-27 16:19:58]
## INFO [2023-04-27 16:19:58] $x
## INFO [2023-04-27 16:19:58] $x$peaks1
## INFO [2023-04-27 16:19:58] [1] 6 7 8 9 10 11 12 13 14 15 16 17
## INFO [2023-04-27 16:19:58]
## INFO [2023-04-27 16:19:58] $x$peaks2
## INFO [2023-04-27 16:19:58] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## INFO [2023-04-27 16:19:58]
## INFO [2023-04-27 16:19:58] $x$peaks3
## INFO [2023-04-27 16:19:58] [1] 3 4 5 11 12 13 14 15 16 17
## INFO [2023-04-27 16:19:58]
## INFO [2023-04-27 16:19:58]
## INFO [2023-04-27 16:19:58] $filename
## INFO [2023-04-27 16:19:58] NULL
## INFO [2023-04-27 16:19:58]
## INFO [2023-04-27 16:19:58] $disable.logging
## INFO [2023-04-27 16:19:58] [1] TRUE
## INFO [2023-04-27 16:19:58]
## $p.value
## peaks1 peaks2 peaks3 pval
## [1,] 0 1 1 1.123492e-09
## [2,] 1 0 1 5.131347e-06
## [3,] 1 1 0 5.890971e-12
##
## $vennCounts
## peaks1 peaks2 peaks3 Counts
## [1,] 0 0 0 83
## [2,] 0 0 1 0
## [3,] 0 1 0 2
## [4,] 0 1 1 3
## [5,] 1 0 0 0
## [6,] 1 0 1 0
## [7,] 1 1 0 5
## [8,] 1 1 1 7
## attr(,"class")
## [1] "VennCounts"
The parameter totalTest in the function makeVennDiagram
indicates the total number of potential peaks used in the hypergeometric test. It should be larger than the largest number of peaks in the replicates. The smaller it is set, the more stringent the test is. The time used to calculate p-value does not depend on the value of the totalTest. For practical guidance on how to choose totalTest, please refer to the post. Hypergeometric test requires users to input an estimate of the total potential binding sites (peaks) for a given TF. To circumvent this requirement, we implemented a permutation test called permTest
. For more details about the permTest
, go to section 4.11.
One main function of the ChIPpeakAnno package is to annotate peaks to known genomic features, such as TSS, 5’UTR, 3’UTR etc. Constructing and choosing the appropriate annotation data is crucial for this process.
To simplify this process, we precompiled a list of annotation data for the transcriptional starting sites (TSS) of various species (with different genome assembly versions), such as TSS.human.NCBI36, TSS.human.GRCh37, TSS.human.GRCh38, TSS.mouse.NCBIM37, TSS.mouse.GRCm38, TSS.rat.RGSC3.4, TSS.rat.Rnor_5.0, TSS.zebrafish.Zv8, and TSS.zebrafish.Zv9. The precompiled annotations can be loaded by R data()
function, e.g., data(TSS.human.GRCh38).
To annotate the peaks with other genomic features, please use function getAnnotation
with the argument featureType, e.g., “Exon” to obtain the nearest exon, “miRNA” to find the nearest miRNA, and “5utr” or “3utr” to locate the overlapping “5’UTR” or “3’UTR”. Another parameter for getAnnotation
is the name of the appropriate biomaRt dataset, for example, drerio_gene_ensembl for zebrafish genome, mmusculus_gene_ensembl for mouse genome and rnorvegicus_gene_ensembl for rat genome. For a list of available biomaRt and dataset, please refer to the biomaRt package documentation2. For the detailed usage of getAnnotation
, please type ?getAnnotation
in R.
In addition, a custom annotation dataset as GRanges, can be used in annotatePeakInBatch
. We implemented toGRanges
function for converting custom annotation dataset in other formats, such as UCSC BED/GFF format, or any user defined dataset such as RangedDate, to GRanges. For example, if you have a list of transcription factor binding sites from literatures and are interested in locating the nearest TSS and the distance to it for the peak lists.
An GRanges object can be also constructed from EnsDb or TxDb object by calling the toGRanges
method. Use ?toGRanges
for more information.
Here is the code snippet to build annotation data containing only the known genes, i.e., excluding other transcript products such as pseudo genes using TranscriptDb TxDb.Hsapiens.UCSC.hg19.knownGene with toGRanges
is:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
annoData <- toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene, feature="gene")
annoData
## GRanges object with 23056 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## 1 chr19 58858172-58874214 -
## 10 chr8 18248755-18258723 +
## 100 chr20 43248163-43280376 -
## 1000 chr18 25530930-25757445 -
## 10000 chr1 243651535-244006886 -
## ... ... ... ...
## 9991 chr9 114979995-115095944 -
## 9992 chr21 35736323-35743440 +
## 9993 chr22 19023795-19109967 -
## 9994 chr6 90539619-90584155 +
## 9997 chr22 50961997-50964905 -
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
With the annotation data, you can annotate the peaks identified from ChIP-seq or ChIP-chip experiments to retrieve the nearest gene and distance to the corresponding TSS of the gene.
For example, using the GRanges object generated in the previous section as AnnotationData, the first 6 peaks in the myPeakList are annotated with the following code:
data(myPeakList)
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6],
AnnotationData = annoData)
annotatedPeak[1:3]
## GRanges object with 3 ranges and 9 metadata columns:
## seqnames ranges strand | peak
## <Rle> <IRanges> <Rle> | <character>
## X1_93_556427.100288069 chr1 556660-556760 * | X1_93_556427
## X1_41_559455.100288069 chr1 559774-559874 * | X1_41_559455
## X1_12_703729.100288069 chr1 703885-703985 * | X1_12_703729
## feature start_position end_position feature_strand
## <character> <integer> <integer> <character>
## X1_93_556427.100288069 100288069 700245 714068 -
## X1_41_559455.100288069 100288069 700245 714068 -
## X1_12_703729.100288069 100288069 700245 714068 -
## insideFeature distancetoFeature shortestDistance
## <character> <numeric> <integer>
## X1_93_556427.100288069 downstream 157408 143485
## X1_41_559455.100288069 downstream 154294 140371
## X1_12_703729.100288069 inside 10183 3640
## fromOverlappingOrNearest
## <character>
## X1_93_556427.100288069 NearestLocation
## X1_41_559455.100288069 NearestLocation
## X1_12_703729.100288069 NearestLocation
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
As discussed in the previous section, all the genomic locations of the human genes have been precompiled, such as TSS.human.NCBI36 dataset, using function getAnnotation
. You can pass it to the argument annotaionData of the annotatePeakInBatch
function.
data(TSS.human.NCBI36)
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6],
AnnotationData=TSS.human.NCBI36)
annotatedPeak[1:3]
## GRanges object with 3 ranges and 9 metadata columns:
## seqnames ranges strand | peak
## <Rle> <IRanges> <Rle> | <character>
## X1_93_556427.ENSG00000212875 chr1 556660-556760 * | X1_93_556427
## X1_41_559455.ENSG00000212678 chr1 559774-559874 * | X1_41_559455
## X1_12_703729.ENSG00000197049 chr1 703885-703985 * | X1_12_703729
## feature start_position end_position
## <character> <integer> <integer>
## X1_93_556427.ENSG00000212875 ENSG00000212875 556318 557859
## X1_41_559455.ENSG00000212678 ENSG00000212678 559620 560165
## X1_12_703729.ENSG00000197049 ENSG00000197049 711184 712376
## feature_strand insideFeature distancetoFeature
## <character> <character> <numeric>
## X1_93_556427.ENSG00000212875 + inside 342
## X1_41_559455.ENSG00000212678 + inside 154
## X1_12_703729.ENSG00000197049 + upstream -7299
## shortestDistance fromOverlappingOrNearest
## <integer> <character>
## X1_93_556427.ENSG00000212875 342 NearestLocation
## X1_41_559455.ENSG00000212678 154 NearestLocation
## X1_12_703729.ENSG00000197049 7199 NearestLocation
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
You can also pass the user defined features as annotationData. A pie chart can be plotted to show the peak distribution among the features after annotation.
myPeak1 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6",
"2", "6", "6", "6", "6", "5"),
ranges=IRanges(start=c(967654, 2010897, 2496704, 3075869,
3123260, 3857501, 201089, 1543200,
1557200, 1563000, 1569800, 167889600),
end= c(967754, 2010997, 2496804, 3075969,
3123360, 3857601, 201089, 1555199,
1560599, 1565199, 1573799, 167893599),
names=paste("Site", 1:12, sep="")))
TFbindingSites <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", "1", "2",
"3", "4", "5", "6", "6", "6", "6", "6",
"5"),
ranges=IRanges(start=c(967659, 2010898, 2496700,
3075866, 3123260, 3857500,
96765, 201089, 249670, 307586,
312326, 385750, 1549800,
1554400, 1565000, 1569400,
167888600),
end=c(967869, 2011108, 2496920,
3076166,3123470, 3857780,
96985, 201299, 249890, 307796,
312586, 385960, 1550599, 1560799,
1565399, 1571199, 167888999),
names=paste("t", 1:17, sep="")),
strand=c("+", "+", "+", "+", "+", "+", "-", "-", "-",
"-", "-", "-", "+", "+", "+", "+", "+"))
annotatedPeak2 <- annotatePeakInBatch(myPeak1, AnnotationData=TFbindingSites)
annotatedPeak2[1:3]
## GRanges object with 3 ranges and 9 metadata columns:
## seqnames ranges strand | peak feature
## <Rle> <IRanges> <Rle> | <character> <character>
## Site1.t1 1 967654-967754 * | Site1 t1
## Site2.t2 2 2010897-2010997 * | Site2 t2
## Site3.t3 3 2496704-2496804 * | Site3 t3
## start_position end_position feature_strand insideFeature
## <integer> <integer> <character> <character>
## Site1.t1 967659 967869 + overlapStart
## Site2.t2 2010898 2011108 + overlapStart
## Site3.t3 2496700 2496920 + inside
## distancetoFeature shortestDistance fromOverlappingOrNearest
## <numeric> <integer> <character>
## Site1.t1 -5 5 NearestLocation
## Site2.t2 -1 1 NearestLocation
## Site3.t3 4 4 NearestLocation
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
pie1(table(as.data.frame(annotatedPeak2)$insideFeature))
Another example of using user defined AnnotationData is to annotate peaks by promoters, defined as upstream 5K to downstream 500bp from TSS. The sample code here demonstrates using the GenomicFeatures::promoters
function to build a custom annotation dataset and annotate the peaks with this user defined promoter annotations.
library(ChIPpeakAnno)
data(myPeakList)
data(TSS.human.NCBI36)
annotationData <- promoters(TSS.human.NCBI36, upstream=5000, downstream=500)
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6,],
AnnotationData=annotationData,
output="overlapping")
annotatedPeak[1:3]
## GRanges object with 3 ranges and 9 metadata columns:
## seqnames ranges strand | peak
## <Rle> <IRanges> <Rle> | <character>
## X1_93_556427.ENSG00000209341 chr1 556660-556760 * | X1_93_556427
## X1_93_556427.ENSG00000209344 chr1 556660-556760 * | X1_93_556427
## X1_93_556427.ENSG00000209346 chr1 556660-556760 * | X1_93_556427
## feature start_position end_position
## <character> <integer> <integer>
## X1_93_556427.ENSG00000209341 ENSG00000209341 554314 559813
## X1_93_556427.ENSG00000209344 ENSG00000209344 555569 561068
## X1_93_556427.ENSG00000209346 ENSG00000209346 555643 561142
## feature_strand insideFeature distancetoFeature
## <character> <character> <numeric>
## X1_93_556427.ENSG00000209341 - inside 3153
## X1_93_556427.ENSG00000209344 - inside 4408
## X1_93_556427.ENSG00000209346 - inside 4482
## shortestDistance fromOverlappingOrNearest
## <integer> <character>
## X1_93_556427.ENSG00000209341 2346 Overlapping
## X1_93_556427.ENSG00000209344 1091 Overlapping
## X1_93_556427.ENSG00000209346 1017 Overlapping
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
In the function annotatyePeakInBatch
, various parameters can be adjusted to specify the way to calculate the distance and how the features are selected. For example, PeakLocForDistance is to specify the location of the peak for distance calculation: “middle” (recommended) means using the middle of the peak, and “start” (default, for backward compatibility) means using the start of the peak to calculate the distance to the features. Similarly, FeatureLocForDistance is to specify the location of the feature for distance calculation: “middle” means using the middle of the feature, “start” means using the start of the feature to calculate the distance from the peak to the feature; “TSS” (default) means using the start of the feature when the feature is on plus strand and using the end of feature when the feature is on minus strand; “geneEnd” means using end of the feature when feature is on plus strand and using start of feature when feature is on minus strand.
The argument “output” specifies the characteristics of the output of the annotated features. The default is “nearestLocation”, which means to output the nearest features calculated as PeakLocForDistance-FeatureLocForDistance; “overlapping” will output the overlapping features within the maximum gap specified as maxgap between the peak range and feature range; “shortestDistance” will output the nearest features; “both” will output all the nearest features, in addition, will output any features that overlap the peak that are not the nearest features. other options see ?annotatePeakInBatch.
In addition to annotating peaks to nearest genes, ChIPpeakAnno can also reports all overlapping and flanking genes by setting output=“both” and maxgap in annotatePeakInBatch
. For example, it outputs all overlapping and flanking genes within 5kb plus nearest genes if set maxgap = 5000 and output =“both”.
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6],
AnnotationData = annoData,
output="both", maxgap=5000)
head(annotatedPeak)
## GRanges object with 6 ranges and 9 metadata columns:
## seqnames ranges strand | peak
## <Rle> <IRanges> <Rle> | <character>
## X1_93_556427.100288069 chr1 556660-556760 * | X1_93_556427
## X1_41_559455.100288069 chr1 559774-559874 * | X1_41_559455
## X1_12_703729.100288069 chr1 703885-703985 * | X1_12_703729
## X1_20_925025.84808 chr1 926058-926158 * | X1_20_925025
## X1_11_1041174.54991 chr1 1041646-1041746 * | X1_11_1041174
## X1_14_1269014.1855 chr1 1270239-1270339 * | X1_14_1269014
## feature start_position end_position feature_strand
## <character> <integer> <integer> <character>
## X1_93_556427.100288069 100288069 700245 714068 -
## X1_41_559455.100288069 100288069 700245 714068 -
## X1_12_703729.100288069 100288069 700245 714068 -
## X1_20_925025.84808 84808 910579 917473 -
## X1_11_1041174.54991 54991 1017198 1051736 -
## X1_14_1269014.1855 1855 1270658 1284492 -
## insideFeature distancetoFeature shortestDistance
## <character> <numeric> <integer>
## X1_93_556427.100288069 downstream 157408 143485
## X1_41_559455.100288069 downstream 154294 140371
## X1_12_703729.100288069 inside 10183 3640
## X1_20_925025.84808 upstream -8585 8585
## X1_11_1041174.54991 inside 10090 9990
## X1_14_1269014.1855 downstream 14253 319
## fromOverlappingOrNearest
## <character>
## X1_93_556427.100288069 NearestLocation
## X1_41_559455.100288069 NearestLocation
## X1_12_703729.100288069 NearestLocation
## X1_20_925025.84808 NearestLocation
## X1_11_1041174.54991 NearestLocation
## X1_14_1269014.1855 Overlapping
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
Additional annotations features such as entrez ID, gene symbol and gene name can be added with the function addGeneIDs
. The annotated peaks can be saved as an Excel file or plotted for visualizing the peak distribution relative to the genomic features of interest. Here is an example to add gene symbol to the annotated peaks. Please type ?addGeneIDs
in a R session for more information.
data(annotatedPeak)
library(org.Hs.eg.db)
addGeneIDs(annotatedPeak[1:6], orgAnn="org.Hs.eg.db", IDs2Add=c("symbol"))
## GRanges object with 6 ranges and 9 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## X1_11_100272487.ENSG00000202254 1 100272801-100272900 + |
## X1_11_108905539.ENSG00000186086 1 108906026-108906125 + |
## X1_11_110106925.ENSG00000065135 1 110107267-110107366 + |
## X1_11_110679983.ENSG00000197106 1 110680469-110680568 + |
## X1_11_110681677.ENSG00000197106 1 110682125-110682224 + |
## X1_11_110756560.ENSG00000116396 1 110756823-110756922 + |
## peak feature start_position
## <character> <character> <numeric>
## X1_11_100272487.ENSG00000202254 1_11_100272487 ENSG00000202254 100257218
## X1_11_108905539.ENSG00000186086 1_11_108905539 ENSG00000186086 108918435
## X1_11_110106925.ENSG00000065135 1_11_110106925 ENSG00000065135 110091233
## X1_11_110679983.ENSG00000197106 1_11_110679983 ENSG00000197106 110693108
## X1_11_110681677.ENSG00000197106 1_11_110681677 ENSG00000197106 110693108
## X1_11_110756560.ENSG00000116396 1_11_110756560 ENSG00000116396 110753965
## end_position insideFeature distancetoFeature
## <numeric> <character> <numeric>
## X1_11_100272487.ENSG00000202254 100257309 downstream 15582
## X1_11_108905539.ENSG00000186086 109013624 upstream -12410
## X1_11_110106925.ENSG00000065135 110136975 inside 16033
## X1_11_110679983.ENSG00000197106 110744824 upstream -12640
## X1_11_110681677.ENSG00000197106 110744824 upstream -10984
## X1_11_110756560.ENSG00000116396 110776666 inside 2857
## shortestDistance fromOverlappingOrNearest
## <numeric> <character>
## X1_11_100272487.ENSG00000202254 15491 NearestStart
## X1_11_108905539.ENSG00000186086 12310 NearestStart
## X1_11_110106925.ENSG00000065135 16033 NearestStart
## X1_11_110679983.ENSG00000197106 12540 NearestStart
## X1_11_110681677.ENSG00000197106 10884 NearestStart
## X1_11_110756560.ENSG00000116396 2857 NearestStart
## symbol
## <character>
## X1_11_100272487.ENSG00000202254 <NA>
## X1_11_108905539.ENSG00000186086 NBPF6
## X1_11_110106925.ENSG00000065135 GNAI3
## X1_11_110679983.ENSG00000197106 SLC6A17
## X1_11_110681677.ENSG00000197106 SLC6A17
## X1_11_110756560.ENSG00000116396 KCNC4
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
addGeneIDs(annotatedPeak$feature[1:6], orgAnn="org.Hs.eg.db",
IDs2Add=c("symbol"))
## ensembl_gene_id symbol
## 1 ENSG00000065135 GNAI3
## 2 ENSG00000116396 KCNC4
## 3 ENSG00000197106 SLC6A17
## 4 ENSG00000186086 NBPF6
## 5 ENSG00000202254 <NA>
Here is an example to get the sequences of the peaks plus 20 bp upstream and downstream for PCR validation or motif discovery.
peaks <- GRanges(seqnames=c("NC_008253", "NC_010468"),
ranges=IRanges(start=c(100, 500),
end=c(300, 600),
names=c("peak1", "peak2")))
library(BSgenome.Ecoli.NCBI.20080805)
peaksWithSequences <- getAllPeakSequence(peaks, upstream=20,
downstream=20, genome=Ecoli)
The obtained sequences can be converted to fasta format for motif discovery by calling the function write2FASTA
.
write2FASTA(peaksWithSequences,"test.fa")
You can easily visualize and compare the binding patterns of raw signals of multiple ChIP-Seq experiments using function featureAlignedHeatmap
and featureAlignedDistribution
.
path <- system.file("extdata", package="ChIPpeakAnno")
files <- dir(path, "broadPeak")
data <- sapply(file.path(path, files), toGRanges, format="broadPeak")
names(data) <- gsub(".broadPeak", "", files)
ol <- findOverlapsOfPeaks(data)
#makeVennDiagram(ol)
features <- ol$peaklist[[length(ol$peaklist)]]
wid <- width(features)
feature.recentered <- feature.center <- features
start(feature.center) <- start(features) + floor(wid/2)
width(feature.center) <- 1
start(feature.recentered) <- start(feature.center) - 2000
end(feature.recentered) <- end(feature.center) + 2000
## here we also suggest importData function in bioconductor trackViewer package
## to import the coverage.
## compare rtracklayer, it will save you time when handle huge dataset.
library(rtracklayer)
files <- dir(path, "bigWig")
if(.Platform$OS.type != "windows"){
cvglists <- sapply(file.path(path, files), import,
format="BigWig",
which=feature.recentered,
as="RleList")
}else{## rtracklayer can not import bigWig files on Windows
load(file.path(path, "cvglist.rds"))
}
names(cvglists) <- gsub(".bigWig", "", files)
sig <- featureAlignedSignal(cvglists, feature.center,
upstream=2000, downstream=2000)
heatmap <- featureAlignedHeatmap(sig, feature.center,
upstream=2000, downstream=2000,
upper.extreme=c(3,.5,4))
featureAlignedDistribution(sig, feature.center,
upstream=2000, downstream=2000,
type="l")
Here is an example to search the motifs in the binding peaks. The motif patterns to be searched are saved in the file examplepattern.fa.
peaks <- GRanges(seqnames=c("NC_008253", "NC_010468"),
ranges=IRanges(start=c(100, 500),
end=c(300, 600),
names=c("peak1", "peak2")))
filepath <- system.file("extdata", "examplePattern.fa", package="ChIPpeakAnno")
readLines(filepath)
## [1] ">ExamplePattern" "GGNCCK" ">ExamplePattern" "AACCNM"
library(BSgenome.Ecoli.NCBI.20080805)
summarizePatternInPeaks(patternFilePath=filepath, format="fasta", skip=0L,
BSgenomeName=Ecoli, peaks=peaks)
## ExamplePattern GGNCCK not found in the input sequences!
## chr motifStart motifEnd motif name motif motifStartOffset motifEndOffset
## 1 1 135 140 ExamplePattern AACCNM 36 41
## motif found motifFoundStrand seqnames start end width strand
## 1 AACCAA + NC_008253 100 300 201 *
With the annotated peak data, you can call the function getEnrichedGO
to obtain a list of enriched GO terms. For pathway analysis, you can call function getEnrichedPATH
using reactome or KEGG database.
In the following sample code, we used a subset of the annotatedPeak (the first 500 peaks) for demonstration. All annotated peaks should be used in the real situation.
library(org.Hs.eg.db)
over <- getEnrichedGO(annotatedPeak[1:500], orgAnn="org.Hs.eg.db",
maxP=0.01, minGOterm=10,
multiAdjMethod="BH",
condense=FALSE)
head(over[["bp"]][, -3])
## [1] go.id go.term Ontology
## [4] pvalue count.InDataset count.InGenome
## [7] totaltermInDataset totaltermInGenome BH.adjusted.p.value
## [10] EntrezID
## <0 rows> (or 0-length row.names)
head(over[["cc"]][, -3])
## [1] go.id go.term Ontology
## [4] pvalue count.InDataset count.InGenome
## [7] totaltermInDataset totaltermInGenome BH.adjusted.p.value
## [10] EntrezID
## <0 rows> (or 0-length row.names)
head(over[["mf"]][, -3])
## [1] go.id go.term Ontology
## [4] pvalue count.InDataset count.InGenome
## [7] totaltermInDataset totaltermInGenome BH.adjusted.p.value
## [10] EntrezID
## <0 rows> (or 0-length row.names)
Please note that the default setting of feature_id_type is “ensembl_gene_id”. If you are using TxDb as annotation data, please set feature id type to “entrez_id”.
Please also note that org.Hs.eg.db is the GO gene mapping for Human, for other organisms, please refer to released organism annotations, or call function egOrgMap
to get the name of annotation database. For example, here is how to obtain the GO gene mapping for mouse and human.
egOrgMap("Mus musculus")
## [1] "org.Mm.eg.db"
egOrgMap("Homo sapiens")
## [1] "org.Hs.eg.db"
To obtain enriched the pathways, use the following sample code.
library(reactome.db)
enriched.PATH <-
getEnrichedPATH(annotatedPeak[1:500],
orgAnn="org.Hs.eg.db",
feature_id_type="ensembl_gene_id",
pathAnn="reactome.db",
maxP=0.01,
minPATHterm=10,
multiAdjMethod=NULL)#Try to change the method to BH
To add gene symbols to the enriched pathways. Use below.
```r {getEnrichedPath2} ann <- addGeneIDs(enriched.PATH[,2], feature_id_type = “entrez_id”, orgAnn = org.Hs.eg.db, IDs2Add = “symbol”) # enriched.PATH <- merge(ann, enriched.PATH, by.x = “entrez_id”, by.y = “EntrezID”) # before v3.31.1
enriched.PATH <- merge(ann, enriched.PATH, by.x = “entrez_id”, by.y = “entrez_id”) # after v3.31.1
head(enriched.PATH)
## Find peaks with bi-directional promoters
Bidirectional promoters are the DNA regions located between the transcription start sites (TSS) of two adjacent genes that are transcribed on the opposite directions and often co-regulated by this shared promoter region[@robertson2007].
Here is an example to find peaks near bi-directional promoters and
output the percentage of the peaks near bi-directional promoters.
```r
data(myPeakList)
data(TSS.human.NCBI36)
seqlevelsStyle(TSS.human.NCBI36) <- seqlevelsStyle(myPeakList)
annotatedBDP <- peaksNearBDP(myPeakList[1:10,],
AnnotationData=TSS.human.NCBI36,
MaxDistance=5000)
annotatedBDP$peaksWithBDP
## GRangesList object of length 2:
## $`1`
## GRanges object with 2 ranges and 9 metadata columns:
## seqnames ranges strand | bdp_idx peak
## <Rle> <IRanges> <Rle> | <integer> <character>
## X1_93_556427 chr1 556660-556760 * | 1 X1_93_556427
## X1_93_556427 chr1 556660-556760 * | 1 X1_93_556427
## feature feature.ranges feature.strand distance
## <character> <IRanges> <Rle> <integer>
## X1_93_556427 ENSG00000209349 556240-556304 - 355
## X1_93_556427 ENSG00000209351 557933-557999 + 1172
## insideFeature distanceToSite description
## <character> <integer> <character>
## X1_93_556427 upstream 355
## X1_93_556427 upstream 1172
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
##
## $`8`
## GRanges object with 2 ranges and 9 metadata columns:
## seqnames ranges strand | bdp_idx peak
## <Rle> <IRanges> <Rle> | <integer> <character>
## X1_14_1300250 chr1 1300503-1300603 * | 8 X1_14_1300250
## X1_14_1300250 chr1 1300503-1300603 * | 8 X1_14_1300250
## feature feature.ranges feature.strand distance
## <character> <IRanges> <Rle> <integer>
## X1_14_1300250 ENSG00000175756 1298974-1300443 - 59
## X1_14_1300250 ENSG00000218550 1303908-1304275 + 3304
## insideFeature distanceToSite description
## <character> <integer> <character>
## X1_14_1300250 upstream 59 Aurora kinase A-inte..
## X1_14_1300250 upstream 3304
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
c(annotatedBDP$percentPeaksWithBDP,
annotatedBDP$n.peaks,
annotatedBDP$n.peaksWithBDP)
## [1] 0.2 10.0 2.0
Given two peak lists from two transcript factors (TFs), one common question is whether there is a significant overlap between DNA binding sites of the two TFs, which can be determined using hypergeometric test. As we have discussed in section 4.1, the hypergeometric test requires users to input an estimate of the total potential binding sites for a given TF. To circumvent this requirement, we implemented a permutation test called peakPermTest
. Before performing a permutation test, users need to generate a random peak list using the distribution discovered from the input peaks for a given feature type (transcripts or exons), to make sure the binding positions relative to features, such as TSS and geneEnd, and the width of the random peaks follow the distribution of that of the input peaks.
Following are the sample codes to do the peakPermTest
:
if(interactive()){
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
cds <- unique(unlist(cdsBy(txdb)))
utr5 <- unique(unlist(fiveUTRsByTranscript(txdb)))
utr3 <- unique(unlist(threeUTRsByTranscript(txdb)))
set.seed(123)
utr3 <- utr3[sample.int(length(utr3), 1000)]
pt <- peakPermTest(utr3,
utr5[sample.int(length(utr5), 1000)],
maxgap=500,
TxDb=txdb, seed=1,
force.parallel=FALSE)
plot(pt)
## highly relevant peaks
ol <- findOverlaps(cds, utr3, maxgap=1)
pt1 <- peakPermTest(utr3,
c(cds[sample.int(length(cds), 500)],
cds[queryHits(ol)][sample.int(length(ol), 500)]),
maxgap=500,
TxDb=txdb, seed=1,
force.parallel=FALSE)
plot(pt1)
}
Alternatively, a peak pool representing all potential binding sites can be created with associated binding probabilities using random peak sampling using preparePool
. Here is an example to build a peak pool for human genome using the transcription factor binding site clusters (V3) (see ?wgEncodeTfbsV3
) downloaded from ENCODE with the HOT spots (?HOT.spots
) removed. HOT spots are the genomic regions with high probability of being bound by many TFs in ChIP-seq experiments9. We suggest remove those HOT spots from the peak lists before performing permutation test to avoid the overestimation of the association between two input peak lists. Users can also choose to remove ENCODE blacklist for a given species. The blacklists were constructed by identifying consistently problematic regions over independent cell lines and types of experiments for each species in the ENCODE and modENCODE datasets10. Please note that some of the blacklists may need to be converted to the correct genome assembly using liftover utility. Following are the sample codes to do the permutation test using peakPermTest
:
if(interactive()){
data(HOT.spots)
data(wgEncodeTfbsV3)
hotGR <- reduce(unlist(HOT.spots))
removeOl <- function(.ele){
ol <- findOverlaps(.ele, hotGR)
if(length(ol)>0) .ele <- .ele[-unique(queryHits(ol))]
.ele
}
temp <- tempfile()
download.file(file.path("http://hgdownload.cse.ucsc.edu",
"goldenPath", "hg19", "encodeDCC",
"wgEncodeRegTfbsClustered",
"wgEncodeRegTfbsClusteredV3.bed.gz"), temp)
data <- toGRanges(gzfile(temp, "r"), header=FALSE, format="others",
colNames = c("seqnames", "start", "end", "TF"))
unlink(temp)
data <- split(data, data$TF)
TAF1 <- removeOl(data[["TAF1"]])
TEAD4 <- removeOl(data[["TEAD4"]])
pool <- new("permPool", grs=GRangesList(wgEncodeTfbsV3), N=length(TAF1))
pt <- peakPermTest(TAF1, TEAD4, pool=pool, ntimes=1000)
plot(pt)
}
Please cite ChIPpeakAnno
in your publication as follows:
## Please cite the paper below for the ChIPpeakAnno package.
##
## Lihua J Zhu, Claude Gazin, Nathan D Lawson, Herve Pages, Simon M Lin,
## David S Lapointe and Michael R Green, ChIPpeakAnno: a Bioconductor
## package to annotate ChIP-seq and ChIP-chip data. BMC Bioinformatics.
## 2010, 11:237
##
## Zhu LJ. Integrative analysis of ChIP-chip and ChIP-seq dataset.
## Methods Mol Biol. 2013;1067:105-24.
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.
sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [2] reactome.db_1.84.0
## [3] BSgenome.Ecoli.NCBI.20080805_1.3.1000
## [4] BSgenome_1.68.0
## [5] Biostrings_2.68.0
## [6] XVector_0.40.0
## [7] EnsDb.Hsapiens.v75_2.99.0
## [8] ensembldb_2.24.0
## [9] AnnotationFilter_1.24.0
## [10] GO.db_3.17.0
## [11] rtracklayer_1.60.0
## [12] TxDb.Hsapiens.UCSC.hg38.knownGene_3.17.0
## [13] GenomicFeatures_1.52.0
## [14] org.Hs.eg.db_3.17.0
## [15] AnnotationDbi_1.62.0
## [16] Biobase_2.60.0
## [17] ChIPpeakAnno_3.34.1
## [18] GenomicRanges_1.52.0
## [19] GenomeInfoDb_1.36.0
## [20] IRanges_2.34.0
## [21] S4Vectors_0.38.0
## [22] BiocGenerics_0.46.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.1.3 bitops_1.0-7
## [3] RBGL_1.76.0 formatR_1.14
## [5] biomaRt_2.56.0 rlang_1.1.0
## [7] magrittr_2.0.3 matrixStats_0.63.0
## [9] compiler_4.3.0 RSQLite_2.3.1
## [11] png_0.1-8 vctrs_0.6.2
## [13] ProtGenerics_1.32.0 stringr_1.5.0
## [15] pkgconfig_2.0.3 crayon_1.5.2
## [17] fastmap_1.1.1 dbplyr_2.3.2
## [19] utf8_1.2.3 Rsamtools_2.16.0
## [21] rmarkdown_2.21 graph_1.78.0
## [23] bit_4.0.5 xfun_0.39
## [25] zlibbioc_1.46.0 cachem_1.0.7
## [27] jsonlite_1.8.4 progress_1.2.2
## [29] blob_1.2.4 highr_0.10
## [31] DelayedArray_0.26.0 BiocParallel_1.34.0
## [33] parallel_4.3.0 prettyunits_1.1.1
## [35] R6_2.5.1 bslib_0.4.2
## [37] stringi_1.7.12 jquerylib_0.1.4
## [39] Rcpp_1.0.10 SummarizedExperiment_1.30.0
## [41] knitr_1.42 VennDiagram_1.7.3
## [43] Matrix_1.5-4 splines_4.3.0
## [45] tidyselect_1.2.0 yaml_2.3.7
## [47] codetools_0.2-19 curl_5.0.0
## [49] lattice_0.21-8 tibble_3.2.1
## [51] regioneR_1.32.0 InteractionSet_1.28.0
## [53] KEGGREST_1.40.0 evaluate_0.20
## [55] lambda.r_1.2.4 survival_3.5-5
## [57] futile.logger_1.4.3 BiocFileCache_2.8.0
## [59] xml2_1.3.4 pillar_1.9.0
## [61] BiocManager_1.30.20 filelock_1.0.2
## [63] MatrixGenerics_1.12.0 generics_0.1.3
## [65] RCurl_1.98-1.12 hms_1.1.3
## [67] ggplot2_3.4.2 munsell_0.5.0
## [69] scales_1.2.1 BiocStyle_2.28.0
## [71] glue_1.6.2 lazyeval_0.2.2
## [73] tools_4.3.0 BiocIO_1.10.0
## [75] mirbase.db_1.2.0 GenomicAlignments_1.36.0
## [77] FDb.UCSC.tRNAs_1.0.1 XML_3.99-0.14
## [79] grid_4.3.0 colorspace_2.1-0
## [81] GenomeInfoDbData_1.2.10 restfulr_0.0.15
## [83] cli_3.6.1 rappdirs_0.3.3
## [85] futile.options_1.0.1 fansi_1.0.4
## [87] dplyr_1.1.2 gtable_0.3.3
## [89] sass_0.4.5 digest_0.6.31
## [91] rjson_0.2.21 memoise_2.0.1
## [93] htmltools_0.5.5 multtest_2.56.0
## [95] lifecycle_1.0.3 httr_1.4.5
## [97] bit64_4.0.5 MASS_7.3-59
1. Gentleman, R. et al. Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 5, R80 (2004).
2. Durinck, S. et al. BioMart and bioconductor: A powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439–3440 (2005).
3. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, pp. 289–300 (1995).
4. Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29, 1165–1188 (2001).
5. Johnson, N. L., Kemp, A. W. & Kotz, S. Univariate discrete distributions. 444, (John Wiley & Sons, 2005).
6. Holm, S. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, pp. 65–70 (1979).
7. Hochberg, Y. A sharper bonferroni procedure for multiple tests of significance. Biometrika 75, 800–802 (1988).
8. Dudoit, S., Shaffer, J. P. & Boldrick, J. C. Multiple hypothesis testing in microarray experiments. Statist. Sci. 18, 71–103 (2003).
9. Yip, K. Y. et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol 13, R48 (2012).
10. Consortium, E. P. & others. An integrated encyclopedia of dna elements in the human genome. Nature 489, 57–74 (2012).