ChIPseeker is an R package for annotating ChIP-seq data analysis. It supports annotating ChIP peaks and provides functions to visualize ChIP peaks coverage over chromosomes and profiles of peaks binding to TSS regions. Comparison of ChIP peak profiles and annotation are also supported. Moreover, it supports evaluating significant overlap among ChIP-seq datasets. Currently, ChIPseeker contains 17,000 bed file information from GEO database. These datasets can be downloaded and compare with user’s own data to explore significant overlap datasets for inferring co-regulation or transcription factor complex for further investigation.
If you use ChIPseeker1 in published research, please cite:
G Yu, LG Wang, QY He. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 2015, 31(14):2382-2383. doi:[10.1093/bioinformatics/btv145](http://dx.doi.org/10.1093/bioinformatics/btv145)
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) has become standard technologies for genome wide identification of DNA-binding protein target sites. After read mappings and peak callings, the peak should be annotated to answer the biological questions. Annotation also create the possibility of integrate expression profile data to predict gene expression regulation. ChIPseeker1 was developed for annotating nearest genes and genomic features to peaks.
ChIP peak data set comparison is also very important. We can use it as an index to estimate how well biological replications are. Even more important is applying to infer cooperative regulation. If two ChIP seq data, obtained by two different binding proteins, overlap significantly, these two proteins may form a complex or have interaction in regulation chromosome remodelling or gene expression. ChIPseeker1 support statistical testing of significant overlap among ChIP seq data sets, and incorporate open access database GEO for users to compare their own dataset to those deposited in database. Protein interaction hypothesis can be generated by mining data deposited in database. Converting genome coordinations from one genome version to another is also supported, making this comparison available for different genome version and different species.
Several visualization functions are implemented to visualize the coverage of the ChIP seq data, peak annotation, average profile and heatmap of peaks binding to TSS region.
Functional enrichment analysis of the peaks can be performed by my Bioconductor packages DOSE2, ReactomePA3, clusterProfiler4.
## loading packages
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
library(clusterProfiler)
The datasets CBX6 and CBX7 in this vignettes were downloaded from GEO (GSE40740)5 while ARmo_0M, ARmo_1nM and ARmo_100nM were downloaded from GEO (GSE48308)6 . ChIPseeker provides readPeakFile
to load the peak and store in GRanges
object.
files <- getSampleFiles()
print(files)
## $ARmo_0M
## [1] "/tmp/RtmpUWv1jJ/Rinst7c751b35c329/ChIPseeker/extdata/GEO_sample_data/GSM1174480_ARmo_0M_peaks.bed.gz"
##
## $ARmo_1nM
## [1] "/tmp/RtmpUWv1jJ/Rinst7c751b35c329/ChIPseeker/extdata/GEO_sample_data/GSM1174481_ARmo_1nM_peaks.bed.gz"
##
## $ARmo_100nM
## [1] "/tmp/RtmpUWv1jJ/Rinst7c751b35c329/ChIPseeker/extdata/GEO_sample_data/GSM1174482_ARmo_100nM_peaks.bed.gz"
##
## $CBX6_BF
## [1] "/tmp/RtmpUWv1jJ/Rinst7c751b35c329/ChIPseeker/extdata/GEO_sample_data/GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz"
##
## $CBX7_BF
## [1] "/tmp/RtmpUWv1jJ/Rinst7c751b35c329/ChIPseeker/extdata/GEO_sample_data/GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz"
peak <- readPeakFile(files[[4]])
peak
## GRanges object with 1331 ranges and 2 metadata columns:
## seqnames ranges strand | V4 V5
## <Rle> <IRanges> <Rle> | <factor> <numeric>
## [1] chr1 [ 815093, 817883] * | MACS_peak_1 295.76
## [2] chr1 [1243288, 1244338] * | MACS_peak_2 63.19
## [3] chr1 [2979977, 2981228] * | MACS_peak_3 100.16
## [4] chr1 [3566182, 3567876] * | MACS_peak_4 558.89
## [5] chr1 [3816546, 3818111] * | MACS_peak_5 57.57
## ... ... ... ... . ... ...
## [1327] chrX [135244783, 135245821] * | MACS_peak_1327 55.54
## [1328] chrX [139171964, 139173506] * | MACS_peak_1328 270.19
## [1329] chrX [139583954, 139586126] * | MACS_peak_1329 918.73
## [1330] chrX [139592002, 139593238] * | MACS_peak_1330 210.88
## [1331] chrY [ 13845134, 13845777] * | MACS_peak_1331 58.39
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
After peak calling, we would like to know the peak locations over the whole genome, covplot
function calculates the coverage of peak regions over chromosomes and generate a figure to visualize. GRangesList is also supported and can be used to compare coverage of multiple bed files.
covplot(peak, weightCol="V5")
covplot(peak, weightCol="V5", chrs=c("chr17", "chr18"), xlim=c(4.5e7, 5e7))
First of all, for calculate the profile of ChIP peaks binding to TSS regions, we should prepare the TSS regions, which are defined as the flanking sequence of the TSS sites. Then align the peaks that are mapping to these regions, and generate the tagMatrix.
## promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
## tagMatrix <- getTagMatrix(peak, windows=promoter)
##
## to speed up the compilation of this vignettes, we use a precalculated tagMatrix
data("tagMatrixList")
tagMatrix <- tagMatrixList[[4]]
In the above code, you should notice that tagMatrix is not restricted to TSS regions. The regions can be other types that defined by the user.
tagHeatmap(tagMatrix, xlim=c(-3000, 3000), color="red")
ChIPseeker provide a one step function to generate this figure from bed file. The following function will generate the same figure as above.
peakHeatmap(files[[4]], TxDb=txdb, upstream=3000, downstream=3000, color="red")
plotAvgProf(tagMatrix, xlim=c(-3000, 3000),
xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")
The function plotAvgProf2
provide a one step from bed file to average profile plot. The following command will generate the same figure as shown above.
plotAvgProf2(files[[4]], TxDb=txdb, upstream=3000, downstream=3000,
xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")
Confidence interval estimated by bootstrap method is also supported for characterizing ChIP binding profiles.
plotAvgProf(tagMatrix, xlim=c(-3000, 3000), conf = 0.95, resample = 1000)
Referring to the issue #16, we developed getBioRegion
function to support centering all peaks to the start region of Exon/Intron. Users can also create heatmap or average profile of ChIP peaks binding to these regions.
peakAnno <- annotatePeak(files[[4]], tssRegion=c(-3000, 3000),
TxDb=txdb, annoDb="org.Hs.eg.db")
## >> loading peak file... 2016-12-16 07:32:40 PM
## >> preparing features information... 2016-12-16 07:32:40 PM
## >> identifying nearest features... 2016-12-16 07:32:41 PM
## >> calculating distance from peak to TSS... 2016-12-16 07:32:41 PM
## >> assigning genomic annotation... 2016-12-16 07:32:41 PM
## >> adding gene annotation... 2016-12-16 07:32:58 PM
## >> assigning chromosome lengths 2016-12-16 07:32:59 PM
## >> done... 2016-12-16 07:32:59 PM
Peak Annotation is performed by annotatePeak
. User can define TSS (transcription start site) region, by default TSS is defined from -3kb to +3kb. The output of annotatePeak
is csAnno
instance. ChIPseeker provides as.GRanges
to convert csAnno
to GRanges
instance, and as.data.frame
to convert csAnno
to data.frame
which can be exported to file by write.table
.
TxDb
object contained transcript-related features of a particular genome. Bioconductor provides several package that containing TxDb
object of model organisms with multiple commonly used genome version, for instance TxDb.Hsapiens.UCSC.hg38.knownGene, TxDb.Hsapiens.UCSC.hg19.knownGene for human genome hg38 and hg19, TxDb.Mmusculus.UCSC.mm10.knownGene and TxDb.Mmusculus.UCSC.mm9.knownGene for mouse genome mm10 and mm9, etc. User can also prepare their own TxDb
object by retrieving information from UCSC Genome Bioinformatics and BioMart data resources by R function makeTxDbFromBiomart
and makeTxDbFromUCSC
. TxDb
object should be passed for peak annotation.
All the peak information contained in peakfile will be retained in the output of annotatePeak
. The position and strand information of nearest genes are reported. The distance from peak to the TSS of its nearest gene is also reported. The genomic region of the peak is reported in annotation column. Since some annotation may overlap, ChIPseeker adopted the following priority in genomic annotation.
Downstream is defined as the downstream of gene end.
ChIPseeker also provides parameter genomicAnnotationPriority for user to prioritize this hierachy.
annotatePeak
report detail information when the annotation is Exon or Intron, for instance “Exon (uc002sbe.3/9736, exon 69 of 80)”, means that the peak is overlap with an Exon of transcript uc002sbe.3, and the corresponding Entrez gene ID is 9736 (Transcripts that belong to the same gene ID may differ in splice events), and this overlaped exon is the 69th exon of the 80 exons that this transcript uc002sbe.3 prossess.
Parameter annoDb is optional, if provided, extra columns including SYMBOL, GENENAME, ENSEMBL/ENTREZID will be added. The geneId column in annotation output will be consistent with the geneID in TxDb. If it is ENTREZID, ENSEMBL will be added if annoDb is provided, while if it is ENSEMBL ID, ENTREZID will be added.
To annotate the location of a given peak in terms of genomic features, annotatePeak
assigns peaks to genomic annotation in “annotation” column of the output, which includes whether a peak is in the TSS, Exon, 5’ UTR, 3’ UTR, Intronic or Intergenic. Many researchers are very interesting in these annotations. TSS region can be defined by user and annotatePeak
output in details of which exon/intron of which genes as illustrated in previous section.
Pie and Bar plot are supported to visualize the genomic annotation.
plotAnnoPie(peakAnno)
plotAnnoBar(peakAnno)
Since some annotation overlap, user may interested to view the full annotation with their overlap, which can be partially resolved by vennpie
function.
vennpie(peakAnno)
We extend UpSetR to view full annotation overlap. User can user upsetplot
function.
upsetplot(peakAnno)
We can combine vennpie
with upsetplot
by setting vennpie = TRUE.
upsetplot(peakAnno, vennpie=TRUE)
The distance from the peak (binding site) to the TSS of the nearest gene is calculated by annotatePeak
and reported in the output. We provide plotDistToTSS
to calculate the percentage of binding sites upstream and downstream from the TSS of the nearest genes, and visualize the distribution.
plotDistToTSS(peakAnno,
title="Distribution of transcription factor-binding loci\nrelative to TSS")
Once we have obtained the annotated nearest genes, we can perform functional enrichment analysis to identify predominant biological themes among these genes by incorporating biological knowledge provided by biological ontologies. For instance, Gene Ontology (GO)7 annotates genes to biological processes, molecular functions, and cellular components in a directed acyclic graph structure, Kyoto Encyclopedia of Genes and Genomes (KEGG)8 annotates genes to pathways, Disease Ontology (DO)9 annotates genes with human disease association, and Reactome10 annotates gene to pathways and reactions.
ChIPseeker also provides a function, seq2gene, for linking genomc regions to genes in a many-to-many mapping. It consider host gene (exon/intron), promoter region and flanking gene from intergenic region that may under control via cis-regulation. This function is designed to link both coding and non-coding genomic regions to coding genes and facilitate functional analysis.
Enrichment analysis is a widely used approach to identify biological themes. I have developed several Bioconductor packages for investigating whether the number of selected genes associated with a particular biological term is larger than expected, including DOSE2 for Disease Ontology, ReactomePA for reactome pathway, clusterProfiler4 for Gene Ontology and KEGG enrichment analysis.
library(ReactomePA)
pathway1 <- enrichPathway(as.data.frame(peakAnno)$geneId)
head(pathway1, 2)
## ID Description GeneRatio BgRatio pvalue
## 1266738 1266738 Developmental Biology 44/291 438/6748 7.850446e-08
## 195721 195721 Signaling by Wnt 28/291 245/6748 1.840754e-06
## p.adjust qvalue
## 1266738 4.985033e-05 4.985033e-05
## 195721 5.844395e-04 5.844395e-04
## geneID
## 1266738 27022/1944/1942/2494/55740/4838/219699/3983/1793/6712/7225/64221/1280/3651/1282/8874/3175/4756/56963/9423/9611/6928/9969/6929/390874/2034/785/1290/3642/4821/57167/1291/6092/2042/5361/6657/2049/2044/4825/57556/1879/1808/63978/9282
## 195721 79971/607/5692/23401/55553/144165/11197/5717/6662/57690/5714/9736/7474/6657/51176/6423/5145/6885/23072/51384/83595/7976/64321/340419/23513/7088/4920/6658
## Count
## 1266738 44
## 195721 28
gene <- seq2gene(peak, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb=txdb)
pathway2 <- enrichPathway(gene)
head(pathway2, 2)
## ID Description GeneRatio BgRatio pvalue
## 1266738 1266738 Developmental Biology 33/235 438/6748 1.792286e-05
## 195721 195721 Signaling by Wnt 22/235 245/6748 3.826936e-05
## p.adjust qvalue
## 1266738 0.01073579 0.01071598
## 195721 0.01093603 0.01091585
## geneID
## 1266738 27022/2494/55740/4838/219699/3983/7225/1280/3651/56963/9423/9611/2034/785/4821/1291/6092/2042/4825/57556/63978/1944/1282/8874/6928/57167/2044/1808/1942/4756/9969/9282/3642
## 195721 5692/23401/55553/144165/11197/5717/6662/5714/9736/6423/23072/51384/340419/4920/6658/607/5145/7976/51176/64321/23513/7088
## Count
## 1266738 33
## 195721 22
dotplot(pathway2)
More information can be found in the vignettes of Bioconductor packages DOSE2, ReactomePA, clusterProfiler4, which also provide several methods to visualize enrichment results. The clusterProfiler4 is designed for comparing and visualizing functional profiles among gene clusters, and can directly applied to compare biological themes at GO, DO, KEGG, Reactome perspective.
Function plotAvgProf
and tagHeatmap
can accept a list of tagMatrix
and visualize profile or heatmap among several ChIP experiments, while plotAvgProf2
and peakHeatmap
can accept a list of bed files and perform the same task in one step.
## promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
## tagMatrixList <- lapply(files, getTagMatrix, windows=promoter)
##
## to speed up the compilation of this vigenette, we load a precaculated tagMatrixList
data("tagMatrixList")
plotAvgProf(tagMatrixList, xlim=c(-3000, 3000))
plotAvgProf(tagMatrixList, xlim=c(-3000, 3000), conf=0.95,resample=500, facet="row")