Contents

1 Abstract

ChIPseeker is an R package for annotating ChIP-seq data analysis. It supports annotating ChIP peaks and provides functions to visualize ChIP peaks coverage over chromosomes and profiles of peaks binding to TSS regions. Comparison of ChIP peak profiles and annotation are also supported. Moreover, it supports evaluating significant overlap among ChIP-seq datasets. Currently, ChIPseeker contains 17,000 bed file information from GEO database. These datasets can be downloaded and compare with user’s own data to explore significant overlap datasets for inferring co-regulation or transcription factor complex for further investigation.

2 Citation

If you use ChIPseeker1 in published research, please cite G. Yu (2015). In addition please cite G. Yu (2012) and G. Yu (2015) when performing enrichment analysis by using clusterProfiler and DOSE.

G Yu, LG Wang, QY He.
ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization.
Bioinformatics 2015, 31(14):2382-2383.

URL: http://dx.doi.org/10.1093/bioinformatics/btv145

G Yu, LG Wang, Y Han, QY He.
clusterProfiler: an R package for comparing biological themes among gene clusters.
OMICS: A Journal of Integrative Biology 2012, 16(5):284-287.

URL: http://dx.doi.org/10.1089/omi.2011.0118

G Yu, LG Wang, GR Yan, QY He.
DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis.
Bioinformatics 2015, 31(4):608-609.

URL: http://dx.doi.org/10.1093/bioinformatics/btu684

3 Introduction

Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) has become standard technologies for genome wide identification of DNA-binding protein target sites. After read mappings and peak callings, the peak should be annotated to answer the biological questions. Annotation also create the possibility of integrate expression profile data to predict gene expression regulation. ChIPseeker1 was developed for annotating nearest genes and genomic features to peaks.

ChIP peak data set comparison is also very important. We can use it as an index to estimate how well biological replications are. Even more important is applying to infer cooperative regulation. If two ChIP seq data, obtained by two different binding proteins, overlap significantly, these two proteins may form a complex or have interaction in regulation chromosome remodelling or gene expression. ChIPseeker1 support statistical testing of significant overlap among ChIP seq data sets, and incorporate open access database GEO for users to compare their own dataset to those deposited in database. Protein interaction hypothesis can be generated by mining data deposited in database. Converting genome coordinations from one genome version to another is also supported, making this comparison available for different genome version and different species.

Several visualization functions are implemented to visualize the coverage of the ChIP seq data, peak annotation, average profile and heatmap of peaks binding to TSS region.

Functional enrichment analysis of the peaks can be performed by my Bioconductor packages DOSE2, ReactomePA, clusterProfiler3.

## loading packages
require(ChIPseeker)
require(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
require(clusterProfiler)

4 ChIP profiling

The datasets CBX6 and CBX7 in this vignettes were downloaded from GEO (GSE40740)4 while ARmo_0M, ARmo_1nM and ARmo_100nM were downloaded from GEO (GSE48308)5 . ChIPseeker provides readPeakFile to load the peak and store in GRanges object.

files <- getSampleFiles()
print(files)
## $ARmo_0M
## [1] "/tmp/RtmpG302jm/Rinst6e0a3f3ff3d5/ChIPseeker/extdata/GEO_sample_data/GSM1174480_ARmo_0M_peaks.bed.gz"
## 
## $ARmo_1nM
## [1] "/tmp/RtmpG302jm/Rinst6e0a3f3ff3d5/ChIPseeker/extdata/GEO_sample_data/GSM1174481_ARmo_1nM_peaks.bed.gz"
## 
## $ARmo_100nM
## [1] "/tmp/RtmpG302jm/Rinst6e0a3f3ff3d5/ChIPseeker/extdata/GEO_sample_data/GSM1174482_ARmo_100nM_peaks.bed.gz"
## 
## $CBX6_BF
## [1] "/tmp/RtmpG302jm/Rinst6e0a3f3ff3d5/ChIPseeker/extdata/GEO_sample_data/GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz"
## 
## $CBX7_BF
## [1] "/tmp/RtmpG302jm/Rinst6e0a3f3ff3d5/ChIPseeker/extdata/GEO_sample_data/GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz"
peak <- readPeakFile(files[[4]])
peak
## GRanges object with 1331 ranges and 2 metadata columns:
##          seqnames                 ranges strand   |             V4
##             <Rle>              <IRanges>  <Rle>   |       <factor>
##      [1]     chr1     [ 815093,  817883]      *   |    MACS_peak_1
##      [2]     chr1     [1243288, 1244338]      *   |    MACS_peak_2
##      [3]     chr1     [2979977, 2981228]      *   |    MACS_peak_3
##      [4]     chr1     [3566182, 3567876]      *   |    MACS_peak_4
##      [5]     chr1     [3816546, 3818111]      *   |    MACS_peak_5
##      ...      ...                    ...    ... ...            ...
##   [1327]     chrX [135244783, 135245821]      *   | MACS_peak_1327
##   [1328]     chrX [139171964, 139173506]      *   | MACS_peak_1328
##   [1329]     chrX [139583954, 139586126]      *   | MACS_peak_1329
##   [1330]     chrX [139592002, 139593238]      *   | MACS_peak_1330
##   [1331]     chrY [ 13845134,  13845777]      *   | MACS_peak_1331
##                 V5
##          <numeric>
##      [1]    295.76
##      [2]     63.19
##      [3]    100.16
##      [4]    558.89
##      [5]     57.57
##      ...       ...
##   [1327]     55.54
##   [1328]    270.19
##   [1329]    918.73
##   [1330]    210.88
##   [1331]     58.39
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

4.1 ChIP peaks coverage plot

After peak calling, we would like to know the peak locations over the whole genome, covplot function calculates the coverage of peak regions over chromosomes and generate a figure to visualize.

covplot(peak, weightCol="V5")

covplot(peak, weightCol="V5", chrs=c("chr17", "chr18"), xlim=c(4.5e7, 5e7))

4.2 Profile of ChIP peaks binding to TSS regions

First of all, for calculate the profile of ChIP peaks binding to TSS regions, we should prepare the TSS regions, which are defined as the flanking sequence of the TSS sites. Then align the peaks that are mapping to these regions, and generate the tagMatrix.

## promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
## tagMatrix <- getTagMatrix(peak, windows=promoter)
##
## to speed up the compilation of this vignettes, we use a precalculated tagMatrix
data("tagMatrixList")
tagMatrix <- tagMatrixList[[4]]

In the above code, you should notice that tagMatrix is not restricted to TSS regions. The regions can be other types that defined by the user.

4.2.1 Heatmap of ChIP binding to TSS regions

tagHeatmap(tagMatrix, xlim=c(-3000, 3000), color="red")
Heatmap of ChIP peaks binding to TSS regions

Heatmap of ChIP peaks binding to TSS regions

ChIPseeker provide a one step function to generate this figure from bed file. The following function will generate the same figure as above.

peakHeatmap(files[[4]], TxDb=txdb, upstream=3000, downstream=3000, color="red")

4.2.2 Average Profile of ChIP peaks binding to TSS region

plotAvgProf(tagMatrix, xlim=c(-3000, 3000), 
            xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")
Average Profile of ChIP peaks binding to TSS region

Average Profile of ChIP peaks binding to TSS region

The function plotAvgProf2 provide a one step from bed file to average profile plot. The following command will generate the same figure as shown above.

plotAvgProf2(files[[4]], TxDb=txdb, upstream=3000, downstream=3000, 
             xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")

Confidence interval estimated by bootstrap method is also supported for characterizing ChIP binding profiles.

plotAvgProf(tagMatrix, xlim=c(-3000, 3000), conf = 0.95, resample = 500)
## >> Running bootstrapping for tag matrix...        2016-03-01 08:18:21 PM
Average Profile of ChIP peaks binding to TSS region

Average Profile of ChIP peaks binding to TSS region

5 Peak Annotation

peakAnno <- annotatePeak(files[[4]], tssRegion=c(-3000, 3000), 
                         TxDb=txdb, annoDb="org.Hs.eg.db")
## >> loading peak file...               2016-03-01 08:18:25 PM 
## >> preparing features information...      2016-03-01 08:18:25 PM 
## >> identifying nearest features...        2016-03-01 08:18:25 PM 
## >> calculating distance from peak to TSS...   2016-03-01 08:18:25 PM 
## >> assigning genomic annotation...        2016-03-01 08:18:25 PM 
## >> adding gene annotation...          2016-03-01 08:18:40 PM 
## >> assigning chromosome lengths           2016-03-01 08:18:40 PM 
## >> done...                    2016-03-01 08:18:40 PM

Peak Annotation is performed by annotatePeak. User can define TSS (transcription start site) region, by default TSS is defined from -3kb to +3kb. The output of annotatePeak is csAnno instance. ChIPseeker provides as.GRanges to convert csAnno to GRanges instance, and as.data.frame to convert csAnno to data.frame which can be exported to file by write.table.

TxDb object contained transcript-related features of a particular genome. Bioconductor provides several package that containing TxDb object of model organisms with multiple commonly used genome version, for instance TxDb.Hsapiens.UCSC.hg38.knownGene, TxDb.Hsapiens.UCSC.hg19.knownGene for human genome hg38 and hg19, TxDb.Mmusculus.UCSC.mm10.knownGene and TxDb.Mmusculus.UCSC.mm9.knownGene for mouse genome mm10 and mm9, etc. User can also prepare their own TxDb object by retrieving information from UCSC Genome Bioinformatics and BioMart data resources by R function makeTxDbFromBiomart and makeTxDbFromUCSC. TxDb object should be passed for peak annotation.

All the peak information contained in peakfile will be retained in the output of annotatePeak. The position and strand information of nearest genes are reported. The distance from peak to the TSS of its nearest gene is also reported. The genomic region of the peak is reported in annotation column. Since some annotation may overlap, ChIPseeker adopted the following priority in genomic annotation.

Downstream is defined as the downstream of gene end.

ChIPseeker also provides parameter genomicAnnotationPriority for user to prioritize this hierachy.

annotatePeak report detail information when the annotation is Exon or Intron, for instance “Exon (uc002sbe.3/9736, exon 69 of 80)”, means that the peak is overlap with an Exon of transcript uc002sbe.3, and the corresponding Entrez gene ID is 9736 (Transcripts that belong to the same gene ID may differ in splice events), and this overlaped exon is the 69th exon of the 80 exons that this transcript uc002sbe.3 prossess.

Parameter annoDb is optional, if provided, extra columns including SYMBOL, GENENAME, ENSEMBL/ENTREZID will be added. The geneId column in annotation output will be consistent with the geneID in TxDb. If it is ENTREZID, ENSEMBL will be added if annoDb is provided, while if it is ENSEMBL ID, ENTREZID will be added.

5.1 Visualize Genomic Annotation

To annotate the location of a given peak in terms of genomic features, annotatePeak assigns peaks to genomic annotation in “annotation” column of the output, which includes whether a peak is in the TSS, Exon, 5’ UTR, 3’ UTR, Intronic or Intergenic. Many researchers are very interesting in these annotations. TSS region can be defined by user and annotatePeak output in details of which exon/intron of which genes as illustrated in previous section.

Pie and Bar plot are supported to visualize the genomic annotation.

plotAnnoPie(peakAnno)
Genomic Annotation by pieplot

Genomic Annotation by pieplot

plotAnnoBar(peakAnno)
Genomic Annotation by barplot

Genomic Annotation by barplot

Since some annotation overlap, user may interested to view the full annotation with their overlap, which can be partially resolved by vennpie function.

vennpie(peakAnno)
Genomic Annotation by vennpie

Genomic Annotation by vennpie

We extend UpSetR to view full annotation overlap. User can user upsetplot function.

upsetplot(peakAnno)

We can combine vennpie with upsetplot by setting vennpie = TRUE.

upsetplot(peakAnno, vennpie=TRUE)

5.2 Visualize distribution of TF-binding loci relative to TSS

The distance from the peak (binding site) to the TSS of the nearest gene is calculated by annotatePeak and reported in the output. We provide plotDistToTSS to calculate the percentage of binding sites upstream and downstream from the TSS of the nearest genes, and visualize the distribution.

plotDistToTSS(peakAnno, 
              title="Distribution of transcription factor-binding loci\nrelative to TSS")
Distribution of Binding Sites

Distribution of Binding Sites

6 Functional enrichment analysis

Once we have obtained the annotated nearest genes, we can perform functional enrichment analysis to identify predominant biological themes among these genes by incorporating biological knowledge provided by biological ontologies. For instance, Gene Ontology (GO)6 annotates genes to biological processes, molecular functions, and cellular components in a directed acyclic graph structure, Kyoto Encyclopedia of Genes and Genomes (KEGG)7 annotates genes to pathways, Disease Ontology (DO)8 annotates genes with human disease association, and Reactome9 annotates gene to pathways and reactions.

ChIPseeker also provides a function, seq2gene, for linking genomc regions to genes in a many-to-many mapping. It consider host gene (exon/intron), promoter region and flanking gene from intergenic region that may under control via cis-regulation. This function is designed to link both coding and non-coding genomic regions to coding genes and facilitate functional analysis.

Enrichment analysis is a widely used approach to identify biological themes. I have developed several Bioconductor packages for investigating whether the number of selected genes associated with a particular biological term is larger than expected, including DOSE2 for Disease Ontology, ReactomePA for reactome pathway, clusterProfiler3 for Gene Ontology and KEGG enrichment analysis.

require(clusterProfiler)
bp <- enrichGO(as.data.frame(peakAnno)$geneId, ont="BP", readable=TRUE)
head(summary(bp), n=3)
##                    ID                           Description GeneRatio
## GO:0032502 GO:0032502                 developmental process   469/786
## GO:0044767 GO:0044767 single-organism developmental process   462/786
## GO:0007275 GO:0007275  multicellular organismal development   417/786
##               BgRatio       pvalue     p.adjust       qvalue
## GO:0032502 5679/18679 2.157108e-67 1.158798e-63 8.789648e-64
## GO:0044767 5595/18679 9.326736e-66 2.505161e-62 1.900200e-62
## GO:0007275 4795/18679 6.385841e-63 1.143491e-59 8.673541e-60
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      geneID
## GO:0032502 SP9/DNM1L/EDIL3/OLIG2/SPRY1/CDKN2A/CCNO/WARS2/KLF2/MERTK/ZBTB18/HOXB13/IGF2BP1/IGF2BP3/LBX1/TBR1/SIX2/NEK6/PNPLA6/VAX1/GLMN/SOX21/WIF1/CHN1/CHRNB1/OSBPL11/CCDC151/ADCYAP1/OSR2/OLIG1/ZNF488/ADD2/DBX1/CLN5/FOXN4/ZPBP2/WDR81/CNR1/DMBX1/COL2A1/COL4A1/COL5A2/COL6A1/M1AP/COL19A1/TRIM71/ADM/CPM/NKX2-6/GATA5/SLC32A1/ZBTB46/CRYGD/PRICKLE1/CSNK1D/ADRA1A/NKX2-5/CUX1/CXADR/SHISA3/ADRB1/CYP24A1/NKX2-3/DAB1/DACH1/DDX6/OLIG3/ARX/DLX1/DLX4/DOCK1/DPYSL2/DRD1/DUSP4/EBF1/ECE1/EDNRA/EFNA1/EFNA3/MEGF8/ELF2/SPAG17/FOXD4L1/EMX1/CTTN/EN2/EPAS1/EPHA3/EPHA5/EPHB3/ERBB4/EREG/EYA4/ERG/MECOM/EYA1/F2R/ACSL3/UNC5B/EFEMP1/LEMD2/SP8/THSD7A/PRR15/RFX6/FGF9/FGF10/FHL1/IKZF3/ADGRL1/SLITRK3/FOXJ3/ZBTB1/FOXG1/RAB18/FOXF2/FOXC1/FOXD4/RIMS1/FOXC2/FOXE1/FOXD2/EXPH5/ARHGAP26/SMC5/LPIN1/FLT1/POFUT2/TBC1D30/FRAT2/OTP/HEY1/PHF3/CBX7/HEY2/SCRIB/SPO11/FLRT3/ALK/VGLL2/ALOX15B/NR5A2/NR5A1/ZNF396/EBF3/G6PC/NPNT/ALX3/BAMBI/VAX2/GAS1/GATA2/GATA3/GATA4/GATA6/GBX2/LHX6/AATF/GFI1/GFRA2/FOXD3/FOXB1/SND1/GJB2/FOXP1/DKK2/SALL3/RBM20/MKX/SYPL2/DLL1/IGSF10/CCDC66/ANK3/OSTM1/MYLIP/ABT1/PARVB/SNX10/TLX3/RAX/HHEX/MNX1/HLA-DOA/HLF/HMX2/FOXA1/ONECUT1/C14orf39/TLX1/TLX2/HOXA1/HOXB9/APBB1/HOXC10/HOXC12/HOXC13/HOXD10/HOXD11/HOXD12/HPGD/ZAR1/HTR2C/VSX2/ID4/UNCX/RSPO2/HMX3/FMN1/KRT27/BARHL2/IFNA2/NOTO/EVX2/MCIDAS/TUBB2B/IGF1R/IGFBP2/IGFBP5/PLET1/RTN4RL2/IL6ST/IMPDH1/INSIG1/INSM1/PDX1/IRF4/JAK1/KIF5C/ARF6/KRT3/INSC/SP5/SPATA31D1/AFF3/BSX/ONECUT3/HES3/GDF6/LEPR/ARHGDIA/LHX1/ABLIM1/LMO2/LMX1B/MAF/MAK/MBP/MEIS1/MEOX2/MITF/MLF1/LHX8/NRARP/FOXB2/CITED1/MSX1/MSX2/ASTN1/ZFHX3/ATF3/NEFL/NEO1/NEUROD2/NEUROG1/NFIA/NFIB/NGFR/NHLH2/NKX2-2/NKX6-1/NODAL/ROR2/NR4A2/OPCML/SIX6/OTX1/OTX2/NOX4/CNTN3/PAX2/PAX3/PAX5/PAX6/PAX9/LEF1/WNT16/PDE3B/PDE6A/ASB4/ATP5G3/PFKFB3/PITX1/PKP1/PLAG1/PLAGL1/BCL11A/PLXNA1/BRWD1/SNTG2/POU3F1/POU3F2/POU3F3/POU4F1/POU4F2/TMEM106B/FNBP1L/CASZ1/BANP/FEZF2/AVPR1A/RBM41/SMPD3/SOX6/CHD7/TENM3/ENAH/BARX1/PRL/CTNNBL1/CYP26B1/BARHL1/C8orf4/PSMB4/SMYD2/LHX9/RGMA/TBX20/PSMD8/SALL4/PSMD11/PTBP1/PTGER4/CGN/SEMA6A/RDH14/EPB41L5/TNRC6C/ADGRB3/PTPN3/NKX3-2/PTPRZ1/RASGRF2/PRDM12/PRDM13/EDA2R/PROK2/BCL9/ROBO2/SATB1/CEACAM1/BICD1/DMRTA2/NEUROG2/PRDM14/NPAS3/LHX5/ROBO3/SFRP2/RAD21L1/SOX17/ARAP3/DRGX/NARFL/NKX2-4/EBF2/BMI1/ISL2/BCL11B/SIM1/SIM2/SIX1/SIX3/BMP4/SLC6A4/SMARCD3/SOX2/SOX3/SOX9/SPAG1/SPTBN2/TACC1/MAP3K7/TAL1/TBX2/TBX15/TBX3/HNF1B/TCF3/TCF21/TFAP2C/NR2F1/NR2F2/TGFB2/THBS4/TCHH/TLE1/NR2E1/TMOD1/TRPC6/PHLDA2/VGF/WNT5A/WT1/ZFX/ZIC1/ZIC2/ZNF16/CA4/VEZF1/PAX8/CACNB4/NKAP/CALCA/FZD3/FBXO31/LPCAT1/KAT6A/WLS/RMI1/PREX2/EFHD1/PDGFD/SP6/DDHD1/HMGA2/ALX1/KAZALD1/MADCAM1/SBF2/EOMES/SOX7/MAD1L1/IMMP2L/CTTNBP2/ADGRV1/DTNBP1/USP42/HOPX/NTNG2/CBX2/DISP1/STRIP1/BARX2/ZIC5/STON2/TP63/TNFRSF11A/ALDH1A2/ARHGEF7/SPHK1/SQSTM1/TRIM15/LHX4/SOCS3/LGI1/MED14/PRDM6/PTER/LHX2/FOXP2/NTN1/ONECUT2/TBX4/SPAG6/NCOR1/EIF4A3/KIAA0319/MAFB
## GO:0044767                                         SP9/DNM1L/EDIL3/OLIG2/SPRY1/CDKN2A/CCNO/WARS2/KLF2/MERTK/ZBTB18/HOXB13/IGF2BP1/LBX1/TBR1/SIX2/NEK6/PNPLA6/VAX1/GLMN/SOX21/WIF1/CHN1/CHRNB1/OSBPL11/CCDC151/ADCYAP1/OSR2/OLIG1/ZNF488/ADD2/DBX1/CLN5/FOXN4/ZPBP2/WDR81/CNR1/DMBX1/COL2A1/COL4A1/COL5A2/COL6A1/M1AP/COL19A1/TRIM71/ADM/NKX2-6/GATA5/SLC32A1/ZBTB46/CRYGD/PRICKLE1/CSNK1D/ADRA1A/NKX2-5/CUX1/CXADR/SHISA3/ADRB1/CYP24A1/NKX2-3/DAB1/DACH1/DDX6/OLIG3/ARX/DLX1/DLX4/DOCK1/DPYSL2/DRD1/DUSP4/EBF1/ECE1/EDNRA/EFNA1/EFNA3/MEGF8/ELF2/SPAG17/FOXD4L1/EMX1/CTTN/EN2/EPAS1/EPHA3/EPHA5/EPHB3/ERBB4/EREG/EYA4/ERG/MECOM/EYA1/ACSL3/UNC5B/EFEMP1/LEMD2/SP8/THSD7A/PRR15/RFX6/FGF9/FGF10/FHL1/IKZF3/ADGRL1/SLITRK3/FOXJ3/ZBTB1/FOXG1/RAB18/FOXF2/FOXC1/FOXD4/RIMS1/FOXC2/FOXE1/FOXD2/EXPH5/ARHGAP26/SMC5/LPIN1/FLT1/POFUT2/TBC1D30/FRAT2/OTP/HEY1/PHF3/CBX7/HEY2/SCRIB/SPO11/FLRT3/ALK/VGLL2/ALOX15B/NR5A2/NR5A1/ZNF396/EBF3/G6PC/NPNT/ALX3/BAMBI/VAX2/GAS1/GATA2/GATA3/GATA4/GATA6/GBX2/LHX6/AATF/GFI1/GFRA2/FOXD3/FOXB1/SND1/GJB2/FOXP1/DKK2/SALL3/RBM20/MKX/SYPL2/DLL1/IGSF10/CCDC66/ANK3/OSTM1/MYLIP/ABT1/PARVB/SNX10/TLX3/RAX/HHEX/MNX1/HLA-DOA/HLF/HMX2/FOXA1/ONECUT1/C14orf39/TLX1/TLX2/HOXA1/HOXB9/APBB1/HOXC10/HOXC12/HOXC13/HOXD10/HOXD11/HOXD12/HPGD/ZAR1/HTR2C/VSX2/ID4/UNCX/RSPO2/HMX3/FMN1/KRT27/BARHL2/IFNA2/NOTO/EVX2/MCIDAS/TUBB2B/IGF1R/IGFBP2/IGFBP5/PLET1/RTN4RL2/IL6ST/IMPDH1/INSIG1/INSM1/PDX1/IRF4/JAK1/KIF5C/ARF6/KRT3/INSC/SP5/SPATA31D1/AFF3/BSX/ONECUT3/HES3/GDF6/LEPR/ARHGDIA/LHX1/ABLIM1/LMO2/LMX1B/MAF/MAK/MBP/MEIS1/MEOX2/MITF/MLF1/LHX8/NRARP/FOXB2/CITED1/MSX1/MSX2/ASTN1/ZFHX3/ATF3/NEFL/NEO1/NEUROD2/NEUROG1/NFIA/NFIB/NGFR/NHLH2/NKX2-2/NKX6-1/NODAL/ROR2/NR4A2/OPCML/SIX6/OTX1/OTX2/NOX4/CNTN3/PAX2/PAX3/PAX5/PAX6/PAX9/LEF1/WNT16/PDE3B/PDE6A/ASB4/ATP5G3/PFKFB3/PITX1/PKP1/PLAG1/PLAGL1/BCL11A/PLXNA1/BRWD1/SNTG2/POU3F1/POU3F2/POU3F3/POU4F1/POU4F2/TMEM106B/FNBP1L/CASZ1/BANP/FEZF2/AVPR1A/SMPD3/SOX6/CHD7/TENM3/ENAH/BARX1/PRL/CTNNBL1/CYP26B1/BARHL1/C8orf4/PSMB4/SMYD2/LHX9/RGMA/TBX20/PSMD8/SALL4/PSMD11/PTBP1/PTGER4/CGN/SEMA6A/RDH14/EPB41L5/TNRC6C/ADGRB3/PTPN3/NKX3-2/PTPRZ1/RASGRF2/PRDM12/PRDM13/EDA2R/PROK2/BCL9/ROBO2/SATB1/CEACAM1/DMRTA2/NEUROG2/PRDM14/LHX5/ROBO3/SFRP2/RAD21L1/SOX17/ARAP3/DRGX/NARFL/NKX2-4/EBF2/BMI1/ISL2/BCL11B/SIM1/SIM2/SIX1/SIX3/BMP4/SLC6A4/SMARCD3/SOX2/SOX3/SOX9/SPAG1/SPTBN2/TACC1/MAP3K7/TAL1/TBX2/TBX15/TBX3/HNF1B/TCF3/TCF21/TFAP2C/NR2F1/NR2F2/TGFB2/THBS4/TCHH/TLE1/NR2E1/TMOD1/TRPC6/PHLDA2/VGF/WNT5A/WT1/ZFX/ZIC1/ZIC2/ZNF16/CA4/VEZF1/PAX8/CACNB4/NKAP/CALCA/FZD3/FBXO31/LPCAT1/KAT6A/WLS/RMI1/PREX2/EFHD1/PDGFD/SP6/HMGA2/ALX1/KAZALD1/MADCAM1/SBF2/EOMES/SOX7/MAD1L1/IMMP2L/CTTNBP2/ADGRV1/DTNBP1/USP42/HOPX/NTNG2/CBX2/DISP1/STRIP1/BARX2/ZIC5/STON2/TP63/TNFRSF11A/ALDH1A2/ARHGEF7/SPHK1/SQSTM1/TRIM15/LHX4/SOCS3/LGI1/MED14/PRDM6/PTER/LHX2/FOXP2/NTN1/ONECUT2/TBX4/SPAG6/NCOR1/EIF4A3/KIAA0319/MAFB
## GO:0007275                                                                                                                                                                                                                                                                                                                                 SP9/EDIL3/OLIG2/SPRY1/CDKN2A/WARS2/KLF2/MERTK/ZBTB18/HOXB13/IGF2BP1/LBX1/TBR1/SIX2/PNPLA6/VAX1/GLMN/SOX21/WIF1/CHN1/CCDC151/ADCYAP1/OSR2/OLIG1/ZNF488/ADD2/DBX1/CLN5/FOXN4/WDR81/CNR1/DMBX1/COL2A1/COL4A1/COL5A2/COL6A1/COL19A1/TRIM71/ADM/NKX2-6/GATA5/ZBTB46/CRYGD/PRICKLE1/ADRA1A/NKX2-5/CUX1/CXADR/SHISA3/NKX2-3/DAB1/DACH1/DDX6/OLIG3/ARX/DLX1/DLX4/DOCK1/DPYSL2/DRD1/DUSP4/EBF1/ECE1/EDNRA/EFNA1/EFNA3/MEGF8/EMX1/CTTN/EN2/EPAS1/EPHA3/EPHA5/EPHB3/ERBB4/EREG/EYA4/ERG/MECOM/EYA1/ACSL3/UNC5B/EFEMP1/LEMD2/SP8/THSD7A/PRR15/RFX6/FGF9/FGF10/FHL1/IKZF3/ADGRL1/SLITRK3/ZBTB1/FOXG1/RAB18/FOXF2/FOXC1/RIMS1/FOXC2/FOXE1/EXPH5/ARHGAP26/LPIN1/FLT1/POFUT2/FRAT2/OTP/HEY1/PHF3/CBX7/HEY2/SCRIB/SPO11/FLRT3/ALK/VGLL2/ALOX15B/NR5A2/NR5A1/ZNF396/EBF3/NPNT/ALX3/BAMBI/VAX2/GATA2/GATA3/GATA4/GATA6/GBX2/LHX6/AATF/GFI1/GFRA2/FOXD3/FOXB1/GJB2/FOXP1/DKK2/SALL3/RBM20/MKX/SYPL2/DLL1/IGSF10/CCDC66/ANK3/OSTM1/MYLIP/ABT1/SNX10/TLX3/RAX/HHEX/MNX1/HLA-DOA/HLF/HMX2/FOXA1/ONECUT1/C14orf39/TLX1/TLX2/HOXA1/HOXB9/APBB1/HOXC10/HOXC12/HOXC13/HOXD10/HOXD11/HOXD12/HPGD/ZAR1/VSX2/ID4/UNCX/RSPO2/HMX3/FMN1/KRT27/BARHL2/IFNA2/NOTO/EVX2/TUBB2B/IGF1R/IGFBP5/RTN4RL2/IL6ST/IMPDH1/INSIG1/INSM1/PDX1/IRF4/JAK1/KIF5C/ARF6/INSC/SP5/AFF3/BSX/ONECUT3/HES3/GDF6/LEPR/ARHGDIA/LHX1/ABLIM1/LMO2/LMX1B/MAF/MAK/MBP/MEIS1/MEOX2/MITF/MLF1/LHX8/NRARP/CITED1/MSX1/MSX2/ASTN1/ZFHX3/ATF3/NEFL/NEO1/NEUROD2/NEUROG1/NFIA/NFIB/NGFR/NHLH2/NKX2-2/NKX6-1/NODAL/ROR2/NR4A2/OPCML/SIX6/OTX1/OTX2/NOX4/CNTN3/PAX2/PAX3/PAX5/PAX6/PAX9/LEF1/WNT16/PDE3B/PDE6A/ASB4/PFKFB3/PITX1/PKP1/PLAG1/PLAGL1/BCL11A/PLXNA1/SNTG2/POU3F1/POU3F2/POU3F3/POU4F1/POU4F2/TMEM106B/CASZ1/BANP/FEZF2/AVPR1A/SMPD3/SOX6/CHD7/TENM3/ENAH/BARX1/PRL/CTNNBL1/CYP26B1/BARHL1/C8orf4/PSMB4/SMYD2/LHX9/RGMA/TBX20/PSMD8/SALL4/PSMD11/PTGER4/SEMA6A/EPB41L5/TNRC6C/ADGRB3/PTPN3/NKX3-2/PTPRZ1/RASGRF2/PRDM12/PRDM13/EDA2R/PROK2/BCL9/ROBO2/SATB1/CEACAM1/DMRTA2/NEUROG2/PRDM14/LHX5/ROBO3/SFRP2/RAD21L1/SOX17/DRGX/NARFL/NKX2-4/EBF2/BMI1/ISL2/BCL11B/SIM1/SIM2/SIX1/SIX3/BMP4/SLC6A4/SMARCD3/SOX2/SOX3/SOX9/SPTBN2/TACC1/MAP3K7/TAL1/TBX2/TBX15/TBX3/HNF1B/TCF3/TCF21/TFAP2C/NR2F1/NR2F2/TGFB2/THBS4/TCHH/TLE1/NR2E1/TMOD1/TRPC6/PHLDA2/VGF/WNT5A/WT1/ZFX/ZIC1/ZIC2/ZNF16/CA4/VEZF1/PAX8/CACNB4/NKAP/CALCA/FZD3/FBXO31/LPCAT1/KAT6A/WLS/PREX2/EFHD1/PDGFD/SP6/HMGA2/ALX1/KAZALD1/MADCAM1/SBF2/EOMES/SOX7/MAD1L1/IMMP2L/CTTNBP2/ADGRV1/DTNBP1/HOPX/NTNG2/CBX2/DISP1/BARX2/ZIC5/STON2/TP63/TNFRSF11A/ALDH1A2/ARHGEF7/SPHK1/TRIM15/LHX4/SOCS3/LGI1/PRDM6/LHX2/FOXP2/NTN1/ONECUT2/TBX4/NCOR1/EIF4A3/KIAA0319/MAFB
##            Count
## GO:0032502   469
## GO:0044767   462
## GO:0007275   417

More information can be found in the vignettes of Bioconductor packages DOSE2, ReactomePA, clusterProfiler3, which also provide several methods to visualize enrichment results. The clusterProfiler3 is designed for comparing and visualizing functional profiles among gene clusters, and can directly applied to compare biological themes at GO, DO, KEGG, Reactome perspective.

7 ChIP peak data set comparison

7.1 Profile of several ChIP peak data binding to TSS region

Function plotAvgProf and tagHeatmap can accept a list of tagMatrix and visualize profile or heatmap among several ChIP experiments, while plotAvgProf2 and peakHeatmap can accept a list of bed files and perform the same task in one step.

7.1.1 Average profiles

## promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
## tagMatrixList <- lapply(files, getTagMatrix, windows=promoter)
## 
## to speed up the compilation of this vigenette, we load a precaculated tagMatrixList
data("tagMatrixList")
plotAvgProf(tagMatrixList, xlim=c(-3000, 3000))
Average Profiles of ChIP peaks among different experiments

Average Profiles of ChIP peaks among different experiments

## resample = 500 by default, here use 100 to speed up the compilation of this vignette.
plotAvgProf(tagMatrixList, xlim=c(-3000, 3000), conf=0.95,resample=100, facet="row")
## >> Running bootstrapping for tag matrix...        2016-03-01 08:19:21 PM 
## >> Running bootstrapping for tag matrix...        2016-03-01 08:19:26 PM 
## >> Running bootstrapping for tag matrix...        2016-03-01 08:19:30 PM 
## >> Running bootstrapping for tag matrix...        2016-03-01 08:19:46 PM 
## >> Running bootstrapping for tag matrix...        2016-03-01 08:20:05 PM
Average Profiles of ChIP peaks among different experiments

Average Profiles of ChIP peaks among different experiments

7.1.2 Peak heatmaps

tagHeatmap(tagMatrixList, xlim=c(-3000, 3000), color=NULL)
Heatmap of ChIP peaks among different experiments

Heatmap of ChIP peaks among different experiments

7.2 ChIP peak annotation comparision

The plotAnnoBar and plotDistToTSS can also accept input of a named list of annotated peaks (output of annotatePeak).

peakAnnoList <- lapply(files, annotatePeak, TxDb=txdb, 
                       tssRegion=c(-3000, 3000), verbose=FALSE)

We can use plotAnnoBar to comparing their genomic annotation.

plotAnnoBar(peakAnnoList)
Genomic Annotation among different ChIPseq data

Genomic Annotation among different ChIPseq data

R function plotDistToTSS can use to comparing distance to TSS profiles among ChIPseq data.

plotDistToTSS(peakAnnoList)
Distribution of Binding Sites among different ChIPseq data

Distribution of Binding Sites among different ChIPseq data

7.3 Functional profiles comparison

As shown in section 4, the annotated genes can analyzed by clusterProfiler3, DOSE2 and ReactomePA for Gene Ontology, KEGG, Disease Ontology and Reactome Pathway enrichment analysis.

The clusterProfiler3 package provides compareCluster function for comparing biological themes among gene clusters, and can be easily adopted to compare different ChIP peak experiments.

genes = lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
names(genes) = sub("_", "\n", names(genes))
compMF <- compareCluster(geneCluster   = genes, 
                         fun           = "enrichGO",
                         ont           = "MF",
                         pvalueCutoff  = 0.05, 
                         pAdjustMethod = "BH")
plot(compMF, showCategory = 20, title = "Molecular Function Enrichment")

7.4 Overlap of peaks and annotated genes

User may want to compare the overlap peaks of replicate experiments or from different experiments. ChIPseeker provides peak2GRanges that can read peak file and stored in GRanges object. Several files can be read simultaneously using lapply, and then passed to vennplot to calculate their overlap and draw venn plot.

vennplot accept a list of object, can be a list of GRanges or a list of vector. Here, I will demonstrate using vennplot to visualize the overlap of the nearest genes stored in peakAnnoList.

genes= lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
vennplot(genes)
Overlap of annotated genes

Overlap of annotated genes

8 Statistical testing of ChIP seq overlap

Overlap is very important, if two ChIP experiment by two different proteins overlap in a large fraction of their peaks, they may cooperative in regulation. Calculating the overlap is only touch the surface. ChIPseeker implemented statistical methods to measure the significance of the overlap.

8.1 Shuffle genome coordination

p <- GRanges(seqnames=c("chr1", "chr3"), 
             ranges=IRanges(start=c(1, 100), end=c(50, 130)))
shuffle(p, TxDb=txdb)
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames                 ranges strand
##          <Rle>              <IRanges>  <Rle>
##   [1]     chr1 [193408767, 193408817]      *
##   [2]     chr3 [160424344, 160424375]      *
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

We implement the shuffle function to randomly permute the genomic locations of ChIP peaks defined in a genome which stored in TxDb object.

8.2 Peak overlap enrichment analysis

With the ease of this shuffle method, we can generate thousands of random ChIP data and calculate the background null distribution of the overlap among ChIP data sets.

enrichPeakOverlap(queryPeak     = files[[5]], 
                  targetPeak    = unlist(files[1:4]), 
                  TxDb          = txdb, 
                  pAdjustMethod = "BH", 
                  nShuffle      = 50, 
                  chainFile     = NULL,
                  verbose       = FALSE)
##                                                       qSample
## ARmo_0M    GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
## ARmo_1nM   GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
## ARmo_100nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
## CBX6_BF    GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
##                                                       tSample qLen tLen
## ARmo_0M                       GSM1174480_ARmo_0M_peaks.bed.gz 1663  812
## ARmo_1nM                     GSM1174481_ARmo_1nM_peaks.bed.gz 1663 2296
## ARmo_100nM                 GSM1174482_ARmo_100nM_peaks.bed.gz 1663 1359
## CBX6_BF    GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz 1663 1331
##            N_OL     pvalue   p.adjust
## ARmo_0M       0 0.86274510 0.86274510
## ARmo_1nM      8 0.17647059 0.35294118
## ARmo_100nM    3 0.50980392 0.67973856
## CBX6_BF     968 0.01960784 0.07843137

Parameter queryPeak is the query ChIP data, while targetPeak is bed file name or a vector of bed file names from comparison; nShuffle is the number to shuffle the peaks in targetPeak. To speed up the compilation of this vignettes, we only set nShuffle to 50 as an example for only demonstration. User should set the number to 1000 or above for more robust result. Parameter chainFile are chain file name for mapping the targetPeak to the genome version consistent with queryPeak when their genome version are different. This creat the possibility of comparison among different genome version and cross species.

In the output, qSample is the name of queryPeak and qLen is the the number of peaks in queryPeak. N_OL is the number of overlap between queryPeak and targetPeak.

9 Data Mining with ChIP seq data deposited in GEO

There are many ChIP seq data sets that have been published and deposited in GEO database. We can compare our own dataset to those deposited in GEO to search for significant overlap data. Significant overlap of ChIP seq data by different binding proteins may be used to infer cooperative regulation and thus can be used to generate hypotheses.

We collect about 17,000 bed files deposited in GEO, user can use getGEOspecies to get a summary based on speices.

9.1 GEO data collection

getGEOspecies()
##                                               species Freq
## 1                                       Aedes aegypti   11
## 2                                            Anabaena    6
## 3                                 Anolis carolinensis    2
## 4                                   Anopheles gambiae    2
## 5                                      Apis mellifera    5
## 6                           Apis mellifera scutellata    1
## 7                                  Arabidopsis lyrata    4
## 8                                Arabidopsis thaliana  174
## 9                                Atelerix albiventris    2
## 10                                         Bos taurus    4
## 11                                      Brassica rapa    8
## 12                             Caenorhabditis elegans  164
## 13                                   Candida albicans   25
## 14                               Candida dubliniensis   20
## 15                             Canis lupus familiaris    7
## 16                          Chlamydomonas reinhardtii   51
## 17                               Chlorocebus aethiops    2
## 18                                 Cleome hassleriana    1
## 19                                      Columba livia    6
## 20                                  Crassostrea gigas    1
## 21                            Cryptococcus neoformans   51
## 22                                        Danio rerio  143
## 23                            Drosophila melanogaster  688
## 24                           Drosophila pseudoobscura    7
## 25                                Drosophila simulans   12
## 26                                 Drosophila virilis   26
## 27                                  Drosophila yakuba    8
## 28                                     Equus caballus    1
## 29                                   Escherichia coli   13
## 30                           Escherichia coli BW25113    4
## 31                              Escherichia coli K-12    2
## 32          Escherichia coli str. K-12 substr. MG1655    8
## 33                                      Gallus gallus   55
## 34                       Geobacter sulfurreducens PCA    3
## 35                                    Gorilla gorilla    2
## 36                                  Histophilus somni    1
## 37                                       Homo sapiens 9179
## 38                 Homo sapiens;\tHuman herpesvirus 8    6
## 39                               Human herpesvirus 6B    2
## 40                                Human herpesvirus 8    6
## 41                                Larimichthys crocea    7
## 42                             Legionella pneumophila    5
## 43                             Leishmania amazonensis    4
## 44                                   Leishmania major    2
## 45                              Leishmania tarentolae   15
## 46                                     Macaca mulatta   85
## 47                              Monodelphis domestica    8
## 48                         Moraxella catarrhalis O35E    6
## 49                                       Mus musculus 6911
## 50                         Mus musculus x Mus spretus    1
## 51                         Mycobacterium tuberculosis    2
## 52                                    Myotis brandtii   15
## 53                              Naumovozyma castellii    1
## 54                             Nematostella vectensis   23
## 55                              Oreochromis niloticus    1
## 56                           Ornithorhynchus anatinus    5
## 57                                       Oryza sativa   23
## 58                                    Oryzias latipes    2
## 59                                    Pan troglodytes   51
## 60                                       Papio anubis    1
## 61                              Plasmodium falciparum    5
## 62                          Plasmodium falciparum 3D7   29
## 63                          Pseudomonas putida KT2440    2
## 64                                 Pseudozyma aphidis   11
## 65                                Pyrococcus furiosus    4
## 66                                  Rattus norvegicus   55
## 67                         Rhodopseudomonas palustris    6
## 68                  Rhodopseudomonas palustris CGA009    3
## 69                           Saccharomyces cerevisiae  473
## 70 Saccharomyces cerevisiae x Saccharomyces paradoxus   16
## 71            Saccharomyces cerevisiae;\tMus musculus   12
## 72                         Saccharomyces kudriavzevii    1
## 73                            Saccharomyces paradoxus    8
## 74                               Saccharomyces uvarum    1
## 75                      Schizosaccharomyces japonicus    2
## 76                          Schizosaccharomyces pombe  130
## 77                             Schmidtea mediterranea    7
## 78                                    Sorghum bicolor    2
## 79                      Streptomyces coelicolor A3(2)    6
## 80                                         Sus scrofa   23
## 81                                   Tupaia chinensis    7
## 82                                    Vibrio cholerae    2
## 83                      Xenopus (Silurana) tropicalis   62
## 84                                     Xenopus laevis    2
## 85                                           Zea mays   63

The summary can also based on genome version as illustrated below:

getGEOgenomeVersion()
##                         organism genomeVersion Freq
## 1            Anolis carolinensis       anoCar2    2
## 2                     Bos taurus       bosTau4    2
## 3                     Bos taurus       bosTau7    2
## 4         Caenorhabditis elegans          ce10    4
## 5         Caenorhabditis elegans           ce6   64
## 6                    Danio rerio       danRer6    6
## 7                    Danio rerio       danRer7   61
## 8        Drosophila melanogaster           dm3  461
## 9                  Gallus gallus       galGal3   32
## 10                 Gallus gallus       galGal4   15
## 11                  Homo sapiens          hg18 2459
## 12                  Homo sapiens          hg19 6093
## 13                  Homo sapiens          hg38    4
## 14                  Mus musculus          mm10   83
## 15                  Mus musculus           mm8  505
## 16                  Mus musculus           mm9 5751
## 17         Monodelphis domestica       monDom5    8
## 18               Pan troglodytes       panTro3   48
## 19                Macaca mulatta       rheMac2   81
## 20             Rattus norvegicus           rn5    1
## 21      Saccharomyces cerevisiae       sacCer2  141
## 22      Saccharomyces cerevisiae       sacCer3  158
## 23                    Sus scrofa       susScr2   17
## 24 Xenopus (Silurana) tropicalis       xenTro3    3

User can access the detail information by getGEOInfo, for each genome version.

hg19 <- getGEOInfo(genome="hg19", simplify=TRUE)
head(hg19)
##     series_id        gsm     organism
## 111  GSE16256  GSM521889 Homo sapiens
## 112  GSE16256  GSM521887 Homo sapiens
## 113  GSE16256  GSM521883 Homo sapiens
## 114  GSE16256 GSM1010966 Homo sapiens
## 115  GSE16256  GSM896166 Homo sapiens
## 116  GSE16256  GSM910577 Homo sapiens
##                                                                                                       title
## 111          Reference Epigenome: ChIP-Seq Analysis of H3K27me3 in IMR90 Cells; renlab.H3K27me3.IMR90-02.01
## 112            Reference Epigenome: ChIP-Seq Analysis of H3K27ac in IMR90 Cells; renlab.H3K27ac.IMR90-03.01
## 113            Reference Epigenome: ChIP-Seq Analysis of H3K14ac in IMR90 Cells; renlab.H3K14ac.IMR90-02.01
## 114                      polyA RNA sequencing of STL003 Pancreas Cultured Cells; polyA-RNA-seq_STL003PA_r1a
## 115          Reference Epigenome: ChIP-Seq Analysis of H4K8ac in hESC H1 Cells; renlab.H4K8ac.hESC.H1.01.01
## 116 Reference Epigenome: ChIP-Seq Analysis of H3K4me1 in Human Spleen Tissue; renlab.H3K4me1.STL003SX.01.01
##                                                                                                     supplementary_file
## 111         ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM521nnn/GSM521889/suppl/GSM521889_UCSD.IMR90.H3K27me3.SK05.bed.gz
## 112          ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM521nnn/GSM521887/suppl/GSM521887_UCSD.IMR90.H3K27ac.YL58.bed.gz
## 113          ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM521nnn/GSM521883/suppl/GSM521883_UCSD.IMR90.H3K14ac.SK17.bed.gz
## 114 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1010nnn/GSM1010966/suppl/GSM1010966_UCSD.Pancreas.mRNA-Seq.STL003.bed.gz
## 115              ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM896nnn/GSM896166/suppl/GSM896166_UCSD.H1.H4K8ac.AY17.bed.gz
## 116       ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM910nnn/GSM910577/suppl/GSM910577_UCSD.Spleen.H3K4me1.STL003.bed.gz
##     genomeVersion pubmed_id
## 111          hg19  19829295
## 112          hg19  19829295
## 113          hg19  19829295
## 114          hg19  19829295
## 115          hg19  19829295
## 116          hg19  19829295

If simplify is set to FALSE, extra information including source_name, extract_protocol, description, data_processing and submission_date will be incorporated.

9.2 Download GEO ChIP data sets

ChIPseeker provide function downloadGEObedFiles to download all the bed files of a particular genome.

downloadGEObedFiles(genome="hg19", destDir="hg19")

Or a vector of GSM accession number by downloadGSMbedFiles.

gsm <- hg19$gsm[sample(nrow(hg19), 10)]
downloadGSMbedFiles(gsm, destDir="hg19")

9.3 Overlap significant testing

After download the bed files from GEO, we can pass them to enrichPeakOverlap for testing the significant of overlap. Parameter targetPeak can be the folder, e.g. hg19, that containing bed files. enrichPeakOverlap will parse the folder and compare all the bed files. It is possible to test the overlap with bed files that are mapping to different genome or different genome versions, enrichPeakOverlap provide a parameter chainFile that can pass a chain file and liftOver the targetPeak to the genome version consistent with queryPeak. Signifcant overlap can be use to generate hypothesis of cooperative regulation.By mining the data deposited in GEO, we can identify some putative complex or interacted regulators in gene expression regulation or chromsome remodelling for further validation.

10 External documents

11 Bugs/Feature Requests

If you have any, let me know.

12 Session Information

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 3.2.3 (2015-12-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] GO.db_3.2.2                            
##  [2] ChIPseeker_1.6.7                       
##  [3] clusterProfiler_2.4.3                  
##  [4] org.Hs.eg.db_3.2.3                     
##  [5] RSQLite_1.0.0                          
##  [6] DBI_0.3.1                              
##  [7] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [8] GenomicFeatures_1.22.13                
##  [9] AnnotationDbi_1.32.3                   
## [10] Biobase_2.30.0                         
## [11] GenomicRanges_1.22.4                   
## [12] GenomeInfoDb_1.6.3                     
## [13] IRanges_2.4.8                          
## [14] S4Vectors_0.8.11                       
## [15] BiocGenerics_0.16.1                    
## [16] BiocStyle_1.8.0                        
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.1.0                 splines_3.2.3             
##  [3] topGO_2.22.0               gtools_3.5.0              
##  [5] assertthat_0.1             DO.db_2.9                 
##  [7] Rsamtools_1.22.0           yaml_2.1.13               
##  [9] lattice_0.20-33            digest_0.6.9              
## [11] RColorBrewer_1.1-2         XVector_0.10.0            
## [13] qvalue_2.2.2               colorspace_1.2-6          
## [15] htmltools_0.3              plyr_1.8.3                
## [17] XML_3.98-1.4               biomaRt_2.26.1            
## [19] SparseM_1.7                zlibbioc_1.16.0           
## [21] scales_0.4.0               gdata_2.17.0              
## [23] BiocParallel_1.4.3         KEGGREST_1.10.1           
## [25] ggplot2_2.0.0              UpSetR_1.1.1              
## [27] SummarizedExperiment_1.0.2 lazyeval_0.1.10           
## [29] magrittr_1.5               DOSE_2.8.2                
## [31] evaluate_0.8               gplots_2.17.0             
## [33] graph_1.48.0               tools_3.2.3               
## [35] formatR_1.2.1              gridBase_0.4-7            
## [37] stringr_1.0.0              munsell_0.4.3             
## [39] plotrix_3.6-1              lambda.r_1.1.7            
## [41] Biostrings_2.38.4          caTools_1.17.1            
## [43] futile.logger_1.4.1        grid_3.2.3                
## [45] RCurl_1.95-4.8             igraph_1.0.1              
## [47] bitops_1.0-6               labeling_0.3              
## [49] rmarkdown_0.9.5            boot_1.3-18               
## [51] gtable_0.2.0               reshape2_1.4.1            
## [53] R6_2.1.2                   GenomicAlignments_1.6.3   
## [55] gridExtra_2.2.1            knitr_1.12.3              
## [57] dplyr_0.4.3                rtracklayer_1.30.2        
## [59] futile.options_1.0.0       KernSmooth_2.23-15        
## [61] GOSemSim_1.28.2            stringi_1.0-1             
## [63] Rcpp_0.12.3                png_0.1-7

References

1.Yu, G., Wang, L.-G. & He, Q.-Y. ChIPseeker: An r/Bioconductor package for chIP peak annotation, comparison and visualization. Bioinformatics 31, 2382–2383 (2015).

2.Yu, G., Wang, L.-G., Yan, G.-R. & He, Q.-Y. DOSE: An r/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics 31, 608–609 (2015).

3.Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. clusterProfiler: an r package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16, 284–287 (2012).

4.Pemberton, H. et al. Genome-wide co-localization of polycomb orthologs and their effects on gene expression in human fibroblasts. Genome Biol. 15, R23 (2014).

5.Urbanucci, A. et al. Overexpression of androgen receptor enhances the binding of the receptor to the chromatin in prostate cancer. Oncogene 31, 2153–2163 (2012).

6.Ashburner, M. et al. Gene ontology: Tool for the unification of biology. Nat Genet 25, 25–29 (2000).

7.Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y. & Hattori, M. The KEGG resource for deciphering the genome. Nucl. Acids Res. 32, D277–D280 (2004).

8.Schriml, L. M. et al. Disease ontology: A backbone for disease semantic integration. Nucleic Acids Research 40, D940–D946 (2011).

9.Croft, D. et al. The reactome pathway knowledgebase. Nucleic Acids Research 42, D472–D477 (2013).