Contents

1 Introduction

Gene set enrichment (GSE) testing enables the interpretation of lists of differentially expressed genes (e.g. from RNA-seq), or lists of peaks (e.g. from ChIP-seq), in terms of pathways and other biologically meaningful sets of genes. The chipenrich package was originally designed to perform GSE for ChIP-seq peaks, but it can also be used for genomic regions with different biological meaning. The primary innovation of chipenrich is its accounting for biases that are known to affect the Type I error of such testing. In particular, the length of a gene’s regulatory region affects the probability that a peak will be assigned to it, the number of peaks that will be assigned to it, or the proportion of it covered by peaks.

The chipenrich package includes different enrichment methods for different use cases:

2 Concepts and Usage

library(chipenrich)
## 
## 
## 
## 
## 

2.1 Peaks

A ChIP-seq peak is a genomic region that represents a transcription factor binding event or the presence of a histone complex with a particular histone modification. Typically peaks are called with a peak caller (such as MACS2 or PePr) and represent relative enrichment of reads in a sample where the antibody is present versus input. Typically, peaks are output by a peak caller in BED-like format.

The primary user input for chipenrich(), broadenrich(), or polyenrich() are the peaks called from reads in a ChIP-seq experiment. Lists of genomic regions having other biological meaning can be used, but we shall continue to refer to ‘peaks’. Peaks can be input as either a file path or a data.frame.

If a file path, the following formats are fully supported via their file extensions: .bed, .broadPeak, .narrowPeak, .gff3, .gff2, .gff, and .bedGraph or .bdg. BED3 through BED6 files are supported under the .bed extension (BED specification). Files without these extensions are supported under the conditions that the first 3 columns correspond to chr, start, and end and that there is either no header column, or it is commented out. Files may be compressed with gzip, and so might end in .narrowPeak.gz, for example. For files with extension support, the rtracklayer::import() function is used to read peaks, so adherence to the mentioned file formats is necessary.

If peaks are already in the R environment as a data.frame, the GenomicRanges::makeGRangesFromDataFrame() function is used to convert to a GRanges object. For the acceptable column names needed for correct interpretation, see ?GenomicRanges::makeGRangesFromDataFrame.

For the purpose of the vignette, we’ll load some ChIP-seq peaks from the chipenrich.data companion package:

data(peaks_E2F4, package = 'chipenrich.data')
data(peaks_H3K4me3_GM12878, package = 'chipenrich.data')

head(peaks_E2F4)
##   chrom     start       end
## 1  chr1 156186314 156186469
## 2  chr1  10490456  10490550
## 3  chr1  46713352  46713436
## 4  chr1 226496843 226496924
## 5  chr1 200589825 200589928
## 6  chr1  47779789  47779907
head(peaks_H3K4me3_GM12878)
##   chrom    start      end
## 1 chr22 16846080 16871326
## 2 chr22 17305402 17306803
## 3 chr22 17517008 17517744
## 4 chr22 17518172 17518768
## 5 chr22 17518987 17520014
## 6 chr22 17520113 17520375

2.2 Genomes

Genomes for fly, human, mouse, rat, and zebrafish are supported. Particular supported genome builds are given by:

supported_genomes()
##  [1] "danRer10" "dm3"      "dm6"      "hg19"     "hg38"     "mm10"    
##  [7] "mm9"      "rn4"      "rn5"      "rn6"

2.3 Locus Definitions

A locus definition is a way of defining a gene regulatory region, and enables us to associate peaks with genes. The terms ‘gene’, ‘gene regulatory region’, and ‘gene locus’ are used interchangeably in the vignette. A trivial locus definition might be the gene bodies from the transcription start sites (TSS) to the transcript end sites (TES) for each gene. A locus definition can also express how one expects a transcription factor to regulate genes. For example, a locus definition defined as 1kb upstream and downstream of a TSS (the 1kb definition) would capture TFs binding in proximal-promoter regions.

2.3.1 Built-in locus definitions

A number of locus definitions representing different regulatory paradigms are included in the package:

  • nearest_tss: The locus is the region spanning the midpoints between the TSSs of adjacent genes.
  • nearest_gene: The locus is the region spanning the midpoints between the boundaries of each gene, where a gene is defined as the region between the furthest upstream TSS and furthest downstream TES for that gene. If gene loci overlap, the midpoint of the overlap is used as a border. If a gene locus is nested in another, the larger locus is split in two.
  • exon: Each gene has multiple loci corresponding to its exons. Overlaps between different genes are allowed.
  • intron: Each gene has multiple loci corresponding to its introns. Overlaps between different genes are allowed.
  • 1kb: The locus is the region within 1kb of any of the TSSs belonging to a gene. If TSSs from two adjacent genes are within 2 kb of each other, we use the midpoint between the two TSSs as the boundary for the locus for each gene.
  • 1kb_outside_upstream: The locus is the region more than 1kb upstream from a TSS to the midpoint between the adjacent TSS.
  • 1kb_outside: The locus is the region more than 1kb upstream or downstream from a TSS to the midpoint between the adjacent TSS.
  • 5kb: The locus is the region within 5kb of any of the TSSs belonging to a gene. If TSSs from two adjacent genes are within 10 kb of each other, we use the midpoint between the two TSSs as the boundary for the locus for each gene.
  • 5kb_outside_upstream: The locus is the region more than 5kb upstream from a TSS to the midpoint between the adjacent TSS.
  • 5kb_outside: The locus is the region more than 5kb upstream or downstream from a TSS to the midpoint between the adjacent TSS.
  • 10kb: The locus is the region within 10kb of any of the TSSs belonging to a gene. If TSSs from two adjacent genes are within 20 kb of each other, we use the midpoint between the two TSSs as the boundary for the locus for each gene.
  • 10kb_outside_upstream: The locus is the region more than 10kb upstream from a TSS to the midpoint between the adjacent TSS.
  • 10kb_outside: The locus is the region more than 10kb upstream or downstream from a TSS to the midpoint between the adjacent TSS.

The complete listing of genome build and locus definition pairs can be listed with supported_locusdefs():

# Take head because it's long
head(supported_locusdefs())
##     genome              locusdef
## 1 danRer10                  10kb
## 2 danRer10          10kb_outside
## 3 danRer10 10kb_outside_upstream
## 4 danRer10                   1kb
## 5 danRer10           1kb_outside
## 6 danRer10  1kb_outside_upstream

2.3.2 Custom locus definitions

Users can create custom locus definitions for any of the supported_genomes(), and pass the file path as the value of the locusdef parameter in broadenrich(), chipenrich(), or polyenrich(). Custom locus definitions should be defined in a tab-delimited text file with column names chr, start, end, and gene_id. For example:

chr start   end geneid
chr1    839460  839610  148398
chr1    840040  840190  148398
chr1    840040  840190  57801
chr1    840800  840950  148398
chr1    841160  841310  148398

2.3.3 Selecting a locus definition

For a transcription factor ChIP-seq experiment, selecting a particular locus definition for use in enrichment testing implies how the TF is assumed to regulate genes. For example, selecting the 1kb locus definition will imply that the biological processes found enriched are a result of TF regulation near the promoter. In contrast, selecting the 5kb_outside locus definition will imply that the biological processes found enriched are a result of TF regulation distal from the promoter.

Selecting a locus definition can also help reduce the noise in the enrichment tests. For example, if a TF is known to primarily regulate genes by binding around the promoter, then selecting the 1kb locus definition can help to reduce the noise from TSS-distal peaks in the enrichment testing.

The plot_dist_to_tss() QC plot displays where peak midpoints fall relative to TSSs genome-wide, and can help inform the choice of locus definition. For example, if many peaks fall far from the TSS, the nearest_tss locus definition may be a good choice because it will capture all peaks, whereas the 1kb locus definition may not capture many of the peaks and adversely affect the enrichment testing.

2.4 Gene Sets

Gene sets are sets of genes that represent a particular biological function. Popular gene sets used by chipenrich include: KEGG Pathways, Gene Ontology, and Reactome Pathways.

2.4.1 Built-in gene sets

Gene sets for fly, human, mouse, rat, and zebrafish are built in to chipenrich. Some organisms have gene sets that others do not, so check with:

# Take head because it's long
head(supported_genesets())
##     geneset organism
## 1      GOBP      dme
## 6      GOCC      dme
## 11     GOMF      dme
## 49 reactome      dme
## 2      GOBP      dre
## 7      GOCC      dre

2.4.2 Custom gene sets

Users can perform GSE on custom gene sets for any supported organism by passing the file path as the value of genesets parameter in broadenrich(), chipenrich(), or polyenrich(). Custom gene set definitions should be defined in a tab-delimited text file with a header. The first column should be the geneset ID or name, and the second column should be the Entrez IDs belonging to the geneset. For example:

gs_id   gene_id
GO:0006631  30
GO:0006631  31
GO:0006631  32
GO:0006631  33
GO:0006631  34
GO:0006631  35
GO:0006631  36
GO:0006631  37
GO:0006631  51
GO:0006631  131
GO:0006631  183
GO:0006631  207
GO:0006631  208
GO:0006631  215
GO:0006631  225

2.5 Mappability

We define base pair mappability as the average read mappability of all possible reads of size K that encompass a specific base pair location, \(b\). Mappability files from UCSC Genome Browser mappability track were used to calculate base pair mappability. The mappability track provides values for theoretical read mappability, or the number of places in the genome that could be mapped by a read that begins with the base pair location \(b\). For example, a value of 1 indicates a Kmer read beginning at \(b\) is mappable to one area in the genome. A value of 0.5 indicates a Kmer read beginning at \(b\) is mappable to two areas in the genome. For our purposes, we are only interested in uniquely mappable reads; therefore, all reads with mappability less than 1 were set to 0 to indicate non-unique mappability. Then, base pair mappability is calculated as:

\[ \begin{equation} M_{i} = (\frac{1}{2K-1}) \sum_{j=i-K+1}^{i+(K-1)} M_{j} \end{equation} \]

where \(M_{i}\) is the mappability of base pair \(i\), and \(M_{j}\) is mappability (from UCSC’s mappability track) of read \(j\) where j is the start position of the K length read.

2.5.1 Built-in mappability

Base pair mappability for reads of lengths 24, 36, 40, 50, 75, and 100 base pairs for hg19 and for reads of lengths 36, 40, 50, 75, and 100 base pairs mm9 a included. See the complete list with:

# Take head because it's long
head(supported_read_lengths())
##   genome locusdef read_length
## 1   hg19     10kb         100
## 2   hg19     10kb          24
## 3   hg19     10kb          36
## 4   hg19     10kb          40
## 5   hg19     10kb          50
## 6   hg19     10kb          75

2.5.2 Custom mappability

Users can use custom mappability with any built-in locus definition (if, for example, the read length needed is not present), or with a custom locus definition. Custom mappability should be defined in a tab-delimited text file with columns named gene_id and mappa. Gene IDs should be Entrez Gene IDs, and mappability should be in [0,1]. A check is performed to verify that the gene IDs in the locus definition and mappability overlap by at least 95%. An example custom mappability file looks like:

mappa   gene_id
0.8 8487
0.1 84
0.6 91
1   1000

2.6 Testing for enrichment

As stated in the introduction, the chipenrich package includes three classes of methods for doing GSE testing. For each method, we describe the intended use case, the model used for enrichment, and an example using the method.

2.6.1 broadenrich()

Broad-Enrich is designed for use with broad peaks that may intersect multiple gene loci, and cumulatively cover greater than 5% of the genome. For example, ChIP-seq experiments for histone modifications.

The Broad-Enrich method uses the cumulative peak coverage of genes in its logistic regression model for enrichment: GO ~ ratio + s(log10_length). Here, GO is a binary vector indicating whether a gene is in the gene set being tested, ratio is a numeric vector indicating the ratio of the gene covered by peaks, and s(log10_length) is a binomial cubic smoothing spline which adjusts for the relationship between gene coverage and locus length.

gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich')
results = broadenrich(peaks = peaks_H3K4me3_GM12878, genome = 'hg19', genesets = gs_path,
    locusdef = "nearest_tss", qc_plots = FALSE, out_name = NULL, n_cores=1)
results.be = results$results
print(results.be[1:5,1:5])
##    Geneset.Type Geneset.ID Description      P.value          FDR
## 1 user-supplied GO:0002521  GO:0002521 7.159376e-06 5.966147e-05
## 2 user-supplied GO:0031400  GO:0031400 6.916624e-05 4.322890e-04
## 3 user-supplied GO:0022411  GO:0022411 1.561766e-04 7.808829e-04
## 4 user-supplied GO:0071845  GO:0071845 5.317505e-04 1.783107e-03
## 5 user-supplied GO:0022604  GO:0022604 3.334969e-03 8.337422e-03

2.6.2 chipenrich()

ChIP-Enrich is designed for use with 1,000s or 10,000s of narrow peaks which results in fewer gene loci containing a peak overall. For example, ChIP-seq experiments for transcription factors.

The ChIP-Enrich method uses the presence of a peak in its logistic regression model for enrichment: peak ~ GO + s(log10_length). Here, GO is a binary vector indicating whether a gene is in the gene set being tested, peak is a binary vector indicating the presence of a peak in a gene, and s(log10_length) is a binomial cubic smoothing spline which adjusts for the relationship between the presence of a peak and locus length.

# Without mappability
gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich')
results = chipenrich(peaks = peaks_E2F4, genome = 'hg19', genesets = gs_path,
    locusdef = "nearest_tss", qc_plots = FALSE, out_name = NULL, n_cores = 1)
results.ce = results$results
print(results.ce[1:5,1:5])
##    Geneset.Type Geneset.ID Description      P.value          FDR
## 1 user-supplied GO:0034660  GO:0034660 5.116330e-05 0.0004263609
## 2 user-supplied GO:0007346  GO:0007346 8.344453e-05 0.0005215283
## 3 user-supplied GO:0031400  GO:0031400 1.216352e-03 0.0060817587
## 4 user-supplied GO:0009314  GO:0009314 2.350799e-02 0.0734624730
## 5 user-supplied GO:0051129  GO:0051129 1.274418e-01 0.2739779165
# With mappability
gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich')
results = chipenrich(peaks = peaks_E2F4, genome = 'hg19', genesets = gs_path,
    locusdef = "nearest_tss", mappability=24, qc_plots = FALSE,
    out_name = NULL,n_cores=1)
results.cem = results$results
print(results.cem[1:5,1:5])
##    Geneset.Type Geneset.ID Description      P.value          FDR
## 1 user-supplied GO:0034660  GO:0034660 4.270420e-05 0.0003558683
## 2 user-supplied GO:0007346  GO:0007346 6.858707e-05 0.0004286692
## 3 user-supplied GO:0031400  GO:0031400 1.086810e-03 0.0054340518
## 4 user-supplied GO:0009314  GO:0009314 2.547550e-02 0.0796109494
## 5 user-supplied GO:0043623  GO:0043623 1.156425e-01 0.2409219495

2.6.3 polyenrich()

Poly-Enrich is also designed for narrow peaks, but where there are 100,000s of peaks which results in nearly every gene locus containing a peak. For example, ChIP-seq experiments for transcription factors.

The Poly-Enrich method uses the number of peaks in genes in its negative binomial regression model for enrichment: num_peaks ~ GO + s(log10_length). Here, GO is a binary vector indicating whether a gene is in the gene set being tested, num_peaks is a numeric vector indicating the number of peaks in each gene, and s(log10_length) is a negative binomial cubic smoothing spline which adjusts for the relationship between the number of peaks in a gene and locus length.

gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich')
results = polyenrich(peaks = peaks_E2F4, genome = 'hg19', genesets = gs_path,  method = 'polyenrich',
    locusdef = "nearest_tss", qc_plots = FALSE, out_name = NULL, n_cores = 1)
results.pe = results$results
print(results.pe[1:5,1:5])
##    Geneset.Type Geneset.ID Description      P.value          FDR
## 1 user-supplied GO:0007346  GO:0007346 7.211038e-05 0.0006009198
## 2 user-supplied GO:0031400  GO:0031400 4.716383e-04 0.0029477392
## 3 user-supplied GO:0009314  GO:0009314 1.277385e-03 0.0053224386
## 4 user-supplied GO:0051129  GO:0051129 1.634697e-01 0.4434484188
## 5 user-supplied GO:0043623  GO:0043623 2.128552e-01 0.4434484188

2.7 QC Plots

Each enrich function outputs QC plots if qc_plots = TRUE. There are also stand-alone functions to make the QC plots without the need for GSE testing. The QC plots can be used to help determine which locus definition to use, or which enrichment method is more appropriate.

2.7.1 Peak distance to TSS distribution

This plot gives a distribution of the distance of the peak midpoints to the TSSs. It can help in selecting a locus definition. For example, if genes are primarily within 5kb of TSSs, then the 5kb locus definition may be a good choice. In contrast, if most genes fall far from TSSs, the nearest_tss locus definition may be a good choice.

# Output in chipenrich and polyenrich
plot_dist_to_tss(peaks = peaks_E2F4, genome = 'hg19')
E2F4 peak distances to TSS

Figure 1: E2F4 peak distances to TSS

2.7.2 Presence of peak versus locus length

This plot visualizes the relationship between the presence of at least one peak in a gene locus and the locus length (on the log10 scale). For clarity of visualization, each point represents 25 gene loci binned after sorting by locus length. The expected fit under the assumptions of Fisher’s Exact Test (horizontal line), and a binomial-based test (gray curve) are displayed to indicate how the dataset being enriched conforms to the assumption of each. The empirical spline used in the chipenrich method is in orange.

# Output in chipenrich
plot_chipenrich_spline(peaks = peaks_E2F4, locusdef = 'nearest_tss', genome = 'hg19')
E2F4 chipenrich spline without mappability

Figure 2: E2F4 chipenrich spline without mappability

2.7.3 Number of peaks versus locus length

This plot visualizes the relationship between the number of peaks assigned to a gene and the locus length (on the log10 scale). For clarity of visualization, each point represents 25 gene loci binned after sorting by locus length. The empirical spline used in the polyenrich method is in orange.

If many gene loci have multiple peaks assigned to them, polyenrich is likely an appropriate method. If there are a low number of peaks per gene, then chipenrich() may be the appropriate method.

# Output in polyenrich
plot_polyenrich_spline(peaks = peaks_E2F4, locusdef = 'nearest_tss', genome = 'hg19')
E2F4 polyenrich spline without mappability

Figure 3: E2F4 polyenrich spline without mappability

2.7.4 Gene coverage versus locus length

This plot visualizes the relationship between proportion of the gene locus covered by peaks and the locus length (on the log10 scale). For clarity of visualization, each point represents 25 gene loci binned after sorting by locus length.

# Output in broadenrich
plot_gene_coverage(peaks = peaks_H3K4me3_GM12878, locusdef = 'nearest_tss',  genome = 'hg19')
H3K4me3 gene coverage

Figure 4: H3K4me3 gene coverage

2.8 Output

The output of broadenrich(), chipenrich(), and polyenrich() is a list with components corresponding to each section below. If out_name is not NULL, then a file for each component will be written to the out_path with prefixes of out_name.

2.8.1 Assigned peaks

Peak assignments are stored in $peaks. This is a peak-level summary. Each line corresponds to a peak intersecting a particular gene locus defined in the selected locus definition. In the case of broadenrich() peaks may be assigned to multiple gene loci. Doing table() on the peak_id column will indicate how many genes are assigned to each peak.

head(results$peaks)
##   peak_id  chr peak_start peak_end gene_id gene_symbol gene_locus_start
## 1  peak:1 chr1     816846   816937  284593      FAM41C           787681
## 2  peak:2 chr1     859131   859258  148398      SAMD11           836357
## 3  peak:3 chr1     877208   877306  148398      SAMD11           836357
## 4  peak:4 chr1     895928   896050  339451      KLHL17           895324
## 5  peak:5 chr1     968519   968706  375790        AGRN           952176
## 6  peak:6 chr1     968678   968804  375790        AGRN           952176
##   gene_locus_end nearest_tss dist_to_tss nearest_tss_gene_id
## 1         836356      860530      -43638              148398
## 2         879482      860530       -1335              148398
## 3         879482      882440        5182               26155
## 4         899806      896829        -839              339451
## 5         982595     1009687       41074              401934
## 6         982595     1009687       40945              401934
##   nearest_tss_symbol nearest_tss_gene_strand
## 1             SAMD11                       +
## 2             SAMD11                       +
## 3              NOC2L                       -
## 4             KLHL17                       +
## 5             RNF223                       -
## 6             RNF223                       -

2.8.2 Peaks-per-gene

Peak information aggregated over gene loci is stored in $peaks_per_gene. This is a gene-level summary. Each line corresponds to aggregated peak information over the gene_id such as the number of peaks assigned to the gene locus or the ratio of the gene locus covered in the case of broadenrich().

head(results$peaks_per_gene)
##       gene_id  length log10_length num_peaks peak
## 15683  144245 2469903     6.392680        21    1
## 19852  643955 2785792     6.444949        18    1
## 15546  139886 2421902     6.384157        14    1
## 18007  340441 2378457     6.376295        13    1
## 19911  645367 1508636     6.178584        13    1
## 13904   84920 2558900     6.408053        12    1

2.8.3 Gene set enrichment results

GSE results are stored in $results. For convenience, gene set descriptions are provided in addition to the gene set ID (which is the same as the ID from the originating database). The Status column takes values of enriched if the Effect is > 0 and depleted if < 0, with enriched results being of primary importance. Finally, the Geneset.Peak.Genes column gives a list of gene IDs that had signal contributing to the test for enrichment. This list can be used to cross reference information from $peaks or $peaks_per_gene if desired.

head(results$results)
##    Geneset.Type Geneset.ID Description      P.value          FDR     Effect
## 1 user-supplied GO:0007346  GO:0007346 7.211038e-05 0.0006009198 0.26845843
## 2 user-supplied GO:0031400  GO:0031400 4.716383e-04 0.0029477392 0.23700191
## 3 user-supplied GO:0009314  GO:0009314 1.277385e-03 0.0053224386 0.22358776
## 4 user-supplied GO:0051129  GO:0051129 1.634697e-01 0.4434484188 0.09638929
## 5 user-supplied GO:0043623  GO:0043623 2.128552e-01 0.4434484188 0.09084678
## 6 user-supplied GO:0016055  GO:0016055 2.427236e-01 0.4667761884 0.08105132
##   Odds.Ratio   Status N.Geneset.Genes N.Geneset.Peak.Genes
## 1   1.307947 enriched             296                  186
## 2   1.267444 enriched             294                  181
## 3   1.250555 enriched             282                  167
## 4   1.101188 enriched             293                  173
## 5   1.095101 enriched             278                  158
## 6   1.084427 enriched             282                  169
##   Geneset.Avg.Gene.Length
## 1                117873.0
## 2                132200.6
## 3                126212.6
## 4                170940.7
## 5                131632.5
## 6                196058.8
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Geneset.Peak.Genes
## 1 3398, 5569, 5586, 9184, 55159, 415116, 1026, 1029, 3265, 5578, 5727, 150094, 323, 996, 1111, 2296, 3146, 4091, 6659, 7314, 8451, 8621, 8626, 8877, 9099, 10296, 10950, 23026, 29117, 55023, 57026, 79791, 351, 595, 596, 598, 652, 655, 675, 699, 835, 994, 995, 1104, 1869, 1977, 1978, 3275, 4085, 4609, 5300, 5311, 5347, 5682, 5689, 5690, 5698, 5709, 5713, 6233, 6597, 6776, 6790, 7153, 7161, 7324, 8091, 8318, 8379, 8558, 8881, 9735, 9787, 10783, 23137, 23326, 23411, 23476, 23513, 26271, 26574, 27085, 54623, 55743, 56475, 63967, 64326, 80895, 81620, 90381, 118611, 332, 472, 701, 890, 891, 900, 983, 990, 991, 1017, 1027, 1063, 1263, 1540, 1613, 1761, 1814, 1877, 2273, 2290, 3364, 3553, 3643, 4193, 4751, 5037, 5048, 5119, 5424, 5684, 5686, 5702, 5704, 5707, 5708, 5711, 5714, 5715, 5717, 5718, 5728, 5883, 5925, 6777, 7013, 7040, 7175, 7272, 7311, 7321, 7480, 8697, 8883, 9088, 9183, 9491, 9585, 9587, 9700, 10201, 10213, 10361, 10393, 10459, 10537, 11065, 11130, 22974, 23087, 25906, 26013, 29882, 29945, 51143, 51203, 51379, 51433, 51499, 51512, 51529, 54443, 54962, 54998, 55022, 55055, 55367, 55726, 64682, 79027, 79577, 140609, 143471, 286053, 340152, 340533
## 2                                200734, 3480, 3720, 4616, 7026, 9184, 10221, 10253, 1026, 1029, 1846, 1849, 3725, 5079, 7128, 8165, 10114, 56940, 57732, 64092, 996, 1844, 2729, 4067, 4091, 4092, 6497, 6659, 7314, 9021, 9683, 11221, 25998, 51347, 161742, 25, 652, 655, 1647, 1786, 1789, 1843, 1847, 1850, 2776, 2810, 3659, 4085, 4298, 4763, 4771, 5347, 5524, 5682, 5689, 5690, 5698, 5709, 5713, 5987, 6233, 6622, 7161, 7249, 7324, 8613, 8651, 8692, 8881, 9353, 10252, 11213, 23411, 26271, 26524, 51160, 51343, 51422, 54206, 54880, 64780, 80824, 153090, 408, 409, 672, 701, 836, 857, 891, 991, 1027, 1487, 1816, 1855, 2629, 2873, 2950, 3553, 3643, 3984, 4221, 5037, 5167, 5170, 5580, 5590, 5611, 5663, 5664, 5684, 5686, 5702, 5704, 5707, 5708, 5711, 5714, 5715, 5717, 5718, 5728, 5795, 5925, 5998, 5999, 6418, 6422, 6423, 6609, 6895, 7040, 7291, 7311, 7320, 7321, 7375, 7471, 7531, 8697, 9093, 9113, 9241, 9370, 9467, 9474, 9491, 9529, 9655, 10213, 10393, 10399, 10459, 10507, 10614, 10724, 11065, 11261, 23376, 23560, 23636, 26277, 26973, 29882, 29945, 51433, 51529, 51654, 51763, 55022, 55364, 58509, 58533, 64359, 64682, 64853, 80725, 81848, 116496, 171392, 374969
## 3                                                                                                                                    2902, 7528, 22976, 641, 55159, 777, 1026, 2625, 3265, 3725, 3726, 7398, 8243, 10277, 51776, 840, 993, 1111, 1153, 1432, 2353, 3593, 4255, 4968, 5366, 7056, 8553, 8626, 9021, 9575, 25788, 55075, 57646, 64782, 84236, 351, 595, 596, 598, 675, 847, 1407, 1647, 1789, 1843, 1958, 2073, 2235, 2793, 2903, 2956, 3383, 4609, 4734, 4763, 5153, 5187, 5796, 5810, 5970, 6310, 6794, 7161, 8091, 8493, 8600, 8692, 10276, 23411, 27023, 51720, 57654, 64421, 64859, 79840, 83759, 84142, 90381, 118611, 120227, 261734, 147, 301, 388, 472, 545, 672, 836, 839, 857, 1032, 1263, 1396, 1586, 1588, 1643, 1814, 1816, 1894, 1956, 2063, 2071, 2138, 2140, 2177, 2189, 2237, 2547, 3014, 3091, 3364, 3553, 3592, 3713, 3815, 4144, 4157, 4221, 4308, 4311, 4436, 4867, 4893, 5424, 5591, 5599, 5883, 5936, 6422, 6423, 6506, 6591, 7040, 7159, 7222, 7320, 7391, 7516, 7518, 8438, 8533, 9025, 9212, 9577, 9600, 10206, 10413, 11073, 11284, 23596, 25896, 51150, 51455, 51514, 54962, 56852, 58493, 60626, 63943, 79035, 83695, 84268, 85453, 121457, 165918, 195814, 221927
## 4                                                                                             6904, 6711, 3720, 6092, 9184, 23493, 1902, 2672, 4214, 5079, 5525, 6714, 6772, 11252, 25913, 53335, 382, 396, 996, 1111, 2932, 3146, 4091, 4092, 5494, 5962, 6497, 6709, 7533, 8543, 22902, 26586, 598, 652, 655, 664, 699, 830, 899, 1630, 1786, 1789, 1901, 1978, 2043, 2067, 2288, 2395, 3196, 4085, 4135, 5747, 5879, 6311, 6622, 6695, 7143, 7324, 8379, 8408, 8881, 9139, 9353, 9711, 10395, 11078, 11104, 11213, 23122, 23189, 23332, 26271, 50964, 51185, 54880, 55755, 57142, 57731, 80312, 120892, 153090, 118, 335, 341, 347, 357, 387, 395, 472, 672, 701, 832, 891, 991, 1063, 1191, 1265, 1487, 1809, 2039, 3066, 3956, 4070, 4323, 4761, 4804, 4869, 4976, 5048, 5580, 5587, 5590, 5663, 5728, 5734, 5756, 5792, 5894, 6259, 6418, 6419, 6422, 6423, 7013, 7014, 7040, 7175, 7267, 7272, 7291, 7320, 7321, 7473, 8697, 8898, 9138, 9168, 9181, 9370, 9423, 9700, 9856, 10062, 10201, 10393, 10399, 10459, 11065, 11075, 11344, 22919, 23063, 25906, 26277, 26278, 29882, 29945, 50944, 51143, 51433, 51529, 51763, 54386, 54998, 55022, 58526, 64131, 64682, 65057, 65078, 116985, 192683, 286527
## 5                                                                                                                                                    6904, 6711, 284, 1060, 6047, 9659, 10383, 10718, 4214, 6714, 57580, 382, 1175, 1729, 2241, 3320, 4087, 4091, 5494, 5962, 6709, 10059, 10376, 57180, 811, 830, 1058, 2288, 3084, 3092, 3329, 3717, 4086, 4089, 4292, 4690, 5581, 5747, 5879, 6249, 6622, 6812, 7019, 7273, 8195, 8775, 8867, 8932, 9353, 9997, 10435, 10959, 11078, 23189, 25909, 25915, 26054, 26271, 27338, 55172, 55835, 57731, 79003, 84790, 133746, 137682, 203068, 347733, 644096, 118, 229, 395, 466, 832, 857, 1027, 1062, 1063, 1236, 1352, 1730, 2039, 2280, 2475, 3308, 4724, 5018, 5063, 5580, 5618, 5711, 5715, 5756, 5921, 6341, 6729, 6902, 6950, 7013, 7040, 7186, 7280, 7283, 7454, 8440, 8522, 8618, 8624, 8936, 9113, 9131, 9168, 9212, 9372, 9493, 10092, 10095, 10109, 10273, 10381, 10552, 10576, 10972, 11047, 11065, 11075, 11076, 11344, 22919, 23126, 23149, 23165, 27175, 29078, 29127, 51371, 54443, 54968, 55036, 55154, 55706, 55971, 57505, 57606, 58526, 63971, 79133, 79738, 80152, 81873, 115548, 123872, 163126, 203547, 285521, 286527, 374291, 387103
## 6                                                                                 3090, 4300, 6788, 7091, 6662, 2254, 2625, 4851, 5218, 51176, 51741, 441478, 650, 1456, 1499, 2869, 2932, 5494, 6262, 6382, 6469, 6497, 6659, 7476, 7481, 8945, 9209, 10042, 29117, 55959, 79718, 80319, 80326, 85458, 440193, 595, 1454, 1455, 1958, 2010, 2263, 2870, 2887, 4609, 5336, 6299, 6468, 6657, 6789, 6794, 6801, 6928, 7088, 7089, 7249, 7472, 7477, 8452, 8613, 8658, 8932, 8994, 9368, 9736, 9863, 11197, 23189, 23213, 23291, 23401, 25776, 26524, 27130, 29969, 30851, 50964, 51588, 54623, 54764, 59343, 65062, 79412, 80351, 84870, 90780, 153090, 219287, 340419, 857, 898, 1453, 1482, 1500, 1540, 1601, 1613, 1855, 1856, 1857, 2116, 2487, 3087, 3219, 4035, 4038, 4188, 4435, 4920, 5562, 5626, 5728, 5754, 5868, 6259, 6422, 6423, 6591, 6663, 6907, 7097, 7320, 7471, 7473, 7479, 7480, 7855, 8313, 8321, 8324, 8326, 8532, 8840, 8861, 9113, 9241, 10023, 10076, 10297, 10399, 23043, 23168, 23499, 25805, 51339, 51366, 51444, 53944, 55681, 55897, 56033, 57680, 57822, 59349, 64321, 64359, 65981, 79577, 80114, 80139, 81029, 81847, 83999, 84133, 84445, 85407, 91355, 144165, 205147, 219771

2.9 Assessing Type I Error with Randomizations

Randomization of locus definitions allows for the assessment of Type I Error under the null hypothesis of no true gene set enrichment. A well-calibrated Type I Error means that the false positive rate is controlled, and the p-values reported for actual data can be trusted. In both Welch & Lee, et al. and Cavalcante, et al., we demonstrated that both chipenrich() and broadenrich() have well-calibrated Type I Error over dozens of publicly available ENCODE ChIP-seq datasets. Unpublished data suggests the same is true for polyenrich().

Within chipenrich(), broadenrich(), and polyenrich(), the randomization parameters can be used to assess the Type I Error for the data being analyzed.

The randomization codes, and their effects are:

  • NULL: No randomizations, the default.
  • complete: Shuffle the gene_id and symbol columns of the locusdef together, without regard for the chromosome location, or locus length. The null hypothesis is that there is no true gene set enrichment.
  • bylength: Shuffle the gene_id and symbol columns of the locusdef together, within bins of 100 genes sorted by locus length. The null hypothesis is that there is no true gene set enrichment, but with preserved locus length relationship.
  • bylocation: Shuffle the gene_id and symbol columns of the locusdef together, within bins of 50 genes sorted by genomic location. The null hypothesis is that there is no true gene set enrichment, but with preserved genomic location.

The return value of chipenrich(), broadenrich(), or polyenrich() with a selected randomization is the same list object described above. To assess the Type I error, the alpha level for the particular data set can be calculated by dividing the total number of gene sets with p-value < alpha by the total number of tests tested. Users may want to perform multiple randomizations for a set of peaks and take the median of the alpha values.

# Assessing if alpha = 0.05
gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich')
results = chipenrich(peaks = peaks_E2F4, genome = 'hg19', genesets = gs_path,
    locusdef = "nearest_tss", qc_plots = FALSE, randomization = 'complete',
    out_name = NULL, n_cores = 1)
alpha = sum(results$results$P.value < 0.05) / nrow(results$results)
# NOTE: This is for
print(alpha)
## [1] 0.04

3 References

R.P. Welch^, C. Lee^, R.A. Smith, P. Imbriano, S. Patil, T. Weymouth, L.J. Scott, M.A. Sartor. “ChIP-Enrich: gene set enrichment testing for ChIP-seq data.” Nucl. Acids Res. (2014) 42(13):e105. doi:10.1093/nar/gku463

R.G. Cavalcante, C. Lee, R.P. Welch, S. Patil, T. Weymouth, L.J. Scott, and M.A. Sartor. “Broad-Enrich: functional interpretation of large sets of broad genomic regions.” Bioinformatics (2014) 30(17):i393-i400 doi:10.1093/bioinformatics/btu444