GenoGAM 2.2.0
Note: if you use GenoGAM in published research, please cite:
Stricker and Engelhardt, et al. (2017) GenoGAM: Genome-wide generalized additive models for ChIP-seq analysis Bioinformatics, 33:15. 10.1093/bioinformatics/btx150
and
Stricker, Galinier and Gagneur (2018) GenoGAM 2.0: scalable and efficient implementation of genome-wide generalized additive models for gigabase-scale genomes BMC Bioinformatics, 19:247. 10.1186/s12859-018-2238-7
This is the brief version of the usual workflow of GenoGAM. It involves:
Reading in data through GenoGAMDataSet()
to get a GenoGAMDataSet
object. This is done with the HDF5 backend.
Computing size factors with computeSizeFactors()
Compute the model with genogam()
to get the result GenoGAM
object
compute position-wise log p-values with computeSignificance()
The example dataset is restricted to the first 100kb of the first yeast chromosome.
library(GenoGAM)
## A.
## specify folder and experiment design path
wd <- system.file("extdata/Set1", package='GenoGAM')
folder <- file.path(wd, "bam")
expDesign <- file.path(wd, "experimentDesign.txt")
## B.
## register parallel backend (default is "the number of cores" - 2)
BiocParallel::register(BiocParallel::MulticoreParam(worker = 2))
## C.
## specify chunk and overhang size.
chunkSize <- 50000
overhangSize <- 1000
## D.
## the experiment design reflecting the underlying GAM
## x = position
design <- ~ s(x) + s(x, by = genotype)
GenoGAMDataSet
object.## build the GenoGAMDataSet
ggd <- GenoGAMDataSet(
expDesign, directory = folder,
chunkSize = chunkSize, overhangSize = overhangSize,
design = design)
## INFO [2019-05-02 21:58:43] Creating GenoGAMDataSet
## INFO [2019-05-02 21:58:44] Reading in data
## INFO [2019-05-02 21:58:44] Reading in wt_1
## INFO [2019-05-02 21:58:45] Reading in wt_2
## INFO [2019-05-02 21:58:46] Reading in mutant_1
## INFO [2019-05-02 21:58:46] Reading in mutant_2
## INFO [2019-05-02 21:58:46] Finished reading in data
## INFO [2019-05-02 21:58:47] GenoGAMDataSet created
ggd
## class: GenoGAMDataSet
## dimension: 100000 4
## samples(4): wt_1 wt_2 mutant_1 mutant_2
## design variables(1): genotype
## tiles size: 52kbp
## number of tiles: 2
## chromosomes: chrI
## size factors:
## wt_1 wt_2 mutant_1 mutant_2
## 0 0 0 0
## formula:
## ~ s(x) + s(x, by = genotype)
## compute size factors
ggd <- computeSizeFactors(ggd)
## INFO [2019-05-02 21:58:47] Computing size factors
## INFO [2019-05-02 21:58:47] DONE
ggd
## class: GenoGAMDataSet
## dimension: 100000 4
## samples(4): wt_1 wt_2 mutant_1 mutant_2
## design variables(1): genotype
## tiles size: 52kbp
## number of tiles: 2
## chromosomes: chrI
## size factors:
## wt_1 wt_2 mutant_1 mutant_2
## -0.04156987 0.04210819 -0.25589212 0.14328182
## formula:
## ~ s(x) + s(x, by = genotype)
## the data
assay(ggd)
## DataFrame with 100000 rows and 4 columns
## wt_1 wt_2 mutant_1 mutant_2
## <Rle> <Rle> <Rle> <Rle>
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## ... ... ... ... ...
## 99996 0 0 0 0
## 99997 0 0 0 0
## 99998 0 0 0 0
## 99999 0 0 0 0
## 100000 1 1 0 0
## compute model without parameter estimation to save time in vignette
result <- genogam(ggd, lambda = 4601, theta = 4.51)
## INFO [2019-05-02 21:58:47] Initializing the model
## INFO [2019-05-02 21:58:47] Done
## INFO [2019-05-02 21:58:48] Fitting model
## INFO [2019-05-02 21:58:54] Done
## INFO [2019-05-02 21:58:54] Processing fits
## INFO [2019-05-02 21:58:55] Processing done
## INFO [2019-05-02 21:58:55] Finished
result
## Family: nb
## Formula: ~ s(x) + s(x, by = genotype)
## Class: GenoGAM
## Dimension: 100000 4
## Samples(4): wt_1 wt_2 mutant_1 mutant_2
## Design variables(1): genotype
## Smooth functions(2): s(x) s(x):genotype
## Chromosomes: chrI
##
## Size factors:
## wt_1 wt_2 mutant_1 mutant_2
## -0.04156987 0.04210819 -0.25589212 0.14328182
##
## Cross Validation: Not performed
##
## Spline Parameters:
## Knot spacing: 20
## B-spline order: 2
## Penalization order: 2
##
## Tile settings:
## Chunk size: 50000
## Tile size: 52000
## Overhang size: 1000
## Number of tiles: 2
## Evaluated genome ranges:
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrI 1-100000 *
## -------
## seqinfo: 1 sequence from an unspecified genome
## the fit and standard error
fits(result)
## DataFrame with 100000 rows and 2 columns
## s(x) s(x):genotype
## <numeric> <numeric>
## 1 -2.81190225503088 0.5320456038111
## 2 -2.81128719366336 0.531848806972624
## 3 -2.81067225244489 0.531651893033589
## 4 -2.81005744734502 0.531454845516499
## 5 -2.80944279433325 0.531257647943864
## ... ... ...
## 99996 -1.86772191646085 -0.110004776821936
## 99997 -1.875276615882 -0.110350587546044
## 99998 -1.88283132577617 -0.110696460554156
## 99999 -1.89038604404279 -0.111042389677715
## 100000 -1.89794076858129 -0.111388368748164
se(result)
## DataFrame with 100000 rows and 2 columns
## s(x) s(x):genotype
## <numeric> <numeric>
## 1 0.258647538432571 0.317642013209663
## 2 0.257521587708836 0.316405090894515
## 3 0.256401263130459 0.31517351476289
## 4 0.255286594725364 0.31394731225843
## 5 0.254177609315401 0.312726508234327
## ... ... ...
## 99996 0.156906848359635 0.209381149691971
## 99997 0.157924553240842 0.210544652601457
## 99998 0.158948126782311 0.211713323560403
## 99999 0.159977531447117 0.212887133067987
## 100000 0.161012724757534 0.214066047892396
computeSignificance(result)
## INFO [2019-05-02 21:58:56] Computing positionwise p-values
## INFO [2019-05-02 21:58:56] DONE
pvalue(result)
## DataFrame with 100000 rows and 2 columns
## s(x) s(x):genotype
## <numeric> <numeric>
## 1 1.57484620773576e-27 0.0939371770243043
## 2 9.5912533137095e-28 0.0927801973572222
## 3 5.81947376361327e-28 0.091631313734527
## 4 3.51774570066058e-28 0.0904906101030138
## 5 2.11845197903288e-28 0.0893581659938518
## ... ... ...
## 99996 1.13648308616769e-32 0.599318694195086
## 99997 1.6057301625264e-32 0.600195329046113
## 99998 2.26867759564966e-32 0.601071576336622
## 99999 3.2049678853473e-32 0.601947355046144
## 100000 4.52673255409177e-32 0.602822579980591
Plot results of the center 10kb, where both tiles are joined together.
ranges = GRanges("chrI", IRanges(45000, 55000))
plotGenoGAM(result, ranges = ranges)
Additional to the basic smoothing and point-wise significance computation this version of GenoGAM now also supports differential analysis and peak calling with position-wise confidence intervals on ChIP-Seq data.
A small dataset is provided to illustrate the ChIP-Seq functionalities. This is a subset of the data published by Thornton et al (Thornton et al. 2014), who assayed histone H3 Lysine 4 trimethylation (H3K4me3) by ChIP-Seq comparing wild type yeast versus a mutant with a truncated form of Set1, the yeast H3 Lysine 4 methylase. The goal of this analysis is the identification of genomic positions that are significantly differentially methylated in the mutant compared to the wild type strain.
To this end, we will build a GenoGAM model that models the logarithm of the expected ChIP-seq fragment counts \(y\) as sums of smooth functions of the genomic position \(x\). Specifically, we write (with simplified notations) that:
\[\begin{equation} \log(\operatorname{E}(y)) = f(x) + \text{genotype} \times f_\text{mutant/wt}(x) \tag{1} \end{equation}\]where genotype is 1 for data from the mutant samples, and 0 for the wild type. Here the function \(f(x)\) is the reference level, i.e. the log-rate in the wild type strain. The function \(f_\text{mutant/wt}(x)\) is the log-ratio of the mutant over wild-type. We will then statistically test the null hypothesis \(f_\text{mutant/wt}(x) = 0\) at each position \(x\). In the following we show how to build the dataset, perform the fitting of the model and perform the testing.
The parallel backend is registered using the BiocParallel package. See the documentation in the BiocParallel
class for the correct use. Also note, that BiocParallel
is just an interface to multiple parallel packages. For example in order to use GenoGAM on a cluster, the BatchJobs package might be required. The parallel backend can be registered at anytime as GenoGAM will just call the current one.
IMPORTANT: According to this and this posts on the Bioconductor support page and R-devel mailing list, the most important core feature of the multicore backend, shared memory, is compromised by Rs own garbage collector, resulting in a replication of the entire workspace across all cores. Given that the amount of data in memory is big (without the HDF5 backend) it might crash the entire system. We highly advice to register the SnowParam backend to avoid this if working on larger organisms. This way the overhead is a little bigger, but only necessary data is copied to the workers, keeping memory consumption relatively low. We never experienced a higher load than 4GB per core, usually it was around 2GB on human genome, which is even lower with the HDF5 backend.
library(GenoGAM)
## On multicore machines by default the number of available cores - 2 are registered on the default backend 'Multicore'
BiocParallel::registered()[1]
## $MulticoreParam
## class: MulticoreParam
## bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
## bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
## bpexportglobals: TRUE
## bplogdir: NA
## bpresultdir: NA
## cluster type: FORK
For this small example we would like to assign less workers. Check BiocParallel for other possible backends and more options for SnowParam
BiocParallel::register(BiocParallel::SnowParam(workers = 3))
If we check the current registered backend, we see that the number of workers and the backend has changed.
BiocParallel::registered()[1]
## $SnowParam
## class: SnowParam
## bpisup: FALSE; bpnworkers: 3; bptasks: 0; bpjobname: BPJOB
## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
## bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
## bpexportglobals: TRUE
## bplogdir: NA
## bpresultdir: NA
## cluster type: SOCK
BAM files restricted to the first 100kb of yeast chromosome I are provided in the inst/extdata
folder of the GenoGAM package. This folder also contains a flat file describing the experimental design.
We start by loading the experimental design from the tab-separated text file experimentDesign.txt
into a data frame:
folder <- system.file("extdata/Set1", package='GenoGAM')
expDesign <- read.delim(
file.path(folder, "experimentDesign.txt")
)
expDesign
## ID file paired genotype
## 1 wt_1 H3K4ME3_Full_length_Set1_Rep_1_chr1_100k.bam FALSE 0
## 2 wt_2 H3K4ME3_Full_length_Set1_Rep_2_chr1_100k.bam FALSE 0
## 3 mutant_1 H3K4ME3_aa762-1080_Set1_Rep_1_chr1_100k.bam FALSE 1
## 4 mutant_2 H3K4ME3_aa762-1080_Set1_Rep_2_chr1_100k.bam FALSE 1
Each row of the experiment design corresponds to the alignment files in BAM format of one ChIP sample. In case of multiplexed sequencing, the BAM files must have been demultiplexed. The experiment design have named columns. Three column names have a fixed meaning for GenoGAMDataSet
and must be provided: ID, file, paired. The field ID stores a unique identifier for each alignment file. It is recommended to use short and easy to understand identifiers because they are subsequently used for labelling data and plots. The field file stores the BAM file name without the directory. The field paired is boolean with values TRUE for paired-end sequencing data, and FALSE for single-end sequencing data. All other columns are taken as the experimental design and must be of binary type. The names of those columns must be identical to the names provided later in the design formula. It will allow us modeling the differential occupancy or call peaks later on. Here the important one is the genotype column.
We will now count sequencing fragment centers per genomic position and store these counts into a GenoGAMDataSet
class. GenoGAM reduces ChIP-Seq data to fragment center counts rather than full base coverage so that each fragment is counted only once. This reduces artificial correlation between adjacent nucleotides. For single-end libraries, the fragment center is estimated by shifting the read end position by a constant. The estimation of this constant makes use of methods from the chipseq package (See help of the chipseq::estimate.mean.fraglen
method).
The parameters needed to create a GenoGAMDataSet
are:
## build the GenoGAMDataSet
wd_folder <- file.path(folder, "bam")
ggd <- GenoGAMDataSet(
expDesign, directory = wd_folder,
design = ~ s(x) + s(x, by = genotype)
)
## INFO [2019-05-02 21:58:58] Creating GenoGAMDataSet
## INFO [2019-05-02 21:58:58] Reading in data
## INFO [2019-05-02 21:58:58] Reading in wt_1
## INFO [2019-05-02 21:58:59] Reading in wt_2
## INFO [2019-05-02 21:58:59] Reading in mutant_1
## INFO [2019-05-02 21:58:59] Reading in mutant_2
## INFO [2019-05-02 21:59:00] Finished reading in data
## INFO [2019-05-02 21:59:00] GenoGAMDataSet created
assay(ggd)
## DataFrame with 100000 rows and 4 columns
## wt_1 wt_2 mutant_1 mutant_2
## <Rle> <Rle> <Rle> <Rle>
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## ... ... ... ... ...
## 99996 0 0 0 0
## 99997 0 0 0 0
## 99998 0 0 0 0
## 99999 0 0 0 0
## 100000 1 1 0 0
## overhang size
getOverhangSize(ggd)
## [1] 1000
## chunk size
getChunkSize(ggd)
## [1] 98000
The GenoGAMDataSet
class stores this count data into a structure that index genomic positions over tiles, defined by chunkSize
and overhangSize
. A bit of background is required to understand these parameters. The smoothing in GenoGAM is based on splines (Figure 1), which are piecewise polynomials. The knots are the positions where the polynomials connect. In our experience, one knot every 20 to 50 bp is required for enough resolution of the smooth fits in typical applications. The fitting of generalized additive models involves steps demanding a number of operations proportional to the square of the number of knots, preventing fits along whole chromosomes. To make the fitting of GAMs genome-wide, GenoGAM performs fitting on overlapping intervals (tiles), and join the fit at the midpoint of the overlap of consecutive tiles. The parameters chunkSize
and overhangSize
defines the tiles, where the chunk is the core part of a tile that does not overlap other tiles, and the overhangs are the two overlapping parts.
Both variables influence the computation speed and memory usage as they define into how many (and how big) tiles the overall genome is being split for parallel computation. Additionally, a too small overhang size might cause jigged fits at junction points where the chunks are glued back together. Therefore it is advised to keep overhang size to be around 1kb (and is set like this by default). Chunk size is by default roughly estimated to fit the overall modelling problem without using too much memory. However, this estimation is a very simple heuristic, thus if problems are experienced it is advised to set the chunk size yourself. We had good exerience with numbers between 50