GenoGAM 2.6.0
Note: if you use GenoGAM in published research, please cite:
Stricker and Engelhardt, et al. (2017) GenoGAM: Genome-wide generalized additive models for ChIP-seq analysis Bioinformatics, 33:15. 10.1093/bioinformatics/btx150
and
Stricker, Galinier and Gagneur (2018) GenoGAM 2.0: scalable and efficient implementation of genome-wide generalized additive models for gigabase-scale genomes BMC Bioinformatics, 19:247. 10.1186/s12859-018-2238-7
This is the brief version of the usual workflow of GenoGAM. It involves:
Reading in data through GenoGAMDataSet()
to get a GenoGAMDataSet
object. This is done with the HDF5 backend.
Computing size factors with computeSizeFactors()
Compute the model with genogam()
to get the result GenoGAM
object
compute position-wise log p-values with computeSignificance()
The example dataset is restricted to the first 100kb of the first yeast chromosome.
library(GenoGAM)
## A.
## specify folder and experiment design path
wd <- system.file("extdata/Set1", package='GenoGAM')
folder <- file.path(wd, "bam")
expDesign <- file.path(wd, "experimentDesign.txt")
## B.
## register parallel backend (default is "the number of cores" - 2)
BiocParallel::register(BiocParallel::MulticoreParam(worker = 2))
## C.
## specify chunk and overhang size.
chunkSize <- 50000
overhangSize <- 1000
## D.
## the experiment design reflecting the underlying GAM
## x = position
design <- ~ s(x) + s(x, by = genotype)
GenoGAMDataSet
object.## build the GenoGAMDataSet
ggd <- GenoGAMDataSet(
expDesign, directory = folder,
chunkSize = chunkSize, overhangSize = overhangSize,
design = design)
## INFO [2020-04-27 22:18:58] Creating GenoGAMDataSet
## INFO [2020-04-27 22:19:00] Reading in data
## INFO [2020-04-27 22:19:00] Reading in wt_1
## INFO [2020-04-27 22:19:02] Reading in wt_2
## INFO [2020-04-27 22:19:03] Reading in mutant_1
## INFO [2020-04-27 22:19:03] Reading in mutant_2
## INFO [2020-04-27 22:19:03] Finished reading in data
## INFO [2020-04-27 22:19:04] GenoGAMDataSet created
ggd
## class: GenoGAMDataSet
## dimension: 100000 4
## samples(4): wt_1 wt_2 mutant_1 mutant_2
## design variables(1): genotype
## tiles size: 52kbp
## number of tiles: 2
## chromosomes: chrI
## size factors:
## wt_1 wt_2 mutant_1 mutant_2
## 0 0 0 0
## formula:
## ~ s(x) + s(x, by = genotype)
## compute size factors
ggd <- computeSizeFactors(ggd)
## INFO [2020-04-27 22:19:04] Computing size factors
## INFO [2020-04-27 22:19:04] DONE
ggd
## class: GenoGAMDataSet
## dimension: 100000 4
## samples(4): wt_1 wt_2 mutant_1 mutant_2
## design variables(1): genotype
## tiles size: 52kbp
## number of tiles: 2
## chromosomes: chrI
## size factors:
## wt_1 wt_2 mutant_1 mutant_2
## -0.04156987 0.04210819 -0.25589212 0.14328182
## formula:
## ~ s(x) + s(x, by = genotype)
## the data
assay(ggd)
## DataFrame with 100000 rows and 4 columns
## wt_1 wt_2 mutant_1 mutant_2
## <Rle> <Rle> <Rle> <Rle>
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## ... ... ... ... ...
## 99996 0 0 0 0
## 99997 0 0 0 0
## 99998 0 0 0 0
## 99999 0 0 0 0
## 100000 1 1 0 0
## compute model without parameter estimation to save time in vignette
result <- genogam(ggd, lambda = 4601, theta = 4.51)
## INFO [2020-04-27 22:19:04] Initializing the model
## INFO [2020-04-27 22:19:04] Done
## INFO [2020-04-27 22:19:04] Fitting model
## INFO [2020-04-27 22:19:12] Done
## INFO [2020-04-27 22:19:12] Processing fits
## INFO [2020-04-27 22:19:12] Processing done
## INFO [2020-04-27 22:19:12] Finished
result
## Family: nb
## Formula: ~ s(x) + s(x, by = genotype)
## Class: GenoGAM
## Dimension: 100000 4
## Samples(4): wt_1 wt_2 mutant_1 mutant_2
## Design variables(1): genotype
## Smooth functions(2): s(x) s(x):genotype
## Chromosomes: chrI
##
## Size factors:
## wt_1 wt_2 mutant_1 mutant_2
## -0.04156987 0.04210819 -0.25589212 0.14328182
##
## Cross Validation: Not performed
##
## Spline Parameters:
## Knot spacing: 20
## B-spline order: 2
## Penalization order: 2
##
## Tile settings:
## Chunk size: 50000
## Tile size: 52000
## Overhang size: 1000
## Number of tiles: 2
## Evaluated genome ranges:
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrI 1-100000 *
## -------
## seqinfo: 1 sequence from an unspecified genome
## the fit and standard error
fits(result)
## DataFrame with 100000 rows and 2 columns
## s(x) s(x):genotype
## <numeric> <numeric>
## 1 -2.81190 0.532046
## 2 -2.81129 0.531849
## 3 -2.81067 0.531652
## 4 -2.81006 0.531455
## 5 -2.80944 0.531258
## ... ... ...
## 99996 -1.86772 -0.110005
## 99997 -1.87528 -0.110351
## 99998 -1.88283 -0.110696
## 99999 -1.89039 -0.111042
## 100000 -1.89794 -0.111388
se(result)
## DataFrame with 100000 rows and 2 columns
## s(x) s(x):genotype
## <numeric> <numeric>
## 1 0.258648 0.317642
## 2 0.257522 0.316405
## 3 0.256401 0.315174
## 4 0.255287 0.313947
## 5 0.254178 0.312727
## ... ... ...
## 99996 0.156907 0.209381
## 99997 0.157925 0.210545
## 99998 0.158948 0.211713
## 99999 0.159978 0.212887
## 100000 0.161013 0.214066
result <- computeSignificance(result)
## INFO [2020-04-27 22:19:13] Computing positionwise p-values
## INFO [2020-04-27 22:19:13] DONE
pvalue(result)
## DataFrame with 100000 rows and 2 columns
## s(x) s(x):genotype
## <numeric> <numeric>
## 1 1.57485e-27 0.0939372
## 2 9.59125e-28 0.0927802
## 3 5.81947e-28 0.0916313
## 4 3.51775e-28 0.0904906
## 5 2.11845e-28 0.0893582
## ... ... ...
## 99996 1.13648e-32 0.599319
## 99997 1.60573e-32 0.600195
## 99998 2.26868e-32 0.601072
## 99999 3.20497e-32 0.601947
## 100000 4.52673e-32 0.602823
Plot results of the center 10kb, where both tiles are joined together.
ranges = GRanges("chrI", IRanges(45000, 55000))
plotGenoGAM(result, ranges = ranges)
Additional to the basic smoothing and point-wise significance computation this version of GenoGAM now also supports differential analysis and peak calling with position-wise confidence intervals on ChIP-Seq data.
A small dataset is provided to illustrate the ChIP-Seq functionalities. This is a subset of the data published by Thornton et al (Thornton et al. 2014), who assayed histone H3 Lysine 4 trimethylation (H3K4me3) by ChIP-Seq comparing wild type yeast versus a mutant with a truncated form of Set1, the yeast H3 Lysine 4 methylase. The goal of this analysis is the identification of genomic positions that are significantly differentially methylated in the mutant compared to the wild type strain.
To this end, we will build a GenoGAM model that models the logarithm of the expected ChIP-seq fragment counts \(y\) as sums of smooth functions of the genomic position \(x\). Specifically, we write (with simplified notations) that:
\[\begin{equation} \log(\operatorname{E}(y)) = f(x) + \text{genotype} \times f_\text{mutant/wt}(x) \tag{1} \end{equation}\]where genotype is 1 for data from the mutant samples, and 0 for the wild type. Here the function \(f(x)\) is the reference level, i.e. the log-rate in the wild type strain. The function \(f_\text{mutant/wt}(x)\) is the log-ratio of the mutant over wild-type. We will then statistically test the null hypothesis \(f_\text{mutant/wt}(x) = 0\) at each position \(x\). In the following we show how to build the dataset, perform the fitting of the model and perform the testing.
The parallel backend is registered using the BiocParallel package. See the documentation in the BiocParallel
class for the correct use. Also note, that BiocParallel
is just an interface to multiple parallel packages. For example in order to use GenoGAM on a cluster, the BatchJobs package might be required. The parallel backend can be registered at anytime as GenoGAM will just call the current one.
IMPORTANT: According to this and this posts on the Bioconductor support page and R-devel mailing list, the most important core feature of the multicore backend, shared memory, is compromised by Rs own garbage collector, resulting in a replication of the entire workspace across all cores. Given that the amount of data in memory is big (without the HDF5 backend) it might crash the entire system. We highly advice to register the SnowParam backend to avoid this if working on larger organisms. This way the overhead is a little bigger, but only necessary data is copied to the workers, keeping memory consumption relatively low. We never experienced a higher load than 4GB per core, usually it was around 2GB on human genome, which is even lower with the HDF5 backend.
library(GenoGAM)
## On multicore machines by default the number of available cores - 2 are registered on the default backend 'Multicore'
BiocParallel::registered()[1]
## $MulticoreParam
## class: MulticoreParam
## bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
## bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
## bpexportglobals: TRUE
## bplogdir: NA
## bpresultdir: NA
## cluster type: FORK
For this small example we would like to assign less workers. Check BiocParallel for other possible backends and more options for SnowParam
BiocParallel::register(BiocParallel::SnowParam(workers = 3))
If we check the current registered backend, we see that the number of workers and the backend has changed.
BiocParallel::registered()[1]
## $SnowParam
## class: SnowParam
## bpisup: FALSE; bpnworkers: 3; bptasks: 0; bpjobname: BPJOB
## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
## bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
## bpexportglobals: TRUE
## bplogdir: NA
## bpresultdir: NA
## cluster type: SOCK
BAM files restricted to the first 100kb of yeast chromosome I are provided in the inst/extdata
folder of the GenoGAM package. This folder also contains a flat file describing the experimental design.
We start by loading the experimental design from the tab-separated text file experimentDesign.txt
into a data frame:
folder <- system.file("extdata/Set1", package='GenoGAM')
expDesign <- read.delim(
file.path(folder, "experimentDesign.txt")
)
expDesign
## ID file paired genotype
## 1 wt_1 H3K4ME3_Full_length_Set1_Rep_1_chr1_100k.bam FALSE 0
## 2 wt_2 H3K4ME3_Full_length_Set1_Rep_2_chr1_100k.bam FALSE 0
## 3 mutant_1 H3K4ME3_aa762-1080_Set1_Rep_1_chr1_100k.bam FALSE 1
## 4 mutant_2 H3K4ME3_aa762-1080_Set1_Rep_2_chr1_100k.bam FALSE 1
Each row of the experiment design corresponds to the alignment files in BAM format of one ChIP sample. In case of multiplexed sequencing, the BAM files must have been demultiplexed. The experiment design have named columns. Three column names have a fixed meaning for GenoGAMDataSet
and must be provided: ID, file, paired. The field ID stores a unique identifier for each alignment file. It is recommended to use short and easy to understand identifiers because they are subsequently used for labelling data and plots. The field file stores the BAM file name without the directory. The field paired is boolean with values TRUE for paired-end sequencing data, and FALSE for single-end sequencing data. All other columns are taken as the experimental design and must be of binary type. The names of those columns must be identical to the names provided later in the design formula. It will allow us modeling the differential occupancy or call peaks later on. Here the important one is the genotype column.
We will now count sequencing fragment centers per genomic position and store these counts into a GenoGAMDataSet
class. GenoGAM reduces ChIP-Seq data to fragment center counts rather than full base coverage so that each fragment is counted only once. This reduces artificial correlation between adjacent nucleotides. For single-end libraries, the fragment center is estimated by shifting the read end position by a constant. The estimation of this constant makes use of methods from the chipseq package (See help of the chipseq::estimate.mean.fraglen
method).
The parameters needed to create a GenoGAMDataSet
are:
## build the GenoGAMDataSet
wd_folder <- file.path(folder, "bam")
ggd <- GenoGAMDataSet(
expDesign, directory = wd_folder,
design = ~ s(x) + s(x, by = genotype)
)
## INFO [2020-04-27 22:19:14] Creating GenoGAMDataSet
## INFO [2020-04-27 22:19:15] Reading in data
## INFO [2020-04-27 22:19:15] Reading in wt_1
## INFO [2020-04-27 22:19:15] Reading in wt_2
## INFO [2020-04-27 22:19:16] Reading in mutant_1
## INFO [2020-04-27 22:19:16] Reading in mutant_2
## INFO [2020-04-27 22:19:16] Finished reading in data
## INFO [2020-04-27 22:19:16] GenoGAMDataSet created
assay(ggd)
## DataFrame with 100000 rows and 4 columns
## wt_1 wt_2 mutant_1 mutant_2
## <Rle> <Rle> <Rle> <Rle>
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## ... ... ... ... ...
## 99996 0 0 0 0
## 99997 0 0 0 0
## 99998 0 0 0 0
## 99999 0 0 0 0
## 100000 1 1 0 0
## overhang size
getOverhangSize(ggd)
## [1] 1000
## chunk size
getChunkSize(ggd)
## [1] 98000
The GenoGAMDataSet
class stores this count data into a structure that index genomic positions over tiles, defined by chunkSize
and overhangSize
. A bit of background is required to understand these parameters. The smoothing in GenoGAM is based on splines (Figure 1), which are piecewise polynomials. The knots are the positions where the polynomials connect. In our experience, one knot every 20 to 50 bp is required for enough resolution of the smooth fits in typical applications. The fitting of generalized additive models involves steps demanding a number of operations proportional to the square of the number of knots, preventing fits along whole chromosomes. To make the fitting of GAMs genome-wide, GenoGAM performs fitting on overlapping intervals (tiles), and join the fit at the midpoint of the overlap of consecutive tiles. The parameters chunkSize
and overhangSize
defines the tiles, where the chunk is the core part of a tile that does not overlap other tiles, and the overhangs are the two overlapping parts.
Both variables influence the computation speed and memory usage as they define into how many (and how big) tiles the overall genome is being split for parallel computation. Additionally, a too small overhang size might cause jigged fits at junction points where the chunks are glued back together. Therefore it is advised to keep overhang size to be around 1kb (and is set like this by default). Chunk size is by default roughly estimated to fit the overall modelling problem without using too much memory. However, this estimation is a very simple heuristic, thus if problems are experienced it is advised to set the chunk size yourself. We had good exerience with numbers between 50 - 100kb.
For most genome sizes beyond the yeast genome, keeping all data in memory can be impractical or even impossible to fit. Therefore, the HDF5 backend is employed to store all relevant data on hard drive. This can be simply done by setting the parameter hdf5
to true. This will also trigger a split of the underlying data into a list of data frames by chromosome. Splitting data is theoretically not necessary. However, due to problems of representing long integers in the Bioconductor infrastructure data frames of size biger than 2^31 (around 2.14 gigabase pairs) can not be stored easily or only through workarounds that would require more storage. Moreover, because HDF5 data is stored on hard drive it can be easily moved to a different device but has to be read-in again. To makes this easy, the parameter fromHDF5
can be set to true, to trigger GenoGAMDataSet
to read from already present HDF5 files.
ggd <- GenoGAMDataSet(
expDesign, directory = wd_folder,
design = ~ s(x) + s(x, by = genotype), hdf5 = TRUE
)
## INFO [2020-04-27 22:19:16] Creating GenoGAMDataSet
## INFO [2020-04-27 22:19:17] Reading in data
## INFO [2020-04-27 22:19:17] Reading in wt_1
## INFO [2020-04-27 22:19:17] Reading in wt_2
## INFO [2020-04-27 22:19:18] Reading in mutant_1
## INFO [2020-04-27 22:19:18] Reading in mutant_2
## INFO [2020-04-27 22:19:19] Finished reading in data
## INFO [2020-04-27 22:19:19] Writing chrI to HDF5
## INFO [2020-04-27 22:19:20] chrI written
## INFO [2020-04-27 22:19:21] GenoGAMDataSet created
## Note the difference in data type compared to the DataFrame above
assay(ggd)
## $chrI
## <100000 x 4> matrix of class DelayedMatrix and type "integer":
## wt_1 wt_2 mutant_1 mutant_2
## [1,] 0 0 0 0
## [2,] 0 0 0 0
## [3,] 0 0 0 0
## [4,] 0 0 0 0
## [5,] 0 0 0 0
## ... . . . .
## [99996,] 0 0 0 0
## [99997,] 0 0 0 0
## [99998,] 0 0 0 0
## [99999,] 0 0 0 0
## [100000,] 1 1 0 0
By default, the HDF5 files are stored in the temp folder, which is impractical if the data has to be kept and shared. Therefore the HDF5 folder can be specified explicitly in the GenoGAMSettings
object. This object holds a lot of meta parmeters which are responsible for the functionality of GenoGAM. In particular it guards the following areas:
All settings can be seen by just printing the GenoGAMSettings
object. Additionally to the meta parameters it will also show the parallel backend information and the logger threshold, although both are not part of the GenoGAMSettings
object.
GenoGAMSettings()
## -------------------- Read-in parameters -----------------
## Center: TRUE
## Chromosomes:
## Custom process function: disabled
##
## -------------------- BAM parameters ---------------------
## class: ScanBamParam
## bamFlag (NA unless specified):
## bamSimpleCigar: FALSE
## bamReverseComplement: FALSE
## bamTag:
## bamTagFilter:
## bamWhich: 0 ranges
## bamWhat: pos, qwidth
## bamMapqFilter: NA
##
## -------------------- Parallel backend -------------------
## class: SnowParam
## bpisup: FALSE; bpnworkers: 3; bptasks: 0; bpjobname: BPJOB
## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
## bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
## bpexportglobals: TRUE
## bplogdir: NA
## bpresultdir: NA
## cluster type: SOCK
##
## -------------------- Logger settings --------------------
## Logger threshold: INFO
##
## -------------------- HDF5 settings ----------------------
## HDF5 Directory: /tmp/RtmpSSdsH8/HDF5Array_dump
## HDF5 Compression Level (0-9): 6
##
## -------------------- Data settings ----------------------
## Region size: 4000
##
## Basepairs per Knot: 20
##
## -------------------- Optimization parameters ------------
## Optimization method: Nelder-Mead
## Optimization control:
## maxit: 60
## fnscale: -1
## trace: 1
## Parameter estimation control:
## eps: 1e-06
## maxiter: 1000
Reading in data is guarded by the bamParams
slot. It makes direct use of the class ScanBamParam
from the Rsamtools package. Unless the data should be restricted to specific chromosomes, it is the easiest way to specify particular regions that should be read in. Chromosomes can be simply restricted through the chromosomeList
slot. Finally, the ‘center’ slot just specifies if fragments should be centered or not.
range <- GRanges("chrI", IRanges(30000, 50000))
params <- Rsamtools::ScanBamParam(which = range)
settings <- GenoGAMSettings(center = FALSE,
bamParams = params)
ggd <- GenoGAMDataSet(
expDesign, directory = wd_folder,
design = ~ s(x) + s(x, by = genotype),
settings = settings)
## INFO [2020-04-27 22:19:21] Creating GenoGAMDataSet
## INFO [2020-04-27 22:19:22] Reading in data
## INFO [2020-04-27 22:19:22] Reading in wt_1
## INFO [2020-04-27 22:19:22] Reading in wt_2
## INFO [2020-04-27 22:19:22] Reading in mutant_1
## INFO [2020-04-27 22:19:23] Reading in mutant_2
## INFO [2020-04-27 22:19:23] Finished reading in data
## INFO [2020-04-27 22:19:23] GenoGAMDataSet created
## Note the higher counts
assay(ggd)
## DataFrame with 20001 rows and 4 columns
## wt_1 wt_2 mutant_1 mutant_2
## <Rle> <Rle> <Rle> <Rle>
## 1 20 6 3 5
## 2 20 6 3 5
## 3 20 6 3 5
## 4 20 6 3 5
## 5 20 6 3 5
## ... ... ... ... ...
## 19997 14 5 6 7
## 19998 14 5 6 7
## 19999 14 5 6 7
## 20000 13 5 6 7
## 20001 13 5 6 7
rowRanges(ggd)
## StitchedGPos object with 20001 positions and 0 metadata columns:
## seqnames pos strand
## <Rle> <integer> <Rle>
## [1] chrI 30000 *
## [2] chrI 30001 *
## [3] chrI 30002 *
## [4] chrI 30003 *
## [5] chrI 30004 *
## ... ... ... ...
## [19997] chrI 49996 *
## [19998] chrI 49997 *
## [19999] chrI 49998 *
## [20000] chrI 49999 *
## [20001] chrI 50000 *
## -------
## seqinfo: 1 sequence from an unspecified genome
## Now with chromosome specification. As only chrI is present
## in the example file, the read in will log an error and generate
## an empty GenoGAMDataSet object
settings <- GenoGAMSettings(chromosomeList = c("chrII", "chrIII"))
ggd <- GenoGAMDataSet(
expDesign, directory = wd_folder,
design = ~ s(x) + s(x, by = genotype),
settings = settings)
## INFO [2020-04-27 22:19:23] Creating GenoGAMDataSet
## ERROR [2020-04-27 22:19:23] No chromosomes to read in. Check either the specified settings or the header of BAM file
## INFO [2020-04-27 22:19:23] GenoGAMDataSet created
## Empty GenoGAMDataSet due to lack of data in the example BAM file
length(ggd)
## [1] 0
ggd
## class: GenoGAMDataSet
## dimension: 0 0
## samples(0):
## design variables(0):
## size factors:
## numeric(0)
## formula:
## ~ s(x)
Cross-validaton is performed to estimate the penalization parameter \(\lambda\) and the dispersion parameter \(\theta\). This is done iteratively maximising the likelihood, where each iteration involved performing one set of k-fold cross-validation. The default method maximising the likelihood function is Nelder-Mead, as the likelihood function can not be analytically specified and has no derivatives. The method can be changed to any method in Rs optim
function, but the only reasonable candidate is L-BFGS-B. Moreover, the number of iterations can be changed. This is also true for the model fitting function. There, the number of maximum iterations and the threshold value can be changed.
## Note, optimControl and estimControl is a list of more parameters. However none of them besides the shown are important
settings <- GenoGAMSettings(optimMethod = "L-BFGS-B", optimControl = list(maxit = 100L),
estimControl = list(eps = 1e-10, maxiter = 500L))
settings
## -------------------- Read-in parameters -----------------
## Center: TRUE
## Chromosomes:
## Custom process function: disabled
##
## -------------------- BAM parameters ---------------------
## class: ScanBamParam
## bamFlag (NA unless specified):
## bamSimpleCigar: FALSE
## bamReverseComplement: FALSE
## bamTag:
## bamTagFilter:
## bamWhich: 0 ranges
## bamWhat: pos, qwidth
## bamMapqFilter: NA
##
## -------------------- Parallel backend -------------------
## class: SnowParam
## bpisup: FALSE; bpnworkers: 3; bptasks: 0; bpjobname: BPJOB
## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
## bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
## bpexportglobals: TRUE
## bplogdir: NA
## bpresultdir: NA
## cluster type: SOCK
##
## -------------------- Logger settings --------------------
## Logger threshold: INFO
##
## -------------------- HDF5 settings ----------------------
## HDF5 Directory: /tmp/RtmpSSdsH8/HDF5Array_dump
## HDF5 Compression Level (0-9): 6
##
## -------------------- Data settings ----------------------
## Region size: 4000
##
## Basepairs per Knot: 20
##
## -------------------- Optimization parameters ------------
## Optimization method: L-BFGS-B
## Optimization control:
## maxit: 100
## fnscale: -1
## trace: 1
## Parameter estimation control:
## eps: 1e-10
## maxiter: 500
Probably the most important parameter to set through GenoGAMSettings
is the HDF5 folder. Additionally, the compression level can also be changed as an integer between 0 (Lowest) and 9 (Highest). Note, with specifying the folder, the folder gets created!
## set HDF5 folder to 'myFolder'
tmp <- tempdir()
settings <- GenoGAMSettings(hdf5Control = list(dir = file.path(tmp, "myFolder"), level = 0))
HDF5Array::getHDF5DumpDir()
## [1] "/tmp/RtmpSSdsH8/myFolder"
HDF5Array::getHDF5DumpCompressionLevel()
## [1] 0
## Another way to set this would be through the HDF5Array package
HDF5Array::setHDF5DumpDir(file.path(tmp, "myOtherFolder"))
settings
## -------------------- Read-in parameters -----------------
## Center: TRUE
## Chromosomes:
## Custom process function: disabled
##
## -------------------- BAM parameters ---------------------
## class: ScanBamParam
## bamFlag (NA unless specified):
## bamSimpleCigar: FALSE
## bamReverseComplement: FALSE
## bamTag:
## bamTagFilter:
## bamWhich: 0 ranges
## bamWhat: pos, qwidth
## bamMapqFilter: NA
##
## -------------------- Parallel backend -------------------
## class: SnowParam
## bpisup: FALSE; bpnworkers: 3; bptasks: 0; bpjobname: BPJOB
## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
## bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
## bpexportglobals: TRUE
## bplogdir: NA
## bpresultdir: NA
## cluster type: SOCK
##
## -------------------- Logger settings --------------------
## Logger threshold: INFO
##
## -------------------- HDF5 settings ----------------------
## HDF5 Directory: /tmp/RtmpSSdsH8/myFolder
## HDF5 Compression Level (0-9): 0
##
## -------------------- Data settings ----------------------
## Region size: 4000
##
## Basepairs per Knot: 20
##
## -------------------- Optimization parameters ------------
## Optimization method: Nelder-Mead
## Optimization control:
## maxit: 60
## fnscale: -1
## trace: 1
## Parameter estimation control:
## eps: 1e-06
## maxiter: 1000
As it was mentioned before the fitted splines depend on the placement of knots. The more knots are placed the wigglier the fit and the more the spline has to be penalized to avoid overfitting. This is done automatically. However too little knots will lead to underfitting with no automatic resolution. The default setting is at one knot at every 20bp. Given our experience this is a conservative setting, that works for all applications. For example narrow peak calling might need more knots than broad peak calling as high resolution is required to be able to identify two very close peaks as two peaks instead of one. Because the computational fitting problem grows quadaratically with the number of knots, less knots might speed up computation. However, this might also lead to worse results given the downstream applicaton. Thus, this parameter should be handled with care.
Additionally, computation is speed up during cross-validation by using smaller tiles. Since \(\lambda\) and \(\theta\) are independent of length, smaller tiles can be used to speed up parameter estimation. The parameter is set as chunk size and defaults to 4kb. Generally one shouldn’t go below 2kb as the computation it might become unstable in combination with sparse data.
## For example, we choose a twice as long region but also two times less knots (double the knot spacing),
## which results in the same number of knots per tile as before.
settings <- GenoGAMSettings(dataControl = list(regionSize = 8000, bpknots = 40))
settings
## -------------------- Read-in parameters -----------------
## Center: TRUE
## Chromosomes:
## Custom process function: disabled
##
## -------------------- BAM parameters ---------------------
## class: ScanBamParam
## bamFlag (NA unless specified):
## bamSimpleCigar: FALSE
## bamReverseComplement: FALSE
## bamTag:
## bamTagFilter:
## bamWhich: 0 ranges
## bamWhat: pos, qwidth
## bamMapqFilter: NA
##
## -------------------- Parallel backend -------------------
## class: SnowParam
## bpisup: FALSE; bpnworkers: 3; bptasks: 0; bpjobname: BPJOB
## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
## bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
## bpexportglobals: TRUE
## bplogdir: NA
## bpresultdir: NA
## cluster type: SOCK
##
## -------------------- Logger settings --------------------
## Logger threshold: INFO
##
## -------------------- HDF5 settings ----------------------
## HDF5 Directory: /tmp/RtmpSSdsH8/myOtherFolder
## HDF5 Compression Level (0-9): 0
##
## -------------------- Data settings ----------------------
## Region size: 8000
##
## Basepairs per Knot: 40
##
## -------------------- Optimization parameters ------------
## Optimization method: Nelder-Mead
## Optimization control:
## maxit: 60
## fnscale: -1
## trace: 1
## Parameter estimation control:
## eps: 1e-06
## maxiter: 1000
Note: It is generally not advised to change the settings parameters, unless it is the specification of regions for data read-in or the HDF5 folder.
Sequencing libraries typically vary in sequencing depth. Such variations is controlled for in GenoGAM by adding a sample-specific constant to the right term of Equation (1). The estimation of these constants is performed by the function computeSizeFactor()
as follows:
## read in data again, since the last read-in was an empty GenoGAM
ggd <- GenoGAMDataSet(
expDesign, directory = wd_folder,
design = ~ s(x) + s(x, by = genotype)
)
## INFO [2020-04-27 22:19:24] Creating GenoGAMDataSet
## INFO [2020-04-27 22:19:25] Reading in data
## INFO [2020-04-27 22:19:25] Reading in wt_1
## INFO [2020-04-27 22:19:25] Reading in wt_2
## INFO [2020-04-27 22:19:25] Reading in mutant_1
## INFO [2020-04-27 22:19:26] Reading in mutant_2
## INFO [2020-04-27 22:19:26] Finished reading in data
## INFO [2020-04-27 22:19:26] GenoGAMDataSet created
ggd <- computeSizeFactors(ggd)
## INFO [2020-04-27 22:19:26] Computing size factors
## INFO [2020-04-27 22:19:27] DONE
sizeFactors(ggd)
## wt_1 wt_2 mutant_1 mutant_2
## -0.04156987 0.04210819 -0.25589212 0.14328182
Note: The size factors in GenoGAM are in the natural logarithm scale. Also factor groups are identified automatically from the design. Given more complex design, this might fail. Or, maybe different factor groups are desired. In this case the groups can be provided separatelly through the factorGroups argument.
A GenoGAM model requires two further parameters to be fitted: the regularization parameter, \(\lambda\), and the dispersion parameter \(\theta\). The regularization parameters \(\lambda\) controls the amount of smoothing. The larger \(\lambda\) is, the smoother the smooth functions are. The dispersion parameter \(\theta\) controls how much the observed counts deviate from their expected value modeled by Equation (1). The dispersion captures biological and technical variation which one typically sees across replicate samples, but also errors of the model. In GenoGAM, the dispersion is modeled by assuming the counts to follow a negative binomial distribution with mean \(\mu=\operatorname{E}(y)\) whose logarithm is modeled by Equation (1) and with variance \(v = \mu + \mu^2/\theta\).
If not provided, the parameters \(\lambda\) and \(\theta\) are obtained by cross-validation. This step is too time consuming for a vignette. For sake of going through this example quickly, we provide the values manually:
## fit model without parameters estimation
fit <- genogam(ggd, lambda = 4601, theta = 4.51)
## INFO [2020-04-27 22:19:27] Initializing the model
## INFO [2020-04-27 22:19:27] Done
## INFO [2020-04-27 22:19:27] Fitting model
## INFO [2020-04-27 22:19:35] Done
## INFO [2020-04-27 22:19:35] Processing fits
## INFO [2020-04-27 22:19:35] Processing done
## INFO [2020-04-27 22:19:35] Finished
fit
## Family: nb
## Formula: ~ s(x) + s(x, by = genotype)
## Class: GenoGAM
## Dimension: 100000 4
## Samples(4): wt_1 wt_2 mutant_1 mutant_2
## Design variables(1): genotype
## Smooth functions(2): s(x) s(x):genotype
## Chromosomes: chrI
##
## Size factors:
## wt_1 wt_2 mutant_1 mutant_2
## -0.04156987 0.04210819 -0.25589212 0.14328182
##
## Cross Validation: Not performed
##
## Spline Parameters:
## Knot spacing: 20
## B-spline order: 2
## Penalization order: 2
##
## Tile settings:
## Chunk size: 98000
## Tile size: 1e+05
## Overhang size: 1000
## Number of tiles: 1
## Evaluated genome ranges:
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrI 1-100000 *
## -------
## seqinfo: 1 sequence from an unspecified genome
Remark on parameter estimation: To estimate the parameters \(\lambda\) and \(\theta\) by cross-validation, call genogam()
without setting their values. This will perform a k-fold cross-validation on each tile with initial parameter values and iterate until convergence, often for about 30 iterations. The number of folds can be set with parameter kfolds. We recommend to do it for 20 to 40 different regions (the number can be set with parameter regions). During this procedure the underlying optimization algorithm (Nelder-Mead by default) will explore the possible parameter space. In case of low-count regions this might result in unstable matrix operations, in particular in resolutions of linear systems. To stabilize the computation a small first-order regularization might be added through the parameter eps. This will change the penalization term from \(\beta^T \textbf{S} \beta\) to \(\beta^T \textbf{S} \beta + \epsilon \textbf{I}\), where identity matrix \(\textbf{I}\) is of the same dimension as penalty matrix \(\textbf{S}\). The parameter eps represents \(\epsilon\) in the equation above. Thus, a too large \(\epsilon\) will also have a significant effect on the general fit, while a small \(\epsilon\) will mostly affect small values close too zero. We recommend an \(\epsilon\) value in the range of 1e-4 - 1e-3.
An important difference of cross-validation in GenoGAM is that instead of leaving out data points, intervals are left out. This improves the parameter estimation by taking into account local correlation where leaving out single points might lead to overfitting. If replicates are present in the data the cross-validation can be further improved by using larger intervals (20bp by default) around the size of a ChIP-Seq fragment, i.e. 200bp. Thus, the large left out interval will be captured by the data in the replicate.
## Does not evaluate
fit_CV <- genogam(ggd, intervalSize = 200)
Remark on parallel computing:
If using the SnowParams
backend, the initiation of a worker and copying relevant data might take a few seconds. Especially if this is done sequentially. During cross-validation, where computation is done on small tiles, computation might therefore be only a fraction of the total overhead. Therefore, the number of workers is automatically set to an optimal lower number during cross-validation and reset afterwards.
Following the convention of the packages mgcv and gam the names of the fit for the smooth function defined by the by variable in the design formula (which is the same as the column name in the config) follow the pattern s(x):by-variable. Here, the smooth function of interest \(f_\text{mutant/wt}(x)\) is thus named s(x):genotype.
ranges = GRanges("chrI", IRanges(45000, 55000))
plotGenoGAM(fit, ranges = ranges)
Because the data is generally too big to plot in its totality. The plotting function requires a specifig range, which is best provided by a GRanges
object and supplied to the ranges argument. Additionally, it can be specified if the smooths should be plotted on the same scale (scale), the values should be displayed in log format (log) or if the raw data count data should be included (provided GenoGAMDataSet to ggd, unfortunatelly hard to show in the vignette).
plotGenoGAM(fit, ranges = ranges, scale = FALSE)
We test for each smooth and at each position \(x\) the null hypothesis that it values 0 by a call to computeSignificance(). This gives us pointwise p-values.
fit <- computeSignificance(fit)
## INFO [2020-04-27 22:19:37] Computing positionwise p-values
## INFO [2020-04-27 22:19:37] DONE
pvalue(fit)
## DataFrame with 100000 rows and 2 columns
## s(x) s(x):genotype
## <numeric> <numeric>
## 1 1.60811e-27 0.0940669
## 2 9.79473e-28 0.0929080
## 3 5.94354e-28 0.0917573
## 4 3.59313e-28 0.0906147
## 5 2.16408e-28 0.0894804
## ... ... ...
## 99996 1.16158e-32 0.600230
## 99997 1.64094e-32 0.601107
## 99998 2.31806e-32 0.601984
## 99999 3.27430e-32 0.602859
## 100000 4.62426e-32 0.603735
If region-wise significance is needed, as in the case of differential binding, then we call computeRegionSignificance(). This returns the provided GRanges
object updated by a p-value and FRD column. If smooth is not provided it is computed for all smooths.
gr <- GRanges("chrI", IRanges(c(1000, 7000), c(4000, 9000)))
db <- computeRegionSignificance(fit, regions = gr, smooth = "s(x):genotype")
## INFO [2020-04-27 22:19:39] Estimating region p-values and FDR
## INFO [2020-04-27 22:19:39] Computing positionwise log p-values
## INFO [2020-04-27 22:19:39] DONE
## INFO [2020-04-27 22:19:39] Performing multiple correction
## INFO [2020-04-27 22:19:39] DONE
db
## $`s(x):genotype`
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | pvalue FDR
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chrI 1000-4000 * | 0.249909 0.249909
## [2] chrI 7000-9000 * | 0.249536 0.249909
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The computed fit allows for easy peak calling of narrow and broad peaks via the callPeaks(). The resulting data.table
object provides a similar structure as the ENCODE narrowPeak and broadPeak format. Note, the score is given as the negative natural logarithm of the p-values. Also no peaks borders are provided (a function to compute them just as in GenoGAM 1.0 is being implemented). As this is not peak calling data, we use a high threshold do demonstrate the functionality.
peaks <- callPeaks(fit, threshold = 1, smooth = "s(x):genotype")
## INFO [2020-04-27 22:19:40] Calling narrow peaks
## INFO [2020-04-27 22:19:40] DONE
peaks
## $`s(x):genotype`
## seqnames pos zscore score fdr summit
## 1: chrI 77623 3.953526539 10.1647521 0.0000000 0.8218586
## 2: chrI 59362 2.299714606 4.5345070 0.5555556 1.4055236
## 3: chrI 62226 2.605805754 5.3854186 0.5714286 1.7176303
## 4: chrI 59883 2.632129470 5.4625846 0.6000000 1.3510637
## 5: chrI 81446 2.233273750 4.3610131 0.6000000 1.3675995
## 6: chrI 83493 2.307658693 4.5555172 0.6250000 1.5556027
## 7: chrI 90239 2.611568704 5.4022580 0.6666667 0.9923707
## 8: chrI 64237 2.983954682 6.5551675 0.7500000 1.3781992
## 9: chrI 19099 0.007317323 0.6990026 0.8333333 3.1174076
## 10: chrI 27398 0.123957859 0.7970103 0.8363636 1.0669989
## 11: chrI 22332 0.274332932 0.9367119 0.8367347 4.2479926
## 12: chrI 2357 0.010564995 0.7016124 0.8474576 3.0423462
## 13: chrI 57733 0.184091053 0.8510392 0.8490566 1.4130322
## 14: chrI 7751 0.306173644 0.9682784 0.8510638 1.2812820
## 15: chrI 44866 0.130774568 0.8030138 0.8518519 2.0274223
## 16: chrI 88055 0.284273605 0.9464913 0.8541667 3.4807577
## 17: chrI 18116 0.085875987 0.7640365 0.8571429 1.0741642
## 18: chrI 42998 0.058164384 0.7406396 0.8596491 6.1783913
## 19: chrI 38415 0.381625912 1.0459171 0.8604651 1.5213175
## 20: chrI 23137 0.015640543 0.7057045 0.8620690 4.6110303
## 21: chrI 21262 0.200954529 0.8666270 0.8653846 1.7848703
## 22: chrI 14576 0.338250353 1.0007945 0.8695652 3.6128455
## 23: chrI 27850 0.257966317 0.9207596 0.8800000 1.9101958
## 24: chrI 15951 0.387984743 1.0526442 0.8809524 1.5720949
## 25: chrI 61001 0.204287531 0.8697307 0.8823529 1.0008833
## 26: chrI 7237 0.350087567 1.0129765 0.8863636 1.2718125
## 27: chrI 31286 0.343267025 1.0059453 0.8888889 3.1725112
## 28: chrI 13524 0.395600685 1.0607392 0.9024390 1.6710578
## 29: chrI 24877 0.399441628 1.0648374 0.9250000 1.5034923
## 30: chrI 23903 0.623616735 1.3226075 0.9354839 1.3912920
## 31: chrI 49699 1.245915796 2.2405717 0.9375000 1.3353167
## 32: chrI 4797 0.455874666 1.1262711 0.9459459 1.4687162
## 33: chrI 41399 0.422117971 1.0892479 0.9487179 2.0248993
## 34: chrI 76493 -0.246390343 0.5153191 0.9516129 1.4352471
## 35: chrI 39240 -1.212076272 0.1196190 0.9571429 2.2251183
## 36: chrI 57191 0.624069757 1.3231660 0.9666667 0.7640392
## 37: chrI 76594 -0.229948090 0.5260510 0.9672131 2.2555678
## 38: chrI 28556 -0.338869459 0.4578443 0.9682540 1.3313414
## 39: chrI 9026 -0.388672158 0.4288760 0.9692308 4.1543744
## 40: chrI 51485 0.557091881 1.2422552 0.9696970 2.0802139
## 41: chrI 936 -0.814219812 0.2328903 0.9705882 1.9832077
## 42: chrI 37176 -1.054367667 0.1576570 0.9710145 1.9079031
## 43: chrI 47888 0.469061169 1.1409575 0.9722222 1.2486934
## 44: chrI 73443 0.434946481 1.1032210 0.9736842 1.4434126
## 45: chrI 68581 -0.375111593 0.4366288 0.9843750 3.6005765
## 46: chrI 34811 -0.795611070 0.2396910 0.9850746 1.8982624
## 47: chrI 91386 2.993624852 6.5868149 1.0000000 1.4422657
## 48: chrI 86031 1.509538353 2.7244749 1.0000000 1.5380176
## 49: chrI 98729 1.479515619 2.6664101 1.0000000 1.2249742
## 50: chrI 75279 1.294625768 2.3256011 1.0000000 1.2876984
## 51: chrI 25457 0.654689506 1.3612746 1.0000000 0.8438616
## 52: chrI 12615 0.627263028 1.3271073 1.0000000 0.8576204
## 53: chrI 17336 0.604812635 1.2995599 1.0000000 0.7174821
## 54: chrI 46915 0.506105669 1.1828924 1.0000000 1.4172731
## 55: chrI 87700 0.470411916 1.1424690 1.0000000 1.2082224
## 56: chrI 35955 -0.772600348 0.2483068 1.0000000 1.5995945
## seqnames pos zscore score fdr summit
peaks <- callPeaks(fit, threshold = 1, peakType = "broad",
cutoff = 0.75, smooth = "s(x):genotype")
## INFO [2020-04-27 22:19:40] Calling broad peaks
## INFO [2020-04-27 22:19:40] DONE
peaks
## $`s(x):genotype`
## seqnames start end width strand score meanSignal fdr
## 1: chrI 1 578 578 * 0.2881752 1.3225292 0.8687306
## 2: chrI 1937 5520 3584 * 0.2880604 1.0858642 0.8687306
## 3: chrI 6557 11060 4504 * 0.2460828 1.1853691 0.8687306
## 4: chrI 11900 26025 14126 * 0.2880586 1.2344399 0.8687306
## 5: chrI 26974 29868 2895 * 0.2883320 1.1872678 0.8687306
## 6: chrI 30637 33619 2983 * 0.2880309 1.3303773 0.8687306
## 7: chrI 37978 38914 937 * 0.2883072 1.1866105 0.8687306
## 8: chrI 39710 41842 2133 * 0.2885133 1.5021450 0.8687306
## 9: chrI 42744 43261 518 * 0.2879385 1.0780682 0.8687306
## 10: chrI 43812 45263 1452 * 0.2877189 1.1058146 0.8687306
## 11: chrI 46322 48244 1923 * 0.2880785 1.2790290 0.8687306
## 12: chrI 49127 52097 2971 * 0.2884650 1.3329063 0.8687306
## 13: chrI 68366 68770 405 * 0.2883727 0.9572166 0.8687306
## 14: chrI 69998 71392 1395 * 0.2880892 1.4932166 0.8687306
## 15: chrI 72905 79495 6591 * 1.3726813 2.2360777 0.8687306
## 16: chrI 80251 84073 3823 * 0.2883924 2.0675725 0.8687306
## 17: chrI 84745 86766 2022 * 0.2884386 1.6236602 0.8687306
## 18: chrI 87351 92044 4694 * 0.2886375 2.3349403 0.8687306
## 19: chrI 52845 66840 13996 * 0.1385248 1.7610177 0.8706416
## 20: chrI 92840 100000 7161 * 0.1497634 1.5247567 0.8706416
The function writeToBEDFile() provides an easy way to write the peaks data.table to a narrowPeaks or broadPeaks file. The suffix will be determined automatically
## Not evaluated
writeToBEDFile(peaks, file = 'myPeaks')
Problem: An error occurs during model fitting along those lines:
error: matrix multiplication: incompatible matrix dimensions: 22333830147200x5360120267008000 and 4294972496x1
Solution: First, make sure you have all Armadillo dependencies installed correctly. See here
Second, the error is most likely related to the fact, that Armadillo is using 32bit matrices, thus causing problems for large matrices GenoGAM is using. The solution is to enable ARMA_64BIT_WORD
, which is not enabled in RcppArmadillo by default. This should have been done during compilation, but if it fails for some reason you can do it manually with #define ARMA_64BIT_WORD 1
in my_R_Directory/lib/R/library/RcppArmadillo/include/RcppArmadilloConfig.h
. See here.
We thank Alexander Engelhardt, Mathilde Galinier, Simon Wood, Herv'e Pag`es, and Martin Morgan for input in the development of GenoGAM
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] GenoGAM_2.6.0 data.table_1.12.8
## [3] Matrix_1.2-18 HDF5Array_1.16.0
## [5] rhdf5_2.32.0 SummarizedExperiment_1.18.0
## [7] DelayedArray_0.14.0 matrixStats_0.56.0
## [9] Biobase_2.48.0 GenomicRanges_1.40.0
## [11] GenomeInfoDb_1.24.0 IRanges_2.22.0
## [13] S4Vectors_0.26.0 BiocGenerics_0.34.0
## [15] BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] bit64_0.9-7 splines_4.0.0 dotCall64_1.0-0
## [4] assertthat_0.2.1 BiocManager_1.30.10 latticeExtra_0.6-29
## [7] blob_1.2.1 Rsamtools_2.4.0 GenomeInfoDbData_1.2.3
## [10] yaml_2.2.1 pillar_1.4.3 RSQLite_2.2.0
## [13] lattice_0.20-41 glue_1.4.0 digest_0.6.25
## [16] RColorBrewer_1.1-2 XVector_0.28.0 colorspace_1.4-1
## [19] htmltools_0.4.0 chipseq_1.38.0 DESeq2_1.28.0
## [22] XML_3.99-0.3 pkgconfig_2.0.3 ShortRead_1.46.0
## [25] magick_2.3 genefilter_1.70.0 bookdown_0.18
## [28] zlibbioc_1.34.0 purrr_0.3.4 xtable_1.8-4
## [31] scales_1.1.0 jpeg_0.1-8.1 BiocParallel_1.22.0
## [34] tibble_3.0.1 annotate_1.66.0 ggplot2_3.3.0
## [37] ellipsis_0.3.0 survival_3.1-12 magrittr_1.5
## [40] crayon_1.3.4 memoise_1.1.0 evaluate_0.14
## [43] hwriter_1.3.2 tools_4.0.0 sparseinv_0.1.3
## [46] formatR_1.7 lifecycle_0.2.0 stringr_1.4.0
## [49] Rhdf5lib_1.10.0 munsell_0.5.0 locfit_1.5-9.4
## [52] lambda.r_1.2.4 AnnotationDbi_1.50.0 Biostrings_2.56.0
## [55] compiler_4.0.0 rlang_0.4.5 futile.logger_1.4.3
## [58] grid_4.0.0 RCurl_1.98-1.2 spam_2.5-1
## [61] bitops_1.0-6 rmarkdown_2.1 codetools_0.2-16
## [64] gtable_0.3.0 DBI_1.1.0 R6_2.4.1
## [67] GenomicAlignments_1.24.0 knitr_1.28 dplyr_0.8.5
## [70] bit_1.1-15.2 futile.options_1.0.1 stringi_1.4.6
## [73] Rcpp_1.0.4.6 png_0.1-7 vctrs_0.2.4
## [76] geneplotter_1.66.0 tidyselect_1.0.0 xfun_0.13
Thornton, J., G.H.. Westfield, Y. Takahashi, M. Cook, X. Gao, Woodfin A.R., Lee J., et al. 2014. “Context dependency of Set1/ COMPASS-mediated histone H3 Lys4 trimethylation.” Genes & Development 28 (2):115–20. https://doi.org/10.1101/gad.232215.113.