Modeling ChIP-Seq data with GenoGAM 2.0: A Genome-wide generalized additive model

Georg Stricker1* and Julien Gagneur1**

1Technical University Munich

*georg.stricker@protonmail.com
**gagneur@in.tum.de

04/27/2020

Abstract

Many genomic assays lead to noisy observations of a biological quantity of interest varying along the genome. This is the case for ChIP-Seq, for which read counts reflect local protein occupancy of the ChIP-ed protein. The GenoGAM package allows statistical analysis of genome-wide data with smooth functions using generalized additive models. It provides methods for the statistical analysis of ChIP-Seq data including inference of protein occupancy, and pointwise and region-wise differential analysis as well as peak calling with position-wise confidence bands. Estimation of dispersion and smoothing parameters is performed by cross-validation. Scaling of generalized additive model fitting to whole chromosomes is achieved by parallelization over overlapping genomic intervals. This vignette explains the use of the package for typical ChIP-Seq analysis tasks

Package

GenoGAM 2.6.0

Note: if you use GenoGAM in published research, please cite:

Stricker and Engelhardt, et al. (2017) GenoGAM: Genome-wide generalized additive models for ChIP-seq analysis Bioinformatics, 33:15. 10.1093/bioinformatics/btx150

and

Stricker, Galinier and Gagneur (2018) GenoGAM 2.0: scalable and efficient implementation of genome-wide generalized additive models for gigabase-scale genomes BMC Bioinformatics, 19:247. 10.1186/s12859-018-2238-7

1 Quick guide (for small organisms)

This is the brief version of the usual workflow of GenoGAM. It involves:

Reading in data through GenoGAMDataSet() to get a GenoGAMDataSet object. This is done with the HDF5 backend.
Computing size factors with computeSizeFactors()
Compute the model with genogam() to get the result GenoGAM object
compute position-wise log p-values with computeSignificance()

1.1 Getting started

The example dataset is restricted to the first 100kb of the first yeast chromosome.

We set some required meta variables

library(GenoGAM)

## A.
## specify folder and experiment design path
wd <- system.file("extdata/Set1", package='GenoGAM')
folder <- file.path(wd, "bam")
expDesign <- file.path(wd, "experimentDesign.txt")

## B.
## register parallel backend (default is "the number of cores" - 2)
BiocParallel::register(BiocParallel::MulticoreParam(worker = 2))

## C.
## specify chunk and overhang size.
chunkSize <- 50000
overhangSize <- 1000

## D.
## the experiment design reflecting the underlying GAM
## x = position
design <- ~ s(x) + s(x, by = genotype)

Read in the data to obtain a GenoGAMDataSet object.

## build the GenoGAMDataSet
ggd <- GenoGAMDataSet(
  expDesign, directory = folder,
  chunkSize = chunkSize, overhangSize = overhangSize,
  design = design)

## INFO [2020-04-27 22:18:58] Creating GenoGAMDataSet
## INFO [2020-04-27 22:19:00] Reading in data
## INFO [2020-04-27 22:19:00] Reading in wt_1
## INFO [2020-04-27 22:19:02] Reading in wt_2
## INFO [2020-04-27 22:19:03] Reading in mutant_1
## INFO [2020-04-27 22:19:03] Reading in mutant_2
## INFO [2020-04-27 22:19:03] Finished reading in data
## INFO [2020-04-27 22:19:04] GenoGAMDataSet created

ggd

## class: GenoGAMDataSet 
## dimension: 100000 4 
## samples(4): wt_1 wt_2 mutant_1 mutant_2 
## design variables(1): genotype 
## tiles size: 52kbp 
## number of tiles: 2 
## chromosomes: chrI 
## size factors:
##     wt_1     wt_2 mutant_1 mutant_2 
##        0        0        0        0 
## formula:
## ~ s(x) + s(x, by = genotype)

Compute Size factors

## compute size factors
ggd <- computeSizeFactors(ggd)

## INFO [2020-04-27 22:19:04] Computing size factors
## INFO [2020-04-27 22:19:04] DONE

ggd

## class: GenoGAMDataSet 
## dimension: 100000 4 
## samples(4): wt_1 wt_2 mutant_1 mutant_2 
## design variables(1): genotype 
## tiles size: 52kbp 
## number of tiles: 2 
## chromosomes: chrI 
## size factors:
##        wt_1        wt_2    mutant_1    mutant_2 
## -0.04156987  0.04210819 -0.25589212  0.14328182 
## formula:
## ~ s(x) + s(x, by = genotype)

## the data
assay(ggd)

## DataFrame with 100000 rows and 4 columns
##         wt_1  wt_2 mutant_1 mutant_2
##        <Rle> <Rle>    <Rle>    <Rle>
## 1          0     0        0        0
## 2          0     0        0        0
## 3          0     0        0        0
## 4          0     0        0        0
## 5          0     0        0        0
## ...      ...   ...      ...      ...
## 99996      0     0        0        0
## 99997      0     0        0        0
## 99998      0     0        0        0
## 99999      0     0        0        0
## 100000     1     1        0        0

1.2 Fitting the model

Compute model with fixed hyperparameters

## compute model without parameter estimation to save time in vignette
result <- genogam(ggd, lambda = 4601, theta = 4.51)

## INFO [2020-04-27 22:19:04] Initializing the model
## INFO [2020-04-27 22:19:04] Done
## INFO [2020-04-27 22:19:04] Fitting model
## INFO [2020-04-27 22:19:12] Done
## INFO [2020-04-27 22:19:12] Processing fits
## INFO [2020-04-27 22:19:12] Processing done
## INFO [2020-04-27 22:19:12] Finished

result

## Family:  nb 
## Formula:  ~ s(x) + s(x, by = genotype) 
## Class: GenoGAM 
## Dimension: 100000 4 
## Samples(4): wt_1 wt_2 mutant_1 mutant_2 
## Design variables(1): genotype 
## Smooth functions(2): s(x) s(x):genotype 
## Chromosomes: chrI 
## 
## Size factors:
##        wt_1        wt_2    mutant_1    mutant_2 
## -0.04156987  0.04210819 -0.25589212  0.14328182 
## 
## Cross Validation:  Not performed 
## 
## Spline Parameters:
##   Knot spacing:  20 
##   B-spline order:  2 
##   Penalization order:  2 
## 
## Tile settings:
##   Chunk size: 50000 
##   Tile size: 52000 
##   Overhang size: 1000 
##   Number of tiles: 2 
##   Evaluated genome ranges:
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrI  1-100000      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome

## the fit and standard error
fits(result)

## DataFrame with 100000 rows and 2 columns
##             s(x) s(x):genotype
##        <numeric>     <numeric>
## 1       -2.81190      0.532046
## 2       -2.81129      0.531849
## 3       -2.81067      0.531652
## 4       -2.81006      0.531455
## 5       -2.80944      0.531258
## ...          ...           ...
## 99996   -1.86772     -0.110005
## 99997   -1.87528     -0.110351
## 99998   -1.88283     -0.110696
## 99999   -1.89039     -0.111042
## 100000  -1.89794     -0.111388

se(result)

## DataFrame with 100000 rows and 2 columns
##             s(x) s(x):genotype
##        <numeric>     <numeric>
## 1       0.258648      0.317642
## 2       0.257522      0.316405
## 3       0.256401      0.315174
## 4       0.255287      0.313947
## 5       0.254178      0.312727
## ...          ...           ...
## 99996   0.156907      0.209381
## 99997   0.157925      0.210545
## 99998   0.158948      0.211713
## 99999   0.159978      0.212887
## 100000  0.161013      0.214066

Compute log p-values

result <- computeSignificance(result)

## INFO [2020-04-27 22:19:13] Computing positionwise p-values
## INFO [2020-04-27 22:19:13] DONE

pvalue(result)

## DataFrame with 100000 rows and 2 columns
##               s(x) s(x):genotype
##          <numeric>     <numeric>
## 1      1.57485e-27     0.0939372
## 2      9.59125e-28     0.0927802
## 3      5.81947e-28     0.0916313
## 4      3.51775e-28     0.0904906
## 5      2.11845e-28     0.0893582
## ...            ...           ...
## 99996  1.13648e-32      0.599319
## 99997  1.60573e-32      0.600195
## 99998  2.26868e-32      0.601072
## 99999  3.20497e-32      0.601947
## 100000 4.52673e-32      0.602823

Plot results of the center 10kb, where both tiles are joined together.

ranges = GRanges("chrI", IRanges(45000, 55000))
plotGenoGAM(result, ranges = ranges)

2 GenoGAM 2.0 workflow overview

3 Standard ChIP-Seq analysis

Additional to the basic smoothing and point-wise significance computation this version of GenoGAM now also supports differential analysis and peak calling with position-wise confidence intervals on ChIP-Seq data.

3.1 Goal of the analysis

A small dataset is provided to illustrate the ChIP-Seq functionalities. This is a subset of the data published by Thornton et al (Thornton et al. 2014), who assayed histone H3 Lysine 4 trimethylation (H3K4me3) by ChIP-Seq comparing wild type yeast versus a mutant with a truncated form of Set1, the yeast H3 Lysine 4 methylase. The goal of this analysis is the identification of genomic positions that are significantly differentially methylated in the mutant compared to the wild type strain.

To this end, we will build a GenoGAM model that models the logarithm of the expected ChIP-seq fragment counts \(y\) as sums of smooth functions of the genomic position \(x\). Specifically, we write (with simplified notations) that:

\[\begin{equation} \log(\operatorname{E}(y)) = f(x) + \text{genotype} \times f_\text{mutant/wt}(x) \tag{1} \end{equation}\]

where genotype is 1 for data from the mutant samples, and 0 for the wild type. Here the function \(f(x)\) is the reference level, i.e. the log-rate in the wild type strain. The function \(f_\text{mutant/wt}(x)\) is the log-ratio of the mutant over wild-type. We will then statistically test the null hypothesis \(f_\text{mutant/wt}(x) = 0\) at each position \(x\). In the following we show how to build the dataset, perform the fitting of the model and perform the testing.

3.2 Registering a parallel backend

The parallel backend is registered using the BiocParallel package. See the documentation in the BiocParallel class for the correct use. Also note, that BiocParallel is just an interface to multiple parallel packages. For example in order to use GenoGAM on a cluster, the BatchJobs package might be required. The parallel backend can be registered at anytime as GenoGAM will just call the current one.

IMPORTANT: According to this and this posts on the Bioconductor support page and R-devel mailing list, the most important core feature of the multicore backend, shared memory, is compromised by Rs own garbage collector, resulting in a replication of the entire workspace across all cores. Given that the amount of data in memory is big (without the HDF5 backend) it might crash the entire system. We highly advice to register the SnowParam backend to avoid this if working on larger organisms. This way the overhead is a little bigger, but only necessary data is copied to the workers, keeping memory consumption relatively low. We never experienced a higher load than 4GB per core, usually it was around 2GB on human genome, which is even lower with the HDF5 backend.

library(GenoGAM)

## On multicore machines by default the number of available cores - 2 are registered on the default backend 'Multicore'
BiocParallel::registered()[1]

## $MulticoreParam
## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
##   bpexportglobals: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK

For this small example we would like to assign less workers. Check BiocParallel for other possible backends and more options for SnowParam

BiocParallel::register(BiocParallel::SnowParam(workers = 3))

If we check the current registered backend, we see that the number of workers and the backend has changed.

BiocParallel::registered()[1]

## $SnowParam
## class: SnowParam
##   bpisup: FALSE; bpnworkers: 3; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
##   bpexportglobals: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: SOCK

3.3 Building a GenoGAMDataSet

BAM files restricted to the first 100kb of yeast chromosome I are provided in the inst/extdata folder of the GenoGAM package. This folder also contains a flat file describing the experimental design.

We start by loading the experimental design from the tab-separated text file experimentDesign.txt into a data frame:

folder <- system.file("extdata/Set1", package='GenoGAM')

expDesign <- read.delim(
  file.path(folder, "experimentDesign.txt")
)

expDesign

##         ID                                         file paired genotype
## 1     wt_1 H3K4ME3_Full_length_Set1_Rep_1_chr1_100k.bam  FALSE        0
## 2     wt_2 H3K4ME3_Full_length_Set1_Rep_2_chr1_100k.bam  FALSE        0
## 3 mutant_1  H3K4ME3_aa762-1080_Set1_Rep_1_chr1_100k.bam  FALSE        1
## 4 mutant_2  H3K4ME3_aa762-1080_Set1_Rep_2_chr1_100k.bam  FALSE        1

Each row of the experiment design corresponds to the alignment files in BAM format of one ChIP sample. In case of multiplexed sequencing, the BAM files must have been demultiplexed. The experiment design have named columns. Three column names have a fixed meaning for GenoGAMDataSet and must be provided: ID, file, paired. The field ID stores a unique identifier for each alignment file. It is recommended to use short and easy to understand identifiers because they are subsequently used for labelling data and plots. The field file stores the BAM file name without the directory. The field paired is boolean with values TRUE for paired-end sequencing data, and FALSE for single-end sequencing data. All other columns are taken as the experimental design and must be of binary type. The names of those columns must be identical to the names provided later in the design formula. It will allow us modeling the differential occupancy or call peaks later on. Here the important one is the genotype column.

We will now count sequencing fragment centers per genomic position and store these counts into a GenoGAMDataSet class. GenoGAM reduces ChIP-Seq data to fragment center counts rather than full base coverage so that each fragment is counted only once. This reduces artificial correlation between adjacent nucleotides. For single-end libraries, the fragment center is estimated by shifting the read end position by a constant. The estimation of this constant makes use of methods from the chipseq package (See help of the chipseq::estimate.mean.fraglen method).

The parameters needed to create a GenoGAMDataSetare:

the experiment design as either a path to the config file or a data.frame.
the directory of the BAM files
the design formula

## build the GenoGAMDataSet
wd_folder <- file.path(folder, "bam")
ggd <- GenoGAMDataSet(
  expDesign, directory = wd_folder,
  design = ~ s(x) + s(x, by = genotype)
)

## INFO [2020-04-27 22:19:14] Creating GenoGAMDataSet
## INFO [2020-04-27 22:19:15] Reading in data
## INFO [2020-04-27 22:19:15] Reading in wt_1
## INFO [2020-04-27 22:19:15] Reading in wt_2
## INFO [2020-04-27 22:19:16] Reading in mutant_1
## INFO [2020-04-27 22:19:16] Reading in mutant_2
## INFO [2020-04-27 22:19:16] Finished reading in data
## INFO [2020-04-27 22:19:16] GenoGAMDataSet created

assay(ggd)

## DataFrame with 100000 rows and 4 columns
##         wt_1  wt_2 mutant_1 mutant_2
##        <Rle> <Rle>    <Rle>    <Rle>
## 1          0     0        0        0
## 2          0     0        0        0
## 3          0     0        0        0
## 4          0     0        0        0
## 5          0     0        0        0
## ...      ...   ...      ...      ...
## 99996      0     0        0        0
## 99997      0     0        0        0
## 99998      0     0        0        0
## 99999      0     0        0        0
## 100000     1     1        0        0

## overhang size
getOverhangSize(ggd)

## [1] 1000

## chunk size
getChunkSize(ggd)

## [1] 98000

The GenoGAMDataSet class stores this count data into a structure that index genomic positions over tiles, defined by chunkSize and overhangSize. A bit of background is required to understand these parameters. The smoothing in GenoGAM is based on splines (Figure 1), which are piecewise polynomials. The knots are the positions where the polynomials connect. In our experience, one knot every 20 to 50 bp is required for enough resolution of the smooth fits in typical applications. The fitting of generalized additive models involves steps demanding a number of operations proportional to the square of the number of knots, preventing fits along whole chromosomes. To make the fitting of GAMs genome-wide, GenoGAM performs fitting on overlapping intervals (tiles), and join the fit at the midpoint of the overlap of consecutive tiles. The parameters chunkSize and overhangSize defines the tiles, where the chunk is the core part of a tile that does not overlap other tiles, and the overhangs are the two overlapping parts.

Both variables influence the computation speed and memory usage as they define into how many (and how big) tiles the overall genome is being split for parallel computation. Additionally, a too small overhang size might cause jigged fits at junction points where the chunks are glued back together. Therefore it is advised to keep overhang size to be around 1kb (and is set like this by default). Chunk size is by default roughly estimated to fit the overall modelling problem without using too much memory. However, this estimation is a very simple heuristic, thus if problems are experienced it is advised to set the chunk size yourself. We had good exerience with numbers between 50 - 100kb.

Figure1: An example spline. Displayed are seven cubic B-spline basis functions which together make up the complete spline (dark blue above). The knots are depicted as dark-grey dots at the bottom-center of each basis function. All basis functions have been multiplied by their respective coefficient and thus differ.

3.4 More options with GenoGAMSettings and the HDF5 backend

For most genome sizes beyond the yeast genome, keeping all data in memory can be impractical or even impossible to fit. Therefore, the HDF5 backend is employed to store all relevant data on hard drive. This can be simply done by setting the parameter hdf5 to true. This will also trigger a split of the underlying data into a list of data frames by chromosome. Splitting data is theoretically not necessary. However, due to problems of representing long integers in the Bioconductor infrastructure data frames of size biger than 2^31 (around 2.14 gigabase pairs) can not be stored easily or only through workarounds that would require more storage. Moreover, because HDF5 data is stored on hard drive it can be easily moved to a different device but has to be read-in again. To makes this easy, the parameter fromHDF5 can be set to true, to trigger GenoGAMDataSet to read from already present HDF5 files.

ggd <- GenoGAMDataSet(
  expDesign, directory = wd_folder,
  design = ~ s(x) + s(x, by = genotype), hdf5 = TRUE
)

## INFO [2020-04-27 22:19:16] Creating GenoGAMDataSet
## INFO [2020-04-27 22:19:17] Reading in data
## INFO [2020-04-27 22:19:17] Reading in wt_1
## INFO [2020-04-27 22:19:17] Reading in wt_2
## INFO [2020-04-27 22:19:18] Reading in mutant_1
## INFO [2020-04-27 22:19:18] Reading in mutant_2
## INFO [2020-04-27 22:19:19] Finished reading in data
## INFO [2020-04-27 22:19:19] Writing chrI to HDF5
## INFO [2020-04-27 22:19:20] chrI written
## INFO [2020-04-27 22:19:21] GenoGAMDataSet created

## Note the difference in data type compared to the DataFrame above
assay(ggd)

## $chrI
## <100000 x 4> matrix of class DelayedMatrix and type "integer":
##           wt_1 wt_2 mutant_1 mutant_2
##      [1,]    0    0        0        0
##      [2,]    0    0        0        0
##      [3,]    0    0        0        0
##      [4,]    0    0        0        0
##      [5,]    0    0        0        0
##       ...    .    .        .        .
##  [99996,]    0    0        0        0
##  [99997,]    0    0        0        0
##  [99998,]    0    0        0        0
##  [99999,]    0    0        0        0
## [100000,]    1    1        0        0

By default, the HDF5 files are stored in the temp folder, which is impractical if the data has to be kept and shared. Therefore the HDF5 folder can be specified explicitly in the GenoGAMSettings object. This object holds a lot of meta parmeters which are responsible for the functionality of GenoGAM. In particular it guards the following areas:

Reading-in and process data
Parameter estimation during model fitting and cross-validation
Data and storage control

All settings can be seen by just printing the GenoGAMSettings object. Additionally to the meta parameters it will also show the parallel backend information and the logger threshold, although both are not part of the GenoGAMSettings object.

GenoGAMSettings()

## -------------------- Read-in parameters -----------------
## Center: TRUE 
## Chromosomes:  
## Custom process function: disabled 
## 
## -------------------- BAM parameters ---------------------
## class: ScanBamParam
## bamFlag (NA unless specified):
## bamSimpleCigar: FALSE
## bamReverseComplement: FALSE
## bamTag:  
## bamTagFilter:
## bamWhich: 0 ranges
## bamWhat: pos, qwidth
## bamMapqFilter: NA
## 
## -------------------- Parallel backend -------------------
## class: SnowParam
##   bpisup: FALSE; bpnworkers: 3; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
##   bpexportglobals: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: SOCK
## 
## -------------------- Logger settings --------------------
## Logger threshold: INFO 
## 
## -------------------- HDF5 settings ----------------------
## HDF5 Directory: /tmp/RtmpSSdsH8/HDF5Array_dump 
## HDF5 Compression Level (0-9): 6 
## 
## -------------------- Data settings ----------------------
## Region size: 4000 
## 
## Basepairs per Knot: 20 
## 
## -------------------- Optimization parameters ------------
## Optimization method: Nelder-Mead 
## Optimization control:
##   maxit: 60
##   fnscale: -1
##   trace: 1
## Parameter estimation control:
##   eps: 1e-06
##   maxiter: 1000

3.4.1 Reading in data

Reading in data is guarded by the bamParams slot. It makes direct use of the class ScanBamParam from the Rsamtools package. Unless the data should be restricted to specific chromosomes, it is the easiest way to specify particular regions that should be read in. Chromosomes can be simply restricted through the chromosomeList slot. Finally, the ‘center’ slot just specifies if fragments should be centered or not.

range <- GRanges("chrI", IRanges(30000, 50000))
params <- Rsamtools::ScanBamParam(which = range)
settings <- GenoGAMSettings(center = FALSE,
                            bamParams = params)

ggd <- GenoGAMDataSet(
  expDesign, directory = wd_folder,
  design = ~ s(x) + s(x, by = genotype),
  settings = settings)

## INFO [2020-04-27 22:19:21] Creating GenoGAMDataSet
## INFO [2020-04-27 22:19:22] Reading in data
## INFO [2020-04-27 22:19:22] Reading in wt_1
## INFO [2020-04-27 22:19:22] Reading in wt_2
## INFO [2020-04-27 22:19:22] Reading in mutant_1
## INFO [2020-04-27 22:19:23] Reading in mutant_2
## INFO [2020-04-27 22:19:23] Finished reading in data
## INFO [2020-04-27 22:19:23] GenoGAMDataSet created

## Note the higher counts
assay(ggd)

## DataFrame with 20001 rows and 4 columns
##        wt_1  wt_2 mutant_1 mutant_2
##       <Rle> <Rle>    <Rle>    <Rle>
## 1        20     6        3        5
## 2        20     6        3        5
## 3        20     6        3        5
## 4        20     6        3        5
## 5        20     6        3        5
## ...     ...   ...      ...      ...
## 19997    14     5        6        7
## 19998    14     5        6        7
## 19999    14     5        6        7
## 20000    13     5        6        7
## 20001    13     5        6        7

rowRanges(ggd)

## StitchedGPos object with 20001 positions and 0 metadata columns:
##           seqnames       pos strand
##              <Rle> <integer>  <Rle>
##       [1]     chrI     30000      *
##       [2]     chrI     30001      *
##       [3]     chrI     30002      *
##       [4]     chrI     30003      *
##       [5]     chrI     30004      *
##       ...      ...       ...    ...
##   [19997]     chrI     49996      *
##   [19998]     chrI     49997      *
##   [19999]     chrI     49998      *
##   [20000]     chrI     49999      *
##   [20001]     chrI     50000      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome

## Now with chromosome specification. As only chrI is present
## in the example file, the read in will log an error and generate
## an empty GenoGAMDataSet object
settings <- GenoGAMSettings(chromosomeList = c("chrII", "chrIII"))

ggd <- GenoGAMDataSet(
  expDesign, directory = wd_folder,
  design = ~ s(x) + s(x, by = genotype),
  settings = settings)

## INFO [2020-04-27 22:19:23] Creating GenoGAMDataSet
## ERROR [2020-04-27 22:19:23] No chromosomes to read in. Check either the specified settings or the header of BAM file
## INFO [2020-04-27 22:19:23] GenoGAMDataSet created

## Empty GenoGAMDataSet due to lack of data in the example BAM file
length(ggd)

## [1] 0

ggd

## class: GenoGAMDataSet 
## dimension: 0 0 
## samples(0): 
## design variables(0):  
## size factors:
## numeric(0)
## formula:
## ~ s(x)

3.4.2 Parameter estimation

Cross-validaton is performed to estimate the penalization parameter \(\lambda\) and the dispersion parameter \(\theta\). This is done iteratively maximising the likelihood, where each iteration involved performing one set of k-fold cross-validation. The default method maximising the likelihood function is Nelder-Mead, as the likelihood function can not be analytically specified and has no derivatives. The method can be changed to any method in Rs optim function, but the only reasonable candidate is L-BFGS-B. Moreover, the number of iterations can be changed. This is also true for the model fitting function. There, the number of maximum iterations and the threshold value can be changed.

## Note, optimControl and estimControl is a list of more parameters. However none of them besides the shown are important
settings <- GenoGAMSettings(optimMethod = "L-BFGS-B", optimControl = list(maxit = 100L),
                            estimControl = list(eps = 1e-10, maxiter = 500L)) 
settings

## -------------------- Read-in parameters -----------------
## Center: TRUE 
## Chromosomes:  
## Custom process function: disabled 
## 
## -------------------- BAM parameters ---------------------
## class: ScanBamParam
## bamFlag (NA unless specified):
## bamSimpleCigar: FALSE
## bamReverseComplement: FALSE
## bamTag:  
## bamTagFilter:
## bamWhich: 0 ranges
## bamWhat: pos, qwidth
## bamMapqFilter: NA
## 
## -------------------- Parallel backend -------------------
## class: SnowParam
##   bpisup: FALSE; bpnworkers: 3; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
##   bpexportglobals: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: SOCK
## 
## -------------------- Logger settings --------------------
## Logger threshold: INFO 
## 
## -------------------- HDF5 settings ----------------------
## HDF5 Directory: /tmp/RtmpSSdsH8/HDF5Array_dump 
## HDF5 Compression Level (0-9): 6 
## 
## -------------------- Data settings ----------------------
## Region size: 4000 
## 
## Basepairs per Knot: 20 
## 
## -------------------- Optimization parameters ------------
## Optimization method: L-BFGS-B 
## Optimization control:
##   maxit: 100
##   fnscale: -1
##   trace: 1
## Parameter estimation control:
##   eps: 1e-10
##   maxiter: 500

3.4.3 Data control

Probably the most important parameter to set through GenoGAMSettings is the HDF5 folder. Additionally, the compression level can also be changed as an integer between 0 (Lowest) and 9 (Highest). Note, with specifying the folder, the folder gets created!

## set HDF5 folder to 'myFolder'
tmp <- tempdir()
settings <- GenoGAMSettings(hdf5Control = list(dir = file.path(tmp, "myFolder"), level = 0))
HDF5Array::getHDF5DumpDir()

## [1] "/tmp/RtmpSSdsH8/myFolder"

HDF5Array::getHDF5DumpCompressionLevel()

## [1] 0

## Another way to set this would be through the HDF5Array package
HDF5Array::setHDF5DumpDir(file.path(tmp, "myOtherFolder"))
settings

## -------------------- Read-in parameters -----------------
## Center: TRUE 
## Chromosomes:  
## Custom process function: disabled 
## 
## -------------------- BAM parameters ---------------------
## class: ScanBamParam
## bamFlag (NA unless specified):
## bamSimpleCigar: FALSE
## bamReverseComplement: FALSE
## bamTag:  
## bamTagFilter:
## bamWhich: 0 ranges
## bamWhat: pos, qwidth
## bamMapqFilter: NA
## 
## -------------------- Parallel backend -------------------
## class: SnowParam
##   bpisup: FALSE; bpnworkers: 3; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
##   bpexportglobals: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: SOCK
## 
## -------------------- Logger settings --------------------
## Logger threshold: INFO 
## 
## -------------------- HDF5 settings ----------------------
## HDF5 Directory: /tmp/RtmpSSdsH8/myFolder 
## HDF5 Compression Level (0-9): 0 
## 
## -------------------- Data settings ----------------------
## Region size: 4000 
## 
## Basepairs per Knot: 20 
## 
## -------------------- Optimization parameters ------------
## Optimization method: Nelder-Mead 
## Optimization control:
##   maxit: 60
##   fnscale: -1
##   trace: 1
## Parameter estimation control:
##   eps: 1e-06
##   maxiter: 1000

As it was mentioned before the fitted splines depend on the placement of knots. The more knots are placed the wigglier the fit and the more the spline has to be penalized to avoid overfitting. This is done automatically. However too little knots will lead to underfitting with no automatic resolution. The default setting is at one knot at every 20bp. Given our experience this is a conservative setting, that works for all applications. For example narrow peak calling might need more knots than broad peak calling as high resolution is required to be able to identify two very close peaks as two peaks instead of one. Because the computational fitting problem grows quadaratically with the number of knots, less knots might speed up computation. However, this might also lead to worse results given the downstream applicaton. Thus, this parameter should be handled with care.

Additionally, computation is speed up during cross-validation by using smaller tiles. Since \(\lambda\) and \(\theta\) are independent of length, smaller tiles can be used to speed up parameter estimation. The parameter is set as chunk size and defaults to 4kb. Generally one shouldn’t go below 2kb as the computation it might become unstable in combination with sparse data.

## For example, we choose a twice as long region but also two times less knots (double the knot spacing),
## which results in the same number of knots per tile as before.
settings <- GenoGAMSettings(dataControl = list(regionSize = 8000, bpknots = 40))
settings

## -------------------- Read-in parameters -----------------
## Center: TRUE 
## Chromosomes:  
## Custom process function: disabled 
## 
## -------------------- BAM parameters ---------------------
## class: ScanBamParam
## bamFlag (NA unless specified):
## bamSimpleCigar: FALSE
## bamReverseComplement: FALSE
## bamTag:  
## bamTagFilter:
## bamWhich: 0 ranges
## bamWhat: pos, qwidth
## bamMapqFilter: NA
## 
## -------------------- Parallel backend -------------------
## class: SnowParam
##   bpisup: FALSE; bpnworkers: 3; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
##   bpexportglobals: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: SOCK
## 
## -------------------- Logger settings --------------------
## Logger threshold: INFO 
## 
## -------------------- HDF5 settings ----------------------
## HDF5 Directory: /tmp/RtmpSSdsH8/myOtherFolder 
## HDF5 Compression Level (0-9): 0 
## 
## -------------------- Data settings ----------------------
## Region size: 8000 
## 
## Basepairs per Knot: 40 
## 
## -------------------- Optimization parameters ------------
## Optimization method: Nelder-Mead 
## Optimization control:
##   maxit: 60
##   fnscale: -1
##   trace: 1
## Parameter estimation control:
##   eps: 1e-06
##   maxiter: 1000

Note: It is generally not advised to change the settings parameters, unless it is the specification of regions for data read-in or the HDF5 folder.

3.5 Size factor estimation

Sequencing libraries typically vary in sequencing depth. Such variations is controlled for in GenoGAM by adding a sample-specific constant to the right term of Equation (1). The estimation of these constants is performed by the function computeSizeFactor() as follows:

## read in data again, since the last read-in was an empty GenoGAM
ggd <- GenoGAMDataSet(
  expDesign, directory = wd_folder,
  design = ~ s(x) + s(x, by = genotype)
)

## INFO [2020-04-27 22:19:24] Creating GenoGAMDataSet
## INFO [2020-04-27 22:19:25] Reading in data
## INFO [2020-04-27 22:19:25] Reading in wt_1
## INFO [2020-04-27 22:19:25] Reading in wt_2
## INFO [2020-04-27 22:19:25] Reading in mutant_1
## INFO [2020-04-27 22:19:26] Reading in mutant_2
## INFO [2020-04-27 22:19:26] Finished reading in data
## INFO [2020-04-27 22:19:26] GenoGAMDataSet created

ggd <- computeSizeFactors(ggd)

## INFO [2020-04-27 22:19:26] Computing size factors
## INFO [2020-04-27 22:19:27] DONE

sizeFactors(ggd)

##        wt_1        wt_2    mutant_1    mutant_2 
## -0.04156987  0.04210819 -0.25589212  0.14328182

Note: The size factors in GenoGAM are in the natural logarithm scale. Also factor groups are identified automatically from the design. Given more complex design, this might fail. Or, maybe different factor groups are desired. In this case the groups can be provided separatelly through the factorGroups argument.

3.6 Model fitting

A GenoGAM model requires two further parameters to be fitted: the regularization parameter, \(\lambda\), and the dispersion parameter \(\theta\). The regularization parameters \(\lambda\) controls the amount of smoothing. The larger \(\lambda\) is, the smoother the smooth functions are. The dispersion parameter \(\theta\) controls how much the observed counts deviate from their expected value modeled by Equation (1). The dispersion captures biological and technical variation which one typically sees across replicate samples, but also errors of the model. In GenoGAM, the dispersion is modeled by assuming the counts to follow a negative binomial distribution with mean \(\mu=\operatorname{E}(y)\) whose logarithm is modeled by Equation (1) and with variance \(v = \mu + \mu^2/\theta\).

If not provided, the parameters \(\lambda\) and \(\theta\) are obtained by cross-validation. This step is too time consuming for a vignette. For sake of going through this example quickly, we provide the values manually:

## fit model without parameters estimation
fit <- genogam(ggd, lambda = 4601, theta = 4.51)

## INFO [2020-04-27 22:19:27] Initializing the model
## INFO [2020-04-27 22:19:27] Done
## INFO [2020-04-27 22:19:27] Fitting model
## INFO [2020-04-27 22:19:35] Done
## INFO [2020-04-27 22:19:35] Processing fits
## INFO [2020-04-27 22:19:35] Processing done
## INFO [2020-04-27 22:19:35] Finished

fit

## Family:  nb 
## Formula:  ~ s(x) + s(x, by = genotype) 
## Class: GenoGAM 
## Dimension: 100000 4 
## Samples(4): wt_1 wt_2 mutant_1 mutant_2 
## Design variables(1): genotype 
## Smooth functions(2): s(x) s(x):genotype 
## Chromosomes: chrI 
## 
## Size factors:
##        wt_1        wt_2    mutant_1    mutant_2 
## -0.04156987  0.04210819 -0.25589212  0.14328182 
## 
## Cross Validation:  Not performed 
## 
## Spline Parameters:
##   Knot spacing:  20 
##   B-spline order:  2 
##   Penalization order:  2 
## 
## Tile settings:
##   Chunk size: 98000 
##   Tile size: 1e+05 
##   Overhang size: 1000 
##   Number of tiles: 1 
##   Evaluated genome ranges:
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrI  1-100000      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome

Remark on parameter estimation: To estimate the parameters \(\lambda\) and \(\theta\) by cross-validation, call genogam() without setting their values. This will perform a k-fold cross-validation on each tile with initial parameter values and iterate until convergence, often for about 30 iterations. The number of folds can be set with parameter kfolds. We recommend to do it for 20 to 40 different regions (the number can be set with parameter regions). During this procedure the underlying optimization algorithm (Nelder-Mead by default) will explore the possible parameter space. In case of low-count regions this might result in unstable matrix operations, in particular in resolutions of linear systems. To stabilize the computation a small first-order regularization might be added through the parameter eps. This will change the penalization term from \(\beta^T \textbf{S} \beta\) to \(\beta^T \textbf{S} \beta + \epsilon \textbf{I}\), where identity matrix \(\textbf{I}\) is of the same dimension as penalty matrix \(\textbf{S}\). The parameter eps represents \(\epsilon\) in the equation above. Thus, a too large \(\epsilon\) will also have a significant effect on the general fit, while a small \(\epsilon\) will mostly affect small values close too zero. We recommend an \(\epsilon\) value in the range of 1e-4 - 1e-3.

An important difference of cross-validation in GenoGAM is that instead of leaving out data points, intervals are left out. This improves the parameter estimation by taking into account local correlation where leaving out single points might lead to overfitting. If replicates are present in the data the cross-validation can be further improved by using larger intervals (20bp by default) around the size of a ChIP-Seq fragment, i.e. 200bp. Thus, the large left out interval will be captured by the data in the replicate.

## Does not evaluate
fit_CV <- genogam(ggd, intervalSize = 200)

Remark on parallel computing: If using the SnowParams backend, the initiation of a worker and copying relevant data might take a few seconds. Especially if this is done sequentially. During cross-validation, where computation is done on small tiles, computation might therefore be only a fraction of the total overhead. Therefore, the number of workers is automatically set to an optimal lower number during cross-validation and reset afterwards.

3.7 Plotting results

Following the convention of the packages mgcv and gam the names of the fit for the smooth function defined by the by variable in the design formula (which is the same as the column name in the config) follow the pattern s(x):by-variable. Here, the smooth function of interest \(f_\text{mutant/wt}(x)\) is thus named s(x):genotype.

ranges = GRanges("chrI", IRanges(45000, 55000))
plotGenoGAM(fit, ranges = ranges)

Because the data is generally too big to plot in its totality. The plotting function requires a specifig range, which is best provided by a GRanges object and supplied to the ranges argument. Additionally, it can be specified if the smooths should be plotted on the same scale (scale), the values should be displayed in log format (log) or if the raw data count data should be included (provided GenoGAMDataSet to ggd, unfortunatelly hard to show in the vignette).

plotGenoGAM(fit, ranges = ranges, scale = FALSE)

3.8 Statistical testing

We test for each smooth and at each position \(x\) the null hypothesis that it values 0 by a call to computeSignificance(). This gives us pointwise p-values.

fit <- computeSignificance(fit)

## INFO [2020-04-27 22:19:37] Computing positionwise p-values
## INFO [2020-04-27 22:19:37] DONE

pvalue(fit)

## DataFrame with 100000 rows and 2 columns
##               s(x) s(x):genotype
##          <numeric>     <numeric>
## 1      1.60811e-27     0.0940669
## 2      9.79473e-28     0.0929080
## 3      5.94354e-28     0.0917573
## 4      3.59313e-28     0.0906147
## 5      2.16408e-28     0.0894804
## ...            ...           ...
## 99996  1.16158e-32      0.600230
## 99997  1.64094e-32      0.601107
## 99998  2.31806e-32      0.601984
## 99999  3.27430e-32      0.602859
## 100000 4.62426e-32      0.603735

3.9 Differential binding

If region-wise significance is needed, as in the case of differential binding, then we call computeRegionSignificance(). This returns the provided GRanges object updated by a p-value and FRD column. If smooth is not provided it is computed for all smooths.

gr <- GRanges("chrI", IRanges(c(1000, 7000), c(4000, 9000)))
db <- computeRegionSignificance(fit, regions = gr, smooth = "s(x):genotype")

## INFO [2020-04-27 22:19:39] Estimating region p-values and FDR
## INFO [2020-04-27 22:19:39] Computing positionwise log p-values
## INFO [2020-04-27 22:19:39] DONE
## INFO [2020-04-27 22:19:39] Performing multiple correction
## INFO [2020-04-27 22:19:39] DONE

db

## $`s(x):genotype`
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames    ranges strand |    pvalue       FDR
##          <Rle> <IRanges>  <Rle> | <numeric> <numeric>
##   [1]     chrI 1000-4000      * |  0.249909  0.249909
##   [2]     chrI 7000-9000      * |  0.249536  0.249909
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

3.10 Peak calling

The computed fit allows for easy peak calling of narrow and broad peaks via the callPeaks(). The resulting data.table object provides a similar structure as the ENCODE narrowPeak and broadPeak format. Note, the score is given as the negative natural logarithm of the p-values. Also no peaks borders are provided (a function to compute them just as in GenoGAM 1.0 is being implemented). As this is not peak calling data, we use a high threshold do demonstrate the functionality.

peaks <- callPeaks(fit, threshold = 1, smooth = "s(x):genotype")

## INFO [2020-04-27 22:19:40] Calling narrow peaks
## INFO [2020-04-27 22:19:40] DONE

peaks

## $`s(x):genotype`
##     seqnames   pos       zscore      score       fdr    summit
##  1:     chrI 77623  3.953526539 10.1647521 0.0000000 0.8218586
##  2:     chrI 59362  2.299714606  4.5345070 0.5555556 1.4055236
##  3:     chrI 62226  2.605805754  5.3854186 0.5714286 1.7176303
##  4:     chrI 59883  2.632129470  5.4625846 0.6000000 1.3510637
##  5:     chrI 81446  2.233273750  4.3610131 0.6000000 1.3675995
##  6:     chrI 83493  2.307658693  4.5555172 0.6250000 1.5556027
##  7:     chrI 90239  2.611568704  5.4022580 0.6666667 0.9923707
##  8:     chrI 64237  2.983954682  6.5551675 0.7500000 1.3781992
##  9:     chrI 19099  0.007317323  0.6990026 0.8333333 3.1174076
## 10:     chrI 27398  0.123957859  0.7970103 0.8363636 1.0669989
## 11:     chrI 22332  0.274332932  0.9367119 0.8367347 4.2479926
## 12:     chrI  2357  0.010564995  0.7016124 0.8474576 3.0423462
## 13:     chrI 57733  0.184091053  0.8510392 0.8490566 1.4130322
## 14:     chrI  7751  0.306173644  0.9682784 0.8510638 1.2812820
## 15:     chrI 44866  0.130774568  0.8030138 0.8518519 2.0274223
## 16:     chrI 88055  0.284273605  0.9464913 0.8541667 3.4807577
## 17:     chrI 18116  0.085875987  0.7640365 0.8571429 1.0741642
## 18:     chrI 42998  0.058164384  0.7406396 0.8596491 6.1783913
## 19:     chrI 38415  0.381625912  1.0459171 0.8604651 1.5213175
## 20:     chrI 23137  0.015640543  0.7057045 0.8620690 4.6110303
## 21:     chrI 21262  0.200954529  0.8666270 0.8653846 1.7848703
## 22:     chrI 14576  0.338250353  1.0007945 0.8695652 3.6128455
## 23:     chrI 27850  0.257966317  0.9207596 0.8800000 1.9101958
## 24:     chrI 15951  0.387984743  1.0526442 0.8809524 1.5720949
## 25:     chrI 61001  0.204287531  0.8697307 0.8823529 1.0008833
## 26:     chrI  7237  0.350087567  1.0129765 0.8863636 1.2718125
## 27:     chrI 31286  0.343267025  1.0059453 0.8888889 3.1725112
## 28:     chrI 13524  0.395600685  1.0607392 0.9024390 1.6710578
## 29:     chrI 24877  0.399441628  1.0648374 0.9250000 1.5034923
## 30:     chrI 23903  0.623616735  1.3226075 0.9354839 1.3912920
## 31:     chrI 49699  1.245915796  2.2405717 0.9375000 1.3353167
## 32:     chrI  4797  0.455874666  1.1262711 0.9459459 1.4687162
## 33:     chrI 41399  0.422117971  1.0892479 0.9487179 2.0248993
## 34:     chrI 76493 -0.246390343  0.5153191 0.9516129 1.4352471
## 35:     chrI 39240 -1.212076272  0.1196190 0.9571429 2.2251183
## 36:     chrI 57191  0.624069757  1.3231660 0.9666667 0.7640392
## 37:     chrI 76594 -0.229948090  0.5260510 0.9672131 2.2555678
## 38:     chrI 28556 -0.338869459  0.4578443 0.9682540 1.3313414
## 39:     chrI  9026 -0.388672158  0.4288760 0.9692308 4.1543744
## 40:     chrI 51485  0.557091881  1.2422552 0.9696970 2.0802139
## 41:     chrI   936 -0.814219812  0.2328903 0.9705882 1.9832077
## 42:     chrI 37176 -1.054367667  0.1576570 0.9710145 1.9079031
## 43:     chrI 47888  0.469061169  1.1409575 0.9722222 1.2486934
## 44:     chrI 73443  0.434946481  1.1032210 0.9736842 1.4434126
## 45:     chrI 68581 -0.375111593  0.4366288 0.9843750 3.6005765
## 46:     chrI 34811 -0.795611070  0.2396910 0.9850746 1.8982624
## 47:     chrI 91386  2.993624852  6.5868149 1.0000000 1.4422657
## 48:     chrI 86031  1.509538353  2.7244749 1.0000000 1.5380176
## 49:     chrI 98729  1.479515619  2.6664101 1.0000000 1.2249742
## 50:     chrI 75279  1.294625768  2.3256011 1.0000000 1.2876984
## 51:     chrI 25457  0.654689506  1.3612746 1.0000000 0.8438616
## 52:     chrI 12615  0.627263028  1.3271073 1.0000000 0.8576204
## 53:     chrI 17336  0.604812635  1.2995599 1.0000000 0.7174821
## 54:     chrI 46915  0.506105669  1.1828924 1.0000000 1.4172731
## 55:     chrI 87700  0.470411916  1.1424690 1.0000000 1.2082224
## 56:     chrI 35955 -0.772600348  0.2483068 1.0000000 1.5995945
##     seqnames   pos       zscore      score       fdr    summit

peaks <- callPeaks(fit, threshold = 1, peakType = "broad",
                   cutoff = 0.75, smooth = "s(x):genotype")

## INFO [2020-04-27 22:19:40] Calling broad peaks
## INFO [2020-04-27 22:19:40] DONE

peaks

## $`s(x):genotype`
##     seqnames start    end width strand     score meanSignal       fdr
##  1:     chrI     1    578   578      * 0.2881752  1.3225292 0.8687306
##  2:     chrI  1937   5520  3584      * 0.2880604  1.0858642 0.8687306
##  3:     chrI  6557  11060  4504      * 0.2460828  1.1853691 0.8687306
##  4:     chrI 11900  26025 14126      * 0.2880586  1.2344399 0.8687306
##  5:     chrI 26974  29868  2895      * 0.2883320  1.1872678 0.8687306
##  6:     chrI 30637  33619  2983      * 0.2880309  1.3303773 0.8687306
##  7:     chrI 37978  38914   937      * 0.2883072  1.1866105 0.8687306
##  8:     chrI 39710  41842  2133      * 0.2885133  1.5021450 0.8687306
##  9:     chrI 42744  43261   518      * 0.2879385  1.0780682 0.8687306
## 10:     chrI 43812  45263  1452      * 0.2877189  1.1058146 0.8687306
## 11:     chrI 46322  48244  1923      * 0.2880785  1.2790290 0.8687306
## 12:     chrI 49127  52097  2971      * 0.2884650  1.3329063 0.8687306
## 13:     chrI 68366  68770   405      * 0.2883727  0.9572166 0.8687306
## 14:     chrI 69998  71392  1395      * 0.2880892  1.4932166 0.8687306
## 15:     chrI 72905  79495  6591      * 1.3726813  2.2360777 0.8687306
## 16:     chrI 80251  84073  3823      * 0.2883924  2.0675725 0.8687306
## 17:     chrI 84745  86766  2022      * 0.2884386  1.6236602 0.8687306
## 18:     chrI 87351  92044  4694      * 0.2886375  2.3349403 0.8687306
## 19:     chrI 52845  66840 13996      * 0.1385248  1.7610177 0.8706416
## 20:     chrI 92840 100000  7161      * 0.1497634  1.5247567 0.8706416

The function writeToBEDFile() provides an easy way to write the peaks data.table to a narrowPeaks or broadPeaks file. The suffix will be determined automatically

## Not evaluated
writeToBEDFile(peaks, file = 'myPeaks')

4 FAQ

4.1 Armadillo Error

Problem: An error occurs during model fitting along those lines:

error: matrix multiplication: incompatible matrix dimensions: 22333830147200x5360120267008000 and 4294972496x1

Solution: First, make sure you have all Armadillo dependencies installed correctly. See here

Second, the error is most likely related to the fact, that Armadillo is using 32bit matrices, thus causing problems for large matrices GenoGAM is using. The solution is to enable ARMA_64BIT_WORD, which is not enabled in RcppArmadillo by default. This should have been done during compilation, but if it fails for some reason you can do it manually with #define ARMA_64BIT_WORD 1 in my_R_Directory/lib/R/library/RcppArmadillo/include/RcppArmadilloConfig.h. See here.

5 Acknowledgments

We thank Alexander Engelhardt, Mathilde Galinier, Simon Wood, Herv'e Pag`es, and Martin Morgan for input in the development of GenoGAM

6 Session Info

sessionInfo()

## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] GenoGAM_2.6.0               data.table_1.12.8          
##  [3] Matrix_1.2-18               HDF5Array_1.16.0           
##  [5] rhdf5_2.32.0                SummarizedExperiment_1.18.0
##  [7] DelayedArray_0.14.0         matrixStats_0.56.0         
##  [9] Biobase_2.48.0              GenomicRanges_1.40.0       
## [11] GenomeInfoDb_1.24.0         IRanges_2.22.0             
## [13] S4Vectors_0.26.0            BiocGenerics_0.34.0        
## [15] BiocStyle_2.16.0           
## 
## loaded via a namespace (and not attached):
##  [1] bit64_0.9-7              splines_4.0.0            dotCall64_1.0-0         
##  [4] assertthat_0.2.1         BiocManager_1.30.10      latticeExtra_0.6-29     
##  [7] blob_1.2.1               Rsamtools_2.4.0          GenomeInfoDbData_1.2.3  
## [10] yaml_2.2.1               pillar_1.4.3             RSQLite_2.2.0           
## [13] lattice_0.20-41          glue_1.4.0               digest_0.6.25           
## [16] RColorBrewer_1.1-2       XVector_0.28.0           colorspace_1.4-1        
## [19] htmltools_0.4.0          chipseq_1.38.0           DESeq2_1.28.0           
## [22] XML_3.99-0.3             pkgconfig_2.0.3          ShortRead_1.46.0        
## [25] magick_2.3               genefilter_1.70.0        bookdown_0.18           
## [28] zlibbioc_1.34.0          purrr_0.3.4              xtable_1.8-4            
## [31] scales_1.1.0             jpeg_0.1-8.1             BiocParallel_1.22.0     
## [34] tibble_3.0.1             annotate_1.66.0          ggplot2_3.3.0           
## [37] ellipsis_0.3.0           survival_3.1-12          magrittr_1.5            
## [40] crayon_1.3.4             memoise_1.1.0            evaluate_0.14           
## [43] hwriter_1.3.2            tools_4.0.0              sparseinv_0.1.3         
## [46] formatR_1.7              lifecycle_0.2.0          stringr_1.4.0           
## [49] Rhdf5lib_1.10.0          munsell_0.5.0            locfit_1.5-9.4          
## [52] lambda.r_1.2.4           AnnotationDbi_1.50.0     Biostrings_2.56.0       
## [55] compiler_4.0.0           rlang_0.4.5              futile.logger_1.4.3     
## [58] grid_4.0.0               RCurl_1.98-1.2           spam_2.5-1              
## [61] bitops_1.0-6             rmarkdown_2.1            codetools_0.2-16        
## [64] gtable_0.3.0             DBI_1.1.0                R6_2.4.1                
## [67] GenomicAlignments_1.24.0 knitr_1.28               dplyr_0.8.5             
## [70] bit_1.1-15.2             futile.options_1.0.1     stringi_1.4.6           
## [73] Rcpp_1.0.4.6             png_0.1-7                vctrs_0.2.4             
## [76] geneplotter_1.66.0       tidyselect_1.0.0         xfun_0.13

References

Thornton, J., G.H.. Westfield, Y. Takahashi, M. Cook, X. Gao, Woodfin A.R., Lee J., et al. 2014. “Context dependency of Set1/ COMPASS-mediated histone H3 Lys4 trimethylation.” Genes & Development 28 (2):115–20. https://doi.org/10.1101/gad.232215.113.