Contents

library(methylSig)

1 Introduction

The purpose of this vignette is to show users how to retrofit their methylSig < 0.99.0 code to work with the refactor in version 0.99.0 and later.

2 Reading Data

2.1 Old methylSig

In versions < 0.99.0 of methylSig, the methylSigReadData() function read Bismark coverage files, Bismark genome-wide CpG reports, or MethylDackel bedGraphs. Additionally, users could destrand the data, filter by coverage, and filter SNPs.

meth = methylSigReadData(
    fileList = files,
    pData = pData,
    assembly = 'hg19',
    destranded = TRUE,
    maxCount = 500,
    minCount = 10,
    filterSNPs = TRUE,
    num.cores = 1,
    fileType = 'cytosineReport')

2.2 New methylSig

In versions >= 0.99.0 of methylSig, the user should read data with bsseq::read.bismark() and then apply functions that were once bundled within methylSigReadData().

files = c(
    system.file('extdata', 'bis_cov1.cov', package='methylSig'),
    system.file('extdata', 'bis_cov2.cov', package='methylSig')
)

bsseq_stranded = bsseq::read.bismark(
    files = files,
    colData = data.frame(row.names = c('test1','test2')),
    rmZeroCov = FALSE,
    strandCollapse = FALSE
)

After reading data, filter by coverage. Note, we are changing our dataset to something we can use with the downstream functions.

# Load data for use in the rest of the vignette
data(BS.cancer.ex, package = 'bsseqData')
bs = BS.cancer.ex[1:10000]

bs = filter_loci_by_coverage(bs, min_count = 5, max_count = 500)

If the locations of C-to-T and G-to-A SNPs are known, or some other set of location should be removed:

# Construct GRanges object
remove_gr = GenomicRanges::GRanges(
    seqnames = c('chr21', 'chr21', 'chr21'),
    ranges = IRanges::IRanges(
        start = c(9411552, 9411784, 9412099),
        end = c(9411552, 9411784, 9412099)
    )
)

bs = filter_loci_by_location(bs = bs, gr = remove_gr)

3 Tiling Data

3.1 Old methylSig

In versions < 0.99.0 of methylSig, the methylSigTile() function combined aggregating CpG data over pre-defined tiles and genomic windows.

# For genomic windows, tiles = NULL
windowed_meth = methylSigTile(meth, tiles = NULL, win.size = 10000)

# For pre-defined tiles, tiles should be a GRanges object.

3.2 New methylSig

In versions >= 0.99.0 of methylSig, tiling is separated into two functions, tile_by_regions() and tile_by_windows(). Users should chooose one or the other.

windowed_bs = tile_by_windows(bs = bs, win_size = 10000)
# Collapsed promoters on chr21 and chr22
data(promoters_gr, package = 'methylSig')

promoters_bs = tile_by_regions(bs = bs, gr = promoters_gr)

4 Testing

4.1 MethylSig Test

4.1.1 Old methylSig

In versions < 0.99.0 of methylSig, the methylSigCalc function had a min.per.group parameter to determine how many samples per group had to have coverage in order to be tested.

result = methylSigCalc(
    meth = meth,
    comparison = 'DR_vs_DS',
    dispersion = 'both',
    local.info = FALSE,
    local.winsize = 200,
    min.per.group = c(3,3),
    weightFunc = methylSig_weightFunc,
    T.approx = TRUE,
    num.cores = 1)

4.1.2 New methylSig

In versions >= 0.99.0 of methylSig, the min.per.group functionality is performed by a separate function filter_loci_by_group_coverage(). Also note the change in form to define dispersion calculations, and the use of local information.

# Look a the phenotype data for bs
bsseq::pData(bs)
#> DataFrame with 6 rows and 2 columns
#>           Type        Pair
#>    <character> <character>
#> C1      cancer       pair1
#> C2      cancer       pair2
#> C3      cancer       pair3
#> N1      normal       pair1
#> N2      normal       pair2
#> N3      normal       pair3

# Require at least two samples from cancer and two samples from normal
bs = filter_loci_by_group_coverage(
    bs = bs,
    group_column = 'Type',
    c('cancer' = 2, 'normal' = 2))

After removing loci with insufficient information, we can now use the diff_methylsig() test.

# Test cancer versus normal with dispersion from both groups
diff_gr = diff_methylsig(
    bs = bs,
    group_column = 'Type',
    comparison_groups = c('case' = 'cancer', 'control' = 'normal'),
    disp_groups = c('case' = TRUE, 'control' = TRUE),
    local_window_size = 0,
    t_approx = TRUE,
    n_cores = 1)

4.2 DSS Test

4.2.1 Old methylSig

In versions < 0.99.0 of methylSig, the methylSigDSS function also had a min.per.group parameter to determine how many samples per group had to have coverage. Users also couldn’t specify which methylation groups to recover. The form of design, formula, and contrast, remain the same in versions >= 0.99.0.

contrast = matrix(c(0,1), ncol = 1)
result_dss = methylSigDSS(
    meth = meth,
    design = design1,
    formula = '~ group',
    contrast = contrast,
    group.term = 'group',
    min.per.group=c(3,3))

4.2.2 New methylSig

In versions >= 0.99.0 of methylSig, the single methylSigDSS() function is replaced by a fit function diff_dss_fit() and a test functiotn diff_dss_test(). As with diff_methylsig(), users should ensure enough samples have sufficient coverage with the filter_loci_by_group_coverage() function. The design and formula are unchanged in their forms.

If a continuous covariate is to be tested, filter_loci_by_group_coverage() should be skipped, as there are no groups. In prior versions of methylSigDSS(), this was not possible, and the group constraints were incorrectly applied prior to testing on a continuous covariate.

# IF NOT DONE PREVIOUSLY
# Require at least two samples from cancer and two samples from normal
bs = filter_loci_by_group_coverage(
    bs = bs,
    group_column = 'Type',
    c('cancer' = 2, 'normal' = 2))
# Test the simplest model with an intercept and Type
diff_fit_simple = diff_dss_fit(
    bs = bs,
    design = bsseq::pData(bs),
    formula = as.formula('~ Type'))
#> Warning in if ((!is.matrix(Y0) | !is.matrix(N0)) & (class(Y0) != "DelayedMatrix"
#> | : the condition has length > 1 and only the first element will be used
#> Fitting DML model for CpG site:

The contrast parameter is also changed in its form. Note the, additional parameters to specify how to recover group methylation. methylation_group_column and methylation_groups should be specified for group versus group comparisons. For continuous covariates, methylation_group_column is sufficient, and the samples will be grouped into top/bottom 25 percentile based on the continuous covariate column name given in methylation_group_column.

# Test the simplest model for cancer vs normal
# Note, 2 rows corresponds to 2 columns in diff_fit_simple$X
simple_contrast = matrix(c(0,1), ncol = 1)

diff_simple_gr = diff_dss_test(
    bs = bs,
    diff_fit = diff_fit_simple,
    contrast = simple_contrast,
    methylation_group_column = 'Type',
    methylation_groups = c('case' = 'cancer', 'control' = 'normal'))

5 Session Info

sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] methylSig_1.0.0  BiocStyle_2.16.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.18.0 gtools_3.8.2               
#>  [3] locfit_1.5-9.4              xfun_0.13                  
#>  [5] beachmat_2.4.0              HDF5Array_1.16.0           
#>  [7] splines_4.0.0               lattice_0.20-41            
#>  [9] rhdf5_2.32.0                colorspace_1.4-1           
#> [11] htmltools_0.4.0             stats4_4.0.0               
#> [13] rtracklayer_1.48.0          yaml_2.2.1                 
#> [15] DSS_2.36.0                  XML_3.99-0.3               
#> [17] rlang_0.4.5                 R.oo_1.23.0                
#> [19] R.utils_2.9.2               BiocParallel_1.22.0        
#> [21] BiocGenerics_0.34.0         matrixStats_0.56.0         
#> [23] GenomeInfoDbData_1.2.3      lifecycle_0.2.0            
#> [25] stringr_1.4.0               zlibbioc_1.34.0            
#> [27] Biostrings_2.56.0           munsell_0.5.0              
#> [29] R.methodsS3_1.8.0           evaluate_0.14              
#> [31] Biobase_2.48.0              knitr_1.28                 
#> [33] permute_0.9-5               IRanges_2.22.0             
#> [35] GenomeInfoDb_1.24.0         parallel_4.0.0             
#> [37] Rcpp_1.0.4.6                scales_1.1.0               
#> [39] BSgenome_1.56.0             BiocManager_1.30.10        
#> [41] limma_3.44.0                DelayedArray_0.14.0        
#> [43] S4Vectors_0.26.0            bsseq_1.24.0               
#> [45] XVector_0.28.0              Rsamtools_2.4.0            
#> [47] digest_0.6.25               stringi_1.4.6              
#> [49] bookdown_0.18               GenomicRanges_1.40.0       
#> [51] grid_4.0.0                  tools_4.0.0                
#> [53] bitops_1.0-6                magrittr_1.5               
#> [55] RCurl_1.98-1.2              crayon_1.3.4               
#> [57] Matrix_1.2-18               data.table_1.12.8          
#> [59] DelayedMatrixStats_1.10.0   rmarkdown_2.1              
#> [61] Rhdf5lib_1.10.0             R6_2.4.1                   
#> [63] GenomicAlignments_1.24.0    compiler_4.0.0