Introduction

Methrix provides set of function which allows easy importing of various flavors of bedgraphs generated by methylation callers, and many downstream analysis to be performed on large matrices.

This vignette describes basic usage of the package intended to process several large bedgraph files in R. In addition, a detailed exemplary complete data analysis with steps from reading in to annotation and differential methylation calling can be found in our WGBS best practices workflow

Overview and usage functions of the package

Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

#Installing stable version from BioConductor
BiocManager::install("methrix")

#Installing developmental version from GitHub
BiocManager::install("CompEpigen/methrix")

NOTE

Installation from BioConductor requires the BioC and R versions to be the newest. This arises from the restrictions imposed by BioConductor community which might cause package incompatibilities with the earlier versions of R (for e.g; R < 4.0). In that case installing from GitHub might be easier since it is much more merciful with regards to versions.

Reading bedgraph files

read_bedgraphs function is a versatile bedgraph reader intended to import bedgraph files generated virtually by any sort of methylation calling program. It requires user to provide indices for chromosome names, start position and other required fields. There are also presets available to import bedgraphs from most common programs such as Bismark, MethylDackel, and MethylcTools.

#Load library
library(methrix)
#Genome of your preference to work with
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

library(BiocManager)

if(!requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")
}
library(BSgenome.Hsapiens.UCSC.hg19) 
#Example bedgraph files
bdg_files <- list.files(
  path = system.file('extdata', package = 'methrix'),
  pattern = "*bedGraph\\.gz$",
  full.names = TRUE
)

print(basename(bdg_files))
#> [1] "C1.bedGraph.gz" "C2.bedGraph.gz" "N1.bedGraph.gz" "N2.bedGraph.gz"

#Generate some sample annotation table
sample_anno <- data.frame(
  row.names = gsub(
    pattern = "\\.bedGraph\\.gz$",
    replacement = "",
    x = basename(bdg_files)
  ),
  Condition = c("cancer", 'cancer', "normal", "normal"),
  Pair = c("pair1", "pair2", "pair1", "pair2"),
  stringsAsFactors = FALSE
)

print(sample_anno)
#>    Condition  Pair
#> C1    cancer pair1
#> C2    cancer pair2
#> N1    normal pair1
#> N2    normal pair2

We can import bedgraph files with the function read_bedgraphs which reads in the bedgraphs, adds CpGs missing from the reference set, and creates a methylation/coverage matrices. Once the process is complete - it returns an object of class methrix which in turn inherits SummarizedExperiment class. methrix object contains ‘methylation’ and ‘coverage’ matrices (either in-memory or as on-disk HDF5 arrays) along with pheno-data and other basic info. This object can be passed to all downstream functions for various analysis.

#First extract genome wide CpGs from the desired reference genome
hg19_cpgs <- suppressWarnings(methrix::extract_CPGs(ref_genome = "BSgenome.Hsapiens.UCSC.hg19"))
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#> 
#>     strsplit
#> 
#> Attaching package: 'rtracklayer'
#> The following object is masked from 'package:BiocIO':
#> 
#>     FileForFormat
#> -Extracting CpGs
#> -Done. Extracted 28,217,448 CpGs from 25 contigs.
#Read the files 
meth <- methrix::read_bedgraphs(
  files = bdg_files,
  ref_cpgs = hg19_cpgs,
  chr_idx = 1,
  start_idx = 2,
  M_idx = 3,
  U_idx = 4,
  stranded = FALSE,
  zero_based = FALSE, 
  collapse_strands = FALSE, 
  coldata = sample_anno
)
#> ----------------------------
#> -Preset:        Custom
#> --Missing beta and coverage info. Estimating them from M and U values
#> -CpGs raw:      28,217,448 (total reference CpGs)
#> -CpGs retained: 28,217,448(reference CpGs from contigs of interest)
#> ----------------------------
#> -Processing:    C1.bedGraph.gz
#> --CpGs missing: 28,216,771 (from known reference CpGs)
#> -Processing:    C2.bedGraph.gz
#> --CpGs missing: 28,216,759 (from known reference CpGs)
#> -Processing:    N1.bedGraph.gz
#> --CpGs missing: 28,216,746 (from known reference CpGs)
#> -Processing:    N2.bedGraph.gz
#> --CpGs missing: 28,216,747 (from known reference CpGs)
#> -Finished in:  00:01:18 elapsed (55.6s cpu)

Note: Use the argument pipeline if your bedgraphs are generated with “Bismark”, “MethylDeckal”, or “MethylcTools”. This will automatically figure out the file formats for you, and you dont have to use the arguments chr_idx start_idx and so..

#Typing meth shows basic summary.
meth
#> An object of class methrix
#>    n_CpGs: 28,217,448
#> n_samples: 4
#>     is_h5: FALSE
#> Reference: hg19

HTML QC report

Get basic summary statistics of the methrix object with methrix_report function which produces an interactive html report

methrix::methrix_report(meth = meth, output_dir = tempdir())

Click here for an example report.

Filtering

Remove uncovered loci

Usual task in analysis involves removing uncovered CpGs. i.e, those loci which are not covered across all sample (in other words covered only in subset of samples resulting NA for rest of the samples ).

meth = methrix::remove_uncovered(m = meth)
#> -Removed 28,216,705 [100%] uncovered loci of 28,217,448 sites
#> -Finished in:  2.909s elapsed (1.914s cpu)
meth
#> An object of class methrix
#>    n_CpGs: 743
#> n_samples: 4
#>     is_h5: FALSE
#> Reference: hg19

Remove SNPs

One can also remove CpG sites overlaping with common SNPs based on minor allele frequencies.

if(!require(MafDb.1Kgenomes.phase3.hs37d5)) {
BiocManager::install("MafDb.1Kgenomes.phase3.hs37d5")} 
if(!require(GenomicScores)) {
BiocManager::install("GenomicScores")} 
library(MafDb.1Kgenomes.phase3.hs37d5)
#> Loading required package: GenomicScores
#> 
#> Attaching package: 'GenomicScores'
#> The following object is masked from 'package:utils':
#> 
#>     citation
#> Warning: replacing previous import 'utils::findMatches' by
#> 'S4Vectors::findMatches' when loading 'MafDb.1Kgenomes.phase3.hs37d5'
library(GenomicScores)

meth_snps_filtered <- methrix::remove_snps(m = meth)
#> Used SNP database: MafDb.1Kgenomes.phase3.hs37d5.
#> Number of SNPs removed:
#>      chr  N
#> 1: chr21 42
#> 2: chr22 39
#> Sum:
#> [1] 81
#> -Finished in:  4.499s elapsed (3.244s cpu)

Basic operations

Extract methylation/coverage matrices

#Example data bundled, same as the previously generated meth 
data("methrix_data")

#Coverage matrix
coverage_mat <- methrix::get_matrix(m = methrix_data, type = "C")
head(coverage_mat)
#>      C1 C2 N1 N2
#> [1,] 13  7  9 10
#> [2,] NA  2  3 NA
#> [3,]  9 10  3  5
#> [4,] 11  8 12  8
#> [5,]  6  7 17  8
#> [6,] 13  6  6 14
#Methylation matrix
meth_mat <- methrix::get_matrix(m = methrix_data, type = "M")
head(meth_mat)
#>             C1        C2        N1        N2
#> [1,] 0.1538462 0.2857143 0.5555556 0.3000000
#> [2,]        NA 0.5000000 0.0000000        NA
#> [3,] 0.5555556 0.7000000 0.3333333 0.8000000
#> [4,] 0.1818182 0.2500000 0.5833333 0.2500000
#> [5,] 0.6666667 1.0000000 0.8823529 0.8750000
#> [6,] 0.8461538 1.0000000 0.8333333 0.9285714
#If you prefer you can attach loci info to the matrix and output in GRanges format
meth_mat_with_loci <- methrix::get_matrix(m = methrix_data, type = "M", add_loci = TRUE, in_granges = TRUE)
meth_mat_with_loci
#> GRanges object with 743 ranges and 4 metadata columns:
#>         seqnames            ranges strand |        C1        C2        N1
#>            <Rle>         <IRanges>  <Rle> | <numeric> <numeric> <numeric>
#>     [1]    chr21 27866423-27866424      * |  0.153846  0.285714  0.555556
#>     [2]    chr21 27866575-27866576      * |        NA  0.500000  0.000000
#>     [3]    chr21 27866921-27866922      * |  0.555556  0.700000  0.333333
#>     [4]    chr21 27867197-27867198      * |  0.181818  0.250000  0.583333
#>     [5]    chr21 27867248-27867249      * |  0.666667  1.000000  0.882353
#>     ...      ...               ...    ... .       ...       ...       ...
#>   [739]    chr22 49007313-49007314      * |  1.000000  0.714286  0.857143
#>   [740]    chr22 49007329-49007330      * |  1.000000  0.428571  1.000000
#>   [741]    chr22 49007347-49007348      * |  0.666667  0.166667  0.875000
#>   [742]    chr22 49007375-49007376      * |  0.333333  0.125000  1.000000
#>   [743]    chr22 49007398-49007399      * |  1.000000  0.600000  1.000000
#>                N2
#>         <numeric>
#>     [1]     0.300
#>     [2]        NA
#>     [3]     0.800
#>     [4]     0.250
#>     [5]     0.875
#>     ...       ...
#>   [739]       1.0
#>   [740]       1.0
#>   [741]       1.0
#>   [742]       0.6
#>   [743]       1.0
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Coverage filter

Furthermore if you prefer you can filter sites based on coverage conditions.

#e.g; Retain all loci which are covered at-least in two sample by 3 or more reads
methrix::coverage_filter(m = methrix_data, cov_thr = 3, min_samples = 2)
#> -Retained 600 of 743 sites
#> -Finished in:  1.234s elapsed (1.043s cpu)
#> An object of class methrix
#>    n_CpGs: 600
#> n_samples: 4
#>     is_h5: FALSE
#> Reference: hg19

Subset operations

Subset operations in methrix make use of data.tables fast binary search which is several orders faster than bsseq or other similar packages.

Subset by chromosome

#Retain sites only from chromosme chr21
methrix::subset_methrix(m = methrix_data, contigs = "chr21")
#> -Subsetting by contigs
#> An object of class methrix
#>    n_CpGs: 540
#> n_samples: 4
#>     is_h5: FALSE
#> Reference: hg19

Subset by genomic regions

Regions can be data.table or GRanges format.

#e.g; Retain sites only in TP53 loci 
target_loci <- GenomicRanges::GRanges("chr21:27867971-27868103")

print(target_loci)
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames            ranges strand
#>          <Rle>         <IRanges>  <Rle>
#>   [1]    chr21 27867971-27868103      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

methrix::subset_methrix(m = methrix_data, regions = target_loci)
#> -Subsetting by genomic regions
#> An object of class methrix
#>    n_CpGs: 4
#> n_samples: 4
#>     is_h5: FALSE
#> Reference: hg19

Subset by samples

methrix::subset_methrix(m = methrix_data, samples = "C1")
#> Subsetting by samples
#> An object of class methrix
#>    n_CpGs: 743
#> n_samples: 1
#>     is_h5: FALSE
#> Reference: hg19

#Or you could use [] operator to subset by index
methrix_data[,1]
#> An object of class methrix
#>    n_CpGs: 743
#> n_samples: 1
#>     is_h5: FALSE
#> Reference: hg19

Summary statsitcis

Basic summaries

meth_stats <- get_stats(m = methrix_data)
#> -Finished in:  1.205s elapsed (1.028s cpu)
print(meth_stats)
#>    Chromosome Sample_Name mean_meth median_meth   sd_meth mean_cov median_cov
#> 1:      chr21          C1  0.560004   0.6515152 0.4004019 4.745968          4
#> 2:      chr21          C2 0.4934995         0.5 0.3896208 5.047525          4
#> 3:      chr21          N1 0.5245418      0.6125 0.4205225 4.978682          4
#> 4:      chr21          N2 0.5333442   0.6666667 0.4227578 5.055662          5
#> 5:      chr22          C1 0.7392421   0.8571429 0.3115366 4.657459          4
#> 6:      chr22          C2 0.5778096   0.6515152 0.3678762      5.5          5
#> 7:      chr22          N1 0.8445557           1 0.2273115 5.505376          5
#> 8:      chr22          N2 0.8520698           1  0.220973 5.866667          5
#>      sd_cov
#> 1: 2.990217
#> 2: 3.294072
#> 3: 3.214882
#> 4: 3.148072
#> 5: 2.813423
#> 6: 3.046094
#> 7: 3.270255
#> 8: 3.166515
#Draw mean coverage per sample
plot_stats(plot_dat = meth_stats, what = "C", stat = "mean")

#Draw mean methylation per sample
plot_stats(plot_dat = meth_stats, what = "M", stat = "mean")

PCA

mpca <- methrix_pca(m = methrix_data, do_plot = FALSE)

#Plot PCA results
plot_pca(pca_res = mpca, show_labels = TRUE)


#Color code by an annotation
plot_pca(pca_res = mpca, m = methrix_data, col_anno = "Condition")

Plotting

Methylation

#Violin plots
methrix::plot_violin(m = methrix_data)
#> Randomly selecting 25000 sites
#> Warning: Removed 203 rows containing non-finite values (`stat_ydensity()`).

Coverage

methrix::plot_coverage(m = methrix_data, type = "dens")

Converting methrix to BSseq

If you prefer to work with bsseq object, you can generate bsseq object from methrix with the methrix2bsseq.

if(!require(bsseq)) {
BiocManager::install("bsseq")}
library(bsseq)
bs_seq <- methrix::methrix2bsseq(m = methrix_data)

bs_seq
#> An object of type 'BSseq' with
#>   743 methylation loci
#>   4 samples
#> has not been smoothed
#> All assays are in-memory

SessionInfo

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] bsseq_1.38.0                         MafDb.1Kgenomes.phase3.hs37d5_3.10.0
#>  [3] GenomicScores_2.14.0                 BSgenome.Hsapiens.UCSC.hg19_1.4.3   
#>  [5] BSgenome_1.70.0                      rtracklayer_1.62.0                  
#>  [7] BiocIO_1.12.0                        Biostrings_2.70.0                   
#>  [9] XVector_0.42.0                       methrix_1.16.0                      
#> [11] SummarizedExperiment_1.32.0          Biobase_2.62.0                      
#> [13] GenomicRanges_1.54.0                 GenomeInfoDb_1.38.0                 
#> [15] IRanges_2.36.0                       S4Vectors_0.40.0                    
#> [17] BiocGenerics_0.48.0                  MatrixGenerics_1.14.0               
#> [19] matrixStats_1.0.0                    data.table_1.14.8                   
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.1.3                     bitops_1.0-7                 
#>   [3] permute_0.9-7                 rlang_1.1.1                  
#>   [5] magrittr_2.0.3                compiler_4.3.1               
#>   [7] RSQLite_2.3.1                 DelayedMatrixStats_1.24.0    
#>   [9] png_0.1-8                     vctrs_0.6.4                  
#>  [11] pkgconfig_2.0.3               crayon_1.5.2                 
#>  [13] fastmap_1.1.1                 dbplyr_2.3.4                 
#>  [15] ellipsis_0.3.2                labeling_0.4.3               
#>  [17] utf8_1.2.4                    Rsamtools_2.18.0             
#>  [19] promises_1.2.1                rmarkdown_2.25               
#>  [21] bit_4.0.5                     xfun_0.40                    
#>  [23] zlibbioc_1.48.0               cachem_1.0.8                 
#>  [25] jsonlite_1.8.7                blob_1.2.4                   
#>  [27] later_1.3.1                   rhdf5filters_1.14.0          
#>  [29] DelayedArray_0.28.0           Rhdf5lib_1.24.0              
#>  [31] BiocParallel_1.36.0           interactiveDisplayBase_1.40.0
#>  [33] parallel_4.3.1                R6_2.5.1                     
#>  [35] RColorBrewer_1.1-3            bslib_0.5.1                  
#>  [37] limma_3.58.0                  jquerylib_0.1.4              
#>  [39] Rcpp_1.0.11                   knitr_1.44                   
#>  [41] R.utils_2.12.2                httpuv_1.6.12                
#>  [43] Matrix_1.6-1.1                tidyselect_1.2.0             
#>  [45] abind_1.4-5                   yaml_2.3.7                   
#>  [47] codetools_0.2-19              curl_5.1.0                   
#>  [49] lattice_0.22-5                tibble_3.2.1                 
#>  [51] withr_2.5.1                   shiny_1.7.5.1                
#>  [53] KEGGREST_1.42.0               evaluate_0.22                
#>  [55] BiocFileCache_2.10.0          pillar_1.9.0                 
#>  [57] BiocManager_1.30.22           filelock_1.0.2               
#>  [59] generics_0.1.3                RCurl_1.98-1.12              
#>  [61] BiocVersion_3.18.0            ggplot2_3.4.4                
#>  [63] sparseMatrixStats_1.14.0      munsell_0.5.0                
#>  [65] scales_1.2.1                  gtools_3.9.4                 
#>  [67] xtable_1.8-4                  glue_1.6.2                   
#>  [69] tools_4.3.1                   AnnotationHub_3.10.0         
#>  [71] locfit_1.5-9.8                GenomicAlignments_1.38.0     
#>  [73] XML_3.99-0.14                 rhdf5_2.46.0                 
#>  [75] grid_4.3.1                    AnnotationDbi_1.64.0         
#>  [77] colorspace_2.1-0              GenomeInfoDbData_1.2.11      
#>  [79] HDF5Array_1.30.0              restfulr_0.0.15              
#>  [81] cli_3.6.1                     rappdirs_0.3.3               
#>  [83] fansi_1.0.5                   S4Arrays_1.2.0               
#>  [85] dplyr_1.1.3                   gtable_0.3.4                 
#>  [87] R.methodsS3_1.8.2             sass_0.4.7                   
#>  [89] digest_0.6.33                 SparseArray_1.2.0            
#>  [91] farver_2.1.1                  rjson_0.2.21                 
#>  [93] memoise_2.0.1                 htmltools_0.5.6.1            
#>  [95] R.oo_1.25.0                   lifecycle_1.0.3              
#>  [97] httr_1.4.7                    statmod_1.5.0                
#>  [99] mime_0.12                     bit64_4.0.5