Methrix
provides set of function which allows easy
importing of various flavors of bedgraphs generated by methylation
callers, and many downstream analysis to be performed on large
matrices.
This vignette describes basic usage of the package intended to process several large bedgraph files in R. In addition, a detailed exemplary complete data analysis with steps from reading in to annotation and differential methylation calling can be found in our WGBS best practices workflow
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
#Installing stable version from BioConductor
::install("methrix")
BiocManager
#Installing developmental version from GitHub
::install("CompEpigen/methrix") BiocManager
NOTE
Installation from BioConductor requires the BioC and R versions to be the newest. This arises from the restrictions imposed by BioConductor community which might cause package incompatibilities with the earlier versions of R (for e.g; R < 4.0). In that case installing from GitHub might be easier since it is much more merciful with regards to versions.
read_bedgraphs
function is a versatile bedgraph reader
intended to import bedgraph files generated virtually by any sort of
methylation calling program. It requires user to provide indices for
chromosome names, start position and other required fields. There are
also presets available to import bedgraphs
from most common
programs such as Bismark
, MethylDackel
, and
MethylcTools
.
#Load library
library(methrix)
#Genome of your preference to work with
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
library(BiocManager)
if(!requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
::install("BSgenome.Hsapiens.UCSC.hg19")
BiocManager
}library(BSgenome.Hsapiens.UCSC.hg19)
#Example bedgraph files
<- list.files(
bdg_files path = system.file('extdata', package = 'methrix'),
pattern = "*bedGraph\\.gz$",
full.names = TRUE
)
print(basename(bdg_files))
#> [1] "C1.bedGraph.gz" "C2.bedGraph.gz" "N1.bedGraph.gz" "N2.bedGraph.gz"
#Generate some sample annotation table
<- data.frame(
sample_anno row.names = gsub(
pattern = "\\.bedGraph\\.gz$",
replacement = "",
x = basename(bdg_files)
),Condition = c("cancer", 'cancer', "normal", "normal"),
Pair = c("pair1", "pair2", "pair1", "pair2"),
stringsAsFactors = FALSE
)
print(sample_anno)
#> Condition Pair
#> C1 cancer pair1
#> C2 cancer pair2
#> N1 normal pair1
#> N2 normal pair2
We can import bedgraph files with the function
read_bedgraphs
which reads in the bedgraphs, adds CpGs
missing from the reference set, and creates a methylation/coverage
matrices. Once the process is complete - it returns an object of class
methrix
which in turn inherits SummarizedExperiment
class. methrix
object contains ‘methylation’ and ‘coverage’
matrices (either in-memory or as on-disk HDF5 arrays) along with
pheno-data and other basic info. This object can be passed to all
downstream functions for various analysis.
#First extract genome wide CpGs from the desired reference genome
<- suppressWarnings(methrix::extract_CPGs(ref_genome = "BSgenome.Hsapiens.UCSC.hg19"))
hg19_cpgs #>
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#>
#> strsplit
#> -Extracting CpGs
#> -Done. Extracted 28,217,448 CpGs from 25 contigs.
#Read the files
<- methrix::read_bedgraphs(
meth files = bdg_files,
ref_cpgs = hg19_cpgs,
chr_idx = 1,
start_idx = 2,
M_idx = 3,
U_idx = 4,
stranded = FALSE,
zero_based = FALSE,
collapse_strands = FALSE,
coldata = sample_anno
)#> ----------------------------
#> -Preset: Custom
#> --Missing beta and coverage info. Estimating them from M and U values
#> -CpGs raw: 28,217,448 (total reference CpGs)
#> -CpGs retained: 28,217,448(reference CpGs from contigs of interest)
#> ----------------------------
#> -Processing: C1.bedGraph.gz
#> --CpGs missing: 28,216,771 (from known reference CpGs)
#> -Processing: C2.bedGraph.gz
#> --CpGs missing: 28,216,759 (from known reference CpGs)
#> -Processing: N1.bedGraph.gz
#> --CpGs missing: 28,216,746 (from known reference CpGs)
#> -Processing: N2.bedGraph.gz
#> --CpGs missing: 28,216,747 (from known reference CpGs)
#> -Finished in: 00:03:05 elapsed (58.0s cpu)
Note: Use the argument pipeline
if your bedgraphs are
generated with “Bismark”, “MethylDeckal”, or “MethylcTools”. This will
automatically figure out the file formats for you, and you dont have to
use the arguments chr_idx
start_idx
and
so..
#Typing meth shows basic summary.
meth#> An object of class methrix
#> n_CpGs: 28,217,448
#> n_samples: 4
#> is_h5: FALSE
#> Reference: hg19
Get basic summary statistics of the methrix
object with
methrix_report
function which produces an interactive html
report
::methrix_report(meth = meth, output_dir = tempdir()) methrix
Click here for an example report.
Usual task in analysis involves removing uncovered CpGs. i.e, those
loci which are not covered across all sample (in other words covered
only in subset of samples resulting NA
for rest of the
samples ).
= methrix::remove_uncovered(m = meth)
meth #> -Removed 28,216,705 [100%] uncovered loci of 28,217,448 sites
#> -Finished in: 2.586s elapsed (1.543s cpu)
meth#> An object of class methrix
#> n_CpGs: 743
#> n_samples: 4
#> is_h5: FALSE
#> Reference: hg19
One can also remove CpG sites overlaping with common SNPs based on minor allele frequencies.
if(!require(MafDb.1Kgenomes.phase3.hs37d5)) {
::install("MafDb.1Kgenomes.phase3.hs37d5")}
BiocManagerif(!require(GenomicScores)) {
::install("GenomicScores")} BiocManager
library(MafDb.1Kgenomes.phase3.hs37d5)
#> Loading required package: GenomicScores
#>
#> Attaching package: 'GenomicScores'
#> The following object is masked from 'package:utils':
#>
#> citation
library(GenomicScores)
<- methrix::remove_snps(m = meth)
meth_snps_filtered #> Used SNP database: MafDb.1Kgenomes.phase3.hs37d5.
#> Number of SNPs removed:
#> chr N
#> 1: chr21 42
#> 2: chr22 39
#> Sum:
#> [1] 81
#> -Finished in: 5.162s elapsed (3.186s cpu)
#Example data bundled, same as the previously generated meth
data("methrix_data")
#Coverage matrix
<- methrix::get_matrix(m = methrix_data, type = "C")
coverage_mat head(coverage_mat)
#> C1 C2 N1 N2
#> [1,] 13 7 9 10
#> [2,] NA 2 3 NA
#> [3,] 9 10 3 5
#> [4,] 11 8 12 8
#> [5,] 6 7 17 8
#> [6,] 13 6 6 14
#Methylation matrix
<- methrix::get_matrix(m = methrix_data, type = "M")
meth_mat head(meth_mat)
#> C1 C2 N1 N2
#> [1,] 0.1538462 0.2857143 0.5555556 0.3000000
#> [2,] NA 0.5000000 0.0000000 NA
#> [3,] 0.5555556 0.7000000 0.3333333 0.8000000
#> [4,] 0.1818182 0.2500000 0.5833333 0.2500000
#> [5,] 0.6666667 1.0000000 0.8823529 0.8750000
#> [6,] 0.8461538 1.0000000 0.8333333 0.9285714
#If you prefer you can attach loci info to the matrix and output in GRanges format
<- methrix::get_matrix(m = methrix_data, type = "M", add_loci = TRUE, in_granges = TRUE)
meth_mat_with_loci
meth_mat_with_loci#> GRanges object with 743 ranges and 4 metadata columns:
#> seqnames ranges strand | C1 C2 N1
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
#> [1] chr21 27866423-27866424 * | 0.153846 0.285714 0.555556
#> [2] chr21 27866575-27866576 * | NA 0.500000 0.000000
#> [3] chr21 27866921-27866922 * | 0.555556 0.700000 0.333333
#> [4] chr21 27867197-27867198 * | 0.181818 0.250000 0.583333
#> [5] chr21 27867248-27867249 * | 0.666667 1.000000 0.882353
#> ... ... ... ... . ... ... ...
#> [739] chr22 49007313-49007314 * | 1.000000 0.714286 0.857143
#> [740] chr22 49007329-49007330 * | 1.000000 0.428571 1.000000
#> [741] chr22 49007347-49007348 * | 0.666667 0.166667 0.875000
#> [742] chr22 49007375-49007376 * | 0.333333 0.125000 1.000000
#> [743] chr22 49007398-49007399 * | 1.000000 0.600000 1.000000
#> N2
#> <numeric>
#> [1] 0.300
#> [2] NA
#> [3] 0.800
#> [4] 0.250
#> [5] 0.875
#> ... ...
#> [739] 1.0
#> [740] 1.0
#> [741] 1.0
#> [742] 0.6
#> [743] 1.0
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
Furthermore if you prefer you can filter sites based on coverage conditions.
#e.g; Retain all loci which are covered at-least in two sample by 3 or more reads
::coverage_filter(m = methrix_data, cov_thr = 3, min_samples = 2)
methrix#> -Retained 600 of 743 sites
#> -Finished in: 1.051s elapsed (0.820s cpu)
#> An object of class methrix
#> n_CpGs: 600
#> n_samples: 4
#> is_h5: FALSE
#> Reference: hg19
Subset operations in methrix
make use of
data.table
s fast
binary search which is several orders faster than bsseq
or other similar packages.
#Retain sites only from chromosme chr21
::subset_methrix(m = methrix_data, contigs = "chr21")
methrix#> -Subsetting by contigs
#> An object of class methrix
#> n_CpGs: 540
#> n_samples: 4
#> is_h5: FALSE
#> Reference: hg19
Regions can be data.table or GRanges format.
#e.g; Retain sites only in TP53 loci
<- GenomicRanges::GRanges("chr21:27867971-27868103")
target_loci
print(target_loci)
#> GRanges object with 1 range and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr21 27867971-27868103 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
::subset_methrix(m = methrix_data, regions = target_loci)
methrix#> -Subsetting by genomic regions
#> An object of class methrix
#> n_CpGs: 4
#> n_samples: 4
#> is_h5: FALSE
#> Reference: hg19
::subset_methrix(m = methrix_data, samples = "C1")
methrix#> Subsetting by samples
#> An object of class methrix
#> n_CpGs: 743
#> n_samples: 1
#> is_h5: FALSE
#> Reference: hg19
#Or you could use [] operator to subset by index
1]
methrix_data[,#> An object of class methrix
#> n_CpGs: 743
#> n_samples: 1
#> is_h5: FALSE
#> Reference: hg19
<- get_stats(m = methrix_data)
meth_stats #> -Finished in: 1.302s elapsed (1.013s cpu)
print(meth_stats)
#> Chromosome Sample_Name mean_meth median_meth sd_meth mean_cov median_cov
#> 1: chr21 C1 0.560004 0.6515152 0.4004019 4.745968 4
#> 2: chr21 C2 0.4934995 0.5 0.3896208 5.047525 4
#> 3: chr21 N1 0.5245418 0.6125 0.4205225 4.978682 4
#> 4: chr21 N2 0.5333442 0.6666667 0.4227578 5.055662 5
#> 5: chr22 C1 0.7392421 0.8571429 0.3115366 4.657459 4
#> 6: chr22 C2 0.5778096 0.6515152 0.3678762 5.5 5
#> 7: chr22 N1 0.8445557 1 0.2273115 5.505376 5
#> 8: chr22 N2 0.8520698 1 0.220973 5.866667 5
#> sd_cov
#> 1: 2.990217
#> 2: 3.294072
#> 3: 3.214882
#> 4: 3.148072
#> 5: 2.813423
#> 6: 3.046094
#> 7: 3.270255
#> 8: 3.166515
#Draw mean coverage per sample
plot_stats(plot_dat = meth_stats, what = "C", stat = "mean")
#Draw mean methylation per sample
plot_stats(plot_dat = meth_stats, what = "M", stat = "mean")
<- methrix_pca(m = methrix_data, do_plot = FALSE)
mpca
#Plot PCA results
plot_pca(pca_res = mpca, show_labels = TRUE)
#Color code by an annotation
plot_pca(pca_res = mpca, m = methrix_data, col_anno = "Condition")
#Violin plots
::plot_violin(m = methrix_data)
methrix#> Randomly selecting 25000 sites
#> Warning: Removed 203 rows containing non-finite values (stat_ydensity).
::plot_coverage(m = methrix_data, type = "dens") methrix
If you prefer to work with bsseq
object, you can generate bsseq
object from methrix with the
methrix2bsseq
.
if(!require(bsseq)) {
::install("bsseq")} BiocManager
library(bsseq)
<- methrix::methrix2bsseq(m = methrix_data)
bs_seq
bs_seq#> An object of type 'BSseq' with
#> 743 methylation loci
#> 4 samples
#> has not been smoothed
#> All assays are in-memory
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.0
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] bsseq_1.34.0 MafDb.1Kgenomes.phase3.hs37d5_3.10.0
#> [3] GenomicScores_2.10.0 BSgenome.Hsapiens.UCSC.hg19_1.4.3
#> [5] BSgenome_1.66.1 rtracklayer_1.58.0
#> [7] Biostrings_2.66.0 XVector_0.38.0
#> [9] methrix_1.12.0 SummarizedExperiment_1.28.0
#> [11] Biobase_2.58.0 GenomicRanges_1.50.1
#> [13] GenomeInfoDb_1.34.2 IRanges_2.32.0
#> [15] S4Vectors_0.36.0 BiocGenerics_0.44.0
#> [17] MatrixGenerics_1.10.0 matrixStats_0.62.0
#> [19] data.table_1.14.2
#>
#> loaded via a namespace (and not attached):
#> [1] colorspace_2.0-3 rjson_0.2.21
#> [3] ellipsis_0.3.2 farver_2.1.1
#> [5] bit64_4.0.5 interactiveDisplayBase_1.36.0
#> [7] AnnotationDbi_1.60.0 fansi_1.0.3
#> [9] codetools_0.2-18 R.methodsS3_1.8.2
#> [11] sparseMatrixStats_1.10.0 cachem_1.0.6
#> [13] knitr_1.39 jsonlite_1.8.0
#> [15] Rsamtools_2.14.0 dbplyr_2.2.1
#> [17] png_0.1-7 R.oo_1.25.0
#> [19] shiny_1.7.1 HDF5Array_1.26.0
#> [21] BiocManager_1.30.18 compiler_4.2.1
#> [23] httr_1.4.3 assertthat_0.2.1
#> [25] Matrix_1.4-1 fastmap_1.1.0
#> [27] limma_3.54.0 cli_3.3.0
#> [29] later_1.3.0 htmltools_0.5.2
#> [31] tools_4.2.1 gtable_0.3.0
#> [33] glue_1.6.2 GenomeInfoDbData_1.2.8
#> [35] dplyr_1.0.9 rappdirs_0.3.3
#> [37] Rcpp_1.0.9 jquerylib_0.1.4
#> [39] vctrs_0.4.1 rhdf5filters_1.10.0
#> [41] DelayedMatrixStats_1.19.0 xfun_0.31
#> [43] stringr_1.4.0 mime_0.12
#> [45] lifecycle_1.0.1 restfulr_0.0.15
#> [47] gtools_3.9.3 XML_3.99-0.10
#> [49] AnnotationHub_3.6.0 zlibbioc_1.44.0
#> [51] scales_1.2.0 promises_1.2.0.1
#> [53] parallel_4.2.1 rhdf5_2.42.0
#> [55] RColorBrewer_1.1-3 yaml_2.3.5
#> [57] curl_4.3.2 memoise_2.0.1
#> [59] ggplot2_3.3.6 sass_0.4.1
#> [61] stringi_1.7.8 RSQLite_2.2.14
#> [63] BiocVersion_3.16.0 highr_0.9
#> [65] BiocIO_1.8.0 permute_0.9-7
#> [67] filelock_1.0.2 BiocParallel_1.32.1
#> [69] rlang_1.0.4 pkgconfig_2.0.3
#> [71] bitops_1.0-7 evaluate_0.15
#> [73] lattice_0.20-45 purrr_0.3.4
#> [75] Rhdf5lib_1.20.0 GenomicAlignments_1.34.0
#> [77] labeling_0.4.2 bit_4.0.4
#> [79] tidyselect_1.1.2 magrittr_2.0.3
#> [81] R6_2.5.1 generics_0.1.3
#> [83] DelayedArray_0.24.0 DBI_1.1.3
#> [85] pillar_1.7.0 KEGGREST_1.38.0
#> [87] RCurl_1.98-1.7 tibble_3.1.7
#> [89] crayon_1.5.1 utf8_1.2.2
#> [91] BiocFileCache_2.6.0 rmarkdown_2.14
#> [93] locfit_1.5-9.6 grid_4.2.1
#> [95] blob_1.2.3 digest_0.6.29
#> [97] xtable_1.8-4 httpuv_1.6.5
#> [99] R.utils_2.12.0 munsell_0.5.0
#> [101] bslib_0.3.1