This vignette describes basic usage of the package intended to process several large bedgraph files in R. Methrix
provides set of function which allows easy importing of various flavors of bedgraphs generated by methylation callers, and many downstream analysis to be performed on large matrices.
read_bedgraphs
function is a versatile bedgraph reader intended to import bedgraph files generated virtually by any sort of methylation calling program. It requires user to provide indices for chromosome names, start position and other required fields. There are also presets available to import bedgraphs
from most common programs such as Bismark
, MethylDackel
, and MethylcTools
.
#Genome of your preference to work with
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
library(BiocManager)
if(!requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")
}
library(BSgenome.Hsapiens.UCSC.hg19)
#Example bedgraph files
bdg_files <- list.files(
path = system.file('extdata', package = 'methrix'),
pattern = "*bedGraph\\.gz$",
full.names = TRUE
)
print(basename(bdg_files))
#> [1] "C1.bedGraph.gz" "C2.bedGraph.gz" "N1.bedGraph.gz" "N2.bedGraph.gz"
#Generate some sample annotation table
sample_anno <- data.frame(
row.names = gsub(
pattern = "\\.bedGraph\\.gz$",
replacement = "",
x = basename(bdg_files)
),
Condition = c("cancer", 'cancer', "normal", "normal"),
Pair = c("pair1", "pair2", "pair1", "pair2"),
stringsAsFactors = FALSE
)
print(sample_anno)
#> Condition Pair
#> C1 cancer pair1
#> C2 cancer pair2
#> N1 normal pair1
#> N2 normal pair2
We can import bedgraph files with the function read_bedgraphs
which reads in the bedgraphs, adds CpGs missing from the reference set, and creates a methylation/coverage matrices. Once the process is complete - it returns an object of class methrix
which in turn inherits SummarizedExperiment class. methrix
object contains ‘methylation’ and ‘coverage’ matrices (either in-memory or as on-disk HDF5 arrays) along with pheno-data and other basic info. This object can be passed to all downstream functions for various analysis.
#First extract genome wide CpGs from the desired reference genome
hg19_cpgs <- suppressWarnings(methrix::extract_CPGs(ref_genome = "BSgenome.Hsapiens.UCSC.hg19"))
#>
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#>
#> strsplit
#> -Extracting CpGs
#> -Done. Extracted 28,700,086 CpGs from 93 contigs.
#Read the files
meth <- methrix::read_bedgraphs(
files = bdg_files,
ref_cpgs = hg19_cpgs,
chr_idx = 1,
start_idx = 2,
M_idx = 3,
U_idx = 4,
stranded = FALSE,
zero_based = FALSE,
collapse_strands = FALSE,
coldata = sample_anno
)
#> ----------------------------
#> -Preset: Custom
#> --Missing beta and coverage info. Estimating them from M and U values
#> -CpGs raw: 28,700,086
#> -CpGs filtered: 28,217,448
#> ----------------------------
#> -Processing: C1.bedGraph.gz
#> --CpGs missing: 28,216,771
#> -Processing: C2.bedGraph.gz
#> --CpGs missing: 28,216,759
#> -Processing: N1.bedGraph.gz
#> --CpGs missing: 28,216,746
#> -Processing: N2.bedGraph.gz
#> --CpGs missing: 28,216,747
#> -Finished in: 00:01:33 elapsed (00:02:50 cpu)
Note: Use the argument pipeline
if your bedgraphs are generated with “Bismark”, “MethylDeckal”, or “MethylcTools”. This will automatically figure out the file formats for you, and you dont have to use the arguments chr_idx
start_idx
and so..
Get basic summary statistics of the methrix
object with methrix_report
function which produces an interactive html report
Usual task in analysis involves removing uncovered CpGs. i.e, those loci which are not covered across all sample (in other words covered only in subset of samples resulting NA
for rest of the samples ).
One can also remove CpG sites overlaping with common SNPs based on minor allele frequencies.
if(!require(MafDb.1Kgenomes.phase3.hs37d5)) {
BiocManager::install("MafDb.1Kgenomes.phase3.hs37d5")}
if(!require(GenomicScores)) {
BiocManager::install("GenomicScores")}
library(MafDb.1Kgenomes.phase3.hs37d5)
#> Loading required package: GenomicScores
#>
#> Attaching package: 'GenomicScores'
#> The following object is masked from 'package:utils':
#>
#> citation
library(GenomicScores)
meth_snps_filtered <- methrix::remove_snps(m = meth)
#> Used SNP database: MafDb.1Kgenomes.phase3.hs37d5.
#> Number of SNPs removed:
#> chr N
#> 1: chr21 42
#> 2: chr22 39
#> Sum:
#> [1] 81
#Example data bundled, same as the previously generated meth
data("methrix_data")
#Coverage matrix
coverage_mat <- methrix::get_matrix(m = methrix_data, type = "C")
head(coverage_mat)
#> C1 C2 N1 N2
#> [1,] 13 7 9 10
#> [2,] NA 2 3 NA
#> [3,] 9 10 3 5
#> [4,] 11 8 12 8
#> [5,] 6 7 17 8
#> [6,] 13 6 6 14
#Methylation matrix
meth_mat <- methrix::get_matrix(m = methrix_data, type = "M")
head(meth_mat)
#> C1 C2 N1 N2
#> [1,] 0.1538462 0.2857143 0.5555556 0.3000000
#> [2,] NA 0.5000000 0.0000000 NA
#> [3,] 0.5555556 0.7000000 0.3333333 0.8000000
#> [4,] 0.1818182 0.2500000 0.5833333 0.2500000
#> [5,] 0.6666667 1.0000000 0.8823529 0.8750000
#> [6,] 0.8461538 1.0000000 0.8333333 0.9285714
#If you prefer you can attach loci info to the matrix and output in GRanges format
meth_mat_with_loci <- methrix::get_matrix(m = methrix_data, type = "M", add_loci = TRUE, in_granges = TRUE)
meth_mat_with_loci
#> GRanges object with 743 ranges and 4 metadata columns:
#> seqnames ranges strand | C1 C2
#> <Rle> <IRanges> <Rle> | <numeric> <numeric>
#> [1] chr21 27866423-27866424 * | 0.153846153846154 0.285714285714286
#> [2] chr21 27866575-27866576 * | <NA> 0.5
#> [3] chr21 27866921-27866922 * | 0.555555555555556 0.7
#> [4] chr21 27867197-27867198 * | 0.181818181818182 0.25
#> [5] chr21 27867248-27867249 * | 0.666666666666667 1
#> ... ... ... ... . ... ...
#> [739] chr22 49007313-49007314 * | 1 0.714285714285714
#> [740] chr22 49007329-49007330 * | 1 0.428571428571429
#> [741] chr22 49007347-49007348 * | 0.666666666666667 0.166666666666667
#> [742] chr22 49007375-49007376 * | 0.333333333333333 0.125
#> [743] chr22 49007398-49007399 * | 1 0.6
#> N1 N2
#> <numeric> <numeric>
#> [1] 0.555555555555556 0.3
#> [2] 0 <NA>
#> [3] 0.333333333333333 0.8
#> [4] 0.583333333333333 0.25
#> [5] 0.882352941176471 0.875
#> ... ... ...
#> [739] 0.857142857142857 1
#> [740] 1 1
#> [741] 0.875 1
#> [742] 1 0.6
#> [743] 1 1
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
Furthermore if you prefer you can filter sites based on coverage conditions.
#e.g; Retain all loci which are covered at-least in two sample by 3 or more reads
methrix::coverage_filter(m = methrix_data, cov_thr = 3, min_samples = 2)
#> -Retained 600 of 743 sites
#> -Finished in: 1.022s elapsed (1.272s cpu)
#> An object of class methrix
#> n_CpGs: 600
#> n_samples: 4
#> is_h5: FALSE
#> Reference: hg19
Subset operations in methrix
make use of data.table
s fast binary search which is several orders faster than bsseq
or other similar packages.
Regions can be data.table or GRanges format.
#e.g; Retain sites only in TP53 loci
target_loci <- GenomicRanges::GRanges("chr21:27867971-27868103")
print(target_loci)
#> GRanges object with 1 range and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr21 27867971-27868103 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
methrix::subset_methrix(m = methrix_data, regions = target_loci)
#> -Subsetting by genomic regions
#> An object of class methrix
#> n_CpGs: 4
#> n_samples: 4
#> is_h5: FALSE
#> Reference: hg19
methrix::subset_methrix(m = methrix_data, samples = "C1")
#> Subsetting by samples
#> An object of class methrix
#> n_CpGs: 743
#> n_samples: 1
#> is_h5: FALSE
#> Reference: hg19
#Or you could use [] operator to subset by index
methrix_data[,1]
#> An object of class methrix
#> n_CpGs: 743
#> n_samples: 1
#> is_h5: FALSE
#> Reference: hg19
meth_stats <- get_stats(m = methrix_data)
#> -Finished in: 1.550s elapsed (3.864s cpu)
print(meth_stats)
#> Chromosome Sample_Name mean_meth median_meth sd_meth mean_cov median_cov
#> 1: chr21 C1 0.560004 0.6515152 0.4004019 4.745968 4
#> 2: chr21 C2 0.4934995 0.5 0.3896208 5.047525 4
#> 3: chr21 N1 0.5245418 0.6125 0.4205225 4.978682 4
#> 4: chr21 N2 0.5333442 0.6666667 0.4227578 5.055662 5
#> 5: chr22 C1 0.7392421 0.8571429 0.3115366 4.657459 4
#> 6: chr22 C2 0.5778096 0.6515152 0.3678762 5.5 5
#> 7: chr22 N1 0.8445557 1 0.2273115 5.505376 5
#> 8: chr22 N2 0.8520698 1 0.220973 5.866667 5
#> sd_cov
#> 1: 2.990217
#> 2: 3.294072
#> 3: 3.214882
#> 4: 3.148072
#> 5: 2.813423
#> 6: 3.046094
#> 7: 3.270255
#> 8: 3.166515
If you prefer to work with bsseq object, you can generate bsseq
object from methrix with the methrix2bsseq
.
sessionInfo()
#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] bsseq_1.22.0 MafDb.1Kgenomes.phase3.hs37d5_3.10.0
#> [3] GenomicScores_1.10.0 BSgenome.Hsapiens.UCSC.hg19_1.4.0
#> [5] BSgenome_1.54.0 rtracklayer_1.46.0
#> [7] Biostrings_2.54.0 XVector_0.26.0
#> [9] methrix_1.0.05 SummarizedExperiment_1.16.1
#> [11] DelayedArray_0.12.2 BiocParallel_1.20.1
#> [13] matrixStats_0.55.0 Biobase_2.46.0
#> [15] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0
#> [17] IRanges_2.20.1 S4Vectors_0.24.1
#> [19] BiocGenerics_0.32.0 data.table_1.12.8
#>
#> loaded via a namespace (and not attached):
#> [1] bitops_1.0-6 bit64_0.9-7
#> [3] RColorBrewer_1.1-2 httr_1.4.1
#> [5] tools_3.6.2 backports_1.1.5
#> [7] R6_2.4.1 HDF5Array_1.14.1
#> [9] DBI_1.1.0 lazyeval_0.2.2
#> [11] colorspace_1.4-1 permute_0.9-5
#> [13] tidyselect_0.2.5 bit_1.1-14
#> [15] curl_4.3 compiler_3.6.2
#> [17] labeling_0.3 scales_1.1.0
#> [19] rappdirs_0.3.1 stringr_1.4.0
#> [21] digest_0.6.23 Rsamtools_2.2.1
#> [23] rmarkdown_2.0 R.utils_2.9.2
#> [25] pkgconfig_2.0.3 htmltools_0.4.0
#> [27] limma_3.42.0 dbplyr_1.4.2
#> [29] fastmap_1.0.1 rlang_0.4.2
#> [31] RSQLite_2.2.0 shiny_1.4.0
#> [33] DelayedMatrixStats_1.8.0 farver_2.0.1
#> [35] gtools_3.8.1 dplyr_0.8.3
#> [37] R.oo_1.23.0 RCurl_1.95-4.12
#> [39] magrittr_1.5 GenomeInfoDbData_1.2.2
#> [41] Matrix_1.2-18 Rcpp_1.0.3
#> [43] munsell_0.5.0 Rhdf5lib_1.8.0
#> [45] lifecycle_0.1.0 R.methodsS3_1.7.1
#> [47] stringi_1.4.3 yaml_2.2.0
#> [49] zlibbioc_1.32.0 rhdf5_2.30.1
#> [51] BiocFileCache_1.10.2 AnnotationHub_2.18.0
#> [53] grid_3.6.2 blob_1.2.0
#> [55] promises_1.1.0 crayon_1.3.4
#> [57] lattice_0.20-38 locfit_1.5-9.1
#> [59] zeallot_0.1.0 knitr_1.26
#> [61] pillar_1.4.3 rjson_0.2.20
#> [63] XML_3.98-1.20 glue_1.3.1
#> [65] BiocVersion_3.10.1 evaluate_0.14
#> [67] BiocManager_1.30.10 vctrs_0.2.1
#> [69] httpuv_1.5.2 gtable_0.3.0
#> [71] purrr_0.3.3 assertthat_0.2.1
#> [73] ggplot2_3.2.1 xfun_0.11
#> [75] mime_0.8 xtable_1.8-4
#> [77] later_1.0.0 tibble_2.1.3
#> [79] GenomicAlignments_1.22.1 AnnotationDbi_1.48.0
#> [81] memoise_1.1.0 interactiveDisplayBase_1.24.0