Abstract
A comprehensive guide to using the ramr package for detection of rare aberrantly methylated regions.ramr
is an R package for detection of low-frequency aberrant methylation events in large data sets obtained by methylation profiling using array or high-throughput bisulfite sequencing. In addition, package provides functions to visualize found aberrantly methylated regions (AMRs), to generate sets of all possible regions to be used as reference sets for enrichment analysis, and to generate biologically relevant test data sets for performance evaluation of AMR/DMR search algorithms.
ramr
methods operate on objects of the class GRanges
. The input object for AMR search must in addition contain metadata columns with sample beta values. A typical input object looks like this:
GRanges object with 383788 ranges and 845 metadata columns:
seqnames ranges strand | GSM1235534 GSM1235535 GSM1235536 ...
<Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> ...
cg13869341 chr1 15865 * | 0.801634776091808 0.846486905008704 0.86732154737116 ...
cg24669183 chr1 534242 * | 0.834138820071765 0.861974610731835 0.832557979806823 ...
cg15560884 chr1 710097 * | 0.711275180750356 0.70461945838556 0.699487225634589 ...
cg01014490 chr1 714177 * | 0.0769098196182058 0.0569443780518647 0.0623154673389864 ...
cg17505339 chr1 720865 * | 0.876413362222415 0.885593263385521 0.877944732153869 ...
... ... ... ... . ... ... ... ...
cg05615487 chr22 51176407 * | 0.84904178467798 0.836538383875097 0.81568519870099 ...
cg22122449 chr22 51176711 * | 0.882444486059592 0.870804215405886 0.859269224277308 ...
cg08423507 chr22 51177982 * | 0.886406345093286 0.882430879852752 0.887241923657461 ...
cg19565306 chr22 51222011 * | 0.0719084295670266 0.0845209871264646 0.0689074604483659 ...
cg09226288 chr22 51225561 * | 0.724145303755024 0.696281176451351 0.711459675603635 ...
ramr
package is supplied with a sample data, which was simulated using GSE51032 data set as described in the ramr
reference paper. Sample data set ramr.data
contains beta values for 10000 CpGs and 100 samples (ramr.samples
), and carries 6 unique (ramr.tp.unique
) and 15 non-unique (ramr.tp.nonunique
) true positive AMRs containing at least 10 CpGs with their beta values increased/decreased by 0.5
library(ramr)
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append, as.data.frame,
#> basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, get, grep,
#> grepl, intersect, is.unsorted, lapply, mapply, match, mget, order, paste, pmax,
#> pmax.int, pmin, pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: parallel
#> Loading required package: doParallel
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: doRNG
#> Loading required package: rngtools
data(ramr)
head(ramr.samples)
#> [1] "sample1" "sample2" "sample3" "sample4" "sample5" "sample6"
ramr.data[1:10,ramr.samples[1:3]]
#> GRanges object with 10 ranges and 3 metadata columns:
#> seqnames ranges strand | sample1 sample2 sample3
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
#> cg13869341 chr1 15865 * | 0.833609 0.847747 0.879959
#> cg14008030 chr1 18827 * | 0.553312 0.547931 0.609759
#> cg12045430 chr1 29407 * | 0.186572 0.204322 0.222865
#> cg20826792 chr1 29425 * | 0.481280 0.439688 0.426783
#> cg00381604 chr1 29435 * | 0.134126 0.173629 0.167011
#> cg20253340 chr1 68849 * | 0.500267 0.615868 0.541587
#> cg21870274 chr1 69591 * | 0.777613 0.771680 0.775057
#> cg03130891 chr1 91550 * | 0.245783 0.204307 0.223610
#> cg24335620 chr1 135252 * | 0.777796 0.790842 0.786113
#> cg16162899 chr1 449076 * | 0.878150 0.860134 0.898071
#> -------
#> seqinfo: 24 sequences from hg19 genome; no seqlengths
plotAMR(ramr.data, ramr.samples, ramr.tp.unique[1])
#> [[1]]
The input (or template) object may be obtained using data from various sources. Here we provide two examples:
The following code pulls (NB: very large) raw files from NCBI GEO database, performes normalization and creates GRanges
object for further analysis using ramr
(system requirements: 22GB of disk space, 64GB of RAM)
library(minfi)
library(GEOquery)
library(GenomicRanges)
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
# destination for temporary files
dest.dir <- tempdir()
# downloading and unpacking raw IDAT files
suppl.files <- getGEOSuppFiles("GSE51032", baseDir=dest.dir, makeDirectory=FALSE, filter_regex="RAW")
untar(rownames(suppl.files), exdir=dest.dir, verbose=TRUE)
idat.files <- list.files(dest.dir, pattern="idat.gz$", full.names=TRUE)
sapply(idat.files, gunzip, overwrite=TRUE)
# reading IDAT files
geo.idat <- read.metharray.exp(dest.dir)
colnames(geo.idat) <- gsub("(GSM\\d+).*", "\\1", colnames(geo.idat))
# processing raw data
genomic.ratio.set <- preprocessQuantile(geo.idat, mergeManifest=TRUE, fixOutliers=TRUE)
# creating the GRanges object with beta values
data.ranges <- granges(genomic.ratio.set)
data.betas <- getBeta(genomic.ratio.set)
sample.ids <- colnames(geo.idat)
mcols(data.ranges) <- data.betas
# data.ranges and sample.ids objects are now ready for AMR search using ramr
library(methylKit)
library(GenomicRanges)
# file.list is a user-defined character vector with full file names of Bismark cytosine report files
file.list
# sample.ids is a user-defined character vector holding sample names
sample.ids
# methylation context string, defines if the reads covering both strands will be merged
context <- "CpG"
# fitting beta distribution (filtering using ramr.method "beta" or "wbeta") requires
# that most of the beta values are not equal to 0 or 1
min.beta <- 0.001
max.beta <- 0.999
# reading and uniting methylation values
meth.data.raw <- methRead(as.list(file.list), as.list(sample.ids), assembly="hg19", header=TRUE,
context=context, resolution="base", treatment=rep(0,length(sample.ids)),
pipeline="bismarkCytosineReport")
meth.data.utd <- unite(meth.data.raw, destrand=isTRUE(context=="CpG"))
# creating the GRanges object with beta values
data.ranges <- GRanges(meth.data.utd)
data.betas <- percMethylation(meth.data.utd)/100
data.betas[data.betas<min.beta] <- min.beta
data.betas[data.betas>max.beta] <- max.beta
mcols(data.ranges) <- data.betas
# data.ranges and sample.ids objects are now ready for AMR search using ramr
ramr
provides methods to create sets of random AMRs and to generate biologically relevant methylation beta values using real data sets as templates. The following code provides an example, however it is recommended to use a real experimental data (e.g. GSE51032) to create a test data set for assessing the performance of ramr
or other AMR/DMR search engines. The results of parallel data generation are fully reproducible when the same seed has been set (thanks to doRNG::%dorng%).
# set the seed if reproducible results required
set.seed(999)
# unique random AMRs
amrs.unique <-
simulateAMR(ramr.data, nsamples=25, regions.per.sample=2,
min.cpgs=5, merge.window=1000, dbeta=0.2)
# non-unique AMRs outside of regions with unique AMRs
amrs.nonunique <-
simulateAMR(ramr.data, nsamples=4, exclude.ranges=amrs.unique,
samples.per.region=2, min.cpgs=5, merge.window=1000)
# random noise outside of AMR regions
noise <-
simulateAMR(ramr.data, nsamples=25, regions.per.sample=20,
exclude.ranges=c(amrs.unique, amrs.nonunique),
min.cpgs=1, max.cpgs=1, merge.window=1, dbeta=0.5)
# "smooth" methylation data without AMRs (negative control)
smooth.data <-
simulateData(ramr.data, nsamples=25, cores=2)
#> Simulating data [1.007s]
# methylation data with AMRs and noise
noisy.data <-
simulateData(ramr.data, nsamples=25,
amr.ranges=c(amrs.unique, amrs.nonunique, noise), cores=2)
#> Simulating data [1.501s]
#> Introducing epimutations [0.071s]
# that's how regions look like
library(gridExtra)
#>
#> Attaching package: 'gridExtra'
#>
#> The following object is masked from 'package:BiocGenerics':
#>
#> combine
do.call("grid.arrange", c(plotAMR(noisy.data, amr.ranges=amrs.unique[1:4]), ncol=2))
do.call("grid.arrange", c(plotAMR(noisy.data, amr.ranges=noise[1:4]), ncol=2))
#> geom_path: Each group consists of only one observation. Do you need to adjust the group
#> aesthetic?
# can we find them?
system.time(found <- getAMR(noisy.data, ramr.method="beta", min.cpgs=5,
merge.window=1000, qval.cutoff=1e-2, cores=2))
#> Identifying AMRs [2.570s]
#> user system elapsed
#> 3.846 0.968 2.742
# all possible regions
all.ranges <- getUniverse(noisy.data, min.cpgs=5, merge.window=1000)
# true positives
tp <- sum(found %over% c(amrs.unique, amrs.nonunique))
# false positives
fp <- sum(found %outside% c(amrs.unique, amrs.nonunique))
# true negatives
tn <- length(all.ranges %outside% c(amrs.unique, amrs.nonunique))
# false negatives
fn <- sum(c(amrs.unique, amrs.nonunique) %outside% found)
# accuracy, MCC
acc <- (tp+tn) / (tp+tn+fp+fn)
mcc <- (tp*tn - fp*fn) / (sqrt(tp+fp)*sqrt(tp+fn)*sqrt(tn+fp)*sqrt(tn+fn))
setNames(c(tp, fp, tn, fn), c("TP", "FP", "TN", "FN"))
#> TP FP TN FN
#> 57 0 206 1
setNames(c(acc, mcc), c("accuracy", "MCC"))
#> accuracy MCC
#> 0.9962121 0.9889444
This code shows how to do basic analysis with ramr
using provided data files:
# identify AMRs
amrs <- getAMR(ramr.data, ramr.samples, ramr.method="beta", min.cpgs=5,
merge.window=1000, qval.cutoff=1e-3, cores=2)
#> Identifying AMRs [7.171s]
# inspect
sort(amrs)
#> GRanges object with 22 ranges and 5 metadata columns:
#> seqnames ranges strand | revmap ncpg sample dbeta
#> <Rle> <IRanges> <Rle> | <list> <integer> <character> <numeric>
#> [1] chr1 566172-569687 * | 17,18,19,... 15 sample95 0.498337
#> [2] chr1 874697-877876 * | 165,166,167,... 13 sample44 0.451475
#> [3] chr1 874697-877876 * | 165,166,167,... 13 sample45 0.457115
#> [4] chr1 874697-877876 * | 165,166,167,... 13 sample46 0.458498
#> [5] chr1 1095607-1106175 * | 620,621,622,... 49 sample66 -0.503085
#> ... ... ... ... . ... ... ... ...
#> [18] chr1 2200890-2203648 * | 2263,2264,2265,... 10 sample58 -0.505059
#> [19] chr1 2200890-2203648 * | 2263,2264,2265,... 10 sample59 -0.498849
#> [20] chr1 2200890-2203648 * | 2263,2264,2265,... 10 sample60 -0.500389
#> [21] chr1 2269871-2271665 * | 2410,2411,2412,... 10 sample71 -0.496600
#> [22] chr1 2443577-2453006 * | 2722,2723,2724,... 30 sample25 -0.484617
#> pval
#> <numeric>
#> [1] 4.60616e-171
#> [2] 1.30458e-07
#> [3] 1.04111e-07
#> [4] 8.03068e-08
#> [5] 1.92980e-17
#> ... ...
#> [18] 2.98503e-08
#> [19] 4.05494e-08
#> [20] 3.68870e-08
#> [21] 1.54741e-17
#> [22] 1.26772e-19
#> -------
#> seqinfo: 24 sequences from hg19 genome; no seqlengths
do.call("grid.arrange", c(plotAMR(ramr.data, ramr.samples, amrs[1:10]), ncol=2))
Again, the results of parallel processing are fully reproducible if the same seed has been set.
If necessary, AMRs can be annotated to known genomic elements using R library annotatr
1 or tested for potential enrichment in epigenetic or other marks using R library LOLA
2
# annotating AMRs using R library annotatr
library(annotatr)
annotation.types <- c("hg19_cpg_inter", "hg19_cpg_islands", "hg19_cpg_shores",
"hg19_cpg_shelves", "hg19_genes_intergenic", "hg19_genes_promoters",
"hg19_genes_5UTRs", "hg19_genes_firstexons", "hg19_genes_3UTRs")
annotations <- build_annotations(genome='hg19', annotations=annotation.types)
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with 'browseVignettes()'. To cite
#> Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> 'select()' returned 1:1 mapping between keys and columns
#> Building promoters...
#> Building 1to5kb upstream of TSS...
#> Building intergenic...
#> Building 5UTRs...
#> Building 3UTRs...
#> Building exons...
#> Building first exons...
#> Building introns...
#> snapshotDate(): 2022-10-26
#> Building CpG islands...
#> loading from cache
#> Building CpG shores...
#> Building CpG shelves...
#> Building inter-CpG-islands...
amrs.annots <- annotate_regions(regions=amrs, annotations=annotations, ignore.strand=TRUE, quiet=FALSE)
#> Annotating...
summarize_annotations(annotated_regions=amrs.annots, quiet=FALSE)
#> Counting annotation types
#> # A tibble: 9 × 2
#> annot.type n
#> <chr> <int>
#> 1 hg19_cpg_inter 4
#> 2 hg19_cpg_islands 10
#> 3 hg19_cpg_shelves 4
#> 4 hg19_cpg_shores 10
#> 5 hg19_genes_3UTRs 2
#> 6 hg19_genes_5UTRs 5
#> 7 hg19_genes_firstexons 8
#> 8 hg19_genes_intergenic 1
#> 9 hg19_genes_promoters 8
# generate the set of all possible genomic regions using sample data set and
# the same parameters as for AMR search
universe <- getUniverse(ramr.data, min.cpgs=5, merge.window=1000)
# enrichment analysis of AMRs using R library LOLA
library(LOLA)
# prepare the core database as described in vignettes
vignette("usingLOLACore")
# load the core database and perform the enrichment analysis
hg19.coredb <- loadRegionDB(system.file("LOLACore", "hg19", package="LOLA"))
runLOLA(amrs, universe, hg19.coredb, cores=1, redefineUserSets=TRUE)
ramr
packageOleksii Nikolaienko, Per Eystein Lønning, Stian Knappskog, ramr: an R/Bioconductor package for detection of rare aberrantly methylated regions, Bioinformatics, 2021;, btab586, https://doi.org/10.1093/bioinformatics/btab586
ramr
manuscriptReplication Data for: "ramr: an R package for detection of rare aberrantly methylated regions, https://doi.org/10.18710/ED8HSD
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB
#> [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] org.Hs.eg.db_3.16.0 TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
#> [3] GenomicFeatures_1.50.0 AnnotationDbi_1.60.0
#> [5] Biobase_2.58.0 annotatr_1.24.0
#> [7] gridExtra_2.3 ramr_1.6.0
#> [9] doRNG_1.8.2 rngtools_1.5.2
#> [11] doParallel_1.0.17 iterators_1.0.14
#> [13] foreach_1.5.2 GenomicRanges_1.50.0
#> [15] GenomeInfoDb_1.34.0 IRanges_2.32.0
#> [17] S4Vectors_0.36.0 BiocGenerics_0.44.0
#>
#> loaded via a namespace (and not attached):
#> [1] colorspace_2.0-3 rjson_0.2.21 ellipsis_0.3.2
#> [4] EnvStats_2.7.0 XVector_0.38.0 farver_2.1.1
#> [7] bit64_4.0.5 interactiveDisplayBase_1.36.0 fansi_1.0.3
#> [10] xml2_1.3.3 codetools_0.2-18 cachem_1.0.6
#> [13] knitr_1.40 jsonlite_1.8.3 Rsamtools_2.14.0
#> [16] dbplyr_2.2.1 png_0.1-7 shiny_1.7.3
#> [19] BiocManager_1.30.19 readr_2.1.3 compiler_4.2.1
#> [22] httr_1.4.4 assertthat_0.2.1 Matrix_1.5-1
#> [25] fastmap_1.1.0 cli_3.4.1 later_1.3.0
#> [28] htmltools_0.5.3 prettyunits_1.1.1 tools_4.2.1
#> [31] gtable_0.3.1 glue_1.6.2 GenomeInfoDbData_1.2.9
#> [34] reshape2_1.4.4 dplyr_1.0.10 rappdirs_0.3.3
#> [37] Rcpp_1.0.9 jquerylib_0.1.4 vctrs_0.5.0
#> [40] Biostrings_2.66.0 rtracklayer_1.58.0 xfun_0.34
#> [43] stringr_1.4.1 mime_0.12 lifecycle_1.0.3
#> [46] restfulr_0.0.15 XML_3.99-0.12 AnnotationHub_3.6.0
#> [49] zlibbioc_1.44.0 scales_1.2.1 BSgenome_1.66.0
#> [52] hms_1.1.2 promises_1.2.0.1 MatrixGenerics_1.10.0
#> [55] SummarizedExperiment_1.28.0 yaml_2.3.6 curl_4.3.3
#> [58] memoise_2.0.1 ggplot2_3.3.6 sass_0.4.2
#> [61] biomaRt_2.54.0 stringi_1.7.8 RSQLite_2.2.18
#> [64] BiocVersion_3.16.0 highr_0.9 BiocIO_1.8.0
#> [67] filelock_1.0.2 optimx_2022-4.30 BiocParallel_1.32.0
#> [70] rlang_1.0.6 pkgconfig_2.0.3 matrixStats_0.62.0
#> [73] bitops_1.0-7 evaluate_0.17 lattice_0.20-45
#> [76] purrr_0.3.5 GenomicAlignments_1.34.0 labeling_0.4.2
#> [79] bit_4.0.4 tidyselect_1.2.0 plyr_1.8.7
#> [82] magrittr_2.0.3 R6_2.5.1 generics_0.1.3
#> [85] DelayedArray_0.24.0 DBI_1.1.3 withr_2.5.0
#> [88] pillar_1.8.1 KEGGREST_1.38.0 RCurl_1.98-1.9
#> [91] tibble_3.1.8 crayon_1.5.2 utf8_1.2.2
#> [94] BiocFileCache_2.6.0 tzdb_0.3.0 rmarkdown_2.17
#> [97] progress_1.2.2 ExtDist_0.6-4 grid_4.2.1
#> [100] blob_1.2.3 digest_0.6.30 xtable_1.8-4
#> [103] httpuv_1.6.6 regioneR_1.30.0 numDeriv_2016.8-1.1
#> [106] munsell_0.5.0 bslib_0.4.0
Raymond G Cavalcante, Maureen A Sartor, annotatr: genomic regions in context, Bioinformatics, Volume 33, Issue 15, 01 August 2017, Pages 2381–2383, https://doi.org/10.1093/bioinformatics/btx183↩
Nathan C. Sheffield, Christoph Bock, LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor, Bioinformatics, Volume 32, Issue 4, 15 February 2016, Pages 587–589, https://doi.org/10.1093/bioinformatics/btv612↩