Abstract
A comprehensive guide to using the epialleleR package for calling the hypermethylated epiallele frequencies from next-generation sequencing dataThe cytosine DNA methylation is an important epigenetic mechanism for regulation of gene expression. Abnormal methylation is linked to several diseases, being for example the most common molecular lesion in cancer cell1. Multiple studies suggest that alterations in DNA methylation, despite occurring at a low mosaic level, may confer increased risk of cancer later in life2.
The cytosine methylation levels within relatively small regions of the human genome are thought to be often concordant, resulting in a limited number of distinct methylation patterns of short sequencing reads3. Due to the cell-to-cell variations in methylation, DNA purified from tissue samples contains a mix of hyper- and hypomethylated alleles with varying ratios that depend on the genomic region and tissue type.
Unsurprisingly, when the frequencies of hypermethylated epialleles are low (e.g. 1e-02 and lower) and cytosine methylation levels are averaged and reported using conventional algorithms, the identification of such hypermethylated epialleles becomes nearly impossible. In order to increase the sensitivity of DNA methylation analysis we have developed epialleleR
— an R package for calling hypermethylated variant epiallele frequencies (VEF).
epialleleR
is a fast and scalable solution for analysis of data obtained by next-generation sequencing of bisulfite-treated DNA samples. The minimum requirement for the input is a Binary Alignment Map (BAM) file containing sequencing reads. These reads can be obtained from either deep or ultra-deep sequencing, using either narrowly targeted gene panels (amplicon sequencing), larger methylation capture panels, or even whole-genome approaches.
epialleleR
can call variant epiallele frequencies (VEF) of hypermethylated alleles at the level of individual cytosines (generateCytosineReport
) or supplied genomic regions (generateBedReport
)generateBedEcdf
methodgenerateVcfReport
Currently epialleleR
runs in a single-thread mode only, however most of the processing time is spent on loading the BAM data (using Rsamtools
), which is difficult to improve by parallel processing. All the other operations are performed using optimised C++ functions, and usually take less time. Running time for complete task “BAM on disk -> CX report on disk” depends on the size of the BAM file, and the speed is usually within the range of 4-6 MB/s (or 25-50 thousand reads per second) for a single core of a relatively modern CPU (Intel(R) Core(TM) i7-7700).
Since BAM loading and preprocessing is the biggest bottleneck at the moment, major improvements are planned for the next upcoming release (v1.1): the dependency on Rsamtools
will be removed, and loading and preprocessing will be done within the same C++ function, hopefully aiding in much higher processing speed.
The epialleleR
package includes sample data, which was obtained using targeted sequencing. The description of assays and files is given below. All the genomic coordinates for external data files are according to GRCh38 reference assembly.
The samples of Human HCT116 DKO Non-Methylated (Zymo Research, cat # D5014-1), or Human HCT116 DKO Methylated (Zymo Research, cat # D5014-2) DNA4, or their mix were bisulfite-converted, and the BRCA1 gene promoter region was amplified using four pairs of primers. Amplicons were mixed, indexed and sequenced at Illumina MiSeq system. The related files are:
Name | Type | Description |
---|---|---|
amplicon000meth.bam | BAM | a subset of reads for non-methylated DNA sample |
amplicon010meth.bam | BAM | a subset of reads for a 1:9 mix of methylated and non-methylated DNA samples |
amplicon100meth.bam | BAM | a subset of reads for fully methylated DNA sample |
amplicon.bed | BED | genomic coordinates of four amplicons covering promoter area of BRCA1 gene |
amplicon.vcf.gz | VCF | a relevant subset of sequence variations |
amplicon.vcf.gz.tbi | tabix | tabix file for the amplicon.vcf.gz |
The tumour DNA was bisulfite-converted, fragmented and hybridized with custom-made probes covering promoter regions of 283 tumour suppressor genes (as described in 5). Libraries were sequenced using Illumina MiSeq system. The related files are:
Name | Type | Description |
---|---|---|
capture.bam | BAM | a subset of reads |
capture.bed | BED | genomic coordinates of capture target regions |
capture.vcf.gz | VCF | a relevant subset of sequence variations |
capture.vcf.gz.tbi | tabix | tabix file for the capture.vcf.gz |
As mentioned earlier, epialleleR
uses data stored in Binary Alignment Map (BAM) files as its input. It is a prerequisite that records in a BAM file contain an XM tag with the methylation call string — such files are produced by virtually any software tool for mapping and alignment of bisulfite sequencing reads (such as Bismark, BSMAP or Illumina DRAGEN Bio-IT Platform).
Please use the function help files for a full description of available parameters, as well as explanation of the function’s logic and output values.
All epialleleR
methods can load BAM data using the file path. However, if a file is very large and several reports need to be prepared, it is advised to use the preprocessBam
convenience function as shown below. This function is also used internally when a BAM file location string is supplied as an input for other epialleleR
methods.
preprocessBam
automatically determines if BAM is derived from single-end or paired-end sequencing. When the latter is the case, paired reads are merged so that the overlapping fragments of the second read are clipped (because the quality of the second read is usually lower than of the first). These merged reads are then processed as a single entity in all epialleleR
methods.
epialleleR
can generate conventional cytosine reports in a format, which is similar to the genome-wide cytosine report produced by the coverage2cytosine
Bismark module6.
Please note that generateCytosineReport
produces thresholded (VEF) report by default: methylated cytosines from reads that do not pass the threshold (hypomethylated reads) are counted as being unmethylated. In order to make a conventional cytosine report, use threshold.reads=FALSE
.
# data.table::data.table object for
# CpG VEF report
cg.vef.report <- generateCytosineReport(bam.data)
#> Already preprocessed BAM supplied as an input. Options 'min.mapq' and 'skip.duplicates' will have no effect.
#> Thresholding reads [0.003s]
#> Preparing cytosine report [0.023s]
head(cg.vef.report[order(meth+unmeth, decreasing=TRUE)])
#> rname strand pos context meth unmeth triad
#> 1: chr17 + 61864475 CG 7 9 NNN
#> 2: chr17 + 61864486 CG 10 6 NNN
#> 3: chr17 + 61864504 CG 9 7 NNN
#> 4: chr20 - 57267455 CG 13 1 NNN
#> 5: chr17 - 61863826 CG 0 13 NNN
#> 6: chr17 - 61863830 CG 0 13 NNN
# CpG cytosine report
cg.report <- generateCytosineReport(bam.data, threshold.reads=FALSE)
#> Already preprocessed BAM supplied as an input. Options 'min.mapq' and 'skip.duplicates' will have no effect.
#> Preparing cytosine report [0.024s]
head(cg.report[order(meth+unmeth, decreasing=TRUE)])
#> rname strand pos context meth unmeth triad
#> 1: chr17 + 61864475 CG 8 8 NNN
#> 2: chr17 + 61864486 CG 10 6 NNN
#> 3: chr17 + 61864504 CG 10 6 NNN
#> 4: chr20 - 57267455 CG 13 1 NNN
#> 5: chr17 - 61863826 CG 0 13 NNN
#> 6: chr17 - 61863830 CG 0 13 NNN
# CX cytosine report
cx.report <- generateCytosineReport(bam.data, threshold.reads=FALSE,
report.context="CX")
#> Already preprocessed BAM supplied as an input. Options 'min.mapq' and 'skip.duplicates' will have no effect.
#> Preparing cytosine report [0.041s]
head(cx.report[order(meth+unmeth, decreasing=TRUE)])
#> rname strand pos context meth unmeth triad
#> 1: chr17 + 61864338 CHG 1 25 NNN
#> 2: chr17 + 61864348 CHH 0 24 NNN
#> 3: chr17 + 61864364 CHH 0 24 NNN
#> 4: chr17 + 61864365 CHH 0 24 NNN
#> 5: chr17 + 61864373 CHH 0 24 NNN
#> 6: chr17 + 61864324 CHG 0 23 NNN
epialleleR
allows to make reports not only for individual cytosine bases, but also for a set of genomic regions. It is especially useful when the targeted methylation sequencing was used to produce reads (such as amplicon sequencing or hybridization capture using, e.g., Agilent SureSelect Target Enrichment Probes).
The amplicon sequencing principally differs from capture-based assays in that the coordinates of reads are known. Therefore, reads can be assigned to amplicons by their exact positions, while to the capture targets — by the overlap. For this, epialleleR
provides generic generateBedReport
function as well as two of its aliases, generateAmpliconReport
(for amplicon-based NGS) and generateCaptureReport
(for capture-based NGS).
# report for amplicon-based data
# matching is done by exact start or end positions plus/minus tolerance
amplicon.report <- generateAmpliconReport(
bam=system.file("extdata", "amplicon010meth.bam", package="epialleleR"),
bed=system.file("extdata", "amplicon.bed", package="epialleleR")
)
#> Reading BED file [0.049s]
#> Reading BAM file [0.168s]
#> Preprocessing BAM data:
#> Transforming sequences [0.008s]
#> Merging pairs [0.007s]
#> Thresholding reads [0.000s]
#> Preparing amplicon report [0.027s]
amplicon.report
#> seqnames start end width strand amplicon nreads+ nreads- VEF
#> 1: chr17 43125624 43126026 403 * CpG00-13 0 156 0.08333333
#> 2: chr17 43125270 43125640 371 * CpG14-31 0 61 0.11475410
#> 3: chr17 43125171 43125550 380 * CpG17-34 0 93 0.05376344
#> 4: chr17 43124861 43125249 389 * CpG33-49 0 84 0.10714286
#> 5: <NA> NA NA NA <NA> <NA> 60 46 0.13207547
# report for capture-based data
# matching is done by overlap
capture.report <- generateCaptureReport(
bam=system.file("extdata", "capture.bam", package="epialleleR"),
bed=system.file("extdata", "capture.bed", package="epialleleR")
)
#> Reading BED file [0.042s]
#> Reading BAM file [0.215s]
#> Preprocessing BAM data:
#> Transforming sequences [0.022s]
#> Merging pairs [0.017s]
#> Thresholding reads [0.003s]
#> Preparing capture report [1.190s]
head(capture.report)
#> seqnames start end width strand V4 nreads+ nreads- VEF
#> 1: chr1 3067647 3069703 2057 * PRDM16 2 1 1.0000000
#> 2: chr1 3651039 3653096 2058 * TP73 0 2 0.5000000
#> 3: chr1 3689153 3691202 2050 * TP73 0 2 1.0000000
#> 4: chr1 3696519 3698570 2052 * TP73 1 2 1.0000000
#> 5: chr1 6179609 6181670 2062 * CHD5 0 3 0.6666667
#> 6: chr1 13698869 13699064 196 * PRDM2 NA NA NA
# generateBedReport is a generic function for BED-guided reports
bed.report <- generateBedReport(
bam=system.file("extdata", "capture.bam", package="epialleleR"),
bed=system.file("extdata", "capture.bed", package="epialleleR"),
bed.type="capture"
)
#> Reading BED file [0.021s]
#> Reading BAM file [0.179s]
#> Preprocessing BAM data:
#> Transforming sequences [0.023s]
#> Merging pairs [0.020s]
#> Thresholding reads [0.002s]
#> Preparing capture report [0.018s]
identical(capture.report, bed.report)
#> [1] TRUE
As stated in the introduction, human genomic DNA regions often show bimodal methylation patterns. epialleleR
allows to visualize this information by plotting empirical cumulative distribution functions (eCDFs) for within- and out-of-context beta values.
The code below produces plots for the sequencing reads of control DNA samples. Note that within-the-context eCDF(0.5) values are very close to the expected 1-VEF values for the corresponding control DNA samples:
# First, let's visualise eCDFs for within- and out-of-context beta values
# for all four amplicons and unmatched reads. Note that within-the-context eCDF
# of 0.5 is very close to the expected 1-VEF value (0.1) for all amplicons
# produced from this 1:9 mix of methylated and non-methylated control DNA
# let's compute eCDF
amplicon.ecdfs <- generateBedEcdf(
bam=system.file("extdata", "amplicon010meth.bam", package="epialleleR"),
bed=system.file("extdata", "amplicon.bed", package="epialleleR"),
bed.rows=NULL
)
#> Reading BED file [0.017s]
#> Reading BAM file [0.148s]
#> Preprocessing BAM data:
#> Transforming sequences [0.006s]
#> Merging pairs [0.007s]
#> Computing ECDFs for within- and out-of-context per-read beta values [0.013s]
# there are 5 items in amplicon.ecdfs, let's plot all of them
par(mfrow=c(1,length(amplicon.ecdfs)))
# cycle through items
for (x in 1:length(amplicon.ecdfs)) {
# four of them have names corresponding to genomic regions of amplicon.bed
# fifth - NA for all the reads that don't match to any of those regions
main <- if (is.na(names(amplicon.ecdfs[x]))) "unmatched"
else names(amplicon.ecdfs[x])
# plotting eCDF for within-the-context per-read beta values (in red)
plot(amplicon.ecdfs[[x]]$context, col="red", verticals=TRUE, do.points=FALSE,
xlim=c(0,1), xlab="per-read beta value", ylab="cumulative density",
main=main)
# adding eCDF for out-of-context per-read beta values (in blue)
plot(amplicon.ecdfs[[x]]$out.of.context, add=TRUE, col="blue",
verticals=TRUE, do.points=FALSE)
}
# Second, let's compare eCDFs for within-the-context beta values for only one
# amplicon but all three sequenced samples: pure non-methylated DNA, 1:9 mix of
# methylated and non-methylated DNA, and fully methylated DNA
# our files
bam.files <- c("amplicon000meth.bam", "amplicon010meth.bam",
"amplicon100meth.bam")
# let's plot all of them
par(mfrow=c(1,length(bam.files)))
# cycle through items
for (f in bam.files) {
# let's compute eCDF
amplicon.ecdfs <- generateBedEcdf(
bam=system.file("extdata", f, package="epialleleR"),
bed=system.file("extdata", "amplicon.bed", package="epialleleR"),
# only the second amplicon
bed.rows=2, verbose=FALSE
)
# plotting eCDF for within-the-context per-read beta values (in red)
plot(amplicon.ecdfs[[1]]$context, col="red", verticals=TRUE, do.points=FALSE,
xlim=c(0,1), xlab="per-read beta value", ylab="cumulative density",
main=f)
# adding eCDF for out-of-context per-read beta values (in blue)
plot(amplicon.ecdfs[[1]]$out.of.context, add=TRUE, col="blue",
verticals=TRUE, do.points=FALSE)
}
It is known that sequence variants can affect the methylation status of a DNA7. The generateVcfReport
function calculates frequencies of single nucleotide variants (SNVs) within epialleles and tests for the association between SNV and epiallelic status using Fisher’s exact test. Base counts and the test’s p-values are included in the returned value.
In addition to BAM file location string or preprocessed BAM object, the function requires a location string for the Variant Call Format (VCF) file or the VCF object that was obtained using VariantAnnotation::readVcf
function. As VCF files can be extremely large, it is strongly advised to prefilter the VCF object by the relevant set of genomic regions, or specify such relevant set of regions as a bed
parameter when vcf
points to a VCF file location.
Please note, that the output report is currently limited to SNVs only.
# VCF report
vcf.report <- generateVcfReport(
bam=system.file("extdata", "amplicon010meth.bam", package="epialleleR"),
bed=system.file("extdata", "amplicon.bed", package="epialleleR"),
vcf=system.file("extdata", "amplicon.vcf.gz", package="epialleleR"),
# when VCF seqlevels are different from BED and BAM it is possible
# to convert them internally
vcf.style="NCBI"
)
#> Reading BED file [0.016s]
#> Reading VCF file [1.151s]
#> Reading BAM file [0.135s]
#> Preprocessing BAM data:
#> Transforming sequences [0.007s]
#> Merging pairs [0.007s]
#> Thresholding reads [0.000s]
#> Extracting base frequences [0.114s]
# NA values are shown for the C->T variants on the "+" and G->A on the "-"
# strands, because bisulfite conversion makes their counting impossible
head(vcf.report)
#> name seqnames range REF ALT M+Ref U+Ref M-Ref U-Ref M+Alt U+Alt M-Alt U-Alt SumRef
#> 1: rs546660277 chr17 43124874 A C 0 0 9 74 0 0 0 0 83
#> 2: rs574263814 chr17 43124891 G A 0 0 NA NA 0 0 NA NA 0
#> 3: rs8176076 chr17 43124935 G A 0 0 NA NA 0 0 NA NA 0
#> 4: rs535977743 chr17 43125016 C T NA NA 9 75 NA NA 0 0 84
#> 5: rs191784032 chr17 43125050 C A 0 0 9 74 0 0 0 1 83
#> 6: rs111956204 chr17 43125083 C A 0 0 9 75 0 0 0 0 84
#> SumAlt FEp+ FEp-
#> 1: 0 1 1
#> 2: 0 1 NA
#> 3: 0 1 NA
#> 4: 0 NA 1
#> 5: 1 1 1
#> 6: 0 1 1
# let's sort the report by increasing Fisher's exact test's p-values.
# the p-values are given separately for reads that map to the "+"
head(vcf.report[order(`FEp-`, na.last=TRUE)])
#> name seqnames range REF ALT M+Ref U+Ref M-Ref U-Ref M+Alt U+Alt M-Alt U-Alt SumRef
#> 1: rs187992882 chr17 43125347 C A 0 0 11 141 0 0 1 1 152
#> 2: rs539845430 chr17 43125475 T G 0 0 6 94 0 0 5 38 100
#> 3: rs112960339 chr17 43125524 C T NA NA 10 119 NA NA 0 1 129
#> 4: rs546660277 chr17 43124874 A C 0 0 9 74 0 0 0 0 83
#> 5: rs535977743 chr17 43125016 C T NA NA 9 75 NA NA 0 0 84
#> 6: rs191784032 chr17 43125050 C A 0 0 9 74 0 0 0 1 83
#> SumAlt FEp+ FEp-
#> 1: 2 1 0.1502419
#> 2: 43 1 0.3063393
#> 3: 1 NA 1.0000000
#> 4: 0 1 1.0000000
#> 5: 0 NA 1.0000000
#> 6: 1 1 1.0000000
# and to the "-" strand
head(vcf.report[order(`FEp+`, na.last=TRUE)])
#> name seqnames range REF ALT M+Ref U+Ref M-Ref U-Ref M+Alt U+Alt M-Alt U-Alt SumRef
#> 1: rs546660277 chr17 43124874 A C 0 0 9 74 0 0 0 0 83
#> 2: rs574263814 chr17 43124891 G A 0 0 NA NA 0 0 NA NA 0
#> 3: rs8176076 chr17 43124935 G A 0 0 NA NA 0 0 NA NA 0
#> 4: rs191784032 chr17 43125050 C A 0 0 9 74 0 0 0 1 83
#> 5: rs111956204 chr17 43125083 C A 0 0 9 75 0 0 0 0 84
#> 6: rs55680227 chr17 43125086 A C 0 0 9 74 0 0 0 0 83
#> SumAlt FEp+ FEp-
#> 1: 0 1 1
#> 2: 0 1 NA
#> 3: 0 1 NA
#> 4: 1 1 1
#> 5: 0 1 1
#> 6: 0 1 1
epialleleR
packageThe experimental data analysed using the package has not been published yet. The citation information will be updated in the nearest future.
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB
#> [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 parallel stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] GenomeInfoDb_1.28.0 IRanges_2.26.0 S4Vectors_0.30.0 BiocGenerics_0.38.0
#> [5] epialleleR_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] MatrixGenerics_1.4.0 Biobase_2.52.0 httr_1.4.2
#> [4] sass_0.4.0 bit64_4.0.5 jsonlite_1.7.2
#> [7] bslib_0.2.5.1 assertthat_0.2.1 highr_0.9
#> [10] BiocFileCache_2.0.0 blob_1.2.1 BSgenome_1.60.0
#> [13] GenomeInfoDbData_1.2.6 Rsamtools_2.8.0 yaml_2.2.1
#> [16] progress_1.2.2 pillar_1.6.1 RSQLite_2.2.7
#> [19] lattice_0.20-44 glue_1.4.2 digest_0.6.27
#> [22] GenomicRanges_1.44.0 XVector_0.32.0 htmltools_0.5.1.1
#> [25] Matrix_1.3-3 XML_3.99-0.6 pkgconfig_2.0.3
#> [28] biomaRt_2.48.0 zlibbioc_1.38.0 purrr_0.3.4
#> [31] BiocParallel_1.26.0 tibble_3.1.2 KEGGREST_1.32.0
#> [34] generics_0.1.0 ellipsis_0.3.2 cachem_1.0.5
#> [37] SummarizedExperiment_1.22.0 GenomicFeatures_1.44.0 magrittr_2.0.1
#> [40] crayon_1.4.1 memoise_2.0.0 evaluate_0.14
#> [43] fansi_0.4.2 data.table_1.14.0 tools_4.1.0
#> [46] prettyunits_1.1.1 hms_1.1.0 BiocIO_1.2.0
#> [49] lifecycle_1.0.0 matrixStats_0.58.0 stringr_1.4.0
#> [52] DelayedArray_0.18.0 AnnotationDbi_1.54.0 Biostrings_2.60.0
#> [55] compiler_4.1.0 jquerylib_0.1.4 rlang_0.4.11
#> [58] grid_4.1.0 RCurl_1.98-1.3 rjson_0.2.20
#> [61] rappdirs_0.3.3 VariantAnnotation_1.38.0 bitops_1.0-7
#> [64] rmarkdown_2.8 restfulr_0.0.13 DBI_1.1.1
#> [67] curl_4.3.1 R6_2.5.0 GenomicAlignments_1.28.0
#> [70] rtracklayer_1.52.0 knitr_1.33 dplyr_1.0.6
#> [73] fastmap_1.1.0 bit_4.0.4 utf8_1.2.1
#> [76] filelock_1.0.2 stringi_1.6.2 Rcpp_1.0.6
#> [79] vctrs_0.3.8 png_0.1-7 dbplyr_2.1.1
#> [82] tidyselect_1.1.1 xfun_0.23