generateBedReport {epialleleR} | R Documentation |
'generateBedReport', 'generateAmpliconReport', 'generateCaptureReport' – these functions match BAM reads to the set of genomic locations and return the fraction of reads with an average methylation level passing an arbitrary threshold.
generateAmpliconReport( bam, bed, report.file = NULL, zero.based.bed = FALSE, match.tolerance = 1, threshold.reads = TRUE, threshold.context = c("CG", "CHG", "CHH", "CxG", "CX"), min.context.sites = 2, min.context.beta = 0.5, max.outofcontext.beta = 0.1, min.mapq = 0, min.baseq = 0, skip.duplicates = FALSE, nthreads = 0, gzip = FALSE, verbose = TRUE ) generateCaptureReport( bam, bed, report.file = NULL, zero.based.bed = FALSE, match.min.overlap = 1, threshold.reads = TRUE, threshold.context = c("CG", "CHG", "CHH", "CxG", "CX"), min.context.sites = 2, min.context.beta = 0.5, max.outofcontext.beta = 0.1, min.mapq = 0, min.baseq = 0, skip.duplicates = FALSE, nthreads = 0, gzip = FALSE, verbose = TRUE ) generateBedReport( bam, bed, report.file = NULL, zero.based.bed = FALSE, bed.type = c("amplicon", "capture"), match.tolerance = 1, match.min.overlap = 1, threshold.reads = TRUE, threshold.context = c("CG", "CHG", "CHH", "CxG", "CX"), min.context.sites = 2, min.context.beta = 0.5, max.outofcontext.beta = 0.1, min.mapq = 0, min.baseq = 0, skip.duplicates = FALSE, nthreads = 1, gzip = FALSE, verbose = TRUE )
bam |
BAM file location string OR preprocessed output of
|
bed |
Browser Extensible Data (BED) file location string OR object of
class |
report.file |
file location string to write the BED report. If NULL
(the default) then report is returned as a
|
zero.based.bed |
boolean defining if BED coordinates are zero based (default: FALSE). |
match.tolerance |
integer for the largest difference between read's and
BED |
threshold.reads |
boolean defining if sequence reads should be thresholded before counting reads belonging to variant epialleles (default: TRUE). Disabling thresholding is possible but makes no sense in this context as all the reads will be assigned to the variant epiallele, which will result in VEF==1 (in such case 'NA' VEF values are returned in order to avoid confusion). |
threshold.context |
string defining cytosine methylation context used for thresholding the reads:
This option has no effect when read thresholding is disabled. |
min.context.sites |
non-negative integer for minimum number of cytosines within the 'threshold.context' (default: 2). Reads containing fewer within-the-context cytosines are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled. |
min.context.beta |
real number in the range [0;1] (default: 0.5). Reads with average beta value for within-the-context cytosines below this threshold are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled. |
max.outofcontext.beta |
real number in the range [0;1] (default: 0.1). Reads with average beta value for out-of-context cytosines above this threshold are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled. |
min.mapq |
non-negative integer threshold for minimum read mapping quality (default: 0). Option has no effect if preprocessed BAM data was supplied as an input. |
min.baseq |
non-negative integer threshold for minimum nucleotide base quality (default: 0). Option has no effect if preprocessed BAM data was supplied as an input. |
skip.duplicates |
boolean defining if duplicate aligned reads should be skipped (default: FALSE). Option has no effect if preprocessed BAM data was supplied as an input OR duplicate reads were not marked by alignment software. |
nthreads |
non-negative integer for the number of HTSlib threads to be used during BAM file decompression (default: 1). 2 threads make sense for the files larger than 100 MB. Option has no effect if preprocessed BAM data was supplied as an input. |
gzip |
boolean to compress the report (default: FALSE). |
verbose |
boolean to report progress and timings (default: TRUE). |
match.min.overlap |
integer for the smallest overlap between read's and
BED |
bed.type |
character string for the type of assay that was used to produce sequencing reads:
|
Functions report hypermethylated variant epiallele frequencies (VEF) per genomic region of interest using BAM and BED files as input. Reads (for paired-end sequencing alignment files - read pairs as a single entity) are matched to genomic locations by exact coordinates ('generateAmpliconReport' or 'generateBedReport' with an option bed.type="amplicon") or minimum overlap ('generateCaptureReport' or 'generateBedReport' with an option bed.type="capture") – the former to be used for amplicon-based NGS data, while the latter – for the capture-based NGS data. The function's logic is explained below.
Let's suppose we have a BAM file with four reads, all mapped to the "+" strand of chromosome 1, positions 1-16. The genomic range is supplied as a parameter 'bed = as("chr1:1-100", "GRanges")'. Assuming the default values for the thresholding parameters (threshold.reads = TRUE, threshold.context = "CG", min.context.sites = 2, min.context.beta = 0.5, max.outofcontext.beta = 0.1), the input and results will look as following:
methylation string | threshold | explained |
...Z..x+.h..x..h. | below | min.context.sites < 2 (only one zZ base) |
...Z..z.h..x..h. | above | pass all criteria |
...Z..z.h..X..h. | below | max.outofcontext.beta > 0.1 (1XH / 3xXhH = 0.33) |
...Z..z.h..z-.h. | below | min.context.beta < 0.5 (1Z / 3zZ = 0.33) |
Only the second read will satisfy all of the thresholding criteria, leading to the following BED report (given that all reads map to chr1:+:1-16):
seqnames | start | end | width | strand | nreads+ | nreads- | VEF |
chr1 | 1 | 100 | 100 | * | 4 | 0 | 0.25 |
data.table
object containing VEF report for
BED GRanges
or NULL if report.file was
specified. If BAM file contains reads that would not match to any of BED
GRanges
, the last row in the report will
contain information on such reads (with seqnames, start and end equal to NA).
The report columns are:
seqnames – reference sequence name
start – start of genomic region
end – end of genomic region
width – width of genomic region
strand – strand
... – other columns that were present in BED or metadata columns of
GRanges
object
nreads+ – number of reads (pairs) mapped to the forward ("+") strand
nreads- – number of reads (pairs) mapped to the reverse ("-") strand
VEF – frequency of reads passing the threshold
preprocessBam
for preloading BAM data,
generateCytosineReport
for methylation statistics at the level
of individual cytosines, generateVcfReport
for evaluating
epiallele-SNV associations, generateBedEcdf
for analysing the
distribution of per-read beta values, and 'epialleleR' vignettes for the
description of usage and sample data.
GRanges
class for working with genomic ranges,
seqlevelsStyle
function for getting or setting
the seqlevels style.
# amplicon data amplicon.bam <- system.file("extdata", "amplicon010meth.bam", package="epialleleR") amplicon.bed <- system.file("extdata", "amplicon.bed", package="epialleleR") amplicon.report <- generateAmpliconReport(bam=amplicon.bam, bed=amplicon.bed) # capture NGS capture.bam <- system.file("extdata", "capture.bam", package="epialleleR") capture.bed <- system.file("extdata", "capture.bed", package="epialleleR") capture.report <- generateCaptureReport(bam=capture.bam, bed=capture.bed) # generateAmpliconReport and generateCaptureReport are just aliases # of the generateBedReport bed.report <- generateBedReport(bam=capture.bam, bed=capture.bed, bed.type="capture") identical(capture.report, bed.report)