preprocessBam {epialleleR}R Documentation

preprocessBam

Description

This function reads and preprocesses BAM file.

Usage

preprocessBam(
  bam.file,
  min.mapq = 0,
  min.baseq = 0,
  skip.duplicates = FALSE,
  nthreads = 1,
  verbose = TRUE
)

Arguments

bam.file

BAM file location string.

min.mapq

non-negative integer threshold for minimum read mapping quality (default: 0).

min.baseq

non-negative integer threshold for minimum nucleotide base quality (default: 0).

skip.duplicates

boolean defining if duplicate aligned reads should be skipped (default: FALSE). Option has no effect if duplicate reads were not marked by alignment software.

nthreads

non-negative integer for the number of HTSlib threads to be used during BAM file decompression (default: 1). Two threads (and usually no more than two) make sense for the files larger than 100 MB.

verbose

boolean to report progress and timings (default: TRUE).

Details

The function loads and preprocesses BAM file, saving time when multiple analyses are to be performed on large input files. Currently, HTSlib is used to read the data, therefore it is possible to speed up the loading by means of HTSlib threads.

This function is also called internally when BAM file location is supplied as an input for other 'epialleleR' methods.

'preprocessBam' currently accepts only BAM files that are derived from paired-end sequencing (create an issue if you need to process single-end BAM files). During preprocessing, paired reads are merged according to their base quality: nucleotide base with the highest value in the QUAL string is taken, unless its quality is less than 'min.baseq', which results in no information for that particular position ("-"/"N"). These merged reads are then processed as a single entity in all 'epialleleR' methods. Due to merging, overlapping bases in read pairs are counted only once, and the base with the highest quality is taken.

It is also a requirement currently that BAM file is sorted by QNAME instead of genomic location (i.e., "unsorted") to perform merging of paired-end reads. Error message is shown if it is sorted by genomic location, in this case please sort it by QNAME using 'samtools sort -n -o out.bam in.bam'.

Please also note that for all its methods, 'epialleleR' requires genomic strand (XG tag) and a methylation call string (XM tag) to be present in a BAM file - i.e., methylation calling must be performed after read mapping/alignment by your software of choice.

Value

data.table object containing preprocessed BAM data.

See Also

generateCytosineReport for methylation statistics at the level of individual cytosines, generateBedReport for genomic region-based statistics, generateVcfReport for evaluating epiallele-SNV associations, generateBedEcdf for analysing the distribution of per-read beta values, and 'epialleleR' vignettes for the description of usage and sample data.

Sequence Alignment/Map format specifications, duplicate alignments marking by Samtools and Illumina DRAGEN Bio IT Platform.

Examples

  capture.bam <- system.file("extdata", "capture.bam", package="epialleleR")
  bam.data    <- preprocessBam(capture.bam)

[Package epialleleR version 1.1.12 Index]