Contents

1 Introduction to the bamsignals package

The goal of the bamsignals package is to load count data from bam files as easily and quickly as possible. A typical workflow without the bamsignals package requires to firstly load all reads in R (e.g. using the Rsamtools package), secondly process them and lastly convert them into counts. The bamsignals package optimizes this workflow by merging these steps into one using efficient C code, which makes the whole process easier and faster. Additionally, bamsignals comes with native support for paired end data.

1.1 Loading toy data

We will use the following libraries (which are all required for installing bamsignals).

library(GenomicRanges)
library(Rsamtools)
library(bamsignals)

In the following we will use a sorted and indexed bam file and a gene annotation.

bampath <- system.file("extdata", "randomBam.bam", package="bamsignals")
genes <- 
    get(load(system.file("extdata", "randomAnnot.Rdata", package="bamsignals")))
genes
## GRanges object with 49 ranges and 0 metadata columns:
##        seqnames         ranges strand
##           <Rle>      <IRanges>  <Rle>
##    [1]     chr1 [ 8294,  8861]      -
##    [2]     chr1 [ 6770,  7364]      +
##    [3]     chr1 [13121, 13727]      -
##    [4]     chr1 [ 1025,  1677]      +
##    [5]     chr1 [12022, 12623]      -
##    ...      ...            ...    ...
##   [45]     chr3   [ 542, 1156]      +
##   [46]     chr3   [1315, 1911]      -
##   [47]     chr3   [2129, 2701]      +
##   [48]     chr3   [3546, 4150]      -
##   [49]     chr3   [2100, 2669]      -
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

The chromosome names in the bam file and those in the GenomicRanges object need to match. Additionally, the bam file needs to be sorted and indexed. Note that bamsignals requires the bam index to be named like bam file with “.bai” suffix.

#sequence names of the GenomicRanges object
seqinfo(genes)
## Seqinfo object with 3 sequences from an unspecified genome; no seqlengths:
##   seqnames seqlengths isCircular genome
##   chr1             NA         NA   <NA>
##   chr2             NA         NA   <NA>
##   chr3             NA         NA   <NA>
#sequence names in the bam file
bf <- Rsamtools::BamFile(bampath)
seqinfo(bf)
## Seqinfo object with 3 sequences from an unspecified genome:
##   seqnames seqlengths isCircular genome
##   chr1          20086         NA   <NA>
##   chr2           5017         NA   <NA>
##   chr3           5043         NA   <NA>
#checking if there is an index
file.exists(gsub(".bam$", ".bam.bai", bampath))
## [1] TRUE

1.2 Counting reads in given ranges with bamCount()

1.2.1 Basic counting

Let’s count how many reads map to the promoter regions of our genes. Using the bamCount() function, this is straightforward.

proms <- GenomicRanges::promoters(genes, upstream=100, downstream=100)
counts <- bamCount(bampath, proms, verbose=FALSE)
str(counts)
##  int [1:49] 141 68 87 199 97 62 149 76 202 88 ...

The object counts is a vector of the same length as the number of ranges that we are analyzing, the i-th count corresponds to the i-th range.

1.2.2 Accounting for fragment length

With the bamCount() function a read is counted in a range if the 5’ end of the read falls in that range. This might be appropriate when analyzing DNase I hypersensitivity tags, however for ChIP-seq data the immunoprecipitated protein is normally located downstream with respect to the 5’ end of the sequenced reads. To correct for that, it is possible to count reads with a strand-specific shift, i.e. reads will be counted in a range if the shifted 5’ end falls in that range. Note that this shift will move reads mapped to the positive strand to the right and reads mapped to the negative strand to the left with respect to the reference orientation. The shift should correspond approximately to half of the average length of the fragments in the sequencing experiment.

counts <- bamCount(bampath, proms, verbose=FALSE, shift=75)
str(counts)
##  int [1:49] 126 81 91 192 105 74 143 79 161 107 ...

1.2.3 Counting on each strand separately

Sometimes it is necessary to consider the two genomic strands separately. This is achieved with the ss option (separate strands, or strand-specific), and depends also on the strand of the GenomicRanges object.

strand(proms)
## factor-Rle of length 49 with 33 runs
##   Lengths: 1 1 1 1 1 1 1 1 2 1 5 1 1 4 1 ... 2 1 1 1 3 1 1 1 2 3 1 1 1 1 2
##   Values : - + - + - + - + - + - + - + - ... - + - + - + - + - + - + - + -
## Levels(3): + - *
counts <- bamCount(bampath, proms, verbose=FALSE, ss=TRUE)
str(counts)
##  int [1:2, 1:49] 87 54 50 18 48 39 162 37 52 45 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:2] "sense" "antisense"
##   ..$ : NULL

Now counts is a matrix with two rows, one for the sense strand, the other for the antisense strand. Note that the sense of a read is decided also by the region it falls into, so if both the region and the read are on the same strand the read is counted as a sense read, otherwise as an antisense read.

1.3 Read profiles for each region with bamProfile()

If you are interested in counting how many reads map to each base pair of your genes, the bamProfile() function might save you a day.

sigs <- bamProfile(bampath, genes, verbose=FALSE)
sigs
## CountSignals object with 49 signals
## [1] signal of width 568
## 0 0 0 0 0 0 0 0 0 1 ...
## [2] signal of width 595
## 0 1 1 0 0 0 0 0 1 0 ...
## [3] signal of width 607
## 0 0 1 0 0 0 0 0 1 1 ...
## [4] signal of width 653
## 2 0 1 1 5 1 0 5 2 2 ...
## [5] signal of width 602
## 0 0 1 0 2 0 0 2 0 0 ...
## ....

The CountSignals class is a read-only container for count vectors. Conceptually it is like a list of vectors, and in fact it can be immediately converted to that format.

#CountSignals is conceptually like a list
lsigs <- as.list(sigs)
stopifnot(length(lsigs[[1]])==length(sigs[1]))
#sapply and lapply can be used as if we were using a list
stopifnot(all(sapply(sigs, sum) == sapply(lsigs, sum)))

Similarly as for the count function, the CountSignals object has as many elements (called signals) as there are ranges, and the i-th signal corresponds to the i-th range.

stopifnot(all(width(sigs)==width(genes)))

1.3.1 Counting on each strand separately

As for the bamCount() function, also with bamProfile() the reads can be counted for each strand separately

sssigs <- bamProfile(bampath, genes, verbose=FALSE, ss=TRUE)
sssigs
## CountSignals object with 49 strand-specific signals
## [1] signal of width 568
## sense      0 0 0 0 0 0 0 0 0 1 ...
## antisense  0 0 0 0 0 0 0 0 0 0 ...
## [2] signal of width 595
## sense      0 0 0 0 0 0 0 0 0 0 ...
## antisense  0 1 1 0 0 0 0 0 1 0 ...
## [3] signal of width 607
## sense      0 0 1 0 0 0 0 0 1 0 ...
## antisense  0 0 0 0 0 0 0 0 0 1 ...
## [4] signal of width 653
## sense      1 0 1 1 4 1 0 3 1 1 ...
## antisense  1 0 0 0 1 0 0 2 1 1 ...
## [5] signal of width 602
## sense      0 0 0 0 1 0 0 2 0 0 ...
## antisense  0 0 1 0 1 0 0 0 0 0 ...
## ....

Now each signal is a matrix with two rows.

str(sssigs[1])
##  int [1:2, 1:568] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:2] "sense" "antisense"
##   ..$ : NULL
#summing up the counts from the two strands is the same as using ss=FALSE
stopifnot(colSums(sssigs[1])==sigs[1])
#the width function takes into account that now the signals are strand-specific
stopifnot(width(sssigs)==width(sigs))
#the length function does not, a strand-specific signal is twice as long
stopifnot(length(sssigs[1])==2*length(sigs[1]))

Let’s summarize this with a plot

xlab <- "offset from start of the region"
ylab <- "counts per base pair (negative means antisense)"
main <- paste0("read profile of the region ", 
    seqnames(genes)[1], ":", start(genes)[1], "-", end(genes)[1])
plot(sigs[1], ylim=c(-max(sigs[1]), max(sigs[1])), ylab=ylab, xlab=xlab, 
    main=main, type="l")
lines(sssigs[1]["sense",], col="blue")
lines(-sssigs[1]["antisense",], col="red")
legend("topright", c("sense", "antisense", "both"), 
    col=c("blue", "red", "black"), lty=1)

1.3.2 Regions of the same width

In case our ranges have all the same width, a CountSignals object can be immediately converted into a matrix, or an array, with the alignSignals function

#The promoter regions have all the same width
sigs <- bamProfile(bampath, proms, ss=FALSE, verbose=FALSE)
sssigs <- bamProfile(bampath, proms, ss=TRUE, verbose=FALSE)

sigsMat <- alignSignals(sigs)
sigsArr <- alignSignals(sssigs)

The last dimension of the resulting array (or matrix) represents the different ranges, the second-last one represents the base pairs in each region, and in the strand-specific case, the first-one represents the strand of the signal. This can be changed by using the t() function (for matrices) or aperm() (for arrays).

#the dimensions are [base pair, region]
str(sigsMat)
##  int [1:200, 1:49] 0 0 1 0 0 0 0 0 0 1 ...
#the dimensions are [strand, base pair, region]
str(sigsArr)
##  int [1:2, 1:200, 1:49] 0 0 0 0 1 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ : chr [1:2] "sense" "antisense"
##   ..$ : NULL
##   ..$ : NULL
stopifnot(all(sigsMat == sigsArr["sense",,] + sigsArr["antisense",,]))

Computing the average read profile at promoters in proms is now straightforward

avgSig <- rowMeans(sigsMat)
avgSenseSig <- rowMeans(sigsArr["sense",,])
avgAntisenseSig <- rowMeans(sigsArr["antisense",,])
ylab <- "average counts per base pair"
xlab <- "distance from TSS"
main <- paste0("average profile of ", length(proms), " promoters")
xs <- -99:100
plot(xs, avgSig, ylim=c(0, max(avgSig)), xlab=xlab, ylab=ylab, main=main,
    type="l")
lines(xs, avgSenseSig, col="blue")
lines(xs, avgAntisenseSig, col="red")
legend("topright", c("sense", "antisense", "both"), 
    col=c("blue", "red", "black"), lty=1)

1.3.3 Binning counts

Very often it is better to count reads mapping to small regions instead of single base pairs. Bins are small non-overlapping regions of fixed size tiling a larger region. Instead of splitting your regions of interest into bins, it is easier and much more efficient to provide the binsize option to bamProfile().

binsize <- 20
binnedSigs <- bamProfile(bampath, proms, binsize=binsize)
## Processing /tmp/RtmpsHUJ3H/Rinst51f87c940648/bamsignals/extdata/randomBam.bam: I'M gOnNa cHaSe'em and PiLe'em aLl up!!!
stopifnot(all(width(binnedSigs)==ceiling(width(sigs)/binsize)))
binnedSigs
## CountSignals object with 49 signals
## [1] signal of width 10
## 4 13 22 8 16 3 10 18 21 26
## [2] signal of width 10
## 6 2 2 9 5 10 8 16 8 2
## [3] signal of width 10
## 9 14 12 6 6 6 3 7 12 12
## [4] signal of width 10
## 7 3 4 5 21 30 37 37 28 27
## [5] signal of width 10
## 7 9 16 8 13 5 7 10 9 13
## ....

In case the ranges’ widths are not multiples of the bin size, a warning will be issued and the last bin in those ranges will be smaller than the others (where “last” depends on the orientation of the region).

Binning means considering a signal at a lower resolution.

avgBinnedSig <- rowMeans(alignSignals(binnedSigs))
#the counts in the bin are the sum of the counts in each base pair
stopifnot(all.equal(colSums(matrix(avgSig, nrow=binsize)),avgBinnedSig))
#let's plot it
ylab <- "average counts per base pair"
plot(xs, avgSig, xlab=xlab, ylab=ylab, main=main, type="l")
lines(xs, rep(avgBinnedSig, each=binsize)/binsize, lty=2)
legend("topright", c("base pair count", "bin count"), lty=c(1, 2))

1.4 Read coverage with bamCoverage()

Instead of counting the 5’ end of each read, you may want to count how many reads overlap each base pair, you should check out the bamCoverage() function.

covSigs <- bamCoverage(bampath, genes, verbose=FALSE)
puSigs <- bamProfile(bampath, genes, verbose=FALSE)
xlab <- "offset from start of the region"
ylab <- "reads per base pair"
main <- paste0("read coverage and profile of the region ", seqnames(genes)[1],
    ":", start(genes)[1], "-", end(genes)[1])
plot(covSigs[1], ylim=c(0, max(covSigs[1])), ylab=ylab, xlab=xlab, main=main,
    type="l")
lines(puSigs[1], lty=2)
legend("topright", c("covering the base pair", "5' end maps to the base pair"), 
    lty=c(1,2))

2 Notes

2.1 Paired end data handling

All bamsignal count, profile and coverage methods discussed above support dealing with paired end sequencing data. With paired.end!="ignore", only the first read in a properly aligned pair is considered (this constraint translates into the bitmask 66 for the flag of a read). The strand of this read defines the strand of the pair. Considering only one read avoids counting both reads in read pair which may bias downstream analysis. For bamCoverage() and paired.end=="extend" the coverage computes how many fragments overlap each base pair. The actual length of a fragment is stored in the TLEN field of a bam file. For bamCount() and bamProfile() the option paired.end="midpoint" shifts the counting position from the 5’end to the midpoint of the fragment. For bamCoverage() the option paired.end="extend" considers only the first read in a properly aligned pair and comput

2.2 Exclude ambiguous reads with the mapq argument

Most mapping software (e.g. bwa, bowtie2) stores information about mapping confidence in the MAPQ field of a bam file. In general, it is recommended to exclude reads with bad mapping quality because their mapping location is ambiguous. In bowtie2, a mapping quality of 20 or less indicates that there is at least a 1 in 20 chance that the read truly originated elsewhere. In that case, the mapq argument is helpful.

counts <- bamCount(bampath, proms, mapq=20, verbose=FALSE)