Creating Tallies from within R

In this vignette we show how to create tally files form bam files within an interactive R session (note that this is not a good idea for large datasets).

Dear Windows users, please be advised that the code in this vignette is not guaranteed to run under Windows because of an issue in the HDF5 implementation for that platform. Evaluation is disabled to avoid problems and you wil not see the plots / outputs for this vignette.

Preparations

First we load necessary packages and check for the example .bam files shipped with h5vc.

suppressPackageStartupMessages(library(h5vc))  # h5vc is needed
suppressPackageStartupMessages(library(rhdf5))  # rhdf5 is needed
suppressPackageStartupMessages(library(deepSNV))  # we use deepSNV for tallying
files <- c("NRAS.AML.bam", "NRAS.Control.bam")
bamFiles <- file.path(system.file("extdata", package = "h5vcData"), files)

Those bam files contain reads from a matched pair of samples from an AML patient focusing on the region containing the gene NRAS (Chromosome 1: 115,247,090-115,259,515).

We will now create the tally file and create the groups that represent the study and chromosome we want to work on. Before we do this we have to find out how big our datasets have to be in their genomic-position dimension, to do this we will look into the header of the bam files and extract the sequence length information.

chromdim <- sapply(scanBamHeader(bamFiles), function(x) x$targets)
colnames(chromdim) <- files
head(chromdim)
##   NRAS.AML.bam NRAS.Control.bam
## 1    249250621        249250621
## 2    243199373        243199373
## 3    198022430        198022430
## 4    191154276        191154276
## 5    180915260        180915260
## 6    171115067        171115067

Both files have the same header information and are fully compatible, so we can just pick one file and take the chromosome lengths from there.

chrom <- "1"
chromlength <- chromdim[chrom, 1]
study <- "/NRAS"
tallyFile <- file.path(tempdir(), "NRAS.tally.hfs5")
if (file.exists(tallyFile)) {
    file.remove(tallyFile)
}
h5createFile(tallyFile)
## [1] TRUE
group <- paste(study, chrom, sep = "/")
h5createGroup(tallyFile, study)  # create the toplevel group first
## [1] TRUE
h5createGroup(tallyFile, group)
## [1] TRUE
h5createDataset(tallyFile, paste(group, "Counts", sep = "/"), dims = c(12, 2, 
    2, chromlength), storage.mode = "integer", level = 9)  #Creating the Counts group for chromosome 1 with 12 bases, 2 samples, 2 strands and 249250621 positions
## [1] TRUE
h5createDataset(tallyFile, paste(group, "Deletions", sep = "/"), dims = c(2, 
    2, chromlength), storage.mode = "integer", level = 9)  #Creating the Deletions group for chromosome 1 with 2 samples, 2 strands and 249250621 positions
## [1] TRUE
h5createDataset(tallyFile, paste(group, "Coverages", sep = "/"), dims = c(2, 
    2, chromlength), storage.mode = "integer", level = 9)  #Creating the Coverages group for chromosome 1 with 2 samples, 2 strands and 249250621 positions
## [1] TRUE
h5createDataset(tallyFile, paste(group, "Reference", sep = "/"), dims = c(chromlength), 
    storage.mode = "integer", level = 9)  #Creating the Reference group for chromosome 1 with 249250621 positions
## [1] TRUE
h5ls(tallyFile)
##     group      name       otype  dclass                    dim
## 0       /      NRAS   H5I_GROUP                               
## 1   /NRAS         1   H5I_GROUP                               
## 2 /NRAS/1    Counts H5I_DATASET INTEGER 12 x 2 x 2 x 249250621
## 3 /NRAS/1 Coverages H5I_DATASET INTEGER      2 x 2 x 249250621
## 4 /NRAS/1 Deletions H5I_DATASET INTEGER      2 x 2 x 249250621
## 5 /NRAS/1 Reference H5I_DATASET INTEGER              249250621

Note that we use the default layout for tally files with 12 slots for bases (based on c("A","C","G","T") times the 3 bins for sequencing cycle, i.e. start, middle, end ). The function bam2R from the deepSNV package, which we will use to create the tally actually does not stratify like this so we will leave the first and last 4 slots for bases empty for now.

Now we will attach the sampleData to the group.

sampleData <- data.frame(Sample = c("AML", "Control"), Column = c(1, 2), Patient = c("Pt1", 
    "Pt1"), Type = c("Case", "Control"), stringsAsFactors = FALSE)
setSampleData(tallyFile, group, sampleData)
getSampleData(tallyFile, group)
##   Column Patient  Sample    Type
## 1      1     Pt1     AML    Case
## 2      2     Pt1 Control Control

Note that we set the columns to be 1 and 2 respectively while within the tally file the values 0 and 1 are stored. setSampleData and getSampleData automatically remove / add 1 from the value. This is implemented to adress the fact that R counts 1-based while HDF5 internally counts 0-based.

Extracting tallies from the bam files

Now it is time to extract tally information from the bam files.

startpos <- 115247090
endpos <- 115259515
Counts <- lapply(bamFiles, function(bamf) {
    bam2R(file = bamf, chr = chrom, start = startpos, stop = endpos, head.clip = 10)  #we tell it to ignore the first and last 10 sequencing cycles
})

When we now have a look at the resulting matrix we can see that some transformations will be needed to make the data compatible with the h5vc tally file format. The good news is that we can get counts for both strands at all positions and even get deletions and insertion counts (we will ignore the latter for now). Unfortunately bam2R choses not to store the coverage separately and instead stored it implicitly in the sum of counts (technically the deletions have to be added in also here). This means we will have counts in the slot corresponding to the reference base which would break some of the downstream functions of h5vc. We will calculate the coverages by summing up the counts and deletions for each position and strand.

Counts[[1]][5000:5010, ]
##        A  T  C  G - N INS DEL HEAD TAIL QUAL a t c g _ n ins del head tail
##  [1,] 11  0  0  0 0 5   0   0    0    0  440 0 0 0 0 0 0   0   0    0    0
##  [2,] 12  0  0  0 0 4   0   0    0    0  480 0 0 0 0 0 0   0   0    0    0
##  [3,] 12  0  0  0 0 4   0   0    0    0  480 0 0 0 0 0 0   0   0    0    0
##  [4,] 13  0  0  0 0 6   0   0    3    0  520 0 0 0 0 0 0   0   0    0    0
##  [5,] 13  0  0  0 0 8   0   0    2    0  520 0 0 0 0 0 0   0   0    0    0
##  [6,]  0 14  0  0 0 7   0   0    0    0  560 0 0 0 0 0 0   0   0    0    0
##  [7,]  0  0  0 13 0 9   0   0    1    0  560 0 0 0 0 0 0   0   0    0    0
##  [8,]  0  0 15  0 0 9   0   0    2    0  600 0 0 0 0 0 0   0   0    0    0
##  [9,] 15  0  0  0 0 9   0   0    0    0  640 0 0 0 0 0 0   0   0    0    0
## [10,]  0 15  0  0 0 9   0   0    0    0  640 0 0 0 0 0 0   0   0    0    0
## [11,] 16  0  0  0 0 8   0   0    0    0  640 0 0 0 0 0 0   0   0    0    0
##       qual
##  [1,]    0
##  [2,]    0
##  [3,]    0
##  [4,]    0
##  [5,]    0
##  [6,]    0
##  [7,]    0
##  [8,]    0
##  [9,]    0
## [10,]    0
## [11,]    0
Coverages <- lapply(Counts, function(count) matrix(c(rowSums(count[, c("A", 
    "C", "G", "T", "DEL")]), rowSums(count[, c("a", "c", "g", "t", "del")])), 
    ncol = 2, byrow = FALSE, dimnames = list(NULL, c("Fwd", "Rev"))))
Coverages[[1]][5000:5010, ]
##       Fwd Rev
##  [1,]  11   0
##  [2,]  12   0
##  [3,]  12   0
##  [4,]  13   0
##  [5,]  13   0
##  [6,]  14   0
##  [7,]  13   0
##  [8,]  15   0
##  [9,]  15   0
## [10,]  15   0
## [11,]  16   0
Deletions <- lapply(Counts, function(count) count[, c("DEL", "del")])

Normally one would at this point load the reference sequence and mask out the counts for the reference base at each position. For simplicity, speed and since we want to illustrate comparative analyses mostly we will estimate the reference base as the base with the most counts in both strands and samples (potentially losing e.g. homozygous germline variants in the process).

Counts <- lapply(Counts, function(count) count[, c("A", "C", "G", "T", "a", 
    "c", "g", "t")])  # kick out all unnecessary columns
ref <- apply(Counts[[1]][, 1:4] + Counts[[1]][5:8] + Counts[[2]][, 1:4] + Counts[[2]][5:8], 
    1, which.max)
for (i in seq(length(ref))) {
    Counts[[1]][i, ref[i]] <- 0  #Set the forward strand in the first sample to zero
    Counts[[1]][i, (ref[i] + 4)] <- 0
    Counts[[2]][i, ref[i]] <- 0
    Counts[[2]][i, (ref[i] + 4)] <- 0  #Set the reverse strand in the first sample to zero
}
Reference <- ref - 1  #the tally file encodes the reference as A=0,C=1,G=2,T=3

Writing to the HSF5 file

Now we will write the data to the tally file.

for (sample in 1:2) {
    h5write(t(Counts[[sample]][, 1:4]), tallyFile, paste(group, "Counts", sep = "/"), 
        index = list(5:8, sample, 1, startpos:endpos))  #Sample One Forward Strand
    h5write(t(Counts[[sample]][, 5:8]), tallyFile, paste(group, "Counts", sep = "/"), 
        index = list(5:8, sample, 2, startpos:endpos))  #Sample One Reverse Strand

    h5write(Coverages[[sample]][, "Fwd"], tallyFile, paste(group, "Coverages", 
        sep = "/"), index = list(sample, 1, startpos:endpos))  #Sample One Forward Strand
    h5write(Coverages[[sample]][, "Rev"], tallyFile, paste(group, "Coverages", 
        sep = "/"), index = list(sample, 2, startpos:endpos))  #Sample One Reverse Strand

    h5write(Deletions[[sample]][, "DEL"], tallyFile, paste(group, "Deletions", 
        sep = "/"), index = list(sample, 1, startpos:endpos))  #Sample Two Forward Strand
    h5write(Deletions[[sample]][, "del"], tallyFile, paste(group, "Deletions", 
        sep = "/"), index = list(sample, 2, startpos:endpos))  #Sample Two Reverse Strand
}
h5write(Reference, tallyFile, paste(group, "Reference", sep = "/"), index = list(startpos:endpos))  #The Reference
h5ls(tallyFile)
##     group      name       otype  dclass                    dim
## 0       /      NRAS   H5I_GROUP                               
## 1   /NRAS         1   H5I_GROUP                               
## 2 /NRAS/1    Counts H5I_DATASET INTEGER 12 x 2 x 2 x 249250621
## 3 /NRAS/1 Coverages H5I_DATASET INTEGER      2 x 2 x 249250621
## 4 /NRAS/1 Deletions H5I_DATASET INTEGER      2 x 2 x 249250621
## 5 /NRAS/1 Reference H5I_DATASET INTEGER              249250621

This very explicit step-by-step implementation can be streamlined quite a bit to prevent much of the code-duplication.

Checking if everything worked

We will use the h5dapply function provided by h5vc to extract the data again and have a look at it.

data <- h5dapply(filename = tallyFile, group = group, blocksize = 1e+08, range = c(startpos, 
    endpos))[[1]]  # we use a blocksize larger than the range to get all data in one block (extracted by [[1]])
str(data)
## List of 5
##  $ Counts      : int [1:12, 1:2, 1:2, 1:12425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Coverages   : int [1:2, 1:2, 1:12425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Deletions   : int [1:2, 1:2, 1:12425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Reference   : int [1:12425(1d)] 0 0 0 0 0 0 0 0 0 0 ...
##  $ h5dapplyInfo:List of 4
##   ..$ Blockstart: num 1.15e+08
##   ..$ Blockend  : num 1.15e+08
##   ..$ Datasets  :'data.frame':   4 obs. of  7 variables:
##   .. ..$ group   : chr [1:4] "/NRAS/1" "/NRAS/1" "/NRAS/1" "/NRAS/1"
##   .. ..$ name    : chr [1:4] "Counts" "Coverages" "Deletions" "Reference"
##   .. ..$ otype   : Factor w/ 7 levels "H5I_FILE","H5I_GROUP",..: 5 5 5 5
##   .. ..$ dclass  : chr [1:4] "INTEGER" "INTEGER" "INTEGER" "INTEGER"
##   .. ..$ dim     : chr [1:4] "12 x 2 x 2 x 249250621" "2 x 2 x 249250621" "2 x 2 x 249250621" "249250621"
##   .. ..$ DimCount: int [1:4] 4 3 3 1
##   .. ..$ PosDim  : int [1:4] 4 3 3 1
##   ..$ Group     : chr "/NRAS/1"

We will call variants within this gene now:

vars <- h5dapply(filename = tallyFile, group = group, blocksize = 1000, FUN = callVariantsPaired, 
    sampledata = getSampleData(tallyFile, group), cl = vcConfParams(returnDataPoints = TRUE), 
    range = c(startpos, endpos))
vars <- do.call(rbind, vars)
vars
##                     Chrom     Start       End Sample altAllele refAllele
## 115258090:115259089     1 115258747 115258747    AML         G         C
##                     caseCountFwd caseCountRev caseCoverageFwd
## 115258090:115259089           12           22              27
##                     caseCoverageRev controlCountFwd controlCountRev
## 115258090:115259089              41               0               0
##                     controlCoverageFwd controlCoverageRev
## 115258090:115259089                 53                 37

By cleverly selecting the example data we have found exactly one variant that seems ot be interesting and will now plot the region in question to also check if the mismatchPlot function will work with the tally data we created.

position <- vars$End[1]
windowsize <- 50
data <- h5dapply(filename = tallyFile, group = group, blocksize = 1e+08, range = c(position - 
    windowsize, position + windowsize))[[1]]
p <- mismatchPlot(data, getSampleData(tallyFile, group), samples = c("AML", 
    "Control"), windowsize = windowsize, position = position)
print(p)

plot of chunk unnamed-chunk-1

It would seem that we have found a reliable variant here.

Example of how to set the Reference based on a FASTA file

Below you can see the code used to fill in the Reference dataset in the example tally file, we do not run it here since we do not have the FASTA file available. This code could easily be adapted to replace the majority-vote reference calling used in the above examples.

library(h5vc)
library(rhdf5)
tallyFile <- system.file("extdata", "example.tally.hfs5", package = "h5vcData")
library(ShortRead)
reference <- readDNAStringSet("GRCh37.72.fa")
names(reference) <- sapply(strsplit(names(reference), " "), function(x) x[1])

regions <- list(`16` = c(2.9e+07, 3e+07), `22` = c(3.8e+07, 4e+07))

for (chrom in names(regions)) {
    start <- regions[[chrom]][1]
    end <- regions[[chrom]][2]
    ref <- Views(reference[[chrom]], start = start, end = end)
    ds <- encodeDNAString(ref[[1]])
    h5write(ds, tallyFile, paste("/ExampleStudy", chrom, "Reference", sep = "/"), 
        index = list(start:end))
}

Extending the tally data structure by including Insertions

The bam2R function from the deepSNV package gives us the number of insertion calls at each position (more precisely, between each position and the next one).

First, we have to create the dataset Insertions to hold our data and we will use the same layout as we used for Deletions.

h5createDataset(tallyFile, paste(group, "Insertions", sep = "/"), dims = c(2, 
    2, chromlength), storage.mode = "integer", level = 9)  #Creating the Deletions group for chromosome 1 with 2 samples, 2 strands and 249250621 positions
## [1] TRUE
h5ls(tallyFile)
##     group       name       otype  dclass                    dim
## 0       /       NRAS   H5I_GROUP                               
## 1   /NRAS          1   H5I_GROUP                               
## 2 /NRAS/1     Counts H5I_DATASET INTEGER 12 x 2 x 2 x 249250621
## 3 /NRAS/1  Coverages H5I_DATASET INTEGER      2 x 2 x 249250621
## 4 /NRAS/1  Deletions H5I_DATASET INTEGER      2 x 2 x 249250621
## 5 /NRAS/1 Insertions H5I_DATASET INTEGER      2 x 2 x 249250621
## 6 /NRAS/1  Reference H5I_DATASET INTEGER              249250621

Next we will collate the insertion information from the tally returned by bam2R.

Counts <- lapply(bamFiles, function(bamf) {
    bam2R(file = bamf, chr = chrom, start = startpos, stop = endpos, head.clip = 10)  #we tell it to ignore the first and last 10 sequencing cycles
})
Insertions <- lapply(Counts, function(count) count[, c("INS", "ins")])  # kick out all unnecessary columns

Now we use the same code as above to also write the Insertions into the tally File.

for (sample in 1:2) {
    h5write(Insertions[[sample]][, "INS"], tallyFile, paste(group, "Insertions", 
        sep = "/"), index = list(sample, 1, startpos:endpos))  #Sample Two Forward Strand
    h5write(Insertions[[sample]][, "ins"], tallyFile, paste(group, "Insertions", 
        sep = "/"), index = list(sample, 2, startpos:endpos))  #Sample Two Reverse Strand
}
h5write(Reference, tallyFile, paste(group, "Reference", sep = "/"), index = list(startpos:endpos))  #The Reference
h5ls(tallyFile)
##     group       name       otype  dclass                    dim
## 0       /       NRAS   H5I_GROUP                               
## 1   /NRAS          1   H5I_GROUP                               
## 2 /NRAS/1     Counts H5I_DATASET INTEGER 12 x 2 x 2 x 249250621
## 3 /NRAS/1  Coverages H5I_DATASET INTEGER      2 x 2 x 249250621
## 4 /NRAS/1  Deletions H5I_DATASET INTEGER      2 x 2 x 249250621
## 5 /NRAS/1 Insertions H5I_DATASET INTEGER      2 x 2 x 249250621
## 6 /NRAS/1  Reference H5I_DATASET INTEGER              249250621

We will now plot the mismatch plot of the variant again, adding the Insertions as an additional layer in the plot by using the geom_h5vc function (we specify fill="orange" so insertions should show up as orange boxes).

data <- h5dapply(filename = tallyFile, group = group, blocksize = 1e+08, range = c(position - 
    windowsize, position + windowsize))[[1]]
p <- mismatchPlot(data, getSampleData(tallyFile, group), samples = c("AML", 
    "Control"), windowsize = windowsize, position = position)
p <- p + geom_h5vc(data, getSampleData(tallyFile, group), samples = c("AML", 
    "Control"), windowsize = windowsize, position = position, dataset = "Insertions", 
    fill = "orange")
print(p)

plot of chunk secondPlot

The above code examplifies the ease with which we can extend plots from ggplot2 with additional layers of information.

Note that the region we plot here actually doesn't contain any Insertions since we do not see any orange blocks in this plot. For quality control this is still an interesting result since it confirms that our variant does not fall next to a small insertion or deletion (which could cause artefacts).