h5vc – Scalabale nucleotide tallies using HDF5

In this document we will illustrate the use of the h5vc package for creating and analysing nucleotide tallies of next generation sequencing experiments.

Motivation

h5vc is a tool that is designed to provide researchers with a more intuitive and effective way of interacting with data from large cohorts of samples that have been sequenced with next generation sequencing technologies.

The challenges researchers face with the advent of massive sequencing efforts aimed at sequencing RNA and DNA of thousands of samples will need to be addressed now, before the flood of data becomes a real problem.

The effects of the infeasibility of handling the sequencing data of large cohorts with the current standards (BAM, VCF, BCF, GTF, etc.) have become apparent in recent publications that performed population level analyses of mutations in many cancer samples and work exclusively on the level of preprocessed variant calls stored in VCF/MAF files simply because there is no way to look at the data itself with reasonable resource usage (e.g. in Kandoth et. al 2013).

This challenge can be adressed by augmenting the available legacy formats typically used for sequencing analyses (SAM/BAM files) with an intermediate file format that stores only the most essential information and provides efficient access to the cohort level data whilst reducing the information loss relative to the raw alignments.

This file format will store nucleotide tallies rather than alignments and allow for easy and efficient real-time random access to the data of a whole cohort of samples. The details are described in the following section.

Nucleotide Tally Definition

The tally data structure proposed here consists of 4 datasets that are stored for each chromosome (or contig). Those datasets are:

Counts: A table that contains the number of observed mismatches at any combination of base, sample, strand and genomic position,
Coverages: A table that contains the number of reads overlapping at any combination of sample, strand and genomic position
Deletions: A Table that contains the number of observed deletions of bases at any combination of sample, strand and genomic position
Reference: A Table that contains the reference base against which the calls in the 'Deletions' and 'Counts' table were made.

We outline the basic layout of this set of tables here:

Name	Dimensions	Datatype
Counts	[ #bases, #samples, #strands, #positions ]	int
Coverages	[ #samples, #strands, #positions ]	int
Deletions	[ #samples, #strands, #positions ]	int
Reference	[ #positions ]	int

An HDF5 file has an internal structure that is similar to a file system, where groups are the directories and datasets are the files. The layout of the tally file is as follows:

The layout of a tally `HDF5` file.

A tally file can contain data from more than one study but each study will reside in a separte tree with a group named with the study-ID at its root and sub-groups for all the chromosomes / contigs that are present in the study. Attached to each of the chromosome groups are the 4 datasets described above.

Additionally each chromsome group stores sample data about the samples involved in the experiment (patientID, type, location in the sample dimension) as HDF5 attributes. Convenience functions for extracting the metadata are provided, see examples below.

A practical example

Before we get into the details of how to generate an HDF5 tally file for a set of sequencing experiments we will show some examples of the possible analyses one can perform on such a file. The tally file we will use is provided with the h5vcData package and if you have not installed this so far you should do so now.

source("http://bioconductor.org/biocLite.R")
biocLite("h5vcData")

The first thing we do is set up the session by loading the h5vc and rhdf5 packages and finding the location of the example tally file.

suppressPackageStartupMessages(library(h5vc))
suppressPackageStartupMessages(library(rhdf5))

tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )

We can inspect the data contained in this file with the h5ls function provided by rhdf5.

h5ls(tallyFile)

##               group         name       otype  dclass                   dim
## 0                 / ExampleStudy   H5I_GROUP                              
## 1     /ExampleStudy           16   H5I_GROUP                              
## 2  /ExampleStudy/16       Counts H5I_DATASET INTEGER 12 x 6 x 2 x 90354753
## 3  /ExampleStudy/16    Coverages H5I_DATASET INTEGER      6 x 2 x 90354753
## 4  /ExampleStudy/16    Deletions H5I_DATASET INTEGER      6 x 2 x 90354753
## 5  /ExampleStudy/16    Reference H5I_DATASET INTEGER              90354753
## 6     /ExampleStudy           22   H5I_GROUP                              
## 7  /ExampleStudy/22       Counts H5I_DATASET INTEGER 12 x 6 x 2 x 51304566
## 8  /ExampleStudy/22    Coverages H5I_DATASET INTEGER      6 x 2 x 51304566
## 9  /ExampleStudy/22    Deletions H5I_DATASET INTEGER      6 x 2 x 51304566
## 10 /ExampleStudy/22    Reference H5I_DATASET INTEGER              51304566

In the resulting output we can find the different groups and datasets present in the file and we can extract the relevant sample data attached to those groups in the following way.

sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
sampleData

##                      SampleFiles           Sample Column    Type  Patient
## 1     ../Input/PT8PrimaryDNA.bam    PT8PrimaryDNA      6    Case Patient8
## 2     ../Input/PT5PrimaryDNA.bam    PT5PrimaryDNA      2    Case Patient5
## 3     ../Input/PT5RelapseDNA.bam    PT5RelapseDNA      3    Case Patient5
## 4 ../Input/PT8PreLeukemiaDNA.bam PT8EarlyStageDNA      5    Case Patient8
## 5     ../Input/PT5ControlDNA.bam    PT5ControlDNA      1 Control Patient5
## 6     ../Input/PT8ControlDNA.bam    PT8ControlDNA      4 Control Patient8

The sampleData object is a data.frame that contains information about the samples whose nucleotide tallies are present in the file. We can modify this object (e.g. add new columns) and write it back to the file using the setSampleData function, but we must be aware that a certain set of columns have to be present (Sample, Patient, Column and Type).

sampleData$ClinicalVariable <- rnorm(nrow(sampleData))
setSampleData( tallyFile, "/ExampleStudy/16", sampleData )
sampleData

##                      SampleFiles           Sample Column    Type  Patient
## 1     ../Input/PT8PrimaryDNA.bam    PT8PrimaryDNA      6    Case Patient8
## 2     ../Input/PT5PrimaryDNA.bam    PT5PrimaryDNA      2    Case Patient5
## 3     ../Input/PT5RelapseDNA.bam    PT5RelapseDNA      3    Case Patient5
## 4 ../Input/PT8PreLeukemiaDNA.bam PT8EarlyStageDNA      5    Case Patient8
## 5     ../Input/PT5ControlDNA.bam    PT5ControlDNA      1 Control Patient5
## 6     ../Input/PT8ControlDNA.bam    PT8ControlDNA      4 Control Patient8
##   ClinicalVariable
## 1          -0.8263
## 2          -0.3814
## 3          -1.1798
## 4           0.7863
## 5          -0.8051
## 6          -0.9246

Now that we can find the sample metadata in the file it is time to extract some of the nuclotide tally data stored there. We can use two functions to achieve this, h5readBlock can be used to read a specified block of data along a given dimension (e.g. a region along the genomic position) and h5dapply can be used to apply a function in a blockwise fashion along a specified dimension (e.g. calculating coverage in bins of a certain size along the genomic position dimension).

We can read out a block of data in the following way:

data <- h5readBlock(
  filename = tallyFile,
  group = "/ExampleStudy/16",
  names = c( "Coverages", "Counts" ),
  range = c(29000000,29001000)
  )
str(data)

## List of 3
##  $ Coverages   : int [1:6, 1:2, 1:1001] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Counts      : int [1:12, 1:6, 1:2, 1:1001] 0 0 0 0 0 0 0 0 0 0 ...
##  $ h5dapplyInfo:List of 4
##   ..$ Blockstart: num 2.9e+07
##   ..$ Blockend  : num 2.9e+07
##   ..$ Datasets  :'data.frame':   2 obs. of  7 variables:
##   .. ..$ group   : chr [1:2] "/ExampleStudy/16" "/ExampleStudy/16"
##   .. ..$ name    : chr [1:2] "Coverages" "Counts"
##   .. ..$ otype   : Factor w/ 7 levels "H5I_FILE","H5I_GROUP",..: 5 5
##   .. ..$ dclass  : chr [1:2] "INTEGER" "INTEGER"
##   .. ..$ dim     : chr [1:2] "6 x 2 x 90354753" "12 x 6 x 2 x 90354753"
##   .. ..$ DimCount: int [1:2] 3 4
##   .. ..$ PosDim  : int [1:2] 3 4
##   ..$ Group     : chr "/ExampleStudy/16"

Normally we will want to perform some kind of calculation with the data and store the results instead of the whole tally, for this purpose we can use the h5dapply function which has a very similar interface to h5readBlock but splits the range into blocks and applies a function provided by the user to those blocks, returning the result.

In the following example we calculate the coverage of the samples present in the example tally file by applying the binnedCoverage function in blocks along the genome.

coverages <- h5dapply(
  filename = tallyFile,
  group = "/ExampleStudy/16",
  names = c( "Coverages" ),
  dims = c(3),
  range = c(29000000,29005000),
  blocksize = 500,
  FUN = binnedCoverage,
  sampleData
  )
options(scipen=10)
coverages <- do.call( rbind, coverages )
rownames(coverages) <- NULL #remove block-ids used as row-names
coverages

##    PT5ControlDNA PT5PrimaryDNA PT5RelapseDNA PT8ControlDNA
## 1          19420         21174          8646         18579
## 2          16693         18943          9512         17798
## 3          16119         21491          8456         18181
## 4          20103         27447          8627         19833
## 5          21262         29086         13926         23636
## 6          22655         29332          8267         26095
## 7          23034         27041         10821         22057
## 8          18551         28423         11903         26347
## 9          16753         29006         13082         18219
## 10         19368         21924         10383         22100
##    PT8EarlyStageDNA PT8PrimaryDNA    Start      End
## 1             17658         10470 29000000 29000499
## 2             14657          8528 29000500 29000999
## 3             17429         10226 29001000 29001499
## 4             20615          9684 29001500 29001999
## 5             25064         10976 29002000 29002499
## 6             18516         10461 29002500 29002999
## 7             17877         10854 29003000 29003499
## 8             22981         10478 29003500 29003999
## 9             17443         10455 29004000 29004499
## 10            25117         10849 29004500 29004999

Note that binnedCoverage takes an additional parameter sampleData which we provide to the function as well. Furthermore we specify the blocksize to be 500 bases and we specify the dims parameter to tell h5dapply along which dimension of the selected datasets (“Coverages” in this case) shall be processed (dimension number 3 is the genomic position in the “Coverages” dataset). The explicit specification of dims is only neccessary when we are not extracting the “Counts” dataset, otherwise it defaults to the genomic position.

In the same way we can perform variant calling by using h5dapply together with a variant calling function like callVariantsPaired or callVariantsSingle.

variants <- h5dapply(
  filename = tallyFile,
  group = "/ExampleStudy/16",
  names = c( "Coverages", "Counts", "Deletions", "Reference" ),
  range = c(29950000,30000000),
  blocksize = 10000,
  FUN = callVariantsPaired,
  sampledata = sampleData,
  cl = vcConfParams(returnDataPoints = TRUE)
  )
variants <- do.call( rbind, variants )
variants$AF <- (variants$caseCountFwd + variants$caseCountRev) / (variants$caseCoverageFwd + variants$caseCoverageRev)
variants <- variants[variants$AF > 0.2,]
rownames(variants) <- NULL # remove rownames to save some space on output :D
variants

##   Chrom    Start      End           Sample altAllele refAllele
## 1    16 29983015 29983015    PT5PrimaryDNA         G         C
## 2    16 29983015 29983015    PT5RelapseDNA         G         C
## 3    16 29983015 29983015 PT8EarlyStageDNA         G         C
##   caseCountFwd caseCountRev caseCoverageFwd caseCoverageRev
## 1           12           13              29              27
## 2            3            4              10               9
## 3            8           14              19              30
##   controlCountFwd controlCountRev controlCoverageFwd controlCoverageRev
## 1               0               0                 11                 19
## 2               0               0                 11                 19
## 3               0               0                 13                 10
##       AF
## 1 0.4464
## 2 0.3684
## 3 0.4490

For details about the parameters and behaviour of callVariantsPaired have a look at the corresponding manual page ( i.e. ?callVariantsPaired ).

A function has to have a named parameter data as its first argument in order to be compatible with h5dapply, in this case data is a list of the same structure as the one returned by h5readBlock.

Once we have determined the location of an interesting variant, like 16:29983015-29983015:C/G in our case, we can create a mismatchPlot in the region around it to get a better feeling for the variant. To this end we use the mismatchPlot function on the tallies in the region in the following way:

windowsize <- 35
position <- variants$Start[1]
data <- h5readBlock(
    filename = tallyFile,
    group = paste( "/ExampleStudy", variants$Chrom[1], sep="/" ),
    names = c("Coverages","Counts","Deletions", "Reference"),
    range = c( position - windowsize, position + windowsize)
  )
patient <- sampleData$Patient[sampleData$Sample == variants$Sample[1]]
samples <- sampleData$Sample[sampleData$Patient == patient]
p <- mismatchPlot(
  data = data,
  sampledata = sampleData,
  samples = samples,
  windowsize = windowsize,
  position = position
  )
print(p)

plot of chunk mismatchPlotExample

This plot shows the region 35 bases up and downstream of the variant. It shows one panel for each sample associated with the patient that carries the variant (selected by the line sampleData$Sample[sampleData$Patient == patient]) and each panel is centered on the varian position in the x-axis and the y-axis encodes coverage and mismatches (negative values are on the reverse strand). The grey area is the coverage and the coloured boxes are mismatches. For more details on this plot see ?mismatchPlot.

The object returned by mismatchPlot is a ggplot object which can be manipulated in the same way as any other plot generated through a call to ggplot. We can for example apply a theme to the plot (see ?ggplot2::theme for a list of possible options).

print(p + theme(strip.text.y = element_text(family="serif", size=16, angle=0)))

plot of chunk mismatchPlotExamplesTheming

We end our practical example at this point and move on to sections detailing more involved analysis and, most importantly, the creation of tally files from bam files.

Creating tally files

Creating tally files is a time-consuming process and requires substantial compute power. It is a preprocessing step that is applied to the BAM files corresponding to the samples we want to tally and should be executed only once. In this way it represents an initial investment of time and resources that yields the HDF5 tally files which then allow for fast analysis and interactive exploration of the data in a much more intuitive way than raw BAM files.

We will demonstrate the creation of a HDF5 tally file by using two BAM files provided by the h5vcData package. We load some required packages and extract the locations of the two BAM files in question.

suppressPackageStartupMessages(library("h5vc"))
suppressPackageStartupMessages(library("rhdf5"))
suppressPackageStartupMessages(library("Rsamtools"))
files <- c("NRAS.AML.bam", "NRAS.Control.bam")
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)

Now bamFiles contains the paths to our two BAM files (an AML and a control sample). These files contain reads overlapping the NRAS gene (Chromosome 1: 115,247,090-115,259,515). in samples from a human patients suffering from AML, we have one AML sample and a matching control.

We will now create the tally file and create the groups that represent the study and chromosome we want to work on. Before we do this, we need to find out how big our datasets have to be in their genomic-position dimension, to do this we will look into the header of the bam files and extract the sequence length information.

chromdim <- sapply( scanBamHeader(bamFiles), function(x) x$targets )
colnames(chromdim) <- files
head(chromdim)

##   NRAS.AML.bam NRAS.Control.bam
## 1    249250621        249250621
## 2    243199373        243199373
## 3    198022430        198022430
## 4    191154276        191154276
## 5    180915260        180915260
## 6    171115067        171115067

Both files have the same header information and are fully compatible, so we can just pick one file and take the chromosome lengths from there. Note that although we will only tally the NRAS gene we still instantiate the datasets in the tally file with the full chromosome length so that the index along the genomic position axis corresponds directly to the position in the genome (the internal compression of HDF5 will take care of the large blocks of zeros so that the effective filesize is similar to what it would be if we created the datasets to only hold the NRAS gene region).

chrom <- "1"
chromlength <- chromdim[chrom,1]
study <- "/NRAS"
tallyFile <- file.path( tempdir(), "NRAS.tally.hfs5" )
if( file.exists(tallyFile) ){
  file.remove(tallyFile)
}
if( prepareTallyFile( tallyFile, study, chrom, chromlength, nsamples = 2 ) ){
  h5ls(tallyFile)
}else{
  message( paste( "Preparation of:", tallyFile, "failed" ) )
}

##     group      name       otype  dclass                    dim
## 0       /      NRAS   H5I_GROUP                               
## 1   /NRAS         1   H5I_GROUP                               
## 2 /NRAS/1    Counts H5I_DATASET INTEGER 12 x 2 x 2 x 249250621
## 3 /NRAS/1 Coverages H5I_DATASET INTEGER      2 x 2 x 249250621
## 4 /NRAS/1 Deletions H5I_DATASET INTEGER      2 x 2 x 249250621
## 5 /NRAS/1 Reference H5I_DATASET INTEGER              249250621

Since datasets are stored in HDF5 files as matrices without dimension names we need to create a separate object (a data.frame in this case) to hold sample metadata that tells us which sample corresponds to which slots in the matrix and also stores additional usefull information about the samples.

sampleData <- data.frame(
  Sample = c("AML","Control"),
  Column = c(1,2),
  Patient = c("Pt1","Pt1"),
  Type = c("Case","Control"),
  stringsAsFactors = FALSE
  )
group <- paste( study, chrom, sep = "/" )
setSampleData( tallyFile, group, sampleData )
getSampleData( tallyFile, group )

##   Column Patient  Sample    Type
## 1      1     Pt1     AML    Case
## 2      2     Pt1 Control Control

Note a little complication that derives from the fact that R indexes arrays in a 1-based manner, while HDF5 internally does it 0-based (like, e.g. C). We set the columns to be 1 and 2, respectively, while within the tally file the values 0 and 1 are stored. The functions setSampleData and getSampleData automatically remove / add 1 from the value.

Extracting tallies from the bam files

Now it is time to extract tally information from the bam file. We use the high-level function applyTallies to do this for us (have a look at the code of that function to see what the separate steps are). We have to provide the reference sequence separately and we can do this by passing a DNAString object into applyTallies which we can extract from BSgenome.Hsapiens.UCSC.hg19. If we do not provide a reference a simple majority vote among the tallies will be used.

suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg19))
startpos <- 115247090
endpos <- 115259515
theData <- applyTallies( bamFiles, chrom = chrom, start = startpos, stop = endpos, ncycles = 10, reference = BSgenome.Hsapiens.UCSC.hg19[["chr1"]][startpos:endpos] ) # the first and last 10 sequencing cycles are called unreliable
str(theData)

## List of 4
##  $ Counts   : num [1:12, 1:2, 1:2, 1:12426] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 4
##   .. ..$ : chr [1:12] "A.front" "C.front" "G.front" "T.front" ...
##   .. ..$ : NULL
##   .. ..$ : chr [1:2] "+" "-"
##   .. ..$ : NULL
##  $ Coverages: int [1:2, 1:2, 1:12426] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 3
##   .. ..$ : NULL
##   .. ..$ : chr [1:2] "+" "-"
##   .. ..$ : NULL
##  $ Deletions: int [1:2, 1:2, 1:12426] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 3
##   .. ..$ : NULL
##   .. ..$ : chr [1:2] "+" "-"
##   .. ..$ : NULL
##  $ Reference: num [1:12426] 3 3 0 3 2 0 1 3 0 0 ...

The resulting object is a list with one slot per dataset using the same layout that you will get from calls to h5readBlock or h5dapply.

We use the rhdf5::h5write function to write our data to the tally file. (See the function documentation for more information.)

h5write( theData$Counts, tallyFile, paste( group, "Counts", sep = "/" ), index = list( NULL, NULL, NULL, startpos:endpos ) )
h5write( theData$Coverages, tallyFile, paste( group, "Coverages", sep = "/" ), index = list( NULL, NULL, startpos:endpos ) )
h5write( theData$Deletions, tallyFile, paste( group, "Deletions", sep = "/" ), index = list( NULL, NULL, startpos:endpos ) )
h5write( theData$Reference, tallyFile, paste( group, "Reference", sep = "/" ), index = list( startpos:endpos ) )
h5ls(tallyFile)

##     group      name       otype  dclass                    dim
## 0       /      NRAS   H5I_GROUP                               
## 1   /NRAS         1   H5I_GROUP                               
## 2 /NRAS/1    Counts H5I_DATASET INTEGER 12 x 2 x 2 x 249250621
## 3 /NRAS/1 Coverages H5I_DATASET INTEGER      2 x 2 x 249250621
## 4 /NRAS/1 Deletions H5I_DATASET INTEGER      2 x 2 x 249250621
## 5 /NRAS/1 Reference H5I_DATASET INTEGER              249250621

Checking if everything worked

We will use the h5readBlock function provided by h5vc to extract the data again and have a look at it.

data <- h5readBlock(
    filename = tallyFile,
    group = group,
    range = c(startpos, endpos)
  )
str(data)

## List of 5
##  $ Counts      : int [1:12, 1:2, 1:2, 1:12426] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Coverages   : int [1:2, 1:2, 1:12426] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Deletions   : int [1:2, 1:2, 1:12426] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Reference   : int [1:12426(1d)] 3 3 0 3 2 0 1 3 0 0 ...
##  $ h5dapplyInfo:List of 4
##   ..$ Blockstart: num 115247090
##   ..$ Blockend  : num 115259515
##   ..$ Datasets  :'data.frame':   4 obs. of  7 variables:
##   .. ..$ group   : chr [1:4] "/NRAS/1" "/NRAS/1" "/NRAS/1" "/NRAS/1"
##   .. ..$ name    : chr [1:4] "Counts" "Coverages" "Deletions" "Reference"
##   .. ..$ otype   : Factor w/ 7 levels "H5I_FILE","H5I_GROUP",..: 5 5 5 5
##   .. ..$ dclass  : chr [1:4] "INTEGER" "INTEGER" "INTEGER" "INTEGER"
##   .. ..$ dim     : chr [1:4] "12 x 2 x 2 x 249250621" "2 x 2 x 249250621" "2 x 2 x 249250621" "249250621"
##   .. ..$ DimCount: int [1:4] 4 3 3 1
##   .. ..$ PosDim  : int [1:4] 4 3 3 1
##   ..$ Group     : chr "/NRAS/1"

We will call variants within this gene now:

vars <- h5dapply(
    filename = tallyFile,
    group = group,
    blocksize = 1000,
    FUN = callVariantsPaired,
    sampledata = getSampleData(tallyFile,group),
    cl = vcConfParams(returnDataPoints=TRUE),
    range = c(startpos, endpos)
  )
vars <- do.call(rbind, vars)
vars

##                     Chrom     Start       End Sample altAllele refAllele
## 115258090:115259089     1 115258747 115258747    AML         G         C
##                     caseCountFwd caseCountRev caseCoverageFwd
## 115258090:115259089            9           22              29
##                     caseCoverageRev controlCountFwd controlCountRev
## 115258090:115259089              49               0               0
##                     controlCoverageFwd controlCoverageRev
## 115258090:115259089                 59                 47

By cleverly selecting the example data we have found exactly one variant that seems ot be interesting and will now plot the region in question to also check if the mismatchPlot function will work with the tally data we created.

position <- vars$End[1]
windowsize <- 50
data <- h5readBlock(
    filename = tallyFile,
    group = group,
    range = c(position - windowsize, position + windowsize)
  )
p <- mismatchPlot( data, getSampleData(tallyFile,group), samples = c("AML","Control"),windowsize=windowsize,position=position )

print(p)

plot of chunk plottingVariantEx

Mutation Spectrum Analysis

We can also easily perform mutation spectrum analysis by using the function mutationSpectrum which works on a set of variant calls in a data.frame form as it would be produced by a call to e.g. callVariantsPaired and a tallyFile parameter specifying hte location of a tally file as well as a context parameter. The context parameter specifies how much sequence context should be taken into account for the mutation spectrum. An example with context 1 (i.e. one base up- and one base downstream of the variant) is shown below.

tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
data( "example.variants", package = "h5vcData" ) #example variant calls
head(variantCalls)

##   Chrom    Start      End        Sample altAllele refAllele caseCountFwd
## 1    16 29008219 29008219 PT5PrimaryDNA         T         G            2
## 2    16 29020181 29020181 PT5PrimaryDNA         A         C            3
## 3    16 29037593 29037593 PT8PrimaryDNA         G         C            2
## 4    16 29040237 29040237 PT8PrimaryDNA         C         A            2
## 5    16 29069931 29069931 PT5PrimaryDNA         G         A            2
## 6    16 29102402 29102402 PT5PrimaryDNA         T         C            2
##   caseCountRev caseCoverageFwd caseCoverageRev controlCountFwd
## 1            2              26              20               0
## 2            2              18              20               0
## 3            2              22              17               0
## 4            2              20              17               0
## 5            2              42              39               0
## 6            2              25              34               0
##   controlCountRev controlCoverageFwd controlCoverageRev
## 1               0                 21                 22
## 2               0                 16                 17
## 3               0                 21                 12
## 4               0                 19                 11
## 5               0                 37                 34
## 6               0                 21                 35

ms = mutationSpectrum( variantCalls, tallyFile, "/ExampleStudy" )
head(ms)

##   refAllele altAllele        Sample Prefix Suffix MutationType Context
## 1         T         T PT5PrimaryDNA      A      A          C>A     TGT
## 2         C         A PT5PrimaryDNA      A      A          C>A     ACA
## 3         C         G PT8PrimaryDNA      C      C          C>G     CCC
## 4         C         A PT8PrimaryDNA      G      C          T>G     GAC
## 5         C         C PT5PrimaryDNA      C      C          T>C     GAG
## 6         C         T PT5PrimaryDNA      G      C          C>T     GCC

We can see the structure of the variantCalls object, which is simply a data.frame, this is the return value of a call to callVariantsPaired. The mutation spectrum is also a data.frame. You can find explanations of those data structures by looking at ?mutationSpectrum and ?callVariantsPaired.

We can plot the mutation spectrum with the plotMutationSpectrum function. This function also returns a ggplot object which can be manipulated by adding themes etc.

plotMutationSpectrum(ms) + theme(
  strip.text.y = element_text(angle=0, size=10),
  axis.text.x = element_text(size = 7),
  axis.text.y = element_text(size = 10)) + scale_y_continuous(breaks = c(0,5,10,15))

## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.

plot of chunk plottingMS

Parallelization

In this section we will cover some ofthe aspects of parallelization. Most notably we will talk about parallelizing the tallying step. Since this step is computationally intenisive there is much to gain from parallelizing it.

The simplest way to parallelize is by using multicore processing on the same machine and h5vc supports both parallel tallying and parallel reading from a tally file.

When we go back to the code used to generate the tally files in the NRAS example from above an obvious possibility for parallelization is per sample.

suppressPackageStartupMessages(library(BiocParallel))
serial.time <- system.time(theData <- applyTallies(bamFiles, chrom = chrom, 
    start = startpos, stop = endpos, ncycles = 10))
multicore.time <- system.time(theData <- applyTallies(bamFiles, chrom = chrom, 
    start = startpos, stop = endpos, ncycles = 10, BPPARAM = MulticoreParam()))
serial.time

##    user  system elapsed 
##   0.356   0.000   0.357

multicore.time

##    user  system elapsed 
##   0.428   0.248   0.342

The applyTallies function can be provided with the BPPARAM parameter which is handed down to the bplapply function used to tally the files. (See ?bplapply for examples).

The performance gains (or losses) of parallel tallying and also parallel reading form a tally file are dependent on your system and it makes sense to try some timing first before commiting to a parallel execution set-up. If you are on a cluster with a powerfull file server or raid cluster the gains can be big, whereas with a local single hard-disk you might actually lose time by trying parallel execution. (As is likely the case on the machine that the vignette was build :D)

Let's revisit the coverage example from before, and compare runtimes of the sequential and parallel versions. Note that we can parallelize all calls to h5dapply since by definition the results of the separate blocks can not depend on each other.

suppressPackageStartupMessages(library(BiocParallel))
system.time(
  coverages_serial <- h5dapply(
    filename = tallyFile,
    group = "/ExampleStudy/16",
    names = c( "Coverages" ),
    dims = c(3),
    range = c(29000000,29050000),
    blocksize = 1000,
    FUN = binnedCoverage,
    sampleData
  )
)

##    user  system elapsed 
##   0.768   0.000   0.772

system.time(
  coverages_parallel <- h5dapply(
    filename = tallyFile,
    group = "/ExampleStudy/16",
    names = c( "Coverages" ),
    dims = c(3),
    range = c(29000000,29050000),
    blocksize = 1000,
    FUN = binnedCoverage,
    sampleData,
    BPPARAM = MulticoreParam()
  )
)

##    user  system elapsed 
##   0.596   0.192   0.333

We can observer some speed-up here, but it is not extremely impressive.

Using Clusters

For large datasets it makes sense to do the tallying on a cluster and parallelize not only by sample but also by genomic position (usually in bins of some megabases). In order to achieve this h5vc provides the batchTallies function which has an instance of batchTallyParam as it's input. batchTallyParam is a list of configuration parameters that influence the behaviour of batchTallies.

#instantiating a batch tally param for tallying the NRAS example files
files <- c("NRAS.AML.bam", "NRAS.Control.bam")
startpos <- 115247090
endpos <- 115259515
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
btp <- batchTallyParam(bamFiles = bamFiles, destination = "tally.file.name.hfs5", group = "/NRAS/1", chrom="1", start=startpos, stop=endpos)
btp

## $bamFiles
## [1] "/home/biocbuild/bbs-2.14-bioc/R/library/h5vcData/extdata/NRAS.AML.bam"    
## [2] "/home/biocbuild/bbs-2.14-bioc/R/library/h5vcData/extdata/NRAS.Control.bam"
## 
## $destination
## [1] "tally.file.name.hfs5"
## 
## $group
## [1] "/NRAS/1"
## 
## $chrom
## [1] "1"
## 
## $start
## [1] 115247090
## 
## $stop
## [1] 115259515
## 
## $blocksize
## [1] 100000
## 
## $registryDir
## [1] "/tmp/RtmpRBRfaj"
## 
## $resources
## $resources$queue
## [1] "research-rh6"
## 
## $resources$memory
## [1] "4000"
## 
## $resources$ncpus
## [1] "4"
## 
## $resources$walltime
## [1] "90:00"
## 
## 
## $q
## [1] 25
## 
## $ncycles
## [1] 0
## 
## $max.depth
## [1] 1000000
## 
## $reference
## NULL
## 
## $sleep
## [1] 5

For an in-depth explanation of the different parameters check the help page at ?batchTallyParam.

Please also note that you will need a correctly configured installation of BatchJobs in order to use this functionality which, depending on the type of cluster you are on, might include a .BatchJobs.R file in your working directory and a template file defining cluster functions. I will paste my configuration files below but you will have to adapt them in orde to use the batchTallies function.

This is the example configuration I use.

cluster.functions <- makeClusterFunctionsLSF("/home/pyl/batchjobs/lsf.tmpl")
mail.start <- "first"
mail.done <- "last"
mail.error <- "all"
mail.from <- "<paul.theodor.pyl@embl.de>"
mail.to <- "<pyl@embl.de>"
mail.control <- list(smtpServer="smtp.embl.de")

For explanations of how to customize this have a look at the BatchJobs documentation here.

The important part is the first line in which we specify that LSF shall be used. The call to makeClusterFunctionsLSF has one parameter specifying a template file for the cluster calls. This template file has the following content.

## Default resources can be set in your .BatchJobs.R by defining the variable 
## 'default.resources' as a named list.

## remove everthing in [] if your cluster does not support arrayjobs
#BSUB-J <%= job.name %>[1-<%= arrayjobs %>]         # name of the job / array jobs
#BSUB-o <%= log.file %>                             # output is sent to logfile, stdout + stderr by default
#BSUB-n <%= resources$ncpus %>                      # Number of CPUs on the node
#BSUB-q <%= resources$queue %>                      # Job queue
#BSUB-W <%= resources$walltime %>                   # Walltime in minutes
#BSUB-M <%= resources$memory %>                     # Memory requirements in Kbytes

# we merge R output with stdout from LSF, which gets then logged via -o option
R CMD BATCH --no-save --no-restore "<%= rscript %>" /dev/stdout

Once this setup is functional we can test it with the following little script (you might have to change your resources, e.g. the queue name etc.).

library("BiocParallel")
library("BatchJobs")
cf <- makeClusterFunctionsLSF("/home/pyl/batchjobs/lsf.tmpl")
bjp <- BatchJobsParam( cluster.functions=cf, resources = list("queue" = "medium_priority", "memory"="4000", "ncpus"="4", walltime="00:30") )
bplapply(1:10, sqrt)
bplapply(1:10, sqrt, BPPARAM=bjp)

With the fully configured batch system you can then start tallying on the cluster by using the batchTallies function which you provide with the batchTallyParam object we created earlier.

batchTallies(confList = btp)