h5vc – Scalabale nucleotide tallies using HDF5

In this document we will illustrate the use of the h5vc package for creating and analysing nucleotide tallies of next generation sequencing experiments.

Motivation

h5vc is a tool that is designed to provide researchers with a more intuitive and effective way of interacting with data from large cohorts of samples that have been sequenced with next generation sequencing technologies.

The challenges researchers face with the advent of massive sequencing efforts aimed at sequencing RNA and DNA of thousands of samples will need to be addressed now, before the flood of data becomes a real problem.

The effects of the infeasibility of handling the sequencing data of large cohorts with the current standards (BAM, VCF, BCF, GTF, etc.) have become apparent in recent publications that performed population level analyses of mutations in many cancer samples and work exclusively on the level of preprocessed variant calls stored in VCF/MAF files simply because there is no way to look at the data itself with reasonable resource usage (e.g. in Kandoth et. al 2013).

This challenge can be adressed by augmenting the available legacy formats typically used for sequencing analyses (SAM/BAM files) with an intermediate file format that stores only the most essential information and provides efficient access to the cohort level data whilst reducing the information loss relative to the raw alignments.

This file format will store nucleotide tallies rather than alignments and allow for easy and efficient real-time random access to the data of a whole cohort of samples. The details are described in the following section.

Nucleotide Tally Definition

The tally data structure proposed here consists of 5 datasets that are stored for each chromosome (or contig). Those datasets are: * Counts: A table that contains the number of observed mismatches at any combination of base, sample, strand and genomic position, * Coverages: A table that contains the number of reads overlapping at any combination of sample, strand and genomic position * Deletions: A Table that contains the number of observed deletions of bases at any combination of sample, strand and genomic position * Insertions: A Table that contains the number of observed insertions of bases at any combination of sample, strand and genomic position (showing insertions following the position) * Reference: A Table that contains the reference base against which the calls in the ‘Deletions’ and ‘Counts’ table were made.

We outline the basic layout of this set of tables here:

Name Dimensions Datatype
Counts [ #bases, #samples, #strands, #positions ] int
Coverages [ #samples, #strands, #positions ] int
Deletions [ #samples, #strands, #positions ] int
Insertions [ #samples, #strands, #positions ] int
Reference [ #positions ] int

An HDF5 file has an internal structure that is similar to a file system, where groups are the directories and datasets are the files. The layout of the tally file is as follows:

The layout of a tally HDF5 file.

The layout of a tally HDF5 file.

A tally file can contain data from more than one study but each study will reside in a separte tree with a group named with the study-ID at its root and sub-groups for all the chromosomes / contigs that are present in the study. Attached to each of the chromosome groups are the 4 datasets described above.

Additionally each chromsome group stores sample data about the samples involved in the experiment (patientID, type, location in the sample dimension) as HDF5 attributes. Convenience functions for extracting the metadata are provided, see examples below.

A practical example

Before we get into the details of how to generate an HDF5 tally file for a set of sequencing experiments we will show some examples of the possible analyses one can perform on such a file. The tally file we will use is provided with the h5vcData package and if you have not installed this so far you should do so now.

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("h5vcData")

The first thing we do is set up the session by loading the h5vc and rhdf5 packages and finding the location of the example tally file.

suppressPackageStartupMessages(library(h5vc))
suppressPackageStartupMessages(library(rhdf5))

tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )

We can inspect the data contained in this file with the h5ls function provided by rhdf5.

h5ls(tallyFile)
##               group         name       otype  dclass                   dim
## 0                 / ExampleStudy   H5I_GROUP                              
## 1     /ExampleStudy           16   H5I_GROUP                              
## 2  /ExampleStudy/16       Counts H5I_DATASET INTEGER 12 x 6 x 2 x 90354753
## 3  /ExampleStudy/16    Coverages H5I_DATASET INTEGER      6 x 2 x 90354753
## 4  /ExampleStudy/16    Deletions H5I_DATASET INTEGER      6 x 2 x 90354753
## 5  /ExampleStudy/16    Reference H5I_DATASET INTEGER              90354753
## 6     /ExampleStudy           22   H5I_GROUP                              
## 7  /ExampleStudy/22       Counts H5I_DATASET INTEGER 12 x 6 x 2 x 51304566
## 8  /ExampleStudy/22    Coverages H5I_DATASET INTEGER      6 x 2 x 51304566
## 9  /ExampleStudy/22    Deletions H5I_DATASET INTEGER      6 x 2 x 51304566
## 10 /ExampleStudy/22    Reference H5I_DATASET INTEGER              51304566

In the resulting output we can find the different groups and datasets present in the file and we can extract the relevant sample data attached to those groups in the following way.

sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
sampleData
##                      SampleFiles           Sample Column    Type  Patient
## 1     ../Input/PT8PrimaryDNA.bam    PT8PrimaryDNA      6    Case Patient8
## 2     ../Input/PT5PrimaryDNA.bam    PT5PrimaryDNA      2    Case Patient5
## 3     ../Input/PT5RelapseDNA.bam    PT5RelapseDNA      3    Case Patient5
## 4 ../Input/PT8PreLeukemiaDNA.bam PT8EarlyStageDNA      5    Case Patient8
## 5     ../Input/PT5ControlDNA.bam    PT5ControlDNA      1 Control Patient5
## 6     ../Input/PT8ControlDNA.bam    PT8ControlDNA      4 Control Patient8

The sampleData object is a data.frame that contains information about the samples whose nucleotide tallies are present in the file. We can modify this object (e.g. add new columns) and write it back to the file using the setSampleData function, but we must be aware that a certain set of columns have to be present (Sample, Patient, Column and Type).

sampleData$ClinicalVariable <- rnorm(nrow(sampleData))
setSampleData( tallyFile, "/ExampleStudy/16", sampleData )
## Warning in h5writeAttribute.default(attr = sampleData[[column]], h5obj
## = g, : No function found to write attribute of class 'factor'. Attribute
## 'SampleFiles' is not written to hdf5-file.
## Warning in h5writeAttribute.default(attr = sampleData[[column]], h5obj
## = g, : No function found to write attribute of class 'factor'. Attribute
## 'Sample' is not written to hdf5-file.
## Warning in h5writeAttribute.default(attr = sampleData[[column]], h5obj
## = g, : No function found to write attribute of class 'factor'. Attribute
## 'Type' is not written to hdf5-file.
## Warning in h5writeAttribute.default(attr = sampleData[[column]], h5obj
## = g, : No function found to write attribute of class 'factor'. Attribute
## 'Patient' is not written to hdf5-file.
sampleData
##                      SampleFiles           Sample Column    Type  Patient
## 1     ../Input/PT8PrimaryDNA.bam    PT8PrimaryDNA      6    Case Patient8
## 2     ../Input/PT5PrimaryDNA.bam    PT5PrimaryDNA      2    Case Patient5
## 3     ../Input/PT5RelapseDNA.bam    PT5RelapseDNA      3    Case Patient5
## 4 ../Input/PT8PreLeukemiaDNA.bam PT8EarlyStageDNA      5    Case Patient8
## 5     ../Input/PT5ControlDNA.bam    PT5ControlDNA      1 Control Patient5
## 6     ../Input/PT8ControlDNA.bam    PT8ControlDNA      4 Control Patient8
##   ClinicalVariable
## 1       -0.4271788
## 2       -0.9260081
## 3       -1.4877744
## 4       -0.3414057
## 5        0.5470069
## 6       -0.1845614

Now that we can find the sample metadata in the file it is time to extract some of the nuclotide tally data stored there. We can use two functions to achieve this, h5readBlock can be used to read a specified block of data along a given dimension (e.g. a region along the genomic position) and h5dapply can be used to apply a function in a blockwise fashion along a specified dimension (e.g. calculating coverage in bins of a certain size along the genomic position dimension).

We can read out a block of data in the following way:

data <- h5readBlock(
  filename = tallyFile,
  group = "/ExampleStudy/16",
  names = c( "Coverages", "Counts" ),
  range = c(29000000,29001000)
  )
str(data)
## List of 3
##  $ Coverages   : int [1:6, 1:2, 1:1001] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Counts      : int [1:12, 1:6, 1:2, 1:1001] 0 0 0 0 0 0 0 0 0 0 ...
##  $ h5dapplyInfo:List of 4
##   ..$ Blockstart: int 29000000
##   ..$ Blockend  : int 29001000
##   ..$ Datasets  :'data.frame':   2 obs. of  3 variables:
##   .. ..$ Name    : chr [1:2] "Coverages" "Counts"
##   .. ..$ DimCount: int [1:2] 3 4
##   .. ..$ PosDim  : int [1:2] 3 4
##   ..$ Group     : chr "/ExampleStudy/16"

When we want to read multiple blocks of data we can use the h5dapply function which supports the usage of IRanges and GRanges to define regions of interest, although a simpler system where the user specifies only a start, end and blocksize parameter exists.

suppressPackageStartupMessages(require(GenomicRanges))
data <- h5dapply(
  filename = tallyFile,
  group = "/ExampleStudy",
  names = c( "Coverages" ),
  dims = c(3),
  range = GRanges("16", ranges = IRanges(start = seq(29e6, 30e6, 5e6), width = 1000))
  )
str(data)
## List of 1
##  $ 16:List of 1
##   ..$ 29000000:29000999:List of 2
##   .. ..$ Coverages   : int [1:6, 1:2, 1:1000] 0 0 0 0 0 0 0 0 0 0 ...
##   .. ..$ h5dapplyInfo:List of 4
##   .. .. ..$ Blockstart: int 29000000
##   .. .. ..$ Blockend  : int 29000999
##   .. .. ..$ Datasets  :'data.frame': 1 obs. of  3 variables:
##   .. .. .. ..$ Name    : chr "Coverages"
##   .. .. .. ..$ DimCount: int 3
##   .. .. .. ..$ PosDim  : num 3
##   .. .. ..$ Group     : chr "/ExampleStudy/16"

Usually we do not want to load the data of all those blocks into memory (unless we need it for plotting). A typical workflow will involve some form of processing of the data and as long as this processing can be expressed as an R function that can be applied to each block separately, we can simply provide h5dapply with this function and only retrieve the results. In the following example we calculate the coverage of the samples present in the example tally file by applying the binnedCoverage function to blocks defined in a GRanges object.

#rangeA <- GRanges("16", ranges = IRanges(start = seq(29e6, 29.5e6, 1e5), width = 1000))
#rangeB <- GRanges("22", ranges = IRanges(start = seq(39e6, 39.5e6, 1e5), width = 1000))
range <- GRanges(
  rep(c("16", "22"), each = 6),
  ranges = IRanges(
    start = c(seq(29e6, 29.5e6, 1e5),seq(39e6, 39.5e6, 1e5)),
    width = 1000
  ))
coverages <- h5dapply(
  filename = tallyFile,
  group = "/ExampleStudy",
  names = c( "Coverages" ),
  dims = c(3),
  range = range,
  FUN = binnedCoverage,
  sampledata = sampleData
  )
#options(scipen=10)
coverages <- do.call( rbind, lapply( coverages, function(x) do.call(rbind, x) ))
#rownames(coverages) <- NULL #remove block-ids used as row-names
coverages
##                      PT5ControlDNA PT5PrimaryDNA PT5RelapseDNA
## 16.29000000:29000999         36113         40117         18158
## 16.29100000:29100999         47081         58134         20670
## 16.29200000:29200999         51110         46735         22911
## 16.29300000:29300999         41432         50366         23395
## 16.29400000:29400999         22083         29795         11629
## 16.29500000:29500999           101           101             0
## 22.39000000:39000999         47677         50990         23573
## 22.39100000:39100999         42456         48328         19178
## 22.39200000:39200999         49749         64833         25383
## 22.39300000:39300999         55279         56861         26796
## 22.39400000:39400999         35512         55103         18155
## 22.39500000:39500999         41028         54033         20925
##                      PT8ControlDNA PT8EarlyStageDNA PT8PrimaryDNA Chrom
## 16.29000000:29000999         36377            32315         18998    16
## 16.29100000:29100999         56402            53987         17228    16
## 16.29200000:29200999         46598            50203         20391    16
## 16.29300000:29300999         42309            45285         18961    16
## 16.29400000:29400999         30813            24986          9293    16
## 16.29500000:29500999           202              101             0    16
## 22.39000000:39000999         46070            47232         20825    22
## 22.39100000:39100999         46106            45135         18827    22
## 22.39200000:39200999         52946            53028         24135    22
## 22.39300000:39300999         57834            51304         23554    22
## 22.39400000:39400999         48502            39690         21945    22
## 22.39500000:39500999         51626            37666         18943    22
##                         Start      End
## 16.29000000:29000999 29000000 29000999
## 16.29100000:29100999 29100000 29100999
## 16.29200000:29200999 29200000 29200999
## 16.29300000:29300999 29300000 29300999
## 16.29400000:29400999 29400000 29400999
## 16.29500000:29500999 29500000 29500999
## 22.39000000:39000999 39000000 39000999
## 22.39100000:39100999 39100000 39100999
## 22.39200000:39200999 39200000 39200999
## 22.39300000:39300999 39300000 39300999
## 22.39400000:39400999 39400000 39400999
## 22.39500000:39500999 39500000 39500999

Note that binnedCoverage takes an additional parameter sampleData which we provide to the function as well. Furthermore we specify the blocksize to be 500 bases and we specify the dims parameter to tell h5dapply along which dimension of the selected datasets (“Coverages” in this case) shall be processed (dimension number 3 is the genomic position in the “Coverages” dataset). The explicit specification of dims is only neccessary when we are not extracting the “Counts” dataset, otherwise it defaults to the genomic position.

In the same way we can perform variant calling by using h5dapply together with a variant calling function like callVariantsPaired or callVariantsSingle.

variants <- h5dapply(
  filename = tallyFile,
  group = "/ExampleStudy/16",
  names = c( "Coverages", "Counts", "Deletions", "Reference" ),
  range = c(29950000,30000000),
  blocksize = 10000,
  FUN = callVariantsPaired,
  sampledata = sampleData,
  cl = vcConfParams(returnDataPoints = TRUE)
  )
variants <- do.call( rbind, variants )
variants$AF <- (variants$caseCountFwd + variants$caseCountRev) / (variants$caseCoverageFwd + variants$caseCoverageRev)
variants <- variants[variants$AF > 0.2,]
rownames(variants) <- NULL # remove rownames to save some space on output :D
variants
##   Chrom    Start      End           Sample altAllele refAllele
## 1    16 29950746 29950746    PT8PrimaryDNA         -         T
## 2    16 29983015 29983015    PT5PrimaryDNA         G         C
## 3    16 29983015 29983015    PT5RelapseDNA         G         C
## 4    16 29983015 29983015 PT8EarlyStageDNA         G         C
##   caseCountFwd caseCountRev caseCoverageFwd caseCoverageRev
## 1           10            3              34              29
## 2           12           13              29              27
## 3            3            4              10               9
## 4            8           14              19              30
##   controlCountFwd controlCountRev controlCoverageFwd controlCoverageRev
## 1               0               0                 10                 15
## 2               0               0                 11                 19
## 3               0               0                 11                 19
## 4               0               0                 13                 10
##   backgroundFrequencyFwd backgroundFrequencyRev   pValueFwd pValueRev
## 1                    0.1              0.0754717 0.001365664 0.3761485
## 2                    0.0              0.0000000 0.000000000 0.0000000
## 3                    0.0              0.0000000 0.000000000 0.0000000
## 4                    0.0              0.0000000 0.000000000 0.0000000
##   caseCount controlCount caseCoverage controlCoverage        AF
## 1        13            0           63              25 0.2063492
## 2        25            0           56              30 0.4464286
## 3         7            0           19              30 0.3684211
## 4        22            0           49              23 0.4489796

For details about the parameters and behaviour of callVariantsPaired have a look at the corresponding manual page ( i.e. ?callVariantsPaired ).

A function has to have a named parameter data as its first argument in order to be compatible with h5dapply, in this case data is a list of the same structure as the one returned by h5readBlock.

Once we have determined the location of an interesting variant, like 16:29983015-29983015:C/G in our case, we can create a mismatchPlot in the region around it to get a better feeling for the variant. To this end we use the mismatchPlot function on the tallies in the region in the following way:

windowsize <- 35
position <- variants$Start[2]
data <- h5readBlock(
    filename = tallyFile,
    group = paste( "/ExampleStudy", variants$Chrom[2], sep="/" ),
    names = c("Coverages","Counts","Deletions", "Reference"),
    range = c( position - windowsize, position + windowsize)
  )
patient <- sampleData$Patient[sampleData$Sample == variants$Sample[2]]
samples <- sampleData$Sample[sampleData$Patient == patient]
p <- mismatchPlot(
  data = data,
  sampledata = sampleData,
  samples = samples,
  windowsize = windowsize,
  position = position
  )
print(p)

This plot shows the region 35 bases up and downstream of the variant. It shows one panel for each sample associated with the patient that carries the variant (selected by the line sampleData$Sample[sampleData$Patient == patient]) and each panel is centered on the varian position in the x-axis and the y-axis encodes coverage and mismatches (negative values are on the reverse strand). The grey area is the coverage and the coloured boxes are mismatches. For more details on this plot see ?mismatchPlot.

The object returned by mismatchPlot is a ggplot object which can be manipulated in the same way as any other plot generated through a call to ggplot. We can for example apply a theme to the plot (see ?ggplot2::theme for a list of possible options).

print(p + theme(strip.text.y = element_text(family="serif", size=16, angle=0)))