Scalable nucleotide tallies with HDF5

Abstract

With the increased affordability of genomic sequencing technologies we have at our disposal a flood of whole genome/exome datasets from many patients. When using the .bam file format large scale comparative analyses become cumbersome to implement and a lot of time is spend parsing files and collating data. Given that the tally (the table of counts of nucleotides x sample x strand x genomic position) is at the center of many genomics analyses (e.g. variant calling, copy-number estimation, mutation spectrum analysis, etc.) it is crucial to collate this information in an easily accessible format. Here we propose the use of HDF5 as this format and give concrete examples of the internal layout of a tally-HDF5 file and possible applications. We introduce tools for the creation of a tally file from a set of .bam files and tools for interacting with the tally file and performing genomics analyses.

Motivation

Currently there is a large interest in analyses of cancer genome data across large cohorts, but the current standard file formats are ill-suited to the task. The BAM-file format is too low-level and does not scale well (large file size; slicing by genomic position is usually inefficient) while the VCF format is too high-level with an emphasis on (positive) calls with low FP rate and no control over negative calls and FN rate (just considering every position not mentioned in the VCF as 'no variant' would have have a high FN rate, esp. in the face of subclonality and coverage fluctuations). There is a need for an intermediate format that is scalable, compact and cross-platform accessible.

The 'tally' data structure

The tally data structure proposed here consists of 4 datasets that are stored for each chromosome (or contig). Those datasets are:

Counts: A table that contains the number of observed mismatches at any combination of base, sample, strand and genomic position,
Coverages: A table that contains the number of reads overlapping at any combination of sample, strand and genomic position
Deletions: A Table that contains the number of observed deletions of bases at any combination of sample, strand and genomic position
Reference: A Table that contains the reference base against which the calls in the 'Deletions' and 'Counts' table were made.

We outline the basic layout of this set of tables here:

Name	Dimensions	Datatype
Counts	[ #bases, #samples, #strands, #positions ]	int
Coverages	[ #samples, #strands, #positions ]	int
Deletions	[ #samples, #strands, #positions ]	int
Reference	[ #positions ]	int

An HDF5 file has an internal structure that is similar to a file system, where groups are the directories and datasets are the files. The layout of the tally file is as follows:

The layout of a tally `HDF5` file.

A tally file can contain data from more than one study but each study will reside in a separte tree with a group named with the study-ID at its root and sub-groups for all the chromosomes / contigs that are present in the study. Attached to each of the chromosome groups are the 4 datasets described above.

Additionally each chromsome group stores sample data about the samples involved in the experiment (patientID, type, location in the sample dimension) as HDF5 attributes. Convenience functions for extracting the metadata are provided, see examples below.

Interacting with tally files

In the first step we load the h5vc package and fetch an example tally file (unless it is already present).

library(h5vc)

## Warning: replacing previous import by 'GenomicRanges::%in%' when loading 'h5vc'
## Warning: replacing previous import by 'GenomicRanges::match' when loading 'h5vc'
## Warning: replacing previous import by 'GenomicRanges::order' when loading 'h5vc'
## Warning: replacing previous import by 'GenomicRanges::rank' when loading 'h5vc'

## *NOTE:* h5vc is a package under constant development and many improvements over this version have accumulated in the development branch.
## Please switch to the development version of h5vc (>=1.1.18) for production use.

tallyFile <- system.file("extdata", "example.tally.hfs5", package = "h5vcData")

Now we can have a look into the tally file. We use the h5ls function from the rhdf5 package.

library(rhdf5)
h5ls(tallyFile)

##               group         name       otype  dclass                   dim
## 0                 / ExampleStudy   H5I_GROUP                              
## 1     /ExampleStudy           16   H5I_GROUP                              
## 2  /ExampleStudy/16       Counts H5I_DATASET INTEGER 12 x 6 x 2 x 90354753
## 3  /ExampleStudy/16    Coverages H5I_DATASET INTEGER      6 x 2 x 90354753
## 4  /ExampleStudy/16    Deletions H5I_DATASET INTEGER      6 x 2 x 90354753
## 5  /ExampleStudy/16    Reference H5I_DATASET INTEGER              90354753
## 6     /ExampleStudy           22   H5I_GROUP                              
## 7  /ExampleStudy/22       Counts H5I_DATASET INTEGER 12 x 6 x 2 x 51304566
## 8  /ExampleStudy/22    Coverages H5I_DATASET INTEGER      6 x 2 x 51304566
## 9  /ExampleStudy/22    Deletions H5I_DATASET INTEGER      6 x 2 x 51304566
## 10 /ExampleStudy/22    Reference H5I_DATASET INTEGER              51304566

This returns a table of all the objects present within the tally file: /home/biocbuild/bbs-2.13-bioc/R/library/h5vcData/extdata/example.tally.hfs5. This table encodes the tree-like structure shown in Figure 1.

We can extract the separately stored sample data from the tally file by using the getSampleData function:

sampleData <- getSampleData(tallyFile, "/ExampleStudy/16")
print(sampleData)

##                      SampleFiles           Sample Column    Type  Patient
## 1     ../Input/PT8PrimaryDNA.bam    PT8PrimaryDNA      6    Case Patient8
## 2     ../Input/PT5PrimaryDNA.bam    PT5PrimaryDNA      2    Case Patient5
## 3     ../Input/PT5RelapseDNA.bam    PT5RelapseDNA      3    Case Patient5
## 4 ../Input/PT8PreLeukemiaDNA.bam PT8EarlyStageDNA      5    Case Patient8
## 5     ../Input/PT5ControlDNA.bam    PT5ControlDNA      1 Control Patient5
## 6     ../Input/PT8ControlDNA.bam    PT8ControlDNA      4 Control Patient8

Extracting data from a tally file

We use the h5dapply function as our general purpose accessor to data stored in HDF5 based tally files. We provide the following parameters:

file: The name of the tally file to interact with
group: The name of the group we want to access, this will always be a string of the form when dealing with tally files
names: A character vector listing the datasets to be extracted
blocksize: The size of the blocks in which we want to apply our function
range: The range of values within the dimension we apply over that we want to process
FUN: The function to apply to the blocks, always will be passed the parameter data which will contain the data for the current block. Defaults to returning the data as is

Using h5dapply blank to get the data directly

When not specifying the function to be applied (argument FUN) we get the data returned that would otherwise be supplied to the function as it's first argument (i.e. FUN = function(x) x).

data <- h5dapply(filename = tallyFile, group = "/ExampleStudy/16", names = c("Coverages"), 
    blocksize = 10000, range = c(2.9e+07, 29010000))

## Attaching package bit
## package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
## creators: bit bitwhich
## coercion: as.logical as.integer as.bit as.bitwhich which
## operator: ! & | xor != ==
## querying: print length any all min max range sum summary
## bit access: length<- [ [<- [[ [[<-
## for more help type ?bit
## 
## Attaching package: 'bit'
## 
## The following object is masked from 'package:base':
## 
##     xor
## 
## Attaching package bit64
## package:bit64 (c) 2011-2012 Jens Oehlschlaegel (GPL-2 with commercial restrictions)
## creators: integer64 seq :
## coercion: as.integer64 as.vector as.logical as.integer as.double as.character as.bin
## logical operator: ! & | xor != == < <= >= >
## arithmetic operator: + - * / %/% %% ^
## math: sign abs sqrt log log2 log10
## math: floor ceiling trunc round
## querying: is.integer64 is.vector [is.atomic} [length] is.na format print
## aggregation: any all min max range sum prod
## cumulation: diff cummin cummax cumsum cumprod
## access: length<- [ [<- [[ [[<-
## combine: c rep cbind rbind as.data.frame
## for more help type ?bit64
## 
## Attaching package: 'bit64'
## 
## The following objects are masked from 'package:base':
## 
##     %in%, :, is.double, match, order, rank

str(data)

## List of 1
##  $ 2.9e+07:29009999:List of 2
##   ..$ Coverages   : int [1:6, 1:2, 1:10000] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ h5dapplyInfo:List of 4
##   .. ..$ Blockstart: num 2.9e+07
##   .. ..$ Blockend  : num 2.9e+07
##   .. ..$ Datasets  :'data.frame':    1 obs. of  7 variables:
##   .. .. ..$ group   : chr "/ExampleStudy/16"
##   .. .. ..$ name    : chr "Coverages"
##   .. .. ..$ otype   : Factor w/ 7 levels "H5I_FILE","H5I_GROUP",..: 5
##   .. .. ..$ dclass  : chr "INTEGER"
##   .. .. ..$ dim     : chr "6 x 2 x 90354753"
##   .. .. ..$ DimCount: int 3
##   .. .. ..$ PosDim  : int 3
##   .. ..$ Group     : chr "/ExampleStudy/16"

As you can see the dataobject is a list with one element for each dataset (only “Coverages” in our example; note the dimensions of the “Coverage” array are [1:6, 1:2, 1:10000], i.e. 6 samples, 2 strands and 10000 position in the current block). Furthermore the slot “h5dapplyInfo” is being attached to the list data and is itself a list containing information about the current block (“Blockstart”,“Blockend”), the Datasets represented here (slot “Datasets”) and the group we are processing right now (“/ExampleStudy/16”). This information can be used within the call to FUN to, e.g. calculate the correct offset for genomic positions by adding the block start etc.

Example of using `h5dapply` to extract coverage data

data <- h5dapply(filename = tallyFile, group = "/ExampleStudy/16", names = c("Coverages"), 
    blocksize = 1000, range = c(2.9e+07, 29010000), FUN = function(x) rowSums(x$Coverages), 
    verbose = TRUE)

## Processing Block #1: 2.9e+07:29000999
## Processing Block #2: 29001000:29001999
## Processing Block #3: 29002000:29002999
## Processing Block #4: 29003000:29003999
## Processing Block #5: 29004000:29004999
## Processing Block #6: 29005000:29005999
## Processing Block #7: 29006000:29006999
## Processing Block #8: 29007000:29007999
## Processing Block #9: 29008000:29008999
## Processing Block #10: 29009000:29009999

coverages <- do.call(rbind, data)
colnames(coverages) <- sampleData$Sample[order(sampleData$Column)]
coverages

##                   PT5ControlDNA PT5PrimaryDNA PT5RelapseDNA PT8ControlDNA
## 2.9e+07:29000999          36377         40117         18158         36113
## 29001000:29001999         38014         48938         17083         36222
## 29002000:29002999         49731         58418         22193         43917
## 29003000:29003999         48404         55464         22724         41585
## 29004000:29004999         40319         50930         23465         36121
## 29005000:29005999         44560         60327         22169         43149
## 29006000:29006999         45475         60565         20120         40463
## 29007000:29007999         47064         49689         23940         44508
## 29008000:29008999         48728         52724         21879         39106
## 29009000:29009999         35987         47701         19167         34718
##                   PT8EarlyStageDNA PT8PrimaryDNA
## 2.9e+07:29000999             18998         32315
## 29001000:29001999            19910         38044
## 29002000:29002999            21437         43580
## 29003000:29003999            21332         40858
## 29004000:29004999            21304         42560
## 29005000:29005999            21701         44662
## 29006000:29006999            17943         43503
## 29007000:29007999            19213         41379
## 29008000:29008999            19402         41048
## 29009000:29009999            18208         37825

Here we want to apply the function function(x) rowSums(x$Coverages) to the genomic interval 16:29000000-29010000 in blocks of size 1000 and for each of the blocks we want to only extract the dataset “Coverages” since we only access this dataset in the applied function and extracting the others would be unneccessary additional work.

The return value of h5dapply is a list with one element per block containing the output of the function FUN that will be applied to each block. In the example above we compute the sum of coverages per sample within each block, then use do.call(rbind, data) to merge the results into a matrix and set the column names to the respective sample ids that are specified in the sampleData object we loaded from the tally file.

Example of more involved coverage analysis

Let's extract the coverage in 500 base bins from a small region on chromosome 22. We use the binnedCoverage function suplied by h5vc, which is almost identical to the function(x) rowSums(x$Coverages) used above plus some checks and reformatting of the output.

data <- h5dapply(filename = tallyFile, group = "/ExampleStudy/22", names = c("Coverages"), 
    blocksize = 500, range = c(38750000, 39250000), FUN = binnedCoverage, sampledata = sampleData)
data <- do.call(rbind, data)
rownames(data) = NULL
head(data)

##   PT5ControlDNA PT5PrimaryDNA PT5RelapseDNA PT8ControlDNA PT8EarlyStageDNA
## 1         25869         25603         13203         17276             7596
## 2         26917         25611         11227         19973            10032
## 3         23568         25876         11735         22130            10086
## 4         21408         22309         11056         26193            13041
## 5         22409         21349          9439         20538            10443
## 6         25372         26834         14427         23880             8864
##   PT8PrimaryDNA    Start
## 1         24315 38750000
## 2         22674 38750500
## 3         22064 38751000
## 4         19888 38751500
## 5         18602 38752000
## 6         23910 38752500

The library size can be estimated from the colSums of the data object.

librarySizes <- colSums(data[, 1:6])
sizeFactors <- librarySizes/mean(librarySizes)
sizeFactors

##    PT5ControlDNA    PT5PrimaryDNA    PT5RelapseDNA    PT8ControlDNA 
##           1.2359           1.3641           0.5777           1.1424 
## PT8EarlyStageDNA    PT8PrimaryDNA 
##           0.5359           1.1440

for (n in names(sizeFactors)) {
    data[[n]] <- data[[n]]/sizeFactors[n]
}

Next we plot the pairwise log2 ratios of the size factor adjusted coverage.

suppressPackageStartupMessages(library(reshape))
suppressPackageStartupMessages(library(ggplot2))
pairWiseRatio <- function(data, sampledata, trans = log2) {
    sampledata <- subset(sampledata, Sample %in% colnames(data))
    ret <- data.frame(Start = data$Start)
    for (patient in sampledata$Patient) {
        control <- subset(sampledata, Patient == patient & Type == "Control")$Sample
        if (length(control) != 1) {
            stop(paste("Exactly one control sample per patient is expected. Found", 
                length(control), "in patient", patient))
        }
        for (case in subset(sampledata, Patient == patient & Type == "Case")$Sample) {
            ret[[paste(case, control, sep = "/")]] <- trans(data[[case]]/data[[control]])
        }
    }
    return(ret)
}
plot_data <- melt(pairWiseRatio(data, sampleData), id.vars = c("Start"))
colnames(plot_data) = c("Position", "Samples", "Ratio")
ggplot(plot_data, aes(x = Position, y = Ratio, color = Samples)) + geom_line() + 
    facet_wrap(~Samples, ncol = 1)

plot of chunk plotLog2Ratios