With the increased affordability of genomic sequencing technologies
we have at our disposal a flood of whole genome/exome datasets from
many patients. When using the .bam
file format large scale comparative
analyses become cumbersome to implement and a lot of time is spend
parsing files and collating data. Given that the tally (the table of
counts of nucleotides x sample x strand x genomic position) is at the
center of many genomics analyses (e.g. variant calling, copy-number
estimation, mutation spectrum analysis, etc.) it is crucial to
collate this information in an easily accessible format.
Here we propose the use of HDF5
as this format and give concrete
examples of the internal layout of a tally-HDF5
file and possible
applications. We introduce tools for the creation of a tally file from
a set of .bam
files and tools for interacting with the tally file and
performing genomics analyses.
Currently there is a large interest in analyses of cancer genome data across large cohorts, but the current standard file formats are ill-suited to the task. The BAM-file format is too low-level and does not scale well (large file size; slicing by genomic position is usually inefficient) while the VCF format is too high-level with an emphasis on (positive) calls with low FP rate and no control over negative calls and FN rate (just considering every position not mentioned in the VCF as 'no variant' would have have a high FN rate, esp. in the face of subclonality and coverage fluctuations). There is a need for an intermediate format that is scalable, compact and cross-platform accessible.
The tally data structure proposed here consists of 4 datasets that are stored for each chromosome (or contig). Those datasets are:
We outline the basic layout of this set of tables here:
Name | Dimensions | Datatype |
---|---|---|
Counts | [ #bases, #samples, #strands, #positions ] | int |
Coverages | [ #samples, #strands, #positions ] | int |
Deletions | [ #samples, #strands, #positions ] | int |
Reference | [ #positions ] | int |
An HDF5
file has an internal structure that is similar to a file system, where groups are the directories and datasets are the files.
The layout of the tally file is as follows:
A tally file can contain data from more than one study but each study will reside in a separte tree with a group named with the study-ID at its root and sub-groups for all the chromosomes / contigs that are present in the study. Attached to each of the chromosome groups are the 4 datasets described above.
Additionally each chromsome group stores sample data about the samples involved in the experiment (patientID, type, location in the sample dimension) as HDF5
attributes. Convenience functions for extracting the metadata are provided, see examples below.
In the first step we load the h5vc package and fetch an example tally file (unless it is already present).
library(h5vc)
## Warning: replacing previous import by 'GenomicRanges::%in%' when loading 'h5vc'
## Warning: replacing previous import by 'GenomicRanges::match' when loading 'h5vc'
## Warning: replacing previous import by 'GenomicRanges::order' when loading 'h5vc'
## Warning: replacing previous import by 'GenomicRanges::rank' when loading 'h5vc'
## *NOTE:* h5vc is a package under constant development and many improvements over this version have accumulated in the development branch.
## Please switch to the development version of h5vc (>=1.1.18) for production use.
tallyFile <- system.file("extdata", "example.tally.hfs5", package = "h5vcData")
Now we can have a look into the tally file. We use the h5ls
function from the rhdf5 package.
library(rhdf5)
h5ls(tallyFile)
## group name otype dclass dim
## 0 / ExampleStudy H5I_GROUP
## 1 /ExampleStudy 16 H5I_GROUP
## 2 /ExampleStudy/16 Counts H5I_DATASET INTEGER 12 x 6 x 2 x 90354753
## 3 /ExampleStudy/16 Coverages H5I_DATASET INTEGER 6 x 2 x 90354753
## 4 /ExampleStudy/16 Deletions H5I_DATASET INTEGER 6 x 2 x 90354753
## 5 /ExampleStudy/16 Reference H5I_DATASET INTEGER 90354753
## 6 /ExampleStudy 22 H5I_GROUP
## 7 /ExampleStudy/22 Counts H5I_DATASET INTEGER 12 x 6 x 2 x 51304566
## 8 /ExampleStudy/22 Coverages H5I_DATASET INTEGER 6 x 2 x 51304566
## 9 /ExampleStudy/22 Deletions H5I_DATASET INTEGER 6 x 2 x 51304566
## 10 /ExampleStudy/22 Reference H5I_DATASET INTEGER 51304566
This returns a table of all the objects present within the tally file: /home/biocbuild/bbs-2.13-bioc/R/library/h5vcData/extdata/example.tally.hfs5. This table encodes the tree-like structure shown in Figure 1.
We can extract the separately stored sample data from the tally file by using the getSampleData
function:
sampleData <- getSampleData(tallyFile, "/ExampleStudy/16")
print(sampleData)
## SampleFiles Sample Column Type Patient
## 1 ../Input/PT8PrimaryDNA.bam PT8PrimaryDNA 6 Case Patient8
## 2 ../Input/PT5PrimaryDNA.bam PT5PrimaryDNA 2 Case Patient5
## 3 ../Input/PT5RelapseDNA.bam PT5RelapseDNA 3 Case Patient5
## 4 ../Input/PT8PreLeukemiaDNA.bam PT8EarlyStageDNA 5 Case Patient8
## 5 ../Input/PT5ControlDNA.bam PT5ControlDNA 1 Control Patient5
## 6 ../Input/PT8ControlDNA.bam PT8ControlDNA 4 Control Patient8
We use the h5dapply
function as our general purpose accessor to data stored in HDF5
based tally files.
We provide the following parameters:
data
which will contain the data for the current block. Defaults to returning the data as isWhen not specifying the function to be applied (argument FUN
) we get the data returned that would otherwise be supplied to the function as it's first argument (i.e. FUN = function(x) x
).
data <- h5dapply(filename = tallyFile, group = "/ExampleStudy/16", names = c("Coverages"),
blocksize = 10000, range = c(2.9e+07, 29010000))
## Attaching package bit
## package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
## creators: bit bitwhich
## coercion: as.logical as.integer as.bit as.bitwhich which
## operator: ! & | xor != ==
## querying: print length any all min max range sum summary
## bit access: length<- [ [<- [[ [[<-
## for more help type ?bit
##
## Attaching package: 'bit'
##
## The following object is masked from 'package:base':
##
## xor
##
## Attaching package bit64
## package:bit64 (c) 2011-2012 Jens Oehlschlaegel (GPL-2 with commercial restrictions)
## creators: integer64 seq :
## coercion: as.integer64 as.vector as.logical as.integer as.double as.character as.bin
## logical operator: ! & | xor != == < <= >= >
## arithmetic operator: + - * / %/% %% ^
## math: sign abs sqrt log log2 log10
## math: floor ceiling trunc round
## querying: is.integer64 is.vector [is.atomic} [length] is.na format print
## aggregation: any all min max range sum prod
## cumulation: diff cummin cummax cumsum cumprod
## access: length<- [ [<- [[ [[<-
## combine: c rep cbind rbind as.data.frame
## for more help type ?bit64
##
## Attaching package: 'bit64'
##
## The following objects are masked from 'package:base':
##
## %in%, :, is.double, match, order, rank
str(data)
## List of 1
## $ 2.9e+07:29009999:List of 2
## ..$ Coverages : int [1:6, 1:2, 1:10000] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ h5dapplyInfo:List of 4
## .. ..$ Blockstart: num 2.9e+07
## .. ..$ Blockend : num 2.9e+07
## .. ..$ Datasets :'data.frame': 1 obs. of 7 variables:
## .. .. ..$ group : chr "/ExampleStudy/16"
## .. .. ..$ name : chr "Coverages"
## .. .. ..$ otype : Factor w/ 7 levels "H5I_FILE","H5I_GROUP",..: 5
## .. .. ..$ dclass : chr "INTEGER"
## .. .. ..$ dim : chr "6 x 2 x 90354753"
## .. .. ..$ DimCount: int 3
## .. .. ..$ PosDim : int 3
## .. ..$ Group : chr "/ExampleStudy/16"
As you can see the data
object is a list with one element for each dataset (only “Coverages” in our example; note the dimensions of the “Coverage” array are [1:6, 1:2, 1:10000]
, i.e. 6 samples, 2 strands and 10000 position in the current block). Furthermore the slot “h5dapplyInfo” is being attached to the list data
and is itself a list containing information about the current block (“Blockstart”,“Blockend”), the Datasets represented here (slot “Datasets”) and the group we are processing right now (“/ExampleStudy/16”). This information can be used within the call to FUN
to, e.g. calculate the correct offset for genomic positions by adding the block start etc.
h5dapply
to extract coverage datadata <- h5dapply(filename = tallyFile, group = "/ExampleStudy/16", names = c("Coverages"),
blocksize = 1000, range = c(2.9e+07, 29010000), FUN = function(x) rowSums(x$Coverages),
verbose = TRUE)
## Processing Block #1: 2.9e+07:29000999
## Processing Block #2: 29001000:29001999
## Processing Block #3: 29002000:29002999
## Processing Block #4: 29003000:29003999
## Processing Block #5: 29004000:29004999
## Processing Block #6: 29005000:29005999
## Processing Block #7: 29006000:29006999
## Processing Block #8: 29007000:29007999
## Processing Block #9: 29008000:29008999
## Processing Block #10: 29009000:29009999
coverages <- do.call(rbind, data)
colnames(coverages) <- sampleData$Sample[order(sampleData$Column)]
coverages
## PT5ControlDNA PT5PrimaryDNA PT5RelapseDNA PT8ControlDNA
## 2.9e+07:29000999 36377 40117 18158 36113
## 29001000:29001999 38014 48938 17083 36222
## 29002000:29002999 49731 58418 22193 43917
## 29003000:29003999 48404 55464 22724 41585
## 29004000:29004999 40319 50930 23465 36121
## 29005000:29005999 44560 60327 22169 43149
## 29006000:29006999 45475 60565 20120 40463
## 29007000:29007999 47064 49689 23940 44508
## 29008000:29008999 48728 52724 21879 39106
## 29009000:29009999 35987 47701 19167 34718
## PT8EarlyStageDNA PT8PrimaryDNA
## 2.9e+07:29000999 18998 32315
## 29001000:29001999 19910 38044
## 29002000:29002999 21437 43580
## 29003000:29003999 21332 40858
## 29004000:29004999 21304 42560
## 29005000:29005999 21701 44662
## 29006000:29006999 17943 43503
## 29007000:29007999 19213 41379
## 29008000:29008999 19402 41048
## 29009000:29009999 18208 37825
Here we want to apply the function function(x) rowSums(x$Coverages)
to the genomic interval 16:29000000-29010000
in blocks of size 1000
and for each of the blocks we want to only extract the dataset “Coverages” since we only access this dataset in the applied function and extracting the others would be unneccessary additional work.
The return value of h5dapply
is a list with one element per block containing the output of the function FUN
that will be applied to each block. In the example above we compute the sum of coverages per sample within each block, then use do.call(rbind, data)
to merge the results into a matrix and set the column names to the respective sample ids that are specified in the sampleData
object we loaded from the tally file.
Let's extract the coverage in 500 base bins from a small region on chromosome 22. We use the binnedCoverage
function suplied by h5vc
, which is almost identical to the function(x) rowSums(x$Coverages)
used above plus some checks and reformatting of the output.
data <- h5dapply(filename = tallyFile, group = "/ExampleStudy/22", names = c("Coverages"),
blocksize = 500, range = c(38750000, 39250000), FUN = binnedCoverage, sampledata = sampleData)
data <- do.call(rbind, data)
rownames(data) = NULL
head(data)
## PT5ControlDNA PT5PrimaryDNA PT5RelapseDNA PT8ControlDNA PT8EarlyStageDNA
## 1 25869 25603 13203 17276 7596
## 2 26917 25611 11227 19973 10032
## 3 23568 25876 11735 22130 10086
## 4 21408 22309 11056 26193 13041
## 5 22409 21349 9439 20538 10443
## 6 25372 26834 14427 23880 8864
## PT8PrimaryDNA Start
## 1 24315 38750000
## 2 22674 38750500
## 3 22064 38751000
## 4 19888 38751500
## 5 18602 38752000
## 6 23910 38752500
The library size can be estimated from the colSums
of the data object.
librarySizes <- colSums(data[, 1:6])
sizeFactors <- librarySizes/mean(librarySizes)
sizeFactors
## PT5ControlDNA PT5PrimaryDNA PT5RelapseDNA PT8ControlDNA
## 1.2359 1.3641 0.5777 1.1424
## PT8EarlyStageDNA PT8PrimaryDNA
## 0.5359 1.1440
for (n in names(sizeFactors)) {
data[[n]] <- data[[n]]/sizeFactors[n]
}
Next we plot the pairwise log2 ratios of the size factor adjusted coverage.
suppressPackageStartupMessages(library(reshape))
suppressPackageStartupMessages(library(ggplot2))
pairWiseRatio <- function(data, sampledata, trans = log2) {
sampledata <- subset(sampledata, Sample %in% colnames(data))
ret <- data.frame(Start = data$Start)
for (patient in sampledata$Patient) {
control <- subset(sampledata, Patient == patient & Type == "Control")$Sample
if (length(control) != 1) {
stop(paste("Exactly one control sample per patient is expected. Found",
length(control), "in patient", patient))
}
for (case in subset(sampledata, Patient == patient & Type == "Case")$Sample) {
ret[[paste(case, control, sep = "/")]] <- trans(data[[case]]/data[[control]])
}
}
return(ret)
}
plot_data <- melt(pairWiseRatio(data, sampleData), id.vars = c("Start"))
colnames(plot_data) = c("Position", "Samples", "Ratio")
ggplot(plot_data, aes(x = Position, y = Ratio, color = Samples)) + geom_line() +
facet_wrap(~Samples, ncol = 1)