<!--
%\VignetteEngine{knitr}
%\VignetteIndexEntry{Scalable nucleotide tallies with HDF5}
-->

Scalable nucleotide tallies with HDF5
========================================================

Abstract
--------

With the increased affordability of genomic sequencing technologies
we have at our disposal a flood of whole genome/exome datasets from
many patients. When using the `.bam` file format large scale comparative
analyses become cumbersome to implement and a lot of time is spend
parsing files and collating data. Given that the tally (the table of
counts of nucleotides x sample x strand x genomic position) is at the
center of many genomics analyses (e.g. variant calling, copy-number
estimation, mutation spectrum analysis, etc.) it is crucial to
collate this information in an easily accessible format.
Here we propose the use of `HDF5` as this format and give concrete
examples of the internal layout of a tally-`HDF5` file and possible
applications. We introduce tools for the creation of a tally file from
a set of `.bam` files and tools for interacting with the tally file and
performing genomics analyses.

Motivation
----------

Currently there is a large interest in analyses of cancer genome data
across large cohorts, but the current standard file formats are
ill-suited to the task. The BAM-file format is too low-level and does
not scale well (large file size; slicing by genomic position is usually
inefficient) while the VCF format is too high-level with an emphasis
on (positive) calls with low FP rate and no control over negative
calls and FN rate (just considering every position not mentioned in
the VCF as 'no variant' would have have a high FN rate, esp. in the
face of subclonality and coverage fluctuations).
There is a need for an intermediate format that is scalable, compact
and cross-platform accessible.

The 'tally' data structure
--------------------------

The tally data structure proposed here consists of 4 datasets that are
stored for each chromosome (or contig). Those datasets are:
* Counts: A table that contains the number of observed mismatches at
  any combination of base, sample, strand and genomic position,
* Coverages: A table that contains the number of reads overlapping at
  any combination of sample, strand and genomic position
* Deletions: A Table that contains the number of observed deletions
  of bases at any combination of sample, strand and genomic position
* Reference: A Table that contains the reference base against which
  the calls in the 'Deletions' and 'Counts' table were made.

We outline the basic layout of this set of tables here:

Name      | Dimensions                                 | Datatype 
----------|--------------------------------------------|----------
Counts    | [ #bases, #samples, #strands, #positions ] | int       
Coverages | [ #samples, #strands, #positions ]         | int      
Deletions | [ #samples, #strands, #positions ]         | int      
Reference | [ #positions ]                             | int      

An `HDF5` file has an internal structure that is similar to a file system, where groups are the directories and datasets are the files.
The layout of the tally file is as follows:

![The layout of a tally `HDF5` file.](tallyLayout.svg)

A tally file can contain data from more than one study but each study will reside in a separte tree with a group named with the study-ID at its root and sub-groups for all the chromosomes / contigs that are present in the study. Attached to each of the chromosome groups are the 4 datasets described above.

Additionally each chromsome group stores sample data about the samples involved in the experiment (patientID, type, location in the sample dimension) as `HDF5` attributes. Convenience functions for extracting the metadata are provided, see examples below.

Interacting with tally files
----------------------------

In the first step we load the *h5vc* package and fetch an example tally file (unless it is already present).
```{r loadtallyfile}
library(h5vc)
tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
```

Now we can have a look into the tally file. We use the `h5ls` function from the *rhdf5* package.
```{r checktallyfile}
library(rhdf5)
h5ls(tallyFile)
```
This returns a table of all the objects present within the tally file: `r tallyFile`.
This table encodes the tree-like structure shown in Figure 1.

We can extract the separately stored sample data from the tally file by using the `getSampleData` function:
```{r checksampledata}
sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
print(sampleData)
```

Extracting data from a tally file
---------------------------------

We use the `h5dapply` function as our general purpose accessor to data stored in `HDF5` based tally files.
We provide the following parameters:

* file: The name of the tally file to interact with
* group: The name of the group we want to access, this will always be a string of the form when dealing with tally files
* names: A character vector listing the datasets to be extracted
* blocksize: The size of the blocks in which we want to apply our function
* range: The range of values within the dimension we apply over that we want to process
* FUN: The function to apply to the blocks, always will be passed the parameter `data` which will contain the data for the current block. Defaults to returning the data as is

### Using h5dapply blank to get the data directly
When not specifying the function to be applied (argument `FUN`) we get the data returned that would otherwise be supplied to the function as it's first argument (i.e. `FUN = function(x) x`).

```{r directdataextraction}
data <- h5dapply(
  filename = tallyFile,
  group = "/ExampleStudy/16",
  names = c( "Coverages" ),
  blocksize = 10000,
  range = c(29000000,29010000)
  )
str(data)
```

As you can see the `data`object is a list with one element for each dataset (only "Coverages" in our example; note the dimensions of the "Coverage" array are `[1:6, 1:2, 1:10000]`, i.e. 6 samples, 2 strands and 10000 position in the current block). Furthermore the slot "h5dapplyInfo" is being attached to the list `data` and is itself a list containing information about the current block ("Blockstart","Blockend"), the Datasets represented here (slot "Datasets") and the group we are processing right now ("/ExampleStudy/16"). This information can be used within the call to `FUN` to, e.g. calculate the correct offset for genomic positions by adding the block start etc.

### Example of using `h5dapply` to extract coverage data

```{r coverageexample}
data <- h5dapply(
  filename = tallyFile,
  group = "/ExampleStudy/16",
  names = c( "Coverages" ),
  blocksize = 1000,
  range = c(29000000,29010000),
  FUN = function(x) rowSums(x$Coverages),
  verbose = TRUE
  )
coverages <- do.call( rbind, data )
colnames(coverages) <- sampleData$Sample[order(sampleData$Column)]
coverages
```

Here we want to apply the function `function(x) rowSums(x$Coverages)` to the genomic interval `16:29000000-29010000` in blocks of size `1000` and for each of the blocks we want to only extract the dataset "Coverages" since we only access this dataset in the applied function and extracting the others would be unneccessary additional work.

The return value of `h5dapply` is a list with one element per block containing the output of the function `FUN` that will be applied to each block. In the example above we compute the sum of coverages per sample within each block, then use `do.call(rbind, data)` to merge the results into a matrix and set the column names to the respective sample ids that are specified in the `sampleData` object we loaded from the tally file.

#### Example of more involved coverage analysis

Let's extract the coverage in 500 base bins from a small region on chromosome 22. We use the `binnedCoverage` function suplied by `h5vc`, which is almost identical to the `function(x) rowSums(x$Coverages)` used above plus some checks and reformatting of the output.

```{r binnedCoverageExample}
data <- h5dapply(
  filename = tallyFile,
  group = "/ExampleStudy/22",
  names = c( "Coverages" ),
  blocksize = 500,
  range = c(38750000,39250000),
  FUN = binnedCoverage,
  sampledata = sampleData
  )
data <- do.call(rbind, data)
rownames(data) = NULL
head(data)
```

The library size can be estimated from the `colSums` of the data object.
```{r sizeFactors}
librarySizes <- colSums(data[,1:6])
sizeFactors <- librarySizes / mean(librarySizes)
sizeFactors
```

```{r adjustForSizeFactors}
for( n in names(sizeFactors) ){
  data[[n]] <- data[[n]] / sizeFactors[n]
}
```

Next we plot the pairwise log2 ratios of the size factor adjusted coverage.
```{r plotLog2Ratios, fig.width=18, fig.height=10}
suppressPackageStartupMessages(library(reshape))
suppressPackageStartupMessages(library(ggplot2))
pairWiseRatio <- function( data, sampledata, trans = log2 ){
  sampledata <- subset( sampledata, Sample %in% colnames(data))
  ret <- data.frame(Start = data$Start)
  for( patient in sampledata$Patient){
    control <- subset(sampledata, Patient == patient & Type == "Control")$Sample
    if( length(control) != 1 ){
      stop(paste("Exactly one control sample per patient is expected. Found", length(control), "in patient", patient) )
    }
    for( case in subset(sampledata, Patient == patient & Type == "Case")$Sample){
      ret[[ paste( case, control, sep="/") ]] <- trans( data[[case]] / data[[control]] )
    }
  }
  return(ret)
}
plot_data <- melt(pairWiseRatio( data, sampleData ), id.vars=c("Start"))
colnames(plot_data) = c("Position", "Samples", "Ratio")
ggplot( plot_data, aes( x = Position, y = Ratio, color = Samples ) ) + geom_line() + facet_wrap( ~Samples, ncol = 1)
```