---
title: "S.6 -- Annotation, Communication, and Performance"
author: Martin Morgan <martin.morgan@roswellpark.org>
date: "22 November 2016"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  % \VignetteIndexEntry{S.6 -- Annotation, communication, and performance}
  % \VignetteEngine{knitr::rmarkdown}
---

```{r style, echo = FALSE, results = 'asis'}
options(width=100)
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```

```{r setup, echo=FALSE}
suppressPackageStartupMessages({
    library(AnnotationHub)
    library(airway)
    library(org.Hs.eg.db)
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    library(RNAseqData.HNRNPC.bam.chr14)
    library(GenomicAlignments)
    library(shiny)
    library(BiocParallel)
    library(microbenchmark)
})
```

# Annotation

- _Bioconductor_ provides extensive access to 'annotation' resources
  (see the [AnnotationData][] biocViews hierarchy); some interesting
  examples to explore during this lab include:
- [biomaRt][], [PSICQUIC][], [KEGGREST][] and other packages for
  querying on-line resources; each of these have informative vignettes.
- [AnnotationDbi][] is a cornerstone of the
  [Annotation Data][AnnotationData] packages provided by Bioconductor.
    - **org** packages (e.g., [org.Hs.eg.db][]) contain maps between
      different gene identifiers, e.g., ENTREZ and SYMBOL. The basic
      interface to these packages is described on the help page `?select`
    - **TxDb** packages (e.g., [TxDb.Hsapiens.UCSC.hg19.knownGene][])
      contain gene models (exon coordinates, exon / transcript
      relationships, etc) derived from common sources such as the hg19
      knownGene track of the UCSC genome browser. These packages can be
      queried, e.g., as described on the `?exonsBy` page to retrieve all
      exons grouped by gene or transcript.
    - **BSgenome** packages (e.g., [BSgenome.Hsapiens.UCSC.hg19][])
      contain whole genomes of model organisms.
- [VariantAnnotation][] and [ensemblVEP][] provide access to sequence
  annotation facilities, e.g., to identify coding variants; see the
  [Introduction to VariantAnnotation][] vignette for a brief
  introduction.
- Take a quick look at the [annotation work
  flow](https://bioconductor.org/help/workflows/annotation/annotation/)
  on the Bioconductor web site.

Static packages

- _org.\*_: identifier mappings
  
    - `select()`, `columns()`, `keys()`
    - `mapIds()`

    ```{r}
    library(org.Hs.eg.db)
    org <- org.Hs.eg.db
    select(org, "BRCA1", c("ENSEMBL", "GENENAME"), "SYMBOL")
    ```

- _TxDb.\*_: gene models
    
    - `exons()`, `transcripts()`, `genes()`, `promoters()`, ...
    - `exonsBy()`, `transcriptsBy()`
    - `select()`, etc.
    
    ```{r}
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
    promoters(txdb)
    ```

Web-based resources, e.g., [biomaRt][], [PSICQUIC][], [GEOquery][], ...

Genome-scale resources via [AnnotationHub][]

```{r}
library(AnnotationHub)
hub = AnnotationHub()
hub
query(hub, c("ensembl", "81.gtf"))
hub[["AH48004"]]
```

Load the org package for _Homo sapiens_.

```{r}
library(org.Hs.eg.db)
```

Use `select()` to annotate the HNRNPC gene symbol with its Entrez
identifier and less formal gene name. Create a map between SYMBOL and
ENTREZID using `mapIds()`.

```{r}
select(org.Hs.eg.db, "HNRNPC", c("ENTREZID", "GENENAME"), "SYMBOL")
sym2eg <- mapIds(org.Hs.eg.db, "HNRNPC", "ENTREZID", "SYMBOL")
```

Load the TxDb package for the UCSC hg19 knownGene track

```{r}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
```

Extract coordinates of genes, and of exons grouped by gene for the
HNRNPC gene.

```{r}
gns <- genes(txdb)
exonsBy(txdb, "gene")[sym2eg]
```

The [RNAseqData.HNRNPC.bam.chr14][] package is an example of an
experiment data package. It contains a subset of BAM files used in a
gene knock-down experiment, as described in
`?RNAseqData.HNRNPC.bam.chr14`. Load the package and get the path to
the BAM files.

```{r}
library(RNAseqData.HNRNPC.bam.chr14)
fls = RNAseqData.HNRNPC.bam.chr14_BAMFILES
basename(fls)
```

Create `BamFileList()`, basically telling R that these are paths to
BAM files rather than, say, text files from a spreadsheet.

```{r}
library(GenomicAlignments)
bfls = BamFileList(fls)
bfl = bfls[[1]]
```

Use the gene coordinates to query the BAM file for a specific genomic
region; see `?ScanBamParam()` for other ways of restricting data
input.

```{r}
library(Rsamtools)
param <- ScanBamParam(which=gns[sym2eg])
readGAlignments(bfl, param=param)
```

Load the [airway][] experiment data package

```{r}
library(airway)
data(airway)
airway
```

The row names are Ensembl gene identifiers. Use `mapIds()` to map from
these to gene symbols.

```{r}
symid <- mapIds(org.Hs.eg.db, rownames(airway), "SYMBOL", "ENSEMBL")
```

Add the gene symbols to the summarized experiment object.

```{r}
mcols(rowRanges(airway))$symid <- symid
```

## _AnnotationHub_

The [Roadmap Epigenomics Project][] generated genome-wide maps of
regulatory marks across a number of cell lines. 

Retrieve the Epigenome Roadmap table from [AnnotationHub][]...

```{r}
library(AnnotationHub)
hub <- AnnotationHub()
query(hub, c("epigenome", "metadata"))
meta <- hub[["AH41830"]]
```

Explore the metadata to identify a cell line of interest to you; see
also the [metadata][] spreadsheet version of the data made available
by the Epigenome roadmap project.

```{r}
table(meta$ANATOMY)
meta[meta$ANATOMY == "LIVER",]
```

Use the 'EID' to query for and retrieve the 'mnemonic' file
summarizing chromatin state

```{r}
query(hub, c("E118", "mnemonic"))
E118 <- hub[["AH46971"]]
E118
```

Explore the object, e.g., tabulating the different chromatin state
classifications (in the `name` column). Subset the object to return,
e.g., just those regions marked as 'Heterochromatin'

```{r}
table(E118$name)
E118[E118$name %in% "Heterochromatin"]
```

Can you, using a TxDb package and the `genes()` and
`subsetByOverlaps()` functions, determine how many genes overlap
heterochromatic states, or the genes `nearest()` each enhancer?

## _biomaRt_

Visit the [biomart website][] and figure out how to browse data to
retreive, e.g., genes on chromosmes 21 and 22. You'll need to browse
to the ensembl mart, _Homo spaiens_ data set, establish filters for
chromosomes 21 and 22, and then specify that you'd like the Ensembl
gene id attribute returned.

Now do the same process in [biomaRt][]:

```{r biomart, eval=FALSE}
library(biomaRt)
head(listMarts(), 3)                      ## list marts
head(listDatasets(useMart("ensembl")), 3) ## mart datasets
ensembl <-                                ## fully specified mart
    useMart("ensembl", dataset = "hsapiens_gene_ensembl")

head(listFilters(ensembl), 3)             ## filters
myFilter <- "chromosome_name"
substr(filterOptions(myFilter, ensembl), 1, 50) ## return values
myValues <- c("21", "22")
head(listAttributes(ensembl), 3)          ## attributes
myAttributes <- c("ensembl_gene_id","chromosome_name")

## assemble and query the mart
res <- getBM(attributes =  myAttributes, filters =  myFilter,
             values =  myValues, mart = ensembl)
```

# Communicating results

## Visualization

- The [Gviz][] package provides great tools for visualizing local
  genomic coordinates and associated data.
- [epivizr][] drives the [epiviz](http://epiviz.cbcb.umd.edu/) genome
  browser from within R; [rtracklayer][] provides easy ways to
  transfer data to and manipulate UCSC browser sessions.
- Additional packages include [ComplexHeatmap][], [ggbio][],
  [OmicCircos][], ...
  
## Markdown

## Interactive applications: Shiny

# Performance and big data

## Efficient _R_ code

The goal of this section is to highlight practices for writing correct, robust
and efficient R code.

### Priorities

1. Correct: consistent with hand-worked examples (`identical()`, `all.equal()`)
2. Robust: supports realistic inputs, e.g., 0-length vectors, `NA`
   values, ...
3. Simple: easy to understand next month; easy to describe what it
   does to a colleague; easy to spot logical errors; easy to enhance.
4. Fast, or at least reasonable given the speed of modern computers.

### Strategies

1. Profile
    - _Look_ at the script to understand in general terms what it is doing.
    - _Step_ through the code to see how it is executed, and to gain an
      understanding of the speed of each line.
    - _Time_ evaluation of select lines or simple chunks of code with
      `system.time()` or the `r CRANpkg("microbenchmark")` package.
    - _Profile_ the code with a tool that indicates how much time is
      spent in each function call or line -- the built-in `Rprof()`
      function, or packages such as `r CRANpkg("lineprof")` or
      `r CRANpkg("aprof")`
2. Vectorize -- operate on vectors, rather than explicit loops

    ```{r vectorize}
    x <- 1:10
    log(x)     ## NOT for (i in seq_along(x)) x[i] <- log(x[i])
    ```

3. Pre-allocate memory, then fill in the result

    ```{r pre-allocate}
    result <- numeric(10)
    result[1] <- runif(1)
    for (i in 2:length(result))
           result[i] <- runif(1) * result[i - 1]
    result
    ```

4. Hoist common sub-expressions outside of repeated calculations, so
   that the sub-expression is only calculated once
    - Simple, e.g., 'hoist' constant multiplications from a `for` loop
    - Higher-level, e.g., use `lm.fit()` rather than repeatedly fitting
      the same design matrix.
5. Re-use existing, tested code
    - Efficient implementations of common operations -- `tabulate()`,
      `rowSums()` and friends, `%in%`, ...
    - Efficient domain-specific implementations, e.g.,
      `r Biocpkg("snpStats")` for GWAS linear models; `r Biocpkg("limma")`
      for microarray linear models; `r Biocpkg("edgeR")`,
      `r Biocpkg("DESeq2")` for negative binomial GLMs relevant to
      RNASeq.
    - Reuse others' work -- `r Biocpkg("DESeq2")`,
      `r Biocpkg("GenomicRanges")`, `r Biocpkg("Biostrings")`, ...,
      `r CRANpkg("dplyr")`, `r CRANpkg("data.table")`, `r CRANpkg("Rcpp")`


### Case Study: Pre-allocate and vectorize

Here's an obviously inefficient function:

```{r inefficient}
f0 <- function(n, a=2) {
    ## stopifnot(is.integer(n) && (length(n) == 1) &&
    ##           !is.na(n) && (n > 0))
    result <- numeric()
    for (i in seq_len(n))
        result[[i]] <- a * log(i)
    result
}
```

Use `system.time()` to investigate how this algorithm scales with `n`,
focusing on elapsed time.

```{r system-time}
system.time(f0(10000))
n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")
```

Remember the current 'correct' value, and an approximate time

```{r correct-init}
n <- 10000
system.time(expected <- f0(n))
head(expected)
```

Revise the function to hoist the common multiplier, `a`, out of the
loop. Make sure the result of the 'optimization' and the original
calculation are the same. Use the `r CRANpkg("microbenchmark")`
package to compare the two versions

```{r hoist}
f1 <- function(n, a=2) {
    result <- numeric()
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f1(n))

library(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)
```

Adopt a 'pre-allocate and fill' strategy

```{r preallocate-and-fill}
f2 <- function(n, a=2) {
    result <- numeric(n)
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f2(n))
microbenchmark(f0(n), f2(n), times=5)
```

Use an `*apply()` function to avoid having to explicitly pre-allocate,
and make opportunities for vectorization more apparent.

```{r use-apply}
f3 <- function(n, a=2)
    a * sapply(seq_len(n), log)

identical(expected, f3(n))
microbenchmark(f0(n), f2(n), f3(n), times=10)
```

Now that the code is presented in a single line, it is apparent that
it could be easily vectorized.  Seize the opportunity to vectorize it:

```{r use-vectorize}
f4 <- function(n, a=2)
    a * log(seq_len(n))
identical(expected, f4(n))
microbenchmark(f0(n), f3(n), f4(n), times=10)
```

`f4()` definitely seems to be the winner. How does it scale with `n`?
(Repeat several times)

```{r vectorized-scale, eval=FALSE}
n <- 10 ^ (5:8)                         # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, log="xy", type="b")
```

Any explanations for the different pattern of response?

Lessons learned:

1. Vectorizing offers a huge improvement over iteration
2. Pre-allocate-and-fill is very helpful when explicit iteration is
   required.
3. `*apply()` functions help avoid need for explicit pre-allocation
   and make opportunities for vectorization more apparent. This may
   come at a small performance cost, but  is generally worth it
4. Hoisting common sub-expressions can be helpful for improving
   performance when explicit iteration is required.

## Parallel evaluation

When data are too large to fit in memory, we can iterate through files in
chunks or subset the data by fields or genomic positions.

Iteration

- Chunk-wise
- `open()`, read chunk(s), `close()`.
- e.g., `yieldSize` argument to `Rsamtools::BamFile()`
- Framework: `GenomicFiles::reduceByYield()`

Restriction

- Limit to columns and / or rows of interest
- Exploit domain-specific formats
    - BAM files and `Rsamtools::ScanBamParam()`
    - BAM files and `Rsamtools::PileupParam()`
    - VCF files and `VariantAnnotation::ScanVcfParam()`
- Use a data base

Parallel evalutation

- _After_ writing efficient _R_ code, adopting iteration and restriction
- [BiocParallel][] for common, cross-platform operation

[BiocParallel][] provides a standardized interface for simple parallel
evaluation. The package builds provides access to the `snow` and
`multicore` functionality in the `parallel` package as well as
`BatchJobs` for running cluster jobs.

General ideas:

- Use `bplapply()` instead of `lapply()`
- Argument `BPPARAM` influences how parallel evaluation occurs

    - `MulticoreParam()`: threads on a single (non-Windows) machine
    - `SnowParam()`: processes on the same or different machines
    - `BatchJobsParam()`: resource scheduler on a cluster
    - `DoparParam()`: parallel execution with `foreach()`


### Exercise: Sleeping serially and in parallel

This small example motivates the use of parallel execution and demonstrates how
`bplapply()` can be a drop in for `lapply`.

Use `system.time()` to explore how long this takes to execute as `n`
increases from 1 to 10. Use `identical()` and
`r CRANpkg("microbenchmark")` to compare alternatives `f0()` and  `f1()`
for both correctness and performance.

`fun` sleeps for 1 second, then returns `i`.

```{r parallel-sleep}
library(BiocParallel)

fun <- function(i) {
    Sys.sleep(1)
    i
}

## serial
f0 <- function(n)
    lapply(seq_len(n), fun)

## parallel
f1 <- function(n)
    bplapply(seq_len(n), fun)
```

[B.1 Introduction to _Bioconductor_]: ./B1_Bioconductor_Intro.html
[Roadmap Epigenomics Project]: http://egg2.wustl.edu/roadmap/web_portal/
[metadata]: https://docs.google.com/spreadsheets/d/1yikGx4MsO9Ei36b64yOy9Vb6oPC5IBGlFbYEt-N6gOM/edit#gid=15
[biomart website]: http://biomart.org
[Introduction to VariantAnnotation]: https://bioconductor.org/packages/release/bioc/vignettes/ShortRead/inst/doc/Overview.pdf

[AnnotationDbi]: https://bioconductor.org/packages/AnnotationDbi
[AnnotationHub]: https://bioconductor.org/packages/AnnotationHub
[BSgenome.Hsapiens.UCSC.hg19]: https://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg19
[BSgenome]: https://bioconductor.org/packages/BSgenome
[Biostrings]: https://bioconductor.org/packages/Biostrings
[GenomicAlignments]: https://bioconductor.org/packages/GenomicAlignments
[GenomicFeatures]: https://bioconductor.org/packages/GenomicFeatures
[GenomicRanges]: https://bioconductor.org/packages/GenomicRanges
[KEGGREST]: https://bioconductor.org/packages/KEGGREST
[PSICQUIC]: https://bioconductor.org/packages/PSICQUIC
[RNAseqData.HNRNPC.bam.chr14]: https://bioconductor.org/packages/RNAseqData.HNRNPC.bam.chr14
[Rsamtools]: https://bioconductor.org/packages/Rsamtools
[ShortRead]: https://bioconductor.org/packages/ShortRead
[TxDb.Hsapiens.UCSC.hg19.knownGene]: https://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene
[VariantAnnotation]: https://bioconductor.org/packages/VariantAnnotation
[airway]: https://bioconductor.org/packages/airway
[biomaRt]: https://bioconductor.org/packages/biomaRt
[GEOquery]: https://bioconductor.org/packages/GEOquery
[ensemblVEP]: https://bioconductor.org/packages/ensemblVEP
[org.Hs.eg.db]: https://bioconductor.org/packages/org.Hs.eg.db
[Gviz]: https://bioconductor.org/packages/Gviz
[epivizr]: https://bioconductor.org/packages/epivizr
[ComplexHeatmap]: https://bioconductor.org/packages/ComplexHeatmap
[ggbio]: https://bioconductor.org/packages/ggbio
[OmicCircos]: https://bioconductor.org/packages/OmicCircos
[BiocParallel]: https://bioconductor.org/packages/BiocParallel
[rtracklayer]: https://bioconductor.org/packages/rtracklayer

[AnnotationData]: https://bioconductor.org/packages/release/BiocViews.html#___AnnotationData