---
title: "B.4 -- Next Steps"
author: Martin Morgan <martin.morgan@roswellpark.org>
date: "16 - 17 May, 2016"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  % \VignetteIndexEntry{B.4 -- Next Steps}
  % \VignetteEngine{knitr::rmarkdown}
---

```{r style, echo = FALSE, results = 'asis'}
options(width=100)
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```

```{r setup, echo=FALSE}
suppressPackageStartupMessages({
    library(BiocParallel)
    library(microbenchmark)
})
```

# Common _Bioconductor_ work flows

This very open-ended topic points to some of the most prominent
Bioconductor packages for sequence analysis. Use the opportunity in
this lab to explore the package vignettes and help pages highlighted
below; many of the material will be covered in greater detail in
subsequent labs and lectures.

Domain-specific analysis -- explore the landing pages, vignettes, and
reference manuals of two or three of the following packages.

- Important packages for analysis of differential expression include
  [edgeR][] and [DESeq2][]; both have excellent vignettes for
  exploration. Additional research methods embodied in Bioconductor
  packages can be discovered by visiting the [biocViews][] web page,
  searching for the 'DifferentialExpression' view term, and narrowing
  the selection by searching for 'RNA seq' and similar.
- Popular ChIP-seq packages include [csaw][] and [DiffBind][] for
  comparison of peaks across samples, [ChIPQC][] for quality
  assessment, and [ChIPseeker][] for annotating results (e.g.,
  discovering nearby genes). What other ChIP-seq packages are listed
  on the [biocViews][] page?
- Working with called variants (VCF files) is facilitated by packages
  such as [VariantAnnotation][], [VariantFiltering][], [ensemblVEP][],
  and [SomaticSignatures][]; packages for calling variants include,
  e.g., [h5vc][] and [VariantTools][].
- Several packages identify copy number variants from sequence data,
  including [cn.mops][]; from the [biocViews][] page, what other copy
  number packages are available? The [CNTools][] package provides some
  useful facilities for comparison of segments across samples.
- Microbiome and metagenomic analysis is facilitated by packages such
  as [phyloseq][] and [metagenomeSeq][].
- Metabolomics, chemoinformatics, image analysis, and many other
  high-throughput analysis domains are also represented in
  Bioconductor; explore these via biocViews and title searches.

Working with sequences, alignments, common web file formats, and raw
data; these packages rely very heavily on the [IRanges][] /
[GenomicRanges][] infrastructure.

- The [Biostrings][] package is used to represent DNA and other
  sequences, with many convenient sequence-related functions. Check
  out the functions documented on the help page `?consensusMatrix`,
  for instance. Also check out the [BSgenome][] package for working
  with whole genome sequences, e.g., `?"getSeq,BSgenome-method"`
- The [GenomicAlignments][] package is used to input reads aligned to
  a reference genome. See for instance the `?readGAlignments` help
  page and `vigentte(package="GenomicAlignments",
  "summarizeOverlaps")`
- [rtracklayer][]'s `import` and `export` functions can read in many
  common file types, e.g., BED, WIG, GTF, ..., in addition to querying
  and navigating the UCSC genome browser. Check out the `?import` page
  for basic usage.
- The [ShortRead][] and [Rsamtools][] packages can be used for
  lower-level access to FASTQ and BAM files, respectively. Explore the
  [ShortRead vignette](https://bioconductor.org/packages/release/bioc/vignettes/ShortRead/inst/doc/Overview.pdf)
  and Scalable Genomics labs to see approaches to effectively
  processing the large files.

Visualization

- The [Gviz][] package provides great tools for visualizing local
  genomic coordinates and associated data.
- [epivizr][] drives the [epiviz](http://epiviz.cbcb.umd.edu/) genome
  browser from within R; [rtracklayer][] provides easy ways to
  transfer data to and manipulate UCSC browser sessions.
- Additional packages include [ggbio][], [OmicCircos][], ...

![Alt Sequencing Ecosystem](our_figures/SequencingEcosystem.png)

# Computation: Big Data

## Efficient _R_ code

The goal of this section is to highlight practices for writing correct, robust
and efficient R code.

## Priorities

1. Correct: consistent with hand-worked examples (`identical()`, `all.equal()`)
2. Robust: supports realistic inputs, e.g., 0-length vectors, `NA`
   values, ...
3. Simple: easy to understand next month; easy to describe what it
   does to a colleague; easy to spot logical errors; easy to enhance.
4. Fast, or at least reasonable given the speed of modern computers.

## Strategies

1. Profile
    - _Look_ at the script to understand in general terms what it is doing.
    - _Step_ through the code to see how it is executed, and to gain an
      understanding of the speed of each line.
    - _Time_ evaluation of select lines or simple chunks of code with
      `system.time()` or the `r CRANpkg("microbenchmark")` package.
    - _Profile_ the code with a tool that indicates how much time is
      spent in each function call or line -- the built-in `Rprof()`
      function, or packages such as `r CRANpkg("lineprof")` or
      `r CRANpkg("aprof")`
2. Vectorize -- operate on vectors, rather than explicit loops

    ```{r vectorize}
    x <- 1:10
    log(x)     ## NOT for (i in seq_along(x)) x[i] <- log(x[i])
    ```

3. Pre-allocate memory, then fill in the result

    ```{r pre-allocate}
    result <- numeric(10)
    result[1] <- runif(1)
    for (i in 2:length(result))
           result[i] <- runif(1) * result[i - 1]
    result
    ```

4. Hoist common sub-expressions outside of repeated calculations, so
   that the sub-expression is only calculated once
    - Simple, e.g., 'hoist' constant multiplications from a `for` loop
    - Higher-level, e.g., use `lm.fit()` rather than repeatedly fitting
      the same design matrix.
5. Re-use existing, tested code
    - Efficient implementations of common operations -- `tabulate()`,
      `rowSums()` and friends, `%in%`, ...
    - Efficient domain-specific implementations, e.g.,
      `r Biocpkg("snpStats")` for GWAS linear models; `r Biocpkg("limma")`
      for microarray linear models; `r Biocpkg("edgeR")`,
      `r Biocpkg("DESeq2")` for negative binomial GLMs relevant to
      RNASeq.
    - Reuse others' work -- `r Biocpkg("DESeq2")`,
      `r Biocpkg("GenomicRanges")`, `r Biocpkg("Biostrings")`, ...,
      `r CRANpkg("dplyr")`, `r CRANpkg("data.table")`, `r CRANpkg("Rcpp")`


### Case Study: Pre-allocate and vectorize

Here's an obviously inefficient function:

```{r inefficient}
f0 <- function(n, a=2) {
    ## stopifnot(is.integer(n) && (length(n) == 1) &&
    ##           !is.na(n) && (n > 0))
    result <- numeric()
    for (i in seq_len(n))
        result[[i]] <- a * log(i)
    result
}
```

Use `system.time()` to investigate how this algorithm scales with `n`,
focusing on elapsed time.

```{r system-time}
system.time(f0(10000))
n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")
```

Remember the current 'correct' value, and an approximate time

```{r correct-init}
n <- 10000
system.time(expected <- f0(n))
head(expected)
```

Revise the function to hoist the common multiplier, `a`, out of the
loop. Make sure the result of the 'optimization' and the original
calculation are the same. Use the `r CRANpkg("microbenchmark")`
package to compare the two versions

```{r hoist}
f1 <- function(n, a=2) {
    result <- numeric()
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f1(n))

library(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)
```

Adopt a 'pre-allocate and fill' strategy

```{r preallocate-and-fill}
f2 <- function(n, a=2) {
    result <- numeric(n)
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f2(n))
microbenchmark(f0(n), f2(n), times=5)
```

Use an `*apply()` function to avoid having to explicitly pre-allocate,
and make opportunities for vectorization more apparent.

```{r use-apply}
f3 <- function(n, a=2)
    a * sapply(seq_len(n), log)

identical(expected, f3(n))
microbenchmark(f0(n), f2(n), f3(n), times=10)
```

Now that the code is presented in a single line, it is apparent that
it could be easily vectorized.  Seize the opportunity to vectorize it:

```{r use-vectorize}
f4 <- function(n, a=2)
    a * log(seq_len(n))
identical(expected, f4(n))
microbenchmark(f0(n), f3(n), f4(n), times=10)
```

`f4()` definitely seems to be the winner. How does it scale with `n`?
(Repeat several times)

```{r vectorized-scale}
n <- 10 ^ (5:8)                         # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, log="xy", type="b")
```

Any explanations for the different pattern of response?

Lessons learned:

1. Vectorizing offers a huge improvement over iteration
2. Pre-allocate-and-fill is very helpful when explicit iteration is
   required.
3. `*apply()` functions help avoid need for explicit pre-allocation
   and make opportunities for vectorization more apparent. This may
   come at a small performance cost, but  is generally worth it
4. Hoisting common sub-expressions can be helpful for improving
   performance when explicit iteration is required.

## Parallel evaluation

When data are too large to fit in memory, we can iterate through files in
chunks or subset the data by fields or genomic positions.

Iteration

- Chunk-wise
- `open()`, read chunk(s), `close()`.
- e.g., `yieldSize` argument to `Rsamtools::BamFile()`
- Framework: `GenomicFiles::reduceByYield()`

Restriction

- Limit to columns and / or rows of interest
- Exploit domain-specific formats
    - BAM files and `Rsamtools::ScanBamParam()`
    - BAM files and `Rsamtools::PileupParam()`
    - VCF files and `VariantAnnotation::ScanVcfParam()`
- Use a data base

Parallel evalutation

- _After_ writing efficient _R_ code, adopting iteration and restriction
- [BiocParallel][] for common, cross-platform operation

[BiocParallel][] provides a standardized interface for simple parallel
evaluation. The package builds provides access to the `snow` and
`multicore` functionality in the `parallel` package as well as
`BatchJobs` for running cluster jobs.

General ideas:

- Use `bplapply()` instead of `lapply()`
- Argument `BPPARAM` influences how parallel evaluation occurs

    - `MulticoreParam()`: threads on a single (non-Windows) machine
    - `SnowParam()`: processes on the same or different machines
    - `BatchJobsParam()`: resource scheduler on a cluster
    - `DoparParam()`: parallel execution with `foreach()`


### Exercise: Sleeping serially and in parallel

This small example motivates the use of parallel execution and demonstrates how
`bplapply()` can be a drop in for `lapply`.

Use `system.time()` to explore how long this takes to execute as `n`
increases from 1 to 10. Use `identical()` and
`r CRANpkg("microbenchmark")` to compare alternatives `f0()` and  `f1()`
for both correctness and performance.

`fun` sleeps for 1 second, then returns `i`.

```{r parallel-sleep}
library(BiocParallel)

fun <- function(i) {
    Sys.sleep(1)
    i
}

## serial
f0 <- function(n)
    lapply(seq_len(n), fun)

## parallel
f1 <- function(n)
    bplapply(seq_len(n), fun)
```

[biocViews]: https://bioconductor.org/packages/

[AnnotationDbi]: https://bioconductor.org/packages/AnnotationDbi
[AnnotationHub]: https://bioconductor.org/packages/AnnotationHub
[BSgenome.Hsapiens.UCSC.hg19]: https://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg19
[BSgenome]: https://bioconductor.org/packages/BSgenome
[BiocParallel]: https://bioconductor.org/packages/BiocParallel
[Biostrings]: https://bioconductor.org/packages/Biostrings
[CNTools]: https://bioconductor.org/packages/CNTools
[ChIPQC]: https://bioconductor.org/packages/ChIPQC
[ChIPseeker]: https://bioconductor.org/packages/ChIPseeker
[DESeq2]: https://bioconductor.org/packages/DESeq2
[DiffBind]: https://bioconductor.org/packages/DiffBind
[GenomicAlignments]: https://bioconductor.org/packages/GenomicAlignments
[GenomicFiles]: https://bioconductor.org/packages/GenomicFiles
[GenomicRanges]: https://bioconductor.org/packages/GenomicRanges
[Gviz]: https://bioconductor.org/packages/Gviz
[Homo.sapiens]: https://bioconductor.org/packages/Homo.sapiens
[IRanges]: https://bioconductor.org/packages/IRanges
[KEGGREST]: https://bioconductor.org/packages/KEGGREST
[OmicCircos]: https://bioconductor.org/packages/OmicCircos
[PSICQUIC]: https://bioconductor.org/packages/PSICQUIC
[Rsamtools]: https://bioconductor.org/packages/Rsamtools
[Rsubread]: https://bioconductor.org/packages/Rsubread
[ShortRead]: https://bioconductor.org/packages/ShortRead
[SomaticSignatures]: https://bioconductor.org/packages/SomaticSignatures
[SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment
[TxDb.Hsapiens.UCSC.hg19.knownGene]: https://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene
[VariantAnnotation]: https://bioconductor.org/packages/VariantAnnotation
[VariantFiltering]: https://bioconductor.org/packages/VariantFiltering
[VariantTools]: https://bioconductor.org/packages/VariantTools
[airway]: https://bioconductor.org/packages/airway
[biomaRt]: https://bioconductor.org/packages/biomaRt
[cn.mops]: https://bioconductor.org/packages/cn.mops
[csaw]: https://bioconductor.org/packages/csaw
[edgeR]: https://bioconductor.org/packages/edgeR
[ensemblVEP]: https://bioconductor.org/packages/ensemblVEP
[epivizr]: https://bioconductor.org/packages/epivizr
[ggbio]: https://bioconductor.org/packages/ggbio
[h5vc]: https://bioconductor.org/packages/h5vc
[limma]: https://bioconductor.org/packages/limma
[metagenomeSeq]: https://bioconductor.org/packages/metagenomeSeq
[org.Hs.eg.db]: https://bioconductor.org/packages/org.Hs.eg.db
[org.Sc.sgd.db]: https://bioconductor.org/packages/org.Sc.sgd.db
[phyloseq]: https://bioconductor.org/packages/phyloseq
[rtracklayer]: https://bioconductor.org/packages/rtracklayer
[snpStats]: https://bioconductor.org/packages/snpStats

[dplyr]: https://cran.r-project.org/package=dplyr
[data.table]: https://cran.r-project.org/package=data.table
[Rcpp]: https://cran.r-project.org/package=Rcpp