--- title: "S.6 -- Annotation, Communication, and Performance" author: Martin Morgan date: "22 November 2016" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{S.6 -- Annotation, communication, and performance} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} options(width=100) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` ```{r setup, echo=FALSE} suppressPackageStartupMessages({ library(AnnotationHub) library(airway) library(org.Hs.eg.db) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(RNAseqData.HNRNPC.bam.chr14) library(GenomicAlignments) library(shiny) library(BiocParallel) library(microbenchmark) }) ``` # Annotation - _Bioconductor_ provides extensive access to 'annotation' resources (see the [AnnotationData][] biocViews hierarchy); some interesting examples to explore during this lab include: - [biomaRt][], [PSICQUIC][], [KEGGREST][] and other packages for querying on-line resources; each of these have informative vignettes. - [AnnotationDbi][] is a cornerstone of the [Annotation Data][AnnotationData] packages provided by Bioconductor. - **org** packages (e.g., [org.Hs.eg.db][]) contain maps between different gene identifiers, e.g., ENTREZ and SYMBOL. The basic interface to these packages is described on the help page `?select` - **TxDb** packages (e.g., [TxDb.Hsapiens.UCSC.hg19.knownGene][]) contain gene models (exon coordinates, exon / transcript relationships, etc) derived from common sources such as the hg19 knownGene track of the UCSC genome browser. These packages can be queried, e.g., as described on the `?exonsBy` page to retrieve all exons grouped by gene or transcript. - **BSgenome** packages (e.g., [BSgenome.Hsapiens.UCSC.hg19][]) contain whole genomes of model organisms. - [VariantAnnotation][] and [ensemblVEP][] provide access to sequence annotation facilities, e.g., to identify coding variants; see the [Introduction to VariantAnnotation][] vignette for a brief introduction. - Take a quick look at the [annotation work flow](https://bioconductor.org/help/workflows/annotation/annotation/) on the Bioconductor web site. Static packages - _org.\*_: identifier mappings - `select()`, `columns()`, `keys()` - `mapIds()` ```{r} library(org.Hs.eg.db) org <- org.Hs.eg.db select(org, "BRCA1", c("ENSEMBL", "GENENAME"), "SYMBOL") ``` - _TxDb.\*_: gene models - `exons()`, `transcripts()`, `genes()`, `promoters()`, ... - `exonsBy()`, `transcriptsBy()` - `select()`, etc. ```{r} library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene promoters(txdb) ``` Web-based resources, e.g., [biomaRt][], [PSICQUIC][], [GEOquery][], ... Genome-scale resources via [AnnotationHub][] ```{r} library(AnnotationHub) hub = AnnotationHub() hub query(hub, c("ensembl", "81.gtf")) hub[["AH48004"]] ``` Load the org package for _Homo sapiens_. ```{r} library(org.Hs.eg.db) ``` Use `select()` to annotate the HNRNPC gene symbol with its Entrez identifier and less formal gene name. Create a map between SYMBOL and ENTREZID using `mapIds()`. ```{r} select(org.Hs.eg.db, "HNRNPC", c("ENTREZID", "GENENAME"), "SYMBOL") sym2eg <- mapIds(org.Hs.eg.db, "HNRNPC", "ENTREZID", "SYMBOL") ``` Load the TxDb package for the UCSC hg19 knownGene track ```{r} library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ``` Extract coordinates of genes, and of exons grouped by gene for the HNRNPC gene. ```{r} gns <- genes(txdb) exonsBy(txdb, "gene")[sym2eg] ``` The [RNAseqData.HNRNPC.bam.chr14][] package is an example of an experiment data package. It contains a subset of BAM files used in a gene knock-down experiment, as described in `?RNAseqData.HNRNPC.bam.chr14`. Load the package and get the path to the BAM files. ```{r} library(RNAseqData.HNRNPC.bam.chr14) fls = RNAseqData.HNRNPC.bam.chr14_BAMFILES basename(fls) ``` Create `BamFileList()`, basically telling R that these are paths to BAM files rather than, say, text files from a spreadsheet. ```{r} library(GenomicAlignments) bfls = BamFileList(fls) bfl = bfls[[1]] ``` Use the gene coordinates to query the BAM file for a specific genomic region; see `?ScanBamParam()` for other ways of restricting data input. ```{r} library(Rsamtools) param <- ScanBamParam(which=gns[sym2eg]) readGAlignments(bfl, param=param) ``` Load the [airway][] experiment data package ```{r} library(airway) data(airway) airway ``` The row names are Ensembl gene identifiers. Use `mapIds()` to map from these to gene symbols. ```{r} symid <- mapIds(org.Hs.eg.db, rownames(airway), "SYMBOL", "ENSEMBL") ``` Add the gene symbols to the summarized experiment object. ```{r} mcols(rowRanges(airway))$symid <- symid ``` ## _AnnotationHub_ The [Roadmap Epigenomics Project][] generated genome-wide maps of regulatory marks across a number of cell lines. Retrieve the Epigenome Roadmap table from [AnnotationHub][]... ```{r} library(AnnotationHub) hub <- AnnotationHub() query(hub, c("epigenome", "metadata")) meta <- hub[["AH41830"]] ``` Explore the metadata to identify a cell line of interest to you; see also the [metadata][] spreadsheet version of the data made available by the Epigenome roadmap project. ```{r} table(meta$ANATOMY) meta[meta$ANATOMY == "LIVER",] ``` Use the 'EID' to query for and retrieve the 'mnemonic' file summarizing chromatin state ```{r} query(hub, c("E118", "mnemonic")) E118 <- hub[["AH46971"]] E118 ``` Explore the object, e.g., tabulating the different chromatin state classifications (in the `name` column). Subset the object to return, e.g., just those regions marked as 'Heterochromatin' ```{r} table(E118$name) E118[E118$name %in% "Heterochromatin"] ``` Can you, using a TxDb package and the `genes()` and `subsetByOverlaps()` functions, determine how many genes overlap heterochromatic states, or the genes `nearest()` each enhancer? ## _biomaRt_ Visit the [biomart website][] and figure out how to browse data to retreive, e.g., genes on chromosmes 21 and 22. You'll need to browse to the ensembl mart, _Homo spaiens_ data set, establish filters for chromosomes 21 and 22, and then specify that you'd like the Ensembl gene id attribute returned. Now do the same process in [biomaRt][]: ```{r biomart, eval=FALSE} library(biomaRt) head(listMarts(), 3) ## list marts head(listDatasets(useMart("ensembl")), 3) ## mart datasets ensembl <- ## fully specified mart useMart("ensembl", dataset = "hsapiens_gene_ensembl") head(listFilters(ensembl), 3) ## filters myFilter <- "chromosome_name" substr(filterOptions(myFilter, ensembl), 1, 50) ## return values myValues <- c("21", "22") head(listAttributes(ensembl), 3) ## attributes myAttributes <- c("ensembl_gene_id","chromosome_name") ## assemble and query the mart res <- getBM(attributes = myAttributes, filters = myFilter, values = myValues, mart = ensembl) ``` # Communicating results ## Visualization - The [Gviz][] package provides great tools for visualizing local genomic coordinates and associated data. - [epivizr][] drives the [epiviz](http://epiviz.cbcb.umd.edu/) genome browser from within R; [rtracklayer][] provides easy ways to transfer data to and manipulate UCSC browser sessions. - Additional packages include [ComplexHeatmap][], [ggbio][], [OmicCircos][], ... ## Markdown ## Interactive applications: Shiny # Performance and big data ## Efficient _R_ code The goal of this section is to highlight practices for writing correct, robust and efficient R code. ### Priorities 1. Correct: consistent with hand-worked examples (`identical()`, `all.equal()`) 2. Robust: supports realistic inputs, e.g., 0-length vectors, `NA` values, ... 3. Simple: easy to understand next month; easy to describe what it does to a colleague; easy to spot logical errors; easy to enhance. 4. Fast, or at least reasonable given the speed of modern computers. ### Strategies 1. Profile - _Look_ at the script to understand in general terms what it is doing. - _Step_ through the code to see how it is executed, and to gain an understanding of the speed of each line. - _Time_ evaluation of select lines or simple chunks of code with `system.time()` or the `r CRANpkg("microbenchmark")` package. - _Profile_ the code with a tool that indicates how much time is spent in each function call or line -- the built-in `Rprof()` function, or packages such as `r CRANpkg("lineprof")` or `r CRANpkg("aprof")` 2. Vectorize -- operate on vectors, rather than explicit loops ```{r vectorize} x <- 1:10 log(x) ## NOT for (i in seq_along(x)) x[i] <- log(x[i]) ``` 3. Pre-allocate memory, then fill in the result ```{r pre-allocate} result <- numeric(10) result[1] <- runif(1) for (i in 2:length(result)) result[i] <- runif(1) * result[i - 1] result ``` 4. Hoist common sub-expressions outside of repeated calculations, so that the sub-expression is only calculated once - Simple, e.g., 'hoist' constant multiplications from a `for` loop - Higher-level, e.g., use `lm.fit()` rather than repeatedly fitting the same design matrix. 5. Re-use existing, tested code - Efficient implementations of common operations -- `tabulate()`, `rowSums()` and friends, `%in%`, ... - Efficient domain-specific implementations, e.g., `r Biocpkg("snpStats")` for GWAS linear models; `r Biocpkg("limma")` for microarray linear models; `r Biocpkg("edgeR")`, `r Biocpkg("DESeq2")` for negative binomial GLMs relevant to RNASeq. - Reuse others' work -- `r Biocpkg("DESeq2")`, `r Biocpkg("GenomicRanges")`, `r Biocpkg("Biostrings")`, ..., `r CRANpkg("dplyr")`, `r CRANpkg("data.table")`, `r CRANpkg("Rcpp")` ### Case Study: Pre-allocate and vectorize Here's an obviously inefficient function: ```{r inefficient} f0 <- function(n, a=2) { ## stopifnot(is.integer(n) && (length(n) == 1) && ## !is.na(n) && (n > 0)) result <- numeric() for (i in seq_len(n)) result[[i]] <- a * log(i) result } ``` Use `system.time()` to investigate how this algorithm scales with `n`, focusing on elapsed time. ```{r system-time} system.time(f0(10000)) n <- 1000 * seq(1, 20, 2) t <- sapply(n, function(i) system.time(f0(i))[[3]]) plot(t ~ n, type="b") ``` Remember the current 'correct' value, and an approximate time ```{r correct-init} n <- 10000 system.time(expected <- f0(n)) head(expected) ``` Revise the function to hoist the common multiplier, `a`, out of the loop. Make sure the result of the 'optimization' and the original calculation are the same. Use the `r CRANpkg("microbenchmark")` package to compare the two versions ```{r hoist} f1 <- function(n, a=2) { result <- numeric() for (i in seq_len(n)) result[[i]] <- log(i) a * result } identical(expected, f1(n)) library(microbenchmark) microbenchmark(f0(n), f1(n), times=5) ``` Adopt a 'pre-allocate and fill' strategy ```{r preallocate-and-fill} f2 <- function(n, a=2) { result <- numeric(n) for (i in seq_len(n)) result[[i]] <- log(i) a * result } identical(expected, f2(n)) microbenchmark(f0(n), f2(n), times=5) ``` Use an `*apply()` function to avoid having to explicitly pre-allocate, and make opportunities for vectorization more apparent. ```{r use-apply} f3 <- function(n, a=2) a * sapply(seq_len(n), log) identical(expected, f3(n)) microbenchmark(f0(n), f2(n), f3(n), times=10) ``` Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it: ```{r use-vectorize} f4 <- function(n, a=2) a * log(seq_len(n)) identical(expected, f4(n)) microbenchmark(f0(n), f3(n), f4(n), times=10) ``` `f4()` definitely seems to be the winner. How does it scale with `n`? (Repeat several times) ```{r vectorized-scale, eval=FALSE} n <- 10 ^ (5:8) # 100x larger than f0 t <- sapply(n, function(i) system.time(f4(i))[[3]]) plot(t ~ n, log="xy", type="b") ``` Any explanations for the different pattern of response? Lessons learned: 1. Vectorizing offers a huge improvement over iteration 2. Pre-allocate-and-fill is very helpful when explicit iteration is required. 3. `*apply()` functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth it 4. Hoisting common sub-expressions can be helpful for improving performance when explicit iteration is required. ## Parallel evaluation When data are too large to fit in memory, we can iterate through files in chunks or subset the data by fields or genomic positions. Iteration - Chunk-wise - `open()`, read chunk(s), `close()`. - e.g., `yieldSize` argument to `Rsamtools::BamFile()` - Framework: `GenomicFiles::reduceByYield()` Restriction - Limit to columns and / or rows of interest - Exploit domain-specific formats - BAM files and `Rsamtools::ScanBamParam()` - BAM files and `Rsamtools::PileupParam()` - VCF files and `VariantAnnotation::ScanVcfParam()` - Use a data base Parallel evalutation - _After_ writing efficient _R_ code, adopting iteration and restriction - [BiocParallel][] for common, cross-platform operation [BiocParallel][] provides a standardized interface for simple parallel evaluation. The package builds provides access to the `snow` and `multicore` functionality in the `parallel` package as well as `BatchJobs` for running cluster jobs. General ideas: - Use `bplapply()` instead of `lapply()` - Argument `BPPARAM` influences how parallel evaluation occurs - `MulticoreParam()`: threads on a single (non-Windows) machine - `SnowParam()`: processes on the same or different machines - `BatchJobsParam()`: resource scheduler on a cluster - `DoparParam()`: parallel execution with `foreach()` ### Exercise: Sleeping serially and in parallel This small example motivates the use of parallel execution and demonstrates how `bplapply()` can be a drop in for `lapply`. Use `system.time()` to explore how long this takes to execute as `n` increases from 1 to 10. Use `identical()` and `r CRANpkg("microbenchmark")` to compare alternatives `f0()` and `f1()` for both correctness and performance. `fun` sleeps for 1 second, then returns `i`. ```{r parallel-sleep} library(BiocParallel) fun <- function(i) { Sys.sleep(1) i } ## serial f0 <- function(n) lapply(seq_len(n), fun) ## parallel f1 <- function(n) bplapply(seq_len(n), fun) ``` [B.1 Introduction to _Bioconductor_]: ./B1_Bioconductor_Intro.html [Roadmap Epigenomics Project]: http://egg2.wustl.edu/roadmap/web_portal/ [metadata]: https://docs.google.com/spreadsheets/d/1yikGx4MsO9Ei36b64yOy9Vb6oPC5IBGlFbYEt-N6gOM/edit#gid=15 [biomart website]: http://biomart.org [Introduction to VariantAnnotation]: https://bioconductor.org/packages/release/bioc/vignettes/ShortRead/inst/doc/Overview.pdf [AnnotationDbi]: https://bioconductor.org/packages/AnnotationDbi [AnnotationHub]: https://bioconductor.org/packages/AnnotationHub [BSgenome.Hsapiens.UCSC.hg19]: https://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg19 [BSgenome]: https://bioconductor.org/packages/BSgenome [Biostrings]: https://bioconductor.org/packages/Biostrings [GenomicAlignments]: https://bioconductor.org/packages/GenomicAlignments [GenomicFeatures]: https://bioconductor.org/packages/GenomicFeatures [GenomicRanges]: https://bioconductor.org/packages/GenomicRanges [KEGGREST]: https://bioconductor.org/packages/KEGGREST [PSICQUIC]: https://bioconductor.org/packages/PSICQUIC [RNAseqData.HNRNPC.bam.chr14]: https://bioconductor.org/packages/RNAseqData.HNRNPC.bam.chr14 [Rsamtools]: https://bioconductor.org/packages/Rsamtools [ShortRead]: https://bioconductor.org/packages/ShortRead [TxDb.Hsapiens.UCSC.hg19.knownGene]: https://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene [VariantAnnotation]: https://bioconductor.org/packages/VariantAnnotation [airway]: https://bioconductor.org/packages/airway [biomaRt]: https://bioconductor.org/packages/biomaRt [GEOquery]: https://bioconductor.org/packages/GEOquery [ensemblVEP]: https://bioconductor.org/packages/ensemblVEP [org.Hs.eg.db]: https://bioconductor.org/packages/org.Hs.eg.db [Gviz]: https://bioconductor.org/packages/Gviz [epivizr]: https://bioconductor.org/packages/epivizr [ComplexHeatmap]: https://bioconductor.org/packages/ComplexHeatmap [ggbio]: https://bioconductor.org/packages/ggbio [OmicCircos]: https://bioconductor.org/packages/OmicCircos [BiocParallel]: https://bioconductor.org/packages/BiocParallel [rtracklayer]: https://bioconductor.org/packages/rtracklayer [AnnotationData]: https://bioconductor.org/packages/release/BiocViews.html#___AnnotationData