---
title: "_R_ and _Bioconductor_ for Genomic Analysis"
author:
- name: Martin Morgan
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, New York
  email: Martin.Morgan@RoswellPark.org
package: BiocIntro
output:
  BiocStyle::html_document:
    toc: true
vignette: |
  %\VignetteIndexEntry{R and Bioconductor for Genomic Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

Description: This workshop will introduce you to the _Bioconductor_
collection of R packages for statistical analysis and comprehension of
high-throughput genomic data. The emphasis is on data exploration,
using RNA-sequence gene expression experiments as a motivating
example. How can I access common sequence data formats from R? How can
I use information about gene models or gene annotations in my
analysis? How do the properties of my data influence the statistical
analyses I should perform? What common workflows can I perform with R
and _Bioconductor_? How do I deal with very large data sets in R?
These are the sorts of questions that will be tackled in this
workshop.

Requirements: You will need to bring your own laptop. The workshop
will use cloud-based resources, so your laptop will need a web browser
and WiFi capabilities. Participants should have used _R_ and _RStudio_ for
tasks such as those covered in introductory workshops earlier in the
week. Some knowledge of the biology of gene expression and of concepts
learned in a first course in statistics will be helpful.

Relevance: This workshop is relevant to anyone eager to explore
genomic data in _R_. The workshop will help connect ‘core’ _R_ concepts
for working with data (e.g., data management via `data.frame()`,
statistical modelling with `lm()` or `t.test()`, visualization using
`plot()` or `ggplot()`) to the special challenges of working with large
genomic data sets. It will be especially helpful to those who have or
will have their own genomic data, and are interested in more fully
understanding how to work with it in _R_.

```{r, echo=FALSE}
knitr::opts_chunk$set(
    cache = as.logical(Sys.getenv("KNITR_CACHE", TRUE)),
    collapse = TRUE
)
```

## Our goal

RNA-seq

- Designed experiment, e.g., 8 samples from four cell lines exposed to
  two treatments (based on Himes et al., PMID: [24926665][]; details
  in the [airway][] package vignette).

[24926665]: http://www.ncbi.nlm.nih.gov/pubmed/24926665
[airway]: https://bioconductor.org/packages/airway

```
              cell   dex
SRR1039508  N61311 untrt
SRR1039509  N61311   trt
SRR1039512 N052611 untrt
SRR1039513 N052611   trt
SRR1039516 N080611 untrt
SRR1039517 N080611   trt
SRR1039520 N061011 untrt
SRR1039521 N061011   trt
```

- Library preparation: mRNA to stable double-stranded DNA
- DNA sequencing of 'short' mRNA-derived fragments
- Alignment to a reference genome or transcriptome

```{r, echo=FALSE}
knitr::include_graphics("our_figures/RNASeq_workflow.jpeg")
```
source: http://bio.lundberg.gu.se/courses/vt13/rnaseq.html

- End result: a matrix of 'counts' -- reads aligning to genes across
  samples.

```
                SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000000003        679        448        873        408       1138
ENSG00000000005          0          0          0          0          0
ENSG00000000419        467        515        621        365        587
...                    ...        ...        ...        ...        ...
                SRR1039517 SRR1039520 SRR1039521
ENSG00000000003       1047        770        572
ENSG00000000005          0          0          0
ENSG00000000419        799        417        508
...                    ...        ...        ...
```

Research question

- Which gene counts are most different between dexamethasone `untrt`
  and `trt` experimental treatments?
- We'll try to understand _how_ to accomplish this, without going into
  statistical details.

Our goal

- Visualize 20 most differentially expressed genes as a heatmap.

```{r, echo=FALSE}
knitr::include_graphics("our_figures/top20_heatmap.png")
```

# Data gathering, input, representation, and cleaning

## Base _R_ data structures

Sample information

- Simple 'tab-separated value' text file, e.g., from Excel export.
- Input using base _R_ command `read.table()`
- 'Atomic' vectors, e.g., `integer()`
- `factor()` and `NA`
- `data.frame()`: coordinated management
  - Column access with `$`
  - Subset with `[ , ]`

```{r samples}
samples_file <-
    system.file(package="BiocIntro", "extdata", "samples.tsv")
samples <- read.table(samples_file)
samples

samples$dex <- relevel(samples$dex, "untrt")
```

Counts

- Another tsv file. Many rows, so use `head()` to view the first few.
- Row names: gene identifiers. Column names: sample identifiers.
- All columns are the same (numeric) type; represent as a `matrix()`
  rather than `data.frame()`

```{r counts}
counts_file <-
    system.file(package="BiocIntro", "extdata", "counts.tsv")
counts <- read.table(counts_file)
dim(counts)

head(counts)

counts <- as.matrix(counts)
```

## Genomic ranges (`GRanges`)

Row annotations.

- 'GTF' files contain information about gene models.
- The GTF file relevant to this experiment -- same organism (_Homo
  sapiens_), genome (GRCh37) and gene model annotations (Ensembl
  release-75) as used in the alignment and counting step -- is

```{r}
url <- "ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz"
```

- Use `BiocFileCache` to download the resource once to a location
  that persists across _R_ sessions.

```{r, message = FALSE}
library(BiocFileCache)
```

- About _Bioconductor_ packages

    - `BiocFileCache` is available from https://bioconductor.org
    - Discover packages at https://bioconductor.org/packages
    - Learn about `BiocFileCache` from the [BiocFileCache][] 'landing
      page'.
    - Explore the `BiocFileCache` package [vignette][] (access the
      vignette from within _R_: `browseVignettes("BiocFileCache")`).
    - Install the package with `BiocManager::install("BiocFileCache")`.
    - Find help on functions with, e.g., `?bfcrpath`.
    - Ask questions at https://support.bioconductor.org

[BiocFileCache]: https://bioconductor.org/packages/BiocFileCache
[vignette]: https://bioconductor.org/packages/release/bioc/vignettes/BiocFileCache/inst/doc/BiocFileCache.html

```{r bfcrpath}
gtf_file <- bfcrpath(rnames = url)
```

- GTF files are plain text files and _could_ be read using
  `read.table()` or similar, but contain structured information that
  we want to represent in _R_.
- Common sequence data formats

    - BED, GTF, bigWig: [rtracklayer][]
    - FASTA (DNA sequence): [Biostrings][]
    - FASTQ (short reads & quality scores): [ShortRead][]
    - BAM (aligned reads): [Rsamtools][], [GenomicAlignments][]
    - VCF (called variants): [VariantAnnotation][],
      [VariantFiltering][]. MAF: [maftools][]

- Use the [rtracklayer][] package to import the file.

[rtracklayer]: https://bioconductor.org/packages/rtracklayer

```{r import.gtf, message = FALSE}
library(rtracklayer)
gtf <- import(gtf_file)
```

- A `GRanges` object

    - Range-specific information
    - Annotations on each range
    - Use functions to access core elements: `seqnames()` (e.g.,
      chromosome), `start()` / `end()` / `width()`, `strand()`, etc.
    - Use `$` or `mcols()$` to access annotations on ranges.
    - _Bioconductor_ conventions: 1-based, closed intervals (like
      Ensembl) rather than 0-based, 1/2 open intervals (like UCSC).

[Biostrings]: https://bioconductor.org/packages/Biostrings
[ShortRead]: https://bioconductor.org/packages/ShortRead
[Rsamtools]: https://bioconductor.org/packages/Rsamtools
[GenomicAlignments]: https://bioconductor.org/packages/GenomicAlignments
[VariantAnnotation]: https://bioconductor.org/packages/VariantAnnotation
[VariantFiltering]: https://bioconductor.org/packages/VariantFiltering
[maftools]: https://bioconductor.org/packages/maftools

- Filter the information to gene-level annotations, keeping only some
  of the information about each genomic range. Use the `gene_id`
  column as `names()`.

```{r}
rowidx <- gtf$type == "gene"
colidx <- c("gene_id", "gene_name", "gene_biotype")
genes <- gtf[rowidx, colidx]
names(genes) <- genes$gene_id
genes$gene_id <- NULL

genes
```

## Coordinated management (`SummarizedExperiment`)

Three different data sets

- `counts`: results of the RNAseq workflow
- `samples`: sample and experimental design information
- `genes`: information about the genes that we've assayed.

Coordinate our manipulation

- Avoid 'bookkeeping' errors when, e.g., we subset one part of the
  data in a way different from another.
- Use the [SummarizedExperiment][] package and data representation.

    - Two-dimensional structure, so subset with `[,]`
    - Use functions to access components: `assay()`, `rowData()`,
      `rowRanges()`, `colData()`, etc.

```{r, echo=FALSE}
knitr::include_graphics("our_figures/SummarizedExperiment.svg")
```

[SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment

```{r, message = FALSE}
library(SummarizedExperiment)
```

- Make sure the order of the `samples` rows match the order of the
  samples in the columns of the `counts` matrix, and the order of the
  `genes` rows match the order of the rows of the `counts` matrix.
- Create a `SummarizedExperiment` to coordinate our data manipulation.


```{r SummarizedExperiment}
samples <- samples[colnames(counts),]
genes <- genes[rownames(counts),]
se <- SummarizedExperiment(
  assays = list(counts = counts),
  rowRanges = genes, colData = samples
)

se
```

# Analysis & visualization

## Differential expression analysis

Gestalt

- Perform a `t.test()` for each row of the count matrix, asking
  whether the `trt` samples have on average counts that differ from
  the `untrt` samples.
- _Many_ nuanced statistical issues

The [DESeq2][] pacakge

- Implements efficient, 'correct', robust algorithms for performing
  RNA-seq differential expression analysis of moderate-sized
  experiments.

[DESeq2]: https://bioconductor.org/packages/DESeq2

```{r, message = FALSE}
library(DESeq2)
```

- Specify our experimental design, perform the analysis taking account
  of the nuanced statistical issues, and get a summary of the
  results. The details of this step are beyond the scope of this
  workshop.

```{r DESeq2}
dds <- DESeqDataSet(se, ~ cell + dex)
fit <- DESeq(dds)
destats <- results(fit)

destats
```

- Add the results to our `SummarizedExperiment`, so that we can
  manipulate these in a coordianted fashion too.

```{r}
rowData(se) <- cbind(rowData(se), destats)
```

## Heatmap

- Use `order()` and `head()` to identify the row indexes of the top 20
  most differentially expressed (based on adjusted P-value) genes.
- Subset the our `SummarizedExperiment` to contain just these rows.
- Dispaly the `assay()` of our subset as a heatmap

```{r heatmap-1}
top20idx <- head( order(rowData(se)$padj), 20)
top20 <- se[top20idx,]
heatmap(assay(top20))
```

Update row labels and adding information about treatment group.

- Extract the top 20 matrix.
- Update the row names of the matrix with the corresponding gene names.
- Create a vector of colors, one for each sample, with the color
  determined by the dexamethasone treatment.

```{r heatmap-2}
m <- assay(top20)
rownames(m) <- rowData(top20)$gene_name
trtcolor <- hcl.colors(2, "Pastel 1")[ colData(top20)$dex ]
heatmap(m, ColSideColors = trtcolor)
```

## Volcano plot

- Plots 'statistical significance' on the Y-axis, 'biological
  significance' on the X-axis.
- Use `plot()` to create the points
- Use `text()` to add labels to the two most significant genes.

```{r volcanoplot}
plot(-log10(pvalue) ~ log2FoldChange, rowData(se))
label <- with(
    rowData(se),
    ifelse(-log10(pvalue) > 130, gene_name, "")
)
text(
    -log10(pvalue) ~ log2FoldChange, rowData(se),
    label = label, pos = 4, srt=-15
)
```

## Top table and tidy data

Goal

- Provide a concise summary of the 20 most differentially expressed
  genes.

[dplyr][] and 'tidy' data

- A convenient way to display and manipulate strictly tabular data.
    - 'long form' tables where each row represents an observation and
      each column an attribute measured on the observations.

- `tibble`: a `data.frame` with better display properties
- `%>%`, e.g., `mtcars %>% count(cyl)`: a pipe that takes a tibble (or
  data.frame) `tbl` on the left and uses it as an argument to a small
  number of functions like `count()` on the right.
- 'Tidy' functions usually return a `tibble()`, and hence can be
  chained together.

[dplyr]: https://cran.r-project.org/package=dplyr

```{r dplyr, message = FALSE}
library(dplyr)
```

- Steps below:

    - `as_tibble()`: create a tibble from `rowData(se)`
    - `select()`: select specific columns.
    - `arrange()`: arrange all rows from smallest to largest `padj`
    - `head()`: filter to the first 20 rows

```{r toptable}
rowData(se) %>%
    as_tibble(rownames = "ensembl_id") %>%
    select(ensembl_id, gene_name, baseMean, log2FoldChange, padj) %>%
    arrange(padj) %>%
    head(n = 20)
```

- Check out the [plyranges][] workshop!

[plyranges]: https://bioconductor.org/packages/plyranges

# Summary

## What we've learned

Packages

- Discover at https://bioconductor.org/packages
- Install with `BiocManager::install("BiocFileCache")`
- Use with `library(BiocFileCache)`.
- Get help with `?bfcrpath`
- Get more help at https://support.bioconductor.org
- Mature packages provide access to common sequence analysis data
  formats, e.g., BED, GTF, FASTQ, BAM, VCF.

Data structures

- Represent and coordinate 'complicated' data.
- Already prevalent in base _R_, e.g., `data.frame()`, `matrix()`.
- `GRanges` for representing genomic ranges

    - 'Accessor' functions `seqnames()`, `start()`, etc for core
      components
    - `$` or `mcols()$` for annotations

- `SummarizedExperiment` for coordinated manipulation of assay data
  with row and column annotation.

    - `[,]` to subset assay and annotaions in a coordinated fashion.
    - `assay()`, `rowRanges()`, `rowData()`, `colData()` to access
      components.

Analysis

- Mature packages like [DESeq2][] provide excellent vignettes,
  well-defined steps in analysis, integration with other workflow
  steps, and very robust support.
- Emerging areas are often represented by several packages
  implementing less complete or certain steps in analysis.

## Next steps

Single-cell RNA-seq: an amazing resource: [Orchestarting Single Cell
Analysis][osca] with _Bioconductor_, including the [scran][] and
[scater][] packages.

- Quality control
- Normalization
- Feature selection
- Dimensionality reduction
- Clustering
- Marker gene detection
- Cell type annotation ([SingleR][]) (this package has a great
  vignette!)
- Trajectory analysis ([destiny][])
- Gene set enrichment
- Etc.!

[osca]: https://osca.bioconductor.org
[scran]: https://bioconductor.org/packages/scran
[scater]: https://bioconductor.org/packages/scater
[SingleR]: https://bioconductor.org/packages/SingleR
[destiny]: https://bioconductor.org/packages/destiny

Other prominent domains of analysis (check out [biocViews][])

- Microarrays -- epigenomic, classical expression, copy number
- Annotated variants
- Gene set enrichment
- Flow cytometry
- Proteomics

[biocViews]: https://bioconductor.org/packages

Participate!

- Get help on the [support site][].
- [Join][join-slack] and [use][use-slack] the slack community.
- Participate in other conferences (e.g., [BioC2020][] in Boston, July
  29 - 31, 2020).

[support site]: https://support.bioconductor.org
[join-slack]: https://bioc-community.herokuapp.com/
[use-slack]: https://community-bioc.slack.com
[BioC2020]: https://bioc2020.bioconductor.org/

# Acknowledgements

Research reported in this presentation was supported by the NCI and
NHGRI of the National Institutes of Health under award numbers
U24CA232979, U41HG004059, and U24CA180996.  The content is solely the
responsibility of the authors and does not necessarily represent the
official views of the National Institutes of Health.

A portion of this work is supported by the Chan Zuckerberg Initiative
DAF, an advised fund of Silicon Valley Community Foundation.

```{r sessionInfo, echo=FALSE}
sessionInfo()
```