```{r setup, echo=FALSE}
library(LearnBioconductor)
stopifnot(BiocInstaller::biocVersion() == "3.0")
```
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
knitr::opts_chunk$set(tidy=FALSE)
```
# Introduction to Bioconductor
Martin Morgan
October 29, 2014
## Bioconductor
Analysis and comprehension of high-throughput genomic data
- Statistical analysis: large data, technological artifacts, designed
experiments; rigorous
- Comprehension: biological context, visualization, reproducibility
- High-throughput
- Sequencing: RNASeq, ChIPSeq, variants, copy number, ...
- Microarrays: expression, SNP, ...
- Flow cytometry, proteomics, images, ...
Packages, vignettes, work flows
- 934 packages
- Discover and navigate via [biocViews][]
- Package 'landing page'
- Title, author / maintainer, short description, citation,
installation instructions, ..., download statistics
- All user-visible functions have help pages, most with runnable
examples
- 'Vignettes' an important feature in Bioconductor -- narrative
documents illustrating how to use the package, with integrated code
- 'Release' (every six months) and 'devel' branches
- [Support site](https://support.bioconductor.org);
[videos](https://www.youtube.com/user/bioconductor), [recent
courses](http://bioconductor.org/help/course-materials/)
Objects
- Represent complicated data types
- Foster interoperability
- S4 object system
- Introspection: `getClass()`, `showMethods(..., where=search())`,
`selectMethod()`
- 'accessors' and other documented functions / methods for
manipulation, rather than direct access to the object structure
- Interactive help
- `method?"substr,"` to select help on methods, `class?D`
for help on classes
Example
```{r Biostrings, message=FALSE}
suppressPackageStartupMessages({
library(Biostrings)
})
data(phiX174Phage) # sample data, see ?phiX174Phage
phiX174Phage
m <- consensusMatrix(phiX174Phage)[1:4,] # nucl. x position counts
polymorphic <- which(colSums(m != 0) > 1)
m[, polymorphic]
```
```{r showMethods, eval=FALSE}
showMethods(class=class(phiX174Phage), where=search())
```
## Core concepts
### Genomic ranges
Genomic range
- chromosome (`seqnames`), start, end, and optionally strand
- Coordinates
- 1-based
- Closed -- start and end coordinates _included_ in range
- Left-most -- start is always to the left of end, regardless of
strand
Why genomic ranges?
- 'Annotation'
- Many genome annotations are range-based
- Simple ranges: exons, promoters, transcription factor binding
sites, CpG islands, ...
- Lists of ranges: gene models (exons-within-transcripts)
- 'Data'
- Reads themselves, or derived data
- Simple ranges: ChIP-seq peaks, SNPs, ungapped reads, ...
- List of ranges: gapped alignments, paired-end reads, ...
Data objects
- `r Biocpkg("GenomicRanges")`::_GRanges_
- `seqnames()`
- `start()`, `end()`, `width()`
- `strand()`
- `mcols()`: 'metadata' associated with each range, stored as a
`DataFrame`
- Many very useful operations defined on ranges (later)
- `r Biocpkg("GenomicRanges")`::_GRangesList_
- List-like (e.g., `length()`, `names()`, `[`, `[[`)
- Each list element a _GRanges_
- Metadata at list and element-list levels
- Very easy (fast) to `unlist()` and `relist()`.
- `r Biocpkg("GenomicAlignments")`::_GAlignments_, _GAlignmentsList_,
_GAlignemntPairs_; `r Biocpkg("VariantAnnotation")`::_VCF_, _VRanges_
- _GRanges_-like objects with more specialized roles
Example: _GRanges_
```{r eg-GRanges}
## 'Annotation' package; more later...
suppressPackageStartupMessages({
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
})
promoters <- promoters(TxDb.Hsapiens.UCSC.hg19.knownGene)
## 'GRanges' with 2 metadata columns
promoters
head(table(seqnames(promoters)))
table(strand(promoters))
seqinfo(promoters)
## vector-like access
promoters[ seqnames(promoters) %in% c("chr1", "chr2") ]
## metadata
mcols(promoters)
length(unique(promoters$tx_name))
```
```{r eg-GRangesList}
## exons, grouped by transcript
exByTx <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx", use.names=TRUE)
## list-like subsetting
exByTx[1:10] # also logical, character, ...
exByTx[["uc001aaa.3"]] # also numeric
## accessors return typed-List, e.g., IntegerList
width(exByTx)
log10(width(exByTx))
## 'easy' to ask basic questions, e.g., ...
hist(unlist(log10(width(exByTx)))) # widths of exons
exByTx[which.max(max(width(exByTx)))] # transcript with largest exon
exByTx[which.max(elementLengths(exByTx))] # transcript with most exons
```
There are many neat range-based operations (more later)!
![Range Operations](our_figures/RangeOperations.png)
Some detail
- _GRanges_ and friends use data structures defined in `r Biocpkg("S4Vectors")`,
`r Biocpkg("IRanges")`
- These data structures can handle relatively large data easily, e.g.,
1-10 million ranges
- Basic concepts are built on _R_'s vector and list; _List_ instances
are implemented to be efficient when there are long lists of a few
elements each.
- Takes a little getting used to, but very powerful
### Integrated containers
What is an experiment?
- 'Assays'
- Regions-of-interest x samples
- E.g., read counts, expression values
- Regions-of-interest
- Microarrays: probeset or gene identifiers
- Sequencing: genomic ranges
- Samples
- Experimental inforamtion, covariates
- Overall experimental description
Why integrate?
- Avoid errors when manipulating data
- Case study: [reproducible research]()
Data objects
- `r Biocpkg("Biobase")`::_ExpressionSet_
- Assays (`exprs()`): matrix of expression values
- Regions-of-interest (`featureData(); fData()`): probeset or gene
identifiers
- Samples (`phenoData(); pData()`: `data.frame` of relevant
information
- Experiment data (`exptData()`): Instance of class `MIAME`.
- `r Biocpkg("GenomicRanges")`::_SummarizedExperiment_
- Assays (`assay(), assays()`): arbitrary matrix-like object
- Regions-of-interest (`rowData()`): `GRanges` or `GRangesList`;
use `GRangesList` with names and 0-length elements to represent
assays without ranges.
- Samples (`colData()`): `DataFrame` of relevant information.
- Experiment data (`exptData()`): `List` of arbitrary information.
![SummarizedExperiment](our_figures/SummarizedExperiment.png)
Example: `ExpressionSet` (see vignettes in `r Biocpkg("Biobase")`).
```{r eg-ExpressionSet}
suppressPackageStartupMessages({
library(ALL)
})
data(ALL)
ALL
## 'Phenotype' (sample) and 'feature' data
head(pData(ALL))
head(featureNames(ALL))
## access to pData columns; matrix-like subsetting; exprs()
ALL[, ALL$sex %in% "M"]
range(exprs(ALL))
## 30% 'most variable' features (c.f., genefilter::varFilter)
iqr <- apply(exprs(ALL), 1, IQR)
ALL[iqr > quantile(iqr, 0.7), ]
```
Example: `SummarizedExperiment` (see vignettes in `r Biocpkg("GenomicRanges")`).
```{r eg-SummarizedExperiment}
suppressPackageStartupMessages({
library(airway)
})
data(airway)
airway
## column and row data
colData(airway)
rowData(airway)
## access colData; matrix-like subsetting; assay() / assays()
airway[, airway$dex %in% "trt"]
head(assay(airway))
assays(airway)
## library size
colSums(assay(airway))
hist(rowMeans(log10(assay(airway))))
```
## Lab
### GC content
1. Calculate the GC content of human chr1 in the hg19 build,
excluding regions where the sequence is "N". You will need to
1. Load the `r Biocannopkg("BSgenome.Hsapiens.UCSC.hg19")`
2. Extract, using `[[`, chromosome 1 ("chr1").
3. Use `alphabetFrequency()` to calculate the count or frequency
of the nucleotides in chr1
4. Use standard _R_ functions to calculate the GC content.
```{r gc-reference}
library(BSgenome.Hsapiens.UCSC.hg19)
chr1seq <- BSgenome.Hsapiens.UCSC.hg19[["chr1"]]
chr1alf <- alphabetFrequency(chr1seq)
chr1gc <- sum(chr1alf[c("G", "C")]) / sum(chr1alf[c("A", "C", "G", "T")])
```
2. Calculate the GC content of 'exome' (approximately, all genic
regions) on chr1. You will need to
1. Load the `r Biocannopkg("TxDb.Hsapiens.UCSC.hg19.knownGene")`
package.
2. Use `genes()` to extract genic regions of all genes, then
subsetting operations to restrict to chromosome 1.
3. Use `getSeq,BSgenome-method` to extract sequences from
chromosome 1 of the BSgenome object.
4. Use `alphabetFrequency()` (with the argument `collapse=TRUE` --
why?) and standard _R_ operations to extract the gc content of
the genes.
```{r gc-exons-1}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
genes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
genes1 <- genes[seqnames(genes) %in% "chr1"]
seq1 <- getSeq(BSgenome.Hsapiens.UCSC.hg19, genes1)
alf1 <- alphabetFrequency(seq1, collapse=TRUE)
gc1 <- sum(alf1[c("G", "C")]) / sum(alf1[c("A", "C", "G", "T")])
```
How does the GC content just calculated compare to the average of
the GC content of each exon? Answer this using
`alphabetFrequency()` but with `collapse=FALSE)`, and adjust the
calculation of GC content to act on a matrix, rather than
vector. Why are these numbers different?
```{r gc-exons-2}
alf2 <- alphabetFrequency(seq1, collapse=FALSE)
gc2 <- rowSums(alf2[, c("G", "C")]) / rowSums(alf2[,c("A", "C", "G", "T")])
```
3. Plot a histogram of per-gene GC content, annotating with
information about chromosome and exome GC content. Use base
graphics `hist()`, `abline()`, `plot(density(...))`,
`plot(ecdf(...))`, etc. (one example is below). If this is too
easy, prepare a short presentation for the class illustrating how
to visualize this type of information using another _R_ graphics
package, e.g., `r CRANpkg("ggplot2")`, `{r CRANpkg("ggvis")`, or
`{r CRANpkg("lattice")}.
```{r gc-denisty}
plot(density(gc2))
abline(v=c(chr1gc, gc1), col=c("red", "blue"), lwd=2)
```
### Integrated containers
This exercise illustrates how integrated containers can be used to
effectively manage data; it does _NOT_ represent a suitable way to
analyze RNASeq differential expression data.
1. Load the `r Biocpkg("airway")` package and `airway` data
set. Explore it a litte, e.g., determining its dimensions (number
of regions of interest and samples), the information describing
samples, and the range of values in the `count` assay. The data are
from an RNA-seq experiment. The `colData()` describe treatment
groups and other information. The `assay()` is the (raw) number of
short reads overlapping each region of interest, in each
sample. The solution to this exercise is summarized above.
2. Create a subset of the data set that contains only the 30% most
variable (using IQR as a metric) observations. Plot the
distribution of asinh-transformed (a log-like transformation,
except near 0) row mean counts
```{r airway-plot}
iqr <- apply(assay(airway), 1, IQR)
airway1 <- airway[iqr > quantile(iqr, 0.7),]
plot(density(rowMeans(asinh(assay(airway1)))))
```
3. Use the `r Biocpkg("genefilter")` package `rowttests` function
(consult it's help page!) to compare asinh-transformed read counts
between the two `dex` treatment groups for each row. Explore the
result in various ways, e.g., finding the 'most' differentially
expressed genes, the genes with largest (absolute) difference
between treatment groups, adding adjusted _P_ values (via
`p.adjust()`, in the _stats_ package), etc. Can you obtain the
read counts for each treatment group, for the most differentially
expressed gene?
```{r airway-rowttest}
suppressPackageStartupMessages({
library(genefilter)
})
ttest <- rowttests(asinh(assay(airway1)), airway1$dex)
ttest$p.adj <- p.adjust(ttest$p.value, method="BH")
ttest[head(order(ttest$p.adj)),]
split(assay(airway1)[order(ttest$p.adj)[1], ], airway1$dex)
```
4. Add the statistics of differential expression to the `airway1`
_SummarizedExperiment_. Confirm that the statistics have been
added.
```{r airway-merge}
mcols(rowData(airway1)) <- ttest
head(mcols(airway1))
```
# Resources
- [Web site][Bioconductor] -- install, learn, use, develop _R_ /
_Bioconductor_ packages
- [Support](http://support.bioconductor.org) -- seek help and
guidance; also
- [biocViews](http://bioconductor.org/packages/release/BiocViews.html)
-- discover packages
- Package landing pages, e.g.,
[GenomicRanges](http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html),
including title, description, authors, installation instructions,
vignettes (e.g., GenomicRanges '[How
To](http://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)'),
etc.
- [Course](http://bioconductor.org/help/course-materials/) and other
[help](http://bioconductor.org/help/) material (e.g., videos, EdX
course, community blogs, ...)
Publications (General _Bioconductor_)
- Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, et al. (2013)
Software for Computing and Annotating Genomic Ranges. PLoS Comput
Biol 9(8): e1003118. doi:
[10.1371/journal.pcbi.1003118][GRanges.bib]
Other
- Lawrence, M. 2014. Software for Enabling Genomic Data
Analysis. Bioc2014 conference [slides][Lawrence.bioc2014.bib].
[R]: http://r-project.org
[Bioconductor]: http://bioconductor.org
[GRanges.bib]: http://dx.doi.org/10.1371/journal.pcbi.1003118
[Scalable.bib]: http://arxiv.org/abs/1409.2864
[Lawrence.bioc2014.bib]:
http://bioconductor.org/help/course-materials/2014/BioC2014/Lawrence_Talk.pdf
[AnnotationData]: http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData
[AnnotationDbi]: http://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html
[AnnotationHub]: http://bioconductor.org/packages/release/bioc/html/AnnotationHub.html
[BSgenome.Hsapiens.UCSC.hg19]: http://bioconductor.org/packages/release/data/annotation/html/BSgenome.Hsapiens.UCSC.hg19.html
[BSgenome]: http://bioconductor.org/packages/release/bioc/html/BSgenome.html
[BiocParallel]: http://bioconductor.org/packages/release/bioc/html/BiocParallel.html
[Biostrings]: http://bioconductor.org/packages/release/bioc/html/Biostrings.html
[Bsgenome.Hsapiens.UCSC.hg19]: http://bioconductor.org/packages/release/data/annotation/html/Bsgenome.Hsapiens.UCSC.hg19.html
[CNTools]: http://bioconductor.org/packages/release/bioc/html/CNTools.html
[ChIPQC]: http://bioconductor.org/packages/release/bioc/html/ChIPQC.html
[ChIPpeakAnno]: http://bioconductor.org/packages/release/bioc/html/ChIPpeakAnno.html
[DESeq2]: http://bioconductor.org/packages/release/bioc/html/DESeq2.html
[DiffBind]: http://bioconductor.org/packages/release/bioc/html/DiffBind.html
[GenomicAlignments]: http://bioconductor.org/packages/release/bioc/html/GenomicAlignments.html
[GenomicFiles]: http://bioconductor.org/packages/release/bioc/html/GenomicFiles.html
[GenomicRanges]: http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html
[Homo.sapiens]: http://bioconductor.org/packages/release/data/annotation/html/Homo.sapiens.html
[IRanges]: http://bioconductor.org/packages/release/bioc/html/IRanges.html
[KEGGREST]: http://bioconductor.org/packages/release/bioc/html/KEGGREST.html
[PSICQUIC]: http://bioconductor.org/packages/release/bioc/html/PSICQUIC.html
[Rsamtools]: http://bioconductor.org/packages/release/bioc/html/Rsamtools.html
[Rsubread]: http://bioconductor.org/packages/release/bioc/html/Rsubread.html
[ShortRead]: http://bioconductor.org/packages/release/bioc/html/ShortRead.html
[SomaticSignatures]: http://bioconductor.org/packages/release/bioc/html/SomaticSignatures.html
[TxDb.Hsapiens.UCSC.hg19.knownGene]: http://bioconductor.org/packages/release/data/annotation/html/TxDb.Hsapiens.UCSC.hg19.knownGene.html
[VariantAnnotation]: http://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html
[VariantFiltering]: http://bioconductor.org/packages/release/bioc/html/VariantFiltering.html
[VariantTools]: http://bioconductor.org/packages/release/bioc/html/VariantTools.html
[biocViews]: http://bioconductor.org/packages/release/BiocViews.html#___Software
[biomaRt]: http://bioconductor.org/packages/release/bioc/html/biomaRt.html
[cn.mops]: http://bioconductor.org/packages/release/bioc/html/cn.mops.html
[edgeR]: http://bioconductor.org/packages/release/bioc/html/edgeR.html
[ensemblVEP]: http://bioconductor.org/packages/release/bioc/html/ensemblVEP.html
[h5vc]: http://bioconductor.org/packages/release/bioc/html/h5vc.html
[limma]: http://bioconductor.org/packages/release/bioc/html/limma.html
[metagenomeSeq]: http://bioconductor.org/packages/release/bioc/html/metagenomeSeq.html
[org.Hs.eg.db]: http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html
[org.Sc.sgd.db]: http://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html
[phyloseq]: http://bioconductor.org/packages/release/bioc/html/phyloseq.html
[rtracklayer]: http://bioconductor.org/packages/release/bioc/html/rtracklayer.html
[snpStats]: http://bioconductor.org/packages/release/bioc/html/snpStats.html
[Gviz]: http://bioconductor.org/packages/release/bioc/html/Gviz.html
[epivizr]: http://bioconductor.org/packages/release/bioc/html/epivizr.html
[ggbio]: http://bioconductor.org/packages/release/bioc/html/ggbio.html
[OmicCircos]: http://bioconductor.org/packages/release/bioc/html/OmicCircos.html