--- title: "Workshop: W2 - Sequences, Alignments, and Large Data" output: BiocStyle::html_document: toc: true vignette: > % \VignetteIndexEntry{Workshop: W3 - Sequences, Alignments, and Large Data} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` ```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE} suppressPackageStartupMessages({ library(Biostrings) library(GenomicRanges) library(GenomicFiles) library(BiocParallel) library(TxDb.Hsapiens.UCSC.hg19.knownGene) }) ``` Author: Martin Morgan (mtmorgan@fredhutch.org)
Date: 7 September, 2015
Back to [Workshop Outline](Developer-Meeting-Workshop.html)
The material in this document requires _R_ version 3.2 and _Bioconductor_ version 3.1 ```{r configure-test} stopifnot( getRversion() >= '3.2' && getRversion() < '3.3', BiocInstaller::biocVersion() >= "3.1" ) ``` # _Bioconductor_ 'infrastructure' for sequence analysis ## Classes, methods, and packages This section focuses on classes, methods, and packages, with the goal being to learn to navigate the help system and interactive discovery facilities. ## Motivation Sequence analysis is specialized - Large data needs to be processed in a memory- and time-efficient manner - Specific algorithms have been developed for the unique characteristics of sequence data Additional considerations - Re-use of existing, tested code is easier to do and less error-prone than re-inventing the wheel. - Interoperability between packages is easier when the packages share similar data structures. Solution: use well-defined _classes_ to represent complex data; _methods_ operate on the classes to perform useful functions. Classes and methods are placed together and distributed as _packages_ so that we can all benefit from the hard work and tested code of others. # Core packages
                   VariantAnnotation
                           |
                           v
                    GenomicFeatures
                           |
                           v
                       BSgenome
                           |
                           v
                      rtracklayer
                           |
                           v
                    GenomicAlignments
                      |           |
                      v           v
     SummarizedExperiment   Rsamtools  ShortRead
                  |         |      |      |
                  v         v      v      v
                GenomicRanges     Biostrings
                        |          |
                        v          v
               GenomeInfoDb   (XVector)
                        |     |
                        v     v
                        IRanges
                           |
                           v 
                      (S4Vectors)
# Core classes ## Case study: _IRanges_ and _GRanges_ The [IRanges][] package defines an important class for specifying integer ranges, e.g., ```{r iranges} library(IRanges) ir <- IRanges(start=c(10, 20, 30), width=5) ir ``` There are many interesting operations to be performed on ranges, e.g, `flank()` identifies adjacent ranges ```{r iranges-flank} flank(ir, 3) ``` The `IRanges` class is part of a class hierarchy. To see this, ask R for the class of `ir`, and for the class definition of the `IRanges` class ```{r iranges-class} class(ir) getClass(class(ir)) ``` Notice that `IRanges` extends the `Ranges` class. Now try entering `?flank` (`?"flank,"` if not using _RStudio, where `` means to press the tab key to ask for tab completion). You can see that there are help pages for `flank` operating on several different classes. Select the completion ```{r iranges-flank-method, eval=FALSE} ?"flank,Ranges-method" ``` and verify that you're at the page that describes the method relevant to an `IRanges` instance. Explore other range-based operations. The [GenomicRanges][] package extends the notion of ranges to include features relevant to application of ranges in sequence analysis, particularly the ability to associate a range with a sequence name (e.g., chromosome) and a strand. Create a `GRanges` instance based on our `IRanges` instance, as follows ```{r granges} library(GenomicRanges) gr <- GRanges(c("chr1", "chr1", "chr2"), ir, strand=c("+", "-", "+")) gr ``` The notion of flanking sequence has a more nuanced meaning in biology. In particular we might expect that flanking sequence on the `+` strand would precede the range, but on the minus strand would follow it. Verify that `flank` applied to a `GRanges` object has this behavior. ```{r granges-flank} flank(gr, 3) ``` Discover what classes `GRanges` extends, find the help page documenting the behavior of `flank` when applied to a `GRanges` object, and verify that the help page documents the behavior we just observed. ```{r granges-class} class(gr) getClass(class(gr)) ``` ```{r granges-flank-method, eval=FALSE} ?"flank,GenomicRanges-method" ``` Notice that the available `flank()` methods have been augmented by the methods defined in the _GenomicRanges_ package. It seems like there might be a number of helpful methods available for working with genomic ranges; we can discover some of these from the command line, indicating that the methods should be on the current `search()` path ```{r granges-methods} methods(class="GRanges") ``` Use `help()` to list the help pages in the `GenomicRanges` package, and `vignettes()` to view and access available vignettes; these are also available in the Rstudio 'Help' tab. ```{r granges-man-and-vignettes, eval=FALSE} help(package="GenomicRanges") vignette(package="GenomicRanges") vignette(package="GenomicRanges", "GenomicRangesHOWTOs") ``` ## _GenomicRanges_ ### The `GRanges` and `GRangesList` classes Aside: 'TxDb' packages provide an R representation of gene models ```{r txdb} library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ``` `exons()`: _GRanges_ ```{r txdb-exons} exons(txdb) ``` ![Alt Genomic Ranges](our_figures/GRanges.png) `exonsBy()`: _GRangesList_ ```{r txdb-exonsby} exonsBy(txdb, "tx") ``` ![Alt Genomic Ranges List](our_figures/GRangesList.png) _GRanges_ / _GRangesList_ are incredibly useful - Represent **annotations** -- genes, variants, regulatory elements, copy number regions, ... - Represent **data** -- aligned reads, ChIP peaks, called variants, ... ### Algebra of genomic ranges Many biologically interesting questions represent operations on ranges - Count overlaps between aligned reads and known genes -- `GenomicRanges::summarizeOverlaps()` - Genes nearest to regulatory regions -- `GenomicRanges::nearest()`, [ChIPseeker][] - Called variants relevant to clinical phenotypes -- [VariantFiltering][] _GRanges_ Algebra - Intra-range methods - Independent of other ranges in the same object - GRanges variants strand-aware - `shift()`, `narrow()`, `flank()`, `promoters()`, `resize()`, `restrict()`, `trim()` - See `?"intra-range-methods"` - Inter-range methods - Depends on other ranges in the same object - `range()`, `reduce()`, `gaps()`, `disjoin()` - `coverage()` (!) - see `?"inter-range-methods"` - Between-range methods - Functions of two (or more) range objects - `findOverlaps()`, `countOverlaps()`, ..., `%over%`, `%within%`, `%outside%`; `union()`, `intersect()`, `setdiff()`, `punion()`, `pintersect()`, `psetdiff()` ![Alt Ranges Algebra](our_figures/RangeOperations.png) ## _Biostrings_ (DNA or amino acid sequences) Classes - XString, XStringSet, e.g., DNAString (genomes), DNAStringSet (reads) Methods -- - [Cheat sheat](http://bioconductor.org/packages/release/bioc/vignettes/Biostrings/inst/doc/BiostringsQuickOverview.pdf) - Manipulation, e.g., `reverseComplement()` - Summary, e.g., `letterFrequency()` - Matching, e.g., `matchPDict()`, `matchPWM()` Related packages - [BSgenome][] - Whole-genome representations - Model and custom - [ShortRead][] - FASTQ files Example - Whole-genome sequences are distrubuted by ENSEMBL, NCBI, and others as FASTA files; model organism whole genome sequences are packaged into more user-friendly `BSgenome` packages. The following calculates GC content across chr14. ```{r BSgenome-require, message=FALSE} library(BSgenome.Hsapiens.UCSC.hg19) chr14_range = GRanges("chr14", IRanges(1, seqlengths(Hsapiens)["chr14"])) chr14_dna <- getSeq(Hsapiens, chr14_range) letterFrequency(chr14_dna, "GC", as.prob=TRUE) ``` ## _GenomicAlignments_ (Aligned reads) Classes -- GenomicRanges-like behaivor - GAlignments, GAlignmentPairs, GAlignmentsList Methods - `readGAlignments()`, `readGAlignmentsList()` - Easy to restrict input, iterate in chunks - `summarizeOverlaps()` Example - Find reads supporting the junction identified above, at position 19653707 + 66M = 19653773 of chromosome 14 ```{r bam-require} library(GenomicRanges) library(GenomicAlignments) library(Rsamtools) ## our 'region of interest' roi <- GRanges("chr14", IRanges(19653773, width=1)) ## sample data library('RNAseqData.HNRNPC.bam.chr14') bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1]], asMates=TRUE) ## alignments, junctions, overlapping our roi paln <- readGAlignmentsList(bf) j <- summarizeJunctions(paln, with.revmap=TRUE) j_overlap <- j[j %over% roi] ## supporting reads paln[j_overlap$revmap[[1]]] ``` ## _VariantAnnotation_ (Called variants) Classes -- GenomicRanges-like behavior - VCF -- 'wide' - VRanges -- 'tall' Functions and methods - I/O and filtering: `readVcf()`, `readGeno()`, `readInfo()`, `readGT()`, `writeVcf()`, `filterVcf()` - Annotation: `locateVariants()` (variants overlapping ranges), `predictCoding()`, `summarizeVariants()` - SNPs: `genotypeToSnpMatrix()`, `snpSummary()` Example - Read variants from a VCF file, and annotate with respect to a known gene model ```{r vcf, message=FALSE} ## input variants library(VariantAnnotation) fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") seqlevels(vcf) <- "chr22" ## known gene model library(TxDb.Hsapiens.UCSC.hg19.knownGene) coding <- locateVariants(rowRanges(vcf), TxDb.Hsapiens.UCSC.hg19.knownGene, CodingVariants()) head(coding) ``` Related packages - [ensemblVEP][] - Forward variants to Ensembl Variant Effect Predictor - [VariantTools][], [h5vc][] - Call variants Reference - Obenchain, V, Lawrence, M, Carey, V, Gogarten, S, Shannon, P, and Morgan, M. VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics, first published online March 28, 2014 [doi:10.1093/bioinformatics/btu168](http://bioinformatics.oxfordjournals.org/content/early/2014/04/21/bioinformatics.btu168) ## _rtracklayer_ (Genome annotations) - `import()`: BED, GTF, WIG, 2bit, etc - `export()`: GRanges to BED, GTF, WIG, ... - Access UCSC genome browser ## _SummarizedExperiment_ - Integrate experimental data with sample, feature, and experiment-wide annotations - Matrix where rows are indexed by genomic ranges, columns by a DataFrame. ![Alt SummarizedExperiment](our_figures/SE_Description.png) Functions and methods - Accessors: `assay()` / `assays()`, `rowData()` / `rowRanges()`, `colData()`, `metadata()` - Range-based operations, especially `subsetByOverlaps()` # Input & representation of standard file formats ## BAM files of aligned reads -- `GenomicAlignments` Recall: overall workflow 1. Experimental design 2. Wet-lab preparation 3. High-throughput sequencing 4. Alignment - Whole genome, vs. transcriptome 5. Summary 6. Statistical analysis 7. Comprehension BAM files of aligned reads - Header @HD VN:1.0 SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 ... @SQ SN:chrY LN:59373566 @PG ID:TopHat VN:2.0.8b CL:/home/hpages/tophat-2.0.8b.Linux_x86_64/tophat --mate-inner-dist 150 --solexa-quals --max-multihits 5 --no-discordant --no-mixed --coverage-search --microexon-search --library-type fr-unstranded --num-threads 2 --output-dir tophat2_out/ERR127306 /home/hpages/bowtie2-2.1.0/indexes/hg19 fastq/ERR127306_1.fastq fastq/ERR127306_2.fastq - Alignments - ID, flag, alignment and mate ERR127306.7941162 403 chr14 19653689 3 72M = 19652348 -1413 ... ERR127306.22648137 145 chr14 19653692 1 72M = 19650044 -3720 ... - Sequence and quality ... GAATTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCC *'%%%%%#&&%''#'&%%%)&&%%$%%'%%'&*****$))$)'')'%)))&)%%%%$'%%%%&"))'')%)) ... TTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAG '**)****)*'*&*********('&)****&***(**')))())%)))&)))*')&***********)**** - Tags ... AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU NH:i:2 CC:Z:chr22 CP:i:16189276 HI:i:0 ... AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU NH:i:3 CC:Z:= CP:i:19921600 HI:i:0 - Typically, sorted (by position) and indexed ('.bai' files) [GenomicAlignments][] - Use an example BAM file (`fl` could be the path to your own BAM file) ```{r genomicalignments} ## example BAM data library(RNAseqData.HNRNPC.bam.chr14) ## one BAM file fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] ## Let R know that this is a BAM file, not just a character vector library(Rsamtools) bfl <- BamFile(fl) ``` - Input the data into R ```{r readgalignments} aln <- readGAlignments(bfl) aln ``` - `readGAlignmentPairs()` / `readGAlignmentsList()` if paired-end data - Lots of things to do, including all the _GRanges_ / _GRangesList_ operations ```{r galignments-methods} methods(class=class(aln)) ``` - **Caveat emptor**: BAM files are large. Normally you will _restrict_ the input to particular genomic ranges, or _iterate_ through the BAM file. Key _Bioconductor_ functions (e.g., `GenomicAlignments::summarizeOverlaps()` do this data management step for you. See next section! ## Other formats and packages ![Alt Files and the Bioconductor packages that input them](our_figures/FilesToPackages.png) ## Large data -- `BiocParallel`, `GenomicFiles` ### Restriction - Input only the data necessary, e.g., `ScanBamParam()` - `which`: genomic ranges of interest - `what`: 'columns' of BAM file, e.g., 'seq', 'flag' ### Iteration - Read entire file, but in chunks - Chunk size small enough to fit easily in memory, - Chunk size large enough to benefit from _R_'s vectorized operations -- 10k to 1M records at at time - e.g., `BamFile(..., yieldSize=100000)` Iterative programming model - _yield_ a chunk of data - _map_ input data to convenient representation, often summarizing input to simplified form - E.g., Aligned read coordinates to counts overlapping regions of interest - E.g., Aligned read sequenced to GC content - _reduce_ across mapped chunks - Use `GenomicFiles::reduceByYield()` ```{r iteration} library(GenomicFiles) yield <- function(bfl) { ## input a chunk of alignments library(GenomicAlignments) readGAlignments(bfl, param=ScanBamParam(what="seq")) } map <- function(aln) { ## Count G or C nucleotides per read library(Biostrings) gc <- letterFrequency(mcols(aln)$seq, "GC") ## Summarize number of reads with 0, 1, ... G or C nucleotides tabulate(1 + gc, 73) # max. read length: 72 } reduce <- `+` ``` - Example ```{r iteration-doit} library(RNAseqData.HNRNPC.bam.chr14) fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES bf <- BamFile(fls[1], yieldSize=100000) gc <- reduceByYield(bf, yield, map, reduce) plot(gc, type="h", xlab="GC Content per Aligned Read", ylab="Number of Reads") ``` ### Parallel evaluation - Cores, computers, clusters, clouds - Generally, requires memory management techniques like restriction or iteration -- parallel processes competing for shared memory - Many problems are _embarassingly parallel_ -- `lapply()`-like -- especially in bioinformatics where parallel evaluation is across files - Example: GC content in several BAM files ```{r parallel-doit} library(BiocParallel) gc <- bplapply(BamFileList(fls), reduceByYield, yield, map, reduce) library(ggplot2) df <- stack(as.data.frame(lapply(gc, cumsum))) df$GC <- 0:72 ggplot(df, aes(x=GC, y=values)) + geom_line(aes(colour=ind)) + xlab("Number of GC Nucleotides per Read") + ylab("Number of Reads") ``` [biocViews]: http://bioconductor.org/packages/BiocViews.html#___Software [AnnotationData]: http://bioconductor.org/packages/BiocViews.html#___AnnotationData [aprof]: http://cran.r-project.org/web/packages/aprof/index.html [hexbin]: http://cran.r-project.org/web/packages/hexbin/index.html [lineprof]: https://github.com/hadley/lineprof [microbenchmark]: http://cran.r-project.org/web/packages/microbenchmark/index.html [AnnotationDbi]: http://bioconductor.org/packages/AnnotationDbi [BSgenome]: http://bioconductor.org/packages/BSgenome [BiocParallel]: http://bioconductor.org/packages/BiocParallel [Biostrings]: http://bioconductor.org/packages/Biostrings [CNTools]: http://bioconductor.org/packages/CNTools [ChIPQC]: http://bioconductor.org/packages/ChIPQC [ChIPpeakAnno]: http://bioconductor.org/packages/ChIPpeakAnno [DESeq2]: http://bioconductor.org/packages/DESeq2 [DiffBind]: http://bioconductor.org/packages/DiffBind [GenomicAlignments]: http://bioconductor.org/packages/GenomicAlignments [GenomicRanges]: http://bioconductor.org/packages/GenomicRanges [IRanges]: http://bioconductor.org/packages/IRanges [KEGGREST]: http://bioconductor.org/packages/KEGGREST [PSICQUIC]: http://bioconductor.org/packages/PSICQUIC [rtracklayer]: http://bioconductor.org/packages/rtracklayer [Rsamtools]: http://bioconductor.org/packages/Rsamtools [ShortRead]: http://bioconductor.org/packages/ShortRead [VariantAnnotation]: http://bioconductor.org/packages/VariantAnnotation [VariantFiltering]: http://bioconductor.org/packages/VariantFiltering [VariantTools]: http://bioconductor.org/packages/VariantTools [biomaRt]: http://bioconductor.org/packages/biomaRt [cn.mops]: http://bioconductor.org/packages/cn.mops [h5vc]: http://bioconductor.org/packages/h5vc [edgeR]: http://bioconductor.org/packages/edgeR [ensemblVEP]: http://bioconductor.org/packages/ensemblVEP [limma]: http://bioconductor.org/packages/limma [metagenomeSeq]: http://bioconductor.org/packages/metagenomeSeq [phyloseq]: http://bioconductor.org/packages/phyloseq [snpStats]: http://bioconductor.org/packages/snpStats [org.Hs.eg.db]: http://bioconductor.org/packages/org.Hs.eg.db [TxDb.Hsapiens.UCSC.hg19.knownGene]: http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene [BSgenome.Hsapiens.UCSC.hg19]: http://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg19