--- title: "Course Notes" author: "Martin Morgan" date: "19/10/2015" output: html_document vignette: > % \VignetteIndexEntry{Course Notes} % \VignetteEngine{knitr::rmarkdown} --- These notes were created during the course, and server as a transcript of topics covered. # Intro to sequencing Workflow 1. Experimental design 2. Wet lab sample prep, etc 3. Sequencing - FASTQ file of reads and their quality scores - Quality assessment (FASTQ program), trimming or removing contanimants, removing optical duplicates (FASTX, trimomatic) - Quality with respect to _your_ research question 4. Alignment / (assembly) - BAM file of aligned reads to a known reference genome - Aligners: vary from simple to use to hard to use, from 'good enough' alignments (for RNA-seq of known genes, ChIP-seq) to high-quality (e.g., DNA-seq calling variants) - Bowtie2 (easy, good enough), gmap (excellent, hard to use). - Purpose-built tools that align and reduce. E.g., RNA-seq known gene differential expression -- [kalisto][], sailfish 5. Reduction - BED of called peaks in a ChIP-seq experiment (e.g., MACS, FindPeaks) - VCF of called variants (GATK, bcftools) - Count table (e.g., tsv) in an RNA-seq experiment (python htseq2; `GenomicFeatures::summarizeOverlaps()`) 6. (Statistical) analysis - Why _statistical_ analysis? data is fundamentally huge; biological questions are framed in terms of classical statistics, e.g., designed experiments, hypothesis testing; technical and other artifacts, e.g., GC bias, mapability, [batch effects][] - Appropriate tools: able to cope with statistics; access to advanced statistical methods; analysis _has_ to be reproducible (some sort of scripting); processing large amounts of data is _not_ the primary criterion. - _R_ / _Bioconductor_ is the best most awesome tool. 7. Comprehension - .Rmd or similar documenting the work flow, including inputs, analysis steps, tables, figures, interpertation... ## FASTQ and BAM files View from the Linux command line... - `zcat *fastq.gz | less` - `samtools view -h *bam` ... or within _R_ / _Bioconductor_: fastq files ```{r ShortRead-fastq} library(ShortRead) strm = FastqStreamer("bigdata/SRR1039508_1.fastq.gz", 100000) fq = yield(strm) fq sread(fq) quality(fq) ``` # _R_ - Statistical programming language - _Vectorized_ (works efficiently on vectors; vector notation is very expressive and compact) - _Objects_ help to coordinate management of related data - _Introspection_ helps discover what can be done with objects. ```{r r-intro} x = rnorm(1000) y = x + rnorm(1000, sd=.5) df = data.frame(x=x, y=y) plot(y ~ x, df) fit = lm(y ~ x, df) class(fit) methods(class=class(fit)) methods("anova") ``` Help! ```{r help, eval=FALSE} ?log ?plot # generic 'plot' ?plot.lm # plot for objects of class 'lm' ``` # _Bioconductor_ - Main [web site][], including [biocViews][] - Package landing pages, e.g., [ChIPseeker][] - The [support forum][] - 1100+ packages for analysis and comprehension of high-throughput genomic data: sequencing (RNA, ChIP, variants, ...), microarray (expression, methylation, copy number, etc), flow cytometry, proteomics, imaging, ... Extensive use of 'S4' classes - `fit` (from `lm()`) is an example of an S3 class - `sread(fq)` returned a _DNAStringSet_, an example of an S4 class ```{r S4} library(ShortRead) strm = FastqStreamer("bigdata/SRR1039508_1.fastq.gz", 100000) fq = yield(strm) # 'ShortReadQ' S4 class class(fq) # introspection methods(class=class(fq)) reads = sread(fq) # accessor -- get the reads reads # 'DNAStringSet' S4 class methods(class=class(reads)) gc = letterFrequency(reads, "GC", as.prob=TRUE) hist(gc) ``` Help! ```{r S4-help, eval=FALSE} ?DNAStringSet # class, and often frequently used methods ?letterFrequency # generic methods("letterFrequency") ?"letterFrequency,XStringSet-method" ``` # And... Key software packages... - [ShortRead][] for FASTQ files - [GenomicAlignments][] for aligned reads - [VariantAnnotation][] for VCF files - [rtracklayer][] `import()` to import BED, WIG, GFF, GTF, ..., files - [Gviz][] for visualization of genomic data; [ReportTools][] for reports; [shiny][] for interactive visualizations ... and classes - _DNAStringSet_, _DNAString_ for sequence data - _GRanges_, _GRangesList_ for representing coordinates in genome space - _SummarizedExperiment_ (_ExpressionSet_): integrated data contains: rows x columns (features x samples) - `assays()` - `rowRanges()` for annotations on rows - `colData()` for column annotations Annotation - Pure 'data' packages - Identifier mapping `org.*` packages - Gene models with `TxDb.*` packages - Whole genome sequences `BSgenome.*` packages - [biomaRt][] for accessing ENSEMBL-based biomarts; [AnnotationHub][] for genome-scale annotation resources Strategies for working with big data - Write efficient _R_ code -- vectorized - Process data in chunks, e.g., `FastqStreamer()`, `Rsamtools::BamFile(..., yieldSize=1000000)`; `GenomicFiles::reduceByYield()` (see examples on `?reduceByYield`) - Process in parallel [BiocParallel][] All material on the [course materials][] page [AnnotationHub]: http://bioconductor.org/packages/AnnotationHub [BiocParallel]: http://bioconductor.org/packages/BiocParallel [ChIPseeker]: http://bioconductor.org/packages/ChIPseeker [GenomicAlignments]: http://bioconductor.org/packages/GenomicAlignments [Gviz]: http://bioconductor.org/packages/Gviz [ReportTools]: http://bioconductor.org/packages/ReportingTools [ShortRead]: http://bioconductor.org/packages/ShortRead [VariantAnnotation]: http://bioconductor.org/packages/VariantAnnotation [biomaRt]: http://bioconductor.org/packages/biomaRt [rtracklayer]: http://bioconductor.org/packages/rtracklayer [shiny]: https://cran.r-project.org/package=shiny [support forum]: https://support.bioconductor.org [batch effects]: https://github.com/Bioconductor/BiocUruguay2015/blob/master/vignettes/our_figures/nrg2825-f2.jpg [biocViews]: http://bioconductor.org/packages/release/ [course materials]: http://bioconductor.org/help/course-materials [kalisto]: http://pachterlab.github.io/kallisto/ [web site]: http://bioconductor.org