--- title: "_R_ and _Bioconductor_ for Genomic Analysis" author: - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, New York email: Martin.Morgan@RoswellPark.org package: BiocIntro output: BiocStyle::html_document: toc: true vignette: | %\VignetteIndexEntry{R and Bioconductor for Genomic Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction Description: This workshop will introduce you to the _Bioconductor_ collection of R packages for statistical analysis and comprehension of high-throughput genomic data. The emphasis is on data exploration, using RNA-sequence gene expression experiments as a motivating example. How can I access common sequence data formats from R? How can I use information about gene models or gene annotations in my analysis? How do the properties of my data influence the statistical analyses I should perform? What common workflows can I perform with R and _Bioconductor_? How do I deal with very large data sets in R? These are the sorts of questions that will be tackled in this workshop. Requirements: You will need to bring your own laptop. The workshop will use cloud-based resources, so your laptop will need a web browser and WiFi capabilities. Participants should have used _R_ and _RStudio_ for tasks such as those covered in introductory workshops earlier in the week. Some knowledge of the biology of gene expression and of concepts learned in a first course in statistics will be helpful. Relevance: This workshop is relevant to anyone eager to explore genomic data in _R_. The workshop will help connect ‘core’ _R_ concepts for working with data (e.g., data management via `data.frame()`, statistical modelling with `lm()` or `t.test()`, visualization using `plot()` or `ggplot()`) to the special challenges of working with large genomic data sets. It will be especially helpful to those who have or will have their own genomic data, and are interested in more fully understanding how to work with it in _R_. ```{r, echo=FALSE} knitr::opts_chunk$set( cache = as.logical(Sys.getenv("KNITR_CACHE", TRUE)), collapse = TRUE ) ``` ## Our goal RNA-seq - Designed experiment, e.g., 8 samples from four cell lines exposed to two treatments (based on Himes et al., PMID: [24926665][]; details in the [airway][] package vignette). [24926665]: http://www.ncbi.nlm.nih.gov/pubmed/24926665 [airway]: https://bioconductor.org/packages/airway ``` cell dex SRR1039508 N61311 untrt SRR1039509 N61311 trt SRR1039512 N052611 untrt SRR1039513 N052611 trt SRR1039516 N080611 untrt SRR1039517 N080611 trt SRR1039520 N061011 untrt SRR1039521 N061011 trt ``` - Library preparation: mRNA to stable double-stranded DNA - DNA sequencing of 'short' mRNA-derived fragments - Alignment to a reference genome or transcriptome ```{r, echo=FALSE} knitr::include_graphics("our_figures/RNASeq_workflow.jpeg") ``` source: http://bio.lundberg.gu.se/courses/vt13/rnaseq.html - End result: a matrix of 'counts' -- reads aligning to genes across samples. ``` SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 ENSG00000000003 679 448 873 408 1138 ENSG00000000005 0 0 0 0 0 ENSG00000000419 467 515 621 365 587 ... ... ... ... ... ... SRR1039517 SRR1039520 SRR1039521 ENSG00000000003 1047 770 572 ENSG00000000005 0 0 0 ENSG00000000419 799 417 508 ... ... ... ... ``` Research question - Which gene counts are most different between dexamethasone `untrt` and `trt` experimental treatments? - We'll try to understand _how_ to accomplish this, without going into statistical details. Our goal - Visualize 20 most differentially expressed genes as a heatmap. ```{r, echo=FALSE} knitr::include_graphics("our_figures/top20_heatmap.png") ``` # Data gathering, input, representation, and cleaning ## Base _R_ data structures Sample information - Simple 'tab-separated value' text file, e.g., from Excel export. - Input using base _R_ command `read.table()` - 'Atomic' vectors, e.g., `integer()` - `factor()` and `NA` - `data.frame()`: coordinated management - Column access with `$` - Subset with `[ , ]` ```{r samples} samples_file <- system.file(package="BiocIntro", "extdata", "samples.tsv") samples <- read.table(samples_file) samples samples$dex <- relevel(samples$dex, "untrt") ``` Counts - Another tsv file. Many rows, so use `head()` to view the first few. - Row names: gene identifiers. Column names: sample identifiers. - All columns are the same (numeric) type; represent as a `matrix()` rather than `data.frame()` ```{r counts} counts_file <- system.file(package="BiocIntro", "extdata", "counts.tsv") counts <- read.table(counts_file) dim(counts) head(counts) counts <- as.matrix(counts) ``` ## Genomic ranges (`GRanges`) Row annotations. - 'GTF' files contain information about gene models. - The GTF file relevant to this experiment -- same organism (_Homo sapiens_), genome (GRCh37) and gene model annotations (Ensembl release-75) as used in the alignment and counting step -- is ```{r} url <- "ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz" ``` - Use `BiocFileCache` to download the resource once to a location that persists across _R_ sessions. ```{r, message = FALSE} library(BiocFileCache) ``` - About _Bioconductor_ packages - `BiocFileCache` is available from https://bioconductor.org - Discover packages at https://bioconductor.org/packages - Learn about `BiocFileCache` from the [BiocFileCache][] 'landing page'. - Explore the `BiocFileCache` package [vignette][] (access the vignette from within _R_: `browseVignettes("BiocFileCache")`). - Install the package with `BiocManager::install("BiocFileCache")`. - Find help on functions with, e.g., `?bfcrpath`. - Ask questions at https://support.bioconductor.org [BiocFileCache]: https://bioconductor.org/packages/BiocFileCache [vignette]: https://bioconductor.org/packages/release/bioc/vignettes/BiocFileCache/inst/doc/BiocFileCache.html ```{r bfcrpath} gtf_file <- bfcrpath(rnames = url) ``` - GTF files are plain text files and _could_ be read using `read.table()` or similar, but contain structured information that we want to represent in _R_. - Common sequence data formats - BED, GTF, bigWig: [rtracklayer][] - FASTA (DNA sequence): [Biostrings][] - FASTQ (short reads & quality scores): [ShortRead][] - BAM (aligned reads): [Rsamtools][], [GenomicAlignments][] - VCF (called variants): [VariantAnnotation][], [VariantFiltering][]. MAF: [maftools][] - Use the [rtracklayer][] package to import the file. [rtracklayer]: https://bioconductor.org/packages/rtracklayer ```{r import.gtf, message = FALSE} library(rtracklayer) gtf <- import(gtf_file) ``` - A `GRanges` object - Range-specific information - Annotations on each range - Use functions to access core elements: `seqnames()` (e.g., chromosome), `start()` / `end()` / `width()`, `strand()`, etc. - Use `$` or `mcols()$` to access annotations on ranges. - _Bioconductor_ conventions: 1-based, closed intervals (like Ensembl) rather than 0-based, 1/2 open intervals (like UCSC). [Biostrings]: https://bioconductor.org/packages/Biostrings [ShortRead]: https://bioconductor.org/packages/ShortRead [Rsamtools]: https://bioconductor.org/packages/Rsamtools [GenomicAlignments]: https://bioconductor.org/packages/GenomicAlignments [VariantAnnotation]: https://bioconductor.org/packages/VariantAnnotation [VariantFiltering]: https://bioconductor.org/packages/VariantFiltering [maftools]: https://bioconductor.org/packages/maftools - Filter the information to gene-level annotations, keeping only some of the information about each genomic range. Use the `gene_id` column as `names()`. ```{r} rowidx <- gtf$type == "gene" colidx <- c("gene_id", "gene_name", "gene_biotype") genes <- gtf[rowidx, colidx] names(genes) <- genes$gene_id genes$gene_id <- NULL genes ``` ## Coordinated management (`SummarizedExperiment`) Three different data sets - `counts`: results of the RNAseq workflow - `samples`: sample and experimental design information - `genes`: information about the genes that we've assayed. Coordinate our manipulation - Avoid 'bookkeeping' errors when, e.g., we subset one part of the data in a way different from another. - Use the [SummarizedExperiment][] package and data representation. - Two-dimensional structure, so subset with `[,]` - Use functions to access components: `assay()`, `rowData()`, `rowRanges()`, `colData()`, etc. ```{r, echo=FALSE} knitr::include_graphics("our_figures/SummarizedExperiment.svg") ``` [SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment ```{r, message = FALSE} library(SummarizedExperiment) ``` - Make sure the order of the `samples` rows match the order of the samples in the columns of the `counts` matrix, and the order of the `genes` rows match the order of the rows of the `counts` matrix. - Create a `SummarizedExperiment` to coordinate our data manipulation. ```{r SummarizedExperiment} samples <- samples[colnames(counts),] genes <- genes[rownames(counts),] se <- SummarizedExperiment( assays = list(counts = counts), rowRanges = genes, colData = samples ) se ``` # Analysis & visualization ## Differential expression analysis Gestalt - Perform a `t.test()` for each row of the count matrix, asking whether the `trt` samples have on average counts that differ from the `untrt` samples. - _Many_ nuanced statistical issues The [DESeq2][] pacakge - Implements efficient, 'correct', robust algorithms for performing RNA-seq differential expression analysis of moderate-sized experiments. [DESeq2]: https://bioconductor.org/packages/DESeq2 ```{r, message = FALSE} library(DESeq2) ``` - Specify our experimental design, perform the analysis taking account of the nuanced statistical issues, and get a summary of the results. The details of this step are beyond the scope of this workshop. ```{r DESeq2} dds <- DESeqDataSet(se, ~ cell + dex) fit <- DESeq(dds) destats <- results(fit) destats ``` - Add the results to our `SummarizedExperiment`, so that we can manipulate these in a coordianted fashion too. ```{r} rowData(se) <- cbind(rowData(se), destats) ``` ## Heatmap - Use `order()` and `head()` to identify the row indexes of the top 20 most differentially expressed (based on adjusted P-value) genes. - Subset the our `SummarizedExperiment` to contain just these rows. - Dispaly the `assay()` of our subset as a heatmap ```{r heatmap-1} top20idx <- head( order(rowData(se)$padj), 20) top20 <- se[top20idx,] heatmap(assay(top20)) ``` Update row labels and adding information about treatment group. - Extract the top 20 matrix. - Update the row names of the matrix with the corresponding gene names. - Create a vector of colors, one for each sample, with the color determined by the dexamethasone treatment. ```{r heatmap-2} m <- assay(top20) rownames(m) <- rowData(top20)$gene_name trtcolor <- hcl.colors(2, "Pastel 1")[ colData(top20)$dex ] heatmap(m, ColSideColors = trtcolor) ``` ## Volcano plot - Plots 'statistical significance' on the Y-axis, 'biological significance' on the X-axis. - Use `plot()` to create the points - Use `text()` to add labels to the two most significant genes. ```{r volcanoplot} plot(-log10(pvalue) ~ log2FoldChange, rowData(se)) label <- with( rowData(se), ifelse(-log10(pvalue) > 130, gene_name, "") ) text( -log10(pvalue) ~ log2FoldChange, rowData(se), label = label, pos = 4, srt=-15 ) ``` ## Top table and tidy data Goal - Provide a concise summary of the 20 most differentially expressed genes. [dplyr][] and 'tidy' data - A convenient way to display and manipulate strictly tabular data. - 'long form' tables where each row represents an observation and each column an attribute measured on the observations. - `tibble`: a `data.frame` with better display properties - `%>%`, e.g., `mtcars %>% count(cyl)`: a pipe that takes a tibble (or data.frame) `tbl` on the left and uses it as an argument to a small number of functions like `count()` on the right. - 'Tidy' functions usually return a `tibble()`, and hence can be chained together. [dplyr]: https://cran.r-project.org/package=dplyr ```{r dplyr, message = FALSE} library(dplyr) ``` - Steps below: - `as_tibble()`: create a tibble from `rowData(se)` - `select()`: select specific columns. - `arrange()`: arrange all rows from smallest to largest `padj` - `head()`: filter to the first 20 rows ```{r toptable} rowData(se) %>% as_tibble(rownames = "ensembl_id") %>% select(ensembl_id, gene_name, baseMean, log2FoldChange, padj) %>% arrange(padj) %>% head(n = 20) ``` - Check out the [plyranges][] workshop! [plyranges]: https://bioconductor.org/packages/plyranges # Summary ## What we've learned Packages - Discover at https://bioconductor.org/packages - Install with `BiocManager::install("BiocFileCache")` - Use with `library(BiocFileCache)`. - Get help with `?bfcrpath` - Get more help at https://support.bioconductor.org - Mature packages provide access to common sequence analysis data formats, e.g., BED, GTF, FASTQ, BAM, VCF. Data structures - Represent and coordinate 'complicated' data. - Already prevalent in base _R_, e.g., `data.frame()`, `matrix()`. - `GRanges` for representing genomic ranges - 'Accessor' functions `seqnames()`, `start()`, etc for core components - `$` or `mcols()$` for annotations - `SummarizedExperiment` for coordinated manipulation of assay data with row and column annotation. - `[,]` to subset assay and annotaions in a coordinated fashion. - `assay()`, `rowRanges()`, `rowData()`, `colData()` to access components. Analysis - Mature packages like [DESeq2][] provide excellent vignettes, well-defined steps in analysis, integration with other workflow steps, and very robust support. - Emerging areas are often represented by several packages implementing less complete or certain steps in analysis. ## Next steps Single-cell RNA-seq: an amazing resource: [Orchestarting Single Cell Analysis][osca] with _Bioconductor_, including the [scran][] and [scater][] packages. - Quality control - Normalization - Feature selection - Dimensionality reduction - Clustering - Marker gene detection - Cell type annotation ([SingleR][]) (this package has a great vignette!) - Trajectory analysis ([destiny][]) - Gene set enrichment - Etc.! [osca]: https://osca.bioconductor.org [scran]: https://bioconductor.org/packages/scran [scater]: https://bioconductor.org/packages/scater [SingleR]: https://bioconductor.org/packages/SingleR [destiny]: https://bioconductor.org/packages/destiny Other prominent domains of analysis (check out [biocViews][]) - Microarrays -- epigenomic, classical expression, copy number - Annotated variants - Gene set enrichment - Flow cytometry - Proteomics [biocViews]: https://bioconductor.org/packages Participate! - Get help on the [support site][]. - [Join][join-slack] and [use][use-slack] the slack community. - Participate in other conferences (e.g., [BioC2020][] in Boston, July 29 - 31, 2020). [support site]: https://support.bioconductor.org [join-slack]: https://bioc-community.herokuapp.com/ [use-slack]: https://community-bioc.slack.com [BioC2020]: https://bioc2020.bioconductor.org/ # Acknowledgements Research reported in this presentation was supported by the NCI and NHGRI of the National Institutes of Health under award numbers U24CA232979, U41HG004059, and U24CA180996. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. A portion of this work is supported by the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation. ```{r sessionInfo, echo=FALSE} sessionInfo() ```