--- title: "_Bioconductor_ and Beyond: Workshop" author: - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, New York email: Martin.Morgan@RoswellPark.org date: 18 November 2019 package: BiocIntro output: BiocStyle::html_document vignette: | %\VignetteIndexEntry{How You Can Advance Science Using Bioconductor} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE} knitr::opts_chunk$set(cache = TRUE, collapse = TRUE) ``` # Objectives Abstract: This two-hour workshop is meant to empower Cancer Moonshot research labs to tackle their bioinformatic analysis challenges. For the first hour and a half, we’ll work through a hands-on workflow for single cell assay analysis. We’ll introduce data import, management, and interactive visualization using Bioconductor tools like iSEE. After seeing how to work with one assay, we’ll briefly explore approaches to integrating different assays. In the final ½ hour we’ll go beyond Bioconductor tools for single-cell analysis. We’ll assemble a panel to discuss possible strategies for data analysis challenges submitted (before the workshop) to the organizers. Goal: Empower Cancer Moonshot Research Labs to tackle their bioinformatic analysis challenges Objectives, this workshop: 1. Learn the basics of _R_ and _Bioconductor_ 2. Participate in the exploration of immuno-oncology relevant data using _R_ / _Bioconductor_ 3. Tour additional directions possible in _R_ / _Bioconductor_ 4. Discuss challenges and opportunities in immuno-oncology bioinformatics. # _R_ and _Bioconductor_ 101 ## _R_ **What we will learn** - Working with _R_ functions, variables, vectors and data structures - Using packages to extend base _R_ capabilities - Getting help _R_ - standard and advanced statistical analysis - high quality visualizations - interactivity Vectors, variables, and functions ```{r} x = rnorm(100) mean(x) var(x) hist(x) ``` Manageing data: classes and methods ```{r, fig.asp = 1} y = x + rnorm(100) df = data.frame(x, y) plot(y ~ x, df) ``` Visualization ```{r, fig.asp = 1} fit = lm(y ~ x, df) anova(fit) plot(y ~ x, df) abline(fit) ``` Extending base R: packages ```{r, message=FALSE, fig.asp = 1} library(ggplot2) ggplot(df, aes(x, y)) + geom_point() + geom_smooth(method="lm") ``` CRAN (Comprehensive _R_ Archive Network) - 15000+ R packages: https://cran.r-project.org/web/packages - Task views: https://cran.r-project.org/web/views Help! - `?lm` - `browseVignettes("ggplot2")` / `vignette(package="ggplot2")` **What we learned** - Vectors simplify _R_ expressions - Structures like `data.frame` help manage data - Packages provide many different extensions - Help is available through `?`, `browseVignettes()` and other means ## _Bioconductor_ **What we will learn** - Discover and use _Bioconductor_ packages - Work with `SummarizedExperiment` for data management - Use annotation packages to map between gene identifiers _Bioconductor_ - More than 1800 _R_ packages for statistical analysis and comprehension of high-throughput genomic data - Bulk and single-cell RNA-seq, epigenetic and other microarrays, called variants, flow cytometry, proteomics, ... - Widely used (>1/2 million unique IP downloads / year), highly cited (>33,000 PubMedCentral citations), well-respected - NIH funded -- NHGRI (core, cloud), ITCR (multi-assay, annotation- and experiment-hub), IOTN (immuno-oncology data coordinating center) Resources - Project: https://bioconductor.org - 1823 Packages: https://bioconductor.org/packages - Vignettes (e.g., [DESeq2][]) - Software, annotation, and experiment (exemplar) data packages - Support: https://support.bioconductor.org - [Workflows][], [training material][], [slack][] ([join][]), etc. [DESeq2]: http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html [Workflows]: https://bioconductor.org/packages/release/BiocViews.html#___Workflow [training material]: https://bioconductor.org/help/course-materials/ [slack]: https://community-bioc.slack.com/ [join]: https://bioc-community.herokuapp.com/ Data management - Extensive resources to import and operate on standard file formats, e.g. BED, VCF, BAM, ... Domain-specific work flows, e.g., bulk RNA-seq diffrential expression - Load example data, pre-formatted. ```{r, message = FALSE} library(airway) data(airway) # load example data airway ``` - A _Bioconductor_ _SummarizedExperiment_ object, providing coordinated data management ```{r, echo=FALSE} knitr::include_graphics("our_figures/SummarizedExperiment.svg") ``` - `colData()`: description of samples, especially four cell lines exposed to two dexamethasone treatments ```{r} colData(airway) airway$cell ``` - `assay()` data: RNA-seq reads overlapping genes (proxy for gene expression) ```{r} head(assay(airway)) ``` - Load package implementing domain-specific work flow (i.e., bulk RNA-seq differential expression) - Fit a model describing counts as a function of cell line (blocking variable) and dexamethasone treatment. - Obtain a 'top table' of differentially expressed genes ```{r, message=FALSE, collapse=TRUE} library(DESeq2) # 'Software' package -- bulk RNA-seq differential expression dds = DESeqDataSet(airway, ~ cell + dex) fit = DESeq(dds) toptable = results(fit) toptable ``` - Find resources to map between identifiers using AnnotationHub ```{r, message = FALSE} library(AnnotationHub) query(AnnotationHub(), c("EnsDb", "Homo sapiens", "98")) ensdb <- AnnotationHub()[["AH75011"]] ``` - Map from Ensembl identifiers to gene symbol; add the annotation to both the top table and airway data sets ```{r, message = FALSE, collapse = TRUE} egids <- rownames(toptable) toptable$SYMBOL <- mapIds(ensdb, egids, "SYMBOL", "GENEID") rowData(airway)$SYMBOL <- toptable$SYMBOL ``` - Most differentially expressed genes ```{r} idx <- head( order( toptable$padj), 20 ) toptable[idx, c("SYMBOL", "padj")] ``` - Heat map ```{r} heatmap( log1p( assay(airway)[idx,] )) ``` - More user-friendly heatmap, with Ensembl row namess replaced by symbols. ```{r} m <- assay(airway)[idx,] rownames(m) <- rowData(airway)$SYMBOL[idx] heatmap( log1p( m )) ``` **What we learned** - https://bioconductor.org is the entry point for high-throughput genomic analysis in _R_. - `SummarizedExperiment` allows easy management of complicated data. Use functions like `colData()`, `assay()`, and `[` for manipulation. - AnnotationHub provides resources for mapping between identifiers. Use `query()` to discover common resources. # Immuno-oncology analysis: single-cell RNAseq **What we will do** - Recreate figures from a recent paper, illustrating how some ideas in single-cell analysis translate into analysis. - Survey other options for single cell and multi-omic analysis available in _R_ / _Bioconductor_ An excellent resource -- osca -- - https://osca.bioconductor.org/ For today's workshop - Richard, A. C., A. T. L. Lun, W. W. Y. Lau, B. Gottgens, J. C. Marioni, and G. M. Griffiths. 2018. "T cell cytolytic capacity is independent of initial stimulation strength." [Nat. Immunol. 19 (8):849–58][PMC6300116]. [PMC6300116]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6300116/ Goal: reproduce parts of [Figure 1][]. [Figure 1]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6300116/figure/F1/?report=objectonly ```{r, echo=FALSE} knitr::include_graphics("our_figures/PMC6300116_fig_1.jpg") ``` Our part in the big picture (from Amezquita et al., Orchestrating Single-Cell Analysis with Bioconductor bioRxiv [590562][]) [590562]: https://doi.org/10.1101/590562 ```{r, echo=FALSE} knitr::include_graphics("our_figures/F3.large.jpg") ``` ## Filtering, normalization, and selecting subsets-of-interest Packages we'll use - [scRNAseq][]: Access to select scRNA-seq data sets - [SingleCellExperiment][]: Data representation - [scater][]: Single-cell analysis toolkit -- normalization - [AnnotationHub][], [ensembldb][]: Annotation - [destiny][]: Psuedotime. - [dplyr][]: 'tidy' data management. - [ggplot2][]: Visualization - Base R: filtering & data management [scRNAseq]: https://bioconductor.org/packages/scRNAseq [SingleCellExperiment]: https://bioconductor.org/packages/SingleCellExperiment [scater]: https://bioconductor.org/packages/scater [AnnotationHub]: https://bioconductor.org/packages/AnnotationHub [ensembldb]: https://bioconductor.org/packages/ensembldb [destiny]: https://bioconductor.org/packages/destiny [dplyr]: https://cran.r-project.org/package=dplyr [ggplot2]: https://cran.r-project.org/package=ggplot2 ```{r, message=FALSE} library(scRNAseq) library(SingleCellExperiment) library(scater) library(AnnotationHub) library(ensembldb) library(destiny) library(dplyr) library(ggplot2) ``` Example data set - Available via ExperimentHub and [scRNAseq][] package - Specialized version of SummarizedExperiment ```{r, message=FALSE} tcell = RichardTCellData() tcell ``` Filtering - 'informative' (often expressed) across cells ```{r} ## Expression in at least 5% of cells... drop_filter <- rowSums(assay(tcell) > 0) >= (ncol(tcell) * 0.05) tcell <- tcell[drop_filter,] ## ...and mean count at least 1 mean_filter <- rowMeans(assay(tcell)) >= 1 tcell <- tcell[mean_filter,] ``` Normalization ```{r, message=FALSE} tcell <- logNormCounts(tcell) ``` Cells of interest -- time course - What experimental treatments do we have? [dplyr]: https://cran.r-project.org/package=dplyr ```{r, message=FALSE} as_tibble(colData(tcell)) %>% dplyr::filter(post.analysis.well.quality == "pass") %>% dplyr::count(individual, stimulus, time) ``` - Subset `tcell` to contain only the samples of interest, and expressed genes with non-zero read counts across samples. ```{r} stimulus <- c("OT-I high affinity peptide N4 (SIINFEKL)", "unstimulated") cells_of_interest <- ( tcell$stimulus %in% stimulus ) & ( tcell$`post-analysis well quality` %in% "pass" ) expressed_genes <- rowSums(logcounts(tcell)[, cells_of_interest]) != 0 fig1 <- tcell[expressed_genes, cells_of_interest] ``` - Some basic properties and manipulations ```{r, collpase = TRUE} mean(logcounts(fig1) == 0) # proportion of 0's ## update colData -- treat `time` as factor (discrete) rather than continuous colData(fig1)$time <- factor(colData(fig1)$time) ## Save useful row-wise summaries rowData(fig1)$ave_count <- rowMeans(assay(fig1)) rowData(fig1)$n_cells <- rowSums(assay(fig1) > 0) ``` ## Visualization -- 'looking at' the data Dimensionality reduction - Add reduced dimension information to `fig1` ```{r} fig1 <- runPCA(fig1, ncomponents = 4) fig1 <- runTSNE(fig1) fig1 ``` - Visualize ```{r, collapse = TRUE, fig.asp = 1} plotPCA(fig1, colour_by = "time", point_size = 4) plotTSNE(fig1, colour_by = "time", point_size = 4) ``` Our favorite genes - Use an 'annotation resource' to map between unfriendly Ensembl gene identifiers and more familiar gene symbols - Discover and retrieve available resources -- an `EnsDb` (Ensembl data base) for _Mus musculus_, Ensembl release 98 ```{r, message=FALSE} query(AnnotationHub(), c("EnsDb", "Mus musculus", "98")) ensdb <- AnnotationHub()[["AH75036"]] ``` - Map all Ensembl gene identifiers to gene symbol, adding a row annotation to `fig1` ```{r} rowData(fig1)$SYMBOL <- mapIds(ensdb, rownames(fig1), "SYMBOL", "GENEID") ``` - Boxplot of expression in a favorite gene ```{r, fig.asp=1} rows <- rowData(fig1)$SYMBOL %in% "Nr4a1" tbl <- tibble( time = fig1$time, Nr4a1 = logcounts(fig1)[rows,] ) ggplot(tbl, aes(time, Nr4a1, color = time)) + geom_boxplot() + geom_jitter(width = .1) ``` **What we learned** - Analyses use several different packages from _R_ and _Bioconductor_. - Alot of effort is spent organizing data into appropriate formats for input to functions, and reshaping outputs of one function for the next step in the analysis. - _Bioconductor_ data structures like `SummarizedExperiment` help substantially with these data managment tasks. - Vignettes, resources like [osca][], and well-documented methods in published papers help orient us as we tackle new projects. [osca]: https://osca.bioconductor.org/ ## Further analysis https://osca.bioconductor.org - Quality control - Normalization - Feature selection - Dimensionality reduction - Clustering - Marker gene detection - Cell type annotation ([SingleR][]) (this package has a great vignette!) - Trajectory analysis ([destiny][]) - Gene set enrichment - Etc.! [SingleR]: https://bioconductor.org/packages/SingleR [destiny]: https://bioconductor.org/packages/destiny ## Interactive visualization - See the [iSEE vignette][] [iSEE vignette]: https://bioconductor.org/packages/release/bioc/vignettes/iSEE/inst/doc/basic.html ```{r iSEE, eval = FALSE} iSEE::iSEE(fig1) ``` # Next steps ## Multi-omic analysis - Coordinated management of multiple genetic data types - See the [MultiAssayExperiment vignette][] for deails [MultiAssayExperiment vignette]: https://bioconductor.org/packages/release/bioc/vignettes/MultiAssayExperiment/inst/doc/MultiAssayExperiment.html - Available data types ```{r, message=FALSE} library(MultiAssayExperiment) library(curatedTCGAData) curatedTCGAData() ``` - Particular data -- Breast cancer (BRCA) mutation, gene expression, and protein expression data. ```{r, message=FALSE} assays = c("Mutation", "RNASeqGene", "RPPAArray") brca = curatedTCGAData("BRCA", assays, dry.run = FALSE) brca ``` - Each assay is a 'familiar' object, e.g., a SummarizedExperiment ```{r} brca[["BRCA_RNASeqGene-20160128"]] ``` - Lots of annotations on each sample ```{r} length(colnames(colData(brca))) head(colnames(colData(brca))) ``` - Commands to aid management of overlapping samples across assays ## Computational clouds - E.g., the [AnVIL][] NIH / NHGRI cloud resource - Access to large scale data resources (1000 genomes, GTEx, CCDG, eMERGE, ...) - Hosted computation via notebooks, including R / Bioconductor - Coming soon: RStudio! [AnVIL]: https://anvil.terra.bio/ ## Learning more - https://bioconductor.org - https://bioconductor.org/help/course-materials - Bioconductor [Asia][] Conference, Sydney, December 5 - 6 - Bioconductor [Europe][] Conference, Brussels, December 9 - 11. - Bioconductor [North America][] Conference, Boston, July 29 - 31. [Asia]: https://bioconductor.github.io/BiocAsia/ [Europe]: https://eurobioc2019.bioconductor.org/ [North America]: https://bioc2020.bioconductor.org # Acknowledgements Research reported in this presentation was supported by the NCIA and NHGRI of the National Institutes of Health under award numbers U24CA232979, U41HG004059, and U24CA180996. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. ```{r sessionInfo} sessionInfo() ```