--- title: "MicrobiomeWorkshopII.Rmd" abstract: > The second part of the workshop demonstrates how to use `dada2` on raw reads, and analysis of these data using the `phyloseq`, `treeDA`, `adaptiveGPCA` packages for denoising, estimating differential abundance, ordinations. The `treelapse` and `metavizr` packages allow browsing and interactive visualization of microbiome profiles. Together, these packages provide easily linked components for data acquisition and flexible analysis of 16S rRNA and whole metagenome shotgun microbiome profiles. At the end of this workshop, users will be able to access publicly available metagenomic data and to perform common statistical analyses of these and other data in Bioconductor. author: "Susan Holmes and Joey McMurdie" date: "July 24, 2017" bibliography: F1000Work.bib output: BiocStyle::html_document: number_sections: no toc: yes toc_depth: 4 vignette: > %\VignetteIndexEntry{MicrobiomeWorkshopII} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} library(knitr) knitr::opts_chunk$set(eval = TRUE) knitr::opts_chunk$set(cache = TRUE) options(width = 98) opts_chunk$set(message = FALSE, error = FALSE, warning = FALSE, cache = TRUE, fig.width = 8, fig.height = 7) ``` ```{r NeededPackages, message=FALSE} cranpkgs=c("ggplot2","devtools","adaptiveGPCA","treelapse", "ade4","structSSI", "PMA","vegan", "ggrepel","gridExtra") BioCpkgs=c("dada2", "curatedMetagenomicData", "phyloseq", "DECIPHER","genefilter") ghpkgs= c("treeDA", "treelapse","metavizr") instp <- cranpkgs %in% installed.packages() if(any(!instp)) { install.packages(cranpkgs[!instp],repos="https://cloud.r-project.org") } instp <- BioCpkgs %in% installed.packages() if(any(!instp)) { source("http://bioconductor.org/biocLite.R") biocLite(BioCpkgs [!instp], ask = F) } if(!("treeDA" %in% installed.packages())) BiocInstaller::biocLite("jfukuyama/treeDA", dependencies = TRUE) if(!("treelapse" %in% installed.packages())) BiocInstaller::biocLite("krisrs1128/treelapse", dependencies = TRUE) suppressPackageStartupMessages(sapply(c(cranpkgs, BioCpkgs, ghpkgs), require, character.only = TRUE)) ``` # dada2 theory The **microbiome** is formed of the ecological communities of microorganisms that dominate the living world. Bacteria can now be identified through the use of next generation sequencing applied at several levels. Shotgun sequencing of all bacteria in a sample delivers knowledge of all the genes present. Here we will only be interested in the identification and quantification of individual taxa (or species) through a ‘fingerprint gene’ called 16s rRNA which is present in all bacteria. This gene presents several variable regions which can be used to identify the different taxa. Previous standard workflows depended on clustering all 16s rRNA sequences (generated by next generation amplicon sequencing) that occur within a 97% radius of similarity and then assigning these to ‘OTUs’ from reference trees [@caporaso2010qiime; @mothur]. These approaches do not incorporate all the data, in particular sequence quality information and statistical information available on the reads were not incorporated into the assignments. In contrast, the **de novo** read counts used here will be constructed through the incorporation of both the quality scores and sequence frequencies in a probabilistic noise model for nucleotide transitions. For more details on the algorithmic implementation of this step see [@dada2]. After filtering the sequences and removing the chimeræ, the data are compared to a standard database of bacteria and labeled. In this workflow, we have used the labeled sequences to build a de novo phylogenetic with the . The key step in the sequence analysis is the manner in which reads are denoised and assembled into groups we have chosen to call **ASV**s (Amplicon Sequence Variants)[@callahan2017exact] instead of the traditional OTUs(Operational Taxonomic Units). A published (but essentially similar) version of this workflow, including reviewer reports and comments is available [@callahan2016F1000], see [F1000Research From Raw reads](https://f1000research.com/articles/5-1492/v2). This is a workflow for denoising, filtering, performing data transformations, visualization, supervised learning analyses, community network tests, hierarchical testing and linear models. We provide all the code and give several examples of different types of analyses and use-cases. There are often many different objectives in experiments involving microbiome data and we will only give a flavor for what could be possible once the data has been imported into R. In addition, the code can be easily adapted to accommodate batch effects, covariates and multiple experimental factors. This workflow is based on software packages from the open-source Bioconductor project [@Huber:2015]. We provide all steps necessary from the denoising and identification of the reads input as raw sequences as [fastq](https://en.wikipedia.org/wiki/FASTQ_format) files to the comparative testing and multivariate analyses of the samples and analyses of the abundances according to multiple available covariates. # dada2 tutorial [See tutorial here](https://benjjneb.github.io/dada2/tutorial.html) We will select portions of [Complete Bioconductor Worflow output](http://web.stanford.edu/class/bios221/MicrobiomeWorkflowII.html) to cover in this tutorial # Using phyloseq {.unnumbered} `phyloseq`[@phyloseqplosone] is an R package to import, store, analyze, and graphically display complex phylogenetic sequencing data that has already been clustered into Operational Taxonomic Units (OTUs) or more appropriately denoised, and it is most useful when there is also associated sample data, phylogeny, and/or taxonomic assignment of each taxa. leverages and builds upon many of the tools available in R for ecology and phylogenetic analysis (ape, vegan, ade4), while also using advanced/flexible graphic systems (ggplot2) to easily produce publication-quality graphics of complex phylogenetic data. The `r Biocpkg("phyloseq")` package uses a specialized system of S4 data classes to store all related phylogenetic sequencing data as a single, self-consistent, self-describing experiment-level object, making it easier to share data and reproduce analyses. In general, phyloseq seeks to facilitate the use of R for efficient interactive and reproducible analysis of amplicon count data jointly with important sample covariates. This tutorial shows a useful example workflow, but many more analyses are available to you in phyloseq, and R in general, than can fit in a single workflow. The [phyloseq home page](http://joey711.github.io/phyloseq/) is a good place to begin browsing additional phyloseq documentation, as are the three vignettes included within the package, and linked directly at [the phyloseq release page on Bioconductor](http://bioconductor.org/packages/release/bioc/html/phyloseq.html). ## Loading the data {.unnumbered} Many use cases result in the need to import and combine different data into a phyloseq class object, this can be done using th `import_biom` function to read recent QIIME format files, older files can still be imported with `import_qiime`. More complete details can be found on the [phyloseq FAQ page](https://www.bioconductor.org/packages/release/bioc/vignettes/phyloseq/inst/doc/phyloseq-FAQ.html). In the previous section the results of `r Biocpkg("dada2")` sequence processing were organized into a phyloseq object. We have actually run dada2 on a larger set of samples from the same data source. This object was also saved in R-native serialized RDS format. We will re-load this here for completeness as the initial object `ps`. If you have not downloaded the whole repository you can access the `ps` file though github: ```{r non-local} ps_connect <-url("https://raw.githubusercontent.com/spholmes/F1000_workflow/master/data/ps.rds") ps = readRDS(ps_connect) ps ``` ## Shiny-phyloseq {.unnumbered} It can be beneficial to start the data exploration process interactively, this often saves time in detecting outliers and specific features of the data. [Shiny-phyloseq](http://joey711.github.io/shiny-phyloseq/) [@mcmurdie2015] is an interactive web application that provides a graphical user interface to the `r Biocpkg("phyloseq")` package. The object just loaded into the R session in this workflow is suitable for graphical exploration with Shiny-phyloseq. Filtering {#filtering .unnumbered} --------- `r Biocpkg("phyloseq")` provides useful tools for filtering, subsetting, and agglomerating taxa – a task that is often appropriate or even necessary for effective analysis of microbiome count data. In this subsection, we graphically explore the prevalence of taxa in the example dataset, and demonstrate how this can be used as a filtering criteria. One of the reasons to filter in this way is to avoid spending much time analyzing taxa that were seen only rarely among samples. This also turns out to be a useful filter of noise (taxa that are actually just artifacts of the data collection process), a step that should probably be considered essential for datasets constructed via heuristic OTU-clustering methods, which are notoriously prone to generating spurious taxa. ### Taxonomic Filtering {#taxonomic-filtering .unnumbered} In many biological settings, the set of all organisms from all samples are well-represented in the available taxonomic reference database. When (and only when) this is the case, it is reasonable or even advisable to filter taxonomic features for which a high-rank taxonomy could not be assigned. Such ambiguous features in this setting are almost always sequence artifacts that don’t exist in nature. It should be obvious that such a filter is not appropriate for samples from poorly characterized or novel specimens, at least until the possibility of taxonomic novelty can be satisfactorily rejected. Phylum is a useful taxonomic rank to consider using for this purpose, but others may work effectively for your data. To begin, create a table of read counts for each Phylum present in the dataset. ```{r taxfilter0} # Show available ranks in the dataset rank_names(ps) # Create table, number of features for each phyla table(tax_table(ps)[, "Phylum"], exclude = NULL) ``` This shows a few phyla for which only one feature was observed. Those may be worth filtering, and we’ll check that next. First, notice that in this case, six features were annotated with a Phylum of NA. These features are probably artifacts in a dataset like this, and should be removed. The following ensures that features with ambiguous phylum annotation are also removed. Note the flexibility in defining strings that should be considered ambiguous annotation. ```{r removeNAphyla} ps <- subset_taxa(ps, !is.na(Phylum) & !Phylum %in% c("", "uncharacterized")) ``` A useful next step is to explore feature *prevalence* in the dataset, which we will define here as the number of samples in which a taxon appears at least once. ```{r prevfilter0} # Compute prevalence of each feature, store as data.frame prevdf = apply(X = otu_table(ps), MARGIN = ifelse(taxa_are_rows(ps), yes = 1, no = 2), FUN = function(x){sum(x > 0)}) # Add taxonomy and total read counts to this data.frame prevdf = data.frame(Prevalence = prevdf, TotalAbundance = taxa_sums(ps), tax_table(ps)) ``` Are there phyla that are comprised of mostly low-prevalence features? Compute the total and average prevalences of the features in each phylum. ```{r lowprev} plyr::ddply(prevdf, "Phylum", function(df1){cbind(mean(df1$Prevalence),sum(df1$Prevalence))}) ``` _Deinococcus-Thermus_ appeared in just over one percent of samples, and Fusobacteria appeared in just 2 samples total. In some cases it might be worthwhile to explore these two phyla in more detail despite this (though probably not Fusobacteria’s two samples). For the purposes of this example, though, they will be filtered from the dataset. ```{r taxfilter} # Define phyla to filter filterPhyla = c("Fusobacteria", "Deinococcus-Thermus") # Filter entries with unidentified Phylum. ps1 = subset_taxa(ps, !Phylum %in% filterPhyla) ps1 ``` ### Prevalence Filtering {#prevalence-filtering .unnumbered} The previous filtering steps are considered *supervised*, because they relied on prior information that is external to this experiment (a taxonomic reference database). This next filtering step is completely *unsupervised*, relying only on the data in this experiment, and a parameter that we will choose after exploring the data. Thus, this filtering step can be applied even in settings where taxonomic annotation is unavailable or unreliable. First, explore the relationship of prevalence and total read count for each feature. Sometimes this reveals outliers that should probably be removed, and also provides insight into the ranges of either feature that might be useful. This aspect depends quite a lot on the experimental design and goals of the downstream inference, so keep these in mind. It may even be the case that different types of downstream inference require different choices here. There is no reason to expect ahead of time that a single filtering workflow is appropriate for all analysis. ```{r plotprevalence, fig.width=9, fig.height=5, fig.cap="Taxa prevalence versus total counts."} # Subset to the remaining phyla prevdf1 = subset(prevdf, Phylum %in% get_taxa_unique(ps1, "Phylum")) ggplot(prevdf1, aes(TotalAbundance, Prevalence / nsamples(ps),color=Phylum)) + # Include a guess for parameter geom_hline(yintercept = 0.05, alpha = 0.5, linetype = 2) + geom_point(size = 2, alpha = 0.7) + scale_x_log10() + xlab("Total Abundance") + ylab("Prevalence [Frac. Samples]") + facet_wrap(~Phylum) + theme(legend.position="none") ``` Each point in Figure \@ref(fig:plotprevalence) is a different taxa. Exploration of the data in this way is often useful for selecting filtering parameters, like the minimum prevalence criteria we will used to filter the data above. Sometimes a natural separation in the dataset reveals itself, or at least, a conservative choice that is in a stable region for which small changes to the choice would have minor or no effect on the biological interpreation (stability). Here no natural separation is immediately evident, but it looks like we might reasonably define a prevalence threshold in a range of zero to ten percent or so. Take care that this choice does not introduce bias into a downstream analysis of association of differential abundance. The following uses five percent of all samples as the prevalence threshold. ```{r prevalencefilter} # Define prevalence threshold as 5% of total samples prevalenceThreshold = 0.05 * nsamples(ps) prevalenceThreshold # Execute prevalence filter, using `prune_taxa()` function keepTaxa = rownames(prevdf1)[(prevdf1$Prevalence >= prevalenceThreshold)] ps2 = prune_taxa(keepTaxa, ps) ``` Agglomerate taxa {#agglomerate-taxa .unnumbered} ---------------- When there is known to be a lot of species or sub-species functional redundancy in a microbial community, it might be useful to agglomerate the data features corresponding to closely related taxa. Ideally we would know the functional redundancies perfectly ahead of time, in which case we would agglomerate taxa using those defined relationships and the function in phyloseq. That kind of exquisite functional data is usually not available, and different pairs of microbes will have different sets of overlapping functions, complicating the matter of defining appropriate grouping criteria. While not necessarily the most useful or functionally-accurate criteria for grouping microbial features (sometimes far from accurate), taxonomic agglomeration has the advantage of being much easier to define ahead of time. This is because taxonomies are usually defined with a comparatively simple tree-like graph structure that has a fixed number of internal nodes, called “ranks”. This structure is simple enough for the phyloseq package to represent taxonomies as table of taxonomy labels. Taxonomic agglomeration groups all the “leaves” in the hierarchy that descend from the user-prescribed agglomerating rank, this is sometimes called ‘glomming’. The following example code shows how one would combine all features that descend from the same genus. ```{r taxglom} # How many genera would be present after filtering? length(get_taxa_unique(ps2, taxonomic.rank = "Genus")) ps3 = tax_glom(ps2, "Genus", NArm = TRUE) ``` If taxonomy is not available or not reliable, tree-based agglomeration is a “taxonomy-free” alternative to combine data features corresponding to closely-related taxa. In this case, rather than taxonomic rank, the user specifies a tree height corresponding to the phylogenetic distance between features that should define their grouping. This is very similar to “OTU Clustering”, except that in many OTU Clustering algorithms the sequence distance being used does not have the same (or any) evolutionary definition. ```{r tipglom} h1 = 0.4 ps4 = tip_glom(ps2, h = h1) ``` Here phyloseq's `plot_tree()` function compare the original unfiltered data, the tree after taxonoic agglomeration, and the tree after phylogenetic agglomeration. These are stored as separate plot objects, then rendered together in one combined graphic using `gridExtra::grid.arrange`. ```{r plotglomprep} multiPlotTitleTextSize = 15 p2tree = plot_tree(ps2, method = "treeonly", ladderize = "left", title = "Before Agglomeration") + theme(plot.title = element_text(size = multiPlotTitleTextSize)) p3tree = plot_tree(ps3, method = "treeonly", ladderize = "left", title = "By Genus") + theme(plot.title = element_text(size = multiPlotTitleTextSize)) p4tree = plot_tree(ps4, method = "treeonly", ladderize = "left", title = "By Height") + theme(plot.title = element_text(size = multiPlotTitleTextSize)) ``` ```{r plotglomtree, fig.width=14, fig.cap="Different types of agglomeration"} # group plots together grid.arrange(nrow = 1, p2tree, p3tree, p4tree) ``` Figure \@ref(fig:plotglomtree) shows the original tree on the left, taxonomic agglomeration at Genus rank in the middle and phylogenetic agglomeration at a fixed distance of 0.4 on the right. Abundance value transformation {#abundance-value-transformation .unnumbered} ------------------------------ It is usually necessary to transform microbiome count data to account for differences in library size, variance, scale, etc. The `phyloseq` package provides a flexible interface for defining new functions to accomplish these transformations of the abundance values via the function `transform_sample_counts()`. The first argument to this function is the phyloseq object you want to transform, and the second argument is an R function that defines the transformation. The R function is applied sample-wise, expecting that the first unnamed argument is a vector of taxa counts in the same order as the phyloseq object. Additional arguments are passed on to the function specified in the second argument, providing an explicit means to include pre-computed values, previously defined parameters/thresholds, or any other object that might be appropriate for computing the transformed values of interest. This example begins by defining a custom plot function, `plot_abundance()`, that uses phyloseq’s function to define a relative abundance graphic. We will use this to compare more easily differences in scale and distribution of the abundance values in our phyloseq object before and after transformation. ```{r abundancetransformation} plot_abundance = function(physeq,title = "", Facet = "Order", Color = "Phylum"){ # Arbitrary subset, based on Phylum, for plotting p1f = subset_taxa(physeq, Phylum %in% c("Firmicutes")) mphyseq = psmelt(p1f) mphyseq <- subset(mphyseq, Abundance > 0) ggplot(data = mphyseq, mapping = aes_string(x = "sex",y = "Abundance", color = Color, fill = Color)) + geom_violin(fill = NA) + geom_point(size = 1, alpha = 0.3, position = position_jitter(width = 0.3)) + facet_wrap(facets = Facet) + scale_y_log10()+ theme(legend.position="none") } ``` The transformation in this case converts the counts from each sample into their frequencies, often referred to as *proportions* or *relative abundances*. This function is so simple that it is easiest to define it within the function call to `transform_sample_counts()`. ```{r abundancetransformation2} # Transform to relative abundance. Save as new object. ps3ra = transform_sample_counts(ps3, function(x){x / sum(x)}) ``` Now we plot the abundance values before and after transformation. ```{r abundancetransformation3, fig.height=12, fig.width=10.5,fig.cap="Comparison of original abundances with transformed data"} plotBefore = plot_abundance(ps3,"") plotAfter = plot_abundance(ps3ra,"") # Combine each plot into one graphic. grid.arrange(nrow = 2, plotBefore, plotAfter) ``` Figure \@ref(fig:abundancetransformation3) shows the comparison of original abundances (top panel) and relative abundances (lower). Subset by taxonomy {#subset-by-taxonomy .unnumbered} ------------------ Notice on the previous plot that *Lactobacillales* appears to be a taxonomic Order with bimodal abundance profile in the data. We can check for a taxonomic explanation of this pattern by plotting just that taxonomic subset of the data. For this, we subset with the function, and then specify a more precise taxonomic rank to the argument of the function that we defined above. ```{r subsettaxa, fig.cap= "Violin plot of relative abundances of Lactobacillales"} psOrd = subset_taxa(ps3ra, Order == "Lactobacillales") plot_abundance(psOrd, Facet = "Genus", Color = NULL) ``` Figure \@ref(fig:subsettaxa) shows the relative abundances of Lactobacillales taxonomic Order, grouped by host sex and genera. Here it is clear that the apparent biomodal distribution of Lactobacillales on the previous plot was the result of a mixture of two different genera, with the typical *Lactobacillus* relative abundance much larger than *Streptococcus*. At this stage in the workflow, after converting raw reads to interpretable species abundances, and after filtering and transforming these abundances to focus attention on scientifically meaningful quantities, we are in a position to consider more careful statistical analysis. R is an ideal environment for performing these analyses, as it has an active community of package developers building simple interfaces to sophisticated techniques. As a variety of methods are available, there is no need to commit to any rigid analysis strategy a priori. Further, the ability to easily call packages without reimplementing methods frees researchers to iterate rapidly through alternative analysis ideas. The advantage of performing this full workflow in R is that this transition from bioinformatics to statistics is effortless. Let's start by installing a few packages that are available for these complementary analyses: ```{r init-analysis} .cran_packages <- c( "shiny","miniUI", "caret", "pls", "e1071", "ggplot2", "randomForest", "dplyr", "ggrepel", "nlme", "devtools", "reshape2", "PMA", "structSSI", "ade4", "ggnetwork", "intergraph", "scales") .github_packages <- c("jfukuyama/phyloseqGraphTest") .bioc_packages <- c("genefilter", "impute") # Install CRAN packages (if not already installed) .inst <- .cran_packages %in% installed.packages() if (any(!.inst)){ install.packages(.cran_packages[!.inst],repos = "http://cran.rstudio.com/") } .inst <- .github_packages %in% installed.packages() if (any(!.inst)){ devtools::install_github(.github_packages[!.inst]) } .inst <- .bioc_packages %in% installed.packages() if(any(!.inst)){ source("http://bioconductor.org/biocLite.R") biocLite(.bioc_packages[!.inst]) } ``` We back these claims by illustrating several analyses on the mouse data prepared above. We experiment with several flavors of exploratory ordination before shifting to more formal testing and modeling, explaining the settings in which the different points of view are most appropriate. Finally, we provide example analyses of multitable data, using a study in which both metabolomic and microbial abundance measurements were collected on the same samples, to demonstrate that the general workflow presented here can be adapted to the multitable setting. ### Preprocessing {#preprocessing .unnumbered} Before doing the multivariate projections, we will add a few columns to our sample data, which can then be used to annotate plots. From Figure \@ref(fig:preprocessing-setup), we see that the ages of the mice come in a couple of groups, and so we make a categorical variable corresponding to young, middle-aged, and old mice. We also record the total number of counts seen in each sample and log-transform the data as an approximate variance stabilizing transformation. ```{r preprocessing-setup, fig.cap="Histogram of age groupings",fig.show="hold"} qplot(sample_data(ps)$age, geom = "histogram",binwidth=20) + xlab("age") ``` Figure \@ref(fig:preprocessing-setup) shows that the age covariate belongs to three separate clusters. ```{r preprocessing2, fig.cap="Histograms comparing raw and log transformed read depths",fig.show="hold"} qplot(log10(rowSums(otu_table(ps))),binwidth=0.2) + xlab("Logged counts-per-sample") ``` These preliminary plots suggest certain preprocessing steps. The histogram in Figure \@ref(fig:preprocessing-setup) motivates the creation of a new categorical variable, binning age into one of the three peaks. The histogram in Figure \@ref(fig:preprocessing2) suggests that a $\log\left(1 + x\right)$ transformation might be sufficient for `normalizing` the abundance data for the exploratory analyses. In fact this transformation is not sufficient for testing purposes and when performing differential abundances we recommend the variance stabilizing transformations available in DESeq2 through the `phyloseq_to_deseq2` function, see the [phyloseq_to_deseq2 tutorial here](https://bioconductor.org/packages/devel/bioc/vignettes/phyloseq/inst/doc/phyloseq-mixture-models.html). As our first step, we look at principal coordinates analysis (PCoA) with either the Bray-Curtis dissimilarity on the weighted Unifrac distance. ```{r outlier-detect, fig.cap="Exploratory ordination analysis with log abundances.",fig.wide= TRUE} sample_data(ps)$age_binned <- cut(sample_data(ps)$age, breaks = c(0, 100, 200, 400)) levels(sample_data(ps)$age_binned) <- list(Young100="(0,100]", Mid100to200="(100,200]", Old200="(200,400]") sample_data(ps)$family_relationship=gsub(" ","",sample_data(ps)$family_relationship) pslog <- transform_sample_counts(ps, function(x) log(1 + x)) out.wuf.log <- ordinate(pslog, method = "MDS", distance = "wunifrac") evals <- out.wuf.log$values$Eigenvalues plot_ordination(pslog, out.wuf.log, color = "age_binned") + labs(col = "Binned Age") + coord_fixed(sqrt(evals[2] / evals[1])) ``` Figure \@ref(fig:outlier-detect) showing the ordination on the logged abundance data reveals a few outliers. These turn out to be the samples from females 5 and 6 on day 165 and the samples from males 3, 4, 5, and 6 on day 175. We will take them out, since we are mainly interested in the relationships between the non-outlier points. Before we continue, we should check the two female outliers -- they have been taken over by the same OTU/ASV, which has a relative abundance of over 90\% in each of them. This is the only time in the entire data set that this ASV has such a high relative abundance -- the rest of the time it is below 20\%. In particular, its diversity is by far the lowest of all the samples. ```{r outlier-analyze, fig.width=9, fig.height=5, fig.cap="The outlier samples are dominated by a single ASV."} rel_abund <- t(apply(otu_table(ps), 1, function(x) x / sum(x))) qplot(rel_abund[, 12], geom = "histogram",binwidth=0.05) + xlab("Relative abundance") ``` # Different Ordination Projections {#different-ordination-projections .unnumbered} As we have seen, an important first step in analyzing microbiome data is to do unsupervised, exploratory analysis. This is simple to do in `phyloseq`, which provides many distances and ordination methods. After documenting the outliers, we are going to compute ordinations with these outliers removed and more carefully study the output. ```{r remove-outliers} outliers <- c("F5D165", "F6D165", "M3D175", "M4D175", "M5D175", "M6D175") ps <- prune_samples(!(sample_names(ps) %in% outliers), ps) ``` We are also going to remove samples with fewer than 1000 reads: ```{r removelowreads} which(!rowSums(otu_table(ps)) > 1000) ps <- prune_samples(rowSums(otu_table(ps)) > 1000, ps) pslog <- transform_sample_counts(ps, function(x) log(1 + x)) ``` We’ll first perform a PCoA using Bray-Curtis dissimilarity. ```{r ordinations-bray,fig.cap="A PCoA plot using Bray-Curtis between samples."} out.pcoa.log <- ordinate(pslog, method = "MDS", distance = "bray") evals <- out.pcoa.log$values[,1] plot_ordination(pslog, out.pcoa.log, color = "age_binned", shape = "family_relationship") + labs(col = "Binned Age", shape = "Litter")+ coord_fixed(sqrt(evals[2] / evals[1])) ``` We see that there is a fairly substantial age effect that is consistent between all the mice, male and female, and from different litters. Next we look at double principal coordinates analysis (DPCoA) [@Pavoine:2004; @Purdom2010; @Fukuyama:2012], which is a phylogenetic ordination method and that provides a biplot representation of both samples and taxonomic categories. We see again that the second axis corresponds to young vs. old mice, and the biplot suggests an interpretation of the second axis: samples that have larger scores on the second axis have more taxa from Bacteroidetes and one subset of Firmicutes. ```{r ordinations-dpcoa, fig.wide=TRUE,fig.cap="A DPCoA plot incorporates phylogenetic information, but is dominated by the first axis."} out.dpcoa.log <- ordinate(pslog, method = "DPCoA") evals <- out.dpcoa.log$eig plot_ordination(pslog, out.dpcoa.log, color = "age_binned", label= "SampleID", shape = "family_relationship") + labs(col = "Binned Age", shape = "Litter")+ coord_fixed(sqrt(evals[2] / evals[1])) ``` In Figure \@ref(fig:ordinations-dpcoa) we have the first axis explains `r round(evals[1]/sum(evals)*100)` \% of the variability, about `r round(evals[1]/evals[2])` times that of the second axis; this translates into the elongated form of the ordination plot. ```{r dpcoabiplot,fig.wide=TRUE,fig.cap="Taxa responsible for Axis 1 and 2"} plot_ordination(pslog, out.dpcoa.log, type = "species", color = "Phylum") + coord_fixed(sqrt(evals[2] / evals[1])) ``` Finally, we can look at the results of PCoA with weighted Unifrac. As before, we find that the second axis is associated with an age effect, which is fairly similar to DPCoA. This is not surprising, because both are phylogenetic ordination methods taking abundance into account. However, when we compare biplots, we see that the DPCoA gave a much cleaner interpretation of the second axis, compared to weighted Unifrac. ```{r ordinations-wuf,fig.wide =TRUE, fig.cap="The sample positions produced by a PCoA using weighted Unifrac."} out.wuf.log <- ordinate(pslog, method = "PCoA", distance ="wunifrac") evals <- out.wuf.log$values$Eigenvalues plot_ordination(pslog, out.wuf.log, color = "age_binned", shape = "family_relationship") + coord_fixed(sqrt(evals[2] / evals[1])) + labs(col = "Binned Age", shape = "Litter") ``` ## Why are the ordination plots so far from square? {.unnumbered} ###Aspect ratio of ordination plots {.unnumbered} In the ordination plots in Figure 8–Figure 14, you may have noticed as did the reviewers of the first version of the paper, that the maps are not presented as square representations as is often the case in standard PCoA and PCA plots in the literature. The reason for this is that as we are trying to represent the distances between samples as faithfully as possible; we have to take into account that the second eigenvalue is always smaller than the first, sometimes considerably so, thus we normalize the axis norm ratios to the relevant eigenvalue ratios. This ensures that the variability represented in the plots is done so faithfully. ### PCA on ranks {#pca-on-ranks .unnumbered} Microbial abundance data is often heavy-tailed, and sometimes it can be hard to identify a transformation that brings the data to normality. In these cases, it can be safer to ignore the raw abundances altogether, and work instead with ranks. We demonstrate this idea using a rank-transformed version of the data to perform PCA. First, we create a new matrix, representing the abundances by their ranks, where the microbe with the smallest in a sample gets mapped to rank 1, second smallest rank 2, etc. ```{r rankab} abund <- otu_table(pslog) abund_ranks <- t(apply(abund, 1, rank)) ``` Naively using these ranks could make differences between pairs of low and high abundance microbes comparable. In the case where many bacteria are absent or present at trace amounts, an artificially large difference in rank could occur[@holmes2011] for minimally abundant taxa. To avoid this, all those microbes with rank below some threshold are set to be tied at 1. The ranks for the other microbes are shifted down, so there is no large gap between ranks. ```{r rankthreshold} abund_ranks <- abund_ranks - 329 abund_ranks[abund_ranks < 1] <- 1 ``` ```{r pca-rank-visualize-procedure, fig.cap="Rank threshold transformation"} library(dplyr) library(reshape2) abund_df <- melt(abund, value.name = "abund") %>% left_join(melt(abund_ranks, value.name = "rank")) colnames(abund_df) <- c("sample", "seq", "abund", "rank") abund_df <- melt(abund, value.name = "abund") %>% left_join(melt(abund_ranks, value.name = "rank")) colnames(abund_df) <- c("sample", "seq", "abund", "rank") sample_ix <- sample(1:nrow(abund_df), 8) ggplot(abund_df %>% filter(sample %in% abund_df$sample[sample_ix])) + geom_point(aes(x = abund, y = rank, col = sample), position = position_jitter(width = 0.2), size = 1.5) + labs(x = "Abundance", y = "Thresholded rank") + scale_color_brewer(palette = "Set2") ``` This transformation is illustrated in Figure \@ref(fig:pca-rank-visualize-procedure). The association between abundance and rank, for a few randomly selected samples. The numbers of the $y$-axis are those supplied to PCA. We can now perform PCA and study the resulting biplot, given in the Figure below. To produce annotation for this figure, we used the following block. ```{r pca-rank-pca-setup} library(ade4) ranks_pca <- dudi.pca(abund_ranks, scannf = F, nf = 3) row_scores <- data.frame(li = ranks_pca$li, SampleID = rownames(abund_ranks)) col_scores <- data.frame(co = ranks_pca$co, seq = colnames(abund_ranks)) tax <- tax_table(ps) %>% data.frame(stringsAsFactors = FALSE) tax$seq <- rownames(tax) main_orders <- c("Clostridiales", "Bacteroidales", "Lactobacillales", "Coriobacteriales") tax$Order[!(tax$Order %in% main_orders)] <- "Other" tax$Order <- factor(tax$Order, levels = c(main_orders, "Other")) tax$otu_id <- seq_len(ncol(otu_table(ps))) row_scores <- row_scores %>% left_join(sample_data(pslog)) col_scores <- col_scores %>% left_join(tax) ``` ```{r pca-rank-pca-plot, fig.wide=TRUE,fig.height=8,fig.cap="The biplot resulting from the PCA after the truncated-ranking transformation."} evals_prop <- 100 * (ranks_pca$eig / sum(ranks_pca$eig)) ggplot() + geom_point(data = row_scores, aes(x = li.Axis1, y = li.Axis2), shape = 2) + geom_point(data = col_scores, aes(x = 25 * co.Comp1, y = 25 * co.Comp2, col = Order), size = .3, alpha = 0.6) + scale_color_brewer(palette = "Set2") + facet_grid(~ age_binned) + guides(col = guide_legend(override.aes = list(size = 3))) + labs(x = sprintf("Axis1 [%s%% variance]", round(evals_prop[1], 2)), y = sprintf("Axis2 [%s%% variance]", round(evals_prop[2], 2))) + coord_fixed(sqrt(ranks_pca$eig[2] / ranks_pca$eig[1])) + theme(panel.border = element_rect(color = "#787878", fill = alpha("white", 0))) ``` The results are similar to the PCoA analyses computed without applying a truncated-ranking transformation, reinforcing our confidence in the analysis on the original data. ### Canonical correspondence {#canonical-correspondence .unnumbered} Canonical Correspondence Analysis (CCpnA) is an approach to ordination of a species by sample table that incorporates supplemental information about the samples. As before, the purpose of creating biplots is to determine which types of bacterial communities are most prominent in different mouse sample types. It can be easier to interpret these biplots when the ordering between samples reflects sample characteristics – variations in age or litter status in the mouse data, for example – and this central to the design of CCpnA. The function allows us to create biplots where the positions of samples are determined by similarity in both species signatures and environmental characteristics; in contrast, principal components analysis or correspondence analysis only look at species signatures. More formally, it ensures that the resulting CCpnA directions lie in the span of the environmental variables; thorough treatments are available in [@terBraak:1985; @greenacre2007correspondence]. Like PCoA and DPCoA, this method can be run using `ordinate` from the `phyloseq` package . In order to use supplemental sample data, it is necessary to provide an extra argument, specifying which of the features to consider – otherwise, defaults to using all measurements when producing the ordination. ```{r ccpna-correspondence-analysis} ps_ccpna <- ordinate(pslog, "CCA", formula = pslog ~ age_binned + family_relationship) ``` To access the positions for the biplot, we can use the function `ordinate` in `phyloseq`. Further, to facilitate figure annotation, we also join the site scores with the environmental data in the slot. Of the 23 total taxonomic orders, we only explicitly annotate the four most abundant – this makes the biplot easier to read. ```{r ccpna-join-data, fig.cap="The mouse and bacteria scores generated by CCpnA.", fig.wide=TRUE, fig.height=10} library(ggrepel) ps_scores <- vegan::scores(ps_ccpna) sites <- data.frame(ps_scores$sites) sites$SampleID <- rownames(sites) sites <- sites %>% left_join(sample_data(ps)) species <- data.frame(ps_scores$species) species$otu_id <- seq_along(colnames(otu_table(ps))) species <- species %>% left_join(tax) evals_prop <- 100 * ps_ccpna$CCA$eig[1:2] / sum(ps_ccpna$CA$eig) ggplot() + geom_point(data = sites, aes(x = CCA1, y = CCA2), shape = 2, alpha = 0.5) + geom_point(data = species, aes(x = CCA1, y = CCA2, col = Order), size = 0.5) + geom_text_repel(data = species %>% filter(CCA2 < -2), aes(x = CCA1, y = CCA2, label = otu_id), size = 1.5, segment.size = 0.1) + facet_grid(. ~ family_relationship) + guides(col = guide_legend(override.aes = list(size = 3))) + labs(x = sprintf("Axis1 [%s%% variance]", round(evals_prop[1], 2)), y = sprintf("Axis2 [%s%% variance]", round(evals_prop[2], 2))) + scale_color_brewer(palette = "Set2") + coord_fixed(sqrt(ps_ccpna$CCA$eig[2] / ps_ccpna$CCA$eig[1])*0.45 ) + theme(panel.border = element_rect(color = "#787878", fill = alpha("white", 0))) ``` # Supervised learning {#sec:supervised-learning .unnumbered} Here we illustrate some supervised learning methods that can be easily run in R. The package wraps many prediction algorithms available in R and performs parameter tuning automatically. Since we saw that microbiome signatures change with age, we’ll apply supervised techniques to try to predict age from microbiome composition. We’ll first look at Partial Least Squares (PLS)[@PLSwold]. The first step is to divide the data into training and test sets, with assignments done by mouse, rather than by sample, to ensure that the test set realistically simulates the collection of new data. Once we split the data, we can use the function `train` to fit the PLS model. ```{r caret-pls} library(caret) sample_data(pslog)$age2 <- cut(sample_data(pslog)$age, c(0, 100, 400)) dataMatrix <- data.frame(age = sample_data(pslog)$age2, otu_table(pslog)) # take 8 mice at random to be the training set, and the remaining 4 the test set trainingMice <- sample(unique(sample_data(pslog)$host_subject_id), size = 8) inTrain <- which(sample_data(pslog)$host_subject_id %in% trainingMice) training <- dataMatrix[inTrain,] testing <- dataMatrix[-inTrain,] plsFit <- train(age ~ ., data = training, method = "pls", preProc = "center") ``` Next we can predict class labels on the test set using the function `predict` and compare to the truth. We see that the method does an excellent job of predicting age. ```{r caret-pls-confusion} plsClasses <- predict(plsFit, newdata = testing) table(plsClasses, testing$age) ``` As another example, we can try out random forests. This is run in exactly the same way as PLS, by switching the argument from to . Random forests also perform well at the prediction task on this test set, though there are more old mice misclassified as young. ```{r caret-rf, eval =T} library(randomForest) rfFit <- train(age ~ ., data = training, method = "rf", preProc = "center", proximity = TRUE) rfClasses <- predict(rfFit, newdata = testing) table(rfClasses, testing$age) ``` To interpret these PLS and random forest results, it is standard to produce biplots and proximity plots, respectively. The code below extracts coordinates and supplies annotation for points to include on the PLS biplot. ```{r caret-pls-scores-loadings} pls_biplot <- list("loadings" = loadings(plsFit$finalModel), "scores" = scores(plsFit$finalModel)) class(pls_biplot$scores) <- "matrix" pls_biplot$scores <- data.frame(sample_data(pslog)[inTrain, ], pls_biplot$scores) tax <- tax_table(ps)@.Data %>% data.frame(stringsAsFactors = FALSE) main_orders <- c("Clostridiales", "Bacteroidales", "Lactobacillales", "Coriobacteriales") tax$Order[!(tax$Order %in% main_orders)] <- "Other" tax$Order <- factor(tax$Order, levels = c(main_orders, "Other")) class(pls_biplot$loadings) <- "matrix" pls_biplot$loadings <- data.frame(tax, pls_biplot$loadings) ``` ```{r caret-pls-scores-plot, fig.height = 7, fig.width=8, fig.wide=TRUE, fig.cap="PLS produces a biplot representation designed to separate samples by a response variable."} ggplot() + geom_point(data = pls_biplot$scores, aes(x = Comp.1, y = Comp.2), shape = 2) + geom_point(data = pls_biplot$loadings, aes(x = 25 * Comp.1, y = 25 * Comp.2, col = Order), size = 0.3, alpha = 0.6) + scale_color_brewer(palette = "Set2") + labs(x = "Axis1", y = "Axis2", col = "Binned Age") + guides(col = guide_legend(override.aes = list(size = 3))) + facet_grid( ~ age2) + theme(panel.border = element_rect(color = "#787878", fill = alpha("white", 0))) ``` The resulting biplot is displayed in Figure \@ref(fig:caret-pls-scores-plot); it can be interpreted similarly to earlier ordination diagrams, with the exception that the projection is chosen with an explicit reference to the binned age variable. Specifically, PLS identifies a subspace to maximize discrimination between classes, and the biplot displays sample projections and ASV coefficients with respect to this subspace. ```{r caret-proximity, fig.wide=TRUE, fig.cap="The random forest model determines a distance between samples, which can be input into PCoA to produce a proximity plot."} rf_prox <- cmdscale(1 - rfFit$finalModel$proximity) %>% data.frame(sample_data(pslog)[inTrain, ]) ggplot(rf_prox) + geom_point(aes(x = X1, y = X2, col = age_binned), size = 1, alpha = 0.7) + scale_color_manual(values = c("#A66EB8", "#238DB5", "#748B4F")) + guides(col = guide_legend(override.aes = list(size = 4))) + labs(col = "Binned Age", x = "Axis1", y = "Axis2") ``` A random forest proximity plot is displayed in Figure \@ref(fig:caret-proximity). To generate this representation, a distance is calculated between samples based on how frequently sample occur in the same tree partition in the random forest's bootstrapping procedure. If a pair of samples frequently occur in the same partition, the pair is assigned a low distance. The resulting distances are then input to PCoA, giving a glimpse into the random forests' otherwise complex classification mechanism. The separation between classes is clear, and manually inspecting points would reveal what types of samples are easier or harder to classify. ```{r caret-rf-importance, fig.height=6, fig.width=9, fig.cap="Random forest aids to interpretation: importance scores."} as.vector(tax_table(ps)[which.max(importance(rfFit$finalModel)), c("Family", "Genus")]) impOtu <- as.vector(otu_table(pslog)[,which.max(importance(rfFit$finalModel))]) maxImpDF <- data.frame(sample_data(pslog), abund = impOtu) ggplot(maxImpDF) + geom_histogram(aes(x = abund)) + facet_grid(age2 ~ .) + labs(x = "Abundance of discriminative bacteria", y = "Number of samples") ``` To further understand the fitted random forest model, we identify the microbe with the most influence in the random forest prediction. This turns out to be a microbe in family _Lachnospiraceae_ and genus _Roseburia_. Figure \@ref(fig:caret-rf-importance) plots its abundance across samples; we see that it is uniformly very low from age 0 to 100 days and much higher from age 100 to 400 days. # Multitable techniques {#multitable-techniques .unnumbered} Many microbiome studies attempt to quantify variation in the microbial, genomic, and metabolic measurements across different experimental conditions. As a result, it is common to perform multiple assays on the same biological samples and ask what features – bacteria, genes, or metabolites, for example – are associated with different sample conditions. There are many ways to approach these questions, which to apply depends on the study’s focus. Here, we will focus on one specific workflow that uses sparse Canonical Correlation Analysis (sparse CCA), a method well-suited to both exploratory comparisons between samples and the identification of features with interesting variation. We will use an implementation from the `r CRANpkg("PMA")` [@witten2009pma]. Since the mouse data used above included only a single table, we use a new data set, collected by the study [@kashyap2013genetically]. There are two tables here, one for bacteria and another with metabolites. 12 samples were obtained, each with measurements at 637 m/z values and 20,609 OTUs; however, about 96% of the entries of the microbial abundance table are exactly zero. The code below retrieves and filters these data. ```{r multitable-setup} metab <- read.csv("https://raw.githubusercontent.com/spholmes/F1000_workflow/master/data/metabolites.csv",row.names = 1) microbe_connect <-url("https://raw.githubusercontent.com/spholmes/F1000_workflow/master/data/microbe.rda") load(microbe_connect) microbe ``` We see that `microbe` is a `phyloseq` object. Our preprocessing mirrors that done for the mouse data. We first filter down to microbes and metabolites of interest, removing those that are zero across many samples. Then, we transform them to weaken the heavy tails. We also take the log of the metabolites. ```{r multitable-filtering} library("genefilter") keep_ix <- rowSums(metab == 0) <= 3 metab <- metab[keep_ix, ] microbe <- prune_taxa(taxa_sums(microbe) > 4, microbe) microbe <- filter_taxa(microbe, filterfun(kOverA(3, 2)), TRUE) metab <- log(1 + metab, base = 10) X <- otu_table(microbe) X[X > 50] <- 50 dim(X) dim(metab) ``` We see that both X and metab have 12 columns, these are actually the samples and we will transpose them. We can now apply sparse CCA. This method compares sets of features across high-dimensional data tables, where there may be more measured features than samples. In the process, it chooses a subset of available features that capture the most covariance – these are the features that reflect signals present across multiple tables. We then apply PCA to this selected subset of features. In this sense, we use sparse CCA as a screening procedure, rather than as an ordination method. Our implementation is below. The parameters `penaltyx` and `penaltyz` are sparsity penalties. `penaltyx` will modulate selected microbes, similarly `penaltyz` modulates the number of selected metabolites. We tune them manually to facilitate subsequent interpretation – we gene rally prefer more sparsity than the default parameters would provide. ```{r multitable-sparse-cca} library(PMA) cca_res <- CCA(t(X), t(metab), penaltyx = .15, penaltyz = .15) cca_res ``` With these parameters, 5 microbes and 15 metabolites have been selected, based on their ability to explain covariation between tables. Further, these 20 features result in a correlation of 0.974 between the two tables. We interpret this to mean that the microbial and metabolomic data reflect similar underlying signals, and that these signals can be approximated well by the 20 selected features. Be wary of the correlation value, however, since the scores are far from the usual bivariate normal cloud. Further, note that it is possible that other subsets of features could explain the data just as well – sparse CCA has minimized redundancy across features, but makes no guarantee that these are the “true” features in any sense. Nonetheless, we can still use these 20 features to compress information from the two tables without much loss. To relate the recovered metabolites and OTUs to characteristics of the samples on which they were measured, we use them as input to an ordinary PCA. ```{r loadade4,echo=FALSE} library(ade4) ``` ```{r multitable-plug-in-pca } combined <- cbind(t(X[cca_res$u != 0, ]), t(metab[cca_res$v != 0, ])) pca_res <- dudi.pca(combined, scannf = F, nf = 3) ``` ```{r annotation} genotype <- substr(rownames(pca_res$li), 1, 2) sample_type <- substr(rownames(pca_res$l1), 3, 4) feature_type <- grepl("\\.", colnames(combined)) feature_type <- ifelse(feature_type, "Metabolite", "OTU") sample_info <- data.frame(pca_res$li, genotype, sample_type) feature_info <- data.frame(pca_res$c1, feature = substr(colnames(combined), 1, 6)) ``` ```{r multitable-interpret-pca, fig.wide=TRUE, fig.cap="A PCA triplot produced from the CCA selected features in from muliple data types;metabolites and OTUs."} ggplot() + geom_point(data = sample_info, aes(x = Axis1, y = Axis2, col = sample_type, shape = genotype), size = 3) + geom_label_repel(data = feature_info, aes(x = 5.5 * CS1, y = 5.5 * CS2, label = feature, fill = feature_type), size = 2, segment.size = 0.3, label.padding = unit(0.1, "lines"), label.size = 0) + geom_point(data = feature_info, aes(x = 5.5 * CS1, y = 5.5 * CS2, fill = feature_type), size = 1, shape = 23, col = "#383838") + scale_color_brewer(palette = "Set2") + scale_fill_manual(values = c("#a6d854", "#e78ac3")) + guides(fill = guide_legend(override.aes = list(shape = 32, size = 0))) + coord_fixed(sqrt(pca_res$eig[2] / pca_res$eig[2])) + labs(x = sprintf("Axis1 [%s%% Variance]", 100 * round(pca_res$eig[1] / sum(pca_res$eig), 2)), y = sprintf("Axis2 [%s%% Variance]", 100 * round(pca_res$eig[2] / sum(pca_res$eig), 2)), fill = "Feature Type", col = "Sample Type") ``` Note that we have departed from our convention of fixing the aspect ratio here as the second axis represents very little of the variability and the plot would actually become unreadable. Figure \@ref(fig:multitable-interpret-pca) displays a PCA *triplot*, where we show different types of samples and the multidomain features (Metabolites and OTUs). This allows comparison across the measured samples – triangles for Knockout and circles for wild type – and characterizes the influence the different features – diamonds with text labels. For example, we see that the main variation in the data is across PD and ST samples, which correspond to the different diets. Further, large values of 15 of the features are associated with ST status, while small values for 5 of them indicate PD status. The advantage of the sparse CCA screening is now clear – we can display most of the variation across samples using a relatively simple plot, and can avoid plotting the hundreds of additional points that would be needed to display all of the features. # DPCOA and adaptive GPCA analyses It is important to be able to integrate the phylogenetic tree information into the analysis. We start with a small example from the Antibiotic data study (Relman and Dethfelsen, 2011) available in the DPCoA packages. This can be done through either the use of the ordinate function with the method="DPCoA" or better still by using the new adaptiveGPCA package. Interpolation between complete tree based DPCOA analyses and PCA analysis, this requires interactive use and shiny, so not evaluated in vignette here: ```{r, eval =FALSE} #not run data(AntibioticPhyloseq) theme_set(theme_bw()) pp = processPhyloseq(AntibioticPhyloseq) out.ff = gpcaFullFamily(pp$X, pp$Q, k = 2) out.agpca = visualizeFullFamily(out.ff, sample_data = sample_data(AntibioticPhyloseq), sample_mapping = aes(x = Axis1, y = Axis2, color = condition), var_data = tax_table(AntibioticPhyloseq), var_mapping = aes(x = Axis1, y = Axis2, color = Phylum)) ``` # Sparse discriminant analyses using a phylogenetic tree The package itself needs to downloaded from github: ```{r} ##devtools::install.github("jfukuyama/treeDA") library(treeDA) library(adaptiveGPCA) data(AntibioticPhyloseq) theme_set(theme_bw()) ``` ```{r} out.treeda = treeda(response = sample_data(AntibioticPhyloseq)$type, predictors = otu_table(AntibioticPhyloseq), tree = phy_tree(AntibioticPhyloseq), p = 15) ``` # Looking at longitudinal data in the presence of a tree using treelapse. [Tutorial to follow](http://statweb.stanford.edu/~kriss1/antibiotic.html) #References