---
title: "Codon usage (CU) analysis in R"
author: "Anamaria Elek"
package: coRdon
date: "`r Sys.Date()`"
abstract: >
    `coRdon` provides tools for analysis of codone usage (CU)
    in various unannotated or KEGG/COG annotated DNA sequences.
    Funcionalities include: calculation of different CU bias statistics
    and CU-based predictors of gene expression, gene set enrichment analysis
    for annotated sequences, and several methods for visualization 
    of CU and enrichment analysis results. 
output: 
    BiocStyle::html_document:
    toc_float: TRUE
vignette: >
    %\VignetteIndexEntry{coRdon}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteDepends{ggplot2}
    %\VignetteEncoding{UTF-8}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r setup, echo = FALSE, message = FALSE, warning = FALSE}
library(knitr)
opts_chunk$set(comment = NA,
                fig.align = "center",
                warning = FALSE,
                cache = FALSE)
library(coRdon)
library(ggplot2)
library(Biostrings)
library(Biobase)
```

# Introduction

Not all synonymous codons are used equally often in prokaryotic genomes -
this selective preference is termed **codon usage (CU) bias**, and is 
an independent determinant of gene expression regulation at the translational
level. Those synonymous codons corresponding to the most abundant tRNA
species are considered optimal for translation. Codon usage bias can therfore
be used to predict the relative expression levels of genes, by comparing CU
bias of a gene to the CU bias of a set of genes known to be highly expressed.  

This approach can be efficiently used to predict highly expressed genes 
in a single genome, but is especially useful at the higher level of 
an entire metagenome. It has been shown that, as well as being present 
within the genome, CU bias is shared among the microbial spieces in the same
environment. By analysing CU bias of a metagenome, one can identify genes
with high predicted expression across the entire microbial community, 
and determine which are the enriched functions within the community, 
i.e. its 'functional fingerprint'.  

A typical workflow for analysing CU bias includes:

* calculating one of the CU statistics
* comparing CU of every gene to average CU of highly expressed genes
* functional enrichment analysis for the predicted highly expressed genes
(i.e. those that have CU similar to a reference set of highly expressed genes).

# Installation
Install coRdon from Bioconductor:
```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("coRdon")
```
The developmental version can be installed directly from from GitHub: 
```{r eval=FALSE}
devtools::install_github("BioinfoHR/coRdon")
```


# Loading sequences

Sequences from human gut microbiome samples of healthy individuals and 
liver cirrhosis patients 
([Quin et al. 2014](https://go.nature.com/2G7QdpC)) 
were processed, assembled and used to predict ORFs, which were then 
annotated with a KO (KEGG orthology) function 
([Fabijanic and Vlahovicek 2016](https://bit.ly/2DRfiz6)).  

For each sequence, in each sample, we can calculate codon usage (CU)
bias. 
In order to do this, we need to count occurrences of each codon 
in each sequence. This can be done by storing sequences in each sample 
as a `codonTable` object. 
We can use `readSet()` to read (a directory containing) .fasta files and 
store sequences as a `DNAStringSet` object, which could then be 
converted to `codonTable` using the constructor method `codonTable()`.  

For the purpose of this vignette we will create two `codonTable` objects, 
HD59 and LD94, containing codon counts for metagenomic gut sample from 
healthy individual, and from liver chirosis patient, respectively.

```{r}
dnaLD94 <- readSet(
  file="https://raw.githubusercontent.com/BioinfoHR/coRdon-examples/master/LD94.fasta"
)
LD94 <- codonTable(dnaLD94)
dnaHD59 <- readSet(
  file="https://raw.githubusercontent.com/BioinfoHR/coRdon-examples/master/HD59.fasta"
)
HD59 <- codonTable(dnaHD59)
```

`codonTable` object stores codon counts for each sequence, as well as 
some additional metadata, namely sequence ID infered from 
a .fasta file or DNAStringSet object, sequence length in codons and 
KO / COG annotation. 
Note that not all of these need be present in every `codonTable` object 
- only the codon counts, and inferred lengths are mandatory. 
They can be accessed using `codonCounts()` and `getlen()` 
(not to be confused with `length()` which returns number of sequences 
in a `codonTable` object):

```{r}
cc <- codonCounts(HD59)
head(cc)
l <- getlen(HD59)
head(l)
length(HD59)
```

We can get IDs, KO or COG annotations in a similar way, if these 
are present:

```{r}
ko <- getKO(HD59)
head(ko)
```

# Calculate CU bias

Now we can calculate CU bias for every sequence. 
There are many statistics that measure codon usage, and several widely
used ones are implemented in coRdon. For example, Measure Independent 
of Length and Composition (MILC) can be calculated for sequences in 
the `codonTable` like so:

```{r}
milc <- MILC(HD59)
head(milc)
```

By default, MILC value for every sequence in a set is calculated with
respect to the average CU bias of the entire sample ("self"). This can 
be changed by setting `self = FALSE` and providing a subset of reference
genes, to which CU will be compared. This is commonly a subset of genes
known to be highly expressed, e.g. ribosomal genes. If the sequences in 
the set are annotated, we can choose to calculate CU bias with respect 
to codon usage of ribosomal genes (i.e. codon usage of the subset 
of sequences in the sample which are annotated as ribosomal genes) 
simply by setting `ribosomal = TRUE`.

```{r}
milc <- MILC(HD59, ribosomal = TRUE)
head(milc)
```

Alternatively, instead of using ribosomal genes, we could use 
a `subsets` argument to provide a named list of character vectors 
of annotations for different genes to be used as reference set(s). 
However, if no annotation for sequences is available, one can still
calculate CU bias with respect to some chosen subset(s) of genes, 
by setting `subsets` to be a named list of logical vectors of sequences 
to be included in this reference subset(s), or a named list with
an entirely new reference codon counts table(s).  

Also of note, `self`, `ribosomal` and `subsets` arguments 
can be combined, and MILC calculated for all sequences, with respect 
to each of thus specified reference sets of genes.  

Another important consideration is that MILC, like most of the other 
CU statistics, is still somewhat dependent on the length of sequence 
for which it is calculated, and in order to obtain meaningful values, 
it is recommended that sequences should be at least 80 codons long. 
To that end, either soft or hard filtering step can be included 
in calculation of CU statistics. Hard filtering will remove all of the
sequences from `codonTable` that are shorter than some specified 
threshold (length in codons, default is 80), whereas soft filtering 
will produce  a message if there are such sequences in the set, 
but will not remove them.

```{r warning=TRUE}
milc <- MILC(HD59, filtering = "soft")
```

We can visually inspect distribution of sequence lengths, extracted 
from `codonTable` object using `getlen()`:

```{r include=FALSE}
lengths <- getlen(HD59)
hist(lengths, breaks = 60)
abline(v = 80, col="red")
```
```{r}
lengths <- as.data.frame(getlen(HD59))
colnames(lengths) <- "length"
ggplot(lengths, aes(length)) + 
    geom_density() +
    geom_vline(xintercept = 80, colour = "red") +
    theme_light()
```

Now we calculate MILC values with respect to overall sample CU 
and ribosomal genes CU, removing beforehand sequences shorter than 80 codons
(use `len.threshold` argument to specify differnt value of threshold):

```{r}
milc <- MILC(HD59, ribosomal = TRUE, filtering = "hard")
```

Other CU statistics can be calculated in the same way as `MILC()`, 
using one of the functions: `B()`, `MCB()`, `ENCprime()`, `ENC()` or 
`SCUO()`. Note however, that when calculating ENC and SCUO, 
one doesn't need to provide a subset of reference genes, because these
statistics measure distance in codon usage to uniform usage of 
synonymous codons.  

Some additional arguments that can be set when calculating CU statistics 
are `id_or_name2`, for choosing a genetic code variant (see help for
`getGeneticCode()` function in Biostrings package), `alt.init` for 
alternative initiation codons, `stop.rm` for removal of STOP codons.

## Visualisation of CU bias

Now we can visualize CU bias for every gene on the B plot.  

Here, every gene is represented by a single point on the plot, its 
coordinates determined by distance of genes' CU bias to overall CU bias
("self", on y axis) and to CU bias of reference genes ("ribosomal", 
on x axis).  
`x` and `y` are names of columns in a matrix given as `data` argument, 
they can alternatevly be numeric vectors (of the same length), 
containing values of the CU statistic we wish to plot one against 
the other.  

```{r}
library(ggplot2)

xlab <- "MILC distance from sample centroid"
ylab <- "MILC distance from ribosomal genes"

milc_HD59 <- MILC(HD59, ribosomal = TRUE)
Bplot(x = "ribosomal", y = "self", data = milc_HD59) +
    labs(x = xlab, y = ylab)
```
```{r eval=FALSE}
milc_LD94 <- MILC(LD94, ribosomal = TRUE)
Bplot(x = "ribosomal", y = "self", data = milc_LD94) +
    labs(x = xlab, y = ylab)
```

The argument `annotations` needs to be specified when we wish to indicate
certain genes on the plot. For example, to indicate ribosomal genes, 
provide character vector of annotations corresponding to values in 
`x`, `y` and `data`, and set `ribosomal = TRUE`. 
Note that, if we filtered sequences when calculating MILC, we need to 
provide the annotations for those sequences only.

```{r}
genes <- getKO(HD59)[getlen(HD59) > 80]
Bplot(x = "ribosomal", y = "self", data = milc,
        annotations = genes, ribosomal = TRUE) +
    labs(x = xlab, y = ylab)
```

To indicate any other gene(s), use `reference` argument. `reference` 
takes a list of length 1, containing either a logical vector in which 
`TRUE` corresponds (by position) to genes that are to be indicated 
on the plot, and `FALSE` to all the other genes, or a character vector 
of annotations for the genes to be indicated on the plot.

Another interesting way to visualise codon usage is to plot the 
CU distances between the two samples on the B plot. This can be done 
using `intraBplot()` function which takes as arguments two `codonTable` 
objects and a character indicating which CU statistic to plot. 
We can also choose to indicate ribosomal genes on the plot by 
setting `ribosomal = TRUE`, these genes will now be shown as 
stronger points on the plot.

```{r}
intraBplot(HD59, LD94, names = c("HD59", "LD94"), 
            variable = "MILC", 
            ribosomal = TRUE)
```

# Predict genes' expressivity

Next, we predict expression levels of genes in each sample, with 
ribosomal genes once again used as a reference. There are several 
measures of CU-based gene expressivity implemented in coRdon. 
Here we calculate values of MILC-based Expression Level Predictor (MELP), 
with respect to ribosomal genes, and excluding sequencs shorter than 
80 codons (this is the default value of `len.threshold`):

```{r}
melp <- MELP(HD59, ribosomal = TRUE, filtering = "hard")
head(melp)
```

Other statistics that measure gene expresivity can be calculated 
analogously, using `E()`, `CAI()`, `GCB()`, and `Fop()` functions.  

Genes from a single sample with high expressivity values (e.g. MILC 
greater than 1) are considered to be optimized for translation in 
that sample. If annotation for genes is available, we can perform 
enrichment analysis in order to determine which functions are significantly
enriched in any given sample.  

# Functional annotation

We can identify most significantly enriched or depleted functions 
in the set of annotated genes predicted to have high expression level.  

For a single sample, we first create a contingency table summarising counts 
of genes annotated to each KO category among all the genes in sample, 
and among those predicted to be highly expressed. We do this using 
`crossTab()` function, giving it a character vector of genes' annotations, 
and a numeric vector of their respective MELP values, and specifying that 
a subset of highly expressed reference genes should contain those genes 
that have MELP value grater than 1 (this is default value of threshold). 

```{r}
ct <- crossTab(genes, as.numeric(melp), threshold = 1L)
ct
```

Output of `crossTab()` function is an object of `crossTab` class, 
containing gene anotations, respective values of the given variable, 
and a contingency table with counts for all genes, and for each defined 
subset of genes. They can all be accessed like so:

```{r}
contable(ct)
ann <- getSeqAnnot(ct)
head(ann)
var <- getVariable(ct)
head(var)
```

We could have also defined the subset of highly expressed genes by 
specifying a percent of genes with highest MELP values:

```{r}
crossTab(genes, as.numeric(melp), percentiles = 0.05)
```

Of note, both `threshold` and `percentiles` can be numerical vectors, 
in which case the functions used in subsequent analysis (`enrichment` 
and associated plotting functions, `enrichMAplot`, `enrichBarplot`) 
will produce output for each thus specified reference susbset.  

Having a contingency table, we can perform enrichment analysis. 
This implies scaling and transforming gene counts by MA transformation, 
and performing binomial test, optinally inncluding correction for 
multiple testing, as chosen by specifying the `pAdjustMethod` argument 
(see `p.adjust.methods` for possible options). Additional parameters
`pvalueCutoff` and `padjCutoff` can be passed to the `enrichment()` 
function in order to exclude from the results those sequences that 
have higher significance levels than specified.

```{r}
enr <- enrichment(ct)
enr
```

The output is an object of `AnnotatedDataFrame` class from Biobase package.  
The data stored in this object can be accesed by using `pData()` method 
defined in Biobase.

```{r}
require(Biobase)
enr_data <- pData(enr)
head(enr_data)
```

## Visualisation of enrichment

We can plot enriched and depleted KO categories on an MA plot, 
with significant enrichment or depletion defined as those having 
the p-value below the level of significance `siglev` 0.05 (one can also 
use adjusted p-value, by setting `pvalue` = "padj", and different values 
of `siglev`).  

```{r}
enrichMAplot(enr, pvalue = "pvals", siglev = 0.05) +
    theme_light()
```

In order to determine functions that are enriched among highly expressed 
genes, we need to map KO annotations to a broader onthology. This can be 
done using `reduceCrossTab()` method which takes a `crossTab` object as 
its first argument, and maps KO categories to either KEGG Pathway, 
KEGG Module, or COG functional categories, depending on the value of 
`target` argument.  

```{r}
ctpath <- reduceCrossTab(ct, target = "pathway")
ctpath
```

We can then perform enrichment analysis using the reduced contingency 
table.

```{r}
enrpath <- enrichment(ctpath)
enrpath_data <- pData(enrpath)
head(enrpath_data)
```

Functional enrichment results can also be plotted on a bar plot, 
using enrichBarplot() method. Here, bars are coloured by the level 
of significance, `padj`, whereas their height is the relative enrichment
`"enrich"` (can also be mean average of scaled counts, `"A"`, or scaled 
counts ratio, `"M"`). If we also specify `siglev`, only those categories 
for which associated `pvalue` ("pvals" or "padj") is below the given 
value will be ploted.

```{r fig.height=12}
enrichBarplot(enrpath, variable = "enrich", 
                pvalue = "padj", siglev = 0.05) +
    theme_light() +
    coord_flip() +
    labs(x = "category", y = "relative enrichment")
```

Prior to plotting, KEGGREST package can be used to match pathway 
identifiers to pathway descriptions, in order to produce a more 
informative plot.

```{r eval=FALSE}
require(KEGGREST)
paths <- names(keggList("pathway"))
paths <- regmatches(paths, regexpr("[[:alpha:]]{2,4}\\d{5}", paths))
pnames <- unname(keggList("pathway"))
ids <- match(pData(enrpath)$category, paths)
descriptions <- pnames[ids]
pData(enrpath)$category <- descriptions
enrpath_data <- pData(enrpath)
```

# Integration

In order to pipe the results from CU analysis in various downstream
applications, it is useful to have the data from different samples 
in a single matrix, with sequences (genes) in rows and values for 
different samples in columns. This can be achieved with 
`enrichMatrix()` function.  

For example, if we performed CU-based enrichment analysis on the two 
metagenomic samples described at the beginning of this vignette, and 
wanted to see how the results compare by plotting them on a heatmap 
(e.g. using the ComplexHeatmap package), we could do the following:

```{r}
# calculate MELP
melpHD59 <- MELP(HD59, ribosomal = TRUE, 
                filtering = "hard", len.threshold = 100)
genesHD59 <- getKO(HD59)[getlen(HD59) > 100]

melpLD94 <- MELP(LD94, ribosomal = TRUE, 
                filtering = "hard", len.threshold = 100)
genesLD94 <- getKO(LD94)[getlen(LD94) > 100]

# make cntingency table
ctHD59 <- crossTab(genesHD59, as.numeric(melpHD59))
ctLD94 <- crossTab(genesLD94, as.numeric(melpLD94))

ctHD59 <- reduceCrossTab(ctHD59, "pathway")
ctLD94 <- reduceCrossTab(ctLD94, "pathway")

# calculate enrichment
enrHD59 <- enrichment(ctHD59)
enrLD94 <- enrichment(ctLD94)

mat <- enrichMatrix(list(HD59 = enrHD59, LD94 = enrLD94), 
                    variable = "enrich")
head(mat)
```

Thus created matrix, with genes in rows and samples in columns, 
can be directly used as an argument to `Heatmap()` function for plotting. 
Also, for the purpose of neat visualization, we can again make use 
of the KEGGREST package to match pathway identifiers to pathway 
descriptions.

```{r fig.height=15, fig.width=7, eval=FALSE}
paths <- names(KEGGREST::keggList("pathway"))
paths <- regmatches(paths, regexpr("[[:alpha:]]{2,4}\\d{5}", paths))
pnames <- unname(KEGGREST::keggList("pathway"))
ids <- match(rownames(mat), paths)
descriptions <- pnames[ids]
rownames(mat) <- descriptions

mat <- mat[apply(mat, 1, function(x) all(x!=0)), ]
ComplexHeatmap::Heatmap(
    mat, 
    name = "relative \nenrichment",
    col = circlize::colorRamp2( c(-100, 0, 100), 
                                c("red", "white", "blue")),
    row_names_side = "left",
    row_names_gp = gpar(fontsize = 8),
    show_column_dend = FALSE, 
    show_row_dend = FALSE)
```

# Session info
```{r}
sessionInfo()
```