---
title: "Assigning bulk RNA-seq to pseudotime"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Assigning bulk RNA-seq to pseudotime}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

```{r setup, results='hide', message=FALSE, warning=FALSE}
library(blase)
library(SingleCellExperiment)
library(tradeSeq)
library(slingshot)
library(scran)
library(scater)
library(BiocParallel)
library(ami)
library(patchwork)
library(reshape2)
```

```{r randomseed}
RNGversion("3.5.0")
SEED <- 7
set.seed(SEED)
```

```{r concurrency}
N_CORES <- 4
if (ami::using_ci()) {
    N_CORES <- 2
}
```

## Introduction

BLASE is a tool for matching bulk RNA-seq samples to the pseudotime of a 
trajectory in a scRNA-seq dataset. BLASE integrates with the Bioconductor
ecosystem by integrating closely with SingleCellExperiment objects. (BLASE
can be used with Seurat objects that have been converted using 
`as.SingleCellExperiment`.)

BLASE can be used for the following use-cases:
* Identifying the progress of a bulk like RNA-seq sample through 
  scRNA-seq reference trajectory.
* Annotating scRNA-seq trajectories with reference bulk RNA-seq datasets
* Correcting development rate differences in comparative RNA-seq samples.

### Principle

We believe that in many trajectories that cells undergo, there will be a 
curve that shows the expression of some key genes. In the plot below, 
we can see that by looking at these three "genes" which have 2 peaks each 
over "pseudotime", we can identify 6 unique states, which, as long as 
the genes correspond to some aspect of the trajectory, can be used 
to infer which stage of the trajectory a cell is currently in. This 
is of course a constructed example and unlikely to be exactly repeated in 
real life, but hopefully demonstrates the principle BLASE works on.

```{r expressionOverPseudotime}
# Adapted from
# https://codepal.ai/code-generator/query/n4dEA6I9/plot-sine-wave-ggplot2-r
x <- seq(0, 2 * pi, length.out = 500)
gene1_expr <- 10 * (sin(x + 0.1) + 1)
gene2_expr <- 11 * (sin(x - pi) + 1)
gene3_expr <- 7 * (sin(x + pi / 2) + 1)

genes <- data.frame(gene1 = gene1_expr, gene2 = gene2_expr, gene3 = gene3_expr)
genes_melt <- melt(t(genes), varnames = c("gene", "x"))
genes_melt$x <- as.numeric(genes_melt$x)
```

```{r expressionOverPseudotimePlot}
#| fig.alt: >
#|   Plot showing gene expression varying over time for three genes, with lines
#|   marking different areas which can be uniquely identified by the rank-order
#|   of the expression of the genes.
ggplot(genes_melt, aes = aes()) +
    geom_line(aes(x = x, y = value, color = gene)) +
    geom_vline(xintercept = 0, linetype = "dashed") +
    geom_vline(xintercept = 24, linetype = "dashed") +
    geom_vline(xintercept = 181, linetype = "dashed") +
    geom_vline(xintercept = 243, linetype = "dashed") +
    geom_vline(xintercept = 315, linetype = "dashed") +
    geom_vline(xintercept = 479, linetype = "dashed") +
    geom_vline(xintercept = 500, linetype = "dashed") +
    labs(
        x = "pseudotime",
        y = "Gene Expression",
        title = "Variable Genes Over Pseudotime"
    )
```


In this vignette, we will use BLASE to try to assign bulk RNA-seq
samples to the pseudotime trajectory in a scRNA-seq data. We need to
have the following to use Blase in this case:

-   A Single Cell Experiment with pseudotime.
-   A list of genes that explain the trajectory well, in this case
    generated by tradeSeq's `fitGAM` and `associationTest`

## Installing BLASE

To install BLASE from Bioconductor:

```{r installBlase, eval=FALSE}       
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install("blase")
```

## Setting up the Single Cell Experiment

First, let's load a Single Cell Experiment to use the tool with,
based on the [tradeSeq
vignette](https://bioconductor.org/packages/devel/bioc/vignettes/tradeSeq/inst/doc/tradeSeq.html). We have added pseudotime to the object, and used tradeSeq as shown in their vignette.

```{r loadSCE}
data(tradeSeq_BLASE_example_sce, package = "blase")
sce <- tradeSeq_BLASE_example_sce
```

```{r setupSCEPlot}
#| fig.alt: >
#|   Plots showing a UMAP of the single-cell data,
#|   coloured by celltype and pseudotime. The UMAP is shaped like a horizontal
#|   line, showing the differentiation of a set of immune cell types.
plotUMAP(sce, text_by = "celltype", colour_by = "celltype") +
    plotUMAP(sce, colour_by = "pseudotime") +
    patchwork::plot_layout(ncol = 1, axis_title = "collect")
```

## Finding the most descriptive genes with tradeSeq

Now we'll find the genes we want to use. We select the top 200 so that
we can do some parameter tuning below.

```{r TradeSeqGeneSelection}
# Use consecutive for genes that change over time
assoRes <- associationTest(
    sce,
    lineages = TRUE,
    global = FALSE,
    contrastType = "consecutive"
)
genelist <- blase::get_top_n_genes(assoRes, n_genes = 200, lineage = 1)
```

Please note that other methods of selecting genes are available. Please
see other vignettes for details.

## Parameter Tuning for BLASE

When using BLASE, it can be a good idea to tune the number of genes and
bins used.

-   Too few genes can lead to poor fitting
-   Too many genes can lead to slower execution, without any benefit, as
    the less informative genes still need to be checked.
-   Too few bins can oversimplify the trajectory
-   Too many bins can lead to too few cells being available for reliable
    values.

We can do this using the following commands, provided by BLASE:

```{r findBestParams}
res <- blase::find_best_params(
    sce,
    genelist,
    split_by = "pseudotime_range",
    pseudotime_slot = "pseudotime",
    bins_count_range = c(5, 10),
    gene_count_range = c(40, 80)
)
```

```{r findBestParamsPlot}
#| fig.alt: >
#|   Figures showing the minimum and mean convexity, as well as % confident
#|   mappings different numbers of genes and bins calculated.
blase::plot_find_best_params_results(res)
```

It looks like 80 genes and 10 bins will give us good specificity, but
let's double check. In a non trivial dataset, this may take some
repetition. We ignore the 5 bin result because this might not be enough
resolution - this depends on your dataset. In general, more bins will
reduce specificity because each bin will have a more similar cell
composition. In general, aim for about as many bins as you have
clusters.

### Inspect Bin Choice

Now we can check if this is a good fit by showing how other bins from
the SC dataset map using these genes. Ideally, each bin is very
specific, with a high "worst specificity".

```{r checkBinsChoice}
blase_data <- as.BlaseData(sce, pseudotime_slot = "pseudotime", n_bins = 10)
genes(blase_data) <- genelist[1:80]
```

```{r checkBinsChoicePlot}
#| fig.alt: >
#|   These plots show the correlations of each bin to itself and all other
#|   bins. This shows how distinct each bin is compared to others, based on
#|   the number of bins and genes selected.
evaluate_parameters(blase_data, make_plot = TRUE)
```

### Inspect Genes Choice

We can also see how the genes change over pseudotime, by plotting
expression of the top genes changing over each pseudotime bin.

```{r evaluateTopGenes}
#| fig.alt: >
#|   Plot showing a the "best" genes selected for use with BLASE, according
#|   to how related their expression is to the pseudotime of the cell.
evaluate_top_n_genes(blase_data)
```

## Mapping Bulk Samples to SC with BLASE

If we're happy, now we can try to map a bulk sample onto the single
cell. We'll do this by pseudobulking cell types in the
SingleCellExperiment but in reality you should use a real bulk dataset.
See some of our other articles for examples.

```{r doBLASEMapping}
bulks_df <- DataFrame(row.names = rownames(counts(sce)))
for (type in unique(sce$celltype)) {
    bulks_df <- cbind(
        bulks_df,
        rowSums(normcounts(subset(sce, , celltype == type)))
    )
}

colnames(bulks_df) <- unique(sce$celltype)

multipotent_progenitors_result <- map_best_bin(
    blase_data, "Multipotent progenitors", bulks_df
)
multipotent_progenitors_result
basophils_result <- map_best_bin(blase_data, "Basophils", bulks_df)
basophils_result
neutrophils_result <- map_best_bin(blase_data, "Neutrophils", bulks_df)
neutrophils_result
```

We can also generate this result with `map_all_best_bins`, which can
also allow for parallelisation of the process for large numbers of
bulks:

```{r fastMapping}
map_all_best_bins(blase_data, bulks_df)
```

### Plotting Heatmap of Mappings

To see how well we plotted the differences between these cell types, we
can plot a heatmap of all these results. The neutrophils and
Erythrocytes are confidently mapped, but the GMP population doesn't map
as well.

```{r plotMappingResults}
#| fig.alt: >
#|   A heatmap showing the correlations BLASE found for a subset
#|   of cell types in this data. Confident mappings are indicated
#|   with asterisks over the mapped tile. This information is also
#|   available as text from the results of the previous code block.
plot_mapping_result_heatmap(list(
    multipotent_progenitors_result,
    basophils_result,
    neutrophils_result
), annotate_correlation = TRUE, text_background = TRUE)
```

### Plotting Detailed Correlation Maps

To look into the Basophils mapping, we can plot the correlation over
each bin. Because the the lower boundary of the best selection (bin 3)
is higher than the higher bounds of the next-best bin (bin 4), BLASE
makes a confident call.

```{r plotMappingResultCorr}
#| fig.alt: >
#|   A plot of the correlation of each bin with the basophils pseudobulk
#|   sample, also showing upper and lower bounds as calculated by BLASE. The
#|   best mapping is for bin 3, and bin 4 is the next most similar.
plot_mapping_result_corr(basophils_result)
```

### Plotting Summary Plots of Mappings

Now we can do some more detailed plotting about where the bins mapped
onto the Single Cell data, and the cell type proportions we expect , but
we need to add the pseudotime bins to the metadata of the SCE. We can
see that the true proportions of cell types in the mapped bin do indeed
map to the cell type we expect.

```{r assignPseudotimeBins}
binned_sce <- assign_pseudotime_bins(
    sce,
    split_by = "pseudotime_range",
    n_bins = 10,
    pseudotime_slot = "pseudotime"
)
```

```{r assignPseudotimeBinsPlot}
#| fig.alt: >
#|   An overview of mapping for Multipotent progenitors, including UMAPs
#|   of all cells and just those in the mapped pseudotime bin
#|   coloured by pseudotime bin, and cell-type. Also includes a line plot
#|   of correlation of the mapping over every bin, and a bar plot of the
#|   different cell-type populations in that bin, showing that the bulk
#|   was correctly mapped to a bin consisting mostly of Multipotent
#|   progenitor cells.
plot_mapping_result(
    binned_sce,
    multipotent_progenitors_result,
    group_by_slot = "celltype"
)
```

To view the bin population chart in full, use:

```{r plotBinPopulation}
#| fig.alt: >
#|   A full size bar plot of the cell-type populations of bin 1.
plot_bin_population(binned_sce, 1, group_by_slot = "celltype")
```

## Session Info

```{r, sessionInfo}
sessionInfo()
```