Damsel-workflow

1. Introduction

This document gives an introduction to the R package Damsel, for use in DamID analysis; from BAM file input to gene ontology analysis.

Designed for use with DamID data, the Damsel methodology could be modified for use on any similar technology that seeks to identify enriched regions relative to a control sample.

Utilising the power of edgeR for differential analysis and goseq for gene ontology bias correction, Damsel provides a unique end to end analysis for DamID.

The DamID example data used in this vignette is available in the package and has been taken from Vissers et al., (2018), ‘The Scalloped and Nerfin-1 Transcription Factors Cooperate to Maintain Neuronal Cell Fate’. The fastq files were downloaded from SRA, aligned using Rsubread::index and Rsubread::align, before sorting and making bai files with Samtools.

Installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Damsel")
library(Damsel)

2. Processing the BAM files

As a DamID analysis tool, Damsel requires a GATC region file for analysis. These regions serve as a guide to extract counts from the BAM files.

2.1 Introducing the GATC region file

It can be generated with getGatcRegions() using a BSGenome object or a FASTA file.

It is a GRangesList with the consecutive GATC regions across the genome - representing the region (or the length) between GATC sites, as well as the positions of the sites themselves.

If you have another species of DamID data or would prefer to make your own region file, you can use the following function, providing a BSgenome object or a FASTA file.

library(BSgenome.Dmelanogaster.UCSC.dm6)
#> Loading required package: BSgenome
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
#>     union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#> 
#>     strsplit
#> Loading required package: BiocIO
#> Loading required package: rtracklayer
#> 
#> Attaching package: 'rtracklayer'
#> The following object is masked from 'package:BiocIO':
#> 
#>     FileForFormat
regions_and_sites <- getGatcRegions(BSgenome.Dmelanogaster.UCSC.dm6)
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored

#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored

#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored

#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored

#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored

#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored

#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
regions <- regions_and_sites$regions
knitr::kable(head(regions))
seqnames start end width strand Position
2L 82 230 149 * chr2L-82
2L 231 371 141 * chr2L-231
2L 372 539 168 * chr2L-372
2L 540 688 149 * chr2L-540
2L 689 829 141 * chr2L-689
2L 830 997 168 * chr2L-830
knitr::kable(head(regions_and_sites$sites))
seqnames start end width strand Position
chr2L 80 83 4 * chr2L-80
chr2L 229 232 4 * chr2L-229
chr2L 370 373 4 * chr2L-370
chr2L 538 541 4 * chr2L-538
chr2L 687 690 4 * chr2L-687
chr2L 828 831 4 * chr2L-828

If you already have your own GATC region file, ensure that it has the same format with 6 columns:

The GATC regions are not evenly distributed across the genome, and vary greatly in size. This can be explored in more detail via boxplots

regions.df <- data.frame(regions)
ggplot2::ggplot(
    regions.df,
    ggplot2::aes(x = seqnames, y = log(width))
) +
    ggplot2::geom_boxplot()

2.2 Extracting the counts within the GATC regions

Note: Damsel requires BAM files that have been mapped to the reference genome.

Provided the path to a folder of BAM files (and their .bai files) and the appropriate GATC region file, the function countBamInGATC() will extract the counts for each region for each available BAM and add them as columns to a data frame. The columns will be named by the BAM file name - please rename them before running the function if they do not make sense.

path_to_bams <- system.file("extdata", package = "Damsel")
counts.df <- countBamInGATC(path_to_bams,
    regions = regions
)
knitr::kable(head(counts.df))
Position seqnames start end width strand dam_1_SRR7948872.BAM sd_1_SRR7948874.BAM dam_2_SRR7948876.BAM sd_2_SRR7948877.BAM
chr2L-82 chr2L 82 230 149 * 1.0 0.33 0.0 0.0
chr2L-231 chr2L 231 371 141 * 1.5 5.67 87.0 57.5
chr2L-372 chr2L 372 539 168 * 2.5 6.17 88.0 58.5
chr2L-540 chr2L 540 688 149 * 2.0 4.83 0.0 0.0
chr2L-689 chr2L 689 829 141 * 0.0 0.00 0.5 0.5
chr2L-830 chr2L 830 997 168 * 0.0 1.33 4.5 3.5

This example data is also directly available as a counts file via data.

data("dros_counts")

2.3 Correlation analysis of samples

At this stage, the similarities and differences between the samples can be analysed via correlation. plotCorrHeatmap plots the correlation of all available BAM files Dam and Fusion, to visualise the similarity between files. The default for all Damsel correlation analysis is the non-parametric “spearman’s” correlation. The correlation between Dam_1 and Fusion_1 can be expected to reach ~ 0.7, whereas the correlation between Dam_1 & Dam_3 or Fusion_1 & Fusion_2 would be expected to be closer to ~0.9

plotCorrHeatmap(df = counts.df, method = "spearman")

Two specific samples can also be compared using ggscatter which plots a scatterplot of the two samples, overlaid with the correlation results. [ggpubr::ggscatter()]

2.4 Visualisation of coverage

A specific region can be selected to view the counts across samples.

plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000,
    log2_scale = FALSE,
    layout = "stacked"
)

plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000,
    layout = "stacked",
    log2_scale = TRUE
)

plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000,
    layout = "spread"
)

plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000,
    layout = "spread",
    log2_scale = TRUE
)

3. Differential methylation analysis

The goal with DamID analysis is to identify regions that are enriched in the fusion sample relative to the control. In Damsel, this step is referred to as differential methylation analysis, and makes use of [edgeR].

For ease of use, Damsel has four main edgeR based functions which compile different steps and functions from within edgeR.

3.1 Setting up edgeR analysis

makeDGE sets up the edgeR analysis for differential methylation testing. Taking the data frame of samples and regions as input, it conducts the following steps:

dge <- makeDGE(counts.df)
head(dge)
#> An object of class "DGEList"
#> $counts
#>            dam_1_SRR7948872.BAM sd_1_SRR7948874.BAM dam_2_SRR7948876.BAM
#> chr2L-231                  1.50                5.67                87.00
#> chr2L-372                  2.50                6.17                88.00
#> chr2L-5154                 6.50               10.83                15.50
#> chr2L-5303                43.17               53.67               143.00
#> chr2L-6025                68.17               66.75               283.33
#> chr2L-6881                19.67               14.67                27.33
#>            sd_2_SRR7948877.BAM
#> chr2L-231                57.50
#> chr2L-372                58.50
#> chr2L-5154               17.33
#> chr2L-5303              174.00
#> chr2L-6025              283.33
#> chr2L-6881               23.33
#> 
#> $samples
#>                       group lib.size norm.factors
#> dam_1_SRR7948872.BAM    Dam  3957170    1.6710958
#> sd_1_SRR7948874.BAM  Fusion  3519446    0.9037809
#> dam_2_SRR7948876.BAM    Dam 19908054    1.1742367
#> sd_2_SRR7948877.BAM  Fusion 19943269    0.5638711
#> 
#> $genes
#>            seqnames start  end width
#> chr2L-231     chr2L   231  371   141
#> chr2L-372     chr2L   372  539   168
#> chr2L-5154    chr2L  5154 5302   149
#> chr2L-5303    chr2L  5303 6024   722
#> chr2L-6025    chr2L  6025 6880   856
#> chr2L-6881    chr2L  6881 6922    42
#> 
#> $design
#>   (Intercept) group V2
#> 1           1     1  1
#> 2           1     2  1
#> 3           1     3  0
#> 4           1     4  0
#> attr(,"assign")
#> [1] 0 1 2
#> 
#> $common.dispersion
#> [1] 0.02216819
#> 
#> $trended.dispersion
#> [1] 1.313574e-02 1.349953e-02 9.765625e-05 1.927799e-02 2.142786e-02
#> [6] 6.743923e-03
#> 
#> $tagwise.dispersion
#> [1] 1.699904e-02 1.631270e-02 9.765625e-05 1.731309e-02 1.975349e-02
#> [6] 2.330183e-03
#> 
#> $AveLogCPM
#> [1] 1.7441776 1.7814451 0.4911189 3.4284264 4.0926559 1.2169209
#> 
#> $trend.method
#> [1] "locfit"
#> 
#> $prior.df
#> [1] 24.51953 24.51953 24.93649 24.99455 24.99455 24.99455
#> 
#> $prior.n
#> [1] 24.51953 24.51953 24.93649 24.99455 24.99455 24.99455
#> 
#> $span
#> [1] 0.2719139

The output from this step is a DGEList containing all of the information from the steps.

3.2 Examining the data - multidimensional scaling plot

It’s important to visualise the differences between the samples.

You would expect the Dam samples to cluster together, and for the Fusion samples to cluster together. You would expect the majority of the variation to be within the 1st dimension (the x axis), and less variation in the 2nd dimension (y axis)

group <- dge$samples$group %>% as.character()
limma::plotMDS(dge, col = as.numeric(factor(group)))

3.3 Identifying differentially methylated regions

After exploring the data visually, it’s time to identify the enriched regions. testDmRegions compiles the edgeR functions for differential testing with one key modification - it outputs the results with the adjusted p values as well as the raw p values.

testDmRegions conducts the following key steps:

These results are incorporated with the region data, providing a result for every region. The regions excluded from edgeR analysis are given logFC = 0, and adjust.p = 1 Setting plot=TRUE will plot an [edgeR::plotSmear()] alongside the results

dm_results <- testDmRegions(dge, p.value = 0.01, lfc = 1, regions = regions, plot = TRUE)
#> Warning in plot.xy(xy.coords(x, y), type = type, ...): "panel.first" is not a
#> graphical parameter

dm_results %>%
    dplyr::group_by(meth_status) %>%
    dplyr::summarise(n = dplyr::n())
#> # A tibble: 3 × 2
#>   meth_status       n
#>   <chr>         <int>
#> 1 No_sig        45450
#> 2 Not_included 325087
#> 3 Upreg         13117

knitr::kable(head(dm_results), digits = 32)
seqnames start end width strand Position number dm logFC PValue adjust.p meth_status
chr2L 82 230 149 * chr2L-82 1 NA 0.0000000 1.00000000 1.00000000 Not_included
chr2L 231 371 141 * chr2L-231 2 0 0.7040439 0.04755650 0.07490229 No_sig
chr2L 372 539 168 * chr2L-372 3 0 0.6935078 0.04838744 0.07604180 No_sig
chr2L 540 688 149 * chr2L-540 4 NA 0.0000000 1.00000000 1.00000000 Not_included
chr2L 689 829 141 * chr2L-689 5 NA 0.0000000 1.00000000 1.00000000 Not_included
chr2L 830 997 168 * chr2L-830 6 NA 0.0000000 1.00000000 1.00000000 Not_included

The edgeR results can be plotted alongside the counts as well.

plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000,
    log2_scale = FALSE
) +
    geom_dm(dm_results.df = dm_results)

Only regions that are fully contained within the provided boundaries will be plotted.

gatc_sites <- regions_and_sites$sites

knitr::kable(head(gatc_sites))
seqnames start end width strand Position
chr2L 80 83 4 * chr2L-80
chr2L 229 232 4 * chr2L-229
chr2L 370 373 4 * chr2L-370
chr2L 538 541 4 * chr2L-538
chr2L 687 690 4 * chr2L-687
chr2L 828 831 4 * chr2L-828
plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000,
    log2_scale = FALSE
) +
    geom_dm(dm_results) +
    geom_gatc(gatc_sites)

4. Identifying peaks (bridges)

As you could see from the plot of the differential methylation results, there are 10s of 1000s of enriched regions. To reduce the scale of this data to something that can be more biologically meaningul, enriched regions can be compiled into peaks.

Aggregating the regions

Damsel identifies peaks by aggregating regions of enrichment. As DamID sequencing generally sequences the 75 bp from the GATC site, regions smaller than 150 bp are mostly non-significant in statistical testing. Because of this, gaps between peaks of less than or equal to 150 bp are combined into one longer peak.

The FDR and logFC for each peak is calculated via the theory of [csaw::getBestTest()] where the ‘best’ (smallest) p-value in the regions that make up the peak is selected as representative of the peak. The logFC is therefore the corresponding logFC from the selected region.

peaks <- identifyPeaks(dm_results)
nrow(peaks)
#> [1] 1162

knitr::kable(head(peaks), format = "html", table.attr = "style='width:30%;'", digits = 32)
peak_id seqnames start end width strand rank_p logFC_match FDR multiple_peaks region_pos n_regions_dm n_regions_not_dm
PM_1380 chr2L 15737357 15747460 10104
1 5.070017 4.976049e-06 3 chr2L-15739698 30 2
PS_1191 chr2L 13204670 13207893 3224
2 5.043199 4.976049e-06 1 chr2L-13206327 8 0
PM_881 chr2L 9595401 9598257 2857
3 5.013857 4.976049e-06 3 chr2L-9595799 10 3
PM_765 chr2L 8402774 8404184 1411
4 4.954536 4.976049e-06 2 chr2L-8403504 6 2
PS_2452 chr2R 8605638 8609236 3599
5 4.903245 4.976049e-06 1 chr2R-8607814 8 0
PS_2280 chr2R 6748581 6752819 4239
6 4.889517 4.976049e-06 1 chr2R-6749017 7 0

Plotting

A peak plot layer can be added to our graph

plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000
) +
    geom_dm(dm_results) +
    geom_peak(peaks) +
    geom_gatc(gatc_sites)

plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000
) +
    geom_dm(dm_results) +
    geom_peak(peaks, peak.label = TRUE) +
    geom_gatc(gatc_sites)

5. Identifying genes associated with peaks

The peak information itself - while interesting, has no biological meaning. As the peaks represent a region that the Fusion protein interacted with on the DNA, likely as a transcription factor, we wish to identify the gene that is being affected. To do so, we need to associate the peaks with a potential “target” gene.

Note: any gene identified here is only a potential target that must be validated in laboratory procedures. There is no method available that is able to accurately predict the location and target genes of enhancers, so a key and potentially incorrect assumption in this part of the analysis is that all peaks represent binding to a local enhancer or promoter - that it is close or overlapping to the target gene.

It must also be noted that the Drosophila melanogaster genome and transcription factor interactions are different to that of mammals and using the same assumptions means results must be taken cautiously. While mammalian genes are generally spread out with little overlap, there is a large amount of overlap between Drosophila genes, requiring some intuitive interpretation of which gene the peak is potentially targeting.

In the Damsel methodology, peaks are considered to associate with genes if they are within 5kb upstream or downstream. If multiple genes are within these criteria, they are all listed, with the closest gene given the primary position.

5.1 Extract genes

The function collateGenes() uses two different mechanisms to create a list of genes. It allows for the use of a TxDb object/annotation package, or can access biomaRt.

5.1.1 A TxDb object

The simplest approach is to use a TxDb annotation package. A TxDb package is available for most species and has information about the genes, exons, cds, promoters, etc - which can all be accessed using the GenomicFeatures package. This presentation has a lot of slides on how to use TxDb objects to access different data types: https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicFeatures_In_Bioconductor.html#10

However, TxDb libraries contain only the Ensembl gene Ids and not the gene symbol or name. Instead we need to access an org.Db package to transfer them over. org.Db packages contain information about model organisms genome annotation, and can be used to extract various information about the gene name etc. More information can be found here https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicFeatures_In_Bioconductor.html#46

BiocManager::install("TxDb.Dmelanogaster.UCSC.dm6.ensGene")
BiocManager::install("org.Dm.eg.db")
library("TxDb.Dmelanogaster.UCSC.dm6.ensGene")
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
txdb <- TxDb.Dmelanogaster.UCSC.dm6.ensGene::TxDb.Dmelanogaster.UCSC.dm6.ensGene
library("org.Dm.eg.db")
#> 
genes <- collateGenes(genes = txdb, regions = regions, org.Db = org.Dm.eg.db)
#>   1 gene was dropped because it has exons located on both strands of the
#>   same reference sequence or on more than one reference sequence, so
#>   cannot be represented by a single genomic range.
#>   Use 'single.strand.genes.only=FALSE' to get all the genes in a
#>   GRangesList object, or use suppressMessages() to suppress this message.
#> 'select()' returned 1:many mapping between keys and columns
#> TSS taken as start of gene, taking strand into account
knitr::kable(head(genes))
seqnames start end width strand ensembl_gene_id gene_name TSS n_regions
chr2L 7529 9484 1956 + FBgn0031208 CR11023 7529 3
chr2L 9839 21376 11538 - FBgn0002121 l(2)gl 21376 33
chr2L 21823 25155 3333 - FBgn0031209 Ir21a 25155 8
chr2L 21952 24237 2286 + FBgn0263584 asRNA:CR43609 21952 6
chr2L 25402 65404 40003 - FBgn0051973 Cda5 65404 93
chr2L 54817 55767 951 + FBgn0267987 lncRNA:CR46254 54817 1

Using the TxDb package, Damsel assumes that the TSS (transcription start site) is the same as the start site of the gene, taking the strand into account.

5.1.2 Accessing the biomaRt resource

Alternatively, the name of a species listed in biomaRt can be provided, and the version of the genome. The advantage of biomaRt is that a greater amount of information is able to be uncovered, including the canonical transcript. A guide to understanding more about how biomaRt functions is here: https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html

BiocManager::install("biomaRt")
library(biomaRt)
collateGenes(genes = "dmelanogaster_gene_ensembl", regions = regions, version = 109)
#> GRanges object with 23844 ranges and 5 metadata columns:
#>           seqnames          ranges strand | ensembl_gene_id   gene_name
#>              <Rle>       <IRanges>  <Rle> |     <character> <character>
#>       [1]       2L       7529-9484      + |     FBgn0031208     CR11023
#>       [2]       2L       9726-9859      + |     FBti0060580            
#>       [3]       2L      9839-21376      - |     FBgn0002121      l(2)gl
#>       [4]       2L       9888-9949      - |     FBti0059810            
#>       [5]       2L     21823-25155      - |     FBgn0031209       Ir21a
#>       ...      ...             ...    ... .             ...         ...
#>   [23840]        Y 3370887-3371602      + |     FBgn0267426     CR45779
#>   [23841]        Y 3376224-3376984      + |     FBgn0267427     CR45780
#>   [23842]        Y 3539437-3540722      - |     FBgn0046698      Pp1-Y2
#>   [23843]        Y 3561745-3618339      + |     FBgn0046323         ORY
#>   [23844]        Y 3649996-3666928      + |     FBgn0267592         CCY
#>           ensembl_transcript_id       TSS n_regions
#>                     <character> <integer> <numeric>
#>       [1]           FBtr0475186      7529         3
#>       [2]        FBti0060580-RA      9726         1
#>       [3]           FBtr0306592     21376        33
#>       [4]        FBti0059810-RA      9949         0
#>       [5]           FBtr0113008     25155         8
#>       ...                   ...       ...       ...
#>   [23840]           FBtr0346754   3370887         0
#>   [23841]           FBtr0346755   3376224         0
#>   [23842]           FBtr0346673   3540722         6
#>   [23843]           FBtr0346720   3561745       114
#>   [23844]           FBtr0347413   3649996        28
#>   -------
#>   seqinfo: 7 sequences from an unspecified genome; no seqlengths
  • accesses biomaRt using the seqnames of the appropriate GATC region file as a guide
  • accesses biomaRt a second time to obtain only the Ensembl canonical sequence information for each gene
  • identifies the number of GATC regions that overlap with each gene

5.2 Annotating genes to peaks

As stated above, Damsel associates genes with peaks if they are within 5 kb upstream or downstream. This maximum distance is an adjustable parameter within the annotatePeaksGenes() function. If set to NULL it will output all possible combinations as defined by plyranges::pair_nearest. The nature of this function means that the closest gene will be found for every peak, even if that distance is in the millions. If the user sets max_distance=NULL, we recommend undergoing some filtering to remove those associations.

To respect that some species have genes with more overlap than others, annotatePeaksGenes outputs a list of data frames. The first, closest, outputs information for every peak and it’s closest gene. The second data frame, top_five, outputs a string of the top five genes (if available) and their distances from each peak. The final data frame, all, provides the raw results and all possible gene and peak associations, as well as all available statistical results.

annotation <- annotatePeaksGenes(peaks, genes, regions = regions, max_distance = 5000)

knitr::kable(head(annotation$closest), digits = 32)
seqnames start end width total_regions n_regions_dm peak_id rank_p gene_position ensembl_gene_id gene_name midpoint_is position
chr2L 5154 9484 4331 8 3 PS_1 1080 chr2L:+:7529-9484 FBgn0031208 CR11023 Upstream Peak_upstream
chr2L 9839 21376 11538 33 5 PS_3 25 chr2L:-:9839-21376 FBgn0002121 l(2)gl Upstream Peak_within_gene
chr2L 9839 21643 11805 34 5 PS_4 19 chr2L:-:9839-21376 FBgn0002121 l(2)gl Upstream Peak_overlap_downstream
chr2L 21823 25155 3333 8 3 PS_5 1036 chr2L:-:21823-25155 FBgn0031209 Ir21a Upstream Peak_within_gene
chr2L 25402 65404 40003 93 3 PM_6 1040 chr2L:-:25402-65404 FBgn0051973 Cda5 Upstream Peak_within_gene
chr2L 65609 68330 2722 7 7 PS_9 307 chr2L:+:65999-66242 FBgn0266878 lncRNA:CR45339 Downstream Peak_encompass_gene
knitr::kable(head(annotation$top_five), digits = 32)
seqnames start end peak_id rank_p n_genes list_ensembl list_gene position distance_TSS min_distance
chr2L 5154 9484 PS_1 1080 1 FBgn0031208 CR11023 Peak_upstream 1512 649
chr2L 9839 21376 PS_3 25 1 FBgn0002121 l(2)gl Peak_within_gene 3489 0
chr2L 9839 21643 PS_4 19 1 FBgn0002121 l(2)gl Peak_overlap_downstream 660 267
chr2L 21823 25155 PS_5 1036 2 FBgn0031209, FBgn0263584 Ir21a, asRNA:CR43609 Peak_within_gene, Peak_within_gene 2771, -432 0, 0
chr2L 25402 65404 PM_6 1040 1 FBgn0051973 Cda5 Peak_within_gene 30177 0
chr2L 65609 71390 PS_9 307 4 FBgn0266878, FBgn0266879, FBgn0067779 lncRNA:CR45339, lncRNA:CR45340, dbr Peak_encompass_gene, Peak_encompass_gene, Peak_overlap_upstream -970.5, -651.5, -487.5 0, 0, 873
knitr::kable(str(annotation$all), digits = 32)
#> tibble [2,627 × 33] (S3: tbl_df/tbl/data.frame)
#>  $ seqnames        : Factor w/ 1870 levels "chr2L","chr2R",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ start           : int [1:2627] 5154 9839 9839 21823 21952 25402 65609 65609 65609 71493 ...
#>  $ end             : int [1:2627] 9484 21376 21643 25155 24237 65404 68330 68330 71390 76602 ...
#>  $ width           : num [1:2627] 4331 11538 11805 3333 2286 ...
#>  $ gene_seqnames   : Factor w/ 1870 levels "chr2L","chr2R",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ gene_start      : int [1:2627] 7529 9839 9839 21823 21952 25402 65999 66318 66482 71757 ...
#>  $ gene_end        : int [1:2627] 9484 21376 21376 25155 24237 65404 66242 66524 71390 76211 ...
#>  $ gene_width      : int [1:2627] 1956 11538 11538 3333 2286 40003 244 207 4909 4455 ...
#>  $ gene_strand     : Factor w/ 3 levels "+","-","*": 1 2 2 2 1 2 1 1 1 1 ...
#>  $ peak_seqnames   : Factor w/ 2 levels "chr2L","chr2R": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ peak_start      : int [1:2627] 5154 16957 19789 22097 22097 33758 65609 65609 65609 71493 ...
#>  $ peak_end        : int [1:2627] 6880 18817 21643 22671 22671 36696 68330 68330 68330 76602 ...
#>  $ peak_width      : int [1:2627] 1727 1861 1855 575 575 2939 2722 2722 2722 5110 ...
#>  $ peak_strand     : Factor w/ 3 levels "+","-","*": 3 3 3 3 3 3 3 3 3 3 ...
#>  $ ensembl_gene_id : chr [1:2627] "FBgn0031208" "FBgn0002121" "FBgn0002121" "FBgn0031209" ...
#>  $ gene_name       : chr [1:2627] "CR11023" "l(2)gl" "l(2)gl" "Ir21a" ...
#>  $ TSS             : int [1:2627] 7529 21376 21376 25155 21952 65404 65999 66318 66482 71757 ...
#>  $ n_regions       : num [1:2627] 3 33 33 8 6 93 0 0 15 4 ...
#>  $ peak_id         : chr [1:2627] "PS_1" "PS_3" "PS_4" "PS_5" ...
#>  $ rank_p          : int [1:2627] 1080 25 19 1036 1036 1040 307 307 307 357 ...
#>  $ logFC_match     : num [1:2627] 1.35 4.34 4.43 2.46 2.46 ...
#>  $ FDR             : num [1:2627] 4.88e-04 4.98e-06 4.98e-06 3.11e-04 3.11e-04 ...
#>  $ multiple_peaks  : num [1:2627] 1 1 1 1 1 2 1 1 1 1 ...
#>  $ region_pos      : chr [1:2627] "chr2L-5303" "chr2L-17586" "chr2L-20391" "chr2L-22353" ...
#>  $ n_regions_dm    : num [1:2627] 3 5 5 3 3 3 7 7 7 7 ...
#>  $ n_regions_not_dm: num [1:2627] 0 0 0 0 0 1 0 0 0 0 ...
#>  $ peak_midpoint   : num [1:2627] 6017 17887 20716 22384 22384 ...
#>  $ distance_TSS    : num [1:2627] 1512 3489 660 2771 -432 ...
#>  $ midpoint_is     : chr [1:2627] "Upstream" "Upstream" "Upstream" "Upstream" ...
#>  $ n_genes         : int [1:2627] 1 1 1 2 2 1 4 4 4 4 ...
#>  $ position        : chr [1:2627] "Peak_upstream" "Peak_within_gene" "Peak_overlap_downstream" "Peak_within_gene" ...
#>  $ min_distance    : num [1:2627] 649 0 267 0 0 0 0 0 873 0 ...
#>  $ total_regions   : int [1:2627] 8 33 34 8 6 93 7 7 18 7 ...

5.3 Interpreting results and plotting

Now that we have the genes from collateGenes(), this can be added as a layer to the previous plots. This plot requires the gene positions as a guide for a Txdb or EnsDb object, building off the autoplot generic built by ggbio.

plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000
) +
    geom_dm(dm_results) +
    geom_peak(peaks) +
    geom_gatc(gatc_sites) +
    geom_genes.tx(genes, txdb)
#> If gene is disproportional to the plot, use gene_limits = c(y1,y2). If gene is too large, recommend setting to c(0,2) and adjusting the plot.height accordingly.
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
#> Parsing transcripts...
#> Parsing exons...
#> Parsing cds...
#> Parsing utrs...
#> ------exons...
#> ------cdss...
#> ------introns...
#> ------utr...
#> aggregating...
#> Done
#> Constructing graphics...
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.

plotWrap(
    id = peaks[1, ]$peak_id,
    counts.df = counts.df,
    dm_results.df = dm_results,
    peaks.df = peaks,
    gatc_sites.df = gatc_sites,
    genes.df = genes, txdb = txdb
)
#> If gene is disproportional to the plot, use gene_limits = c(y1,y2). If gene is too large, recommend setting to c(0,2) and adjusting the plot.height accordingly.
#> Parsing transcripts...
#> Parsing exons...
#> Parsing cds...
#> Parsing utrs...
#> ------exons...
#> ------cdss...
#> ------introns...
#> ------utr...
#> aggregating...
#> Done
#> Constructing graphics...
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> [[1]]

plotWrap(
    id = genes[1, ]$ensembl_gene_id,
    counts.df = counts.df,
    dm_results.df = dm_results,
    peaks.df = peaks,
    gatc_sites.df = gatc_sites,
    genes.df = genes, txdb = txdb
)
#> No data available for this region
#> If gene is disproportional to the plot, use gene_limits = c(y1,y2). If gene is too large, recommend setting to c(0,2) and adjusting the plot.height accordingly.
#> Parsing transcripts...
#> Parsing exons...
#> Parsing cds...
#> Parsing utrs...
#> ------exons...
#> ------cdss...
#> ------introns...
#> ------utr...
#> aggregating...
#> Done
#> Constructing graphics...
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> [[1]]

6. Gene ontology

One of the last steps in a typical DamID analysis is gene ontology analysis. However, a key mistake made in this analysis using any common data type - including RNA-seq, is the lack of bias correction. We correct for this by utilising the [goseq] package. Without bias correction, the ontology results would be biased towards longer genes - the longer the gene, the more likely it would be to have a peak associated with it, and therefore be called as significant. We can see this in the plot below.

6.1 GO analysis with goseq

testGeneOntology identifies the over-represented GO terms from the peak data, correcting for the number of GATC regions matching to each gene.

3 outputs Plot of goodness of fit of model signif_results: data.frame of significant GO category results, ordered by p-value. prob_weights: data.frame of probability weights for each gene

ontology <- testGeneOntology(annotation$all, genes, regions = regions, extend_by = 2000)
#> Bias will be n_regions that are contained within the gene length
#> 
#> Warning in pcls(G): initial point very close to some inequality constraints
#> Fetching GO annotations...
#> For 4487 genes, we could not find any categories. These genes will be excluded.
#> To force their use, please run with use_genes_without_cat=TRUE (see documentation).
#> This was the default behavior for version 1.15.1 and earlier.
#> Calculating the p-values...
#> 'select()' returned 1:1 mapping between keys and columns

The goodness of fit plot above shows us that there is a length based bias to the data. The x axis shows the number of GATC regions in each gene. The y axis shows the proportion of genes that have that amount of GATC regions and have been identified as significant. And it shows that as the number of GATC regions in the gene increase, as does the proportion of genes that are significant.

knitr::kable(head(ontology$signif_results), digits = 32)
category over_represented_pvalue under_represented_pvalue numDEInCat numInCat term ontology FDR
GO:0043231 3.827532e-15 1 591 5063 intracellular membrane-bounded organelle CC 4.599162e-11
GO:0043227 1.903928e-14 1 608 5263 membrane-bounded organelle CC 1.143880e-10
GO:0043229 2.583977e-13 1 655 5848 intracellular organelle CC 1.034969e-09
GO:0043226 5.620468e-13 1 665 5969 organelle CC 1.688389e-09
GO:0019222 5.811012e-11 1 304 2284 regulation of metabolic process BP 1.239978e-07
GO:0060255 6.191635e-11 1 285 2112 regulation of macromolecule metabolic process BP 1.239978e-07
knitr::kable(head(ontology$prob_weights), digits = 32)
DEgenes bias.data pwf
FBgn0031208 1 9 0.06763115
FBgn0002121 1 46 0.11235589
FBgn0031209 1 15 0.07541055
FBgn0263584 1 13 0.07282939
FBgn0051973 1 100 0.15152065
FBgn0267987 0 10 0.06893460

As expected, significant gene ontology terms surround developmental processes, which is expected as the fusion gene in the example data (Scalloped) is well known to be involved in development.

plotGeneOntology can be used to plot the top 10 results.

plotGeneOntology(ontology$signif_results)

As shown above, the plot has the category on the y-axis, the FDR on the x-axis, the size of the dot being the number of genes in the GO category, and the colour of the dot being the ontology (Biological Process, Cellular Component, and Molecular Function).

Appendix

sessionInfo()
#> R Under development (unstable) (2024-03-18 r86148)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] biomaRt_2.59.1                            
#>  [2] org.Dm.eg.db_3.18.0                       
#>  [3] TxDb.Dmelanogaster.UCSC.dm6.ensGene_3.12.0
#>  [4] GenomicFeatures_1.55.4                    
#>  [5] AnnotationDbi_1.65.2                      
#>  [6] Biobase_2.63.0                            
#>  [7] BSgenome.Dmelanogaster.UCSC.dm6_1.4.1     
#>  [8] BSgenome_1.71.2                           
#>  [9] rtracklayer_1.63.1                        
#> [10] BiocIO_1.13.0                             
#> [11] Biostrings_2.71.4                         
#> [12] XVector_0.43.1                            
#> [13] GenomicRanges_1.55.4                      
#> [14] GenomeInfoDb_1.39.9                       
#> [15] IRanges_2.37.1                            
#> [16] S4Vectors_0.41.5                          
#> [17] BiocGenerics_0.49.1                       
#> [18] Damsel_0.99.1                             
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          rstudioapi_0.15.0          
#>   [3] jsonlite_1.8.8              magrittr_2.0.3             
#>   [5] farver_2.1.1                rmarkdown_2.26             
#>   [7] zlibbioc_1.49.3             vctrs_0.6.5                
#>   [9] memoise_2.0.1               Rsamtools_2.19.4           
#>  [11] RCurl_1.98-1.14             base64enc_0.1-3            
#>  [13] htmltools_0.5.7             S4Arrays_1.3.6             
#>  [15] progress_1.2.3              curl_5.2.1                 
#>  [17] SparseArray_1.3.4           Formula_1.2-5              
#>  [19] sass_0.4.9                  bslib_0.6.1                
#>  [21] htmlwidgets_1.6.4           plyr_1.8.9                 
#>  [23] httr2_1.0.0                 cachem_1.0.8               
#>  [25] GenomicAlignments_1.39.4    lifecycle_1.0.4            
#>  [27] pkgconfig_2.0.3             Matrix_1.6-5               
#>  [29] R6_2.5.1                    fastmap_1.1.1              
#>  [31] GenomeInfoDbData_1.2.11     MatrixGenerics_1.15.0      
#>  [33] digest_0.6.35               colorspace_2.1-0           
#>  [35] GGally_2.2.1                OrganismDbi_1.45.1         
#>  [37] patchwork_1.2.0             goseq_1.55.0               
#>  [39] Hmisc_5.1-2                 RSQLite_2.3.5              
#>  [41] filelock_1.0.3              labeling_0.4.3             
#>  [43] fansi_1.0.6                 mgcv_1.9-1                 
#>  [45] httr_1.4.7                  abind_1.4-5                
#>  [47] compiler_4.4.0              bit64_4.0.5                
#>  [49] withr_3.0.0                 backports_1.4.1            
#>  [51] htmlTable_2.4.2             BiocParallel_1.37.1        
#>  [53] DBI_1.2.2                   ggstats_0.5.1              
#>  [55] BiasedUrn_2.0.11            highr_0.10                 
#>  [57] geneLenDataBase_1.39.0      rappdirs_0.3.3             
#>  [59] DelayedArray_0.29.9         rjson_0.2.21               
#>  [61] tools_4.4.0                 foreign_0.8-86             
#>  [63] nnet_7.3-19                 glue_1.7.0                 
#>  [65] restfulr_0.0.15             nlme_3.1-164               
#>  [67] grid_4.4.0                  checkmate_2.3.1            
#>  [69] cluster_2.1.6               reshape2_1.4.4             
#>  [71] generics_0.1.3              gtable_0.3.4               
#>  [73] ensembldb_2.27.1            tidyr_1.3.1                
#>  [75] data.table_1.15.2           hms_1.1.3                  
#>  [77] xml2_1.3.6                  utf8_1.2.4                 
#>  [79] pillar_1.9.0                stringr_1.5.1              
#>  [81] limma_3.59.6                splines_4.4.0              
#>  [83] dplyr_1.1.4                 BiocFileCache_2.11.1       
#>  [85] lattice_0.22-6              bit_4.0.5                  
#>  [87] biovizBase_1.51.0           RBGL_1.79.0                
#>  [89] tidyselect_1.2.1            GO.db_3.18.0               
#>  [91] locfit_1.5-9.9              knitr_1.45                 
#>  [93] gridExtra_2.3               ProtGenerics_1.35.4        
#>  [95] ggbio_1.51.0                edgeR_4.1.18               
#>  [97] SummarizedExperiment_1.33.3 xfun_0.42                  
#>  [99] statmod_1.5.0               matrixStats_1.2.0          
#> [101] stringi_1.8.3               lazyeval_0.2.2             
#> [103] yaml_2.3.8                  evaluate_0.23              
#> [105] codetools_0.2-19            tibble_3.2.1               
#> [107] graph_1.81.0                BiocManager_1.30.22        
#> [109] cli_3.6.2                   rpart_4.1.23               
#> [111] munsell_0.5.0               jquerylib_0.1.4            
#> [113] dichromat_2.0-0.1           Rcpp_1.0.12                
#> [115] dbplyr_2.5.0                png_0.1-8                  
#> [117] XML_3.99-0.16.1             parallel_4.4.0             
#> [119] ggplot2_3.5.0               blob_1.2.4                 
#> [121] prettyunits_1.2.0           AnnotationFilter_1.27.0    
#> [123] plyranges_1.23.0            bitops_1.0-7               
#> [125] txdbmaker_0.99.6            VariantAnnotation_1.49.6   
#> [127] scales_1.3.0                purrr_1.0.2                
#> [129] crayon_1.5.2                rlang_1.1.3                
#> [131] KEGGREST_1.43.0