--- title: "syntenet as a synteny detection tool" author: - name: Fabricio Almeida-Silva affiliation: - VIB-UGent Center for Plant Systems Biology, Ghent, Belgium - Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium - name: Tao Zhao affiliation: - State Key Laboratory of Crop Stress Biology for Arid Areas/Shaanxi Key Laboratory of Apple, College of Horticulture, Northwest A&F University, Yangling, China - name: Kristian K Ullrich affiliation: - Department of Evolutionary Biology, Max Planck Institute For Evolutionary Biology, Ploen, Germany - name: Yves Van de Peer affiliation: - VIB-UGent Center for Plant Systems Biology, Ghent, Belgium - Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium - College of Horticulture, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing, China - Center for Microbial Ecology and Genomics, Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria, South Africa output: BiocStyle::html_document: toc: true toc_float: true number_sections: yes bibliography: vignette_bibliography.bib vignette: > %\VignetteIndexEntry{syntenet as a synteny detection tool} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ) ``` # Introduction Although `r BiocStyle::Biocpkg("syntenet")` was designed for large-scale synteny analyses (i.e., with tens of genomes), users might also be interested in using `r BiocStyle::Biocpkg("syntenet")` as a simple synteny detection tool. For example, one can use `r BiocStyle::Biocpkg("syntenet")` to identify syntenic regions in a genome (intraspecies synteny), or to identify syntenic regions between the genomes of two species (interspecies synteny). To detect synteny, `r BiocStyle::Biocpkg("syntenet")` uses its own implementation of the MCScanX algorithm [@wang2012mcscanx], ported to R thanks to the `r BiocStyle::CRANpkg("Rcpp")` package for R and C++ integration. In this vignette, we will guide you on how to perform intra and interspecies synteny with `r BiocStyle::Biocpkg("syntenet")`. ```{r load_package, message=FALSE} library(syntenet) ``` # Detecting intragenome synteny To detect intragenome (or intraspecies) synteny, you will use the function `intraspecies_synteny`. As input, you will need: 1. A `GRangesList` object containing the processed annotation for your species of interest, as returned by `process_input()`. 2. A list of data frames with the output of similarity search programs (e.g., DIAMOND, BLAST, etc). **Only intragenome comparisons must be included.** This can be obtained with `run_diamond(seq, compare = "intraspecies")`. To demonstrate the usage of `intraspecies_synteny()`, let's identify syntenic regions in the genome of *Saccharomyces cerevisiae*. The processed annotation and DIAMOND output are stored in the example data sets `scerevisiae_annot` and `scerevisiae_diamond`. ```{r data_intra} # Load example data sets data(scerevisiae_annot) head(scerevisiae_annot) data(scerevisiae_diamond) names(scerevisiae_diamond) head(scerevisiae_diamond$Scerevisiae_Scerevisiae) ``` Once we have the data, we can detect synteny with `intraspecies_synteny()`. This function returns the path to the *.collinearity* files generated by MCScanX [@wang2012mcscanx], which can be read and parsed with the `parse_collinearity()` function. ```{r intra_syn} # Detect intragenome synteny intra_syn <- intraspecies_synteny( scerevisiae_diamond, scerevisiae_annot ) intra_syn # see where the .collinearity file is # Get anchor pairs from .collinearity file anchors <- parse_collinearity(intra_syn) head(anchors) ``` By default, the output of `parse_collinearity()` is a 2-column data frame with anchor pairs in syntenic regions. You can also extract data on each synteny block (or synteny region) by specifying `as = 'blocks'`. ```{r parse_blocks} # Get synteny block information with `parse_collinearity()` blocks <- parse_collinearity(intra_syn, as = "blocks") head(blocks) ``` You can even extract both anchor pairs and synteny block data at once in a single data frame. ```{r parse_all} # Get anchors and block data with `parse_collinearity()` intrasyn_all <- parse_collinearity(intra_syn, as = "all") head(intrasyn_all) ``` # Detecting intergenome synteny To detect intergenome (or interspecies) synteny, you will use the function `interspecies_synteny`, which works very similarly to `intraspecies_synteny()`. As input, you will need: 1. A `GRangesList` object containing the processed annotation for your species of interest (2+ species), as returned by `process_input()`. 2. A list of data frames with the output of similarity search programs (e.g., DIAMOND, BLAST, etc). **Only intergenome comparisons must be included.** This can be obtained with `run_diamond(seq, compare = compare_df)`. Importantly, to detect intergenome synteny, the MCScanX algorithm requires bidirectional BLAST hits. For instance, if you're trying to detect synteny between `spA` and `spB`, you need to perform DIAMOND/BLAST searches in both directions: `spA_spB` and `spB_spA`. You can create a data frame specifying these comparisons as follows: ```{r make_bidirectional} # Create a data frame with comparisons to be made comp <- data.frame( query = "spA", db = "spB" ) comp # Make comparisons bidirectional comp_bi <- make_bidirectional(comp) comp_bi ``` Then, you can peform similarity searches for these comparisons with: ```{r run_diamond_compare, eval = FALSE} # NOTE: Not executed because object `seq` doesn't exist; for demo only dmd_inter <- run_diamond(seq, compare = comp_bi) ``` The code above would generate a list of two data frames with DIAMOND tables: one named `spA_spB`, and another one named `spB_spA`. To demonstrate what this looks like, we will load the example data set we will use to detect intergenome synteny. Here, we will detect syntenic regions between the genomes of the algae *Ostreococcus lucimarinus* and *Ostreococcus sp RCC809*. The list of DIAMOND tables is stored in object `blast_list`, and we will create the processed annotation from objects `proteomes` and `annotation`. ```{r example_inter} # Load list of DIAMOND tables data(blast_list) names(blast_list) algae_inter <- blast_list[c(2,3)] # keep only intergenome comparisons names(algae_inter) # Get processed annotation data(proteomes) # A list of `AAStringSet` objects data(annotation) # A `GRangesList` object pdata <- process_input(proteomes, annotation) names(pdata$annotation) ``` Now that we have a list of DIAMOND tables with bidirectional comparisons between *Olucimarinus* and *OspRCC809*, as well as annotation for these two genomes, we can detect synteny with `interspecies_synteny()`. Like `intraspecies_synteny()`, this function will return the path to a *.collinearity* file generated by MCScanX, and we can parse this file with `parse_collinearity()`. ```{r inter_syn} # Detect interspecies synteny intersyn <- interspecies_synteny(algae_inter, pdata$annotation) intersyn # see where the .collinearity file is # Parse collinearity file ## 1) Get anchor pairs algae_anchors <- parse_collinearity(intersyn) head(algae_anchors) ## 2) Get synteny blocks algae_blocks <- parse_collinearity(intersyn, as = "blocks") head(algae_blocks) ## 3) Get all data combined (blocks + anchor pairs) algae_all <- parse_collinearity(intersyn, as = "all") head(algae_all) ``` # Session information {.unnumbered} This document was created under the following conditions: ```{r sessionInfo} sessionInfo() ``` # References {.unnumbered}