Although syntenet was designed for large-scale synteny analyses (i.e., with tens of genomes), users might also be interested in using syntenet as a simple synteny detection tool. For example, one can use syntenet to identify syntenic regions in a genome (intraspecies synteny), or to identify syntenic regions between the genomes of two species (interspecies synteny). To detect synteny, syntenet uses its own implementation of the MCScanX algorithm (Wang et al. 2012), ported to R thanks to the Rcpp package for R and C++ integration. In this vignette, we will guide you on how to perform intra and interspecies synteny with syntenet.
library(syntenet)
To detect intragenome (or intraspecies) synteny, you will use the function
intraspecies_synteny
. As input, you will need:
A GRangesList
object containing the processed annotation for your
species of interest, as returned by process_input()
.
A list of data frames with the output of similarity search programs
(e.g., DIAMOND, BLAST, etc). Only intragenome comparisons must be included.
This can be obtained with run_diamond(seq, compare = "intraspecies")
.
To demonstrate the usage of intraspecies_synteny()
, let’s identify syntenic
regions in the genome of Saccharomyces cerevisiae. The processed annotation
and DIAMOND output are stored in the example data sets scerevisiae_annot
and scerevisiae_diamond
.
# Load example data sets
data(scerevisiae_annot)
head(scerevisiae_annot)
#> $Scerevisiae
#> GRanges object with 6600 ranges and 1 metadata column:
#> seqnames ranges strand | gene
#> <Rle> <IRanges> <Rle> | <character>
#> [1] Sce_I 335-649 * | Sce_YAL069W
#> [2] Sce_I 538-792 * | Sce_YAL068W-A
#> [3] Sce_I 1807-2169 * | Sce_YAL068C
#> [4] Sce_I 2480-2707 * | Sce_YAL067W-A
#> [5] Sce_I 7235-9016 * | Sce_YAL067C
#> ... ... ... ... . ...
#> [6596] Sce_XVI 939922-941136 * | Sce_YPR201W
#> [6597] Sce_XVI 943032-943896 * | Sce_YPR202W
#> [6598] Sce_XVI 943880-944188 * | Sce_YPR203W
#> [6599] Sce_XVI 944603-947701 * | Sce_YPR204W
#> [6600] Sce_XVI 946856-947338 * | Sce_YPR204C-A
#> -------
#> seqinfo: 17 sequences from an unspecified genome; no seqlengths
data(scerevisiae_diamond)
names(scerevisiae_diamond)
#> [1] "Scerevisiae_Scerevisiae"
head(scerevisiae_diamond$Scerevisiae_Scerevisiae)
#> query db perc_identity length mismatches gap_open qstart qend
#> 1 Sce_YLR106C Sce_YLR106C 100.0 4910 0 0 1 4910
#> 2 Sce_YKR054C Sce_YKR054C 100.0 4092 0 0 1 4092
#> 3 Sce_YHR099W Sce_YHR099W 100.0 3744 0 0 1 3744
#> 4 Sce_YDR457W Sce_YDR457W 100.0 3268 0 0 1 3268
#> 5 Sce_YDR457W Sce_YER125W 44.1 354 195 3 2913 3266
#> 6 Sce_YDR457W Sce_YJR036C 30.7 378 228 12 2913 3266
#> tstart tend evalue bitscore
#> 1 1 4910 0.00e+00 9095
#> 2 1 4092 0.00e+00 7940
#> 3 1 3744 0.00e+00 7334
#> 4 1 3268 0.00e+00 6170
#> 5 457 807 2.18e-91 315
#> 6 523 890 7.99e-44 172
Once we have the data, we can detect synteny with intraspecies_synteny()
.
This function returns the path to the .collinearity files generated by
MCScanX (Wang et al. 2012), which can be read and parsed with the
parse_collinearity()
function.
# Detect intragenome synteny
intra_syn <- intraspecies_synteny(
scerevisiae_diamond, scerevisiae_annot
)
intra_syn # see where the .collinearity file is
#> [1] "/tmp/RtmpXYAqsL/intra/Scerevisiae.collinearity"
# Get anchor pairs from .collinearity file
anchors <- parse_collinearity(intra_syn)
head(anchors)
#> Anchor1 Anchor2
#> 1 Sce_YAR050W Sce_YHR211W
#> 2 Sce_YAR060C Sce_YHR212C
#> 3 Sce_YAR064W Sce_YHR213W-B
#> 4 Sce_YAR066W Sce_YHR214W
#> 5 Sce_YAR068W Sce_YHR214W-A
#> 6 Sce_YAR069C Sce_YHR214C-D
By default, the output of parse_collinearity()
is a 2-column data
frame with anchor pairs in syntenic regions. You can also extract data on each
synteny block (or synteny region) by specifying as = 'blocks'
.
# Get synteny block information with `parse_collinearity()`
blocks <- parse_collinearity(intra_syn, as = "blocks")
head(blocks)
#> Block Block_score Chr Orientation
#> 1 0 446 Sce_I&Sce_VIII plus
#> 2 1 528 Sce_I&Sce_XV minus
#> 3 2 572 Sce_II&Sce_IV plus
#> 4 3 643 Sce_II&Sce_V minus
#> 5 4 446 Sce_II&Sce_XVI plus
#> 6 5 422 Sce_III&Sce_IV minus
You can even extract both anchor pairs and synteny block data at once in a single data frame.
# Get anchors and block data with `parse_collinearity()`
intrasyn_all <- parse_collinearity(intra_syn, as = "all")
head(intrasyn_all)
#> Block Block_score Chr Orientation Anchor1 Anchor2
#> 1 0 446 Sce_I&Sce_VIII plus Sce_YAR050W Sce_YHR211W
#> 2 0 446 Sce_I&Sce_VIII plus Sce_YAR060C Sce_YHR212C
#> 3 0 446 Sce_I&Sce_VIII plus Sce_YAR064W Sce_YHR213W-B
#> 4 0 446 Sce_I&Sce_VIII plus Sce_YAR066W Sce_YHR214W
#> 5 0 446 Sce_I&Sce_VIII plus Sce_YAR068W Sce_YHR214W-A
#> 6 0 446 Sce_I&Sce_VIII plus Sce_YAR069C Sce_YHR214C-D
To detect intergenome (or interspecies) synteny, you will use the function
interspecies_synteny
, which works very similarly to intraspecies_synteny()
.
As input, you will need:
A GRangesList
object containing the processed annotation for your
species of interest (2+ species), as returned by process_input()
.
A list of data frames with the output of similarity search programs
(e.g., DIAMOND, BLAST, etc). Only intergenome comparisons must be included.
This can be obtained with run_diamond(seq, compare = compare_df)
.
Importantly, to detect intergenome synteny, the MCScanX algorithm requires
bidirectional BLAST hits. For instance, if you’re trying to detect synteny
between spA
and spB
, you need to perform DIAMOND/BLAST searches in both
directions: spA_spB
and spB_spA
. You can create a data frame specifying
these comparisons as follows:
# Create a data frame with comparisons to be made
comp <- data.frame(
query = "spA",
db = "spB"
)
comp
#> query db
#> 1 spA spB
# Make comparisons bidirectional
comp_bi <- make_bidirectional(comp)
comp_bi
#> query db
#> 1 spA spB
#> 2 spB spA
Then, you can peform similarity searches for these comparisons with:
# NOTE: Not executed because object `seq` doesn't exist; for demo only
dmd_inter <- run_diamond(seq, compare = comp_bi)
The code above would generate a list of two data frames with DIAMOND tables:
one named spA_spB
, and another one named spB_spA
.
To demonstrate what this looks like, we will load the example data set
we will use to detect intergenome synteny. Here, we will detect syntenic
regions between the genomes of the algae Ostreococcus lucimarinus and
Ostreococcus sp RCC809. The list of DIAMOND tables is stored in object
blast_list
, and we will create the processed annotation from objects
proteomes
and annotation
.
# Load list of DIAMOND tables
data(blast_list)
names(blast_list)
#> [1] "Olucimarinus_Olucimarinus" "Olucimarinus_OspRCC809"
#> [3] "OspRCC809_Olucimarinus" "OspRCC809_OspRCC809"
algae_inter <- blast_list[c(2,3)] # keep only intergenome comparisons
names(algae_inter)
#> [1] "Olucimarinus_OspRCC809" "OspRCC809_Olucimarinus"
# Get processed annotation
data(proteomes) # A list of `AAStringSet` objects
data(annotation) # A `GRangesList` object
pdata <- process_input(proteomes, annotation)
names(pdata$annotation)
#> [1] "Olucimarinus" "OspRCC809"
Now that we have a list of DIAMOND tables with bidirectional comparisons
between Olucimarinus and OspRCC809, as well as annotation for these two
genomes, we can detect synteny with interspecies_synteny()
. Like
intraspecies_synteny()
, this function will return the path to
a .collinearity file generated by MCScanX, and we can parse this file
with parse_collinearity()
.
# Detect interspecies synteny
intersyn <- interspecies_synteny(algae_inter, pdata$annotation)
intersyn # see where the .collinearity file is
#> [1] "/tmp/RtmpXYAqsL/inter/Olucimarinus_OspRCC809.collinearity"
# Parse collinearity file
## 1) Get anchor pairs
algae_anchors <- parse_collinearity(intersyn)
head(algae_anchors)
#> Anchor1 Anchor2
#> 1 Olu_OL01G00100 Osp_ORCC809_01G06480
#> 2 Olu_OL01G00130 Osp_ORCC809_01G06440
#> 3 Olu_OL01G00150 Osp_ORCC809_01G06420
#> 4 Olu_OL01G00160 Osp_ORCC809_01G06410
#> 5 Olu_OL01G00170 Osp_ORCC809_01G06400
#> 6 Olu_OL01G00180 Osp_ORCC809_01G06390
## 2) Get synteny blocks
algae_blocks <- parse_collinearity(intersyn, as = "blocks")
head(algae_blocks)
#> Block Block_score Chr Orientation
#> 1 0 27234 Olu_Chr_1&Osp_chr_1 minus
#> 2 1 299 Olu_Chr_2&Osp_chr_2 plus
#> 3 2 296 Olu_Chr_2&Osp_chr_2 plus
#> 4 3 287 Olu_Chr_2&Osp_chr_2 plus
#> 5 4 7548 Olu_Chr_2&Osp_chr_2 minus
#> 6 5 4578 Olu_Chr_2&Osp_chr_2 minus
## 3) Get all data combined (blocks + anchor pairs)
algae_all <- parse_collinearity(intersyn, as = "all")
head(algae_all)
#> Block Block_score Chr Orientation Anchor1
#> 1 0 27234 Olu_Chr_1&Osp_chr_1 minus Olu_OL01G00100
#> 2 0 27234 Olu_Chr_1&Osp_chr_1 minus Olu_OL01G00130
#> 3 0 27234 Olu_Chr_1&Osp_chr_1 minus Olu_OL01G00150
#> 4 0 27234 Olu_Chr_1&Osp_chr_1 minus Olu_OL01G00160
#> 5 0 27234 Olu_Chr_1&Osp_chr_1 minus Olu_OL01G00170
#> 6 0 27234 Olu_Chr_1&Osp_chr_1 minus Olu_OL01G00180
#> Anchor2
#> 1 Osp_ORCC809_01G06480
#> 2 Osp_ORCC809_01G06440
#> 3 Osp_ORCC809_01G06420
#> 4 Osp_ORCC809_01G06410
#> 5 Osp_ORCC809_01G06400
#> 6 Osp_ORCC809_01G06390
This document was created under the following conditions:
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggtree_3.17.1 syntenet_1.11.2 BiocStyle_2.37.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 dplyr_1.1.4
#> [3] farver_2.1.2 Biostrings_2.77.2
#> [5] bitops_1.0-9 lazyeval_0.2.2
#> [7] fastmap_1.2.0 RCurl_1.98-1.17
#> [9] GenomicAlignments_1.45.1 XML_3.99-0.18
#> [11] digest_0.6.37 lifecycle_1.0.4
#> [13] cluster_2.1.8.1 tidytree_0.4.6
#> [15] magrittr_2.0.3 compiler_4.5.1
#> [17] rlang_1.1.6 sass_0.4.10
#> [19] tools_4.5.1 igraph_2.1.4
#> [21] yaml_2.3.10 rtracklayer_1.69.1
#> [23] knitr_1.50 htmlwidgets_1.6.4
#> [25] S4Arrays_1.9.1 labeling_0.4.3
#> [27] curl_6.4.0 DelayedArray_0.35.2
#> [29] RColorBrewer_1.1-3 aplot_0.2.8
#> [31] abind_1.4-8 BiocParallel_1.43.4
#> [33] Rtsne_0.17 purrr_1.1.0
#> [35] withr_3.0.2 BiocGenerics_0.55.0
#> [37] grid_4.5.1 stats4_4.5.1
#> [39] data.tree_1.1.0 ggplot2_3.5.2
#> [41] scales_1.4.0 MASS_7.3-65
#> [43] dichromat_2.0-0.1 SummarizedExperiment_1.39.1
#> [45] cli_3.6.5 rmarkdown_2.29
#> [47] crayon_1.5.3 treeio_1.33.0
#> [49] generics_0.1.4 httr_1.4.7
#> [51] labdsv_2.1-0 rjson_0.2.23
#> [53] ape_5.8-1 cachem_1.1.0
#> [55] splines_4.5.1 network_1.19.0
#> [57] parallel_4.5.1 ggplotify_0.1.2
#> [59] BiocManager_1.30.26 XVector_0.49.0
#> [61] restfulr_0.0.16 yulab.utils_0.2.0
#> [63] matrixStats_1.5.0 vctrs_0.6.5
#> [65] Matrix_1.7-3 jsonlite_2.0.0
#> [67] bookdown_0.43 patchwork_1.3.1
#> [69] gridGraphics_0.5-1 IRanges_2.43.0
#> [71] S4Vectors_0.47.0 tidyr_1.3.1
#> [73] jquerylib_0.1.4 intergraph_2.0-4
#> [75] glue_1.8.0 statnet.common_4.12.0
#> [77] codetools_0.2-20 stringi_1.8.7
#> [79] gtable_0.3.6 BiocIO_1.19.0
#> [81] GenomicRanges_1.61.1 tibble_3.3.0
#> [83] pillar_1.11.0 ggnetwork_0.5.13
#> [85] htmltools_0.5.8.1 Seqinfo_0.99.2
#> [87] R6_2.6.1 networkD3_0.4.1
#> [89] evaluate_1.0.4 lattice_0.22-7
#> [91] Biobase_2.69.0 Rsamtools_2.25.1
#> [93] pheatmap_1.0.13 ggfun_0.2.0
#> [95] bslib_0.9.0 Rcpp_1.1.0
#> [97] coda_0.19-4.1 SparseArray_1.9.1
#> [99] nlme_3.1-168 mgcv_1.9-3
#> [101] xfun_0.52 fs_1.6.6
#> [103] MatrixGenerics_1.21.0 pkgconfig_2.0.3
Wang, Yupeng, Haibao Tang, Jeremy D DeBarry, Xu Tan, Jingping Li, Xiyin Wang, Tae-ho Lee, et al. 2012. “MCScanX: A Toolkit for Detection and Evolutionary Analysis of Gene Synteny and Collinearity.” Nucleic Acids Research 40 (7): e49–e49.