Contents

1 Introduction

Synteny analysis allows the identification of conserved gene content and gene order (collinearity) in a genomic segment, and it is often used to study how genomic rearrangements have shaped genomes during the course of evolution. However, accurate detection of syntenic blocks is highly dependent on parameters such as number of top hits in similarity searches (e.g., with BLAST (Altschul et al. 1990) or DIAMOND (Buchfink, Xie, and Huson 2015)), number of anchors, and number of upstream and downstream genes to search for syntenic blocks. Zhao and Schranz (2019) proposed a network-based synteny analysis that allows the identification of optimal parameters using the network’s average clustering coefficient and number of nodes. The algorithm for synteny network inference has been implemented in the Bioconductor package syntenet.

2 Installation

if(!requireNamespace('BiocManager', quietly = TRUE))
  install.packages('BiocManager')
BiocManager::install("cogeqc")
# Load package after installation
library(cogeqc)
set.seed(123) # for reproducibility

3 Data description

Here, we will use a subset of the synteny network inferred in Zhao and Schranz (2019) that contains the synteny network for Brassica oleraceae, B. napus, and B. rapa.

# Load synteny network for 
data(synnet)

head(synnet)
#>             anchor1        anchor2
#> 1 bnp_BnaA01g05780D bol_Bo1g011310
#> 2 bnp_BnaA01g05800D bol_Bo1g011320
#> 3 bnp_BnaA01g05810D bol_Bo1g011330
#> 4 bnp_BnaA01g05820D bol_Bo1g011340
#> 5 bnp_BnaA01g05830D bol_Bo1g011350
#> 6 bnp_BnaA01g05840D bol_Bo1g011360

4 Network-based assessment of synteny identification

To assess synteny detection, we calculate a synteny network score as follows:

\[ \begin{aligned} S &= C N, \text{ where:} \\ \\ C &= \text{Clustering coefficient} \\ N &= \text{Number of nodes} \end{aligned} \]

The network with the highest score is considered the most accurate. To score a network, you will use the function assess_synnet().

assess_synnet(synnet)
#>         CC Node_number    Score
#> 1 0.877912      149144 130935.3

Ideally, you should be able to run a synteny detection program (e.g., MCScanX (Wang et al. 2012) or i-ADHoRE (Proost et al. 2012)) with multiple combinations of parameters, infer a synteny network for each run, and assess each network to pick the best. To demonstrate it, let’s simulate different networks through resampling and calculate scores for each of them with the wrapper function assess_synnet_list().

net1 <- synnet
net2 <- synnet[-sample(1:10000, 500), ]
net3 <- synnet[-sample(1:10000, 1000), ]
synnet_list <- list(
  net1 = net1, 
  net2 = net2, 
  net3 = net3
)
synnet_assesment <- assess_synnet_list(synnet_list)
synnet_assesment
#>          CC Node_number    Score Network
#> 1 0.8779120      149144 130935.3    net1
#> 2 0.8769428      149133 130781.1    net2
#> 3 0.8758974      149114 130608.6    net3

# Determine the best network
synnet_assesment$Network[which.max(synnet_assesment$Score)]
#> [1] "net1"

As you can see, the first (original) network is the best one, as it has the highest score.

Session information

This document was created under the following conditions:

sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] cogeqc_1.0.7     BiocStyle_2.24.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.9             ape_5.6-2              lattice_0.20-45       
#>  [4] tidyr_1.2.1            Biostrings_2.64.1      assertthat_0.2.1      
#>  [7] digest_0.6.29          utf8_1.2.2             plyr_1.8.7            
#> [10] R6_2.5.1               GenomeInfoDb_1.32.4    stats4_4.2.1          
#> [13] evaluate_0.17          highr_0.9              ggplot2_3.3.6         
#> [16] pillar_1.8.1           ggfun_0.0.7            yulab.utils_0.0.5     
#> [19] zlibbioc_1.42.0        rlang_1.0.6            lazyeval_0.2.2        
#> [22] jquerylib_0.1.4        S4Vectors_0.34.0       rmarkdown_2.17        
#> [25] labeling_0.4.2         stringr_1.4.1          igraph_1.3.5          
#> [28] RCurl_1.98-1.9         munsell_0.5.0          compiler_4.2.1        
#> [31] xfun_0.33              pkgconfig_2.0.3        BiocGenerics_0.42.0   
#> [34] gridGraphics_0.5-1     htmltools_0.5.3        tidyselect_1.2.0      
#> [37] tibble_3.1.8           GenomeInfoDbData_1.2.8 bookdown_0.29         
#> [40] IRanges_2.30.1         fansi_1.0.3            crayon_1.5.2          
#> [43] dplyr_1.0.10           bitops_1.0-7           grid_4.2.1            
#> [46] nlme_3.1-160           jsonlite_1.8.2         gtable_0.3.1          
#> [49] lifecycle_1.0.3        DBI_1.1.3              magrittr_2.0.3        
#> [52] scales_1.2.1           tidytree_0.4.1         cli_3.4.1             
#> [55] stringi_1.7.8          cachem_1.0.6           farver_2.1.1          
#> [58] reshape2_1.4.4         XVector_0.36.0         ggtree_3.4.4          
#> [61] bslib_0.4.0            generics_0.1.3         vctrs_0.4.2           
#> [64] tools_4.2.1            treeio_1.20.2          ggplotify_0.1.0       
#> [67] glue_1.6.2             purrr_0.3.5            parallel_4.2.1        
#> [70] fastmap_1.1.0          yaml_2.3.5             colorspace_2.0-3      
#> [73] BiocManager_1.30.18    aplot_0.1.8            knitr_1.40            
#> [76] patchwork_1.1.2        sass_0.4.2

References

Altschul, Stephen F, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. 1990. “Basic Local Alignment Search Tool.” Journal of Molecular Biology 215 (3): 403–10.

Buchfink, Benjamin, Chao Xie, and Daniel H Huson. 2015. “Fast and Sensitive Protein Alignment Using Diamond.” Nature Methods 12 (1): 59–60.

Proost, Sebastian, Jan Fostier, Dieter De Witte, Bart Dhoedt, Piet Demeester, Yves Van de Peer, and Klaas Vandepoele. 2012. “I-Adhore 3.0—Fast and Sensitive Detection of Genomic Homology in Extremely Large Data Sets.” Nucleic Acids Research 40 (2): e11–e11.

Wang, Yupeng, Haibao Tang, Jeremy D DeBarry, Xu Tan, Jingping Li, Xiyin Wang, Tae-ho Lee, et al. 2012. “MCScanX: A Toolkit for Detection and Evolutionary Analysis of Gene Synteny and Collinearity.” Nucleic Acids Research 40 (7): e49–e49.

Zhao, Tao, and M Eric Schranz. 2019. “Network-Based Microsynteny Analysis Identifies Major Differences and Genomic Outliers in Mammalian and Angiosperm Genomes.” Proceedings of the National Academy of Sciences 116 (6): 2165–74.