SynExtend
is a package of tools for working with objects of class Synteny
built from the package DECIPHER
’s FindSynteny()
function.
Synteny maps provide a powerful tool for quantifying and visualizing where pairs of genomes share order. Typically these maps are built from predictions of orthologous pairs, where groups of pairs that provide contiguous and sequential blocks in their respective genomes are deemed a ‘syntenic block’. That designation of synteny can then used to further interogate the predicted orthologs themselves, or query topics like genomic rearrangements or ancestor genome reconstruction.
FindSynteny
takes a different approach, finding exactly matched shared k-mers and determining where shared k-mers, or blocks of proximal shared k-mers are significant. Combining the information generated by FindSynteny
with locations of genomic features allows us to simply mark where features are linked by syntenic k-mers. These linked features represent potential orthologous pairs, and can be easily evaluated on the basis of the syntenic k-mers that they share, or alignment.
Currently SynExtend
contains only one set of functions, but will be expanded in the future.
SynExtend
in R by running the following commands:Using the FindSynteny
function in DECIPHER
build an object of class Synteny
. In this tutorial, a prebuilt DECIPHER
database is used. For database construction see ?Seqs2DB
in DECIPHER
. This example starts with a database containing three archaea genomes: Nitrosopumilus adriaticus, Nitrosopumilus piranensis, and a Candidatus Nitrosopumilus.
## Loading required package: DECIPHER
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
## tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## Loading required package: RSQLite
## Loading required package: igraph
##
## Attaching package: 'igraph'
## The following object is masked from 'package:Biostrings':
##
## union
## The following object is masked from 'package:XVector':
##
## path
## The following object is masked from 'package:IRanges':
##
## union
## The following object is masked from 'package:S4Vectors':
##
## union
## The following objects are masked from 'package:BiocGenerics':
##
## normalize, path, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
DBPATH <- system.file("extdata",
"VignetteSeqs.sqlite",
package = "SynExtend")
# Alternatively, to build this same database using DECIPHER:
# DBPATH <- tempfile()
# FNAs <- c("ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/740/685/GCA_006740685.1_ASM674068v1/GCA_006740685.1_ASM674068v1_genomic.fna.gz",
# "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/956/175/GCA_000956175.1_ASM95617v1/GCA_000956175.1_ASM95617v1_genomic.fna.gz",
# "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/875/775/GCA_000875775.1_ASM87577v1/GCA_000875775.1_ASM87577v1_genomic.fna.gz")
# for (m1 in seq_along(FNAs)) {
# X <- readDNAStringSet(filepath = FNAs[m1])
# X <- X[order(width(X),
# decreasing = TRUE)]
#
# Seqs2DB(seqs = X,
# type = "XStringSet",
# dbFile = DBPATH,
# identifier = as.character(m1),
# verbose = TRUE)
# }
Syn <- FindSynteny(dbFile = DBPATH)
## ================================================================================
##
## Time difference of 8.16 secs
Synteny maps represent where genomes share order. Simply printing a synteny object to the console displays a gross level view of the data inside. Objects of class Synteny
can also be plotted to clear visual representations of the data inside. The genomes used in this example are all from the same genus, and should be expected to be somewhat closely related.
## 1 2 3
## 1 1 seq 43% hits 57% hits
## 2 113 blocks 1 seq 38% hits
## 3 78 blocks 146 blocks 1 seq
Data present inside objects of class Synteny
can also be accessed relatively easily. The object itself is functionally a matrix of lists, with data describing exactly matched k-mers present in the upper triangle, and data describing blocks of chained k-mers in the lower triangle. For more information see ?FindSynteny
in the package DECIPHER
.
## index1 index2 strand width start1 start2 frame1 frame2
## [1,] 1 1 0 18 932191 197117 0 0
## [2,] 1 1 0 14 932212 197138 0 0
## [3,] 1 1 0 45 932268 197194 3 1
## [4,] 1 1 0 17 932350 197276 0 0
## [5,] 1 1 0 20 932377 197303 0 0
## [6,] 1 1 0 24 932416 197342 1 2
## index1 index2 strand score start1 start2 end1 end2 first_hit
## [1,] 1 1 0 40740 932191 197117 1091978 359925 1
## [2,] 1 1 0 34439 495594 1507344 623248 1649448 1975
## [3,] 1 1 0 31996 771868 13735 918407 154427 3561
## [4,] 1 1 0 16601 328609 1298395 463732 1437247 5212
## [5,] 1 1 0 14813 705863 1746189 760663 1799668 6849
## [6,] 1 1 0 14003 40177 943557 155993 1070853 7482
## last_hit
## [1,] 1974
## [2,] 3560
## [3,] 5211
## [4,] 6848
## [5,] 7481
## [6,] 8694
The above printed objects show the data for the comparison between the first and second genome in our database.
To take advantage of these synteny maps, we can then overlay the gene calls for each genome present on top of our map.
Next, GFF annotations for the associated genomes are parsed to provide gene calls in a use-able format. GFFs are not the only possible source of appropriate gene calls, but they are the source that was used during package construction and testing. Parsed GFFs can be constructed with gffToDataFrame
, for full functionality, or GFFs can be imported via rtracklater::import()
for limited functionality. GeneCalls for both the PairSummaries
and NucleotideOverlap
functions must be named list, and those names must match dimnames(Syn)[[1]]
.
GeneCalls <- vector(mode = "list",
length = ncol(Syn))
GeneCalls[[1L]] <- gffToDataFrame(GFF = system.file("extdata",
"GCA_006740685.1_ASM674068v1_genomic.gff.gz",
package = "SynExtend"),
Verbose = TRUE)
## ================================================================================
## Time difference of 27.26634 secs
GeneCalls[[2L]] <- gffToDataFrame(GFF = system.file("extdata",
"GCA_000956175.1_ASM95617v1_genomic.gff.gz",
package = "SynExtend"),
Verbose = TRUE)
## ================================================================================
## Time difference of 34.23424 secs
GeneCalls[[3L]] <- gffToDataFrame(GFF = system.file("extdata",
"GCA_000875775.1_ASM87577v1_genomic.gff.gz",
package = "SynExtend"),
Verbose = TRUE)
## ================================================================================
## Time difference of 32.98205 secs
SynExtend
’s gffToDataFrame
function will directly import gff files into a useable format, and includes other extracted information.
## DataFrame with 6 rows and 10 columns
## Index Strand Start Stop Type ID
## <integer> <integer> <integer> <integer> <character> <character>
## 1 1 1 307 621 gene gene-Nisw_00010
## 2 1 1 673 1182 gene gene-Nisw_00015
## 3 1 0 1271 1621 gene gene-Nisw_00020
## 4 1 1 1603 1914 gene gene-Nisw_00025
## 5 1 0 2013 2225 gene gene-Nisw_00030
## 6 1 1 2222 3313 gene gene-Nisw_00035
## Range Product Coding Contig
## <IRangesList> <character> <logical> <character>
## 1 307-621 DNA-binding protein TRUE CP035425.1
## 2 673-1182 DNA-directed RNA pol.. TRUE CP035425.1
## 3 1271-1621 hypothetical protein TRUE CP035425.1
## 4 1603-1914 MarR family transcri.. TRUE CP035425.1
## 5 2013-2225 hypothetical protein TRUE CP035425.1
## 6 2222-3313 deoxyhypusine synthase TRUE CP035425.1
Raw GFF imports are also acceptable, but prevent alignments in amino acid space with PairSummaries()
.
X01 <- rtracklayer::import(system.file("extdata",
"GCA_000875775.1_ASM87577v1_genomic.gff.gz",
package = "SynExtend"))
class(X01)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
## GRanges object with 4474 ranges and 28 metadata columns:
## seqnames ranges strand | source type score
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
## [1] CP010868.1 1-1713078 + | Genbank region NA
## [2] CP010868.1 1184-3562 - | Genbank gene NA
## [3] CP010868.1 1184-3562 - | Genbank CDS NA
## [4] CP010868.1 3758-4786 + | Genbank gene NA
## [5] CP010868.1 3758-4786 + | Genbank CDS NA
## ... ... ... ... . ... ... ...
## [4470] CP010868.1 1708820-1709782 - | Genbank CDS NA
## [4471] CP010868.1 1709824-1711992 - | Genbank gene NA
## [4472] CP010868.1 1709824-1711992 - | Genbank CDS NA
## [4473] CP010868.1 1712082-1712957 - | Genbank gene NA
## [4474] CP010868.1 1712082-1712957 - | Genbank CDS NA
## phase ID Dbxref Is_circular
## <integer> <character> <CharacterList> <character>
## [1] <NA> CP010868.1:1..1713078 taxon:1582439 true
## [2] <NA> gene-NPIRD3C_0001 <NA>
## [3] 0 cds-AJM91225.1 NCBI_GP:AJM91225.1 <NA>
## [4] <NA> gene-NPIRD3C_0002 <NA>
## [5] 0 cds-AJM91226.1 NCBI_GP:AJM91226.1 <NA>
## ... ... ... ... ...
## [4470] 0 cds-AJM93383.1 NCBI_GP:AJM93383.1 <NA>
## [4471] <NA> gene-NPIRD3C_2174 <NA>
## [4472] 0 cds-AJM93384.1 NCBI_GP:AJM93384.1 <NA>
## [4473] <NA> gene-NPIRD3C_2175 <NA>
## [4474] 0 cds-AJM93385.1 NCBI_GP:AJM93385.1 <NA>
## Name collection-date country gbkey genome
## <character> <character> <character> <character> <character>
## [1] ANONYMOUS 15-Dec-2011 Slovenia Src chromosome
## [2] NPIRD3C_0001 <NA> <NA> Gene <NA>
## [3] AJM91225.1 <NA> <NA> CDS <NA>
## [4] NPIRD3C_0002 <NA> <NA> Gene <NA>
## [5] AJM91226.1 <NA> <NA> CDS <NA>
## ... ... ... ... ... ...
## [4470] AJM93383.1 <NA> <NA> CDS <NA>
## [4471] NPIRD3C_2174 <NA> <NA> Gene <NA>
## [4472] AJM93384.1 <NA> <NA> CDS <NA>
## [4473] NPIRD3C_2175 <NA> <NA> Gene <NA>
## [4474] AJM93385.1 <NA> <NA> CDS <NA>
## isolation-source lat-lon mol_type
## <character> <character> <character>
## [1] coastal surface wate.. 45.51 N 13.56 E genomic DNA
## [2] <NA> <NA> <NA>
## [3] <NA> <NA> <NA>
## [4] <NA> <NA> <NA>
## [5] <NA> <NA> <NA>
## ... ... ... ...
## [4470] <NA> <NA> <NA>
## [4471] <NA> <NA> <NA>
## [4472] <NA> <NA> <NA>
## [4473] <NA> <NA> <NA>
## [4474] <NA> <NA> <NA>
## old-name strain type-material
## <character> <character> <character>
## [1] Candidatus Nitrosopu.. D3C type strain of Nitro..
## [2] <NA> <NA> <NA>
## [3] <NA> <NA> <NA>
## [4] <NA> <NA> <NA>
## [5] <NA> <NA> <NA>
## ... ... ... ...
## [4470] <NA> <NA> <NA>
## [4471] <NA> <NA> <NA>
## [4472] <NA> <NA> <NA>
## [4473] <NA> <NA> <NA>
## [4474] <NA> <NA> <NA>
## gene_biotype locus_tag Parent inference
## <character> <character> <CharacterList> <character>
## [1] <NA> <NA> <NA>
## [2] protein_coding NPIRD3C_0001 <NA>
## [3] <NA> NPIRD3C_0001 gene-NPIRD3C_0001 ab initio prediction..
## [4] protein_coding NPIRD3C_0002 <NA>
## [5] <NA> NPIRD3C_0002 gene-NPIRD3C_0002 ab initio prediction..
## ... ... ... ... ...
## [4470] <NA> NPIRD3C_2173 gene-NPIRD3C_2173 ab initio prediction..
## [4471] protein_coding NPIRD3C_2174 <NA>
## [4472] <NA> NPIRD3C_2174 gene-NPIRD3C_2174 ab initio prediction..
## [4473] protein_coding NPIRD3C_2175 <NA>
## [4474] <NA> NPIRD3C_2175 gene-NPIRD3C_2175 ab initio prediction..
## product protein_id transl_table Note
## <character> <character> <character> <CharacterList>
## [1] <NA> <NA> <NA>
## [2] <NA> <NA> <NA>
## [3] ATPase AJM91225.1 11
## [4] <NA> <NA> <NA>
## [5] tRNA-modifying enzyme AJM91226.1 11
## ... ... ... ... ...
## [4470] NifR3 family TIM-bar.. AJM93383.1 11
## [4471] <NA> <NA> <NA>
## [4472] Thrombospondin type .. AJM93384.1 11
## [4473] <NA> <NA> <NA>
## [4474] Thrombospondin type .. AJM93385.1 11
## gene pseudo
## <character> <character>
## [1] <NA> <NA>
## [2] <NA> <NA>
## [3] <NA> <NA>
## [4] <NA> <NA>
## [5] <NA> <NA>
## ... ... ...
## [4470] <NA> <NA>
## [4471] <NA> <NA>
## [4472] <NA> <NA>
## [4473] <NA> <NA>
## [4474] <NA> <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
SynExtend
’s primary functions provide a way to identify where pairs of genes are explicitly linked by syntenic hits, and then summarize those links. The first step is just identifying those links.
Links <- NucleotideOverlap(SyntenyObject = Syn,
GeneCalls = GeneCalls,
LimitIndex = FALSE,
Verbose = TRUE)
## ================================================================================
## Time difference of 2.075363 secs
The Links
object generated by NucleotideOverlap is a raw representation of positions on the synteny map where shared k-mers link genes between paired genomes. As such, it is analagous in shape to objects of class Synteny
. This raw object is unlikely to be useful to most users, but has been left exposed to ensure that this data remains accessible should a user desire to have access to it.
## [1] "LinkedPairs"
## 1 2 3
## 1 1683 Pairs 1760 Pairs
## 2 609223 Nucleotides 1781 Pairs
## 3 802650 Nucleotides 623239 Nucleotides
This raw data can be processed to provide a straightforward summary of predicted pairs.
LinkedPairs1 <- PairSummaries(SyntenyLinks = Links,
GeneCalls = GeneCalls,
DBPATH = DBPATH,
PIDs = FALSE,
Verbose = TRUE,
Model = "Global",
Correction = "none")
## ================================================================================
## Time difference of 3.512215 secs
The object LinkedPairs1
is a data.frame where each row is populated by information about a predicted orthologous pair. By default PairSummaries
uses a simple model to determine whether the k-mers that link a pair of genes are likely to provide an erroneous link. When set to Model = "Global"
, is is simply a prediction of whether the involved nucleotides are likely to describe a pair of genomic features whose alignment would result in a PID that falls within a random distribution. This model is effective if somewhat permissive, but is significantly faster than performing many pairwise alignments.
## TotalCoverage MaxCoverage MinCoverage NormDeltaStart
## 1_1_1 2_1_1080 0.00273224 0.002398082 0.003174603 0.135245902
## 1_1_1 2_1_1081 0.57051282 0.565079365 0.576051780 0.009615385
## 1_1_2 2_1_1082 0.45014245 0.436464088 0.464705882 0.031339031
## 1_1_3 2_1_1083 0.68292683 0.651162791 0.717948718 0.048780488
## 1_1_4 2_1_1083 0.02575107 0.023255814 0.028846154 0.121602289
## 1_1_4 2_1_1084 0.42948718 0.429487179 0.429487179 0.000000000
## NormDeltaStop NormGeneDiff ExactMatch ModelSelect
## 1_1_1 2_1_1080 0.004098361 0.139344262 1 TRUE
## 1_1_1 2_1_1081 0.000000000 0.009615385 178 TRUE
## 1_1_2 2_1_1082 0.000000000 0.031339031 237 TRUE
## 1_1_3 2_1_1083 0.000000000 0.048780488 252 TRUE
## 1_1_4 2_1_1083 0.014306152 0.107296137 9 TRUE
## 1_1_4 2_1_1084 0.000000000 0.000000000 134 TRUE
PairSummaries includes arguments that allow for aligning all pairs that are predicted, via PIDs = TRUE
, while IgnoreDefaultStringSet = FALSE
indicates that alignments should be performed in nucleotide or amino acid space as is appropriate for the linked sequences. Setting IgnoreDefaultStringSet = TRUE
will force all alignments into nucleotide space.
Session Info:
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] SynExtend_1.2.0 igraph_1.2.6 DECIPHER_2.18.0
## [4] RSQLite_2.2.1 Biostrings_2.58.0 XVector_0.30.0
## [7] IRanges_2.24.0 S4Vectors_0.28.0 BiocGenerics_0.36.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 compiler_4.0.3
## [3] GenomeInfoDb_1.26.0 MatrixGenerics_1.2.0
## [5] bitops_1.0-6 tools_4.0.3
## [7] zlibbioc_1.36.0 digest_0.6.27
## [9] bit_4.0.4 lattice_0.20-41
## [11] evaluate_0.14 memoise_1.1.0
## [13] pkgconfig_2.0.3 rlang_0.4.8
## [15] Matrix_1.2-18 DelayedArray_0.16.0
## [17] DBI_1.1.0 yaml_2.2.1
## [19] xfun_0.18 GenomeInfoDbData_1.2.4
## [21] rtracklayer_1.50.0 stringr_1.4.0
## [23] knitr_1.30 vctrs_0.3.4
## [25] grid_4.0.3 bit64_4.0.5
## [27] Biobase_2.50.0 XML_3.99-0.5
## [29] BiocParallel_1.24.0 rmarkdown_2.5
## [31] blob_1.2.1 magrittr_1.5
## [33] matrixStats_0.57.0 Rsamtools_2.6.0
## [35] htmltools_0.5.0 GenomicRanges_1.42.0
## [37] GenomicAlignments_1.26.0 SummarizedExperiment_1.20.0
## [39] stringi_1.5.3 RCurl_1.98-1.2
## [41] crayon_1.3.4