if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("ggmsa")
ggmsa is a package designed to plot multiple sequence alignments.
This package implements functions to visualize publication-quality multiple sequence alignments (protein/DNA/RNA) in R extremely simple and powerful. It uses module design to annotate sequence alignments and allows to accept other data sets for diagrams combination.
In this tutorial, we’ll work through the basics of using ggmsa.
library(ggmsa)
We’ll start by importing some example data to use throughout this
tutorial. Expect FASTA files, some of the objects in R can also
as input. available_msa()
can be used to list MSA objects
currently available.
available_msa()
#> 1.files currently available:
#> .fasta
#> 2.XStringSet objects from 'Biostrings' package:
#> DNAStringSet RNAStringSet AAStringSet BStringSet DNAMultipleAlignment RNAMultipleAlignment AAMultipleAlignment
#> 3.bin objects:
#> DNAbin AAbin
protein_sequences <- system.file("extdata", "sample.fasta",
package = "ggmsa")
miRNA_sequences <- system.file("extdata", "seedSample.fa",
package = "ggmsa")
nt_sequences <- system.file("extdata", "LeaderRepeat_All.fa",
package = "ggmsa")
The most simple code to use ggmsa:
ggmsa(protein_sequences, 300, 350, color = "Clustal",
font = "DroidSansMono", char_width = 0.5, seq_name = TRUE )
ggmsa predefines several color schemes for rendering MSA
are shipped in the package. In the same ways, using
available_msa()
to list color schemes currently available.
Note that amino acids (protein) and nucleotides (DNA/RNA) have
different names.
available_colors()
#> 1.color schemes for nucleotide sequences currently available:
#> Chemistry_NT Shapely_NT Taylor_NT Zappo_NT
#> 2.color schemes for AA sequences currently available:
#> ClustalChemistry_AA Shapely_AA Zappo_AA Taylor_AA LETTER CN6 Hydrophobicity
Several predefined fonts are shipped ggmsa.
Users can use available_fonts()
to list the font currently available.
available_fonts()
#> font families currently available:
#> helvetical mono TimesNewRoman DroidSansMono
ggmsa supports annotations for MSA. Similar to the ggplot2,
it implements annotations by geom
and users can perform
annotation with +
, like this: ggmsa() + geom_*()
.
Automatically generated annotations that containing colored
labels and symbols are overlaid on MSAs to indicate
potentially conserved or divergent regions.
For example, visualizing multiple sequence alignment with sequence logo and bar chart:
ggmsa(protein_sequences, 221, 280, seq_name = TRUE, char_width = 0.5) +
geom_seqlogo(color = "Chemistry_AA") + geom_msaBar()
This table shows the annnotation layers supported by ggmsa as following:
Annotation modules | Type | Description |
---|---|---|
geom_seqlogo() | geometric layer | automatically generated sequence logos for a MSA |
geom_GC() | annotation module | shows GC content with bubble chart |
geom_seed() | annotation module | highlights seed region on miRNA sequences |
geom_msaBar() | annotation module | shows sequences conservation by a bar chart |
geom_helix() | annotation module | depicts RNA secondary structure as arc diagrams(need extra data) |
Check out the guides for learning everything there is to know about all the different features:
#> R version 4.4.0 beta (2024-04-15 r86425)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] kableExtra_1.4.0 ggplot2_3.5.1 ggmsa_1.10.0 BiocStyle_2.32.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 viridisLite_0.4.2 dplyr_1.1.4
#> [4] farver_2.1.1 Biostrings_2.72.0 fastmap_1.1.1
#> [7] lazyeval_0.2.2 ash_1.0-15 tweenr_2.0.3
#> [10] digest_0.6.35 R4RNA_1.32.0 lifecycle_1.0.4
#> [13] tidytree_0.4.6 magrittr_2.0.3 compiler_4.4.0
#> [16] rlang_1.1.3 sass_0.4.9 tools_4.4.0
#> [19] utf8_1.2.4 yaml_2.3.8 knitr_1.46
#> [22] labeling_0.4.3 xml2_1.3.6 RColorBrewer_1.1-3
#> [25] aplot_0.2.2 KernSmooth_2.23-22 withr_3.0.0
#> [28] purrr_1.0.2 BiocGenerics_0.50.0 grid_4.4.0
#> [31] polyclip_1.10-6 proj4_1.0-14 stats4_4.4.0
#> [34] fansi_1.0.6 colorspace_2.1-0 extrafontdb_1.0
#> [37] scales_1.3.0 seqmagick_0.1.7 MASS_7.3-60.2
#> [40] tinytex_0.50 cli_3.6.2 rmarkdown_2.26
#> [43] crayon_1.5.2 treeio_1.28.0 generics_0.1.3
#> [46] rstudioapi_0.16.0 ggtree_3.12.0 httr_1.4.7
#> [49] ape_5.8 cachem_1.0.8 ggforce_0.4.2
#> [52] stringr_1.5.1 zlibbioc_1.50.0 maps_3.4.2
#> [55] ggalt_0.4.0 parallel_4.4.0 ggplotify_0.1.2
#> [58] BiocManager_1.30.22 XVector_0.44.0 yulab.utils_0.1.4
#> [61] vctrs_0.6.5 jsonlite_1.8.8 bookdown_0.39
#> [64] gridGraphics_0.5-1 IRanges_2.38.0 patchwork_1.2.0
#> [67] S4Vectors_0.42.0 systemfonts_1.0.6 magick_2.8.3
#> [70] jquerylib_0.1.4 tidyr_1.3.1 glue_1.7.0
#> [73] stringi_1.8.3 gtable_0.3.5 GenomeInfoDb_1.40.0
#> [76] UCSC.utils_1.0.0 extrafont_0.19 munsell_0.5.1
#> [79] tibble_3.2.1 pillar_1.9.0 htmltools_0.5.8.1
#> [82] GenomeInfoDbData_1.2.12 R6_2.5.1 evaluate_0.23
#> [85] lattice_0.22-6 highr_0.10 memoise_2.0.1
#> [88] ggfun_0.1.4 bslib_0.7.0 Rcpp_1.0.12
#> [91] svglite_2.1.3 nlme_3.1-164 Rttf2pt1_1.3.12
#> [94] xfun_0.43 fs_1.6.4 pkgconfig_2.0.3