1 Install package

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("ggmsa")

2 Introduction

ggmsa is a package designed to plot multiple sequence alignments.

This package implements functions to visualize publication-quality multiple sequence alignments (protein/DNA/RNA) in R extremely simple and powerful. It uses module design to annotate sequence alignments and allows to accept other data sets for diagrams combination.

In this tutorial, we’ll work through the basics of using ggmsa.

library(ggmsa)

3 Importing MSA data

We’ll start by importing some example data to use throughout this tutorial. Expect FASTA files, some of the objects in R can also as input. available_msa() can be used to list MSA objects currently available.

 available_msa()
#> 1.files currently available:
#> .fasta
#> 2.XStringSet objects from 'Biostrings' package:
#> DNAStringSet RNAStringSet AAStringSet BStringSet DNAMultipleAlignment RNAMultipleAlignment AAMultipleAlignment
#> 3.bin objects:
#> DNAbin AAbin

 protein_sequences <- system.file("extdata", "sample.fasta", 
                                  package = "ggmsa")
 miRNA_sequences <- system.file("extdata", "seedSample.fa", 
                                package = "ggmsa")
 nt_sequences <- system.file("extdata", "LeaderRepeat_All.fa", 
                             package = "ggmsa")

4 Basic use: MSA Visualization

The most simple code to use ggmsa:

ggmsa(protein_sequences, 300, 350, color = "Clustal", 
      font = "DroidSansMono", char_width = 0.5, seq_name = TRUE )

4.1 Color Schemes

ggmsa predefines several color schemes for rendering MSA are shipped in the package. In the same ways, using available_msa() to list color schemes currently available. Note that amino acids (protein) and nucleotides (DNA/RNA) have different names.

available_colors()
#> 1.color schemes for nucleotide sequences currently available:
#> Chemistry_NT Shapely_NT Taylor_NT Zappo_NT
#> 2.color schemes for AA sequences currently available:
#> ClustalChemistry_AA Shapely_AA Zappo_AA Taylor_AA LETTER CN6 Hydrophobicity

4.2 Font

Several predefined fonts are shipped ggmsa. Users can use available_fonts() to list the font currently available.

available_fonts()
#> font families currently available:
#> helvetical mono TimesNewRoman DroidSansMono

5 MSA Annotation

ggmsa supports annotations for MSA. Similar to the ggplot2, it implements annotations by geom and users can perform annotation with + , like this: ggmsa() + geom_*(). Automatically generated annotations that containing colored labels and symbols are overlaid on MSAs to indicate potentially conserved or divergent regions.

For example, visualizing multiple sequence alignment with sequence logo and bar chart:

ggmsa(protein_sequences, 221, 280, seq_name = TRUE, char_width = 0.5) + 
  geom_seqlogo(color = "Chemistry_AA") + geom_msaBar()

This table shows the annnotation layers supported by ggmsa as following:

Annotation modules	Type	Description
geom_seqlogo()	geometric layer	automatically generated sequence logos for a MSA
geom_GC()	annotation module	shows GC content with bubble chart
geom_seed()	annotation module	highlights seed region on miRNA sequences
geom_msaBar()	annotation module	shows sequences conservation by a bar chart
geom_helix()	annotation module	depicts RNA secondary structure as arc diagrams(need extra data)

6 Learn more

Check out the guides for learning everything there is to know about all the different features:

7 Session Info

#> R version 4.4.0 beta (2024-04-15 r86425)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] kableExtra_1.4.0 ggplot2_3.5.1    ggmsa_1.10.0     BiocStyle_2.32.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1        viridisLite_0.4.2       dplyr_1.1.4            
#>  [4] farver_2.1.1            Biostrings_2.72.0       fastmap_1.1.1          
#>  [7] lazyeval_0.2.2          ash_1.0-15              tweenr_2.0.3           
#> [10] digest_0.6.35           R4RNA_1.32.0            lifecycle_1.0.4        
#> [13] tidytree_0.4.6          magrittr_2.0.3          compiler_4.4.0         
#> [16] rlang_1.1.3             sass_0.4.9              tools_4.4.0            
#> [19] utf8_1.2.4              yaml_2.3.8              knitr_1.46             
#> [22] labeling_0.4.3          xml2_1.3.6              RColorBrewer_1.1-3     
#> [25] aplot_0.2.2             KernSmooth_2.23-22      withr_3.0.0            
#> [28] purrr_1.0.2             BiocGenerics_0.50.0     grid_4.4.0             
#> [31] polyclip_1.10-6         proj4_1.0-14            stats4_4.4.0           
#> [34] fansi_1.0.6             colorspace_2.1-0        extrafontdb_1.0        
#> [37] scales_1.3.0            seqmagick_0.1.7         MASS_7.3-60.2          
#> [40] tinytex_0.50            cli_3.6.2               rmarkdown_2.26         
#> [43] crayon_1.5.2            treeio_1.28.0           generics_0.1.3         
#> [46] rstudioapi_0.16.0       ggtree_3.12.0           httr_1.4.7             
#> [49] ape_5.8                 cachem_1.0.8            ggforce_0.4.2          
#> [52] stringr_1.5.1           zlibbioc_1.50.0         maps_3.4.2             
#> [55] ggalt_0.4.0             parallel_4.4.0          ggplotify_0.1.2        
#> [58] BiocManager_1.30.22     XVector_0.44.0          yulab.utils_0.1.4      
#> [61] vctrs_0.6.5             jsonlite_1.8.8          bookdown_0.39          
#> [64] gridGraphics_0.5-1      IRanges_2.38.0          patchwork_1.2.0        
#> [67] S4Vectors_0.42.0        systemfonts_1.0.6       magick_2.8.3           
#> [70] jquerylib_0.1.4         tidyr_1.3.1             glue_1.7.0             
#> [73] stringi_1.8.3           gtable_0.3.5            GenomeInfoDb_1.40.0    
#> [76] UCSC.utils_1.0.0        extrafont_0.19          munsell_0.5.1          
#> [79] tibble_3.2.1            pillar_1.9.0            htmltools_0.5.8.1      
#> [82] GenomeInfoDbData_1.2.12 R6_2.5.1                evaluate_0.23          
#> [85] lattice_0.22-6          highr_0.10              memoise_2.0.1          
#> [88] ggfun_0.1.4             bslib_0.7.0             Rcpp_1.0.12            
#> [91] svglite_2.1.3           nlme_3.1-164            Rttf2pt1_1.3.12        
#> [94] xfun_0.43               fs_1.6.4                pkgconfig_2.0.3

ggmsa-Getting Started

2024-04-30