Contents

1 Introduction

One of the most common metrics to assess the quality of genome assemblies is BUSCO (best universal single-copy orthologs) (Simão et al. 2015). cogeqc allows users to run BUSCO from an R session and visualize results graphically. BUSCO summary statistics will help you assess which assemblies have high quality based on the percentage of complete BUSCOs.

2 Installation

if(!requireNamespace('BiocManager', quietly = TRUE))
  install.packages('BiocManager')
BiocManager::install("cogeqc")
# Load package after installation
library(cogeqc)

3 Running BUSCO

To run BUSCO from R, you will use the function run_busco()1. Here, we will use an example FASTA file containing the first 1,000 lines of the Herbaspirilllum seropedicae SmR1 genome (GCA_000143225), which was downloaded from Ensembl Bacteria. We will run BUSCO using burkholderiales_odb10 as the lineage dataset. To view all available datasets, run list_busco_datasets().

# Path to FASTA file
sequence <- system.file("extdata", "Hse_subset.fa", package = "cogeqc")

# Path to directory where BUSCO datasets will be stored
download_path <- paste0(tempdir(), "/datasets")

# Run BUSCO if it is installed
if(busco_is_installed()) {
  run_busco(sequence, outlabel = "Hse", mode = "genome",
            lineage = "burkholderiales_odb10",
            outpath = tempdir(), download_path = download_path)
}

The output will be stored in the directory specified in outpath. You can read and parse BUSCO’s output with the function read_busco(). For example, let’s read the output of a BUSCO run using the genome of the green algae Ostreococcus tauri. The output directory is /extdata.

# Path to output directory
output_dir <- system.file("extdata", package = "cogeqc")

busco_summary <- read_busco(output_dir)
busco_summary
#>                Class Frequency           Lineage
#> 1        Complete_SC      1412 chlorophyta_odb10
#> 2 Complete_duplicate         4 chlorophyta_odb10
#> 3         Fragmented        35 chlorophyta_odb10
#> 4            Missing        68 chlorophyta_odb10

This is an example output for a BUSCO run with a single FASTA file. You can also specify a directory containing multiple FASTA files in the sequence argument of run_busco(). This way, BUSCO will be run in batch mode. Let’s see what the output of BUSCO in batch mode looks like:

data(batch_summary)
batch_summary
#>                Class Frequency               Lineage   File
#> 1        Complete_SC      98.5 burkholderiales_odb10 Hse.fa
#> 2        Complete_SC      98.8 burkholderiales_odb10 Hru.fa
#> 3 Complete_duplicate       0.7 burkholderiales_odb10 Hse.fa
#> 4 Complete_duplicate       0.7 burkholderiales_odb10 Hru.fa
#> 5         Fragmented       0.4 burkholderiales_odb10 Hse.fa
#> 6         Fragmented       0.3 burkholderiales_odb10 Hru.fa
#> 7            Missing       0.4 burkholderiales_odb10 Hse.fa
#> 8            Missing       0.2 burkholderiales_odb10 Hru.fa

The only difference between this data frame and the previous one is the column File, which contains information on the FASTA file. The example dataset batch_summary contains the output of run_busco() using a directory containing two genomes (Herbaspirillum seropedicae SmR1 and Herbaspirillum rubrisubalbicans M1) as parameter to the sequence argument.

4 Visualizing summary statistics

After using run_busco() and parsing its output with read_busco(), users can visualize summary statistics with plot_busco().

# Single FASTA file - Ostreococcus tauri
plot_busco(busco_summary)


# Batch mode - Herbaspirillum seropedicae and H. rubrisubalbicans
plot_busco(batch_summary)

We usually consider genomes with >90% of complete BUSCOs as having high quality. Thus, we can conclude that the three genomes analyzed here are high-quality genomes.

Session information

This document was created under the following conditions:

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31)
#>  os       macOS Ventura 13.0
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  C
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2023-01-26
#>  pandoc   2.18 @ /opt/homebrew/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package          * version  date (UTC) lib source
#>  ape                5.6-2    2022-03-02 [2] CRAN (R 4.2.0)
#>  aplot              0.1.9    2022-11-24 [2] CRAN (R 4.2.0)
#>  assertthat         0.2.1    2019-03-21 [2] CRAN (R 4.2.0)
#>  BiocGenerics       0.44.0   2023-01-25 [2] Bioconductor
#>  BiocManager        1.30.19  2022-10-25 [2] CRAN (R 4.2.0)
#>  BiocStyle        * 2.26.0   2023-01-25 [2] Bioconductor
#>  Biostrings         2.66.0   2023-01-25 [2] Bioconductor
#>  bitops             1.0-7    2021-04-24 [2] CRAN (R 4.2.0)
#>  bookdown           0.32     2023-01-17 [2] CRAN (R 4.2.0)
#>  bslib              0.4.2    2022-12-16 [2] CRAN (R 4.2.0)
#>  cachem             1.0.6    2021-08-19 [2] CRAN (R 4.2.0)
#>  cli                3.6.0    2023-01-09 [2] CRAN (R 4.2.0)
#>  cogeqc           * 1.2.1    2023-01-25 [1] Bioconductor
#>  colorspace         2.0-3    2022-02-21 [2] CRAN (R 4.2.0)
#>  crayon             1.5.2    2022-09-29 [2] CRAN (R 4.2.0)
#>  DBI                1.1.3    2022-06-18 [2] CRAN (R 4.2.0)
#>  digest             0.6.31   2022-12-11 [2] CRAN (R 4.2.0)
#>  dplyr              1.0.10   2022-09-01 [2] CRAN (R 4.2.0)
#>  evaluate           0.20     2023-01-17 [2] CRAN (R 4.2.0)
#>  fansi              1.0.3    2022-03-24 [2] CRAN (R 4.2.0)
#>  farver             2.1.1    2022-07-06 [2] CRAN (R 4.2.0)
#>  fastmap            1.1.0    2021-01-25 [2] CRAN (R 4.2.0)
#>  generics           0.1.3    2022-07-05 [2] CRAN (R 4.2.0)
#>  GenomeInfoDb       1.34.7   2023-01-25 [2] Bioconductor
#>  GenomeInfoDbData   1.2.9    2023-01-21 [2] Bioconductor
#>  ggfun              0.0.9    2022-11-21 [2] CRAN (R 4.2.0)
#>  ggplot2            3.4.0    2022-11-04 [2] CRAN (R 4.2.0)
#>  ggplotify          0.1.0    2021-09-02 [2] CRAN (R 4.2.0)
#>  ggtree             3.6.2    2023-01-25 [2] Bioconductor
#>  glue               1.6.2    2022-02-24 [2] CRAN (R 4.2.0)
#>  gridGraphics       0.5-1    2020-12-13 [2] CRAN (R 4.2.0)
#>  gtable             0.3.1    2022-09-01 [2] CRAN (R 4.2.0)
#>  highr              0.10     2022-12-22 [2] CRAN (R 4.2.0)
#>  htmltools          0.5.4    2022-12-07 [2] CRAN (R 4.2.0)
#>  igraph             1.3.5    2022-09-22 [2] CRAN (R 4.2.0)
#>  IRanges            2.32.0   2023-01-25 [2] Bioconductor
#>  jquerylib          0.1.4    2021-04-26 [2] CRAN (R 4.2.0)
#>  jsonlite           1.8.4    2022-12-06 [2] CRAN (R 4.2.0)
#>  knitr              1.41     2022-11-18 [2] CRAN (R 4.2.0)
#>  labeling           0.4.2    2020-10-20 [2] CRAN (R 4.2.0)
#>  lattice            0.20-45  2021-09-22 [2] CRAN (R 4.2.2)
#>  lazyeval           0.2.2    2019-03-15 [2] CRAN (R 4.2.0)
#>  lifecycle          1.0.3    2022-10-07 [2] CRAN (R 4.2.0)
#>  magrittr           2.0.3    2022-03-30 [2] CRAN (R 4.2.0)
#>  munsell            0.5.0    2018-06-12 [2] CRAN (R 4.2.0)
#>  nlme               3.1-161  2022-12-15 [2] CRAN (R 4.2.0)
#>  patchwork          1.1.2    2022-08-19 [2] CRAN (R 4.2.0)
#>  pillar             1.8.1    2022-08-19 [2] CRAN (R 4.2.0)
#>  pkgconfig          2.0.3    2019-09-22 [2] CRAN (R 4.2.0)
#>  plyr               1.8.8    2022-11-11 [2] CRAN (R 4.2.0)
#>  purrr              1.0.1    2023-01-10 [2] CRAN (R 4.2.0)
#>  R6                 2.5.1    2021-08-19 [2] CRAN (R 4.2.0)
#>  Rcpp               1.0.9    2022-07-08 [2] CRAN (R 4.2.0)
#>  RCurl              1.98-1.9 2022-10-03 [2] CRAN (R 4.2.0)
#>  reshape2           1.4.4    2020-04-09 [2] CRAN (R 4.2.0)
#>  rlang              1.0.6    2022-09-24 [2] CRAN (R 4.2.0)
#>  rmarkdown          2.20     2023-01-19 [2] CRAN (R 4.2.0)
#>  S4Vectors          0.36.1   2023-01-25 [2] Bioconductor
#>  sass               0.4.4    2022-11-24 [2] CRAN (R 4.2.0)
#>  scales             1.2.1    2022-08-20 [2] CRAN (R 4.2.0)
#>  sessioninfo        1.2.2    2021-12-06 [2] CRAN (R 4.2.0)
#>  stringi            1.7.12   2023-01-11 [2] CRAN (R 4.2.0)
#>  stringr            1.5.0    2022-12-02 [2] CRAN (R 4.2.0)
#>  tibble             3.1.8    2022-07-22 [2] CRAN (R 4.2.0)
#>  tidyr              1.2.1    2022-09-08 [2] CRAN (R 4.2.0)
#>  tidyselect         1.2.0    2022-10-10 [2] CRAN (R 4.2.0)
#>  tidytree           0.4.2    2022-12-18 [2] CRAN (R 4.2.0)
#>  treeio             1.22.0   2023-01-25 [2] Bioconductor
#>  utf8               1.2.2    2021-07-24 [2] CRAN (R 4.2.0)
#>  vctrs              0.5.1    2022-11-16 [2] CRAN (R 4.2.0)
#>  withr              2.5.0    2022-03-03 [2] CRAN (R 4.2.0)
#>  xfun               0.36     2022-12-21 [2] CRAN (R 4.2.0)
#>  XVector            0.38.0   2023-01-25 [2] Bioconductor
#>  yaml               2.3.6    2022-10-18 [2] CRAN (R 4.2.0)
#>  yulab.utils        0.0.6    2022-12-20 [2] CRAN (R 4.2.0)
#>  zlibbioc           1.44.0   2023-01-25 [2] Bioconductor
#> 
#>  [1] /private/tmp/Rtmpvlske3/Rinst10a7850dab199
#>  [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

References

Paul, Matt, Thomas Carroll, and Doug Barrows. 2021. Herper: The Herper Package Is a Simple Toolset to Install and Manage Conda Packages and Environments from r. https://github.com/RockefellerUniversity/Herper.
Simão, Felipe A, Robert M Waterhouse, Panagiotis Ioannidis, Evgenia V Kriventseva, and Evgeny M Zdobnov. 2015. “BUSCO: Assessing Genome Assembly and Annotation Completeness with Single-Copy Orthologs.” Bioinformatics 31 (19): 3210–12.

  1. NOTE: You must have BUSCO installed and in your PATH to use run_busco(). You can check if BUSCO is installed by running busco_is_installed(). If you don’t have it already, you can manually install it or use a conda virtual environment with the Bioconductor package Herper (Paul, Carroll, and Barrows 2021).↩︎