One of the most common metrics to assess the quality of genome assemblies is BUSCO (best universal single-copy orthologs) (Simão et al. 2015). cogeqc
allows users to run BUSCO from an R session and visualize results graphically. BUSCO summary statistics will help you assess which assemblies have high quality based on the percentage of complete BUSCOs.
if(!requireNamespace('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install("cogeqc")
# Load package after installation
library(cogeqc)
To run BUSCO from R, you will use the function run_busco()
1. Here, we will use an example FASTA file containing the first 1,000 lines of the Herbaspirilllum seropedicae SmR1 genome (GCA_000143225), which was downloaded from Ensembl Bacteria. We will run BUSCO using burkholderiales_odb10 as the lineage dataset. To view all available datasets, run list_busco_datasets()
.
# Path to FASTA file
sequence <- system.file("extdata", "Hse_subset.fa", package = "cogeqc")
# Path to directory where BUSCO datasets will be stored
download_path <- paste0(tempdir(), "/datasets")
# Run BUSCO if it is installed
if(busco_is_installed()) {
run_busco(sequence, outlabel = "Hse", mode = "genome",
lineage = "burkholderiales_odb10",
outpath = tempdir(), download_path = download_path)
}
The output will be stored in the directory specified in outpath. You can read and parse BUSCO’s output with the function read_busco()
. For example, let’s read the output of a BUSCO run using the genome of the green algae Ostreococcus tauri. The output directory is /extdata
.
# Path to output directory
output_dir <- system.file("extdata", package = "cogeqc")
busco_summary <- read_busco(output_dir)
busco_summary
#> Class Frequency Lineage
#> 1 Complete_SC 1412 chlorophyta_odb10
#> 2 Complete_duplicate 4 chlorophyta_odb10
#> 3 Fragmented 35 chlorophyta_odb10
#> 4 Missing 68 chlorophyta_odb10
This is an example output for a BUSCO run with a single FASTA file. You can also specify a directory containing multiple FASTA files in the sequence argument of run_busco()
. This way, BUSCO will be run in batch mode. Let’s see what the output of BUSCO in batch mode looks like:
data(batch_summary)
batch_summary
#> Class Frequency Lineage File
#> 1 Complete_SC 98.5 burkholderiales_odb10 Hse.fa
#> 2 Complete_SC 98.8 burkholderiales_odb10 Hru.fa
#> 3 Complete_duplicate 0.7 burkholderiales_odb10 Hse.fa
#> 4 Complete_duplicate 0.7 burkholderiales_odb10 Hru.fa
#> 5 Fragmented 0.4 burkholderiales_odb10 Hse.fa
#> 6 Fragmented 0.3 burkholderiales_odb10 Hru.fa
#> 7 Missing 0.4 burkholderiales_odb10 Hse.fa
#> 8 Missing 0.2 burkholderiales_odb10 Hru.fa
The only difference between this data frame and the previous one is the column File, which contains information on the FASTA file. The example dataset batch_summary
contains the output of run_busco()
using a directory containing two genomes (Herbaspirillum seropedicae SmR1 and Herbaspirillum rubrisubalbicans M1) as parameter to the sequence argument.
After using run_busco()
and parsing its output with read_busco()
, users can visualize summary statistics with plot_busco()
.
# Single FASTA file - Ostreococcus tauri
plot_busco(busco_summary)
# Batch mode - Herbaspirillum seropedicae and H. rubrisubalbicans
plot_busco(batch_summary)
We usually consider genomes with >90% of complete BUSCOs as having high quality. Thus, we can conclude that the three genomes analyzed here are high-quality genomes.
This document was created under the following conditions:
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.2 (2022-10-31)
#> os macOS Ventura 13.0
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate C
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2023-01-26
#> pandoc 2.18 @ /opt/homebrew/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> ape 5.6-2 2022-03-02 [2] CRAN (R 4.2.0)
#> aplot 0.1.9 2022-11-24 [2] CRAN (R 4.2.0)
#> assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.2.0)
#> BiocGenerics 0.44.0 2023-01-25 [2] Bioconductor
#> BiocManager 1.30.19 2022-10-25 [2] CRAN (R 4.2.0)
#> BiocStyle * 2.26.0 2023-01-25 [2] Bioconductor
#> Biostrings 2.66.0 2023-01-25 [2] Bioconductor
#> bitops 1.0-7 2021-04-24 [2] CRAN (R 4.2.0)
#> bookdown 0.32 2023-01-17 [2] CRAN (R 4.2.0)
#> bslib 0.4.2 2022-12-16 [2] CRAN (R 4.2.0)
#> cachem 1.0.6 2021-08-19 [2] CRAN (R 4.2.0)
#> cli 3.6.0 2023-01-09 [2] CRAN (R 4.2.0)
#> cogeqc * 1.2.1 2023-01-25 [1] Bioconductor
#> colorspace 2.0-3 2022-02-21 [2] CRAN (R 4.2.0)
#> crayon 1.5.2 2022-09-29 [2] CRAN (R 4.2.0)
#> DBI 1.1.3 2022-06-18 [2] CRAN (R 4.2.0)
#> digest 0.6.31 2022-12-11 [2] CRAN (R 4.2.0)
#> dplyr 1.0.10 2022-09-01 [2] CRAN (R 4.2.0)
#> evaluate 0.20 2023-01-17 [2] CRAN (R 4.2.0)
#> fansi 1.0.3 2022-03-24 [2] CRAN (R 4.2.0)
#> farver 2.1.1 2022-07-06 [2] CRAN (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.2.0)
#> generics 0.1.3 2022-07-05 [2] CRAN (R 4.2.0)
#> GenomeInfoDb 1.34.7 2023-01-25 [2] Bioconductor
#> GenomeInfoDbData 1.2.9 2023-01-21 [2] Bioconductor
#> ggfun 0.0.9 2022-11-21 [2] CRAN (R 4.2.0)
#> ggplot2 3.4.0 2022-11-04 [2] CRAN (R 4.2.0)
#> ggplotify 0.1.0 2021-09-02 [2] CRAN (R 4.2.0)
#> ggtree 3.6.2 2023-01-25 [2] Bioconductor
#> glue 1.6.2 2022-02-24 [2] CRAN (R 4.2.0)
#> gridGraphics 0.5-1 2020-12-13 [2] CRAN (R 4.2.0)
#> gtable 0.3.1 2022-09-01 [2] CRAN (R 4.2.0)
#> highr 0.10 2022-12-22 [2] CRAN (R 4.2.0)
#> htmltools 0.5.4 2022-12-07 [2] CRAN (R 4.2.0)
#> igraph 1.3.5 2022-09-22 [2] CRAN (R 4.2.0)
#> IRanges 2.32.0 2023-01-25 [2] Bioconductor
#> jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.2.0)
#> jsonlite 1.8.4 2022-12-06 [2] CRAN (R 4.2.0)
#> knitr 1.41 2022-11-18 [2] CRAN (R 4.2.0)
#> labeling 0.4.2 2020-10-20 [2] CRAN (R 4.2.0)
#> lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.2)
#> lazyeval 0.2.2 2019-03-15 [2] CRAN (R 4.2.0)
#> lifecycle 1.0.3 2022-10-07 [2] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.2.0)
#> munsell 0.5.0 2018-06-12 [2] CRAN (R 4.2.0)
#> nlme 3.1-161 2022-12-15 [2] CRAN (R 4.2.0)
#> patchwork 1.1.2 2022-08-19 [2] CRAN (R 4.2.0)
#> pillar 1.8.1 2022-08-19 [2] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.2.0)
#> plyr 1.8.8 2022-11-11 [2] CRAN (R 4.2.0)
#> purrr 1.0.1 2023-01-10 [2] CRAN (R 4.2.0)
#> R6 2.5.1 2021-08-19 [2] CRAN (R 4.2.0)
#> Rcpp 1.0.9 2022-07-08 [2] CRAN (R 4.2.0)
#> RCurl 1.98-1.9 2022-10-03 [2] CRAN (R 4.2.0)
#> reshape2 1.4.4 2020-04-09 [2] CRAN (R 4.2.0)
#> rlang 1.0.6 2022-09-24 [2] CRAN (R 4.2.0)
#> rmarkdown 2.20 2023-01-19 [2] CRAN (R 4.2.0)
#> S4Vectors 0.36.1 2023-01-25 [2] Bioconductor
#> sass 0.4.4 2022-11-24 [2] CRAN (R 4.2.0)
#> scales 1.2.1 2022-08-20 [2] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.2.0)
#> stringi 1.7.12 2023-01-11 [2] CRAN (R 4.2.0)
#> stringr 1.5.0 2022-12-02 [2] CRAN (R 4.2.0)
#> tibble 3.1.8 2022-07-22 [2] CRAN (R 4.2.0)
#> tidyr 1.2.1 2022-09-08 [2] CRAN (R 4.2.0)
#> tidyselect 1.2.0 2022-10-10 [2] CRAN (R 4.2.0)
#> tidytree 0.4.2 2022-12-18 [2] CRAN (R 4.2.0)
#> treeio 1.22.0 2023-01-25 [2] Bioconductor
#> utf8 1.2.2 2021-07-24 [2] CRAN (R 4.2.0)
#> vctrs 0.5.1 2022-11-16 [2] CRAN (R 4.2.0)
#> withr 2.5.0 2022-03-03 [2] CRAN (R 4.2.0)
#> xfun 0.36 2022-12-21 [2] CRAN (R 4.2.0)
#> XVector 0.38.0 2023-01-25 [2] Bioconductor
#> yaml 2.3.6 2022-10-18 [2] CRAN (R 4.2.0)
#> yulab.utils 0.0.6 2022-12-20 [2] CRAN (R 4.2.0)
#> zlibbioc 1.44.0 2023-01-25 [2] Bioconductor
#>
#> [1] /private/tmp/Rtmpvlske3/Rinst10a7850dab199
#> [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
NOTE: You must have BUSCO installed and in your PATH to use run_busco()
. You can check if BUSCO is installed by running busco_is_installed()
. If you don’t have it already, you can manually install it or use a conda virtual environment with the Bioconductor package Herper
(Paul, Carroll, and Barrows 2021).↩︎