When working on your own genome project or when using publicly available
genomes for comparative analyses, it is critical to assess the quality of your
data. Over the past years, several tools have been developed and several
metrics have been proposed to assess the quality of a genome assembly and
annotation. cogeqc
helps users interpret their genome assembly statistics
by comparing them with statistics on publicly available genomes on the NCBI.
Additionally, cogeqc
also provides an interface to BUSCO (Simão et al. 2015),
a popular tool to assess gene space completeness. Graphical functions are
available to make publication-ready plots that summarize the results of
quality control.
You can install cogeqc
from Bioconductor with the following code:
if(!requireNamespace('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install("cogeqc")
# Load package after installation
library(cogeqc)
When analyzing and interpreting genome assembly statistics, it is often
useful to place your stats in a context by comparing them with stats from genomes
of closely-related or even the same species. cogeqc
provides users with
an interface to the NCBI Datasets API, which can be used to retrieve summary
stats for genomes on NCBI. In this section, we will guide you on how to
retrieve such information and use it as a reference to interpret your data.
To obtain a data frame of summary statistics for NCBI genomes of a particular
taxon, you will use the function get_genome_stats()
. In the taxon parameter,
you must specify the taxon from which data will be extracted. This can be done
either by passing a character scalar with taxon name or by passing a numeric
scalar with NCBI Taxonomy ID. For example, the code below demonstrates two
ways of extracting stats on maize (Zea mays) genomes on NCBI:
# Example 1: get stats for all maize genomes using taxon name
maize_stats <- get_genome_stats(taxon = "Zea mays")
head(maize_stats)
#> accession source species_taxid species_name species_common_name
#> 1 GCA_902167145.1 GENBANK 4577 Zea mays maize
#> 2 GCF_902167145.1 REFSEQ 4577 Zea mays maize
#> 3 GCA_022117705.1 GENBANK 4577 Zea mays maize
#> 4 GCA_029775835.1 GENBANK 4577 Zea mays maize
#> 5 GCA_905067065.1 GENBANK 4577 Zea mays maize
#> 6 GCA_902714155.1 GENBANK 4577 Zea mays maize
#> species_ecotype species_strain species_isolate species_cultivar
#> 1 <NA> NA <NA> B73
#> 2 <NA> NA <NA> B73
#> 3 <NA> NA <NA> Mo17-2021
#> 4 <NA> NA <NA> LT2357
#> 5 <NA> NA <NA> <NA>
#> 6 <NA> NA <NA> B73 Ab10
#> assembly_level assembly_status assembly_name
#> 1 Chromosome current Zm-B73-REFERENCE-NAM-5.0
#> 2 Chromosome current Zm-B73-REFERENCE-NAM-5.0
#> 3 <NA> current Zm-Mo17-REFERENCE-CAU-T2T-assembly
#> 4 Chromosome current ASM2977583v1
#> 5 Chromosome current Zm-LH244-REFERENCE-BAYER-1.0
#> 6 Chromosome current Zm-B73_AB10-REFERENCE-NAM-1.0b
#> assembly_type submission_date submitter
#> 1 haploid NA MaizeGDB
#> 2 haploid NA MaizeGDB
#> 3 haploid NA China Agriculture University
#> 4 haploid NA Beijing Lantron Seed Co., LTD.
#> 5 haploid NA BAYER CROPSCIENCE
#> 6 haploid NA MaizeGDB
#> sequencing_technology atypical refseq_category
#> 1 <NA> FALSE <NA>
#> 2 <NA> FALSE representative genome
#> 3 Oxford Nanopore PromethION; PacBio FALSE <NA>
#> 4 PacBio Sequel FALSE <NA>
#> 5 <NA> FALSE <NA>
#> 6 <NA> FALSE <NA>
#> chromosome_count sequence_length ungapped_length contig_count contig_N50
#> 1 10 2182075994 2178268108 1393 47037903
#> 2 10 2182075994 2178268108 1393 47037903
#> 3 10 2178604320 2178604320 10 220303002
#> 4 10 2106865080 2106637080 460 15883073
#> 5 10 2147745480 2107651308 56173 84946
#> 6 10 2243621556 2241350720 1016 161994764
#> contig_L50 scaffold_count scaffold_N50 scaffold_L50 GC_percent
#> 1 16 685 226353449 5 47.0
#> 2 16 685 226353449 5 47.0
#> 3 5 10 220303002 5 47.0
#> 4 41 10 222005600 5 47.0
#> 5 7498 10 225452224 5 47.0
#> 6 6 936 225306452 5 46.5
#> annotation_provider annotation_release_date gene_count_total
#> 1 <NA> <NA> NA
#> 2 NCBI RefSeq 2020-08-09 49897
#> 3 <NA> <NA> NA
#> 4 <NA> <NA> NA
#> 5 <NA> <NA> NA
#> 6 <NA> <NA> NA
#> gene_count_coding gene_count_noncoding gene_count_pseudogene gene_count_other
#> 1 NA NA NA NA
#> 2 34337 10366 5194 NA
#> 3 NA NA NA NA
#> 4 NA NA NA NA
#> 5 NA NA NA NA
#> 6 NA NA NA NA
#> CC_ratio
#> 1 139.3
#> 2 139.3
#> 3 1.0
#> 4 46.0
#> 5 5617.3
#> 6 101.6
str(maize_stats)
#> 'data.frame': 110 obs. of 36 variables:
#> $ accession : chr "GCA_902167145.1" "GCF_902167145.1" "GCA_022117705.1" "GCA_029775835.1" ...
#> $ source : chr "GENBANK" "REFSEQ" "GENBANK" "GENBANK" ...
#> $ species_taxid : int 4577 4577 4577 4577 4577 4577 4577 4577 4577 4577 ...
#> $ species_name : chr "Zea mays" "Zea mays" "Zea mays" "Zea mays" ...
#> $ species_common_name : chr "maize" "maize" "maize" "maize" ...
#> $ species_ecotype : chr NA NA NA NA ...
#> $ species_strain : logi NA NA NA NA NA NA ...
#> $ species_isolate : chr NA NA NA NA ...
#> $ species_cultivar : chr "B73" "B73" "Mo17-2021" "LT2357" ...
#> $ assembly_level : Factor w/ 4 levels "Complete","Chromosome",..: 2 2 NA 2 2 2 2 2 2 2 ...
#> $ assembly_status : chr "current" "current" "current" "current" ...
#> $ assembly_name : chr "Zm-B73-REFERENCE-NAM-5.0" "Zm-B73-REFERENCE-NAM-5.0" "Zm-Mo17-REFERENCE-CAU-T2T-assembly" "ASM2977583v1" ...
#> $ assembly_type : chr "haploid" "haploid" "haploid" "haploid" ...
#> $ submission_date : logi NA NA NA NA NA NA ...
#> $ submitter : chr "MaizeGDB" "MaizeGDB" "China Agriculture University" "Beijing Lantron Seed Co., LTD." ...
#> $ sequencing_technology : chr NA NA "Oxford Nanopore PromethION; PacBio" "PacBio Sequel" ...
#> $ atypical : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
#> $ refseq_category : chr NA "representative genome" NA NA ...
#> $ chromosome_count : int 10 10 10 10 10 10 10 10 10 10 ...
#> $ sequence_length : num 2.18e+09 2.18e+09 2.18e+09 2.11e+09 2.15e+09 ...
#> $ ungapped_length : num 2.18e+09 2.18e+09 2.18e+09 2.11e+09 2.11e+09 ...
#> $ contig_count : int 1393 1393 10 460 56173 1016 1191 1747 2634 863 ...
#> $ contig_N50 : int 47037903 47037903 220303002 15883073 84946 161994764 49888411 49071148 38847693 35863069 ...
#> $ contig_L50 : int 16 16 5 41 7498 6 12 13 21 21 ...
#> $ scaffold_count : int 685 685 10 10 10 936 800 1379 2223 413 ...
#> $ scaffold_N50 : int 226353449 226353449 220303002 222005600 225452224 225306452 220287990 223168870 222765871 223950520 ...
#> $ scaffold_L50 : int 5 5 5 5 5 5 5 5 5 5 ...
#> $ GC_percent : num 47 47 47 47 47 46.5 47 46.5 47 47 ...
#> $ annotation_provider : chr NA "NCBI RefSeq" NA NA ...
#> $ annotation_release_date: chr NA "2020-08-09" NA NA ...
#> $ gene_count_total : int NA 49897 NA NA NA NA NA NA NA NA ...
#> $ gene_count_coding : int NA 34337 NA NA NA NA NA NA NA NA ...
#> $ gene_count_noncoding : int NA 10366 NA NA NA NA NA NA NA NA ...
#> $ gene_count_pseudogene : int NA 5194 NA NA NA NA NA NA NA NA ...
#> $ gene_count_other : int NA NA NA NA NA NA NA NA NA NA ...
#> $ CC_ratio : num 139 139 1 46 5617 ...
# Example 2: get stats for all maize genomes using NCBI Taxonomy ID
maize_stats2 <- get_genome_stats(taxon = 4577)
# Checking if objects are the same
identical(maize_stats, maize_stats2)
#> [1] TRUE
As you can see, there are 110 maize genomes on the NCBI. You can also include filters in your searches by passing a list of key-value pairs with keys in list names and values in elements. For instance, to obtain only chromosome-scale and annotated maize genomes, you would run:
# Get chromosome-scale maize genomes with annotation
## Create list of filters
filt <- list(
filters.has_annotation = "true",
filters.assembly_level = "chromosome"
)
filt
#> $filters.has_annotation
#> [1] "true"
#>
#> $filters.assembly_level
#> [1] "chromosome"
## Obtain data
filtered_maize_genomes <- get_genome_stats(taxon = "Zea mays", filters = filt)
dim(filtered_maize_genomes)
#> [1] 4 36
For a full list of filtering parameters and possible arguments, see the API documentation.
Now, suppose you sequenced a genome, obtained assembly and annotation stats, and want to compare them to NCBI genomes to identify potential issues. Examples of situations you may encounter include:
The genome you assembled is huge and you think there might be a problem with your assembly.
Your gene annotation pipeline predicted n genes, but you are not sure if this number is reasonable compared to other assemblies of the same species or closely-related species.
To compare user-defined summary stats with NCBI stats, you will use
the function compare_genome_stats()
. This function will include the values
you observed for each statistic into a distribution (based on NCBI stats) and
return the percentile and rank of your observed values in each distribution.
As an example, let’s go back to our maize stats we obtained in the previous section. Suppose you sequenced a new maize genome and observed the following values:
To compare your observed values with those for publicly available maize genomes,
you need to store them in a data frame. The column accession is mandatory,
and any other column will be matched against columns in the data frame obtained
with get_genome_stats()
. Thus, make sure column names in your data frame
match column names in the reference data frame. Then, you can compare both
data frames as below:
# Check column names in the data frame of stats for maize genomes on the NCBI
names(maize_stats)
#> [1] "accession" "source"
#> [3] "species_taxid" "species_name"
#> [5] "species_common_name" "species_ecotype"
#> [7] "species_strain" "species_isolate"
#> [9] "species_cultivar" "assembly_level"
#> [11] "assembly_status" "assembly_name"
#> [13] "assembly_type" "submission_date"
#> [15] "submitter" "sequencing_technology"
#> [17] "atypical" "refseq_category"
#> [19] "chromosome_count" "sequence_length"
#> [21] "ungapped_length" "contig_count"
#> [23] "contig_N50" "contig_L50"
#> [25] "scaffold_count" "scaffold_N50"
#> [27] "scaffold_L50" "GC_percent"
#> [29] "annotation_provider" "annotation_release_date"
#> [31] "gene_count_total" "gene_count_coding"
#> [33] "gene_count_noncoding" "gene_count_pseudogene"
#> [35] "gene_count_other" "CC_ratio"
# Create a simulated data frame of stats for a maize genome
my_stats <- data.frame(
accession = "my_lovely_maize",
sequence_length = 2.4 * 1e9,
gene_count_total = 50000,
CC_ratio = 2
)
# Compare stats
compare_genome_stats(ncbi_stats = maize_stats, user_stats = my_stats)
#> accession variable percentile rank
#> 1 my_lovely_maize sequence_length 0.98198198 3
#> 2 my_lovely_maize gene_count_total 1.00000000 1
#> 3 my_lovely_maize CC_ratio 0.02857143 2
To have a visual representation of the summary stats obtained with
get_genome_stats()
, you will use the function plot_genome_stats()
.
# Summarize genome stats in a plot
plot_genome_stats(ncbi_stats = maize_stats)
Finally, you can pass your data frame of observed stats to highlight your values (as red points) in the distributions.
plot_genome_stats(ncbi_stats = maize_stats, user_stats = my_stats)