The function variant_call() initiates a workflow of multiple functions from DADA2. Initially, DADA2 was developed for denoising metabarcoding reads from environmental samples where a high number of ASVs from different species are expected. Thus, default DADA2 parameters aim to maximize true positives while minimizing false positives (artifacts) and false negatives (true variants classified as artifacts) in metagenomic samples (Callahan et al., 2016; Rosen et al., 2012). However, tidyGenR will likely be used in many cases for determining amplicon variants in diploid species, as in vertebrates. For diploids, a maximum of two variants per locus are expected. This scenario justifies exploring how “child” variants are classified by DADA2.
birth_pval is the probability of variant being classified in its parent group given the error model. When birth_pval < OMEGA_A the child variant is promoted to its own cluster.
The pior knowledge of a maximum of 2 alleles per locus for diploids can be used to explore OMEGA_A and read frequency thresholds. These can be edited to maximize the number of “true” variants (true positives) while minimizing the risk of calling artifacts as real alleles.
library(tidyGenR)
The function explore_dada() permits to run DADA2 under testing parameters. It is recommended to run it with a high value of ‘OMEGA_A’. Under such a high threshold most variants will have a lower birth_pval and will form their own cluster. This allows the user to visually determine what thresholds will work best for their given dataset.
fq <-
list.files(system.file("extdata", "truncated",
package = "tidyGenR"),
pattern = "F_filt.fastq.gz",
full.names = TRUE)
ex_d <-
explore_dada(fq,
value_na = 10,
reduced = TRUE,
pool = FALSE,
vline = 2,
hline_fr = 0.1,
omega_a = 0.9,
band_size = 16
)
ex_d$p3
In this case, the minimum log(-log(birth_pval)) for the less frequent allele is around 4, meaning that a threshold of log(-log(birth_pval)) < 4 would be necessary to recover all variants in the sequence dataset. A log(-log(birth_pval)) of 4 implies a birth_pval of approximately \(\exp(-\exp(4)) \approx 1.9 \times 10^{-24}\), indicating an extremely low probability that such a variant arises as an error from its parent sequence under the DADA2 error model.
Variant calls with different parameters can be compared with compare_calls(). For this example, two runs with different thresholds of OMEGA_A are compared. From the explore_dada() results above, it can be concluded that a log(-log(birth_pval)) of 6 will leave out some variants for chrna9 and nfkbia.
# path to fastq
truncated <-
system.file("extdata", "truncated", package = "tidyGenR")
# variant calling
omega_a_test <- c(relaxed = 4, restrictive = 6)
variants_called <-
omega_a_test |>
lapply(function(x) {
variant_call(in_dir = truncated,
omega_a_f = exp(-exp(x)),
band_size = 16,
maf = 0.1
)
})
#> A total of 3 locus/loci have/has been detected in file(s) with the pattern '_F_filt.fastq.gz' in the path '/private/var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T/RtmpNrSQoX/Rinste0cb5728fb12/tidyGenR/extdata/truncated':
#> chrna9 nfkbia rogdi
#> Trying to call variants in single end mode.
#>
#> Calling variants for chrna9
#> Identified 0 bimeras out of 2 input sequences.
#>
#> Calling variants for nfkbia
#> Identified 0 bimeras out of 2 input sequences.
#>
#> Calling variants for rogdi
#> Identified 0 bimeras out of 1 input sequences.
#> A total of 3 locus/loci have/has been detected in file(s) with the pattern '_F_filt.fastq.gz' in the path '/private/var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T/RtmpNrSQoX/Rinste0cb5728fb12/tidyGenR/extdata/truncated':
#> chrna9 nfkbia rogdi
#> Trying to call variants in single end mode.
#>
#> Calling variants for chrna9
#> Identified 0 bimeras out of 2 input sequences.
#>
#> Calling variants for nfkbia
#> Identified 0 bimeras out of 1 input sequences.
#>
#> Calling variants for rogdi
#> Identified 0 bimeras out of 1 input sequences.
# compare calls
comp_call <-
compare_calls(variants_called)
#> 'td' is a named list with 2 or more dataframes that contain sample locus sequence.
#> MDS with 2 dimensions requires at least 3 elements to compare.
#> No output EXCEL file has been written. Change this behaviour by setting a path to 'dest_file'.
comp_call$plot2
It can be appreciated that in the restrictive call some variants are missing compared to the relaxed calling. Other outputs are produced. See str(comp_call) and ?compare_calls() for more information.