MapAssessmentData {AssessORF} | R Documentation |
Maps proteomics hits and evolutionarily conserved starts to a central genome
MapAssessmentData(genomes_DBFile, tblName = "Seqs", central_ID, related_IDs, protHits_Seqs, protHits_Scores = rep.int(1, length(protHits_Seqs)), strainID = "", speciesName = "", protHits_Threshold = 0, protHits_IsNTerm = FALSE, related_KMerLen = 8, related_MinDist = 0.01, related_MaxDistantN = 1000, startCodons = c("ATG", "GTG", "TTG"), useProt = TRUE, useCons = TRUE, verbose = TRUE)
genomes_DBFile |
A SQLite connection object or a character string specifying the path to the database file. |
tblName |
Character string specifying the table where the genome sequences are located. |
central_ID |
Character string specifying which identifier corresponds to the central genome, the genome to which the proteomics data and evolutionary conservation data will be mapped. |
related_IDs |
Character vector of strings specifying identifiers that correspond to related genomes, the genomes that will be used to determine which start codons (ATG, GTG, and TTG) are evolutionarily conserved. |
protHits_Seqs |
Character vector of amino acid strings that correspond to the sequences for the proteomics hits. |
protHits_Scores |
Numeric vector of (confidence) scores for the proteomics hits. Scores cannot be negative. The default option assigns a score of one to each proteomics hit. |
strainID |
Optional character string that specifies the strain identifier that the central genome corresponds to. |
speciesName |
Optional character string that specifies the name of the species that the central genome corresponds to. |
protHits_Threshold |
Optional number that specifies what percent of the lowest scoring proteomics hits should be dropped. Must be a non-negative integer less than 100. |
protHits_IsNTerm |
Logical describing whether or not the proteomics hits come from N-terminal proteomics. Default value is false. |
related_KMerLen |
The k-mer length to be used when measuring distances between the central genome and related genomes. Default value is 8. Recommended to use the default value. |
related_MinDist |
The minimum fractional distance required for a related genome to be used in finding evolutionary conservation. Used to prevent the inclusion of related genomes that are too similar to the central genome. Default value is 0.01. Recommended to use the default value. |
related_MaxDistantN |
The maximum number of related genomes to use in finding evolutionary conservation after the related genomes have been sorted from most distantly related to most closely related in relation to the central genome. Default value is 1000. |
startCodons |
A charcter vector consisting of three-letter DNA strings to use as the start codons when finding evolutionarily conserved starts. |
useProt |
Logical indicating whether or not proteomics evidence should be mapped to the genome.
Default value is true. Cannot be false if |
useCons |
Logical indicating whether or not evolutionary conservation evidence should be mapped to the genome.
Default value is true. Cannot be false if |
verbose |
Logical indicating whether or not to display progress and status messages. |
MapAssessmentData
maps the given data (either proteomics data, evolutionary conservation data, or both) to the
given central genome and stores those mappings in the object outputted by the function. The object that is outputted can
then be used to assess the quality of genes predicted for that same central genome.
All genomes used inside this function, including the central genome, must be inside the specified table of the specified database. If the central genome is not found, the function returns an error. Please see the Using AssessORF vignette for details on how to populate a database with genomic sequences.
Information on the proteomics hits is primarily given by protHits_Seqs
and protHits_Scores
. The sequences
(protHits_Seqs
) are mapped to the six-frame translations of the central genome, and the scores (protHits_Scores
)
are used in thresholding and plotting the proteomics hits.
protHits_Scores
can be a single number. In that case, that number is used the as the score for all proteomics hits.
Otherwise, the protHits_Scores
must be of the same length as protHits_Seqs
.
Only proteomics hits with a score greater than the value of the percentile that corresponds to the value of protHits_Threshold
will be kept and the rest of the hits will be dropped. If all the proteomics hits have the same score or if protHits_Threshold
is zero, no thresholding will occur and no hits will be dropped.
Please note that protHits_IsNTerm
has no affect on how the proteomics evidence is mapped to the central genome but it can be
used to affect how genes are assessed and categorized in AssessGenes
.
Evolutionarily conserved starts and conserved stop are found by first measuring how far the related genomes are from the central genome using k-mer frequencies. Next, the most distant related genomes are aligned to the central genome. This provides information on how often each position in the central genome is covered by syntenic matches to related genomes (coverage), how often those positions correspond to the start codons (start codon conservation) in both genomes, and how often those positions correspond to stop codons in related genomes (stop codon conservation). A ratio of conservation to coverage is used in downstream functions to measure the strength of both conserved starts and conserved stops.
Related genomes should be from species that are closely related to the given strain. related_IDs
specifies the identifiers
for the sequences of the related genomes inside the database. A related genome identifier (each element of related_IDs
) is
considered invalid and not used when finding evolutionary conservation if it is not found in the databse. Please note that the function
will only error when none of the related genomes are found.
If there are less valid related genomes in the sequence database than value of related_MaxDistantN
, all valid related genomes
will be used in finding evolutionary conservation.
The logical flag useProt
is used to indicate whether or not proteomics evidence has been provided and should be mapped to
the genome. Error checking will not occur for any arguments that involve proteomics if it is false.
The logical flag useCons
is used to indicate whether or not evolutionary conservation evidence has been provided and should be
mapped to the genome. Error checking will not occur for any arguments that involve evolutionary conservation if it is false.
An object of class Assessment
and subclass DataMap
## Example showing the minimum number of arguments that need to be specified ## to map both proteomics and evolutionary conservation data: ## Not run: myMapObj <- MapAssessmentData(myDBFile, central_ID = "1", related_IDs = as.character(2:1001), protHits_Seqs = myProtSeqs) ## End(Not run) ## Runnable example that uses evolutionary conservation data only: ## Human adenovirus 1 is the strain of interest, and the set of Adenoviridae ## genomes will serve as the set of genome. The cenral genome, also known as ## the genome of human adenovirus 1, is at identifier 1. The related genomes ## are at identifiers 2 - 13. myMapObj <- MapAssessmentData(system.file("extdata", "Adenoviridae.sqlite", package = "AssessORF"), central_ID = "1", related_IDs = as.character(2:13), speciesName = "Human adenovirus 1", useProt = FALSE)