AssessGenes {AssessORF} | R Documentation |
Assess and categorize a set of genes for a genome using proteomics hits, evolutionarily conserved starts, and evolutionarily conserved stops as evidence
AssessGenes(geneLeftPos, geneRightPos = NA_integer_, geneStrand = NA_character_, inputMapObj, geneSource = "", minCovNum = 10, minCovPct = 5, minConCovRatio_Best = 0.99, limConCovRatio_NotCon = 0.8, minConCovRatio_Stop = 0.5, noConStopsGeneFrac = 0.5, minNumStops = 2, minMissORFLen = 0, allowNestedORFs = FALSE, useNTermProt = FALSE, verbose = TRUE)
geneLeftPos |
An integer vector with the left positions of each gene, in terms of the forward strand.
Can also be a |
geneRightPos |
An integer vector with the right positions of each gene, in terms of the forward strand.
Should be left at the default value of |
geneStrand |
A character vector consisting of "+" and "-", specifying which strand each gene is on.
Should be left at the default value of |
inputMapObj |
EITHER an object of class |
geneSource |
Optional character string that describes the source of the gene set, i.e. a database or gene prediction program. Used when viewing and identifying the object returned by the function. |
minCovNum |
Minimum number of related genomes required to have synteny to a position in the central genome. Recommended to use the default value. |
minCovPct |
Minimum percentage of related genomes required to have synteny to a position in the central genome. Must be an integer ranging from 0 to 100. Recommended to use the default value. |
minConCovRatio_Best |
Minimum value of the start codon conservation to coverage ratio needed to call a start conserved. Must range from 0 to 1. Lower values allow more conserved starts through. Recommended to use the default value. |
limConCovRatio_NotCon |
Maximum, non-inclusive value of the conservation to coverage ratio needed to call a possible conserved start not conserved. Used when making a decision on how to categorize the conserved start evidence. Must range from 0 to 1 Recommended to use the default value. |
minConCovRatio_Stop |
Minimum value of the stop codon conservation to coverage ratio needed to say a position in the central genome corresponds to a conserved stop across the related genomes. Must range from 0 to 1. Lower values allow more conserved stops through. Recommended to use the default value. |
noConStopsGeneFrac |
Value from 0 to 1 describing the fractional range of positions in a gene, starting from the start of the gene and moving towards the stop of the gene, to use in searching for conserved stops. For example, a value of 0.25 means that the first quarter of the gene is checked for conserved stops, a value of 0.5 correspond to the first half of the gene, etc. Recommended to use the default value. |
minNumStops |
Minimum number of conserved stop positions required to be within a gene (with no mapped proteomics hits) in order to categorize that gene as an overprediction. Recommended to use the default value. |
minMissORFLen |
Minimum ORF length required to include an ORF with protein hits but no gene start in the final results. |
allowNestedORFs |
Logical indicating whether or not to include ORFs with protein hits but no gene starts that are completely nested within an ORF in another frame in the final results. |
useNTermProt |
Logical indicating whether or not to require the given gene starts to align with (or be one codon off from) the start of the first proteomics hit in the ORF. The mapping object must be built with N-terminal proteomics data. Default value is FALSE. |
verbose |
Logical indicating whether or not to display progress and status messages. |
For each of the given genes, AssessGene
assigns a category based on where conserved starts, conserved stops, and/or
proteomics hits are located in relation to the start of the gene. The category assignments for the genes are stored in the
CategoryAssignments
vector in the Results
object returned by the function. Please see
Assessment-class
for a list of all possible categories and their descriptions.
If geneLeftPos
is a GRanges
object, then the left and right positions of each gene along with the strand of each
gene are extracted from the object. Any sequence names given for the genes within the GRanges
object are ignored, and
the CategoryAssignments
in the returned Results
object follows the same order as to how the genes are listed
within the GRanges
object.
If gene positional information is instead given as three vectors, then the three vectors, geneLeftPos
, geneRightPos
,
and geneStrand
, must all be of the same length. Using the same index with each vector must provide information on the same
gene (think of the vectors as columns of the same table). geneLeftPos
and geneRightPos
describe the upstream and
downstream positions (respectively) for each gene in terms of the forward strand. For genes on the forward strand, geneLeftPos
corresponds to the start positions and geneRightPos
corresponds to stop positions. For genes on the reverse strand,
geneLeftPos
corresponds to the stop positions and geneRightPos
corresponds to the start positions. Gene positions on the
reverse strand must be relative to the 5' to 3' direction of the forward strand (as opposed to being relative to the 5' to 3' direction
of the reverse strand). This means that none of the elements of geneLeftPos
can be greater than (or equal to) the corresponding
element in geneRightPos
. The CategoryAssignments
in the returned Results
object has the same length as and aligns
with the indexing of the three given gene positional information vectors.
Please ensure that the same genome used in the mapping function is also used to derive the set of genes for this assessment function. The function will only error if any gene positions are outside the bounds of the genome and does not make any other checks to make sure the genes are valid for the genome.
The maximum of either minCovNum
or (minCovPct
/100) multiplied by the number of related genomes is used as
the minimum coverage required in determining conserved starts and stops.
Additionaly, open reading frames with proteomics evidence but no gene start are categorized based on whether or not there
is a conserved start upstream of the proteomic evidence. The positions and lengths of these open reading frames is included
in the N_CS-_PE+_ORFs
and N_CS+_PE+_ORFs
matrices within the final object that is returned.
An object of class Assessment
and subclass Results
## Example showing the minimum number of arguments that need to be specified: ## Not run: myResObj <- AssessGenes(geneLeftPos = myGenesLeft, geneRightPos = myGenesRight, geneStrand = myGenesStrand, inputMapObj = myMapObj) ## End(Not run) ## Example from vignette is shown below currMapObj <- readRDS(system.file("extdata", "MGAS5005_PreSaved_DataMapObj.rds", package = "AssessORF")) currProdigal <- readLines(system.file("extdata", "MGAS5005_Prodigal.sco", package = "AssessORF"))[-1:-2] prodigalLeft <- as.numeric(sapply(strsplit(currProdigal, "_", fixed=TRUE), `[`, 2L)) prodigalRight <- as.numeric(sapply(strsplit(currProdigal, "_", fixed=TRUE), `[`, 3L)) prodigalStrand <- sapply(strsplit(currProdigal, "_", fixed=TRUE), `[`, 4L) currResObj <- AssessGenes(geneLeftPos = prodigalLeft, geneRightPos = prodigalRight, geneStrand = prodigalStrand, inputMapObj = currMapObj, geneSource = "Prodigal")