AssessGenes {AssessORF}R Documentation

Assess Genes

Description

Assess and categorize a set of genes for a genome using proteomics hits, evolutionarily conserved starts, and evolutionarily conserved stops as evidence

Usage

AssessGenes(geneLeftPos,
            geneRightPos = NA_integer_,
            geneStrand = NA_character_,
            inputMapObj,
            geneSource = "",
            minCovNum = 10,
            minCovPct = 5,
            minConCovRatio_Best = 0.99,
            limConCovRatio_NotCon = 0.8,
            minConCovRatio_Stop = 0.5,
            noConStopsGeneFrac = 0.5,
            minNumStops = 2,
            minMissORFLen = 0,
            allowNestedORFs = FALSE,
            useNTermProt = FALSE,
            verbose = TRUE)

Arguments

geneLeftPos

An integer vector with the left positions of each gene, in terms of the forward strand. Can also be a GRanges object from the GenomicRanges package that holds all of the positional information (including strand) for the genes. In that case, the next two parameters should be left as NA.

geneRightPos

An integer vector with the right positions of each gene, in terms of the forward strand. Should be left at the default value of NA_integer_ if geneLeftPos is a GRanges object.

geneStrand

A character vector consisting of "+" and "-", specifying which strand each gene is on. Should be left at the default value of NA_character_ if geneLeftPos is a GRanges object.

inputMapObj

EITHER an object of class Assessment and subclass DataMap OR a character string corresponding to the strain identifier for one of such objects from AssessORFData.

geneSource

Optional character string that describes the source of the gene set, i.e. a database or gene prediction program. Used when viewing and identifying the object returned by the function.

minCovNum

Minimum number of related genomes required to have synteny to a position in the central genome. Recommended to use the default value.

minCovPct

Minimum percentage of related genomes required to have synteny to a position in the central genome. Must be an integer ranging from 0 to 100. Recommended to use the default value.

minConCovRatio_Best

Minimum value of the start codon conservation to coverage ratio needed to call a start conserved. Must range from 0 to 1. Lower values allow more conserved starts through. Recommended to use the default value.

limConCovRatio_NotCon

Maximum, non-inclusive value of the conservation to coverage ratio needed to call a possible conserved start not conserved. Used when making a decision on how to categorize the conserved start evidence. Must range from 0 to 1 Recommended to use the default value.

minConCovRatio_Stop

Minimum value of the stop codon conservation to coverage ratio needed to say a position in the central genome corresponds to a conserved stop across the related genomes. Must range from 0 to 1. Lower values allow more conserved stops through. Recommended to use the default value.

noConStopsGeneFrac

Value from 0 to 1 describing the fractional range of positions in a gene, starting from the start of the gene and moving towards the stop of the gene, to use in searching for conserved stops. For example, a value of 0.25 means that the first quarter of the gene is checked for conserved stops, a value of 0.5 correspond to the first half of the gene, etc. Recommended to use the default value.

minNumStops

Minimum number of conserved stop positions required to be within a gene (with no mapped proteomics hits) in order to categorize that gene as an overprediction. Recommended to use the default value.

minMissORFLen

Minimum ORF length required to include an ORF with protein hits but no gene start in the final results.

allowNestedORFs

Logical indicating whether or not to include ORFs with protein hits but no gene starts that are completely nested within an ORF in another frame in the final results.

useNTermProt

Logical indicating whether or not to require the given gene starts to align with (or be one codon off from) the start of the first proteomics hit in the ORF. The mapping object must be built with N-terminal proteomics data. Default value is FALSE.

verbose

Logical indicating whether or not to display progress and status messages.

Details

For each of the given genes, AssessGene assigns a category based on where conserved starts, conserved stops, and/or proteomics hits are located in relation to the start of the gene. The category assignments for the genes are stored in the CategoryAssignments vector in the Results object returned by the function. Please see Assessment-class for a list of all possible categories and their descriptions.

If geneLeftPos is a GRanges object, then the left and right positions of each gene along with the strand of each gene are extracted from the object. Any sequence names given for the genes within the GRanges object are ignored, and the CategoryAssignments in the returned Results object follows the same order as to how the genes are listed within the GRanges object.

If gene positional information is instead given as three vectors, then the three vectors, geneLeftPos, geneRightPos, and geneStrand, must all be of the same length. Using the same index with each vector must provide information on the same gene (think of the vectors as columns of the same table). geneLeftPos and geneRightPos describe the upstream and downstream positions (respectively) for each gene in terms of the forward strand. For genes on the forward strand, geneLeftPos corresponds to the start positions and geneRightPos corresponds to stop positions. For genes on the reverse strand, geneLeftPos corresponds to the stop positions and geneRightPos corresponds to the start positions. Gene positions on the reverse strand must be relative to the 5' to 3' direction of the forward strand (as opposed to being relative to the 5' to 3' direction of the reverse strand). This means that none of the elements of geneLeftPos can be greater than (or equal to) the corresponding element in geneRightPos. The CategoryAssignments in the returned Results object has the same length as and aligns with the indexing of the three given gene positional information vectors.

Please ensure that the same genome used in the mapping function is also used to derive the set of genes for this assessment function. The function will only error if any gene positions are outside the bounds of the genome and does not make any other checks to make sure the genes are valid for the genome.

The maximum of either minCovNum or (minCovPct/100) multiplied by the number of related genomes is used as the minimum coverage required in determining conserved starts and stops.

Additionaly, open reading frames with proteomics evidence but no gene start are categorized based on whether or not there is a conserved start upstream of the proteomic evidence. The positions and lengths of these open reading frames is included in the N_CS-_PE+_ORFs and N_CS+_PE+_ORFs matrices within the final object that is returned.

Value

An object of class Assessment and subclass Results

See Also

Assessment-class

Examples


## Example showing the minimum number of arguments that need to be specified:

## Not run: 
myResObj <- AssessGenes(geneLeftPos = myGenesLeft,
                        geneRightPos = myGenesRight,
                        geneStrand = myGenesStrand,
                        inputMapObj = myMapObj)

## End(Not run)



## Example from vignette is shown below

currMapObj <- readRDS(system.file("extdata",
                                  "MGAS5005_PreSaved_DataMapObj.rds",
                                  package = "AssessORF"))

currProdigal <- readLines(system.file("extdata",
                                      "MGAS5005_Prodigal.sco",
                                      package = "AssessORF"))[-1:-2]

prodigalLeft <- as.numeric(sapply(strsplit(currProdigal, "_", fixed=TRUE), `[`, 2L))
prodigalRight <- as.numeric(sapply(strsplit(currProdigal, "_", fixed=TRUE), `[`, 3L))
prodigalStrand <- sapply(strsplit(currProdigal, "_", fixed=TRUE), `[`, 4L)

currResObj <- AssessGenes(geneLeftPos = prodigalLeft,
                          geneRightPos = prodigalRight,
                          geneStrand = prodigalStrand,
                          inputMapObj = currMapObj,
                          geneSource = "Prodigal")


[Package AssessORF version 1.0.2 Index]