DetectRepeats {DECIPHER} | R Documentation |
Detects approximate copies of sequence patterns that likely arose from duplication events and therefore share a common ancestor.
DetectRepeats(myXStringSet, type = "tandem", minScore = 10, allScores = FALSE, maxPeriod = 10000, maxFailures = 2, maxShifts = 5, alphabet = AA_REDUCED[[125]], processors = 1, verbose = TRUE, ...)
myXStringSet |
An |
type |
Character string indicating the type of repeats to detect. This should be (an abbreviation of) one of |
minScore |
Numeric giving the minimum log-odds score of repeats in |
allScores |
Logical specifying whether all repeats should be returned ( |
maxPeriod |
Numeric indicating the maximum periodicity of tandem repeats to consider. Interspersed repeats will only be detected that are at least |
maxFailures |
Numeric determining the maximum number of failing attempts to extend a repeat that are permitted. Numbers greater than zero may increase accuracy at the expense of speed, with decreasing marginal returns as |
maxShifts |
Numeric determining the maximum number of failing attempts to shift a repeat left or right that are permitted. Numbers greater than zero may increase accuracy at the expense of speed, with decreasing marginal returns as |
alphabet |
Character vector of amino acid groupings used to reduce the 20 standard amino acids into smaller groups. Alphabet reduction helps to find more distant homologies between sequences. A non-reduced amino acid alphabet can be used by setting |
processors |
The number of processors to use, or |
verbose |
Logical indicating whether to display progress. |
... |
Further arguments to be passed directly to |
Many sequences are composed of a substantial fraction of repetitive sequence. Two main forms of repetition are tandem repeats and interspersed repeats, which can be caused by duplication events followed by divergence. Detecting duplications is challenging because of variability in repeat length and composition due to evolution. The significance of a repeat can be quantified by its time since divergence from a common ancestor. DetectRepeats
uses a seed-and-extend approach to identify candidate repeats, and tests whether a set of repeats is statistically distinct from a background distribution using a goodness of fit multinomial test. The background distribution is derived from input sequences (myXStringSet
) and represents the distribution of characters that would be expected if repeats had diverged infinitely long ago (see Schaper et al. (2012) for examples). The reported score is the product of statistical significance (i.e., log-odds) and average similarity among the repeats. Therefore, a high-score implies the repeats diverged within a finite amount of time from a common ancestor.
Two possible types of repeats are detectable: (1) Tandem repeats are contiguous approximate copies of a nucleotide or amino acid sequence. Once a k-mer seed is identified, repeated attempts are made to optimize the beginning and ending positions, as well as attempting to extend the repeat to the left and right. (2) Interspersed repeats are dispersed approximate copies of a nucleotide sequence on the same strand or opposite strands. These are identified with FindSynteny
, aligned with AlignSynteny
, and then scored using the same statistical framework as tandem repeats. In both cases, the highest scoring repeat in each region is returned, unless allScores
is TRUE
, in which case overlapping repeats are permitted in the result.
If type
is "tandem"
, a data.frame
giving the "Index"
of the sequence in myXStringSet
, "Begin"
and "End"
positions of tandem repeats, "Left"
and "Right"
positions of each repeat, and its "Score"
.
If type
is "interspersed"
, a data.frame
similar to the matrix in the lower diagonal of Synteny objects (see Synteny-class
).
If type
is "both"
, a list
with the above two elements.
Erik Wright eswright@pitt.edu
Schaper, E., et al. (2012). Repeat or not repeat?-Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research, 40(20), 10005-17.
data(yeastSEQCHR1) dna <- DNAStringSet(yeastSEQCHR1) x <- DetectRepeats(dna) x # number of tandem repeats lengths(x[, "Left"]) # average periodicity of tandem repeats per <- mapply(function(a, b) b - a + 1, x[, "Left"], x[, "Right"]) sapply(per, mean) # extract a tandem repeat i <- 1 reps <- extractAt(dna[[x[i, "Index"]]], IRanges(x[[i, "Left"]], x[[i, "Right"]])) reps reps <- AlignSeqs(reps) # align the repeats reps BrowseSeqs(reps) y <- DetectRepeats(dna, type="interspersed") y