DetectRepeats {DECIPHER}R Documentation

Detect Repeats in a Sequence

Description

Detects approximate copies of sequence patterns that likely arose from duplication events and therefore share a common ancestor.

Usage

DetectRepeats(myXStringSet,
              type = "tandem",
              minScore = 10,
              allScores = FALSE,
              maxPeriod = 10000,
              maxFailures = 2,
              maxShifts = 5,
              alphabet = AA_REDUCED[[125]],
              processors = 1,
              verbose = TRUE,
              ...)

Arguments

myXStringSet

An AAStringSet, DNAStringSet, or RNAStringSet object of unaligned sequences.

type

Character string indicating the type of repeats to detect. This should be (an abbreviation of) one of "tandem", "interspersed", or "both". (See details section below.)

minScore

Numeric giving the minimum log-odds score of repeats in myXStringSet to report.

allScores

Logical specifying whether all repeats should be returned (TRUE) or only the top scoring repeat when there are multiple overlapping matches in the same region.

maxPeriod

Numeric indicating the maximum periodicity of tandem repeats to consider. Interspersed repeats will only be detected that are at least maxPeriod nucleotides apart.

maxFailures

Numeric determining the maximum number of failing attempts to extend a repeat that are permitted. Numbers greater than zero may increase accuracy at the expense of speed, with decreasing marginal returns as maxFailures gets higher and higher.

maxShifts

Numeric determining the maximum number of failing attempts to shift a repeat left or right that are permitted. Numbers greater than zero may increase accuracy at the expense of speed, with decreasing marginal returns as maxShifts gets higher and higher.

alphabet

Character vector of amino acid groupings used to reduce the 20 standard amino acids into smaller groups. Alphabet reduction helps to find more distant homologies between sequences. A non-reduced amino acid alphabet can be used by setting alphabet equal to AA_STANDARD.

processors

The number of processors to use, or NULL to automatically detect and use all available processors.

verbose

Logical indicating whether to display progress.

...

Further arguments to be passed directly to FindSynteny if type is "interspersed" or "both".

Details

Many sequences are composed of a substantial fraction of repetitive sequence. Two main forms of repetition are tandem repeats and interspersed repeats, which can be caused by duplication events followed by divergence. Detecting duplications is challenging because of variability in repeat length and composition due to evolution. The significance of a repeat can be quantified by its time since divergence from a common ancestor. DetectRepeats uses a seed-and-extend approach to identify candidate repeats, and tests whether a set of repeats is statistically distinct from a background distribution using a goodness of fit multinomial test. The background distribution is derived from input sequences (myXStringSet) and represents the distribution of characters that would be expected if repeats had diverged infinitely long ago (see Schaper et al. (2012) for examples). The reported score is the product of statistical significance (i.e., log-odds) and average similarity among the repeats. Therefore, a high-score implies the repeats diverged within a finite amount of time from a common ancestor.

Two possible types of repeats are detectable: (1) Tandem repeats are contiguous approximate copies of a nucleotide or amino acid sequence. Once a k-mer seed is identified, repeated attempts are made to optimize the beginning and ending positions, as well as attempting to extend the repeat to the left and right. (2) Interspersed repeats are dispersed approximate copies of a nucleotide sequence on the same strand or opposite strands. These are identified with FindSynteny, aligned with AlignSynteny, and then scored using the same statistical framework as tandem repeats. In both cases, the highest scoring repeat in each region is returned, unless allScores is TRUE, in which case overlapping repeats are permitted in the result.

Value

If type is "tandem", a data.frame giving the "Index" of the sequence in myXStringSet, "Begin" and "End" positions of tandem repeats, "Left" and "Right" positions of each repeat, and its "Score".

If type is "interspersed", a data.frame similar to the matrix in the lower diagonal of Synteny objects (see Synteny-class).

If type is "both", a list with the above two elements.

Author(s)

Erik Wright eswright@pitt.edu

References

Schaper, E., et al. (2012). Repeat or not repeat?-Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research, 40(20), 10005-17.

Examples

data(yeastSEQCHR1)
dna <- DNAStringSet(yeastSEQCHR1)

x <- DetectRepeats(dna)
x

# number of tandem repeats
lengths(x[, "Left"])

# average periodicity of tandem repeats
per <- mapply(function(a, b) b - a + 1,
	x[, "Left"],
	x[, "Right"])
sapply(per, mean)

# extract a tandem repeat
i <- 1
reps <- extractAt(dna[[x[i, "Index"]]],
	IRanges(x[[i, "Left"]], x[[i, "Right"]]))
reps
reps <- AlignSeqs(reps) # align the repeats
reps
BrowseSeqs(reps)

y <- DetectRepeats(dna, type="interspersed")
y

[Package DECIPHER version 2.21.0 Index]