%\VignetteIndexEntry{Efficient genome searching with Biostrings and the BSgenome data packages}
%\VignetteKeywords{BSgenome, genome, DNA, RNA, Sequence, Biostrings, Sequence alignment, SNPs} 
%\VignettePackage{BSgenome}

%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
\SweaveOpts{keep.source=TRUE}
\documentclass[10pt]{article}

%\usepackage{amsmath}
%\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}
\usepackage{underscore}

\textwidth=6.5in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}

\newcommand{\R}{\textsf{R}}
\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\term}[1]{\emph{#1}}
\newcommand{\Rpackage}[1]{\textsf{#1}}
\newcommand{\Rfunction}[1]{\texttt{#1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\Rmethod}[1]{\textit{#1}}
\newcommand{\Rfunarg}[1]{\textit{#1}}

\bibliographystyle{plainnat}
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{Efficient genome searching with Biostrings and the BSgenome data packages}
\author{Herv\'e Pag\`es}
\maketitle

\tableofcontents


% ---------------------------------------------------------------------------

\section{The Biostrings-based genome data packages}

The Bioconductor project provides data packages that contain the full genome
sequences of a given organism. These packages are called
{\it Biostrings-based genome data packages} because the sequences they
contain are stored in some of the basic containers defined in the
\Rpackage{Biostrings} package, like the \Rclass{DNAString},
the \Rclass{DNAStringSet} or the \Rclass{MaskedDNAString} containers.
Regardless of the particular sequence data that they contain, all the
Biostrings-based genome data packages are very similar and can be
manipulated in a consistent and easy way.
They all require the \Rpackage{BSgenome} package in order to work properly.
This package, unlike the Biostrings-based genome data packages,
is a software package that provides the infrastructure needed to
support them (this is why the Biostrings-based genome data packages
are also called {\it BSgenome data packages}).
The \Rpackage{BSgenome} package itself requires the \Rpackage{Biostrings}
package.

See the man page for the \Rfunction{available.genomes} function
(\Rfunction{?available.genomes}) for more information about how to get
the list of all the BSgenome data packages currently available in your
version of Bioconductor (you need an internet connection so that
\Rfunction{available.genomes} can query the Bioconductor package
repositories).

More genomes can be added if necessary. Note that the process of making a
BSgenome data package is not yet documented but you are welcome to ask for
help on the bioc-devel mailing list
(\url{http://bioconductor.org/docs/mailList.html}) if you need a genome
that is not yet available.



% ---------------------------------------------------------------------------

\section{Finding an arbitrary nucleotide pattern in a chromosome}

In this section we show how to find (or just count) the occurences of some
arbitrary nucleotide pattern in a chromosome. The basic tool for this is
the \Rfunction{matchPattern} (or \Rfunction{countPattern}) function
from the \Rpackage{Biostrings} package.

First we need to install and load the BSgenome data package for the
organism that we want to look at. In our case, we want to search
chromosome I of {\it Caenorhabditis elegans}.

UCSC provides several versions of the C. elegans genome: ce1, ce2 and ce4.
These versions correspond to different {\it releases} from WormBase,
which are the WS100, WS120 and WS170 releases, respectively.
See \url{http://genome.ucsc.edu/FAQ/FAQreleases#release1} for the list
of all UCSC genome releases and for the correspondance between UCSC
versions and release names.

The BSgenome data package for the ce2 genome is
\Rpackage{BSgenome.Celegans.UCSC.ce2}. Note that ce1 and ce4 are not
available in Bioconductor but they could be added if there is demand for
them.

See \Rfunction{?available.genomes} for how to install
\Rpackage{BSgenome.Celegans.UCSC.ce2}.
Then load the package and display the single object defined in it:
<<b1>>=
library(BSgenome.Celegans.UCSC.ce2)
ls("package:BSgenome.Celegans.UCSC.ce2")
genome <- BSgenome.Celegans.UCSC.ce2
genome
@

\Robject{genome} is a \Rclass{BSgenome} object:
<<b2>>=
class(genome)
@

When displayed, some basic information about the origin of the
genome is shown (organism, provider, provider version, etc...)
followed by the index of {\it single} sequences and eventually
an additional index of {\it multiple} sequences.
Methods (adequately called {\it accessor methods}) are defined
for individual access to this information:
<<b3>>=
organism(genome)
provider(genome)
providerVersion(genome)
seqnames(genome)
mseqnames(genome)
@

See the man page for the \Rclass{BSgenome} class (\Rfunction{?BSgenome})
for a complete list of accessor methods and their descriptions.

Now we are ready to display chromosome I:
<<b4>>=
genome$chrI
@

Note that this chrI sequence corresponds to the {\it forward} strand
(aka {\it direct} or {\it sense} or {\it positive} or {\it plus} strand)
of chromosome I.
UCSC, and genome providers in general, don't provide files containing the
nucleotide sequence of the {\it reverse} strand (aka {\it indirect}
or {\it antisense} or {\it negative} or {\it minus} or {\it opposite} strand)
of the chromosomes because these sequences can be deduced from the {\it forward}
sequences by taking their reverse complements.
The BSgenome data packages are no exceptions: they only
provide the {\it forward} strand sequence of every chromosome.
See \Rfunction{?reverseComplement} for more details about the reverse
complement of a \Rclass{DNAString} object.
It is important to remember that, in practice, the {\it reverse} strand
sequence is almost never needed.
The reason is that, in fact, a {\it reverse} strand analysis
can (and should) always be transposed into a {\it forward} strand analysis.
Therefore trying to compute the {\it reverse} strand sequence of an entire
chromosome by applying \Rfunction{reverseComplement} to its {\it forward}
strand sequence is almost always a bad idea.
See the {\it Finding an arbitrary nucleotide pattern in an entire genome} section
of this document for how to find arbitrary patterns in the {\it reverse} strand
of a chromosome.

% It seems like this page http://www.medterms.com/script/main/art.asp?articlekey=20468
% is lying about the noncoding (or coding, they are in fact contradicting themselves)
% nature of the sense and antisense strands.

The number of bases in this sequence can be retrieved with:
<<b5>>=
chrI <- genome$chrI
length(chrI)
@

Some basic stats:
<<b6>>=
afI <- alphabetFrequency(chrI)
afI
sum(afI) == length(chrI)
@

Count all {\it exact} matches of pattern \Robject{"ACCCAGGGC"}:
<<b7>>=
p1 <- "ACCCAGGGC"
countPattern(p1, chrI)
@

Like most pattern matching functions in \Rpackage{Biostrings},
the \Rfunction{countPattern} and \Rfunction{matchPattern} functions
support {\it inexact} matching. One form of inexact matching is to
allow a few mismatching letters per match. Here we allow at most one:
<<b8>>=
countPattern(p1, chrI, max.mismatch=1)
@

With the \Rfunction{matchPattern} function, the locations of the matches are
stored in an \Rclass{XStringViews} object:
<<b9>>=
m1 <- matchPattern(p1, chrI, max.mismatch=1)
m1[4:6]
class(m1)
@

The \Rfunction{mismatch} function (new in \Rpackage{Biostrings}~2)
returns the positions of the mismatching letters for each match:
<<b10>>=
mismatch(p1, m1[4:6])
@

Note: The \Rfunction{mismatch} method is in fact a particular case
of a (vectorized) {\it alignment} function where only ``replacements''
are allowed. Current implementation is slow but this will be addressed.

It may happen that a match is {\it out of limits} like in this example:
<<b11>>=
p2 <- DNAString("AAGCCTAAGCCTAAGCCTAA")
m2 <- matchPattern(p2, chrI, max.mismatch=2)
m2[1:4]
p2 == m2[1:4]
mismatch(p2, m2[1:4])
@

The list of exact matches and the list of inexact matches
can both be obtained with:
<<b12,results=hide>>=
m2[p2 == m2]
m2[p2 != m2]
@

Note that the length of \Robject{m2[p2 == m2]} should be
equal to \Robject{countPattern(p2, chrI, max.mismatch=0)}.



% ---------------------------------------------------------------------------

\section{Finding an arbitrary nucleotide pattern in an entire genome}

Now we want to extend our analysis to the {\it forward} and {\it reverse}
strands of all the C. elegans chromosomes.
More precisely, here is the analysis we want to perform:

\begin{itemize}

  \item{The input dictionary: }
  Our input is a dictionary of 50 patterns. Each pattern is a short nucleotide
  sequence of 15 to 25 bases (As, Cs, Gs and Ts only, no Ns).
  It is stored in a FASTA file called \Robject{"ce2dict0.fa"}.
% ce2dict0.fa was generated with
%   > set.seed(23)
%   > ce2dict0 <- sapply(1:50, function(i) { x <- genome[[sample(seqnames(genome), 1)]]; start <- as.integer(runif(1, min=1, length(x)-20)); end <- start + as.integer(runif(1, min=14, max=25)); x <- subXString(x, start, end); if (sample(1:2, 1) == 2) x <- reverseComplement(x); as.character(x) })
%   > names(ce2dict0) <- paste("pattern", 1:50, sep="")
%   > write.XStringViews(XStringViews(ce2dict0, "DNAString"), file="ce2dict0.fa", format="fasta")
  See the {\it Finding all the patterns of a constant width dictionary
  in an entire genome} section of this document for a very efficient way
  to deal with the special case where all the patterns in the input
  dictionary have the same length.

  \item{The target: }
  Our target (or subject) is the {\it forward} and {\it reverse} strands of
  the seven C. elegans chromosomes (14 sequences in total).
  We want to find and report all occurences (or hits) of every pattern
  in the target. Note that a given pattern can have 0, 1 or several hits
  in 0, 1 or 2 strands of 0, 1 or several chromosomes.

  \item{Exact or inexact matching? }
  We are interested in exact matches only (for now).

  \item{The output: }
  We want to put the results of this analysis in a file so we can send
  it to our collaborators for some post analysis work.
  Our collaborators are not necessarily familiar with R or Bioconductor
  so dumping a high-level R object (like a list or a data frame) into an
  .rda file is not an option. For maximum portability (one of our
  collaborators wants to use Microsoft Excel for the post analysis) we
  choose to put our results in a tabulated file where one line describes
  one hit. The columns (or fields) of this file will be (in this order):
  \begin{itemize}
    \item{seqname: }
    the name of the chromosome where the hit occurs.
    \item{start: }
    an integer giving the starting position of the hit.
    \item{end: }
    an integer giving the ending position of the hit.
    \item{strand: }
    a plus (\Robject{+}) for a hit in the positive strand
    or a minus (\Robject{-}) for a hit in the negative strand.
    \item{patternID: }
    we use the unique ID provided for every pattern in the
    \Robject{"ce2dict0.fa"} file.
  \end{itemize}

\end{itemize}

Let's start by loading the input dictionary with:
<<c1>>=
ce2dict0_file <- system.file("extdata", "ce2dict0.fa", package="BSgenome")
ce2dict0 <- readDNAStringSet(ce2dict0_file, "fasta")
ce2dict0
@

Here is how we can write the functions that will perform our analysis:
<<c2>>=
writeHits <- function(seqname, matches, strand, file="", append=FALSE)
{
    if (file.exists(file) && !append)
        warning("existing file ", file, " will be overwritten with 'append=FALSE'")
    if (!file.exists(file) && append)
        warning("new file ", file, " will have no header with 'append=TRUE'")
    hits <- data.frame(seqname=rep.int(seqname, length(matches)),
                       start=start(matches),
                       end=end(matches),
                       strand=rep.int(strand, length(matches)),
                       patternID=names(matches),
                       check.names=FALSE)
    write.table(hits, file=file, append=append, quote=FALSE, sep="\t",
                row.names=FALSE, col.names=!append)
}

runAnalysis1 <- function(dict0, outfile="")
{
    library(BSgenome.Celegans.UCSC.ce2)
    genome <- BSgenome.Celegans.UCSC.ce2
    seqnames <- seqnames(genome)
    seqnames_in1string <- paste(seqnames, collapse=", ")
    cat("Target:", providerVersion(genome),
        "chromosomes", seqnames_in1string, "\n")
    append <- FALSE
    for (seqname in seqnames) {
        subject <- genome[[seqname]]
        cat(">>> Finding all hits in chromosome", seqname, "...\n")
        for (i in seq_len(length(dict0))) {
            patternID <- names(dict0)[i]
            pattern <- dict0[[i]]
            plus_matches <- matchPattern(pattern, subject)
            names(plus_matches) <- rep.int(patternID, length(plus_matches))
            writeHits(seqname, plus_matches, "+", file=outfile, append=append)
            append <- TRUE
            rcpattern <- reverseComplement(pattern)
            minus_matches <- matchPattern(rcpattern, subject)
            names(minus_matches) <- rep.int(patternID, length(minus_matches))
            writeHits(seqname, minus_matches, "-", file=outfile, append=append)
        }
        cat(">>> DONE\n")
    }
}
@

Some important notes about the implementation of the \Rfunction{runAnalysis1}
function:
\begin{itemize}
  \item{}
  \Robject{subject <- genome[[seqname]]} is the code that actually loads a
  chromosome sequence into memory.
  Using only one sequence at a time is a good practice to avoid memory
  allocation problems on a machine with a limited amount of memory.
  For example, loading all the human chromosome sequences in memory would
  require more than 3GB of memory!

  \item{}
  We have 2 nested \Robject{for} loops: the outer loop walks thru the
  target (7 chromosomes) and the inner loop walks thru the set of
  patterns. Doing the other way around would be very inefficient,
  especially with a bigger number of patterns because this would require
  to load each chromosome sequence into memory as many times as the
  number of patterns.
  \Rfunction{runAnalysis1} loads each sequence only once.

  \item{}
  We find the matches in the minus strand (\Robject{minus_matches}) by
  first taking the reverse complement of the current pattern (with
  \Robject{rcpattern <- reverseComplement(pattern)}) and NOT by
  taking the reverse complement of the current subject.
\end{itemize}

Now we are ready to run the analysis and put the results in the
\Robject{"ce2dict0_ana1.txt"} file:
<<c3>>=
runAnalysis1(ce2dict0, outfile="ce2dict0_ana1.txt")
@

Here is some very simple example of post analysis:
\begin{itemize}
  \item{}
Get the total number of hits:
<<c4>>=
hits1 <- read.table("ce2dict0_ana1.txt", header=TRUE)
nrow(hits1)
@
  \item{}
Get the number of hits per chromosome:
<<c5>>=
table(hits1$seqname)
@
  \item{}
Get the number of hits per pattern:
<<c6>>=
hits1_table <- table(hits1$patternID)
hits1_table
@
  \item{}
Get the pattern(s) with the higher number of hits:
<<c7>>=
hits1_table[hits1_table == max(hits1_table)] # pattern(s) with more hits
@
  \item{}
Get the pattern(s) with no hits:
<<c8>>=
setdiff(names(ce2dict0), hits1$patternID) # pattern(s) with no hits
@
  \item{}
And finally a function that can be used to plot the hits:
<<c9>>=
plotGenomeHits <- function(bsgenome, seqnames, hits)
{
    chrlengths <- seqlengths(bsgenome)[seqnames]
    XMAX <- max(chrlengths)
    YMAX <- length(seqnames)
    plot.new()
    plot.window(c(1, XMAX), c(0, YMAX))
    axis(1)
    axis(2, at=seq_len(length(seqnames)), labels=rev(seqnames), tick=FALSE, las=1)
    ## Plot the chromosomes
    for (i in seq_len(length(seqnames)))
        lines(c(1, chrlengths[i]), c(YMAX + 1 - i, YMAX + 1 - i), type="l")
    ## Plot the hits
    for (i in seq_len(nrow(hits))) {
        seqname <- hits$seqname[i]
        y0 <- YMAX + 1 - match(seqname, seqnames)
        if (hits$strand[i] == "+") {
            y <- y0 + 0.05
            col <- "red"
        } else {
            y <- y0 - 0.05
            col <- "blue"
        }
        lines(c(hits$start[i], hits$end[i]), c(y, y), type="l", col=col, lwd=3)
    }
}
@
Plot the hits found by \Rfunction{runAnalysis1} with:
<<c10,eval=false>>=
plotGenomeHits(genome, seqnames(genome), hits1)
@
\end{itemize}



% ---------------------------------------------------------------------------

\section{Some precautions when using \Rmethod{matchPattern}}

Improper use of \Rmethod{matchPattern} (or \Rmethod{countPattern}) can affect
performance.

If needed, the \Rmethod{matchPattern} and \Rmethod{countPattern} methods
convert their first argument (the pattern) to an object of the same class
than their second argument (the subject) before they pass it to the subroutine
that actually implements the fast search algorithm.

So if you need to reuse the same pattern a high number of times,
it's a good idea to convert it {\it before} to pass it to the
\Rmethod{matchPattern} or \Rmethod{countPattern} method.
This way the conversion is done only once:
<<d1>>=
library(hgu95av2probe)
tmpseq <- DNAStringSet(hgu95av2probe$sequence)
someStats <- function(v)
{
    GC <- DNAString("GC")
    CG <- DNAString("CG")
    sapply(seq_len(length(v)),
           function(i) {
               y <- v[[i]]
               c(alphabetFrequency(y)[1:4],
                 GC=countPattern(GC, y),
                 CG=countPattern(CG, y))
           }
    )
}
someStats(tmpseq[1:10])
@

% The above example is Raphael's use case discussed on BioC on Feb 2006.
% In Biostrings 1, the equivalent would be:
% src <- sapply(1:100,
%               function(i) {
%                 paste(sample(c("A","C","G","T"), 25, replace=TRUE),
%                 collapse="")
%               }
%        )
% tmpseq <- DNAString(src)
% someStats <- function(v)
% {
%     GC <- DNAString("GC")
%     CG <- DNAString("CG")
%     sapply(1:length(v),
%            function(i) {
%                y <- v[i]
%                c(alphabetFrequency(y)[2:5],
%                  GC=length(matchDNAPattern(GC, y)),
%                  CG=length(matchDNAPattern(CG, y)))
%            }
%     )
% }
% someStats(tmpseq[1:10])



% ---------------------------------------------------------------------------

\section{Masking the chromosome sequences}

Starting with Bioconductor 2.2, the chromosome sequences in a {\it BSgenome
data package} can have built-in masks.
Starting with Bioconductor 2.3, there can be up to 4 built-in masks per
sequence. These will always be (in this order): (1) the mask of assembly gaps,
(2) the mask of intra-contig ambiguities, (3) the mask of repeat regions that
were determined by the RepeatMasker software, and (4) the mask of repeat
regions that were determined by the Tandem Repeats Finder software (where only
repeats with period less than or equal to 12 were kept).

For a given package, all the sequences will always have the same number of
masks.

<<f1>>=
library(BSgenome.Hsapiens.UCSC.hg38.masked)
genome <- BSgenome.Hsapiens.UCSC.hg38.masked
chrY <- genome$chrY
chrY
chrM <- genome$chrM
chrM
@

The built-in masks are named consistenly across all the BSgenome data packages
available in Bioconductor:

\begin{table}[ht]
\begin{center}
\begin{tabular}{l|l|l|l}
\hline
Name  & Active by default & Short description & Long description \\
\hline
AGAPS & yes & assembly gaps
      & Masks the big N-blocks that have been placed
        between the contigs during the assembly.
        This mask is consistent with the Gap track from UCSC
        Genome Browser.\\
AMB   & yes & intra-contig ambiguities
      & Masks any IUPAC ambiguity letter that was found in the contig
        regions of the original sequence.
        Note that only As, Cs, Gs and Ts remain unmasked when the
        AGAPS and AMB masks are both active (before SNPs are eventually
        injected, see below). \\
RM    & no & RepeatMasker
      & Masks the repeat regions determined by the RepeatMasker software.
        This mask is consistent with the RepeatMasker track from
        UCSC Genome Browser. \\
TRF   & no & Tandem Repeats Finder
      & Masks the tandem repeat regions that were determined by the
        Tandem Repeats Finder software (with period of 12 or less). \\
\hline
\end{tabular}
\end{center}
\caption{The built-in masks provided by the BSgenome data packages.}
\end{table}

When displaying a masked sequence (here a \Rclass{MaskedDNAString}
object), the {\it masked width} and {\it masked ratio} are reported for
each individual mask, as well as for all the masks together, and for all
the active masks together.
The {\it masked width} is the total number of nucleotide positions
that are masked and the {\it masked ratio} is the {\it masked width}
divided by the length of the sequence.

To activate a mask, use the \Rmethod{active} replacement method
in conjonction with the \Rmethod{masks} method. For example, to
activate the RepeatMasker mask, do:
<<f2>>=
active(masks(chrY))["RM"] <- TRUE
chrY
@

As you can see, the {\it masked width} for all the active masks
together (i.e. the total number of nucleotide positions that are
masked by at least one active mask) is now the same as for the
first mask. This represents a {\it masked ratio} of about 83\%.

Now when we use a function that is {\it mask aware}, like
\Rfunction{alphabetFrequency}, the masked regions of the
input sequence are ignored:
<<f3>>=
active(masks(chrY)) <- FALSE
active(masks(chrY))["AGAPS"] <- TRUE
alphabetFrequency(unmasked(chrY))
alphabetFrequency(chrY)
@

This output indicates that, for this chromosome, the assembly gaps
correspond exactly to the regions in the sequence that were filled
with the letter N. Note that this is not always the case: sometimes
Ns, and other IUPAC ambiguity letters, can be found inside the contigs.

When coercing a \Rclass{MaskedXString} object to an \Rclass{XStringViews}
object, each non-masked region in the original sequence is converted into
a view on the sequence:
<<f4>>=
as(chrY, "XStringViews")
@

This can be used in conjonction with the \Rmethod{gaps} method to
see the gaps between the views i.e. the masked regions themselves:
<<f5,results=hide>>=
gaps(as(chrY, "XStringViews"))
@

To extract the sizes of the assembly gaps:
<<f6>>=
width(gaps(as(chrY, "XStringViews")))
@

Note that, if applied directly to \Robject{chrY}, \Rmethod{gaps}
returns a \Rclass{MaskedDNAString} object with a single mask masking
the regions that are not masked in the original object:
<<f7>>=
gaps(chrY)
alphabetFrequency(gaps(chrY))
@

In fact, for any \Rclass{MaskedDNAString} object, the following should
always be \Robject{TRUE}, whatever the masks are:
<<f8>>=
af0 <- alphabetFrequency(unmasked(chrY))
af1 <- alphabetFrequency(chrY)
af2 <- alphabetFrequency(gaps(chrY))
all(af0 == af1 + af2)
@

With all chrY masks active:
<<f9>>=
active(masks(chrY)) <- TRUE
af1 <- alphabetFrequency(chrY)
af1
gaps(chrY)
af2 <- alphabetFrequency(gaps(chrY))
af2
all(af0 == af1 + af2)
@

Now let's compare three different ways of finding all the occurences of
the \Robject{"CANNTG"} consensus sequence in chrY. The Ns in this
pattern need to be treated as wildcards i.e. they must match any
letter in the subject.

Without the mask feature, the first way to do it would be to use
the \Robject{fixed=FALSE} option in the call to \Rfunction{matchPattern}
(or \Rfunction{countPattern}):
<<f10>>=
Ebox <- "CANNTG"
active(masks(chrY)) <- FALSE
countPattern(Ebox, chrY, fixed=FALSE)
@

The problem with this method is that the Ns in the subject
are also treated as wildcards hence the abnormally high number of
matches.
A better method is to specify the {\it side} of the matching problem
(i.e. {\it pattern} or {\it subject}) where the Ns should be treated
as wildcards:
<<f11>>=
countPattern(Ebox, chrY, fixed=c(pattern=FALSE,subject=TRUE))
@

Finally, \Rfunction{countPattern} being {\it mask aware}, this can be
achieved more efficiently by just masking the assembly gaps and
ambiguities:
<<f12>>=
active(masks(chrY))[c("AGAPS", "AMB")] <- TRUE
alphabetFrequency(chrY, baseOnly=TRUE)  # no ambiguities
countPattern(Ebox, chrY, fixed=FALSE)
@

Note that some chromosomes can have Ns outside the assembly gaps:
<<f13>>=
chr2 <- genome$chr2
active(masks(chr2))[-2] <- FALSE
alphabetFrequency(gaps(chr2))
@
so it is recommended to always keep the AMB mask active (in addition to
the AGAPS mask) whatever the sequence is.

Note that not all functions that work with an \Rclass{XString}
input are {\it mask aware} but more will be added in the near future.
However, most of the times there is a alternate way to exclude some
arbitrary regions from an analysis without having to use {\it mask aware}
functions. This is described below in the {\it Hard masking} section.



% ---------------------------------------------------------------------------

\section{Hard masking}

coming soon...



% ---------------------------------------------------------------------------
\section{Injecting known SNPs in the chromosome sequences}

coming soon...



% ---------------------------------------------------------------------------

\section{Finding all the patterns of a constant width dictionary
         in an entire genome}

The \Rfunction{matchPDict} function can be used instead of \Rfunction{matchPattern}
for the kind of analysis described in the {\it Finding an arbitrary nucleotide
pattern in an entire genome} section but it will be much faster (between 100x and
10000x faster depending on the size of the input dictionary).
Note that a current limitation of \Rfunction{matchPDict} is that it only works with
a dictionary of DNA patterns where all the patterns have the same number of
nucleotides (constant width dictionary).
See \Rfunction{?matchPDict} for more information.

Here is how our \Rfunction{runAnalysis1} function can be modified in order to use
\Rfunction{matchPDict} instead of \Rfunction{matchPattern}:
<<e1>>=
runOneStrandAnalysis <- function(dict0, bsgenome, seqnames, strand,
                                 outfile="", append=FALSE)
{
    cat("\nTarget: strand", strand, "of", providerVersion(bsgenome),
        "chromosomes", paste(seqnames, collapse=", "), "\n")
    if (strand == "-")
        dict0 <- reverseComplement(dict0)
    pdict <- PDict(dict0)
    for (seqname in seqnames) {
        subject <- bsgenome[[seqname]]
        cat(">>> Finding all hits in strand", strand, "of chromosome", seqname, "...\n")
        mindex <- matchPDict(pdict, subject)
        matches <- extractAllMatches(subject, mindex)
        writeHits(seqname, matches, strand, file=outfile, append=append)
        append <- TRUE
        cat(">>> DONE\n")
    }
}

runAnalysis2 <- function(dict0, outfile="")
{
    library(BSgenome.Celegans.UCSC.ce2)
    genome <- BSgenome.Celegans.UCSC.ce2
    seqnames <- seqnames(genome)
    runOneStrandAnalysis(dict0, genome, seqnames, "+", outfile=outfile, append=FALSE)
    runOneStrandAnalysis(dict0, genome, seqnames, "-", outfile=outfile, append=TRUE)
}
@

Remember that \Rfunction{matchPDict} only works if all the patterns in the
input dictionary have the same length so for this 2nd analysis, we will truncate
the patterns in \Robject{ce2dict0} to 15 nucleotides:
<<e2>>=
ce2dict0cw15 <- DNAStringSet(ce2dict0, end=15)
@

Now we can run this 2nd analysis and put the results in the
\Robject{"ce2dict0cw15_ana2.txt"} file:
<<e3>>=
runAnalysis2(ce2dict0cw15, outfile="ce2dict0cw15_ana2.txt")
@



% ---------------------------------------------------------------------------

\section{Session info}

<<g1>>=
sessionInfo()
@

\end{document}

