%\VignetteIndexEntry{R Investigation of NimbleGen Oligoarrays}
%\VignetteDepends{Ringo}
%\VignetteKeywords{microarray ChIP-chip NimbleGen nimblegen}
%\VignettePackage{Ringo} % name of package

%%%% HEAD SECTION: START EDITING BELOW %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\documentclass[11pt, a4paper, fleqn]{article}
\usepackage{geometry}\usepackage{color}
\definecolor{darkblue}{rgb}{0.0,0.0,0.75}
\usepackage[%
baseurl={http://www.bioconductor.org},%
pdftitle={Introduction to Ringo},%
pdfauthor={Joern Toedling},%
pdfsubject={Ringo Vignette},%
pdfkeywords={Bioconductor},%
pagebackref,bookmarks,colorlinks,linkcolor=darkblue,citecolor=darkblue,%
filecolor=darkblue,urlcolor=darkblue,pagecolor=darkblue,%
raiselinks,plainpages,pdftex]{hyperref}

\usepackage{verbatim} % for multi-line comments
\usepackage{amsmath,a4,t1enc, graphicx}
\usepackage{natbib}
\bibpunct{(}{)}{;}{a}{,}{,}

\parindent0mm
\parskip2ex plus0.5ex minus0.3ex

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}

\newcommand{\myincfig}[3]{%
  \begin{figure}[h!tb]
    \begin{center}
      \includegraphics[width=#2]{#1}
      \caption{\label{#1}\textit{#3}}
    \end{center}
  \end{figure}
}

\addtolength{\textwidth}{2cm}
\addtolength{\oddsidemargin}{-1cm}
\addtolength{\evensidemargin}{-1cm}
\addtolength{\textheight}{2cm}
\addtolength{\topmargin}{-1cm}
\addtolength{\skip\footins}{1cm}


%%%%%%% START EDITING HERE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}

\SweaveOpts{eps=false} % produce no 'eps' figures

\title{Ringo - R Investigation of NimbleGen Oligoarrays}
\author{Joern Toedling}
\date{}
\maketitle

<<prepare, echo=FALSE>>=
options(length=60)
set.seed(123)
@ 

\section{Introduction}

The package \Rpackage{Ringo} deals with the analysis of two-color
oligonucleotide microarrays used in ChIP-chip projects. 
The package was started to facilitate the analysis of
two-color microarrays from the company 
NimbleGen\footnote{for NimbleGen one-color microarrays,
we recommend the Bioconductor package \Rpackage{oligo}},
but the package has a modular design,
such that the platform-specific functionality is encapsulated
and analogous two-color tiling array platforms can also be
processed.
The package employs functions from other
packages of the Bioconductor project \citep{bioconductor}
and provides additional
ChIP-chip-specific and NimbleGen-specific functionalities.

<<loadpackage, results=hide>>=
library("Ringo")
@ 

If you use Ringo for analyzing your data, please cite: \nocite{Ringo2007}
\begin{itemize}
\item{Joern Toedling, Oleg Sklyar, Tammo Krueger, Jenny J Fischer, Silke Sperling, Wolfgang Huber (2007). Ringo - an R/Bioconductor package for analyzing ChIP-chip readouts. \textsl{BMC Bioinformatics}, 8:221.}
\end{itemize}

\subsection*{Getting help}

If possible, please send questions about \Rpackage{Ringo} to the 
Bioconductor mailing list.\\
See \url{http://www.bioconductor.org/docs/mailList.html} \\
Their archive of questions and responses may prove helpful, too.


\section{Reading in the raw data}

For each microarray, the scanning output consists of two files, one holding
the Cy3 intensities, the other one the Cy5 intensities. These files are 
tab-delimited text files. 

The package comes with (shortened) example scanner output files,
in NimbleGen's \emph{pair} format.
These files are excerpts of the ChIP-chip demo data that NimbleGen 
provide at their FTP site for free download.
Their biological context, identification of DNA binding sites
of complexes containing Suz12 in human cells, 
has been described before \citep{Squazzo2006}.

<<locateData>>=
exDir <- system.file("exData",package="Ringo")
list.files(exDir, pattern="pair.txt")
head(read.delim(file.path(exDir,"MOD_20551_PMT1_pair.txt"), skip=1))[,c(1,4:7,9)]
@

In addition, there is a text file that holds details on the samples, 
including which two \emph{pair} files belong to which 
sample\footnote{You may have to construct such a targets file for your own 
data. The \texttt{scripts} directory of this package contains a script
\texttt{convertSampleKeyTxt.R} as an inspiration how the file 
\texttt{SampleKey.txt} provided by NimbleGen could be used for this.}.

<<exampleFilesTxt>>=
read.delim(file.path(exDir,"example_targets.txt"), header=TRUE) 
@ 

The columns \texttt{FileNameCy3} and \texttt{FileNameCy5} hold which of the
raw data files belong to which sample. 
The immuno-precipitated extract 
was colored with the Cy5 dye in the experiment, so the column \texttt{Cy5} 
essentially holds which antibody has been used for the immuno-precipitation, 
in this case one against the protein \texttt{Suz12}.

Furthermore, there is a file describing the reporter categories
on the array (you might know these Spot Types files from \Rpackage{limma}
\citep{limma05}).

<<spottypes>>=
read.delim(file.path(exDir,"spottypes.txt"), header=TRUE) 
@ 

Reading all these files, we can read in the raw reporter intensities
and obtain an object of class \Rclass{RGList}, a class defined
in package \Rpackage{limma}.

<<readNimblegen, results=hide>>=
exRG <- readNimblegen("example_targets.txt","spottypes.txt",path=exDir)
@ 

This object is essentially a list and contains the raw intensities of the two 
hybridizations for the red and green channel plus information on the reporters on the array
and on the analyzed samples.

<<showRG>>=
head(exRG$R)
head(exRG$G)
head(exRG$genes)
exRG$targets
@ 

Users can alternatively supply raw two-color ChIP-chip readouts
from other platforms in \Rclass{RGList} format and consecutively use
\Rpackage{Ringo} to analyze that data.


\section{Mapping reporters to genomic coordinates}

By \emph{reporters}, we mean the oligo-nucleotides or PCR products
that have been fixated on the array for measuring the abundance of 
corresponding genomic fragments in the ChIP-chip experiment.

Each reporter has a unique identifier and (ideally) a unique sequence,
but can, and probably does, appear in multiple copies as \emph{features}
on the array surface.

A mapping of reporters to genomic coordinates is usually provided
by the array manufacturer, such as in NimbleGen's *.POS files. 
If the reporter sequences are provided as well, 
you may consider to perform a custom
mapping of these sequences to the genome of interest, using alignment
tools such as \emph{Exonerate} \citep{Slater2005}
or functions provided by the Bioconductor package
\Rpackage{Biostrings} \citep{Biostrings}.

Such a re-mapping of reporters to the genome can sometimes be
necessary, for example when the array has designed on
an outdated assembly of the genome.
Re-mapping also provides the advantage that you can allow non-perfect
matches of reporters to the genome, if desired. 

Once reporters have been mapped to the genome, this mapping needs to
be made available to the data analysis functions.
While a \Rclass{data.frame} may be an obvious way of representing such a
mapping, repeatedly extracting sub-sets of the data frame related to a
genomic region of interest turns out to be too slow for practical purposes.
\Rpackage{Ringo}, similar to the Bioconductor package \Rpackage{tilingArray},
employs an object of class \Rclass{probeAnno} to store the mapping between 
reporters on the microarray and genomic positions.
Per chromosome, the object holds four vectors of equal length and ordering
that specify at which genomic positions reporter matches start and end, 
what identifiers or indices these reporters have in the intensities data,
and whether these reporters match uniquely to the genomic positions.

<<loadProbeAnno>>=
load(file.path(exDir,"exampleProbeAnno.rda"))
ls(exProbeAnno)
show(exProbeAnno)
head(exProbeAnno["9.start"])
head(exProbeAnno["9.end"])
@ 

The function \Rfunction{posToProbeAnno} allows generation
of a valid \Rclass{probeAnno} object, 
either from a file that corresponds to a NimbleGen \texttt{POS} file
or from a \Rclass{data.frame} objects that holds the same information.
The package's \texttt{scripts} directory contains 
the script \texttt{mapReportersWithBiostrings.R}, which shows how to
use \Rpackage{Biostrings} for mapping the 
reporter sequences of the provided example data. and some
Perl scripts that allow the conversion of multiple output files from
common alignment tools such as Exonerate into one file that corresponds
to a POS file.
The function \Rfunction{validObject} can be used to perform a
quick check whether a generated \Rclass{probeAnno} object will probably
work with  other \Rpackage{Ringo} functions.


\section{Quality assessment}

The \Rfunction{image} function allows us to look at the spatial 
distribution of the intensities on a chip. This can be useful to
detect obvious artifacts on the array, such as scratches, bright
spots, finger prints etc. that might render parts or all of the
readouts useless. 

<<imageRG0, eval=FALSE>>=
par(mar=c(0.01,0.01,0.01,0.01), bg="black")
image(exRG, 1, channel="green", mycols=c("black","green4","springgreen"))
@ 

<<imageRG,eval=TRUE,results=hide,echo=FALSE>>=
jpeg("Ringo-imageRG.jpg", quality=100, height=400, width=360)
par(mar=c(0.01,0.01,0.01,0.01), bg="black")
image(exRG, 1, channel="green", mycols=c("black","green4","springgreen"))
dev.off()
@

\myincfig{Ringo-imageRG}{0.6\textwidth}{Spatial distribution of raw
reporter intensities laid out by the reporter position on the
microarray surface.}

See figure \ref{Ringo-imageRG} for the image.
Since the provided example data set only holds the intensities for reporters
mapped to the forward strand of chromosome 9,
the image only shows the few green dots of these reporters' positions.
We see, however, that these chromosome 9 reporters are well distributed
over the whole array surface rather than being clustered together in one
part of the array.

It may also be useful to look at the absolute distribution of 
the single-channel densities. \Rpackage{limma}'s function
\Rfunction{plotDensities} may be useful for this purpose.
%
<<plotDensities, fig=TRUE, include=TRUE, width=6, height=4>>=
plotDensities(exRG)
@
%
In addition, the data file loaded above also contains a 
\emph{GFF (General Feature Format)}
file of all transcripts on human chromosome 9 annotated in the 
\href{http://www.ensembl.org}{Ensembl} database 
(release 46, August 2007).
The script \texttt{retrieveGenomicFeatureAnnotation.R} in the 
package's scripts directory contains
example source code showing how the Bioconductor package
\Rpackage{biomaRt} can be used to generate such an annotated
genome features \Rclass{data.frame}.

<<showGFF>>=
head(exGFF[,c("name","symbol","chr","strand","start","end")])
@ 

To assess the impact of the small distance between reporters 
on the data, one can look at the autocorrelation plot.
For each base-pair lag $d$, it is assessed how strong the intensities
of reporters at genomic positions $x+d$ are correlated with the probe
intensities at positions $x$.

The computed correlation is plotted against the lag $d$.

<<autocorRG0, results=hide, fig=TRUE, include=TRUE, width=6, height=4>>=
exAc <- autocor(exRG, probeAnno=exProbeAnno, chrom="9", lag.max=1000)
plot(exAc)
@ 

We see some auto-correlation between probe position up to 800 base pairs
apart. Since the sonicated fragments that are hybridized to the
array have an average size in the range of up to 1000 bp, 
such a degree of auto-correlation up to this distance can be expected.


\section{Preprocessing}

Following quality assessment of the raw data,
we perform normalization of the probe intensities
and derive fold changes of reporters' intensities in the enriched sample
divided by their intensities in the non-enriched \emph{input} sample
and take the (generalized) logarithm of these ratios.

We use the variance-stabilizing normalization \citep{HuberVSN}
or probe intensities and generate an \texttt{ExpressionSet} object
of the normalized probe levels.

<<preprocess, eval=FALSE>>=
exampleX <- preprocess(exRG)
sampleNames(exampleX) <- 
 with(exRG$targets, paste(Cy5,"vs",Cy3,sep="_"))
print(exampleX)
@ 
<<loadExampleX, echo=FALSE>>=
load(file.path(exDir,"exampleX.rda"))
print(exampleX)
@ 

Among the provided alternative preprocessing options is also the
Tukey-biweight scaling procedure that NimbleGen have used to scale ChIP-chip
readouts so that the data is centered on zero.

<<preprocessNG, results=hide>>=
exampleX.NG <- preprocess(exRG, method="nimblegen")
sampleNames(exampleX.NG) <- sampleNames(exampleX)
@ 

The effects of different preprocessing procedures on the data, can be
assessed using the \Rfunction{corPlot} function.

<<comparePreprocessings, results=hide, fig=TRUE, include=TRUE, width=5, height=5>>=
corPlot(cbind(exprs(exampleX),exprs(exampleX.NG)),
        grouping=c("VSN normalized","Tukey-biweight scaled"))
@ 

The same function can also be used to assess the correlation between biological 
and technical replicates among the microarray samples.


\section{Visualize intensities along the chromosome}

<<chipAlongChrom0, eval=FALSE>>=
chipAlongChrom(exampleX, chrom="9", xlim=c(34318000,34321000), ylim=c(-2,4), probeAnno=exProbeAnno, gff=exGFF)
@ 
<<chipAlongChrom, echo=FALSE, fig=TRUE, include=FALSE, width=9.6, height=4.8, results=hide>>=
par(mar=c(2.5,4.2,4,1.5), font.lab=2)
chipAlongChrom(exampleX, chrom="9", xlim=c(34318000,34321000), ylim=c(-2,4), probeAnno=exProbeAnno, gff=exGFF)
@ 

\myincfig{Ringo-chipAlongChrom}{0.98\textwidth}{Normalized probe intensities around the TSS of the \texttt{Nudt2} gene.}

See the result in figure \ref{Ringo-chipAlongChrom}.


\section{Smoothing of probe intensities}

Since the response of reporters to the same amount of hybridized genome material
varies greatly, due to probe GC content, melting temperature, 
secondary structure etc., it is suggested to do a smoothing over individual 
probe intensities before looking for ChIP-enriched regions.

Here, we slide a window of 800 bp width along the chromosome and replace the
intensity at e genomic position $x_0$ by the median over the intensities
of those reporters inside the window  that is centered at $x_0$.

<<smoothing, results=hide>>=
smoothX <- computeRunningMedians(exampleX, probeAnno=exProbeAnno,
modColumn = "Cy5", allChr = "9", winHalfSize = 400)
sampleNames(smoothX) <- paste(sampleNames(exampleX),"smoothed")
@ 
<<smoothAlongChrom0, eval=FALSE>>=
chipAlongChrom(exampleX, chrom="9", xlim=c(34318000,34321000), ylim=c(-2,4), probeAnno=exProbeAnno, gff=exGFF)
chipAlongChrom(smoothX, chrom="9", xlim=c(34318000,34321000), probeAnno=exProbeAnno, itype="l", ilwd=4, paletteName="Spectral", add=TRUE)
@ 
%
<<smoothAlongChrom, echo=FALSE, fig=TRUE, include=FALSE, width=9.6, height=4.8, results=hide>>=
#jpeg("Ringo-smoothAlongChrom.jpg", quality=100, width=960, height=480)
par(mar=c(2.5,4.2,4,1.5), font.lab=2)
chipAlongChrom(exampleX, chrom="9", xlim=c(34318000,34321000), ylim=c(-2,4), probeAnno=exProbeAnno, gff=exGFF)
chipAlongChrom(smoothX, chrom="9", xlim=c(34318000,34321000), probeAnno=exProbeAnno, itype="l", ilwd=4, paletteName="Spectral", add=TRUE)
#dev.off()
@ 
%
\myincfig{Ringo-smoothAlongChrom}{0.98\textwidth}{Normalized and smoothed probe intensities around the TSS of the \texttt{Nudt2} gene.}

See the smoothed probe levels in figure \ref{Ringo-smoothAlongChrom}.


\section{Finding ChIP-enriched regions}

To identify antibody-enriched genomic regions, we require the following:

\begin{itemize}
\item smoothed intensities of reporters mapped to this region are exceed a certain threshold $y_0$
\item the region contains at least three probe match positions
\item each affected position is less than a defined maximum distance $d_{max}$ 
apart from another affected position in the region (we require a certain
probe spacing to have confidence in detected peaks\footnote{Note
that the term ''peak'', while commonly used in ChIP-chip context,
is slightly misleading and the term "ChIP-enriched region", 
or "cher" in shorthand,
is more appropriate. Within such regions the actual signal could show two
or more actual signal peaks or none at all (long plateau).})
\end{itemize}

For setting the threshold $y_0$, one has to assess the 
expected (smoothed) probe levels in non-enriched genomic regions,
i.e. the \emph{null distribution} of probe levels.
In a perfect world, we could use a log ratio of $0$ as definite cut-off.
In this case the ``enriched'' DNA and the input DNA sample would be
present in equal amounts, so no antibody-bound epitope, 
could be found at this genomic site.
In practice, there are some reasons why zero may be a too naive cut-off
for calling a probe-hit genomic site \emph{enriched} in our case.
See \citet{BourgonPhD} for an extensive discussion on problematic issues
with ChIP-chip experiments. We will just briefly mention a few issues
here.
For once, during the immuno-precipitation, some non-antibody-bound
regions may be pulled down in the assay and consequently enriched or
some enriched DNA may cross-hybridize to other reporters. 
Furthermore, since genomic fragments after sonication are mostly a lot
larger than the genomic distance between two probe-matched genomic
positions, auto-correlation between reporters certainly is existent. 
Importantly, different reporters measure the same DNA amount with a
different efficiency even after normalizing the probe levels, due to 
sequence properties of the probe, varying quality of the synthesis of
reporters on the array and other reasons.
To ameliorate this fact, we employ the sliding-window smoothing approach.

The aforementioned issues make it difficult to come up with a reasonable 
estimate for the null distribution of smoothed probe levels 
in non-enriched genomic regions. 
See Figure \ref{Ringo-histogramSmoothed} for the two histograms.
We present one way (out of many) for objectively choosing the cutoff $y_0$.
The histograms suggest the smoothed reporter levels follow a mixture of two 
distributions, one being the null distribution of non-affected reporters and
the other one the alternative one for the smoothed reporter values in
H3K4me3-ChIP enriched regions.
We assume the null distribution is symmetric and its mode is the one close to
zero in the histogram. By mirroring its part left of the mode over the
mode, we end up with an estimated null distribution.
For the alternative distribution, we only assume that it is stochastically
larger than the null distribution and that its mass to the left of the
estimated mode of the null distribution is negligible.
We estimate an upper bound $y_0$ for values arising from the null
distribution and conclude that smoothed probe levels $y > y_0$ 
are more likely to arise from the H3K4me3 enrichment distribution
than from the null distribution.
These estimates are indicated by red vertical lines in the histograms.

<<setY0>>=
(y0 <- apply(exprs(smoothX),2,upperBoundNull))
@ 
<<histogramSmoothed0, eval=FALSE, results=hide>>=
par(font.lab=2, mar=c(4,4,1,1))
hist(exprs(smoothX)[,1], n=25, xlab="Smoothed reporter intensities [log]", xlim=c(-2,2), main=NA, col=brewer.pal(8,"Set2")[1])
abline(v=y0[,1], col="red")
@ 

<<histogramSmoothed, echo=FALSE, fig=TRUE, include=FALSE, width=7, height=5, results=hide>>=
#pdf(file="Ringo-histogramSmoothed.pdf", width=7, height=5)
par(font.lab=2, mar=c(4,4,1,1))
h1 <- hist(exprs(smoothX)[,1], n=25, xlab="Smoothed reporter intensities [log]", xlim=c(-3,5), main=NA, col=brewer.pal(8,"Set2")[1])
abline(v=y0[1], col="red")
#dev.off()
@ 
%
\myincfig{Ringo-histogramSmoothed}{0.7\textwidth}{Histograms of reporter intensities after smoothing of reporter level. The red vertical line is the cutoff values suggested by the histogram.}
%
Since antibodies vary in their efficiency to bind to their target epitope,
we suggest to obtain a different threshold for each antibody. In the example
data, however, we have only one antibody against \texttt{Suz12}.

While this threshold worked well for us, 
we do not claim this way to be a gold standard for determining the
threshold. 
In particular, it does not take into account the auto-correlation between
near-by reporters. See \citet{BourgonPhD} for a more sophisticated algorithm
that does take it into account.

<<cherFinding, results=hide>>=
chersX <- findChersOnSmoothed(smoothX, probeAnno=exProbeAnno, thresholds=y0, allChr="9", distCutOff=600, cellType="human")
chersX <- relateChers(chersX, exGFF)
chersXD <- as.data.frame.cherList(chersX)
@ 
<<showChers>>=
chersXD[order(chersXD$maxLevel, decreasing=TRUE),]
@ 

Note that in \Rpackage{Ringo} functions, 
``ChIP-enriched region'' is abbreviated to ``cher''.
 
One characteristic of enriched regions that can be used for 
sorting them is the element \texttt{maxLevel}, that is the highest smoothed
probe level in the enriched region. 
Alternatively, one can sort by the \texttt{score},
that is the sum of smoothed probe levels minus the threshold. It is a 
discretized version of to the area under the curve with the baseline being the 
threshold. 

<<plotCher0, eval=FALSE, results=hide>>=
plot(chersX[[1]], smoothX, probeAnno=exProbeAnno, gff=exGFF, paletteName="Spectral")
@ 
<<plotCher, echo=FALSE, fig=TRUE, include=FALSE, width=9.6, height=4.8, results=hide>>=
#jpeg("Ringo-plotCher.jpg", quality=100, width=960, height=480)
par(mar=c(2.5,4.2,4,1.5), font.lab=2)
plot(chersX[[1]], smoothX, probeAnno=exProbeAnno, gff=exGFF, paletteName="Spectral")
#dev.off()
@ 
%
\myincfig{Ringo-plotCher}{0.98\textwidth}{One of the identified Suz12-antibody
enriched regions on chromosome 9.}

Figure \ref{Ringo-plotCher} displays an identified enriched region, 
which is located upstream of the \texttt{Nudt2} gene. 
This ChIP-enriched region was already obvious in plots of the 
normalized data (see Figure \ref{Ringo-smoothAlongChrom}).
While it is reassuring that our method recovers it as well, a number
of other approaches would undoubtedly have reported it as well.


\section{Concluding Remarks}

The package \Rpackage{Ringo} aims to facilitate the analysis 
ChIP-chip readouts.
We constructed it during the analysis of a ChIP-chip experiment for the 
genome-wide identification of modified histone sites on data
gained from NimbleGen two-color microarrays. 
Analogous two-color microarray platforms, however, can also be processed.
Key functionalities of \Rpackage{Ringo} are data read-in,
quality assessment, preprocessing of the raw data,
and visualization of the raw and preprocessed data.
The package contains a heuristic algorithm for the detection of
for ChIP-enriched genomic regions, too. 
While this algorithm worked quite well on our data, 
we do not claim it to be the definite algorithm for that task.

This vignette was generated using the following package versions:

<<sessionInfo, echo=FALSE, results=tex>>=
toLatex(sessionInfo())
@

\clearpage
\small
\section*{Acknowledgments}

I thank Wolfgang Huber Oleg Sklyar, Tammo Kr\"uger, Richard Bourgon,
and Matt Ritchie for source code contributions to
and lots of helpful suggestions on Ringo, 
Todd Richmond and NimbleGen Systems, Inc. for 
providing us with the example ChIP-chip data.\\
This work was supported by the
European Union (FP6 HeartRepair, LSHM-CT-2005-018630).


%%% BIBLIOGRAPHY STARTS HERE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliographystyle{abbrvnat}
\bibliography{Ringo-Bibliography.bib}

\end{document}
