%\VignetteIndexEntry{Main vignette: End-to-end analysis of cell-based screens}
%\VignetteKeywords{Cell based assays}
%\VignettePackage{cellHTS2}

\documentclass[11pt]{article}
\usepackage{amsmath}
\usepackage{color}
\definecolor{darkblue}{rgb}{0.0,0.0,0.75}
\usepackage[%
baseurl={http://www.bioconductor.org},%
pdftitle={End-to-end analysis of cell-based screens},%
pdfauthor={Wolfgang Huber},%
pdfsubject={cellHTS2},%
pdfkeywords={Bioconductor},%
pagebackref,bookmarks,colorlinks,linkcolor=darkblue,citecolor=darkblue,%
pagecolor=darkblue,raiselinks,plainpages,pdftex]{hyperref}

\SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,width=4,height=4.5} 

\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rfunction}[1]{\textit{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}

\newcommand{\myincfig}[3]{%
  \begin{figure}[tp]
    \begin{center}
      \includegraphics[width=#2]{#1}
      \caption{\label{#1}#3}
    \end{center}
  \end{figure}
}

\begin{document}

%------------------------------------------------------------
\title{End-to-end analysis of cell-based screens: from raw intensity readings
  to the annotated hit list (Short version)}

%------------------------------------------------------------
\author{Michael Boutros, L\'igia Br\'as and Wolfgang Huber}
\maketitle
\tableofcontents

\section{Introduction}\label{sec:Intro}
The package \Rpackage{cellHTS2} is a revised and improved version of 
the \Rpackage{cellHTS} package~\footnote{To convert a S3 class 
\Rclass{cellHTS} object that was made using the \Rpackage{cellHTS} package 
into an S4 class \Rclass{cellHTS} object suitable for \Rpackage{cellHTS2}, 
please see the function \Rfunction{convertOldCellHTS}.}.

This document is a short version of the technical report
\emph{End-to-end analysis of cell-based screens: from raw intensity
readings to the annotated hit list}, which accompanied the paper
\textit{Analysis of cell-based RNAi screens} by Michael Boutros,
L\'igia Br\'as and Wolfgang Huber~\cite{Boutros2006}.  The main reason
for having this short version is the computation time required by the
full version and the limited computational resources on the
Bioconductor repository for dynamically generated vignettes.  
The PDF file for the complete vignette is also proveded in the \textit{doc}
directory of the \Rpackage{cellHTS2} package, and can be manually compiled 
from the source file \texttt{cellhts2Complete.Rnw} in
the \texttt{scripts} directory.

%You can find the complete report as a zipped PDF file in the \textit{doc}
%directory of the \Rpackage{cellHTS2} package, or manually compile it
%from the source file \texttt{cellhts2Complete.Rnw} in
%the \texttt{scripts} directory.

The complete report includes further sections that exemplify how 
to add more annotation from public databases, how to perform an 
analysis for Gene Ontology categories, and it compares 
the obtained results with previously reported ones.
This document, the short version, focuses on the essential steps: 
how to get from raw intensity readings to an annotated hit list.

This document has been produced as a reproducible
document~\cite{Gentleman2004RepRes}. It contains the actual computer
instructions for the methods it describes, and these in turn produce
all results, including the figures and tables that are shown here. The
computer instructions are given in the language R, thus, in order to
reproduce the computations shown here, you will need an installation
of R together with a recent version of the package
\Rpackage{cellHTS2} and of some other add-on packages.

To reproduce the computations shown here, you do not need to type them
or copy-paste them from the PDF file; rather, you can take the file
\textit{cellhts2.Rnw} in the \textit{doc} directory of the package,
open it in a text editor, run it using the R command
\Rfunction{Sweave}, and modify it to your needs.

First, we load the package.
%
<<setup1, results=hide>>=
library("cellHTS2")
@ 
%
%------------------------------------------------------------
\section{Reading the intensity data}
\label{sec:read}
%------------------------------------------------------------
We consider a cell-based screen that was conducted in microtiter plate
format, where a library of double-stranded RNAs was used to target the
corresponding genes in cultured \textit{Drosophila} $Kc_{167}$ 
cells~\cite{Boutros2004}. Each of the wells in the plates contains either a
gene-specific probe, a control, or it can be empty. 
The experiments were done in duplicate, and the viability of the cells after
treatment was recorded by a plate reader measuring luciferase activity.
The promoter has been chosen such that the luciferase activity 
is indicative of ATP levels. Although this set of example data corresponds to a
single-channel screening assay, the \Rpackage{cellHTS2} package can also 
deal with cases where there are readings from more channels, corresponding 
to different reporters.

Usually, the measurements from each replicate and each channel 
come in individual result files. The set of available result files and 
the information about them (which plate,
which replicate, which channel) is contained in a spreadsheet, which
we call the \emph{plate list file}. This file should contain at least 
the following 
columns: \emph{Filename}, \emph{Plate} and \emph{Replicate}. 
The last two columns should be integer numbers, with values ranging from 1 to 
the maximum number of plates or replicates, respectively. 
The first few lines of an example
plate list file are shown in Table~\ref{tab:platelist}, 
while Table~\ref{tab:signalintensity} shows the first few lines from one 
of the plate result files listed in the plate list file. 

\input{cellhts2-platelist}
\input{cellhts2-signalintensity}

The first step of the analysis is to read the plate list file, to read
all the intensity files, and to assemble the data into a single R
object that is suitable for subsequent analyses.  The main component
of that object are arrays with the intensity readings of all
plates, channels, and replicates. We demonstrate the R instructions
for this step. First we define the path where the input files can be
found.
%
<<dataPath>>=
experimentName = "KcViab"
dataPath = system.file(experimentName, package="cellHTS2") 
@ 
%
In this example, the input files are in the
\Robject{\Sexpr{experimentName}} directory of the \Rpackage{cellHTS2}
package. To read your own data, modify \Robject{dataPath} to point to
the directory where they reside. We show the names of 12 files from
our example directory:
%
<<dirDataPath>>=
dataPath
rev(dir(dataPath))[1:12]
@ 
%
and read the data into the object \Robject{x}
%
<<readPlateList, results=hide>>=
x = readPlateList("Platelist.txt", 
        name=experimentName, 
        path=dataPath)
@ 
%
<<showX>>=
x
@ 
%
The plate format used in the screen (96-well or 384-well plate design) is 
automatically determined from the raw intensity files when calling the
\Rfunction{readPlateList} function.

%% Create the example table:
%% it would have been nice to use the "xtable" package but it insists on 
%%   adding row numbers, which we don't like.
<<plateFileTable, results=hide, echo=FALSE>>=
cellHTS2:::tableOutput(file.path(dataPath, "Platelist.txt"), "plate list")
cellHTS2:::tableOutput(file.path(dataPath, names(intensityFiles(x))[1]), 
              "signal intensity", header=FALSE)
@ 

%------------------------------------------------------------
\subsection{Importing intensity data files with other formats}
\label{sec:read2}
%------------------------------------------------------------

The function \Rfunction{readPlateList} has the argument 
\Robject{importFun} that can be used to provide a different import function to 
read plate result files with a format different from that shown in 
Table~\ref{tab:signalintensity}. 
For example, to import plate data files from an EnVision plate reader, set 
\Robject{importFun=getEnVisionRawData} or 
\Robject{importFun=getEnvisionCrosstalkCorrectedData} when 
calling \Rfunction{readPlateList}. Please see the help page of function 
\Rfunction{getEnVisionRawData} for an example.
Another import function (``importData.R'') is given together with 
the example data set for a enhancer-suppressor screen in the directory called 
\emph{TwoWayAssay} of this package.\\

While for the above data sets, the measurements from each replicate and channel
come in separate result files, this is not the case when measurement files 
are provided in the HTanalyst format. In this case, each output file 
contains meta-experimental data together with intensity readings 
(in a matrix-like layout) of a set of plates made for the same replicate 
or screen. Thus, there is no need to have a plate list file, and instead of 
using \Rfunction{readPlateList} we should call the function 
\Rfunction{readHTAnalystData}. Please see the help page for this function (\Robject{? readHTAnalystData}), 
where we illustrate how it can be applied to import HTAnalyst data files.


%------------------------------------------------------------
\section{The \Rclass{cellHTS} class and reports}
%------------------------------------------------------------
The basic data structure of the package is the class
\Rclass{cellHTS}, which is a container for cell-based high-throughput RNA 
interference assays (data and experimental meta-data) performed in 
multi-plate format.
This class extends the class \Rclass{NChannelSet} of the \Rpackage{Biobase} 
package~\cite{BioC2004}.

The data can be thought of as being organised in a two- or 
three-dimensional array as follows:
\begin{enumerate}
\item The first dimension corresponds to reagents (e.g. siRNAs,
  chemical compounds) that were used in the assays. For example, if
  the screen used 100 plates of 384 wells (24 columns, 16 rows), then
  the first dimension has size 38,400, and the \Rclass{cellHTS} object
  keeps track of plate ID, row, and column associated with each element.
  For historic reasons, and because we are using infrastructure that
  was developed for microarray experiments, the following terms are
  used synonymously for the elements of the first dimension: 
  \emph{reagents}, \emph{features}, \emph{probes}, \emph{genes}.
\item The second dimension corresponds to assays, including
  replicates and different experimental conditions (cell type,
  treatment, genetic background). 
  A potentially confusing terminology is that the data
  structure that annotates the second dimension is called \Robject{phenoData}
  (see below). This is because we are using infrastructure,
  namely the \Rclass{NChannelSet} class from the \Rpackage{Biobase}
  package, that uses this unfortunate term for this purpose.
  Sometimes, the elements of this second dimension are also 
  called \emph{samples}.
\item The (optional) third dimension corresponds to different
  channels (e.g. different luminescence reporters)    
\end{enumerate}

\noindent The main structures  contained in a \Rclass{cellHTS} object are: 

\begin{description}

\item[assayData] an object of class \Rclass{AssayData}, 
usually an environment containing a set of matrices of identical size. 
Each matrix represents a single channel.
In each matrix, the rows correspond to features or reporters 
(e.\,g.\ siRNAs, dsRNAs) and the colums to samples (different conditions
and/or replicates).

\item[phenoData] a dataframe (more precisely, an object of class
\Rclass{AnnotatedDataFrame}) containing information about the screens,
such as the replicate number and the type of biological assay.  It
must have the following columns in its data component:
\begin{description}
\item[replicate] a vector of integers giving the replicate number
\item[assay] a character vector giving the name of
the biological assay or condition
\end{description} 
Both of these columns have the same length as the number of 
samples in the \Robject{assayData} slot. 
The choice of name \emph{phenoData} for this structure is
unfortunate, it has nothing to do with phenotypes.

\item[featureData] a dataframe (more precisely, an object of class
\Rclass{AnnotatedDataFrame}) that contains information about the
reagents used in the experiment. There are three mandatory columns,
in addition there can be an arbitrary number of additional columns,
for example a target gene identifier. The mandatory columns are:
\begin{description}
\item[plate] integers specifying the plate number (e.\,g.\ $1, 2, \ldots$)
\item[well] alphanumeric character strings
giving the well ID within the plate (e.g. A01, B01, ..., P24).
\item[controlStatus] a factor specifying the annotation for each well
with possible levels: \emph{empty}, \emph{other}, \emph{neg},
\emph{sample}, \emph{pos}. Other levels besides \emph{pos} and
\emph{neg} may be employed for the positive and negative controls.
\end{description}

\item[experimentData] an object of class \Rclass{MIAME} 
containing descriptions of the experiment.

\end{description}

\noindent The \Rclass{cellHTS} class also includes additional slots
that are used to store the input files used to assemble the
\Rclass{cellHTS} object.  These are:

\begin{description}

\item[plateList] a dataframe containing names and metadata about
input measurement data files, plus a column named \emph{status},
of type \Rclass{character},
whose elements are the string "OK" if the data import appeared to have gone
well, and the respective error or warning message otherwise.

\item[intensityFiles] a list whose components are
copies of the imported input data files.  Its length corresponds to the
number of rows of \Robject{plateList}.

\item[plateConf] a data frame containing what was read from the
configuration file for the experiment (except the first two header
rows).

\item[screenLog] a data frame containing what was read from the
screen log file for the experiment (in case there was one).

\item[screenDesc] a character containing what was read from the 
description file of the experiment. 
\end{description}


Other slots are \Robject{batch}, 
\Robject{rowcol.effects}, \Robject{overall.effects} and \Robject{annotation}.  
For a detailed description of this class, please type 
\Robject{class ? cellHTS}.

In Section~\ref{sec:read}, we created the object \Robject{x}, which is
an instance of the \Rclass{cellHTS} class.  The measurements intensities
are stored in the slot \Robject{assayData} of \Robject{x}.  The slot
called \Robject{state} helps to keep track of the preprocessing state of our
\Rclass{cellHTS} object.  This slot can be accessed through the
\Rfunction{state}, as shown below:
%
<<see object state>>=
state(x)
@ 
%
It contains a logical vector  of length \Sexpr{length(state(x))}
representing the processing
status of the object. It should have the names "configured",
"normalized", "scored" and "annotated".  We can thus see that
\Robject{x} has not been configured, annotated, normalized or scored
yet.

These 4 main stages are explained along this vignette and can be
briefly described as follows:

\begin{description}
\item[Configuration] involves annotating the experiment with
information about the screen (e.\,g.\ title, when and how it was
performed, which organism, which library, type of assay, etc.),
annotating the measured values with information on controls and
flagging invalid measurements.  This step is covered in
Section~\ref{sec:screenconf} and is prerequisite for preprocessing the
experimental screening data (i.\,e.\ for normalization and scoring).

\item[Annotation] involves annotating the features (reagents) with
information about, for example, their target genes.  
This step, detailed in Section~\ref{sec:annotation}, is not essential for
preprocessing, but this
information can be used for the HTML quality reports generated by
\Rpackage{cellHTS2} and in further analyses (see
the complete report, as explained in Section~\ref{sec:Intro}).

\item[Normalization] involves removing systematic variations, making
measurements comparable across plates and within plates, enhancing the
biological signal and eventually transforming the data to a scale
suitable for subsequent analyses.

\item[Replicates scoring and summarization] involves standardizing the
values for each replicate and then summarizing replicate values in
order to obtain a single-value per probe (for example, a robust
$z$-score).

\end{description}

The steps of data normalization, replicate scoring and summarization
constitute the preprocessing work-flow of the screening data.  They
alter the contents of \Robject{assayData} slot and even the number of
channels of the \Rclass{cellHTS} object. 
The complete analysis project is contained in a set of
\Rclass{cellHTS} objects that reflect different preprocessing stages
and that can be shared with others and stored for subsequent
computational analyses.

The \Rpackage{cellHTS2} package offers export functions for generating
human-readable reports, which consist of linked HTML pages with tables
and plots. The final scored hit list is written as a tab-delimited
format suitable for reading by spreadsheet programs.

Returning to our example data set, we create a report using the
function \Rfunction{writeReport}. Since until now we only have
unnormalized experimental data, the \Rclass{cellHTS} object should be
given to the function as a list component named ``raw'':
 %
<<writeReport, results=hide>>=
out = writeReport(list("raw"=x), force = TRUE, outdir = "report-raw")
@ 
%
This will write the report into the directory given by the argument
\Robject{outdir}. The option \Robject{force=TRUE} tells the function
to delete and overwrite a possibly already existing directory of the
same name.  This option needs to be used with caution and is only
accepted if \Robject{outdir} is explicitely specified.  If the
argument \Robject{outdir} is not specified, the default is a
subdirectory of the current working directory with name given by
\Robject{name(x)}.

It can take a while to run this function, since it creates a large number
of graphics files. After this function has finished, the index page of
the report will be in the file indicated by the variable \Robject{out}, 
%
<<printout>>=
out
@ 
%
and you can view it by directing a web browser to that file.
%
<<browseReport1, eval=FALSE>>=
browseURL(out)
@ 

%------------------------------------------------------------
\section{Screen configuration: annotating the plate results}
\label{sec:screenconf}
%------------------------------------------------------------
\input{cellhts2-plateconfiguration} 
\input{cellhts2-screenlog} 

The next step of the analysis involves reading and processing three input files specific of the screening 
experiment:
\begin{itemize}
\item \emph{Screen description file} contains a general description of
the screen, its goal, the conditions under which it was performed,
references, and any other information that is pertinent to the
biological interpretation of the experiments. 
In \Rpackage{cellHTS2} package we provide a function that creates a template description file 
whose entries can be edited and completed by the user. Type \Robject{? templateDescriptionFile}.
This file contains the entries compliant with the \Rclass{MIAME} class and also additional fields specific for the 
\Rclass{cellHTS} class.

\item \emph{Plate configuration file} is used to annotate the measured data with
information on controls. 
The content of this file for the example data set analysed here 
is shown in Table~\ref{tab:plateconfiguration} and the expected format for this file is explained in 
Section~\ref{sec:plateconf}.

\item \emph{Screen log file} (optional) is used to flag individual measurements as invalid.
The first 5 lines of this file are shown in Table~\ref{tab:screenlog}, and the layout for the 
\emph{screen log file} is detailed in Section~\ref{sec:screenlog}.\\

\end{itemize}


To apply the information contained in these three files in our \Rclass{cellHTS} object, we call:
<<annotatePlateRes>>=
x = configure(x,
  descripFile="Description.txt", 
  confFile="Plateconf.txt", 
  logFile="Screenlog.txt", 
  path=dataPath)
@ 
%
Note that the function \Rfunction{configure}\footnote{More precisely,
\Rfunction{configure} is a method for the S4 class \Rclass{cellHTS}.}
takes \Robject{x}, the result from Section~\ref{sec:read}, as an
argument, and we then overwrite \Robject{x} with the result of
this function. If no screen log file is available for the experiment, 
the argument \Robject{logFile} of the function \Rfunction{configure} 
should be omitted.
%%
%% Create the example table for plateConf and screenLog
<<plateConfscreenLogTable, results=hide, echo=FALSE>>=
cellHTS2:::tableOutputWithHeaderRows(file.path(dataPath, "Plateconf.txt"), 
  "plate configuration", selRows=NULL)
cellHTS2:::tableOutput(file.path(dataPath, "Screenlog.txt"), 
  "screen log", selRows=1:3)
@ 


%%--------------------------------------------------
\subsection{Format of the plate configuration file}
\label{sec:plateconf}
%%--------------------------------------------------
The software expects this to be a rectangular table in a tabulator
delimited text file, with mandatory columns \emph{Plate}, 
\emph{Well}, \emph{Content}, plus two additional header lines that give the total number of wells and plates 
(see Table~\ref{tab:plateconfiguration} for an example).
The content of this file (except the two header lines) are stored in slot \Robject{plateConf} of \Robject{x}.\newline

As the name suggests, the \emph{Content} column provides the content
of each well in the plate (here referred to as the \textit{well
annotation}). Mainly, this annotation falls into four categories:
empty wells, wells targeting genes of interest, control wells, and
wells containing other things that do not fit in the previous
categories. The first two types of wells should be indicated in the
\textit{Content} column of the plate configuration file by
\textit{empty} and \textit{sample}, respectively, while the last type
of wells should be indicated by \textit{other}. The designation for
the control wells in the \textit{Content} column is more flexible. By
default, the software expects them to be indicated by \textit{pos}
(for positive controls), or \textit{neg} (for negative
controls). However, other names are allowed, given that they are
specified by the user whenever necessary (for example, when calling
the \Rfunction{writeReport} function). This versatility for the
control wells' annotation is justified by the fact that, sometimes,
multiple positive and/or negative controls can be employed in a given
screen, making it useful to give different names to the distinct
controls in the \textit{Content} column. Moreover, this versatility
might be required in multi-channel screens for which we frequently
have reporter-specific controls.

The \emph{Well} column contains the name of each well of the plate in alphanumeric format 
(in this case, \Robject{A01} to \Robject{P24}), while column \emph{Plate} gives the plate number (1, 2, ...).
These two columns are also allowed to contain regular expressions. 
In the plate configuration file, each well and plate should be covered by a rule, 
and in case of multiple definitions only the last one is considered. 
For example, in the file shown in Table~\ref{tab:plateconfiguration}, 
the rule specified by the first line after the column 
header indicates that all of the wells in each of the 57 assay plate contain ``sample''. 
However, a following rule indicate that the content of wells \Robject{A01}, \Robject{A02} and \Robject{B01} and 
\Robject{B02} differ from ``sample'', 
containing other material (in this case, ``other'' and controls).

Note that the well annotations mentioned above are used by the
software in the normalization, quality control, and gene selection
calculations. Data from wells that are annotated as \textit{empty} are
ignored, i.\,e.\ they are set to \Robject{NA}. 

Here we look at the frequency of
each well annotation in the example data:
%
<<>>=
table(wellAnno(x))
@ 
%
We can also use the function \Rfunction{configurationAsScreenPlot} to see 
the plate configuration of all of the plates as a screen plot:
%
<<configurationplot, fig=TRUE, width=9, height=9>>=
configurationAsScreenPlot(x)
@ 
The result is shown in Figure~\ref{fig:configurationplot}. This can be useful
to verify the correctness of the plate configuration when a complex set
of rules with regular expressions are used.
\begin{figure}[tp]
\begin{center}
\includegraphics[width=0.9\textwidth]{cellhts2-configurationplot}
\caption{\label{fig:configurationplot}Screen plot of the plate 
configuration.}
\end{center}
\end{figure}

A special case of well annotation is when different types of positive controls are used 
for the screening, that is \emph{enhancer} and \emph{suppressor} compounds. 
The vignette 
\textit{Analysis of screens with enhancer and suppressor controls} 
accompanying this package explains how such controls can be handled 
during the screen analysis using \Rpackage{cellHTS2} package.


%--------------------------------------------------
\subsubsection{Multiple plate configurations}
\label{sec:multPlateConfs}
%%--------------------------------------------------
Although it is good practice to use the same plate configuration for
the whole experiment, sometimes this does not work out, and there are
different parts of the experiment with different plate
configurations. 
The use of regular expressions in columns \emph{Plate} and \emph{Well} of the plate configuration file allow 
therefore to specify different configurations within and between assay plates.
The two header rows of this file ascertain that all of the plates and wells are covered in the plate configuration file.

%% Note about 'batches'
Note that in contrast to the \Rpackage{cellHTS} package, in
\Rpackage{cellHTS2} package the concept of \emph{batch} is separated
from the concept of having multiple plate configurations. So, for
example, different replicate of the same plate can be set as to belong
to different batches (even though they are required to have the same
plate configuration!) - since readouts were performed on different
days, or due to the use of different lots of reagents, etc.

Batch information (expressed in terms of integer values giving the
batch number: 1, 2, ...) can go into a particular slot called
\Robject{batch}. This is expected to be a 3D-array (\Robject{Features}
$\times$ \Robject{Samples} $\times$ \Robject{Channels}) of integer
values giving the batch number (1, 2, ...) for each plate, sample and
channel. Its first two dimensions correspond to the dimension of each
matrix stored in \Robject{assayData} slot and the third dimension
corresponds to the number of channels in \Robject{assayData} slot.
This slot should be filled in by the user in case s/he wants to take
into account this information in the analysis (for example, see
\Rfunction{normalizePlates} function, which allows to adjust the data
variance on a per-batch basis).


%%--------------------------------------------------
\subsection{Format of the screen log file}
\label{sec:screenlog}
%%--------------------------------------------------
The screen log file is a tabulator delimited file with mandatory
columns \emph{Plate}, \emph{Well}, \emph{Flag}. 
If there are multiple samples (replicates or conditions), a column 
called \emph{Sample} should also be present; a column named \emph{Channel} 
must also be provided when there are multiple channels.
In addition, it can 
contain arbitrary optional columns. 
Each row corresponds to one flagged measurement, 
identified by the plate number (and possible sample number and channel number) and 
the well identifier (alphanumeric identifier). 
The type of flag is specified in the column \emph{Flag}. Most commonly,
this will have the value ``NA'', indicating that the measurement
should be discarded and regarded as missing.\\


For those users that have been using \Rpackage{cellHTS} package and need to migrate their projects to \Rpackage{cellHTS2} package, 
we explain how this can be smoothly performed in Appendix~\ref{sec:migratingProjects}.


%------------------------------------------------------------
\section{Normalization, scoring and summarization of replicates}
\label{sec:norm}
%------------------------------------------------------------ 

The data normalization, replicates scoring and summarization functions
available in \Rpackage{cellHTS2} package work on the data stored in
the \Robject{assayData} slot of the \Rclass{cellHTS} object.
They create a copy of their input \Rclass{cellHTS} object,
with the data replaced by the transformed values. For a
list of the available functions, type \Robject{? cellHTS2}.

The preprocessing work-flow of a typical RNAi screen using
\Rpackage{cellHTS2} package is the following:

\begin{description}
\item[(a)] Per-plate normalization to remove plate and/or edge effects.
This can be done using the function \Rfunction{normalizePlates}.
\item[(b)] Scoring of measurements (for example, compute,
for each replicate, $z$-scores).
This can be done using the function \Rfunction{scoreReplicates}.
\item[(c)] Summarization of replicates (for example, take the median value).
This can be done using the \Rfunction{summarizeReplicates}.
\end{description}

Note that this work-flow is suitable for single-channel data. For
dual-channel data, further steps are required, as explained in the
vignette \textit{Analysis of multi-channel cell-based screens},
accompanying this package.

The function \Rfunction{normalizePlates} can be called to perform per-plate 
data transformation, normalization and variance adjustment. The normalization 
is performed in a plate-by-plate fashion and has three components: 
\begin{description}
\item[Data transformation] logarithmic transformation 
(optional, this can be advisable if the data are on multiplicative scale)
\item[Per-plate normalization] location adjustment for possible 
plate effects and/or possible spatial effects, using a choice of methods
that you need to adapt to your data
\item[Variance adjustment] plate-specific scale (variance) adjustment
(optional) 
\end{description}

For more details about this function and available normalization
methods, please type \Robject{? normalizePlates}.  To provide the
means to perform the above steps, \Rfunction{normalizePlates} has the
arguments \Robject{scale}, \Robject{log}, \Robject{method} and
\Robject{varianceAdjust}.

The argument \Robject{scale} allows to define the scale of the data
(``additive'' or ``multiplicative''), which will then control the
subsequent data transformation and normalization steps.  Namely, when
data are in multiplicative scale, the function allows to perform
$\log_2$ data transformation. For that we need to set the function's
argument \Robject{log} to \Robject{TRUE}. Log transformation will then
change the scale of the data to be ``additive''.

Per-plate median normalization is one of the methods available in
\Rfunction{normalizePlates} and can be chosen by setting the argument
\Robject{method="median"}.  In this case, plate effects are corrected
by dividing (if the current scale of the data is multiplicative) each
measurement by the median value across wells annotated as sample, for
each plate and replicate. If data are in additive scale, the
per-plate median values are subtracted from each plate measurement
instead.  All of the available normalization methods are described in
Appendix~\ref{sec:normalizationMethods}.

The variance of normalized intensities can be adjusted according to
argument \Robject{varianceAdjust}, as follows:

\begin{itemize}

\item \Robject{varianceAdjust="byPlate"}: per plate normalized intensities are 
divided by the per-plate median absolute deviations (MAD) in "sample" wells. 
This is done separately for each replicate and channel;

\item \Robject{varianceAdjust="byBatch"}: using the content of slot
\Robject{batch}, plates are split according to assay batches and the
individual normalized intensities in each group of plates (batch) are
divided by the per-"batch of plates" MAD values (calculated based on
"sample" wells).  This is done separately for each replicate and
channel;

\item \Robject{varianceAdjust="byExperiment"}: each normalized
measurement is divided by the overall MAD of normalized values in
wells containing "sample".  This is done separately for each replicate
and channel.

\end{itemize}

If \Robject{varianceAdjust="none"}, no variance adjustment is
performed (default).

As explained above, the parameter \Robject{method} of
\Rfunction{normalizePlates} allows to choose between different types
of per-plate normalization methods.  Returning to our example data
set, we choose to apply \emph{plate median scaling}:

\begin{eqnarray} 
x'_{ki} &=& \frac{x_{ki}}{M_i}\quad\quad\forall k,i \label{eq:normalizePlateMedian}\\
M_{i}&=&\mathop{\operatorname{median}}_{m\in\,\mbox{\scriptsize samples}} x_{mi} \label{eq:plateMedian}
\end{eqnarray}

where $x_{ki}$ is the raw intensity for the $k$-th well in the $i$-th
replicate file, and $x'_{ki}$ is the corresponding normalized intensity. 
The median is calculated across the wells annotated as \textit{sample} in 
the $i$-th result file. 
This is achieved by calling:
%
<<normalizePlateMedian>>=
xn = normalizePlates(x, 
   scale="multiplicative", 
   log=FALSE, 
   method="median", 
   varianceAdjust="none")
@ 
after which we obtain a \Rclass{cellHTS} object with the normalized 
intensities stored in the slot
\Robject{assayData}. 
In the previous call to \Rfunction{normalizePlates} function, 
we have chosen not to adjust the data variance (default behaviour of 
\Rfunction{normalizePlates}).
For example, we can use function \Rfunction{compare2cellHTS} provided with the package to check whether these 
two \Rclass{cellHTS} objects, \Robject{x} and \Robject{xn} belong to the same experiment:

<<compare cellHTs objects>>=
compare2cellHTS(x, xn)
@ 

After normalizing the data, we standardize the values for each 
replicate experiment using Equation~\eqref{eq:defz}. In this equation, 
$\hat{\mu}$ and $\hat{\sigma}$ are estimators of location and
scale of the distribution of $x'_{ki}$ taken across all plates and 
wells of a given replicate experiment. This function uses robust estimators,
namely, median and median absolute deviation (MAD). 
Moreover, it only considers the wells 
containing ``sample'' for estimating $\hat{\mu}$ and $\hat{\sigma}$.
%As the values $x'_{ki}$ were obtained using plate median
%normalization~\eqref{eq:normalizePlateMedian}, it holds that
%$\hat{\mu}=1$.  
The symbol $\pm$ indicates that we allow for either
plus or minus sign in Equation~\eqref{eq:defz}; the minus sign can be
useful in the application to an inhibitor assay, where an effect
results in a decrease of the signal and we may want to see this
represented by a large score.
This is done by calling the \Rfunction{scoreReplicates} function, 
where arguments \Robject{sign} and \Robject{method} define 
the sign and the scoring method to apply (robust $z$-scores, in this case), 
respectively:
%
<<score replicates>>=
xsc = scoreReplicates(xn, sign="-", method="zscore") 
@ 
%
After data standardization, we summarize the replicates, 
calculating a single score for each gene. 
This is performed using the \Rfunction{summarizeReplicates} function, 
where we use its argument \Robject{summary} to select the summary to apply.
One option is \emph{rms}, which corresponds to take the 
root mean square of the replicates values, and is shown in 
Equation~\eqref{eq:scoresSummary}. The chosen summary is 
taken over all the $n_{\mbox{\scriptsize rep}_k}$ replicates of probe $k$. 


\begin{eqnarray}
z_{ki} &=& \pm \frac{x'_{ki}-\hat{\mu}}{\hat{\sigma}} \label{eq:defz} \\
z_{k}  &=& \sqrt{\frac{1}{n_{\mbox{\scriptsize rep}_k}} \sum_{r=1}^{n_{\mbox{\scriptsize rep}_k}} z_{kr}^2} \label{eq:scoresSummary}.
\end{eqnarray}

Depending on the intended stringency of the analysis, other plausible
choices of summary function between replicates are the minimum, the
maximum, the mean or the median. In the first case, the analysis 
would be particularly conservative: all replicate values have to be 
high in order for $z_{k}$ to be high. 
For the cases where both sides of the distribution of $z$-score values are 
of interest, alternative summary options for the replicates are to select 
the value closest to zero (conservative approach) by setting \Robject{summary="closestToZero"} 
or the value furthest from zero 
(\Robject{summary="furthestFromZero"}).
In order to compare our results with those
obtained in the paper of Boutros \textit{et al.}~\cite{Boutros2004},
we choose to consider the mean as a summary:
%
<<summarize replicates>>=
xsc = summarizeReplicates(xsc, summary="mean") 
@ 
%
after which we obtain a \Rclass{cellHTS} object with the 
resulting single $z$-score value per probe
stored in \Robject{assayData} slot.

Boxplots of the $z$-scores for the different types of probes are shown
in Figure~\ref{cellhts2-boxplotzscore}.
%
<<boxplotzscore, fig=TRUE, include=FALSE, width=4.5, height=5.5>>=
scores = Data(xsc)
ylim = quantile(scores, c(0.001, 0.999), na.rm=TRUE)
boxplot(scores ~ wellAnno(x), 
  col="lightblue", outline=FALSE, ylim=ylim)
@ 
%
\myincfig{cellhts2-boxplotzscore}{0.5\textwidth}{Boxplots of $z$-scores 
for the different types of probes.}
%


In the \Rpackage{cellHTS2} package, we provide a further transformation of the $z$-score values to obtain the so-called \emph{calls}. This involves applying a 
sigmoidal transformation to the $z$-score values, with parameters $z_0$ and 
$\lambda$ ($>0$), given by:

\begin{equation}
y_k = \frac{1}{1+e^{-\lambda \,\left(z_k-z_0\right)}} \label{eq:sigtransf}
\end{equation}

This transformation maps the $z$-score values to the interval 
$\left[0,\,1\right]$ and is intended to expand the scale of $z$-scores with 
intermediate values and shrink the ones showing extreme values, therefore making the difference between intermediate phenotypes larger. 
The parameter $z_0$ defines the centre of the sigmoidal transformation, 
while $\lambda$ controls the smoothness of the transformation. 

This transformation can be done by calling the function
\Rfunction{scores2calls}, as shown in Figure~\ref{cellhts2-callvalues}.
%
<<callvalues, fig=TRUE, width=6, height=3>>=
y = scores2calls(xsc, z0=1.5, lambda=2)
plot(Data(xsc), Data(y), col="blue", pch=".",
   xlab="z-scores", ylab="calls", 
   main=expression(1/(1+e^{-lambda *(z-z[0])})))
@ 
\myincfig{cellhts2-callvalues}{0.5\textwidth}{A sigmoidal transformation that
can be used for obtaining call values.}
%
However, for the purpose of the present analysis, we will consider the
$z$-score values instead of the call values.

%------------------------------------------------------------
\section{Probe annotation}
\label{sec:annotation}
%------------------------------------------------------------
\input{cellhts2-geneID} 

Up to now, the assayed genes have been identified solely by the
identifiers of the plate and the well that contains the probe for
them. The \emph{annotation file} contains additional annotation, such
as the probe sequence, references to the probe sequence in public
databases, the gene name, gene ontology annotation, and so forth.
Mandatory columns of the annotation file are \textit{Plate},
\textit{Well}, and \textit{GeneID}, and it has one row for each
well. The content of the \textit{GeneID} column will be species- or
project-specific. The first 5 lines of the example file are shown in
Table~\ref{tab:geneID}, where we have associated each probe with
CG-identifiers for the genes of \textit{Drosophila melanogaster}.
%
<<geneIDs>>=
xsc = annotate(xsc, geneIDFile="GeneIDs_Dm_HFA_1.1.txt", 
                    path=dataPath)
@ 
%% Create the example table:
<<geneIDsTable, results=hide, echo=FALSE>>=
cellHTS2:::tableOutput(file.path(dataPath, "GeneIDs_Dm_HFA_1.1.txt"), 
     "gene ID", selRows = 3:6)
@ 
%
An optional column named \textit{GeneSymbol} can be included in the 
\emph{annotation file}, and its content will be displayed by the tooltips 
added to the plate plots and screen-wide 
plot in the HTML quality report (see Section~\ref{sec:report}).



%------------------------------------------------------------
\section{Report}
\label{sec:report}
%------------------------------------------------------------
We have now completed the analysis tasks: the dataset has been read, 
configured, normalized, scored, and annotated:
%
<<printxagain>>=
xsc
@
%
We can now save the scored data set to a file. 
%
<<savex>>=
save(xsc, file=paste(experimentName, ".rda", sep=""))
@ 
% 
The data set can be loaded again for subsequent analysis, or passed on
to others. To produce a comprehensive report, we can call the function
\Rfunction{writeReport} again, this time specifying in the functions
argument \Robject{cellHTSlist} the three \Rclass{cellHTS} objects:
``raw'', ``normalized'' and ``scored'' \Rclass{cellHTS} objects.
%
<<writeReport2, results=hide>>=
out = writeReport(cellHTSlist=
  list("raw"=x, "normalized"=xn, "scored"=xsc), 
  plotPlateArgs = TRUE, 
  imageScreenArgs = list(zrange=c(-4, 8), ar=1), map=TRUE,
  force = TRUE, outdir = "report-normalized") 
@ 
%
and use a web browser to view the resulting report.
%
<<browseReport2, eval=FALSE>>=
browseURL(out)
@ 
% 
The report contains a quality report for each plate, and also for the
whole screening assays.  The per-plate HTML reports display the
scatterplot between duplicated plate measurements, the histogram of
the normalized signal intensities for each replicate, and plate plots
representing, in a false color scale, the normalized values of each
replicate, and the standard deviation between replicate measurements
at each plate position.  It also reportes different measures of
agreement between replicate measurements, such as the repeatability
standard deviation between replicate plates and the Spearman
correlation coefficient between duplicates. The per-plate reports also
show the dynamic range, calculated as the ratio between the geometric
means of the positive and negative controls. These measures can also
be obtained independently from \Rfunction{writeReport} function, by
using the functions \Rfunction{getMeasureRepAgreement} and
\Rfunction{getDynamicRange} provided in \Rpackage{cellHTS2} package.
If different positive controls were specified at the configuration
step and when calling \Rfunction{writeReport}, the dynamic range is
calculated separately for the distinct positive controls, since
different positive controls might have different potencies.

The experiment-wide HTML report presents, for each replicate, the
boxplots with raw and normalized intensities for the different plates, 
and two plots for the controls: one showing the signal 
from positive and negative controls at each plate, 
and another plot displaying the distribution of the signal 
from positive and negative controls, obtained from kernel density estimates. 
The latter plot further gives the $Z'$-factor determined for each experiment 
(replicate) using the negative controls and each different type 
of positive controls~\cite{Oldenburg1999}, as a measure to quantify
the distance between their distributions. 
This measure can also be obtained by calling the function \Rfunction{getZfactor}.

The experiment-wide report also shows a screen-wide plot with the
$z$-scores in every well position of each plate. If the argument
\Robject{map} of \Rfunction{writeReport} function is set to
\Robject{TRUE}, this plot and the plate plots of the per-plate reports
contain tooltips (information pop-up boxes) dispaying the annotation
information at each position within the plates.  If the scored
\Rclass{cellHTS} object provided for \Rfunction{writeReport} is not
annotated with gene identifiers, the annotation information shown by
the tooltips is simply the well identifiers.  For an annotated
\Rclass{cellHTS} object, if an optional column called
\textit{GeneSymbol} was included in the \emph{annotation file} (see
Section~\ref{sec:annotation}), and therefore is present in
\Robject{featureData} slot of the annotated object, then its content
is used for the tooltips.  Otherwise, the content of column \emph{GeneID}
of the \Robject{featureData} slot (which can be accessed via
\Robject{geneAnno}) is considered.

The screen-wide image plot can also be produced separately using the
function \Rfunction{imageScreen} given in the \Rpackage{cellHTS2}
package. This might be useful if we want to select the best display
for our data, namely, the aspect ratio for the plot and/or the range
of $z$-score values to be mapped into the color scale. These can be
passed to the function's arguments \Robject{ar} and \Robject{zrange},
respectively. For example,

<<imageScreen, eval=FALSE, results=hide>>=
imageScreen(xsc, ar=1, zrange=c(-3,4))
@ 

It should be noted that the per-plate and per-experiment quality
reports are constructed based on the normalized \Rclass{cellHTS}
object, in case it is provided to \Rfunction{writeReport}
function. Otherwise, it uses the data of the raw \Rclass{cellHTS}
object.  The quality report produced by \Rfunction{writeReport}
function has also a link to a file called \emph{topTable.txt} that
contains the list of scored probes ordered by decreasing $z$-score
values, when the final scored \Rclass{cellHTS} object is
provided. This file has one row for each well and plate, and for the
present example data set, it has the following columns:

\begin{itemize}
\item \verb=plate= plate identifier for each feature;
\item \verb=position= gives the position of the well in the plate 
  (runs from 1 to the total number of wells in the plate); 
\item \verb=well= gives the alphanumeric well identifier for each feature;
\item \verb=score= corresponds to the summarized score calculated for each 
  probe (data stored in the scored and summarized object \Robject{xsc});
\item \verb=wellAnno= corresponds to the well annotation (as given by the 
  plate configuration file);
\item \verb=finalWellAnno= gives the final well annotation for the 
  scored values. It combines the information given in the plate 
  configuration file with the values in 
  \Robject{assayData} slot of the scored \Rclass{cellHTS} object, 
  in order to take into account the wells that have been flagged either 
  by the screen log file, or manually by the user during the analysis. 
  These flagged wells appear with the annotation \emph{flagged}.
\item \verb=raw_r1_ch1= and \verb=raw_r2_ch1= contain the raw intensities 
  for replicate 1 and replicate 2, respectively 
  (data stored in the unnormalized \Rclass{cellHTS} object \Robject{x}). 
  'ch' is an abbreviation for channel;
\item \verb=median_ch1= corresponds to the median of raw measurements 
  across replicates;
\item \verb=diff_ch1= gives the difference between replicated raw 
  measurements (only given if the number of replicates is equal to two); 
\item \verb=average_ch1= corresponds to the average between 
  replicate raw intensities (only given if the number of replicates is higher than two);
\item \verb=raw/PlateMedian_r1_ch1= and \verb=raw/PlateMedian_r2_ch1= give 
  the ratio between each raw measurement and the median intensity in 
  each plate for replicate 1 and replicate 2, respectively. The plate median 
  is determined for the raw intensities, using exclusively the wells annotated as ``sample''.
\item \verb=normalized_r1_ch1= and \verb=normalized_r2_ch1= give the normalized 
  intensities for replicate 1 and replicate 2, respectively. This corresponds 
  to the data stored in the normalized \Rclass{cellHTS} object \Robject{xn}. 
\end{itemize} 

Additionally, if any of the \Rclass{cellHTS} objects provided in the argument \Robject{cellHTSlist} to \Rfunction{writeReport} has been annotated (as in the present case), it also contains the data given in 
the content of \Robject{featureData} slot of the annotated object.
The above file with the list of scored probes can also be obtained without the need to run \Rfunction{writeReport} by using the function \Rfunction{getTopTable} provided in the package.


%------------------------------------------------------------
\subsection{Exporting data to a tab-delimited file}
%------------------------------------------------------------

The \Rpackage{cellHTS2} package contains a function called \Rfunction{writeTab} to save 
the data stored in \Robject{assayData} slot of a \Rclass{cellHTS} object to a tab-delimited file. 
The rows of the file are sorted by plate and 
well, and there is one row for each plate and well. 
When the \Rclass{cellHTS} object is annotated, the probe information (i.e. the probe identifiers stored in column 
``GeneID'' of the \Robject{featureData} slot) is also added.
%
<<exportData, results=hide, eval=FALSE>>=
writeTab(xsc, file="Scores.txt")
@ 
%
Since you might be interestered in saving other values to a tab delimited file,
below we demonstrate how you can create a matrix with the ratio between each 
raw measurement and the plate median, together with the gene and well 
annotation, and export it to a tab-delimited file using the 
function \Rfunction{write.tabdel}~\footnote{This function is a wrapper of the function 
\Rfunction{write.table}, whereby you just need to specify the name of the data object and 
the file} also provided in the \Rpackage{cellHTS2} package.
%
<<exportOtherData, eval=FALSE>>=
# determine the ratio between each well and the plate median
y = array(as.numeric(NA), dim=dim(Data(x)))
nrWell = prod(pdim(x))
nrPlate = max(plate(x))
for(p in 1:nrPlate) {
  j = (1:nrWell)+nrWell*(p-1)
  samples = wellAnno(x)[j]=="sample"
  y[j, , ] = apply(Data(x)[j, , , drop=FALSE], 2:3, 
     function(w) w/median(w[samples], na.rm=TRUE)) }

y=signif(y, 3)
out = y[,,1]
out = cbind(fData(xsc), out)
names(out) = c(names(fData(xsc)), 
sprintf("Well/Median_r%d_ch%d", rep(1:dim(y)[2], dim(y)[3]), 
rep(1:dim(y)[3], each=dim(y)[2])))
write.tabdel(out, file="WellMedianRatio.txt")
@ 
 
%
At this point we are finished with the basic analysis of the
screen. As one example for how one could continue to further mine the
screen results for biologically relevant patterns, we demonstrate an
application of category analysis in the complete vignette, which is 
given in the \textit{doc} directory of the package, 
or can otherwise be manually produced from the file 
\textit{cellhts2Complete.Rnw} that resides in the \textit{scripts} 
directory of the package.


%------------------------------------------------------------
\section{Appendix: How to convert \Rpackage{cellHTS} to \Rpackage{cellHTS2} configuration files}
\label{sec:migratingProjects}
%------------------------------------------------------------

We advise the users of \Rpackage{cellHTS} package to start using the improved package \Rpackage{cellHTS2} described herein, 
since the latter provides better functionality for working with multi-channel screens and multiple screens.

To facilitate this transition and help users to migrate their \Rpackage{cellHTS}-specific projects to \Rpackage{cellHTS2}, 
we provide in this package a function that converts the old S3 \Robject{cellHTS} object associated with \Rpackage{cellHTS} 
package into one or several S4 
\Robject{cellHTS} defined with the \Rpackage{cellHTS2} package. This function is called \Rfunction{convertOldCellHTS}.

However, you might want to migrate an existing project from start, i.\,e.\,, 
redo all the steps starting by reading the intensity files and configuring the screening data. In this case, you need 
to update the screen log file (if available), the screen description file and the screen configuration 
file~\footnote{The expected format of the other input files, namely the raw intensity data files (Section~\ref{sec:read}) 
and the annotation file (Section~\ref{sec:annotation}) remains unchanged between the two packages.}.\\

% Screen description file
Regarding the screen description file, as mentioned in Section~\ref{sec:screenconf}, 
we provide a function that creates a template screen description file that can be edited and modified by the user. Below we examplify how such file can be created:
%
<<example for description file>>=
out = templateDescriptionFile("template-Description.txt", force=TRUE)
out
readLines(out)
@
%

\input{cellhts2-cellHTSpackage-specificplateconfiguration} 
\input{cellhts2-cellHTSpackage-specificscreenlog} 

% Create the example table for the older formats of plate configuration and screen log files
<<old plateConfscreenLogTable, results=hide, echo=FALSE>>=
cellHTS2:::tableOutput(file.path(dataPath, "old-Plateconf.txt"), 
  "cellHTS package-specific plate configuration", selRows=1:28)
cellHTS2:::tableOutput(file.path(dataPath, "old-Screenlog.txt"), 
  "cellHTS package-specific screen log", selRows=1:3)
@ 
%

% Screen log file
The format of the screen log file compatible with \Rpackage{cellHTS2} package is shown in Table~\ref{tab:screenlog}.
Compared to the previous format required for \Rpackage{cellHTS} package (Table~\ref{tab:cellHTSpackage-specificscreenlog}), 
we note that column \textit{Filename} was 
replaced by two columns named \textit{Plate} and \textit{Sample}.\\


% Screen configuration file
In \Rpackage{cellHTS} package, the concept of \textit{batch} is intrinsically related with the plate configuration, since
a change in plate configuration along the experiment had to be handled by setting each distinct plate 
configuration as corresponding to a different batch.
Therefore, in \Rpackage{cellHTS} package, the plate configuration file had three mandatory columns named \textit{Batch}, \textit{Well},
\textit{Content}, where the \textit{Batch} column allowed for different plate configurations. The first 28 lines of such file for 
the example RNAi screen considered in this report is shown in Table~\ref{tab:cellHTSpackage-specificplateconfiguration}.
Thus, in the old format required for the plate configuration file, we had to have a number of rows equal to the product between the 
total number of batches and the total number of wells per plate.

In contrast to \Rpackage{cellHTS} package, in \Rpackage{cellHTS2} package the concept of \emph{batch} 
and multiple plate configurations were made independent (see Section~\ref{sec:multPlateConfs}).
For \Rpackage{cellHTS2} package, Table~\ref{tab:plateconfiguration} shows the required file format, which was discussed in more detail 
in Section~\ref{sec:plateconf}.

Due to the separation between batch and multiple plate configuration, the column \textit{Batch} was removed, and replaced by the column 
\emph{Plate}. The other two mandatory columns \emph{Well} and \emph{Content} were kept.
Additionally, we now require that this file contains two extra header lines giving the total number of wells and plates 
(Table~\ref{tab:plateconfiguration}). 
There is also an improvement in the file format related with the fact that \emph{Plate} and \emph{Well} columns now allow 
the use of regular expressions (see Section~\ref{sec:plateconf} for more specific information), which allows to cover the plate 
configuration used in the screen with just a few lines. Besides, it allows to specify different configurations within and between assay plates.





%------------------------------------------------------------
\section{Appendix: Normalization methods implemented in \Rpackage{cellHTS2} package}
\label{sec:normalizationMethods}
%------------------------------------------------------------


There are two main normalization methods available with \Rpackage{cellHTS2} package: methods based on the use of reference controls, and distribution-based methods.
These methods can be applied using the function \Rfunction{normalizePlates}.


%------------------------------------------------------------
\subsection{Controls-based normalization}
%------------------------------------------------------------

% POC
% NPI
% negatives

%------------------------------------------------------------
\subsubsection{Percent of control}
%------------------------------------------------------------

Percent of control (POC) is a preprocessing method 
that tries to correct for plate-to-plate variability by 
normalizing each $k$th compound raw measurements in the $i$th result file, 
$x_{ki}$, relative to the average of within-plate controls. 
In an antagonist (or inhibition) type assay, it is defined as:

\begin{equation}
x_{ki}^{\text{POC}} = \frac{x_{ki}}{\mu_i^{\text{pos}}} \times 100
\end{equation}
\noindent
where $\mu_i^{\text{pos}}$ is the average of the measurements on the positive controls in the $i$th result file (i.\,e.\,, for a given plate and replicate).

In \Rpackage{cellHTS2} package, this method can be applied by setting 
the argument \Robject{method="POC"} when calling \Rfunction{normalizePlates} function.

We also provide in the package a normalization method for 
\Rfunction{normalizePlates} (\Robject{method="negatives"})
that consists of scaling
the plate measurements by 
the per-plate median of the intensities on the negative controls~\footnote{If the scale of the data is defined as being additive 
(i.\,e.\,, argument \Robject{scale="additive"}, or arguments \Robject{scale="multiplicate"} and \Robject{log=TRUE}), measurements are subtracted by the median of per-plate negative controls instead.}.


%------------------------------------------------------------
\subsubsection{Normalized percent inhibition}
%------------------------------------------------------------

If \Rfunction{normalizePlates} is called with \Robject{method="NPI"}, the method known as normalized percent inhibition (NPI) is applied in a per-plate basis to correct for plate effects. 
For an antagonist assay, this method divides the 
difference between each measurement in a given result file $i$ ($x_{ki}$) and the average of the positive controls on that plate ($\mu_i^{\text{pos}}$) by the difference between the averages of the measurements on the positive ($\mu_i^{\text{pos}}$) and the negative controls ($\mu_i^{\text{neg}}$):

\begin{equation}
x_{ki}^{\text{NPI}} = \frac{\mu_i^{\text{pos}}-x_{ki}}{\mu_i^{\text{pos}}-\mu_i^{\text{neg}}}
\end{equation}

%------------------------------------------------------------
\subsection{Non-controls-based normalization}
% or distribution-based normalization
%------------------------------------------------------------

% Z score
% plate-median normalization 
% B score
% spatial normalization ("loess" or "locfit")

There are several normalization method implemented in \Rpackage{cellHTS2} 
package that make use of the overall distribution of values, 
instead of relying exclusively on controls.
These are described in the following sections.

%------------------------------------------------------------
\subsubsection{Z score method}
%------------------------------------------------------------

Z score is a simple and widely known normalizing method that is performed in a per-plate basis as follows:

\begin{equation}
x_{ki}^{\text{Z}} = \frac{x_{ki} - \mu_i}{\sigma_i},
\end{equation}
\noindent
where $\mu_i$ and $\sigma_i$ are the mean and standard deviation, 
respectively, of all measurements within the $i$th result file 
(replicated plate).
In the Z score method, measurements are re-scaled relative to 
within-plate variation by subtracting the average of the plate values 
and dividing this difference by the standard deviation 
estimated from all measurements of the plate.

In \Rpackage{cellHTS2}, we consider a robust version of this method, 
where the mean and standard deviation are replaced by the median and the MAD, 
respectively, calculated at the sample wells. This robust Z score method is performed by calling 
\Rfunction{normalizePlates} as follows: 
%
<<Z score method, eval=FALSE>>=
 xZ = normalizePlates(x, scale="additive", log=FALSE, 
         method="median", varianceAdjust="byPlate") 
@ 

%
%------------------------------------------------------------
\subsubsection{Plate median normalization}
%------------------------------------------------------------

Plate median normalization involves calculating the relative signal of each
well compared to the median of the sample wells in the plate, as shown in Equation (\ref{eq:normalizePlateMedian}) 
and Equation (\ref{eq:plateMedian}). 
The median is
calculated among the $m$ wells containing \textit{sample} (i.\,e.\,, for wells that contain genes of interest) in result file $i$.
Plate median normalization can be chosen by setting \Robject{method="median"} in \Rfunction{normalizePlates}. When applied to data on additive scale, the plate median normalization involves subtracting the plate measuments by the per-plate median instead.

We also have two variants of the plate median scaling which consist of using as the per-plate scaling factor $M_i$ the per-plate average intensity on sample wells (\Robject{method="mean"}) or the midpoint of the shorth of the per-plate distribution of values on sample wells (\Robject{method="shorth"}).\\

The next methods are intended to explicitly correct for \emph{spatial effects} within plates, i.\,e.\,, the presence of 
intensity gradients within the plates. 
Such signal gradients can be caused by differences in temperature, 
incubation time or concentration, etc., in different wells across a plate.
Typically, these gradients produce repeatitive patterns, which make it possible to distinguish them from real actives that should be more or less randomly dispersed for quasi-randomized collections of compounds.


%------------------------------------------------------------
\subsubsection{B score method}
%------------------------------------------------------------

In the B score method, row and column biases within each plate are explicitly corrected for by fitting a two-way median polish to the raw data in a per-plate fashion~\cite{Brideau2003}: 

\begin{eqnarray}
e_{rci} = x_{rci} - \hat{x}_{rci} = x_{rci} - \left( \hat{\mu}_i + \hat{R}_{ri} + \hat{C}_{ci}\right) \\ \label{eq:BscoreResiduals}
x^{\text{B}}_{rci} = \frac{e_{rci}}{MAD_i}	\label{eq:BscoreVarAdj}
\end{eqnarray} 
\noindent
Here, $x_{rci}$ is the measurement value in the $r$th row and $c$th column of the plate corresponding to the $i$th result file, $\hat{x}_{rci}$ is the corresponding fitted value defined as the sum between the estimated average of the replicate plate ($\hat{\mu}_i$), the estimated systematic offset for row $r$ ($\hat{R}_{ri}$) and the systematic offset for column $c$ ($\hat{C}_{ci}$) in that replicated plate.
In a second step, each of the obtained residual values $e_{rci}$'s of the $i$th result file are divided by their 
median absolute deviation ($MAD_i$) 
giving the final B score value -- Equation~(\ref{eq:BscoreVarAdj}).


We implemented a method similar to the B score method described in 
Malo \textit{et al.}~\cite{Malo2006} and Brideau \textit{et al.}~\cite{Brideau2003} using the Tukey's median polish procedure~\cite{Tukey1977} (function \textit{medpolish} of the package \textit{stats}) which fits an 
additive model to the data according to Equation~(\ref{eq:BscoreResiduals}). 
The Tukey's median polish algorithm works by alternately removing the row and column medians and continues until the proportional reduction in the sum of absolute residuals is less than $\epsilon$ or until the maximum number of iterations has been reached.
The B score method can be applied by defining argument 
\Robject{method="Bscore"} in \Rfunction{normalizePlates}. Alternatively, 
the method can be applied by calling a separate function 
called \Rfunction{Bscore} provided in the \Rpackage{cellHTS2} package.\\


In \textit{cellHTS2} package, we provide two additional spatial normalization methods 
that fit a polynomial surface to the intensities within each assay plate using local regression and that can be performed via 
\Rfunction{normalizePlates} or \Rfunction{spatialNormalization} functions, although we advise to apply these methods using the 
former function. The fit can be performed either using the \textit{loess} procedure or the 
\textit{locfit.robust} function of package \textit{locfit}.
In \Rfunction{normalizePlates}, if \Robject{method="locfit"}, 
spatial effects are removed by fitting a bivariate local regression to each plate and replicate, 
while if \Robject{method="loess"}, a loess curve is fitted instead.






%------------------------------------------------------------
\section{Appendix: Data transformation}
%------------------------------------------------------------
\myincfig{cellhts2-transfplots}{0.95\textwidth}{Comparison between untransformed 
(left) and logarithmically (base 2) transformed (right), normalized data. 
Upper: histogram of intensity values of replicate 1. 
Middle: scatterplots of standard deviation versus mean of the two replicates. 
Bottom: Normal quantile-quantile plots.}

An obvious question is whether to do the statistical analyses on the
original intensity scale or on a transformed scale such as the
logarithmic one.  Many statistical analysis methods, as well as
visualizations work better if (to sufficient approximation)
\begin{itemize}
\item replicate values are normally distributed,
\item the data are evenly distributed along their dynamic range, 
\item the variance is homogeneous along the dynamic range~\cite{Huber2002ismb}.
\end{itemize}

Figure~\ref{cellhts2-transfplots} compares these properties for
untransformed and log-transformed normalized data, showing that the difference is small. 
Intuitively, this can be explained by the fact that for
small $x$,
\[
\log(1+x)\approx x
\]
and that indeed the range of the untransformed data is mostly not far
from 1.  Hence, for the data examined here, the choice between
original scale and logarithmic scale is one of taste, rather
than necessity.
%
<<transfplots, fig=TRUE, include=FALSE, width=6.5, height=9>>=
library("vsn")
par(mfcol=c(3,2))
myPlots=function(z,...) {
  hist(z[,1], 100, col="lightblue", xlab="",...)
  meanSdPlot(z, ylim=c(0, quantile(abs(z[,2]-z[,1]), 0.95, na.rm=TRUE)), ...)
  qqnorm(z[,1], pch='.', ...)
  qqline(z[,1], col='blue')
}
dv = Data(xn)[,,1]
myPlots(dv, main="untransformed")
xlog = normalizePlates(x, scale="multiplicative", log=TRUE, 
       method="median", varianceAdjust="byExperiment")
dvlog = Data(xlog)[,,1]
myPlots(dvlog, main="log2")
@ 
%







%---------------------------------------------------------
\section{Session info}
%---------------------------------------------------------

This document was produced using:

%\begin{table*}[h]%[tbp]
%\begin{minipage}{\textwidth}
<<sessionInfo, results=tex, print=TRUE>>=
toLatex(sessionInfo())
@ 
%\end{minipage}
%\caption{\label{tab:sessioninfo}%
%The output of \Rfunction{sessionInfo} on the build system 
%after running this vignette.}
%\end{table*}


%------------------------------------------------------------
%Bibliography
%------------------------------------------------------------
\bibliography{cellhts}
\bibliographystyle{plain}

\end{document}

