%\VignetteIndexEntry{Package maskBAD}
%\VignettePackage{maskBAD}

\documentclass{article}

\usepackage{Sweave}
\usepackage[a4paper]{geometry}
\usepackage{hyperref,graphicx}

\SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,width=4,height=4.5} 
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\Rfunction}[1]{{\small\texttt{#1}}}

\title{\textsf{\textbf{Package ``maskBAD''}}}

\author{Michael Dannemann, Michael Lachmann, Anna Lorenc}

\SweaveOpts{echo=FALSE}
\usepackage{a4wide}

\begin{document}

\maketitle

\begin{abstract}
  Package ``maskBAD'' allows to users to identify, inspect and remove
  probes for which binding affinity differs between two groups of
  samples. It works for Affymetrix 3' IVT and whole transcript (exon
  and gene) arrays. When there are SNPs, transcript isoforms or cross
  hybrydizing transcripts specific to only one of compared groups,
  probes' binding affinities may differ between groups. Such probes
  bias estimates of gene expression, introduce false positives in
  differential gene expression analysis and reduce the power to detect
  real expression differences. Hence they should be identified and
  removed from the data at preprocessing stage.  For details about
  impact of BAD (binding affinity difference) probes on differential
  gene expression estimates, implemented detection method and its
  power to improve data quality, please consult our papers Dannemann,
  Lorenc et al.: ``The effects of probe binding affinity differences
  on gene expression measurements and how to deal with them'',
  Bioinformatics, 2009 (\cite{Dannemann09}) and Dannemann et al.:
  ``'maskBAD' - a package to detect and remove Affymetrix probes with
  binding affinity differences'', BMC Bioinformatics 2012
  (\cite{Dannemann12}).
\end{abstract}

\section{Introduction}
The ``maskBAD'' package implements functions to detect and remove probes
with different binding affinity (BAD) in Affymetrix array expression
data. Identification and removal of BAD probes removes spurious gene
expression differences and helps recover true ones. BAD probes are
prevalent in comparisons of genetically distinct samples, like
belonging to different strains or species, but systematic qualitative
differences in transcriptome might introduce them also when samples
differ by treatment, health status or tissue type. Masking therefore
should be a routine step in data preprocessing. In this package we
introduce functions that allow to identify, inspect and remove BAD
probes and show how it can be integrated in a standard gene expression
analysis pipeline.


\section{Identification of BAD probes}

Detection of BAD (binding affinity difference) probes is done on an
\Robject{AffyBatch} object, usually prepared with
\Rfunction{ReadAffy()} from package \Rpackage{affy}. We'll use 100
random probe sets from human and chimpanzee expression dataset
(\cite{Khaitovich04}), each with 10 individuals, available with the
package.

<<AffyBatch,echo=FALSE,results=hide>>=
library("maskBAD")
data(AffyBatch, package="maskBAD")
data(exmask, package="maskBAD")
## load(system.file("data/AffyBatch.rda", package="maskBAD"))
## load(system.file("data/exmask.rda", package="maskBAD"))
@

<<AffyBatch,echo=TRUE>>=
## data available by loading data(AffyBatch)
newAffyBatch
exmask <- mask(newAffyBatch,ind=rep(1:2,each=10),verbose=FALSE,useExpr=F)
@ 

The scoring of probes is performed with the function \Rfunction{mask}
and is based on the algorithm presented in Dannemann et. al
(\cite{Dannemann09}). Briefly, for a given probe, its signal is
proportional to the amount of RNA in the sample and its binding
affinity. When one target is measured with several probes (a probe
set), as on Affymetrix arrays, the probes' signals are correlated for
each sample. BAD probes correlation differs between groups, hence
comparison of probes' pairwise correlations between those groups
allows to identify of BAD probes. Function \Rfunction{mask()} detects
probes differing in binding affinity between groups of samples defined
by argument \texttt{ind}. The masking algorithm workes on a gene/probe
set basis as the expression signal for the same transcript should be
correlated among probes. Therefore, to identify BAD probes use probe
sets defined at transcript/gene level. \\
The method works properly only for probe sets with signal above
background in both groups. By default (\texttt{use.expr=TRUE} option)
only probe sets with \Rfunction{mas5calls()} ``P'' in at least 90\% of
samples from both groups are considered in BAD detection. Another
desired set of probe sets might be specified by the parameter
\texttt{exprlist}.

\subsection{Evaluation of the mask and choice of cutoff}

For human-chimpanzee, with assumption of sequence divergence 1\%,
we'll expect naively $\sim$22\% of probes comprise a SNP. To filter
out lower 22\% of quality scores, we have to put a cutoff at 0.029.

<<quantile,echo=TRUE>>=
quantile(exmask[[1]]$quality.score,0.22)
@ 

Manual inspection of borderline cases with \Rfunction{plotProbe()}
might help to decide about the cutoff. It plots signal intensity of a
probe against all other probes from the same probe set.

<<quantile,echo=TRUE>>=
## add rownames containing x and y coordinates of each probe
rownames(exmask[[1]]) <-apply(exmask[[1]][,c("x","y")],1,
                              function(x)paste(x[1],x[2],sep="."))
## filtering for probes around the choosen cutoff
probesToSee = rownames(exmask[[1]][exmask[[1]]$quality.score<0.03 
  & exmask[[1]]$quality.score >0.028,])
## select a random probe and plot it against 
## all other probes of its probe set
plotProbe(affy=newAffyBatch,probeset=as.character(exmask[[1]][ probesToSee[1],"probeset"]),
          probeXY=probesToSee[1],scan=TRUE,ind=rep(1:2,each=10),
          exmask=exmask$probes,names=FALSE)
@ 

\subsection{Including external data}

Mask might be calibrated with a subset of probes with known BAD status
(called later on external mask), for example probes with target
sequences known in both compared groups. External mask contains
\texttt{x} and \texttt{y} coordinates and probe set assignement for
the probes, along with their BAD status coded as 0/1 (BAD probe/not
BAD probe). For a given cutoff, compare BAD status according to
expression-based mask and  given by external mask. Here we use the
object sequenceMask - the information whether the probe sequence is 
affected by a SNP between the groups (0) or not (1) .

<<quantile,echo=TRUE>>=
head(exmask$probes)
head(sequenceMask)
rownames(sequenceMask) <- paste(sequenceMask$x,sequenceMask$y,sep="_")
rownames(exmask$probes)<- paste(exmask$probes$x,exmask$probes$y,sep="_") 

cutoff=0.029
table((exmask$probes[rownames(sequenceMask),"quality.score"]>cutoff)+0,
      sequenceMask$match) 
@

Here, 952 probes are concordant between masks, whereas 79 probes with
known SNP are not detected by expression-based mask.\\\\

With \Rfunction{overlapExprExtMasks()} expression mask may be
compared with any other designation of BAD probes along a range of
cutoffs.  Type 1 (fraction of external mask not BAD probes, identified
as BAD by expression mask) and type 2 (fraction of external mask BAD
probes, not detected as BAD in expression mask) errors are plotted
with their binomial confidence intervals.

<<quantile,echo=TRUE>>=
head(exmask$probes[,1:3])
head(sequenceMask[,c(1,2,4)])
@

The function \Rfunction{overlapExprExtMasks()} applied on our example
data, produces a plot like shown in Figure \ref{figErr}.

<<Err,echo=TRUE>>=
overlapExSeq <- overlapExprExtMasks(exmask$probes[,1:3],
                                    sequenceMask[,c(1,2,4)],verbose=FALSE,
                                    cutoffs=quantile(exmask$probes[,3],seq(0,1,0.05))) 
@

\begin{figure}
  \centering
  \includegraphics[width=.7\textwidth]{maskBAD-figErr}
  \caption{Overlap of expression-based mask and external mask (red
    line), with binomial confidence intervals (dashed lines). Several
    cutoff levels are marked on the plot.}
  \label{figErr}
\end{figure}


<<figErr,fig=TRUE,width=5,height=5>>=
overlapExSeq <- overlapExprExtMasks(exmask$probes[,1:3],
                                    sequenceMask[,c(1,2,4)],verbose=FALSE,
                                    cutoffs=quantile(exmask$probes[,3],seq(0,1,0.05)))
@

The cutoff depends on the goal of masking. SFPs detection needs
tighter control of type 2 than type 1 error - low cutoff
values. Removal of differential expression bias needs higher cutoff.
The distribution of quality scores for probes declared in external
mask as BAD and not BAD may be compared with Wilcoxon rank sum test
and Kolomogornov-Smirnow test for several cutoff levels. When those
distributions do not differ (high p-values on \ref{figWil}), cutoff is
too high.

<<Err,echo=TRUE>>=
overlapTests <- overlapExprExtMasks(exmask$probes[,1:3],verbose=FALSE,
                                    sequenceMask[,c(1,2,4)],
                                    wilcox.ks=T,sample=100)
@ 

<<figWil1,fig=TRUE,width=16,height=10>>=
plot(overlapTests$testCutoff[[1]],overlapTests$ksBoot,col="lightgrey",
     main="Kolmogorov-Smirnov Test",xlab="Quality score cutoff", 
     ylab="p value (Kolmogorov-Smirnov Test)",ylim=c(0,1),pch=16,xaxt="n",cex=1.5)
axis(1,at=seq(1,length(unique(overlapTests$testCutoff[[2]])),2),
     labels=signif(unique(overlapTests$testCutoff[[2]]),2)[seq(1,length(unique(overlapTests$testCutoff[[2]])),2)],
     las=3)
lines(which(unique(overlapTests$testCutoff[[2]]) %in% overlapTests$testCutoff[[2]]),
      overlapTests$ksP[!is.na(overlapTests$ksP)],type="p",pch=16,col="red")
@ 
<<figWil2,fig=TRUE,width=16,height=10>>=
plot(overlapTests$testCutoff[[1]],overlapTests$wilcoxonBoot,col="lightgrey",
     main="Wilcoxon Rank Sum Test",xlab="Quality score cutoff",
     ylab="p value (Wilcoxon Rank Sum Test)",ylim=c(0,1),pch=16,xaxt="n",cex=1.5)
axis(1,at=seq(1,length(unique(overlapTests$testCutoff[[2]])),2),
     labels=signif(unique(overlapTests$testCutoff[[2]]),2)[seq(1,length(unique(overlapTests$testCutoff[[2]])),2)],
     las=3)
lines(which(unique(overlapTests$testCutoff[[2]]) %in% overlapTests$testCutoff[[2]]),
      overlapTests$wilcoxonP[!is.na(overlapTests$wilcoxonP)],type="p",pch=16,col="red")
@ 

\begin{figure}
  \centering
  \includegraphics[width=\textwidth]{maskBAD-figWil1}
  \includegraphics[width=\textwidth]{maskBAD-figWil2}
  \caption{Kolmogorow-Smirnov-Test (upper panel) and
    Wilcoxon-Rank-Sum-Test (lower panel) comparing distributions of
    quality scores of BAD and not BAD probes. For low cutoffs, two
    distributions are different (red - actual tests; gray - tests on
    bootstraped data, to estimate variance).}
  \label{figWil}
\end{figure}

<<figHist,fig=TRUE,width=5,height=5>>=
hist(exmask[[1]]$quality.score,breaks=100,xlab="mask quality score",main="") 
@ 

\begin{figure}
  \centering
  \includegraphics[width=.7\textwidth]{maskBAD-figHist}
  \caption{Probes' quality scores.}
  \label{figHist}
\end{figure}

When no information about sequence divergence between the groups is
available, a cutoff might be suggested by the distribution of quality
scores for the mask. On Figure 3, an excess of probes with low quality
scores is obvious. Assuming a uniform distribution of quality scores
for probes not different between groups, a cutoff should be set so
that the remaining distribution is uniform - like.

\section{Removal of BAD probes and estimation of expression values}

\subsection{Removal of BAD probes}

To remove probes from \texttt{affybatch}, use
\Rfunction{prepareMaskedAffybatch()}. It produces an
\texttt{affybatch} object (without the masked probes) and an according
\texttt{CDF} environment.  When one probe appears in several probe sets
(as is a case for some custom annotations) and it is detected as BAD
in one probe set only, it will be nevertheless removed from all probe
sets it appears.  With \texttt{cdftable} argument,
\Rfunction{prepareMaskedAffybatch()} allows also to remove any list of
probes.

<<affbs,echo=TRUE>>=
affyBatchAfterMasking <-
  prepareMaskedAffybatch(affy=newAffyBatch,exmask=exmask$probes,
                         cutoff=quantile(exmask[[1]]$quality.score,0.22))

newAffyBatch
affyBatchAfterMasking
@ 

Note changed number of genes (probe sets) in masked \texttt{affybatch}. 

\subsection{Expression estimation with \Rpackage{rma}}

For the new \texttt{affybatch} expression values might be estimated as
usual. New \texttt{affybatch} object has a masked \texttt{cdf}
declared, instead of a standard one.

<<cdf,echo=TRUE>>=
cdfName(affyBatchAfterMasking[[1]])
@ 

Therefore it is important to save new \texttt{cdf} for further use and
make it available in the workspace (with a name matching
\texttt{cdfName} slot of masked \texttt{affybatch}) whenever a
function to estimate expression is called.

<<rmaEx,echo=TRUE>>=
new_cdf=affyBatchAfterMasking[[2]]
head(exprs(rma(affyBatchAfterMasking[[1]])))
@ 

\subsection{Expression estimation with \Rpackage{gcrma}}
Normalization and expression estimates with \Rfunction{gcrma()}
require probe affinities. Those should be computed with standard
\texttt{cdf}. Then both \texttt{affybatch} object and probe affinities
object should be filtered with the mask.

<<grmaEx1,echo=TRUE,results=hide>>=
library(gcrma)
library(hgu95av2probe)
affy.affinities=compute.affinities("hgu95av2",verbose=FALSE)
@ 
<<grmaEx2,echo=TRUE>>=
affy.affinities.m=prepareMaskedAffybatch(affy=affy.affinities,
  exmask=exmask$probes,cdfName="MaskedAffinitesCdf")
@ 

To calculate expression values, just use both masked objects (make
sure that a \texttt{cdf} with name matching \texttt{cdf} slot of
masked \texttt{affyBatch} is accessible!).

<<grmaEx2,echo=TRUE>>=
head(exprs(gcrma(affyBatchAfterMasking[[1]],affy.affinities.m)))
@ 

\subsection{Exon and gene arrays}

Exon and gene arrays require setting \texttt{useExpr=FALSE}, as those
arrays do not contain MM probes and \Rfunction{mas5calls()} - based
definition of expressed genes is not applicable.  As computing time
increases with number of probe sets, we recommend to use a subset of
probe sets (specified with \texttt{exprlist}) for exon arrays.

\bibliographystyle{plain}
\bibliography{maskBAD}

\end{document}


