\documentclass[11pt]{article}
%\copyrightyear{} \pubyear{}
% \VignetteIndexEntry{Linear models in GSEA}
%\VignettePackage{GSEAlm}


\usepackage{fullpage} %\usepackage[authoryear,round]{natbib}
\usepackage{subfigure}

\setlength{\parindent}{0.2in}
\setlength{\parskip}{0.1in}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\newcommand{\ID}[1]{{\texttt{#1}}}
%\newcommand{\texttt{BCR/ABL}}[1]{\texttt{BCR/ABL }}
% \newcommand{\negf}[1]{\texttt{NEG }}
    % \linespread{1.2}
\usepackage{natbib}
\bibliographystyle{natbib}


\begin{document}

%\firstpage{1}

\title{GSEAlm Package}
\author{Assaf P. Oron, Robert Gentleman (with contributions from S. Falcon and Z. Jiang) \\
  assaf.oron@gmail.com}
% \address{Computational Biology, 1100 Fairview Ave. N. M2-B876,
%         P.O. Box 19024, Seattle, Washington 98109-1024}


\maketitle

\begin{abstract}

  The \texttt{GSEAlm} package contains tools for linear-model
  inference and diagnostics, tailored for Gene Set Enrichment Analysis
  (GSEA). We demonstrate the package's main features, using the acute
  lymphoblastic leukemia (ALL) dataset, comparing the \texttt{BCR/ABL}
  and \texttt{NEG} phenotypes via GSEA based on chromosome-band
  mapping of genes. If you want to follow the text while running the
  code, please use the code from the \texttt{GSEAlm.R} file in the
  same directory where this document is found, because there are some
  code excerpts which were hidden from view in the tutorial below. A
  more up-to-date and academically oriented exposition of the tools
  coded in this package is available at \citet{OronEtAl08}.

\end{abstract}

\section{Background and Dataset} \label{sec:background}

Gene set enrichment analysis \citep{Subramanian2005} is an important
new approach to the analysis of gene expression data, and it has
already been extended and generalized in a number of ways
\citep{Tian2005, Kim2005, Jiang2007}.

%%% FIXME (Assaf): I added this semi-philosophical paragraph, thinking
%%% it can help understand what we're up to. What do you
%%% think? In the same vein I
%%% suggest to re-order the discussion so that we begin with
%%% diagnostics rather than with the potential for expanding the model.
%%
%% FIXME:RG - we do compute per gene test statistics, but not just
%% t-tests, and others that use GSEA do not, so you might want to say
%% one approach is, rather than characterize all of GSEA.

Expression analysis in general and GSEA in particular can be viewed as
a cascade of successive data reductions:
\begin{enumerate}
\item Biochemical hybridization information is reduced to a set of
  pixel images (one per sample);
\item Images are pre-processed to produce a $G\times n$ matrix of
  normalized average expression estimates ($G$ genes, $n$ samples);
  \item This matrix is then filtered out to remove irrelevant
  genes or samples (non-specific filtering);
\item Per-gene statistics are calculated;
  \item Finally, these statistics are used to calculate
  gene-set-level statistics, which help identify differentially
  expressed or otherwise interesting gene-sets.
\end{enumerate}

This process is necessary, in order to bring the amount of information
generated by the microrray experiment down to a scientifically
relevant and manageable level, while retaining core features. However,
given the immense extent of data reduction, it makes both theoretical
and practical sense to find ways to check the process. This is where
the linear-model toolset comes in handy.

A linear model assumes that the mean of the response variable has a
linear relationship with the explanatory variable(s). In
gene-expression terminology, the typical generic model could be written as:

\begin{equation}
 y_{gi} = \beta_{g0} + \sum_{j=1}^pX_{ij}\beta_{gj} + \epsilon_{gi},
 \label{eq:modelgeneric}
\end{equation}
where
\begin{itemize}
  \item $y_{gi}$ is the gene expression value of gene $g$ in sample $i$;
   \item $p$ is the number of explanatory variables in the model;
   \item $X_{ij}$ is the value of the $j$-th variable for the $i$-th
     sample. For dichotomous variables such as phenotype, one
     typically sets $X$ to zero or one (e.g., \texttt{NEG} will be zero and
     \texttt{BCR/ABL} one);
    \item $\beta_{gj}$ is the true value of the effect of variable $j$
     upon the expression of gene $g$; and
     \item $\epsilon_{gi}$ is a random error (``noise''), often
       assumed to follow a normal distribution with mean zero and
       variance $\sigma^{2}$.
\end{itemize}

The data and model are used to calculate a fitted value for each
observation, denoted as $\hat{y}_{gi}$. The residuals, $e_{gi}\equiv
y_{gi}-\hat{y}_{gi}$ can be naively seen as estimates for the random
errors, but they play a far greater role as will be seen below.

Note that applying linear models to gene expression in the way
outlined here, involves fitting the same model form to all genes
separately and simultaneously.

<<load libs, results=hide, echo=FALSE, cache=FALSE>>=
library("annotate")
library("GOstats")
library("graph")
#require("Rgraphviz", quietly=TRUE)
#library("Rgraphviz")
library("hgu95av2.db")
library("genefilter")
library("ALL")
library("lattice")
library("RColorBrewer")
HMcols = rev(brewer.pal(10,"RdBu"))
cols = brewer.pal(10, "BrBG")
@

We load the \Rpackage{GSEAlm} package. Then we load the data and perform some initial filtering.
The example dataset is from a clinical trial in acute lymphoblastic leukemia
(ALL). This dataset is available from Bioconductor, under the name \Rpackage{ALL}.
<<>>=
library("GSEAlm")
@

%%FIXME: RG says a few words about what and why are needed.

<<Dataset load and initial filtering,cache=FALSE,keep.source=TRUE>>=

data(ALL)

### Some filtering;
### see explanation below code
bcellIdx <- grep("^B", as.character(ALL$BT))
bcrOrNegIdx <- which(as.character(ALL$mol.biol)
                     %in% c("NEG", "BCR/ABL"))
esetA <- ALL[ , intersect(bcellIdx, bcrOrNegIdx)]
esetA$mol.biol = factor(esetA$mol.biol) # recode factor

### Non-specific filtering
esetASub <- nsFilter(esetA,var.cutoff=0.6,var.func=sd)$eset
@

The \Rpackage{ALL} dataset contains \Sexpr{dim(ALL)[1]} genes and
\Sexpr{dim(ALL)[2]} samples.  We are interested, among other things,
in comparing the \texttt{BCR/ABL} and \texttt{NEG} phenotypes,
appearing under the \texttt{mol.biol} variable (hereafter: ``the
phenotype effect'').  There are \Sexpr{dim(esetA)[2]} samples with one
of the two phenotypes; we remove all the rest.  Standard non-specific
filtering of genes was also performed in order to remove most of the
unexpressed genes. The common assumption is that only about $40\%$ of
genes are expressed in typical tissues, and therefore we remove the $60\%$ with
less-variable expression across samples, using standard deviation as
the threshold criterion. The filtered dataset contains
\Sexpr{ncol(esetASub)} samples and \Sexpr{nrow(esetASub)} unique
genes.

Next we identify gene-sets based on chromosomal location. Such an
analysis is of clinical interest, as ALL has been associated with
chromosomal irregularities. We mapped the chromosomal location of each
gene in the filtered dataset, using tools and methods described in
\citet{FalconGent2007} and available in the \Rpackage{Category}
package.  The resulting gene-set structure is hierarchical, and can be
represented as a tree graph. More details follow the code excerpt.

<<ChromosomeMapping, results=hide,cache=FALSE>>=

minBandSize = 5
haveMAP = sapply(mget(featureNames(esetASub), hgu95av2MAP),
                 function(x) !all(is.na(x)))
workingEset = esetASub[haveMAP, ]
entrezUniv = unlist(mget(featureNames(workingEset), hgu95av2ENTREZID))

### Creating incidence matrix and keeping the graph structure

AgraphChr=makeChrBandGraph("hgu95av2.db",univ=entrezUniv)
AmatChr = makeChrBandInciMat(AgraphChr)
AmatChr3 = AmatChr[rowSums(AmatChr)>=minBandSize,]

# Re-ordering incidence matrix columns

egIds = sapply(featureNames(workingEset), function(x) hgu95av2ENTREZID[[x]])
idx = match(egIds, colnames(AmatChr))
AmatChr3 = AmatChr3[, idx]
colnames(AmatChr3)=featureNames(workingEset)


# Updating our graph to include only the bands that actually
# appear in the matrix (doing it a bit carefully though...)

# AmatChr3 = AmatChr[!duplicated(AmatChr),]
AgraphChr3 = subGraph(c("ORGANISM:Homo sapiens",rownames(AmatChr3)),AgraphChr)
# AgraphChr3 = subGraph(rownames(AmatChr3),AgraphChr)

@

There are \Sexpr{nrow(AmatChr3)} bands containing at least \Sexpr{minBandSize} genes,
 and there are \Sexpr{ncol(AmatChr3)} genes mapped to such chromosome bands.
To utilize our \Rpackage{GSEAlm} toolset, we need a chromosome-band
{\bf incidence matrix}: a matrix with as many rows as gene-sets, and as many
columns as genes. Matrix elements are $1$ if the gene belongs to the
gene-set, and $0$ otherwise. Rather than use the convenient
single-step function \texttt{MAPAmat} to create this matrix, here we
first create a hierarchical chromosome tree using
\texttt{makeChrBandGraph}, and then generate the matrix from the tree
using \texttt{makeChrBandInciMat} (all 3 functions are found in the \Rpackage{Category} package).
This way we retain the tree in our working memory, which will prove useful down the road. Note that the
tree's trunk begins with the species-identifying node \texttt{\Sexpr{nodes(AgraphChr3)[1]}},
meaning that complete chromosomes are represented as the first-level branches.

\section{The $t$-Test as a Linear Model, and Basic Diagnostics}

We arrive at step 4 of the data-reduction cascade outlined in the
introduction: individual-gene statistics, which in our case are $t$-tests. A point often overlooked is
that $t$-tests are identical to a linear model with a single
dichotomous explanatory variable (hereafter: ``factor''). This means
that even for $t$-tests we can make use of linear-model tools. To
demonstrate that the two approaches yield equivalent effect estimates,
Fig. \ref{fig:lm1} plots the $t$-statistics produced by the
\texttt{rowttests} function from the \Rpackage{genefilter} package,
vs. the analogous statistics produced by a one-factor phenotype model
using the \texttt{lmPerGene} function from the \Rpackage{GSEAlm}
package. The values are identical except for a sign flip: the two
functions have opposing conventions for factor level ordering
(\texttt{rowttests} calculates "first in alphabetical order minus
second", while \texttt{lmPerGene} uses the opposite which is similar
to the \texttt{lm} default). In our particular application, the former convention
makes more sense since the absence of mutation (i.e., "\texttt{NEG}") is a more natural reference.
We therefore reverse factor level ordering, via the standard \texttt{relevel} function.

<<lmPhen,results=hide,fig=TRUE, include=FALSE, keep.source=TRUE, prefix=FALSE,cache=TRUE>>=
lmPhen <- lmPerGene(workingEset,~mol.biol)

##fit the t-tests model
tobsChr <- rowttests(workingEset,"mol.biol")
## fit it via the linear-model interface
lmEsts = lmPhen$tstat[2,]

plot (tobsChr$stat,lmEsts,main="The t-test as a Linear Model",
xlab="T-test t-statistic",ylab="One-Factor Linear Model t-statistic")


### Re-leveling the factor

workingEset$mol.biol<-relevel(workingEset$mol.biol,ref="NEG")
lmPhen <- lmPerGene(workingEset,~mol.biol)
lmEsts = lmPhen$tstat[2,]

@

\begin{figure}
  \centering
  \includegraphics[height=0.4\textwidth,width=0.4\textwidth]{lmPhen} \caption{\label{fig:lm1}T-statistics for gene-level phenotype effect,
  from the \texttt{rowttests} and \texttt{lmPerGene} functions, plotted against each other.}
\end{figure}

Using the linear-model function instead of just a $t$-test, enables us to directly calculate
residuals. We have as many residuals as raw data points -- in our
case, \Sexpr{nrow(workingEset)} by \Sexpr{ncol(workingEset)}.

For the residual analysis, it is better to use normalized residuals,
so that different genes will have similar magnitude. There are several
ways to achieve this, producing normalized, internally-Studentized or
externally-Studentized residuals \citep{CookWeisberg82}. The
\texttt{getResidPerGene} function allows for all these options (and also
the raw un-normalized residuals), with externally-Studentized
residuals being the default.

Arranging Studentized residuals by sample helps identify individual
samples that tend to exhibit high or low expression
levels. Similarly, we can look for samples with noticeably more or
less variable residuals. If a particular sample is
systematically different from the rest, one may suspect imperfect
normalization or other technical issues. Otherwise, this may indicate
some real underlying phenomena.

Fig. \ref{fig:resid1} summarizes all Studentized residuals by sample
(one box per sample), arranged by phenotype, using the
\texttt{resplot} function. Several samples catch the eye, especially
in the \texttt{NEG} phenotype (left) where the residuals from samples
\ID{68001}, \ID{28001} and \ID{16009} are predominantly negative. In
the \texttt{BCR/ABL} phenotype (right), residuals from sample
\ID{84004} appear to be systematically higher than all the rest. Note
that these features are {\bf not} evident from boxplots of raw
expression intensities.\footnote{To see this, try replacing variable
\texttt{lmPhenRes} by \texttt{workingEset} in the 'resplot' command
below; you'll also need to remove the constraint \texttt{lims=c(-5,5)}. }

Note that \Rfunction{lmPerGene} also allows for an intercept-only
model (by omitting the formula) -- so you can examine residuals for
patterns even without assuming any phenotype effect.

<<SimpleResBoxplot,echo=TRUE, results=hide,fig=TRUE, include=FALSE, prefix=FALSE,cache=TRUE>>=

lmPhenRes <- getResidPerGene(lmPhen)
resplot(resmat=exprs(lmPhenRes),fac=workingEset$mol,cex.main=.7,cex.axis=.6,
horiz=TRUE,lims=c(-5,5),xname="",col=5,cex=.3)

@

\begin{figure}
  \centering
  \includegraphics[height=0.8\textheight,width=0.8\textwidth]{SimpleResBoxplot} \caption{\label{fig:resid1}Raw
    residuals from a linear model of gene expression on phenotype for
    all genes of the filtered \Rpackage{ALL} dataset, grouped by sample and arranged by phenotype (\texttt{NEG}
    on left, \texttt{BCR/ABL} on right).}
\end{figure}


\section{Gene-Set Residuals}

We proceed to the next GSEA step: producing summary statistics of the
phenotype effect by chromosome band. There are several ways, all
roughly equivalent, to achieve this
\citep{Subramanian2005,Jiang2007,EfronTibshi08}. Here we chose the
statistic of \citet{Jiang2007} (hereafter, ``the J-G statistic''), for
having the simplest theoretical properties, and for other reasons that
will become apparent shortly.

The J-G statistic can be defined as
\begin{equation} \tau_{S} =
  \sum_{g \in S} \left(t_{g}-t_{ref}\right) / \sqrt{|S|}, \label{eq:gstau}
\end{equation}

where $t_{g}$ is the phenotype-effect $t$-statistic for gene $g$,
$t_{ref}$ is some estimate of the overall mean shift (e.g., the median
of all $t$-statistics in the experiment) and $|S|$ is the size of gene
set $S$. Since gene-gene correlations may induce a size effect on
$\tau_S$ (Oron and Gentleman, forthcoming), and since the sample size
is large enough, we recommend basing inference upon phenotype-label
(``column'') permutation $p$-values rather than upon comparing
$\tau_S$ to standard Normal or $t$ distributions. Function
\Rfunction{gsealmPerm} provides a reasonably fast implementation of
this test for all chromosome bands simultaneously. We skip this test
for now, and focus on diagnostics instead.

Similarly to the gene-set statistics, we can create gene-set aggregate residuals by sample,
using a formula analogous to (\ref{eq:gstau}):

\begin{equation}
  r_{Si} = \sum_{g \in S} r_{gi}  / \sqrt{|S|}, \label{eq:gser}
\end{equation}

where $r_{gi}$ is the Studentized residual from sample $i$ and gene
$g$. We will refer to the $r_{Si}$ as ``GS residuals''. Both GS main
effect statistics and GS residuals can be generated from
individual-gene information, using the \texttt{GSNormalize} function.

Figure \ref{fig:hm1} displays a heatmap of the $r_{Si}$, with
chromosome bands in rows and samples in columns. Red indicates
positive values and blue negative values. Only the lowest possible
hierarchical level in each chromosome band (subject to the constraint
of at least \Sexpr{minBandSize} genes) was used, in order to avoid
redundancies and overlap-related distortions. Since we retained the chromosome's tree
structure, this is easily achieved using the \texttt{leaves} function
from the \Rpackage{graph} package. Both rows and columns were
simultaneously re-ordered to identify clusters with highly correlated
residual patterns. Note that we prefer to measure similarity via
correlation, rather than Euclidean distance which is the default for
the \texttt{heatmap} function. This is because we are interested in
finding groups of samples (or chromosome bands) that exhibit the same
qualitative behavior with respect to the dataset's baseline.

<<GSEA Resids, results=hide,cache=TRUE>>=
## now we are going to aggregate residuals over chromosome bands

stdrAchr=GSNormalize(exprs(lmPhenRes),AmatChr3)
#rAchr =  AmatChr3 %*% exprs(lmPhenRes)
#rAsqrSums = sqrt(rowSums(AmatChr3))
#stdrAchr = sweep(rAchr, 1, rAsqrSums, FUN="/")
@

<<rAExChrHmap,results=hide, fig=TRUE, include=FALSE, prefix=FALSE,cache=TRUE>>=
# brColors <- ifelse(colnames(stdrAchr) %in% branch1,"brown", "grey")
# molColors <- ifelse(workingEset$mol=="BCR/ABL","brown", "grey")
kinetColors <- ifelse(workingEset$kinet=="hyperd.","brown", "grey")

 onecor=function(x) as.dist(1-cor(t(x))) # To get correlation-based heatmap

### In the heatmap we only use the lowest-level bands, or "leaves" of the graph

ChrLeaves=leaves(AgraphChr3,"out")
### for safety
ChrLeaves=ChrLeaves[ChrLeaves %in% rownames(AmatChr3)]

LeafGenes=which(colSums(AmatChr3[ChrLeaves,])>0)

bandHeatmap=heatmap(stdrAchr[ChrLeaves,],scale="row",col = HMcols,
          ColSideColors=kinetColors,keep.dendro=TRUE,distfun=onecor,
          labRow=FALSE,xlab="Sample",ylab="Chromosome band")
@

\begin{figure}
  \centering
  \includegraphics[height=0.7\textheight,width=0.7\textwidth]{rAExChrHmap}
  \caption{\label{fig:hm1}GS residuals from the linear model of gene
    expression on phenotype for each chromosome band (row) and sample
    (column). Red colors indicate positive residuals, and blues
    indicate negative residuals. Both rows and columns are clustered
    according to a correlation-based similarity measure. The colored
    band above the map indicates the value of the \texttt{kinet}
    variable -- grey for diploid, red for hyperdiploid and white for
    unknown.}
\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Fig. \ref{fig:hm1} exhibits some intriguing patterns.  First, one of
the samples identified above as having low residuals --
\ID{28001} -- catches the eye as a narrow predominantly-blue vertical strip,
again suggesting a technical normalization issue with this sample.

More interesting from a modeling perspective is the apparent
block or checkerboard pattern of the heatmap. This pattern indicates a
potential association between certain samples and the expression level of
certain chromosomal locations, an association not explained by
phenotype.


A skeptic might ask whether perhaps the block pattern is only an
artifact of the heatmap's graphical method, which allows for grouping
and reordering of both rows and columns. For this purpose,
Fig. \ref{fig:hm_fake} displays the heatmap of a similarly sized,
randomly-generated dataset. Since from the residual plot on
Fig. \ref{fig:resid1} we know that different samples are prone to some
normalization-related offset, the data were generated by superimposing
independent errors on sample-specific baselines that (randomly) differ
between columns. This error structure generates a heatmap with strong
vertical features. On the other hand, in the absence of any real correlation
between gene-sets and expression, horizontal features, if any, are mild
and localized -- and therefore there is no apparent block pattern.

<<FakeHmap,echo=FALSE,results=hide, fig=TRUE, include=FALSE, prefix=FALSE,cache=TRUE>>=
colbase=rexp(79,rate=3)*sample(c(-1,1),size=79,replace=T)
randres=sweep(matrix(rnorm(length(ChrLeaves)*79),ncol=79),2,colbase)
heatmap(randres, scale="row",col = HMcols,
          ColSideColors=kinetColors,keep.dendro=TRUE,distfun=onecor,
          ,labRow=FALSE,xlab="Sample",ylab="Chromosome band")
@

\begin{figure}
  \centering \includegraphics[height=0.7\textheight,width=0.7\textwidth]{FakeHmap} \caption{\label{fig:hm_fake}
    Randomly generated GS residuals with a different baseline for each
    sample (column). Sample baselines were generated from an
    exponential distribution, and individual-point noise from a normal
    distribution. Otherwise, layout and details are identical to those of Fig. \ref{fig:hm1}.}
\end{figure}

So it seems that there may be unaccounted-for factors associated with
groups of chromosome bands. Among the \Sexpr{ncol(pData(workingEset))}
descriptive variables available for each sample, we found the
``\texttt{kinet}'' variable to be strongly correlated with the
observed pattern. This variable indicates whether the samples exhibit
hyperdiploidy, i.e., having substantially more than $46$ chromosomes.
Clearly, this property could be associated with increased expression
of certain chromosomes -- and if there are specific hyperdiploidy
patterns common in ALL, this could help explain the block pattern of
Fig. \ref{fig:hm1}. The \texttt{kinet} variable is illustrated as a
colored band at the top of the heatmap, with red indicating
hyperdiploid samples, grey diploid samples, and white samples with
unknown status. Even though hyperdiploid subjects are less than a
quarter of the sample size, they form a strong majority in a cluster
of samples characterized by predominantly negative residuals over most
chromosome bands, and strongly positive residuals over the remaining
bands.

We conclude that adding ``\texttt{kinet}'' to the model may be an idea
worth pursuing. It should be noted, though, that while such a model
expansion might help account for most of the block pattern, it will
not be of any use for minority members -- especially the $6$ diploid
samples in ``sample cluster 1'', and the $8$ hyperdiploid samples not
belonging to either cluster. Hence, some remnant (or a milder
mirror-image) of the block pattern would probably still show after
model expansion.

Another factor of interest, which can be used as a benchmark because
it certainly plays a role in the expression of some genes, is
sex. We conclude that adding sex, too, as a covariate in the
model may be of use (a more detailed discussion of Y-chromosome expression and sex assignment in this dataset is found in \citet{OronEtAl08}).

%Genes on the Y chromosome can only be expressed among males. Y
%chromosome bands occupy only two adjacent rows in Fig. \ref{fig:hm1},
%and therefore the sex-associated pattern is not highly visible. Yet,
%GS residuals for the entire Y-chromosome does show a marked gender effect,
%as expected (Fig. \ref{fig:Yres}). Interestingly, the pattern
%is somewhat noisy, to the extent that one may suspect some gender
%mis-assignment problems in the dataset documentation, especially regarding
%several male samples whose Y-chromosome GS residuals are quite
%negative. A closer inspection of this issue can be provided by looking
%at all individual-gene Y chromosome residuals, and clustering them by sample
%(Fig. \ref{fig:YresidsDetail}); this can be achieved via the \texttt{resplot} function.
%The graphical inspection does not yield a clear-cut answer to
%the question whether such errors had occurred. There are a few outlying
%samples in each gender, but none of them show ``typical'' opposite-gender expression levels.
%Moreover, there is also a high degree of overall variability.
%
%In any case,
%
%
%<<YrABoxplot,results=hide,fig=TRUE,include=FALSE,prefix=FALSE,cache=TRUE>>=
%
%resY <- stdrAchr["Y",]
%boxplot(resY ~ workingEset$sex,range=1,names=c("Female","Male"),ylab="GS Residual",
%        cex=0.8,cex.axis=1.5,boxwex=.3,cex.lab=1.5)
%lines(c(-1,3),rep(0,2),col=2)
%@
%
%\begin{figure}[!h] \centering
%  \includegraphics[height=0.3\textwidth,width=0.3\textwidth]{YrABoxplot}
%\caption{\label{fig:Yres} Boxplot of GS residuals for the Y-chromosome, obtained from a linear
%model with only phenotype as predictor and grouped by sex.}
%\end{figure}
%
%<<YResDetailedBoxplot,fig=TRUE, include=FALSE, prefix=FALSE,cache=TRUE>>=
%resplot(GSname="Y",resmat=exprs(lmPhenRes),gnames=c("Female","Male"),incidence=AmatChr3,fac=workingEset$sex,cex.axis=0.7,cex=0.4)
%@
%\begin{figure} [!h]
%  \centering \includegraphics[height=0.7\textheight,width=0.8\textwidth]{YResDetailedBoxplot}
%\caption{\label{fig:YresidsDetail}Raw residuals of Y-chromosome genes from
% linear model of gene expression on phenotype for the \Rpackage{ALL} dataset,
% grouped by sample and arranged by sex (females on top, males on bottom).}
%\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%% START OF 3-FACTOR LINEAR MODEL PART
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\label{sec:lmWsex}GSEA and Diagnostics, Using a 3-Variable Model}

The previous section's findings suggest a $3$-variable model with
phenotype, hyperdiploidy and sex; we'd like to implement it using
\texttt{lmPerGene}. There are $4$ samples with missing data for this model; we
can either give them up and use the remaining $75$, or {\bf impute}
the data by artificially filling the missing values, and thus retain
all $79$ samples. Here we choose the former, mostly because it is very
unclear whether to assign sample \ID{25006} as diploid or
hyperdiploid. However, if one chooses to impute, the following code
excerpts creates a mirror \Robject{expressionSet} object named
\Robject{imputeEset}; substitute for \Robject{workingEset} in all
subsequent code to see the difference (there should be some difference
in residual patterns, but very little in bottom-line inference).

<<lm3FacImpute,results=hide, include=FALSE, prefix=FALSE,cache=TRUE, keep.source=TRUE>>=

sampleIDs=rownames(pData(workingEset))
imputeEset=workingEset
imputeEset$sex[is.na(workingEset$sex)] <- 'M'
imputeEset$kinet[is.na(workingEset$kinet)] <- 'dyploid'

### This is the questionable sample; try once as diploid
### (by skipping the following line), and once as hyperdiploid
imputeEset$kinet[which(sampleIDs=="25006")]<-'hyperd.'
@

We continue with the original (un-imputed) dataset;
\Rfunction{lmPerGene} as a default removes samples with missing
observations, just like \Rfunction{lm}. We also generate residuals, GS
$\tau$ statistics and GS residuals.

<<lm3FacRun,results=hide, include=FALSE, prefix=FALSE,cache=TRUE>>=

lmExpand <- lmPerGene(workingEset,~mol.biol+sex+kinet)
lmExpandRes <- getResidPerGene(lmExpand,type="extStudent")
lmExpandTees <- t(lmExpand$tstat[2:4,])
lmExpandBandTees<-GSNormalize(lmExpandTees,AmatChr3)

GSresidExpand=GSNormalize(exprs(lmExpandRes),AmatChr3)
@

The GS residual heatmap is shown on Fig. \ref{fig:hmExpanded}. The
patterns are more predominantly vertical than in Fig. \ref{fig:hm1},
but not as exclusively vertical as in Fig. \ref{fig:hm_fake}. Most of
the remnant patterns resembling Fig. \ref{fig:hm1}'s blocks seems to
be related to minority members, as explained earlier. Hyperdiploidy
(depicted on the top bar) does not appear to be associated with any
sample cluster anymore.\footnote{Note that we have to be a bit careful in
setting up the heatmap, since we have $4$ columns less than in the
single-factor one; we take advantage of the fact that residuals are an
\Robject{expressionSet} type object to redefine the color vector on the top bar.}

<<lm3FacHeatmap,fig=TRUE,results=hide, echo=TRUE,include=FALSE, prefix=FALSE,cache=TRUE>>=
kinetColors2 <- ifelse(lmExpandRes$kinet=="hyperd.","brown", "grey")
bandHeatmapExp=heatmap(GSresidExpand[ChrLeaves,], scale="row",col = HMcols,
          ColSideColors=kinetColors2,keep.dendro=TRUE,distfun=onecor,
          labRow=FALSE,xlab="Sample",ylab="Chromosome band")
@

\begin{figure}
  \centering
  \includegraphics[height=0.7\textheight,width=0.7\textwidth]{lm3FacHeatmap}
  \caption{\label{fig:hmExpanded}GS residuals from
  the $3$-factor linear model. Layout and details are similar to those of Fig. \ref{fig:hm1}.}
\end{figure}

<<lm3FacInferencePrep,fig=FALSE,results=hide,cache=TRUE,echo=FALSE>>=
nperm=125
flagp=0.01
@

We proceed to inference. As mentioned above, due to within-sample
correlations the safe approach to inference is via label permutation
tests. For a multiple regression model, we can infer about one factor
of interest only by permuting its labels {\bf within each subgroup defined
  by a level combination of the adjusting factors}. For example, the
phenotype effect will be evaluated by separately and simultaneously
permuting the phenotype labels of hyperdiploid females, hyperdiploid
males, diploid females and diploid males, without mixing labels between the
four groups. Using this approach the adjusting factors must be
categorical (the factor of interest itself can be continuous). Function
\texttt{gsealmPerm} performs such a permutation analysis. The
factors are given as the RHS of a standard \texttt{R} model formula, with the first factor
being the one of interest. If inference is needed for other factors,
the same function can be called with a different factor ordering in the formula.
In this document, we run only \Sexpr{nperm} permutations to save compilation time.

Below is a list of flagged chromosome bands, at the \Sexpr{flagp}
level on either side, for the phenotype effect. Theoretically we
approach the p-value as an alert flag, rather than an exact
probability value (which it is not, due to both the hierarchical
structure and multiple testing). In any case, to make the list less
redundant we look only at the chromosome-tree ``leaves'' as
before. The first list is of bands down-expressed on the
\texttt{BCR/ABL} phenotype, followed by a list of those up-expressed
on \texttt{BCR/ABL}. Note that we have set \texttt{removeShift=TRUE}
(this is not an argument of \Rfunction{gsealmPerm} itself, but rather passed
on to \Rfunction{GSNormalize}) -- meaning that differential expression
is estimated vs. the baseline gene-level expression gap between the
phenotypes (calculated by default as the median of gene-level $t$-statistics).

<<lm3FacInfPhenotype,fig=FALSE>>=
pvalsExpand=gsealmPerm(workingEset[LeafGenes,],~mol.biol+sex+kinet,AmatChr3[ChrLeaves,LeafGenes],nperm=nperm,removeShift=TRUE)

pvalsExpand[pvalsExpand[,1]<flagp,1]
pvalsExpand[pvalsExpand[,2]<flagp,2]
@

These lists are not very different from the ones obtained via a
$1$-factor model (not shown here; you can try and compare by running
\texttt{gsealmPerm} on the phenotype-only model). So adjusting with
two of the most obvious factors in the dataset does not substantially
change inference about the phenotype effect.

The sex effect permutation test is omitted here for brevity.
In any case, there are little surprises there - the only locations flagged as
differentially expressed at the \Sexpr{flagp} level sit on the Y chromosome.

%<<lm3FacInfSex,fig=FALSE>>=
%pvalsSex.Exp=gsealmPerm(workingEset,~sex+kinet+mol.biol,AmatChr3,nperm=nperm)
%pvalsSex.Exp[rownames(AmatChr3) %in% ChrLeaves & pvalsSex.Exp[,1]<flagp,1]
%pvalsSex.Exp[rownames(AmatChr3) %in% ChrLeaves & pvalsSex.Exp[,2]<flagp,2]
%@

Finally, when getting to the hyperdiploidy effect our hard work in
expanding the model bears fruit. Given the nature of the hyperdiploidy
phenomenon, here we focus on complete chromosomes. It turns out --
perhaps not so surprisingly -- that quite a few chromosomes are
flagged as overexpressed among the hyperdiploid samples. Here we leave
\texttt{removeShift} in its default \texttt{FALSE} value, because we'd
like to directly gauge the absolute chromosome-level expression
differences between the hyperdiploid and diploid groups.

<<lm3FacInfHyper,fig=FALSE>>=

chrNames=c(as.character(1:22),"X")
pvalsHyper=gsealmPerm(workingEset,~kinet+mol.biol+sex,AmatChr3[chrNames,],nperm=nperm)

pvalsHyper[pvalsHyper[,2]<flagp,2] ### over-expressed list for hyperdiploids

# browser()

# oldflags=c(table(pvals[,1]<flagp)[2],table(pvals[,2]<flagp)[2])
# newflags=c(table(pvals.Exp[,1]<flagp)[2],table(pvals.Exp[,2]<flagp)[2])

#downtable=table(pvals[,1]<flagp,pvals.Exp[,1]<flagp)
#uptable=table(pvals[,2]<flagp,pvals.Exp[,2]<flagp)
@

If we relax the alert level to $0.1$, the over-expressed list expands
to nearly ten chromosomes. Inspecting chromosome-level
hyperdiploidy patterns among hyperdiploid subjects more closely
(Fig. \ref{fig:hmHyperd}), it can be seen that chromosome X expression
levels are closely correlated with chromosomes 6, 14 and 21, while
chromosomes 19 and 22 follow a different pattern and are closely
correlated with each other.  Chromosome X in particular seems to
dichotomize samples into two groups with strongly higher-than-average
and lower-than-average responses. These observations may have
biological significance.

<<hyperdiploidHeatmap,fig=TRUE,results=hide, include=FALSE, prefix=FALSE,cache=TRUE>>=

GSChromeResid=GSresidExpand[chrNames,]
hyperHmap=heatmap(GSChromeResid[pvalsHyper[,2]<0.1,lmExpandRes$kinet=="hyperd."],
scale="row",col = HMcols,keep.dendro=TRUE,distfun=onecor,xlab="Sample ID",ylab="Chromosome")
@

\begin{figure}
  \centering
  \includegraphics[height=0.3\textheight,width=0.45\textwidth]{hyperdiploidHeatmap}
  \caption{\label{fig:hmHyperd} Complete-chromosome GS residuals from
  the $3$-factor linear model, for chromosomes flagged as over-expressed under hyperdiploidy ($p<0.1$),
   and for the hyperdiploid subjects only.}
\end{figure}


% <<Xresplot,echo=FALSE,fig=TRUE,results=hide, include=FALSE, prefix=FALSE,cache=TRUE>>=
% resplot(GSname="X",resmat=exprs(lmExpandRes),incidence=AmatChr3,fac=workingEset$kinet,gnames=c("Diploid","Hyperdiploid"),prefix="Chromosome",notch=T)
% @

% \begin{figure}
%   \centering
%   \includegraphics[height=0.25\textheight,width=0.5\textwidth]{Xresplot}
%   \caption{\label{fig:Xresplot}Standardized
%     individual-gene residuals from the $3$-factor linear model for X
%     chromosome genes, displayed by sample and split into the diploid
%     (top) and hyperdiploid (bottom) groups.}
% \end{figure}

\subsubsection{Influence Analysis}

Recall that some samples (notably \ID{28001}) were visually
flagged as potentially problematic outliers. What should we do about them? If
there is still access to earlier-stage data (physical samples or raw
images), it is always a good idea to inspect questionable samples
directly. In the absence of such access, one needs to answer the
practical question: how strongly does a single outlying
observation affect model inference?

Since our model includes only dichotomous factors, fitting it boils
down to calculating group averages -- for $2$ groups in the $1$-factor
case and for $8$ groups in the $3$-factor case. The smaller the group,
the more leverage each sample in the group exerts (sample leverage
values are obtained via the \texttt{Leverage} function). A measure
known as Cook's $D$ \citep{CookWeisberg82} incorporates a sample's
leverage together with the magnitude of its residual, to calculate the
square distance by which a single observation ``moves'' the model's
parameter estimates.\footnote{The distance is measured in
  $p$-dimensional parameter space and normalized by the standard
  error.} The \texttt{CooksDPerGene} function performs this
calculation for all samples and genes simultaneously.

In the GSEA framework, we would probably like to aggregate individual-gene $D$
values within each chromosome band, in an analogous manner to $\tau_{Si}$
and $r_{Si}$. However, normalizing by square-root makes little sense
since Cook's $D$ is positive, not pivotal around zero. Instead, we suggest using
\begin{equation}
  D_{Si} = \sqrt{\sum_{g \in S} D_{gi} / |S|} \label{eq:gsD},
\end{equation}
which is simply the root-mean $D$ value over the gene-set. This, too,
can be accomplished via the \texttt{GSNormalize} function.

$D_{Si}$ provides a measure of the typical amount by which the sample
in question affects $t$-statistics for genes in the band. We are
looking for samples whose overall $D_{Si}$ values, across different
gene-sets, are substantially larger than the overall baseline.

Since the $1$-factor model splits the dataset into two roughly
equal-sized groups, the leverage of all samples is roughly the same,
and we expect this model's $D_{Si}$ values to convey a similar message
to Fig. \ref{fig:hm1}. For that model, even the strongest outlier samples do
not appear to be substantially away from the pack
(Fig. \ref{fig:Cook}, top). The story is different for the $3$-factor
model (Fig. \ref{fig:Cook}, bottom). There is at least one highly
influential sample (relative to the baseline):
sample \ID{28024}. This sample has become influential for the expanded
model, because it belongs to a female subject with hyperdiploidy -- one of the smallest
among the $8$ groups (by contrast, sample
\ID{28001} belongs to the largest group, and hence its diminished
influence on the $3$-factor model). Additionally, \ID{28024}'s expression
levels were quite variable to begin with: its Studentized residuals
have the second-highest IQR (see Fig.  \ref{fig:resid1}). Again, if
there is access to raw data in some form, it is a good idea to inspect
them and see why this sample is so highly variable. Otherwise, a
sensitivity analysis (repeating the regression without \ID{28024} and
comparing the results) may be in order -- although in our particular case, the sample's overall
influence does not appear to warrant taking this step.

<<CooksDplot,fig=TRUE,results=hide, keep.source=TRUE, include=FALSE, prefix=FALSE,cache=TRUE>>=
cookie1=CooksDPerGene(lmPhen)
cookie3=CooksDPerGene(lmExpand)

BandCooks1=GSNormalize(cookie1,AmatChr3,fun2=identity)
BandCooks3=GSNormalize(cookie3,AmatChr3,fun2=identity)

BandCooks1Base=BandCooks1[ChrLeaves,]
BandCooks3Base=BandCooks3[ChrLeaves,]
layout(1:2)
boxplot(sqrt(BandCooks1Base)~col(BandCooks1Base),names=sampleIDs,pch='+',
        cex=.2,las=3,cex.axis=.4,
        main="Influence by sample and Chromosome Band: 1-Factor Model",
        xlab="Sample ID",ylab="Cook's D (Band Root-Mean)",boxwex=.5,
        cex.main=0.8,cex.lab=0.7,col=5)
boxplot(sqrt(BandCooks3Base)~col(BandCooks3Base),
        names=sampleIDs[!is.na(workingEset$kinet)],pch='+',
        cex=.2,las=3,cex.axis=.4,
        main="Influence by Sample and Chromosome Band: 3-Factor Model",
        xlab="Sample ID",ylab="Cook's D (Band Root-Mean)",boxwex=.5,
        cex.main=0.8,cex.lab=0.7,col=5,ylim=c(0,0.4))
@

\begin{figure}
  \centering
  \includegraphics[height=0.7\textheight,width=0.9\textwidth]{CooksDplot}

% \includegraphics{CooksDplot}
  \caption{\label{fig:Cook} Chromosome-band root-mean Cook's $D$ values,
    summarized by sample, for the 1-factor (top) and 3-factor (bottom) models.}
\end{figure}

The $D_{Si}$ measure can also be used to check whether there are any
problem gene-sets, i.e., chromosome bands influenced by an unduly
large number of outlying samples. In our case, there are no such bands.\footnote{Graph not shown here; it can be easily produced by constructing a
boxplot by rows rather than columns in the code above -- you'll also need
to set \texttt{names=rownames(AmatChr3[ChrLeaves,])}.}

The \Rpackage{GSEAlm} package also offers two other commonly
encountered diagnostics: \texttt{dffitsPerGene} and
\texttt{dfbetasPerGene}, each using a different measure of how
individual samples affect estimation. According to recent theory
\citep{LaMotte1999,Jensen2001}, all three influence diagnostics convey
probabilistically equivalent information from the strict perspective
of outlier identification. However, these functions are still useful
for visual problem identification.


\newpage
%\bibliographystyle{natbib}
\bibliography{cat}

\section*{Session Information}
\scriptsize{
<<sessionInfo,results=tex,echo=FALSE>>=
toLatex(sessionInfo())
@
}
\end{document}