%
% NOTE -- ONLY EDIT howtogenefilter.Rnw!!!
% howtogenefilter.tex file will get overwritten.
%
%\VignetteIndexEntry{HowTo: Genefilter}
%\VignetteDepends{Biobase, genefilter, class}
%\VignetteKeywords{Expression Analysis}
%\VignettePackage{genefilter}
\documentclass{article}

\usepackage{hyperref}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\classdef}[1]{%
  {\em #1}
}

\begin{document}
\title{How To Filter Genes}

\maketitle

\section*{Introduction}

The {\em genefilter} package can be used to filter (select) genes from
a microarray experiment according to a wide variety of different
filtering mechanisms.
To be concrete we will consider the artificial experiment contained
in the \verb+sample.ExpressionSet+ example from the {\em Biobase} package.
This experiment has 26 samples, there are 500 genes and 3
covariates. The covariates are named \verb+sex+, \verb+type+ and
\verb+score+. The first two have two levels and the last one is
continuous.

<<>>=
library("Biobase")
library("genefilter")
data(sample.ExpressionSet)
varLabels(sample.ExpressionSet)
table(sample.ExpressionSet$sex)
table(sample.ExpressionSet$type)
@
%$

One dichotomy that is of interest is to split filters into the
non--specific and the specific. Here, specific means that we are
filtering with reference to some particular variable. For example if
we want to select genes that are differentially expressed in the two
groups defined by \verb+cov1+ that is a specific filter.
If on the otherhand we want to select genes that are expressed in more
than 5 samples then that is an example of a non--specific filter.

First let's see how to perform a non--specific filter.
Suppose we want to select genes that are expressed above 200 in at
least 5 samples. To do that we use the function \verb+kOverA+.

There are three steps that must be performed.
\begin{enumerate}
\item Create the filters function(s).
\item Assemble them into a filtering function.
\item Apply the filtering function to the expression matrix.
\end{enumerate}

<<>>=
f1 <- kOverA(5, 200)
ffun <- filterfun(f1)
which <- genefilter(exprs(sample.ExpressionSet), ffun)
sum(which)
@

Here \verb+f1+ is a filter, the function \verb+ffun+ is a filtering
function and we finally apply it using \verb+genefilter+.
There were 154 genes that had a level of more than 200 in at least 5 samples.

To perform a specific filter lets select genes that are differentially
expressed in the groups defined by \verb+type+.

<<>>=
tf1 <- ttest(sample.ExpressionSet$type, .1)
ff2 <- filterfun(tf1)
wh2 <- genefilter(exprs(sample.ExpressionSet), ff2)
 sum(wh2)
@
%$
Now we see that there are 31 genes that pass this selection
procedure. In this example we chose a very large $p$--value because we
have a very small sample and the covariate was made up.

Suppose that we want to combine the two filters. We want those genes
for which at least 5 are over 200 and which also are differentially
expressed in the groups defined by \verb+type+.

<<>>=
ff3 <- filterfun(f1, tf1)
wh3 <- genefilter(exprs(sample.ExpressionSet), ff3)
sum(wh3)
@

Now we see that there are only 10 genes  that satisfy both conditions.

%%FIXME: need to replace this with something else
%Our last example is to select genes that are
%differentially expressed in at least one of the three groups defined
%by \verb+cov3+.
%To do that we use an Anova filter. This filter uses an analysis of
%variance appraoch (via the \verb+lm+) function to test the hypothesis
%that at least one of the three group means is different from the other
%%two. The test is applied, then the $p$--value computed. We select
%those genes that have a low $p$--value.
%
%<<>>=
%Afilter <- Anova(eset$cov3)
%aff <- filterfun(Afilter)
%wh4 <- genefilter(exprs(eset), aff)
%sum(wh4)
%
%@
%%$
%We see that there are 14 genes that pass this filter and that are
%candidates for further exploration.


\section{Aggregate Vignette}

The function given below performs k--nearest neighbour classification
using leave--one--out cross--validation.
At the same time it aggregates the genes that were selected. The
function returns the predicted classifications as its returned
value. However, there is an additional side effect. The number of
times that each gene was used (provided it was at least one) are
recorded and stored in the environment of the aggregator \verb+Agg+.
These can subsequently be retrieved and used for other purposes.

<<aggregate>>=

 knnCV <- function(EXPR, selectfun, cov, Agg, pselect = 0.01, Scale=FALSE) {
   nc <- ncol(EXPR)
   outvals <- rep(NA, nc)
   for(i in 1:nc) {
      v1 <- EXPR[,i]
      expr <- EXPR[,-i]
      glist <- selectfun(expr, cov[-i], p=pselect)
      expr <- expr[glist,]
      if( Scale ) {
        expr <- scale(expr)
        v1 <- as.vector(scale(v1[glist]))
      }
      else
         v1 <- v1[glist]
      out <- paste("iter ",i, " num genes= ", sum(glist), sep="")
      print(out)
      Aggregate(row.names(expr), Agg)
      if( length(v1) == 1)
         outvals[i] <- knn(expr, v1, cov[-i], k=5)
      else
          outvals[i] <- knn(t(expr), v1, cov[-i], k=5)
    }
    return(outvals)
  }
@
%$

<<aggregate>>=
 gfun <- function(expr, cov, p=0.05) {
    f2 <- ttest(cov, p=p)
    ffun <- filterfun(f2)
    which <- genefilter(expr, ffun)
  }

@

Next we show how to use this function on the system dataset,
\verb+geneData+.

<<aggregate>>=
  library("class")


  ##scale the genes
  ##genescale is a slightly more flexible "scale"
  ##work on a subset -- for speed only
  geneData <- genescale(exprs(sample.ExpressionSet)[1:75,], 1)

  Agg <- new("aggregator")

  testcase <- knnCV(geneData, gfun, sample.ExpressionSet$type, 
         Agg, pselect=0.05)

  genes.used <- as.list(aggenv(Agg))
  genes.counts <- as.numeric(genes.used)
  names(genes.counts) <- names(genes.used)
  sort(genes.counts)
@
%$
  The variable \verb+genes.counts+ now contains, for each gene,
  the number of times it was selected in the cross-validation.

  One may also compare testcase with geneCov to see how well the
  procedure worked at predicting.

\section{Session Information}

The version number of R and packages loaded for generating the vignette were:

<<echo=FALSE,results=tex>>=
toLatex(sessionInfo())
@

\end{document}

