% \VignetteDepends{stats}
% \VignetteIndexEntry{MantelCorrVignette}
% \VignetteKeywords{Cluster}
% \VignettePackage{MantelCorr}
\documentclass[11pt]{article}

\usepackage{epsfig}
\usepackage{latexsym}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsfonts}
\usepackage{amsxtra}
\usepackage{graphicx,subfigure}
\usepackage{vmargin}

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\texttt{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}

\parindent 0in
\setpapersize{USletter}
\setmarginsrb{1truein}{0.5truein}{1truein}{0.5truein}{16pt}{30pt}{0pt}{20truept}
\setlength{\emergencystretch}{2em}
\begin{document}

\title{MantelCorr Package for Bioconductor}
\author{Brian Steinmeyer, MS, and William Shannon, PhD\\
Department of Internal Medicine \\
Division of General Medical Sciences \\
Washington University in St. Louis \\
School of Medicine \\
email: \texttt{steinmeb@ilya.wustl.edu, wshannon@wustl.edu}}

\maketitle
\bibliographystyle{plain}
\tableofcontents

% library(tools)
% testfile <- file.path("/home/steinmeb/RpackageCreation2/MantelCorrTest2/inst/doc", "mantelcorr_vignette.Rnw") 
% Sweave(testfile)

\section{Description}

The \Rpackage{MantelCorr} package is based on the methodology developed in Shannon et al. \cite{shannon05},
for which six functions are used to locate and identify important gene 
clusters from standard microarray expresssion data with \Robject{p} genes (\emph{rows}) and \Robject{n} samples (\emph{columns}).  
Mantel statistics have been applied with success to correlate gene expression levels with clinical covariates \cite{shannon02}.
We also include a real microarray dataset with the package to help illustrate its functionality.  Specifically, 
the package makes use of the \Robject{k-means()} function in \Robject{R} (\emph{with arbitrary \Robject{k}, 
say \Robject{k} $\in$ [5, \Robject{$\frac{p}{{2}}$}]})
to essentially over-partition the gene space into \Robject{k} non-overlapping clusters.  Next, two types 
of dissimilarity matrices are computed, one based on the original data \Robject{Dfull}, and one for each 
resultant cluster, \Robject{Dsubset(k)}.\\  


Mantel \cite{mantel67} cluster correlations are then found by correlating each \Robject{Dsubset(k)} with \Robject{Dfull}, 
resulting in \Robject{k} Mantel correlations.  In order to destroy the distance dependent nature of 
\Robject{Dfull} and to obtain an empirical null distribution of distance independence, a permutation test 
is done, where the number of permutations and $\alpha$ significance level parameters can be chosen by the user.
Specifically, the significance level provides the criterion value (\emph{p-value}) at which a given cluster is considered 
significant or non-significant.  Both significant and non-significant cluster lists can be viewed with the 
\Rfunction{ClusterList} function.  In addition, a summary list of genes within these clusters can
also be seen with the \Rfunction{ClusterGeneList} function.\\


We next introduce a simple application of the \Rfunction{MantelCorr} package with gene-expression training data taken from the 
Golub et al. \cite{golub99} leukemia study.\\



\section{Example \Robject{R} session with the Golub training data}

The Golub training data consists of gene-expression values measured for 38 samples from Affymetrix Hgu6800 chips on 
$7,129$ genes.  There are $27$ acute lymphoblastic leukemia (\Robject{ALL}) and $11$ acute myeloid leukemia (\Robject{AML}) samples. To load
the \Rpackage{MantelCorr} package, simply type \texttt{library(MantelCorr)}.
The data can be loaded by typing \texttt{data(GolubTrain)} and a description provided with \texttt{?GolubTrain}.\\

<<preliminaries>>=
library(MantelCorr)
data(GolubTrain)
dim(GolubTrain)
data <- GolubTrain
@


\section{Reminder}
Help on any of the following \Rpackage{MantelCorr} package functions can be viewed 
by \texttt{?FunctionName}, which provides a complete description and overview of the 
function's purpose and syntax.  In addition, all input 'data' values are \Robject{assumed}
to be interval-scale (e.g., numeric data), with gene and sample labels assigned from the 
\Robject{dimnames()} function.\\



\section{\Robject{GetClusters()} function}
The \Rfunction{GetClusters()} function over-partitions the gene-space as described in the package description.  We 
select \Robject{k = 500} clusters and store the result in an object called "kmeans.result".\\

<<GetClusters>>=
kmeans.result <- GetClusters(data, 500, 100)
@


\section{\Robject{DistMatrices()} function}

A function used to compute distance matrices \Robject{Dfull} and \Robject{Dsubset(k)} from the \Robject{k} 
non-overlapping clusters stored in "kmeans.result".  The result is assigned to "DistMatrices.result".\\

<<DistMatrices>>=
DistMatrices.result <- DistMatrices(data, kmeans.result$clusters)
@

\section{\Robject{MantelCorrs()} function}

The \Rfunction{MantelCorrs()} function uses \Robject{Dfull} and \Robject{Dsubset(k)} to compute a Mantel correlation 
for each \Robject{kth} cluster by correlating these two dissimilarity matrices.  The result is saved in 
"MantelCorrs.result".\\

<<MantelCorrs>>=
MantelCorrs.result <- MantelCorrs(DistMatrices.result$Dfull, DistMatrices.result$Dsubsets)
@


\section{\Robject{PermutationTest()} function}
\Rfunction{PermutationTest()} permutes \Robject{Dfull} to obtain an empirical null distribution for which
cluster significance is determined.  We have selected 100 permutations in order to conserve CPU time, and 
chosen an $\alpha$-value of 0.05 for the 38 Golub leukemia samples. The result is stored in an object called
"permuted.pval".  \Robject{NOTE:}  we recommend using at least 1000 permutations for a thorough analysis.\\

<<PermutationTest>>=
permuted.pval <- PermutationTest(DistMatrices.result$Dfull, DistMatrices.result$Dsubsets, 100, 38, 0.05)
@


\section{\Robject{ClusterList()} function}
A function used to generate a complete list of both significant and non-significant clusters
found by the permutation test and associated level of significance.  Cluster size and correlation 
are provided with each type of cluster.  We assign 
the result to the \Robject{R} object "ClusterLists" as follows:\\


<<ClusterList>>=
ClusterLists <- ClusterList(permuted.pval, kmeans.result$cluster.sizes, MantelCorrs.result)
@


\section{\Robject{ClusterGeneList()} function}
A final function that uses information from the "ClusterList" function, coupled with the \Robject{dimnames}
function to generate a composite list of the genes found in both cluster types (significant and non-significant).  
We store the result in \Robject{R} object "ClusterGenes".\\


<<ClusterGeneList>>=
ClusterGenes <- ClusterGeneList(kmeans.result$clusters, ClusterLists$SignificantClusters, data)
@

% <<fig=TRUE, include=FALSE>>=
% @



% \bibliography{qvalue} 

% \begin{figure}[ht]
%  \begin{center}
%    \includegraphics[width=5in,height=5in]{pHist}
%  \end{center}
%  \caption{Histogram of p-values.}
% \end{figure}

% \begin{figure}[ht]
%  \begin{center}
%    \includegraphics[width=5in,height=5in]{qHist}
%  \end{center}
%  \caption{Histogram of q-values.}
% \end{figure}


\begin{thebibliography}{9}
	
	\bibitem{shannon05}
          Shannon, Steinmeyer, Li, Culverhouse, Grefenstette, Thompson. (2005)
	  \emph{Variable Selection in Cluster Analysis Using k-means and Mantel Correlation}.
           Computing Science and Statistics (To Appear).


	\bibitem{mantel67}
	  Mantel, N.
	  \emph{The Detection of Disease and a Generalized Regression Approach}.
	  Cancer Research, 27, 209-220, 1967.


	\bibitem{shannon02}
          Shannon W, Watson M, Perry A, Rich K. 
          \emph{Mantel statistics to correlate gene expression levels from microarrays with clinical covariates}.
           Genetic Epidemiology 2002; 23:87-96.


	\bibitem{golub99}
          Golub, T. et al.
	  \emph{Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression
           Monitoring}.
          Science, 531-537, 1999.

\end{thebibliography}

\end{document}
