% \VignetteDepends{stats} % \VignetteIndexEntry{MantelCorrVignette} % \VignetteKeywords{Cluster} % \VignettePackage{MantelCorr} \documentclass[11pt]{article} \usepackage{epsfig} \usepackage{latexsym} \usepackage{amsmath} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsxtra} \usepackage{graphicx,subfigure} \usepackage{vmargin} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \parindent 0in \setpapersize{USletter} \setmarginsrb{1truein}{0.5truein}{1truein}{0.5truein}{16pt}{30pt}{0pt}{20truept} \setlength{\emergencystretch}{2em} \begin{document} \title{MantelCorr Package for Bioconductor} \author{Brian Steinmeyer, MS, and William Shannon, PhD\\ Department of Internal Medicine \\ Division of General Medical Sciences \\ Washington University in St. Louis \\ School of Medicine \\ email: \texttt{steinmeb@ilya.wustl.edu, wshannon@wustl.edu}} \maketitle \bibliographystyle{plain} \tableofcontents % library(tools) % testfile <- file.path("/home/steinmeb/RpackageCreation2/MantelCorrTest2/inst/doc", "mantelcorr_vignette.Rnw") % Sweave(testfile) \section{Description} The \Rpackage{MantelCorr} package is based on the methodology developed in Shannon et al. \cite{shannon05}, for which six functions are used to locate and identify important gene clusters from standard microarray expresssion data with \Robject{p} genes (\emph{rows}) and \Robject{n} samples (\emph{columns}). Mantel statistics have been applied with success to correlate gene expression levels with clinical covariates \cite{shannon02}. We also include a real microarray dataset with the package to help illustrate its functionality. Specifically, the package makes use of the \Robject{k-means()} function in \Robject{R} (\emph{with arbitrary \Robject{k}, say \Robject{k} $\in$ [5, \Robject{$\frac{p}{{2}}$}]}) to essentially over-partition the gene space into \Robject{k} non-overlapping clusters. Next, two types of dissimilarity matrices are computed, one based on the original data \Robject{Dfull}, and one for each resultant cluster, \Robject{Dsubset(k)}.\\ Mantel \cite{mantel67} cluster correlations are then found by correlating each \Robject{Dsubset(k)} with \Robject{Dfull}, resulting in \Robject{k} Mantel correlations. In order to destroy the distance dependent nature of \Robject{Dfull} and to obtain an empirical null distribution of distance independence, a permutation test is done, where the number of permutations and $\alpha$ significance level parameters can be chosen by the user. Specifically, the significance level provides the criterion value (\emph{p-value}) at which a given cluster is considered significant or non-significant. Both significant and non-significant cluster lists can be viewed with the \Rfunction{ClusterList} function. In addition, a summary list of genes within these clusters can also be seen with the \Rfunction{ClusterGeneList} function.\\ We next introduce a simple application of the \Rfunction{MantelCorr} package with gene-expression training data taken from the Golub et al. \cite{golub99} leukemia study.\\ \section{Example \Robject{R} session with the Golub training data} The Golub training data consists of gene-expression values measured for 38 samples from Affymetrix Hgu6800 chips on $7,129$ genes. There are $27$ acute lymphoblastic leukemia (\Robject{ALL}) and $11$ acute myeloid leukemia (\Robject{AML}) samples. To load the \Rpackage{MantelCorr} package, simply type \texttt{library(MantelCorr)}. The data can be loaded by typing \texttt{data(GolubTrain)} and a description provided with \texttt{?GolubTrain}.\\ <>= library(MantelCorr) data(GolubTrain) dim(GolubTrain) data <- GolubTrain @ \section{Reminder} Help on any of the following \Rpackage{MantelCorr} package functions can be viewed by \texttt{?FunctionName}, which provides a complete description and overview of the function's purpose and syntax. In addition, all input 'data' values are \Robject{assumed} to be interval-scale (e.g., numeric data), with gene and sample labels assigned from the \Robject{dimnames()} function.\\ \section{\Robject{GetClusters()} function} The \Rfunction{GetClusters()} function over-partitions the gene-space as described in the package description. We select \Robject{k = 500} clusters and store the result in an object called "kmeans.result".\\ <>= kmeans.result <- GetClusters(data, 500, 100) @ \section{\Robject{DistMatrices()} function} A function used to compute distance matrices \Robject{Dfull} and \Robject{Dsubset(k)} from the \Robject{k} non-overlapping clusters stored in "kmeans.result". The result is assigned to "DistMatrices.result".\\ <>= DistMatrices.result <- DistMatrices(data, kmeans.result$clusters) @ \section{\Robject{MantelCorrs()} function} The \Rfunction{MantelCorrs()} function uses \Robject{Dfull} and \Robject{Dsubset(k)} to compute a Mantel correlation for each \Robject{kth} cluster by correlating these two dissimilarity matrices. The result is saved in "MantelCorrs.result".\\ <>= MantelCorrs.result <- MantelCorrs(DistMatrices.result$Dfull, DistMatrices.result$Dsubsets) @ \section{\Robject{PermutationTest()} function} \Rfunction{PermutationTest()} permutes \Robject{Dfull} to obtain an empirical null distribution for which cluster significance is determined. We have selected 100 permutations in order to conserve CPU time, and chosen an $\alpha$-value of 0.05 for the 38 Golub leukemia samples. The result is stored in an object called "permuted.pval". \Robject{NOTE:} we recommend using at least 1000 permutations for a thorough analysis.\\ <>= permuted.pval <- PermutationTest(DistMatrices.result$Dfull, DistMatrices.result$Dsubsets, 100, 38, 0.05) @ \section{\Robject{ClusterList()} function} A function used to generate a complete list of both significant and non-significant clusters found by the permutation test and associated level of significance. Cluster size and correlation are provided with each type of cluster. We assign the result to the \Robject{R} object "ClusterLists" as follows:\\ <>= ClusterLists <- ClusterList(permuted.pval, kmeans.result$cluster.sizes, MantelCorrs.result) @ \section{\Robject{ClusterGeneList()} function} A final function that uses information from the "ClusterList" function, coupled with the \Robject{dimnames} function to generate a composite list of the genes found in both cluster types (significant and non-significant). We store the result in \Robject{R} object "ClusterGenes".\\ <>= ClusterGenes <- ClusterGeneList(kmeans.result$clusters, ClusterLists$SignificantClusters, data) @ % <>= % @ % \bibliography{qvalue} % \begin{figure}[ht] % \begin{center} % \includegraphics[width=5in,height=5in]{pHist} % \end{center} % \caption{Histogram of p-values.} % \end{figure} % \begin{figure}[ht] % \begin{center} % \includegraphics[width=5in,height=5in]{qHist} % \end{center} % \caption{Histogram of q-values.} % \end{figure} \begin{thebibliography}{9} \bibitem{shannon05} Shannon, Steinmeyer, Li, Culverhouse, Grefenstette, Thompson. (2005) \emph{Variable Selection in Cluster Analysis Using k-means and Mantel Correlation}. Computing Science and Statistics (To Appear). \bibitem{mantel67} Mantel, N. \emph{The Detection of Disease and a Generalized Regression Approach}. Cancer Research, 27, 209-220, 1967. \bibitem{shannon02} Shannon W, Watson M, Perry A, Rich K. \emph{Mantel statistics to correlate gene expression levels from microarrays with clinical covariates}. Genetic Epidemiology 2002; 23:87-96. \bibitem{golub99} Golub, T. et al. \emph{Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring}. Science, 531-537, 1999. \end{thebibliography} \end{document}