%\VignetteIndexEntry{Introduction to genome projects}

\documentclass[12pt]{article}

\usepackage{Sweave}
\usepackage{fullpage}
\usepackage{hyperref}

\newcommand{\R}{\textsf{R}}
\newcommand{\Rcmd}[1]{\texttt{#1}}
\newcommand{\pkg}[1]{\texttt{#1}}

\title{ Genome project tables in the genomes package  }
\author{Chris Stubben}

\begin{document}
\maketitle

%% for cutting and pasting use continue =""
%% change margins on every chunk
<<setup, echo=FALSE>>=
library(genomes)
options(warn=-1, width=75, digits=2, scipen=3,  "prompt" = "R> ", "continue" = " ")
options(SweaveHooks=list(fig=function() par(mar=c(5,4.2,1,1))))
@

The \pkg{genomes} package collects genome project metadata from NCBI 
using E-utility scripts (esearch, esummary, efetch and elink) or the NCBI genomes FTP. 
The packages also includes tools to summarize, compare and plot the data in the \R~programming environment. Genome
tables are a defined class (\emph{genomes}) and each table is a data
frame where rows are genome projects and columns are the fields
describing the associated metadata.  A number of methods are available that operate on genome tables
including \Rcmd{print}, \Rcmd{summary}, \Rcmd{plot} and \Rcmd{update}.

Genome tables from the Genomes FTP at NCBI include
prokaryotic (\Rcmd{proks}), eukaryotic
(\Rcmd{euks}) and virus genomes (\Rcmd{virus}).
The \Rcmd{print} methods displays the first few rows and columns of
the table (either select less than seven rows or convert the object to
a \Rcmd{data.frame} to print all columns). The \Rcmd{summary} function
displays the download date, a count of projects by status, and a list
of recent submissions. The \Rcmd{plot} method displays a cumulative
plot of genomes by release date.

<<proks>>=
data(proks)
proks
summary(proks)
plot(proks, log='y', las=1)

@ 


Most importantly, the \Rcmd{update} method downloads the latest version of
the table from NCBI and displays a message listing the number of
project IDs added and removed (not run).

<<update, eval=FALSE>>=
update(proks)
@ 

A number of additional functions assist in selecting, sorting and
grouping genomes.  The \Rcmd{species} and \Rcmd{genus} functions can
be used to extract the species or genus from a scientific name.  The \Rcmd{month} 
and \Rcmd{year} functions can be used to extract the month or
year from the release date. The
\Rcmd{table2} function formats and sorts a contingency table by
counts.

<<table2>>=
spp<-species(proks$name)
table2(spp)
@ 


Because subsets of tables are often needed, the binary operator \Rcmd{like} allows
pattern matching using wildcards.   The \Rcmd{plotby} function can
then be used to plot the release dates by status using labeled points,
in this case to identify complete and draft sequences of \emph{Yersinia pestis} released before 2012 (Figure
\ref{yersinia}).  

<<yersinia, fig=TRUE, include=FALSE, height=5, width=5>>=
## Yersinia pestis
yp<-subset(proks, name %like% 'Yersinia pestis*' & year(released)<2012 )
plotby(yp, labels=TRUE, cex=.5, lbty='n', curdate=FALSE)

@ 

\begin{figure}[t]
\centering
\includegraphics[height=5in,width=5in]{genome-tables-yersinia.pdf}
 \caption{Cumulative plot of \emph{Yersinia pestis} genomes released before 2012.}
\label{yersinia}
\end{figure}


\end{document}
