%\VignetteIndexEntry{OrganismDbi: A meta framework for Annotation Packages}
%\VignetteDepends{Homo.sapiens}

\documentclass[11pt]{article}

\usepackage{Sweave}
\usepackage[usenames,dvipsnames]{color}
\usepackage{graphics}
\usepackage{latexsym, amsmath, amssymb}
\usepackage{authblk}
\usepackage[colorlinks=true, linkcolor=Blue, urlcolor=Blue,
  citecolor=Blue]{hyperref}

%% Simple macros

\newcommand{\code}[1]{{\texttt{#1}}}
\newcommand{\file}[1]{{\texttt{#1}}}

\newcommand{\software}[1]{\textsl{#1}}
\newcommand\R{\textsl{R}}
\newcommand\Bioconductor{\textsl{Bioconductor}}
\newcommand\Rpackage[1]{{\textsl{#1}\index{#1 (package)}}}
\newcommand\Biocpkg[1]{%
  {\href{http://bioconductor.org/packages/devel/bioc/html/#1.html}%
    {\textsl{#1}}}%
  \index{#1 (package)}}
\newcommand\Rpkg[1]{%
  {\href{http://cran.fhcrc.org/web/devel/#1/index.html}%
    {\textsl{#1}}}%
  \index{#1 (package)}}
\newcommand\Biocdatapkg[1]{%
  {\href{http://bioconductor.org/packages/devel/data/experiment/html/#1.html}%
    {\textsl{#1}}}%
  \index{#1 (package)}}
\newcommand\Robject[1]{{\small\texttt{#1}}}
\newcommand\Rclass[1]{{\textit{#1}\index{#1 (class)}}}
\newcommand\Rfunction[1]{{{\small\texttt{#1}}\index{#1 (function)}}}
\newcommand\Rmethod[1]{{\texttt{#1}}}
\newcommand\Rfunarg[1]{{\small\texttt{#1}}}
\newcommand\Rcode[1]{{\small\texttt{#1}}}

%% Question, Exercise, Solution
\usepackage{theorem}
\theoremstyle{break}
\newtheorem{Ext}{Exercise}
\newtheorem{Question}{Question}


\newenvironment{Exercise}{
  \renewcommand{\labelenumi}{\alph{enumi}.}\begin{Ext}%
}{\end{Ext}}
\newenvironment{Solution}{%
  \noindent\textbf{Solution:}\renewcommand{\labelenumi}{\alph{enumi}.}%
}{\bigskip}




\title{OrganismDbi: A meta framework for Annotation Packages}
\author{Marc Carlson}

\SweaveOpts{keep.source=TRUE}
\begin{document}

\maketitle


OrganismDbi is a software package that helps tie together different
annotation resources.  It is expected that users may have previously
made seen packages like \Rpackage{org.Hs.eg.db} and
\Rpackage{TxDb.Hsapiens.UCSC.hg19.knownGene}.  Packages like these two
are very different and contain very different kinds of information,
but are still about the same organism: Homo sapiens.  The
\Rpackage{OrganismDbi} package allows us to combine resources like
these together into a single package resource, which can represent ALL
of these resources at the same time.  An example of this is the
\Rpackage{homo.sapiens} package, which combines access to the two
resources above along with others.

This is made possible because the packages that are represented by
\Rpackage{homo.sapiens} are related to each other via foreign keys.

\begin{figure}[ht]
\centering
\includegraphics[width=.6\textwidth]{databaseTypes.pdf}
\caption{Relationships between Annotation packages}
\label{fig:dbtypes}
\end{figure}



\section{Getting started with OrganismDbi}

Usage of a package like this has been deliberately kept very simple.
The methods supported are the same ones that work for all the packages
based on \Rclass{AnnotationDb} objects.  The methods that can be
applied to these new packages are \Rmethod{columns}, \Rmethod{keys},
\Rmethod{keytypes} and \Rmethod{select}.

So to learn which kinds of data can be retrieved from a package like
this we would simply load the package and then call the \Rmethod{columns}
method.

<<columns>>=
library(Homo.sapiens)
columns(Homo.sapiens)
@

To learn which of those kinds of data can be used as keys to extract
data, we use the \Rmethod{keytypes} method.

<<keys>>=
keytypes(Homo.sapiens)
@

To extract specific keys, we need to use the \Rmethod{keys} method,
and also provide it a legitimate keytype:

<<keys>>=
head(keys(Homo.sapiens, keytype="ENTREZID"))
@

And to extract data, we can use the \Rmethod{select} method.  The
select method depends on the values from the previous three methods to
specify what it will extract.  Here is an example that will extract,
UCSC transcript names, and gene symbols using Entrez Gene IDs as keys.

<<select>>=
k <- head(keys(Homo.sapiens, keytype="ENTREZID"),n=3)
select(Homo.sapiens, keys=k, columns=c("TXNAME","SYMBOL"), keytype="ENTREZID")
@

In Addition to \Rmethod{select}, some of the more popular range based
methods have also been updated to work with an \Rclass{AnnotationDb}
object.  So for example you could extract transcript information like this:

<<transcripts>>=
transcripts(Homo.sapiens, columns=c("TXNAME","SYMBOL"))
@ 

And the \Rclass{GRanges} object that would be returned would have the
information that you specified in the columns argument.  You could also
have used the \Rmethod{exons} or \Rmethod{cds} methods in this way.

The \Rmethod{transcriptsBy},\Rmethod{exonsBy} and \Rmethod{cdsBy}
methods are also supported.  For example:

<<transcriptsBy>>=
transcriptsBy(Homo.sapiens, by="gene", columns=c("TXNAME","SYMBOL"))
@ 


\section{Making your own OrganismDbi packages}

So in the preceding section you can see that using an
\Rpackage{OrganismDbi} package behaves very similarly to how you might
use a \Robject{TxDb} or an \Robject{OrgDb} package.  The same
methods are defined, and the behave similarly except that they now
have access to much more data than before.  But before you make your
own OrganismDbi package you need to understand that there are few
logical limitations for what can be included in this kind of package.

\begin{itemize}
  
\item The 1st limitation is that all the annotation resources in question
must have implemented the four methods described in the preceding
section (\Rmethod{columns}, \Rmethod{keys}, \Rmethod{keytypes} and
\Rmethod{select}).

\item The 2nd limitation is that you cannot have more than one example
of each field that can be retrieved from each type of package that is
included.  So basically, all values returned by \Rmethod{columns} must be
unique across ALL of the supporting packages.

\item The 3rd limitation is that you cannot have more than one example
of each object type represented.  So you cannot have two org packages
since that would introduce two \Robject{OrgDb} objects.

\item And the 4th limitation is that you cannot have cycles in the
graph.  What this means is that there will be a graph that represents
the relationships between the different object types in your package,
and this graph must not present more than one pathway between any two
nodes/objects.  This limitation means that you can choose one foreign
key relationship to connect any two packages in your graph.

\end{itemize}

With these limitations in mind, lets set up an example.  Lets show how
we could make Homo.sapiens, such that it allowed access to
\Rpackage{org.Hs.eg.db}, \Rpackage{TxDb.Hsapiens.UCSC.hg19.knownGene}
and \Rpackage{GO.db}.

The 1st thing that we need to do is set up a list that expresses the
way that these different packages relate to each other.  To do this,
we make a \Robject{list} that contains short two element long
character vectors.  Each character vector represents one relationship
between a pair of packages.  The names of the vectors are the package
names and the values are the foreign keys.  Please note that the
foreign key values in these vectors are the same strings that are
returned by the \Rmethod{columns} method for the individual packages.
Here is an example that shows how \Rpackage{GO.db},
\Rpackage{org.Hs.eg.db} and
\Rpackage{TxDb.Hsapiens.UCSC.hg19.knownGene} all relate to each other.

<<setupColData>>=
gd <- list(join1 = c(GO.db="GOID", org.Hs.eg.db="GO"),
           join2 = c(org.Hs.eg.db="ENTREZID",
                     TxDb.Hsapiens.UCSC.hg19.knownGene="GENEID"))
@ 

So this \Robject{data.frame} indicates both which packages are
connected to each other, and also what these connections are using for
foreign keys.

Once this is finished, we just have to call the
\Rfunction{makeOrganismPackage} function to finish the task.


<<makeOrganismPackage, eval=FALSE>>=
destination <- tempfile()
dir.create(destination)
makeOrganismPackage(pkgname = "Homo.sapiens",
  graphData = gd,
  organism = "Homo sapiens",
  version = "1.0.0",
  maintainer = "Package Maintainer<maintainer@somewhere.org>",
  author = "Some Body",
  destDir = destination,
  license = "Artistic-2.0")
@ 


\Rfunction{makeOrganismPackage} will then generate a lightweight
package that you can install.  This package will not contain all the
data that it refers to, but will instead depend on the packages that
were referred to in the \Robject{data.frame}.  Because the end result
will be a package that treats all the data mapped together as a single
source, the user is encouraged to take extra care to ensure that the
different packages used are from the same build etc.



\end{document}




