%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Accessin raw mass spectrometry and identification data}
%\VignetteKeywords{mzXML, mzData, netCDF, mzML, mzIdentML, mass spectrometry, proteomics, metabolomics}
%\VignettePackage{mzR}

\documentclass{article}

<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

\title{A parser for raw and identification mass-spectrometry data}

\author{Bernd Fischer\footnote{\email{bernd.fischer@embl.de}} \\
  Steffen Neumann\footnote{\email{sneumann@ipb-halle.de}} \\
  Laurent Gatto\footnote{\email{lg390@cam.ac.uk}}\\
  Qiang Kou\footnote{\email{qkou@umail.iu.edu}}}


\begin{document}


\maketitle

\tableofcontents

\section{Introduction}

The \Biocpkg{mzR} package aims at providing a common interface to
several mass spectrometry data formats, namely \texttt{mzData}
\cite{Orchard2007}, \texttt{mzXML} \cite{Pedrioli2004}, \texttt{mzML}
\cite{Martens2010} for raw data, and \texttt{mzIdentML}
\cite{Jones2012}, somewhat similar to the Bioconductor package affyio
for affymetrix raw data. No processing is done in \Biocpkg{mzR}, which
is left to packages such as \Biocpkg{XCMS} \cite{Smith:2006,
  Tautenhahn:2008} or \Biocpkg{MSnbase} \cite{Gatto:2012}.

\bigskip

Most importantly, access to the data should be fast and memory
efficient. This is made possible by allowing on-disk random file
access, i.e. retrieving specific data of interest without having to
sequentially browser the full content nor loading the entire data into
memory.

The actual work of reading and parsing the data files is handled by
the included C/C++ libraries or ``backends''. The \texttt{mzRramp}
RAMP parser, written at the Institute for Systems Biology (ISB) is a
fast and lightweight parser in pure C. Later, it gained support for
the \texttt{mzData} format. The C++ reference implementation for the
\texttt{mzML} is the proteowizard library \cite{Kessner08} (pwiz in
short), which in turn makes use of the boost C++
(\url{http://www.boost.org/}) library.  RAMP is able to access
\texttt{mzML} files by calling pwiz methods. More recently, the
proteowizard\footnote{\url{http://proteowizard.sourceforge.net/}}
\cite{Chambers2012} has been fully integrated using the
\texttt{mzRpwiz} backend for raw data. The \texttt{mzRnetCDF} backend
provides support to \texttt{CDF}-based formats. Finally, the
\texttt{mzRident} backend is available to access identification data
(\texttt{mzIdentML}) through pwiz.

\warning{It is anticipated to switch to the \texttt{mzRpwiz} backend
  in Bioconductor 3.1. We advise users and developers to test it and
  report any issues on the github issue tracker
  \url{https://github.com/sneumann/mzR/issues}.}

The \Biocpkg{mzR} package is in essence a collection of wrappers to
the C++ code, and benefits from the C++ interface provided through the
Rcpp package \cite{Rcpp11}.


\section{Mass spectrometry raw data}

All the mass spectrometry file formats are organized similarly, where
a set of metadata nodes about the run is followed by a list of spectra
with the actual masses and intensities. In addition, each of these
spectra has its own set of metadata, such as the retention time and
acquisition parameters.

\subsection{Spectral data access}

Access to the spectral data is done via the \Rfunction{peaks}
function. The return value is a list of two-column mass-to-charge and
intensity matrices or a single matrix if one spectrum is queried.

\subsection{Identification result access}

The main access to identification result is done via \Rfunction{psms},
\Rfunction{score} and \Rfunction{modifications}.  \Rfunction{psms} and
\Rfunction{score} will return the detailed information on each psm and
scores.  \Rfunction{modifications} will return the details on each
modification found in peptide.

\subsection{Metadata access}

\paragraph{Run metadata} is available via several functions such as
\Rfunction{instrumentInfo()} or \Rfunction{runInfo()}. The individual
fields can be accessed via e.g. \Rfunction{detector()} etc.

\paragraph{Spectrum metadata} is available via \Rfunction{header()},
which will return a list (for single scans) or a dataframe with
information such as the \Rfunction{basePeakMZ},
\Rfunction{peaksCount}, \ldots or, for higher-order MS the
\Rfunction{msLevel} and precursor information.

\paragraph{Identification metadata} is available via
\Rfunction{mzidInfo()}, which will return a list with information such
as the \Rfunction{software}, \Rfunction{ModificationSearched},
\Rfunction{enzymes}, \Rfunction{SpectraSource} and other information
for this identification result.

\bigskip

The availability of this metadata can not always be guaranteed, and
depends on the MS software which converted the data.

\section{Example}

\subsection{\texttt{mzXML}/\texttt{mzML}/\texttt{mzData} files}

A short example sequence to read data from a mass spectrometer. 
First open the file.

<<openraw>>=
library(mzR)
library(msdata)

mzxml <- system.file("threonine/threonine_i2_e35_pH_tree.mzXML", 
                     package = "msdata")
aa <- openMSfile(mzxml) ## ramp, default backend
@ 

We can obtain different kind of header information.
<<get header information>>=
runInfo(aa)
instrumentInfo(aa)
header(aa,1)
@ 

Read a single spectrum from the file.
<<plotspectrum>>=
pl <- peaks(aa,10)
peaksCount(aa,10)
head(pl)
plot(pl[,1], pl[,2], type="h", lwd=1)
@ 

One should always close the file when not needed any more if you are using RAMP backend.
This will release the memory of cached content.

<<close the file>>=
close(aa)
@ 

\subsection{\texttt{mzIdentML} files}

You can use \Rfunction{openIDfile} to read a \texttt{mzIdentML} file
(version 1.1), which use the pwiz backend.

<<openid>>=
library(mzR)
library(msdata)

file <- system.file("mzid", "Tandem.mzid.gz", package="msdata")
x <- openIDfile(file)
@ 

\Rfunction{mzidInfo} function will return general information about
this identification result.

<<metadata>>=
mzidInfo(x)
@ 

\Rfunction{psms} will return the detailed information on each
peptide-spectrum-match, include \texttt{spectrumID},
\texttt{chargeState}, \texttt{sequence}. \texttt{modNum} and others.

<<psms0>>=
p <- psms(x)
colnames(p)
@

The modifications information can be accessed using
\Rfunction{modifications}, which will return the \texttt{spectrumID},
\texttt{sequence}, \texttt{name}, \texttt{mass} and \texttt{location}.

<<psms1>>=
m <- modifications(x)
head(m)
@

Since different software will use different scoring function, we
provide a \texttt{score} to extract the scores for each psm. It will
return a data.frame with different columns depending on software
generating this file.

<<psms2>>=
scr <- score(x)
colnames(scr)
@

\section{Future plans}

Other file formats provided by HUPO, such as \texttt{mzQuantML} for
quantitative data \cite{Walzer:2013} are also possible in the future.

\section{Session information}\label{sec:sessionInfo} 

<<label=sessioninfo, results='asis', echo=FALSE>>=
toLatex(sessionInfo())
@ 


\bibliography{mzR}

\end{document}



