%\VignetteIndexEntry{IVAS : Identification of genetic Variants affecting Alternative Splicing}
%\VignetteDepends{GenomicFeatures}
%\VignettePackage{IVAS}

\documentclass[a4paper,11pt]{article}
\usepackage{graphicx}
\usepackage{a4wide}
\graphicspath{{./result/}}

<<style, echo=FALSE, results=tex>>=
BiocStyle::latex(use.unsrturl=FALSE)
@

\title{\Rpackage{IVAS} : Identification of genetic Variants affecting Alternative Splicing}
\author{Seonggyun Han and Sangsoo Kim}

\begin{document}
\SweaveOpts{concordance=TRUE}

\maketitle

\tableofcontents

\section{Introduction}
Alternative splicing controls relative expression ratios of mature mRNA isoforms from a single gene. Mapping studies of Splicing Quantitative Trait Loci (SQTL), a genetic variant affecting the alternative splicing, are important steps to understand gene regulations and protein activity \cite{KeyanZhao}. We present an effective and user-friendly computational tool to detect SQTLs using transcript expression data from RNA-seq and genotype data, both measured on the same sample. As RNA sequencing (RNA-seq) provides insight into relatively precise measurments of expression level of transcript isoforms from a gene, it is a useful tool to analyze complicated biological phenomenons of RNA transcripts including the alternative splicing \cite{JosephK}. The mapping analysis uses two statistical models : Linear regression model \cite{Breslow} and/or Generalized linear mixed model \cite{Chambers}.

\section{The input data set}
The next subsection introduces the input data. To run this tool, two experimental data sets (an expression data frame from RNA-seq and a genotype data frame) are required. Moreover, we also need a data frame for positions of SNP markers and GTF file for transcript models. As any other genome-wide analyses, it is recommended to use as many samples as possible, usually of population scale, in order to guarantee a statistically significant result.

\subsection{The genotype data}
The genotype data should be prepared as a simple matrix data. Each column represents an individual and its name should match that of the expression matrix described below (2.2)

\begin{table}[h!]
\begin{tabular}{c c c c c}
&ind1&ind2&ind3&ind4\\
SNP1&AA&AA&AT&TT\\
SNP2&CG&CC&GG&CG\\
SNP3&TT&TT&AT&TT\\
\end{tabular}
\end{table}

\subsection{The expression data}
The expression matrix must comprise expresion values of transcripts from RNA-seq. We may obtain them by using alignment tools such as cufflinks. Each column represents an individual and its name should match that of the genotype matrix described above (2.1)

\begin{table}[h!]
\begin{tabular}{c c c c c}
&ind1&ind2&ind3&ind4\\
transcript1&10.5&15.4&6.7&12.4\\
transcript2&6.4&7.2&4.5&9.2\\
transcript3&15.4&14.5&13.2&17.8\\
\end{tabular}
\end{table}

\subsection{The SNP marker position data}
To search SNPs affecting alternative splicing, a data frame comprising genomic location of each SNP is required. It consists of following columns: SNP(SNP marker name), CHR(chromosome number), and locus(SNP position).

\begin{table}[h!]
\begin{tabular}{c c c}
SNP&CHR&locus\\
SNP1&1&4964005\\
SNP2&1&23513047\\
\end{tabular}
\end{table}

\subsection{The transcripts model data}
We need a reference GTF(General Feature Format) file including information about gene structures such as the positions of exons, introns, and transcripts of genes. The GTF file must be \Robject{TxDb} object from the \Biocpkg{GenomicFeatures} package \cite{Michael}.

\section{The example dataset : data from Geuvadis RNA sequencing project of 1000 Genome samples}

This example uses filtered data from an origin data generated by Geuvadis RNA sequencing project, available at \texttt{http://www.geuvadis.org/web/geuvadis/RNAseq-project} \cite{TuuliLappalainen}. The example expression data includes transcripts of 11 randomly selected genes. The genotype data comprises SNPs in those genes. 

\section{Loading data}
For this analysis, you need to load the  \Rpackage{IVAS} package, SNP data, expression data, SNP position data, and \Robject{TxDb} object from GTF.
\\
Loading  \Rpackage{IVAS} package :
<<loading IVAS package,results=hide>>=
library(IVAS)
@
Loading expression data : 
<<loading the expression data,results=hide>>=
data(sampleexp)
@
Loading SNP data : 
<<loading the SNP data,results=hide>>=
data(samplesnp)
@
Loading SNP position data : 
<<loading the SNP position data,results=hide>>=
data(samplesnplocus)
@
Loading \Robject{TxDb} object : 
<<loading the txdb data,results=hide>>=
sampleDB <- system.file("extdata", "sampleDB", package="IVAS")
sample.Txdb <- loadDb(sampleDB)
@

If you want to create the \Robject{TxDb} object from a GTF file, you need to use the \Rfunction{makeTxDbFromGFF} function in the \Biocpkg{GenomicFeatures} package.

\section{Running the  \Rpackage{IVAS} package for detecting SQTLs}
\subsection{Separating the \Robject{TxDb} object based on a single chromosome.}
The \Rfunction{chrseparate} function separates the \Robject{TxDb} object based on a single chromosome for mapping analysis of expression and genotype in a single chromosome.
\\
To use this function : 
<<chrseparate function>>=
filtered.txdb <- chrseparate(sample.Txdb,19)
filtered.txdb
@
In this example, We filter the \Robject{TxDb} object with only the chromosome 6.

\subsection{Finding alternative exons of a single gene}
The \Rfunction{findAlternative} function finds flanking introns of alternative exons from a single gene. To run this function, it requires trans.exon.range, trans.intron.range, and txTable. The trans.exon.range is a range of exons based on transcripts using the \Rfunction{exonsBy} function in \Biocpkg{GenomicFeatures} package. The trans.intron.range is a range of introns based on transcripts using the \Rfunction{intronsBy} function in \Biocpkg{GenomicFeatures} package. The txTable has information about trancripts(start site, chromosome number, etc).
<<findAlternative function>>=
trans.exon.range <- exonsBy(filtered.txdb,by="tx")
trans.intron.range <- intronsByTranscript(filtered.txdb)
txTable <- select(filtered.txdb, keys=names(trans.exon.range),
                  columns=c("TXID","TXNAME","GENEID","TXSTART","TXEND"),keytype="TXID")
Altvalue <- findAlternative("ENSG00000170889",txTable,trans.exon.range,
                            trans.intron.range,19)
Altvalue
@
In this example, we will search SQTLs in the ENSG00000170889.

\subsection{Finding SNPs which belongs to alternative exons and flanking introns}
The \Rfunction{overlapsnp} function searches SNPs in alternative exons and the flanking introns.
<<overlapsnp function>>=
ch.snp.locus <- as.matrix(samplesnplocus[samplesnplocus[,2] == 19,])
ch.snps <- matrix(ch.snp.locus[is.element(ch.snp.locus[,1],rownames(samplesnp)),],
                  ncol=3,byrow=FALSE)
ch.snps.range <- GRanges(seqnames=Rle(19),ranges=IRanges(start=as.integer(ch.snps[,3]),
                         end=as.integer(ch.snps[,3])),metadata=ch.snps[,1])
overlapsnp <- findOversnp(Altvalue,ch.snps.range)
overlapsnp
@

\subsection{Finding SQTLs}
Using the output data from the \Rfunction{Altvalue} function and the \Rfunction{overlapsnp} function, significant SNPs affecting the alternative splicing can be identified by the sqtlfinder function.
<<sqtlfinder function>>=
sqtl.result <- sqtlfinder(Altvalue,overlapsnp,sampleexp,samplesnp,"lm")
sqtl.result
@
In this example, We will run the function with the linear regression model.

\section{Identification of SQTLs in multiple genes}

The functions in the  \Rpackage{IVAS} package perform mapping analysis using a single gene. However, the \Rfunction{MsqtlFinder} function in the \Rpackage{IVAS} package enables one to analyze multiple genes using the multi-thread version of \Rfunction{foreach} function. Moreover, it joins the results on the genes from sqtlfinder function and calculates the FDR using P-values of the results.
<<MsqtlFinder function, eval=FALSE>>=
final.result <- MsqtlFinder(sampleexp,samplesnp,samplesnplocus,sample.Txdb,"lm",1)
@
\Rfunction{MsqtlFinder} shows chromosome numbers during mapping analysis. The last argument denotes the number of threads(a single processor in this example)

\section{Visualizing the result}

To visualize the results into boxplot, the  \Rpackage{IVAS} package provides the saveBplot function. Using the dataframe from the output of \Rfunction{sqtlfinder} or \Rfunction{MsqtlFinder} function, we can make the boxplot.
<<saveBplot function, results=hide>>=
saveBplot(sqtl.result,sampleexp,samplesnp,samplesnplocus,filtered.txdb,"./result")
@
\begin{center}
\includegraphics[width=0.6\textwidth]{ENSG00000170889_54704610-54704756_rs3810232(13)_lm.png}
\end{center}

The output png files are saved in "result" folder.

\section{Session Information}
<<sessionInfo, echo=FALSE>>=
sessionInfo()
@

\begin{thebibliography}{9}

\bibitem{KeyanZhao} Keyan Zhao, Zhi-xiang Lu, Juw Won Park, Qing Zhou, Yi Xing. 2013. GLiMMPS: robust statistical model for regulatory variation of alternative splicing using RNA-seq data. Genome Biol 14, R74.

\bibitem{JosephK} Joseph K. Pickrell, John C. Marioni, Athma A. Pai, Jacob F. Degner, Barbara E. Engelhardt, Everlyne Nkadori, Jean-Baptiste Veyrieras, Matthew Stephens, Yoav Gilad, Jonathan K. Pritchard. 2010. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768-722.

\bibitem{Breslow} N.E. Breslow and D.G. Clayton. 1993. Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association 88 421: 9-25.

\bibitem{Michael} Michael Lawrence, et al. 2013. Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol. 9(8): e1003118.

\bibitem{Chambers} Chambers, J. M. 1992. Linear models. Chapter 4 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth, and Brooks Cole.

\bibitem{TuuliLappalainen} Tuuli Lappalainen, et al. 2013. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506-511.

\end{thebibliography}

\end{document}
