# Introduction

This vignette provides the implementation of the procedure described in point 7 of our Guidelines for RNA-Seq data analysis1 protocol available from the Epigenesys website.

Briefly, it details the step necessary to: 1. create a non-redundant annotation 1. count reads per feature 1. pre-analyse the data, i.e. assess the pertinence of the samples’ charateristics in the light of their biological provenance; i.e. in other words perform a so called “biological QA” using assessment methods such as Principal Component Analysis, Multi-dimensional Scaling, Hierarcical Clustering, etc.

The aim of this vignette is to go through these steps using the easyRNASeq package, hence the rationale of the implementation will not be discussed, albeit relevant litterature will be pointed at when necessarry.

To perform the listed steps, we need to instantiate a number of objects to store the minimal set of parameters describing the conducted RNA-Seq experiment, e.g. the BAM files location, the annotation location and type, the sequencing parameters, etc.

To get started with this process, we load the package into our R session:

library(easyRNASeq)
## Warning: multiple methods tables found for 'as.vector'
## Warning: multiple methods tables found for 'relistToClass'
## Warning: multiple methods tables found for 'relistToClass'
## Warning: multiple methods tables found for 'relistToClass'
## Warning: multiple methods tables found for 'relistToClass'

before instantiating an AnnotParam object informing on the location and type of the annotation to be used.

====

# Annotation parameters - AnnotParam

The AnnotParam class is meant to store the minimal set of information necessary to retrieve the annotation

The minimal information to provide is:

1. a datasource: a path to a file provided as a character string or if the type is biomaRt, the datasource you want to connect to, as retrieved using the biomaRt datasource function.
2. a type: one of “gff3”,“gtf”,“rda”, and “biomaRt”. gff3 is the default. If rda is used, it expects the corresponding file to contain a GRanges object by the name of gAnnot.

In this tutorial, we will reproduce the analysis performed in Robinson, Delhomme et al. (Robinson et al. 2014). For that we will start by downloading the original annotation gff3 file for P. trichocarpa, a close related species of the trees used in the study into the current directory.

library(curl)
"Populus_trichocarpa/v3.0/v10.1/GFF3/",
"Ptrichocarpa_210_v3.0_gene_exons.gff3.gz"),
destfile=,"./Ptrichocarpa_210_v3.0_gene_exons.gff3.gz")

Before instantiating an “AnnotParam” object.

    annotParam <- AnnotParam(
datasource="./Ptrichocarpa_210_v3.0_gene_exons.gff3.gz")

This annotation file however contains multiple copy of the same exons, i.e. when exons are shared by several isoforms of a gene. This might result in so-called “multiple-counting” and as described in these guidelines2, we will to circumvent that issue create a set of synthetic transcripts.

====

# Synthetic transcripts - createSyntheticTranscripts

One major caveat of estimating gene expression using aligned RNA-Seq reads is that a single read, which originated from a single mRNA molecule, might sometimes align to several features (e.g. transcripts or genes) with alignments of equivalent quality.

This, for example, might happen as a result of gene duplication and the presence of repetitive or common domains. To avoid counting unique mRNA fragments multiple times, the stringent approach is to keep only uniquely mapping reads - being aware of potential consequences, see the note below.

Not only can ???multiple counting??? arise from a biological reason, but also from technical artifacts, introduced mostly by poorly formatted gff3/gtf annotation files. To avoid this, it is best practice to adopt a conservative approach by collapsing all existing transcripts of a single gene locus into a ???synthetic??? transcript containing every exon of that gene. In the case of overlapping exons, the longest genomic interval is kept, i.e. an artificial exon is created. This process results in a flattened transcript ??? a gene structure with a one to one relationship.

To create such a structure, we use the createSyntheticTranscripts function on the file we just downloaded, simply by passing our annotParam object as argument.

annotParam <- createSyntheticTranscripts(annotParam,verbose=FALSE)

This function returns an updated annotParam object that contains the newly created, flattened transcript annotation. This object can then be saved as an rda file for later re-use or for sharing with collaborators.

save(annotParam,
file="./Ptrichocarpa_210_v3.0_gene_exons_synthetic-transcripts_annotParam.rda")

## Alternatives

Instead of updating the annotParam object, we could have created an object of class Genome_Intervals from the genomeIntervals package, using the same function but using the actual datasource of the previous annotParam object as argument rather than the object itself.

gI <- createSyntheticTranscripts(
"./Ptrichocarpa_210_v3.0_gene_exons.gff3.gz",
verbose=FALSE)

This gI object can then be exported as a gff3 file.

writeGff3(gI,
file="./Ptrichocarpa_210_v3.0_gene_exons_synthetic-transcripts.gff3.gz")

Note: Ignoring multi-mapping reads may introduce biases in the read counts of some genes (such as that of paralogs or of very conserved gene families), but in the context of a conservative first analysis we are of the current opinion that they are best ignored. One should of course assess how many reads are multi-mapping (check for example the STAR output) and possibly extract them from the alignment read file to visualize them using a genome browser so as to understand where they are located and how they may affect any analysis. Based on this, one may, at a later stage, decide to relax the counting parameters to accept multi-mapping reads.

====

# More to come

Work in progress…

=====

# Session Info

## R Under development (unstable) (2015-12-15 r69777)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.2 (El Capitan)
##
## locale:
## [1] C/UTF-8/C/C/C/C
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] curl_0.9.4       easyRNASeq_2.7.2 BiocStyle_1.9.2
##
## loaded via a namespace (and not attached):
##  [1] GenomeInfoDb_1.7.3          BiocInstaller_1.21.2
##  [3] formatR_1.2.1               RColorBrewer_1.1-2
##  [5] futile.logger_1.4.1         XVector_0.11.1
##  [7] bitops_1.0-6                futile.options_1.0.0
##  [9] tools_3.3.0                 zlibbioc_1.17.0
## [11] biomaRt_2.27.2              digest_0.6.8
## [13] annotate_1.49.0             evaluate_0.8
## [15] memoise_0.2.1               RSQLite_1.0.0
## [17] lattice_0.20-33             DBI_0.3.1
## [19] yaml_2.1.13                 parallel_3.3.0
## [21] DESeq_1.23.0                hwriter_1.3.2
## [23] genefilter_1.53.0           stringr_1.0.0
## [25] knitr_1.11                  Biostrings_2.39.3
## [27] S4Vectors_0.9.15            IRanges_2.5.18
## [29] devtools_1.9.1              locfit_1.5-9.1
## [31] LSD_3.0                     stats4_3.3.0
## [33] grid_3.3.0                  Biobase_2.31.3
## [35] AnnotationDbi_1.33.3        XML_3.98-1.3
## [37] survival_2.38-3             BiocParallel_1.5.1
## [39] rmarkdown_0.8.1             limma_3.27.6
## [41] latticeExtra_0.6-26         edgeR_3.13.4
## [43] geneplotter_1.49.0          lambda.r_1.1.7
## [45] magrittr_1.5                intervals_0.15.1
## [47] genomeIntervals_1.27.0      Rsamtools_1.23.1
## [49] GenomicRanges_1.23.7        GenomicAlignments_1.7.3
## [51] htmltools_0.2.6             BiocGenerics_0.17.2
## [57] stringi_1.0-1               RCurl_1.95-4.7
## [1] TRUE TRUE TRUE