<!--
%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{A package to produce Metafeature plots}
-->

metagene: A package to produce Metafeature plots
========================================================

Charles Joly Beauparlant, Fabien Claude Lamaze, Rawane Samb, 
Astrid Louise Deschenes and Arnaud Droit.

This package and the underlying metagene code are distributed under 
the Artistic license 2.0. You are free to use and redistribute this software. 

## Introduction

This package produces Metagene-like plots to compare the behavior of 
DNA-interacting proteins at selected groups of features. A typical analysis 
can be done in viscinity of transcription start sites (TSS) of genes or at any 
regions of interest (such as enhancers). Multiple combinations of group of 
features and/or group of bam files can be compared in a single analysis. 
Bootstraping analysis is used to compare the groups and locate regions with 
statistically different enrichment profiles. In order to increase the 
sensitivity of the analysis, alignment data is used instead of peaks produced 
with peak callers (i.e.: MACS2 or PICS). The metagene package uses bootstrap 
to obtain a better estimation of the mean enrichment and the confidence 
interval for every group of samples.

This vignette will introduce all the main features of the metagene package.

## Loading metagene package

```{r libraryLoad}
suppressMessages(library(metagene))
```
 
## Inputs

### Alignment files (BAM files)

There is no hard limit in the number of BAM files that can be included in an 
analysis (but with too many BAM files, memory may become an issue). BAM files 
must be indexed. For instance, if you use a file names `file.bam`, a file 
named `file.bam.bai` must be present in the same directory.

The path (relative or absolute) to the BAM files must be in a vector:
```{r bamFiles}
bam_files <- get_demo_bam_files()
bam_files
```

For this demo, we have 2 samples (each with 2 replicates).

### Genomic regions

#### BED files

To compare custom regions of interest, it is possible to use a list of one or 
more BED files.

```{r regionsArgument}
regions <- get_demo_regions()
regions
```

The name of the files (without the extension) will be used to name each groups.

#### GRanges or GRangesList objects - Regions

As an alternative to a list of bed files, GRanges or GRangesList objects can 
be used.

### Design groups (facultative)

A design group contains a set of BAM files that, when pull together, represent 
a logical analysis. Furthermore, a design group contains the relationship 
between every BAM files present. Samples (with or without replicates) and 
controls can be assigned to a same design group. There can be as many groups 
as necessary. A BAM file can be assigned to more than one group. 

To represent the relationship between every BAM files, design groups must have 
the following columns:

* The list of paths to every BAM files related to an analysis.
* One column per group of files (replicates and/or controls). 

There is two possible way to create design groups, by reading a file or by 
directly creating a design object in R.

#### Design groups from a file

Design groups can be loaded into the metagene package by using a text file. As 
the relationship between BAM files as to be specified, the following columns 
are mandatory:

* First column: The list of paths (absolute or relative) to every BAM files 
for all the design groups.
* Following columns: One column per design group (replicates and/or controls). 
The column can take only three values: 
	+ 0: ignore file
	+ 1: input
	+ 2: control

The file must also contain a header. It is recommanded to use Samples for the 
name of the first column, but the value is not checked. The other columns in 
the design file will be used for naming design groups, and must be unique.

```{r designFile}
fileDesign <- system.file("extdata/design.txt", package="metagene")
design <- read.table(fileDesign, header=TRUE, stringsAsFactors=FALSE)
design$Samples <- paste(system.file("extdata", package="metagene"), 
                        design$Samples, sep="/")
design
```

#### Design groups from R

It is not obligatory to use a design file, you can create the design 
`data.frame` using your prefered method (as long as the restrictions on the 
values mentioned previously are respected).

For instance, the previous design data.frame could have been create directly 
in R:
```{r alternateDesign}
design <- data.frame(Samples = c("align1_rep1.bam", "align1_rep2.bam", 
                            "align2_rep1.bam", "align2_rep2.bam", "ctrl.bam"),
		                    align1 = c(1,1,0,0,2), align2 = c(0,0,1,1,2))
design$Samples <- paste0(system.file("extdata", package="metagene"), "/", 
                            design$Samples)
design
```

## Analysis steps

A typical metagene analysis will consist steps:

* Extraction the read count of every BAM files in selected regions.
* Conversion in coverage.
* Normalization of the coverage values.
* Statistical analysis.
* Generation of the metagene plot.

To facilitate the analysis, all those steps managed by a metagene object. A 
typical analysis will require 2 steps:

* Initialization of the metagene object: during this step, the metagene object 
will produce normalized coverages for every regions specified and for every 
bam files.
* Generation of the plot: during this step, the metagene object will perform 
the statistical analysis necessary to obtain a robust estimate of the mean 
and its confidence interval for every group of regions and will show the 
results in a metagene plot.

### Generating coverages

To extract coverages, the user must initialize the metagene object. The minimal
requirements for this step is a list of bam files and a list of regions:
```{r initializeMetageneObj}
mg <- metagene$new(regions, bam_files)
```

### Plotting results

The statistical analysis and the production of the metagene plot are done 
in a single step with the function `plot` of the metagene object. In this
specific case, we know that all the regions have the same size and are all
centered on the same element. We can specify this information with the `range`
parameter.

```{r producePlotEx1}
results <- mg$plot(range = c(-1000,1000))
```

The plot function returns a list containing the `data.frame` used by 
ggplot2 to produce the results as well as the Friedman test results 
(or `NULL` if no test run).

```{r showDataFrame, collapse=TRUE}
class(results)
names(results)
head(results$DF)
```
