Note: the most recent version of this tutorial can be found here and a short overview slide show here.

Introduction

systemPipeR provides utilities for building analysis workflows with automated report generation for next generation sequence (NGS) applications such as RNA-Seq, ChIP-Seq, VAR-Seq and many others (Girke 2014). An important feature is support for running command-line software, such as NGS aligners, on both single machines or compute clusters. This includes both interactive job submissions or batch submissions to queuing systems of clusters. For instance, systemPipeR can be used with most command-line aligners such as BWA (Heng Li 2013; H Li and Durbin 2009), TopHat2 (Kim et al. 2013) and Bowtie2 (Langmead and Salzberg 2012), as well as the R-based NGS aligners Rsubread (Liao, Smyth, and Shi 2013) and gsnap (gmapR) (Wu and Nacu 2010). Efficient handling of complex sample sets and experimental designs is facilitated by a well-defined sample annotation infrastructure which improves reproducibility and user-friendliness of many typical analysis workflows in the NGS area (Lawrence et al. 2013).

Motivation and advantages of sytemPipeR environment:

Facilitates design of complex NGS workflows involving multiple R/Bioconductor packages
Makes NGS analysis with Bioconductor utilities more accessible to new users
Simplifies usage of command-line software from within R
Reduces complexity of using compute clusters for R and command-line software
Accelerates runtime of workflows via parallelzation on computer systems with mutiple CPU cores and/or multiple compute nodes
Automates generation of analysis reports to improve reproducibility

A central concept for designing workflows within the sytemPipeR environment is the use of sample management containers called SYSargs (see Figure 1). Instances of this S4 object class are constructed by the systemArgs function from two simple tabular files: a targets file and a param file. The latter is optional for workflow steps lacking command-line software. Typically, a SYSargs instance stores all sample-level inputs as well as the paths to the corresponding outputs generated by command-line- or R-based software generating sample-level output files, such as read preprocessors (trimmed/filtered FASTQ files), aligners (SAM/BAM files), variant callers (VCF/BCF files) or peak callers (BED/WIG files). Each sample level input/outfile operation uses its own SYSargs instance. The outpaths of SYSargs usually define the sample inputs for the next SYSargs instance. This connectivity is established by writing the outpaths with the writeTargetsout function to a new targets file that serves as input to the next systemArgs call. Typically, the user has to provide only the initial targets file. All downstream targets files are generated automatically. By chaining several SYSargs steps together one can construct complex workflows involving many sample-level input/output file operations with any combinaton of command-line or R-based software.

Figure 1: Workflow design structure of systemPipeR

The intended way of running sytemPipeR workflows is via *.Rnw or *.Rmd files, which can be executed either line-wise in interactive mode or with a single command from R or the command-line using a Makefile. This way comprehensive and reproducible analysis reports in PDF or HTML format can be generated in a fully automated manner by making use of the highly functional reporting utilities available for R. Templates for setting up custom project reports are provided as *.Rnw files in the vignettes subdirectory of this package. The corresponding PDFs of these report templates are linked here: systemPipeRNAseq, systemPipeChIPseq and systemPipeVARseq. To work with *.Rnw or *.Rmd files efficiently, basic knowledge of Sweave or knitr and Latex or R Markdown v2 is required.

Relevant workflow parameter files:

targets.txt: initial one provided by user; downstream targets_*.txt files are generated automatically
*.param: defines parameter for input/output file operations, e.g. trim.param, bwa.param, vartools.parm, …
*_run.sh: optional bash script, e.g.: gatk_run.sh
Compute cluster environment (skip on single machine):
- .BatchJobs: defines type of scheduler for BatchJobs
- *.tmpl: specifies parameters of scheduler used by a system, e.g. Torque, SGE, StarCluster, Slurm, etc.

systemPipeR: NGS workflow and report generation environment

Author: Thomas Girke (thomas.girke@ucr.edu)

Last update: 11 September, 2015

Introduction

Getting Started

Installation

Loading package and documentation

Load sample data and workflow templates

Structure of targets file

Structure of targets file for single end (SE) samples

Structure of targets file for paired end (PE) samples

Sample comparisons

Structure of param file and SYSargs container

Workflow overview

Define environment settings and samples

Read Preprocessing

FASTQ quality report

Alignment with Tophat2

Read and alignment count stats

Create symbolic links for viewing BAM files in IGV

Alternative NGS Aligners

Alignment with Bowtie2 (e.g. for miRNA profiling)

Alignment with BWA-MEM (e.g. for VAR-Seq)

Alignment with Rsubread (e.g. for RNA-Seq)

Alignment with gsnap (e.g. for VAR-Seq and RNA-Seq)

Read counting for mRNA profiling experiments

Read counting for miRNA profiling experiments

Correlation analysis of samples

DEG analysis with edgeR

DEG analysis with DESeq2

Venn Diagrams

GO term enrichment analysis of DEGs

Obtain gene-to-GO mappings

Batch GO term enrichment analysis

Plot batch GO term results

Clustering and heat maps

Workflow templates

RNA-Seq sample

Run workflow

ChIP-Seq sample

Run workflow

VAR-Seq sample

VAR-Seq workflow for single machine

Run workflow

VAR-Seq workflow for computer cluster

Version information

References

Structure of `targets` file

Structure of `targets` file for single end (SE) samples

Structure of `targets` file for paired end (PE) samples

Structure of `param` file and `SYSargs` container

Alignment with `Tophat2`

Alignment with `Bowtie2` (e.g. for miRNA profiling)

Alignment with `BWA-MEM` (e.g. for VAR-Seq)

Alignment with `Rsubread` (e.g. for RNA-Seq)

Alignment with `gsnap` (e.g. for VAR-Seq and RNA-Seq)

DEG analysis with `edgeR`

DEG analysis with `DESeq2`