Note: if you use systemPipeR in published research, please cite: Backman, T.W.H and Girke, T. (2016). systemPipeR: NGS Workflow and Report Generation Environment. BMC Bioinformatics, 17: 388. 10.1186/s12859-016-1241-0.

1 Introduction

systemPipeR provides flexible utilities for building and running automated end-to-end analysis workflows for a wide range of research applications, including next-generation sequencing (NGS) experiments, such as RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq (H Backman and Girke 2016). Important features include a uniform workflow interface across different data analysis applications, automated report generation, and support for running both R and command-line software, such as NGS aligners or peak/variant callers, on local computers or compute clusters (Figure 1). The latter supports interactive job submissions and batch submissions to queuing systems of clusters. For instance, systemPipeR can be used with most command-line aligners such as BWA (Li 2013; Li and Durbin 2009), HISAT2 (Kim, Langmead, and Salzberg 2015), TopHat2 (Kim et al. 2013) and Bowtie2 (Langmead and Salzberg 2012), as well as the R-based NGS aligners Rsubread (Liao, Smyth, and Shi 2013) and gsnap (gmapR) (Wu and Nacu 2010). Efficient handling of complex sample sets (e.g. FASTQ/BAM files) and experimental designs are facilitated by a well-defined sample annotation infrastructure which improves reproducibility and user-friendliness of many typical analysis workflows in the NGS area (Lawrence et al. 2013).

The main motivation and advantages of using systemPipeR for complex data analysis tasks are:

  1. Facilitates the design of complex NGS workflows involving multiple R/Bioconductor packages
  2. Common workflow interface for different NGS applications
  3. Makes NGS analysis with Bioconductor utilities more accessible to new users
  4. Simplifies usage of command-line software from within R
  5. Reduces the complexity of using compute clusters for R and command-line software
  6. Accelerates runtime of workflows via parallelization on computer systems with multiple CPU cores and/or multiple compute nodes
  7. Improves reproducibility by automating analyses and generation of analysis reports