1 Introduction

Users want to provide here background information about the design of their VAR-Seq project.

This report describes the analysis of a VAR-Seq project studying the genetic differences among several strains … from organism ….

1.1 Experimental design

Typically, users want to specify here all information relevant for the analysis of their VAR-Seq study. This includes detailed descriptions of FASTQ files, experimental design, reference genome, gene annotations, etc.

2 Workflow environment

2.1 Generate workflow environment

systemPipeRdata package is a helper package to generate a fully populated systemPipeR workflow environment in the current working directory with a single command. All the instruction for generating the pre-configured workflow templates are provide in the systemPipeRdata vignette.

systemPipeRdata::genWorkenvir(workflow = "varseq", mydirname = "varseq")
setwd("varseq")

This step can be skipped if you already have the environment to run the analysis. If not, you can run it, and it will create the directory structure and populate all the necessary param and demo data files.

After building and loading the workflow environment generated by genWorkenvir from systemPipeRdata all data inputs are stored in a data/ directory and all analysis results will be written to a separate results/ directory, while the systemPipeVARseq.Rmd script and the targets file are expected to be located in the parent directory. The R session is expected to run from this parent directory. Additional parameter files are stored under param/.

To work with real data, users want to organize their own data similarly and substitute all test data for their own data. To rerun an established workflow on new data, the initial targets file along with the corresponding FASTQ files are usually the only inputs the user needs to provide.

For more details, please consult the documentation here. More information about the targets files from systemPipeR can be found here.

2.2 Build the Workflow with a single command

This template provides some common steps for a VARseq workflow. One can add, remove, modify workflow steps by operating on the SYSargsList workflow object. For more details of all the features and utilities, please consult the main vignette.

To initiate a VARseq workflow, this entire Rmarkdown file will be imported as a SYSargsList workflow object, by using the importWF("systemPipeVARseq.Rmd") command.

In this template, code chunks with the option spr = TRUE' will be added to the workflow. Other R code chunks without this option will be ignored. The option eval = FALSE can be ignored when imported and build the workflow object. Please be aware of this possibility.

The template can provide more than one alternative for each step, such as different mapping methods, that will receive the mandatory or optional flag. One can run just the mandatory steps, ALL, or optional steps when running the workflow.

Also, each one of the steps can be run on compute clusters (compute option) or on the current session, here called management session. For the demonstration of this template, a management session will be chosen.

2.3 Workflow initialization

The other alternative is to initialize the workflow and append each of the steps in the workflow object.

sal <- SPRproject()

2.4 Required packages and resources

systemPipeR workflows can be designed and built from start to finish with a single command, importing from an R Markdown file or stepwise in interactive mode from the R console. This tutorial will demonstrate how to build the workflow in an interactive mode, appending each step. The workflow is constructed by connecting each step via appendStep method. Each SYSargsList instance contains instructions needed for processing a set of input files with a specific command-line or R software and the paths to the corresponding outfiles generated by a particular tool/step.

The systemPipeR package needs to be loaded (H Backman and Girke 2016).

# Some samples in the test dataset do not work well in
# VARseq, and VARseq workflow takes long time to process
# each sample. To better test and speed up the test
# workflow, sample set is reduced to the first 13 samples.
# Please REMOVE the next two lines in your real analysis
cat(crayon::red$bold("Some samples in targets are removed for test workflow. Please change the template to disable this in your real analysis.\n"))
writeLines(readLines("targetsPE.txt")[1:13], "targetsPE.txt")

cat(crayon::blue$bold("To use this workflow, following R packages are expected:\n"))
cat(c("'GenomicFeatures", "VariantAnnotation", "GenomicFeatures",
    "ggbio", "ggplot2'\n"), sep = "', '")
### pre-end
appendStep(sal) <- LineWise(code = {
    library(systemPipeR)
}, step_name = "load_SPR")

2.5 FASTQ quality report

The following seeFastq and seeFastqPlot functions generate and plot a series of useful quality statistics for a set of FASTQ files including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length, and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution. The results are written to a png file named fastqReport.png.

This is the pre-trimming fastq report. Another post-trimming fastq report step is not included in the default. It is recommended to run this step first to decide whether the trimming is needed.

Please note that initial targets files are being used here. In this case, it has been added to the first step, and later, we used the function getColumn to extract a named vector.

appendStep(sal) <- LineWise(code = {
    targets <- read.delim("targetsPE.txt", comment.char = "#")
    updateColumn(sal, step = "load_SPR", position = "targetsWF") <- targets
    fq_files <- getColumn(sal, "load_SPR", "targetsWF", column = 1)
    fqlist <- seeFastq(fastq = fq_files, batchsize = 10000, klength = 8)
    png("./results/fastqReport.png", height = 162, width = 288 *
        length(fqlist))
    seeFastqPlot(fqlist)
    dev.off()
}, step_name = "fastq_report_pre", dependency = "load_SPR")