1 Introduction

Users want to provide here background information about the design of their RNA-Seq project.

2 Samples and environment settings

2.1 Environment settings and input data

Typically, the user wants to record here the sources and versions of the reference genome sequence along with the corresponding annotations. In the provided sample data set all data inputs are stored in a data subdirectory and all results will be written to a separate results directory, while the systemPipeRNAseq.Rmd script and the targets file are expected to be located in the parent directory. The R session is expected to run from this parent directory.

To run this sample report, mini sample FASTQ and reference genome files can be downloaded from here. The chosen data set SRP010938 contains 18 paired-end (PE) read sets from Arabidposis thaliana (Howard et al. 2013). To minimize processing time during testing, each FASTQ file has been subsetted to 90,000-100,000 randomly sampled PE reads that map to the first 100,000 nucleotides of each chromosome of the A. thalina genome. The corresponding reference genome sequence (FASTA) and its GFF annotation files (provided in the same download) have been truncated accordingly. This way the entire test sample data set is less than 200MB in storage space. A PE read set has been chosen for this test data set for flexibility, because it can be used for testing both types of analysis routines requiring either SE (single end) reads or PE reads.

The following loads one of the available NGS workflow templates (here RNA-Seq) into the user’s current working directory. At the moment, the package includes workflow templates for RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq. Templates for additional NGS applications will be provided in the future.

library(systemPipeRdata)
genWorkenvir(workflow = "rnaseq")
setwd("rnaseq")

Alternatively, this can be done from the command-line as follows:

Rscript -e "systemPipeRdata::genWorkenvir(workflow='rnaseq')"

Now open the R markdown script systemPipeRNAseq.Rmdin your R IDE (e.g. vim-r or RStudio) and run the workflow as outlined below. If you work under Vim-R-Tmux, the following command sequence will connect the user in an interactive session with a node on the cluster. The code of the Rmd script can then be sent from Vim on the login (head) node to an open R session running on the corresponding computer node. This is important since Tmux sessions should not be run on the computer nodes.

q("no")  # closes R session on head node
srun --x11 --partition=short --mem=2gb --cpus-per-task 4 --ntasks 1 --time 2:00:00 --pty bash -l
module load R/3.4.2
R

Now check whether your R session is running on a computer node of the cluster and not on a head node.

system("hostname")  # should return name of a compute node starting with i or c 
getwd()  # checks current working directory of R session
dir()  # returns content of current working directory

2.2 Required packages and resources

The systemPipeR package needs to be loaded to perform the analysis steps shown in this report (H Backman and Girke 2016).

library(systemPipeR)

If applicable load custom functions not provided by systemPipeR package.

source("systemPipeRNAseq_Fct.R")

To apply workflows to custom data, the user needs to modify the targets file and if necessary update the corresponding .cwl and .yml files. A collection of pre-generated .cwl and .yml files are provided in the param/cwl subdirectory of each workflow template. They are also viewable in the GitHub repository of systemPipeRdata (see here).

2.3 Experiment definition provided by targets file

The targets file defines all FASTQ files and sample comparisons of the analysis workflow.

targetspath <- system.file("extdata", "targets.txt", package = "systemPipeR")
targets <- read.delim(targetspath, comment.char = "#")[, 1:4]
targets
##                       FileName SampleName Factor SampleLong
## 1  ./data/SRR446027_1.fastq.gz        M1A     M1  Mock.1h.A
## 2  ./data/SRR446028_1.fastq.gz        M1B     M1  Mock.1h.B
## 3  ./data/SRR446029_1.fastq.gz        A1A     A1   Avr.1h.A
## 4  ./data/SRR446030_1.fastq.gz        A1B     A1   Avr.1h.B
## 5  ./data/SRR446031_1.fastq.gz        V1A     V1   Vir.1h.A
## 6  ./data/SRR446032_1.fastq.gz        V1B     V1   Vir.1h.B
## 7  ./data/SRR446033_1.fastq.gz        M6A     M6  Mock.6h.A
## 8  ./data/SRR446034_1.fastq.gz        M6B     M6  Mock.6h.B
## 9  ./data/SRR446035_1.fastq.gz        A6A     A6   Avr.6h.A
## 10 ./data/SRR446036_1.fastq.gz        A6B     A6   Avr.6h.B
## 11 ./data/SRR446037_1.fastq.gz        V6A     V6   Vir.6h.A
## 12 ./data/SRR446038_1.fastq.gz        V6B     V6   Vir.6h.B
## 13 ./data/SRR446039_1.fastq.gz       M12A    M12 Mock.12h.A
## 14 ./data/SRR446040_1.fastq.gz       M12B    M12 Mock.12h.B
## 15 ./data/SRR446041_1.fastq.gz       A12A    A12  Avr.12h.A
## 16 ./data/SRR446042_1.fastq.gz       A12B    A12  Avr.12h.B
## 17 ./data/SRR446043_1.fastq.gz       V12A    V12  Vir.12h.A
## 18 ./data/SRR446044_1.fastq.gz       V12B    V12  Vir.12h.B

3 Read preprocessing

3.1 Read quality filtering and trimming

The function preprocessReads allows to apply predefined or custom read preprocessing functions to all FASTQ files referenced in a SYSargs2 container, such as quality filtering or adapter trimming routines. The paths to the resulting output FASTQ files are stored in the output slot of the SYSargs2 object. The following example performs adapter trimming with the trimLRPatterns function from the Biostrings package. After the trimming step a new targets file is generated (here targets_trim.txt) containing the paths to the trimmed FASTQ files. The new targets file can be used for the next workflow step with an updated SYSargs2 instance, e.g. running the NGS alignments using the trimmed FASTQ files.

Construct SYSargs2 object from cwl and yml param and targets files.

dir_path <- system.file("extdata/cwl/preprocessReads/trim-se", 
    package = "systemPipeR")
trim <- loadWorkflow(targets = targetspath, wf_file = "trim-se.cwl", 
    input_file = "trim-se.yml", dir_path = dir_path)
trim <- renderWF(trim, inputvars = c(FileName = "_FASTQ_PATH1_", 
    SampleName = "_SampleName_"))
trim
output(trim)[1:2]
preprocessReads(args = trim, Fct = "trimLRPatterns(Rpattern='GCCCGGGTAA', 
                subject=fq)", 
    batchsize = 1e+05, overwrite = TRUE, compress = TRUE)
writeTargetsout(x = trim, file = "targets_trimPE.txt", step = 1, 
    new_col = "FileName1", new_col_output_index = 1, overwrite = TRUE)

3.2 FASTQ quality report

The following seeFastq and seeFastqPlot functions generate and plot a series of useful quality statistics for a set of FASTQ files including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution. The results are written to a PDF file named fastqReport.pdf.

fqlist <- seeFastq(fastq = infile1(trim), batchsize = 10000, 
    klength = 8)
pdf("./results/fastqReport.pdf", height = 18, width = 4 * length(fqlist))
seeFastqPlot(fqlist)
dev.off()