systemPipeR 2.8.0
Users want to provide here background information about the design of their VAR-Seq project.
This report describes the analysis of a VAR-Seq project studying the genetic differences among several strains … from organism ….
Typically, users want to specify here all information relevant for the analysis of their VAR-Seq study. This includes detailed descriptions of FASTQ files, experimental design, reference genome, gene annotations, etc.
systemPipeRdata package is a helper package to generate a fully populated systemPipeR workflow environment in the current working directory with a single command. All the instruction for generating the pre-configured workflow templates are provide in the systemPipeRdata vignette.
systemPipeRdata::genWorkenvir(workflow = "varseq", mydirname = "varseq")
setwd("varseq")
This step can be skipped if you already have the environment to run the analysis. If not, you can run it, and it will create the directory structure and populate all the necessary param and demo data files.
After building and loading the workflow environment generated by genWorkenvir
from systemPipeRdata
all data inputs are stored in
a data/
directory and all analysis results will be written to a separate
results/
directory, while the systemPipeVARseq.Rmd
script and the targets
file are expected to be located in the parent directory. The R session is expected
to run from this parent directory. Additional parameter files are stored under param/
.
To work with real data, users want to organize their own data similarly
and substitute all test data for their own data. To rerun an established
workflow on new data, the initial targets
file along with the corresponding
FASTQ files are usually the only inputs the user needs to provide.
For more details, please consult the documentation
here.
More information about the targets
files from systemPipeR can be found here.
This template provides some common steps for a VARseq
workflow. One can add, remove,
modify workflow steps by operating on the SYSargsList
workflow object.
For more details of all the features and utilities, please consult the main vignette.
To initiate a VARseq workflow, this entire Rmarkdown file will be imported as a
SYSargsList
workflow object, by using the importWF("systemPipeVARseq.Rmd")
command.
In this template, code chunks with the option spr = TRUE'
will be added to the workflow.
Other R code chunks without this option will be ignored. The option eval = FALSE
can be ignored when imported and build the workflow object. Please be aware of this
possibility.
The template can provide more than one alternative for each step, such as
different mapping methods, that will receive the mandatory
or optional
flag.
One can run just the mandatory
steps, ALL
, or optional
steps when running
the workflow.
Also, each one of the steps can be run on compute clusters (compute
option)
or on the current session, here called management
session.
For the demonstration of this template, a management
session will be chosen.
The other alternative is to initialize the workflow and append each of the steps in the workflow object.
sal <- SPRproject()
systemPipeR
workflows can be designed and built from start to finish with a
single command, importing from an R Markdown file or stepwise in interactive
mode from the R console.
This tutorial will demonstrate how to build the workflow in an interactive mode,
appending each step. The workflow is constructed by connecting each step via
appendStep
method. Each SYSargsList
instance contains instructions needed
for processing a set of input files with a specific command-line or R software
and the paths to the corresponding outfiles generated by a particular tool/step.
The systemPipeR
package needs to be loaded (H Backman and Girke 2016).
# Some samples in the test dataset do not work well in
# VARseq, and VARseq workflow takes long time to process
# each sample. To better test and speed up the test
# workflow, sample set is reduced to the first 13 samples.
# Please REMOVE the next two lines in your real analysis
cat(crayon::red$bold("Some samples in targets are removed for test workflow. Please change the template to disable this in your real analysis.\n"))
writeLines(readLines("targetsPE.txt")[1:13], "targetsPE.txt")
cat(crayon::blue$bold("To use this workflow, following R packages are expected:\n"))
cat(c("'GenomicFeatures", "VariantAnnotation", "GenomicFeatures",
"ggbio", "ggplot2'\n"), sep = "', '")
### pre-end
appendStep(sal) <- LineWise(code = {
library(systemPipeR)
}, step_name = "load_SPR")
The following seeFastq
and seeFastqPlot
functions generate and plot a series of useful
quality statistics for a set of FASTQ files including per cycle quality box
plots, base proportions, base-level quality trends, relative k-mer
diversity, length, and occurrence distribution of reads, number of reads
above quality cutoffs and mean quality distribution. The results are
written to a png file named fastqReport.png
.
This is the pre-trimming fastq report. Another post-trimming fastq report step is not included in the default. It is recommended to run this step first to decide whether the trimming is needed.
Please note that initial targets files are being used here. In this case,
it has been added to the first step, and later, we used the function getColumn
to extract a named vector.
appendStep(sal) <- LineWise(code = {
targets <- read.delim("targetsPE.txt", comment.char = "#")
updateColumn(sal, step = "load_SPR", position = "targetsWF") <- targets
fq_files <- getColumn(sal, "load_SPR", "targetsWF", column = 1)
fqlist <- seeFastq(fastq = fq_files, batchsize = 10000, klength = 8)
png("./results/fastqReport.png", height = 162, width = 288 *
length(fqlist))
seeFastqPlot(fqlist)
dev.off()
}, step_name = "fastq_report_pre", dependency = "load_SPR")