systemPipeR 2.8.0
Users want to provide here background information about the design of their ChIP-Seq project.
This report describes the analysis of several ChIP-Seq experiments studying the DNA binding patterns of the transcriptions factors … from organism ….
Typically, users want to specify here all information relevant for the analysis of their NGS study. This includes detailed descriptions of FASTQ files, experimental design, reference genome, gene annotations, etc.
systemPipeRdata package is a helper package to generate a fully populated systemPipeR workflow environment in the current working directory with a single command. All the instruction for generating the workflow are provide in the systemPipeRdata vignette here.
systemPipeRdata::genWorkenvir(workflow = "chipseq", mydirname = "chipseq")
setwd("chipseq")
After building and loading the workflow environment generated by genWorkenvir
from systemPipeRdata all data inputs are stored in
a data/
directory and all analysis results will be written to a separate
results/
directory, while the systemPipeChIPseq.Rmd
script and the targets
file are expected to be located in the parent directory. The R session is expected
to run from this parent directory. Additional parameter files are stored under param/
.
The chosen data set used by this report SRP010938 contains 18 paired-end (PE) read sets from Arabidposis thaliana (Howard et al. 2013). To minimize processing time during testing, each FASTQ file has been subsetted to 90,000-100,000 randomly sampled PE reads that map to the first 100,000 nucleotides of each chromosome of the A. thaliana genome. The corresponding reference genome sequence (FASTA) and its GFF annotation files have been truncated accordingly. This way the entire test sample data set is less than 200MB in storage space. A PE read set has been chosen for this test data set for flexibility, because it can be used for testing both types of analysis routines requiring either SE (single end) reads or PE reads.
To work with real data, users want to organize their own data similarly
and substitute all test data for their own data. To rerun an established
workflow on new data, the initial targets
file along with the corresponding
FASTQ files are usually the only inputs the user needs to provide.
For more details, please consult the documentation
here. More information about the targets
files from systemPipeR can be found here.
targets
fileThe targets
file defines all FASTQ files and sample comparisons of the analysis workflow.
targetspath <- system.file("extdata", "targetsPE_chip.txt", package = "systemPipeR")
targets <- read.delim(targetspath, comment.char = "#")
targets[1:4, -c(5, 6)]
## FileName1 FileName2
## 1 ./data/SRR446027_1.fastq.gz ./data/SRR446027_2.fastq.gz
## 2 ./data/SRR446028_1.fastq.gz ./data/SRR446028_2.fastq.gz
## 3 ./data/SRR446029_1.fastq.gz ./data/SRR446029_2.fastq.gz
## 4 ./data/SRR446030_1.fastq.gz ./data/SRR446030_2.fastq.gz
## SampleName Factor Date SampleReference
## 1 M1A M1 23-Mar-2012
## 2 M1B M1 23-Mar-2012
## 3 A1A A1 23-Mar-2012 M1A
## 4 A1B A1 23-Mar-2012 M1B
To work with custom data, users need to generate a targets
file containing
the paths to their own FASTQ files.
systemPipeR
workflows can be designed and built from start to finish with a
single command, importing from an R Markdown file or stepwise in interactive
mode from the R console.
This tutorial will demonstrate how to build the workflow in an interactive mode,
appending each step. The workflow is constructed by connecting each step via
appendStep
method. Each SYSargsList
instance contains instructions needed
for processing a set of input files with a specific command-line or R software
and the paths to the corresponding outfiles generated by a particular tool/step.
To create a Workflow within systemPipeR
, we can start by defining an empty
container and checking the directory structure:
library(systemPipeR)
sal <- SPRproject()
sal
The systemPipeR
package needs to be loaded (H Backman and Girke 2016).
cat(crayon::blue$bold("To use this workflow, following R packages are expected:\n"))
cat(c("'ggbio", "ChIPseeker", "GenomicFeatures", "GenomicRanges",
"Biostrings", "seqLogo", "BCRANK", "readr'\n"), sep = "', '")
targetspath <- system.file("extdata", "targetsPE_chip.txt", package = "systemPipeR")
### pre-end
appendStep(sal) <- LineWise(code = {
library(systemPipeR)
}, step_name = "load_SPR")
The following seeFastq
and seeFastqPlot
functions generate and plot a series of useful
quality statistics for a set of FASTQ files, including per cycle quality box
plots, base proportions, base-level quality trends, relative k-mer
diversity, length, and occurrence distribution of reads, number of reads
above quality cutoffs and mean quality distribution. The results are
written to a png file named fastqReport.png
.
This is the pre-trimming fastq report. Another post-trimming fastq report step is not included in the default. It is recommended to run this step first to decide whether the trimming is needed.
Please note that initial targets files are being used here. In this case,
it has been added to the first step using the updateColumn
function, and
later, we used the getColumn
function to extract a named vector.
appendStep(sal) <- LineWise(code = {
targets <- read.delim(targetspath, comment.char = "#")
updateColumn(sal, step = "load_SPR", position = "targetsWF") <- targets
fq_files <- getColumn(sal, "load_SPR", "targetsWF", column = 1)
fqlist <- seeFastq(fastq = fq_files, batchsize = 10000, klength = 8)
png("./results/fastqReport.png", height = 162, width = 288 *
length(fqlist))
seeFastqPlot(fqlist)
dev.off()
}, step_name = "fastq_report", dependency = "load_SPR")