systemPipeR 2.8.0
Ribo-Seq and polyRibo-Seq are a specific form of RNA-Seq gene expression experiments utilizing mRNA subpopulations directly bound to ribosomes. Compared to standard RNA-Seq, their readout of gene expression provides a better approximation of downstream protein abundance profiles due to their close association with translational processes. The most important difference among the two is that polyRibo-Seq utilizes polyribosomal RNA for sequencing, whereas Ribo-Seq is a footprinting approach restricted to sequencing RNA fragments protected by ribosomes (Ingolia et al. 2009; Aspden et al. 2014; Juntawong et al. 2015).
The workflow presented in this vignette contains most of the data analysis steps described by (Juntawong et al. 2014) including functionalities useful for processing both polyRibo-Seq and Ribo-Seq experiments. To improve re-usability and adapt to recent changes of software versions (e.g. R, Bioconductor and short read aligners), the code has been optimized accordingly. Thus, the results obtained with the updated workflow are expected to be similar but not necessarily identical with the published results described in the original paper.
Relevant analysis steps of this workflow include read preprocessing, read
alignments against a reference genome, counting of reads overlapping with a
wide range of genomic features (e.g. CDSs, UTRs, uORFs, rRNAs, etc.),
differential gene expression and differential ribosome binding analyses, as
well as a variety of genome-wide summary plots for visualizing RNA expression
trends. Functions are provided for evaluating the quality of Ribo-seq data,
for identifying novel expressed regions in the genomes, and for gaining
insights into gene regulation at the post-transcriptional and translational
levels. For example, the functions genFeatures
and
featuretypeCounts
can be used to quantify the expression output for
all feature types included in a genome annotation (e.g.
genes,
introns, exons, miRNAs, intergenic regions, etc.). To determine the approximate
read length of ribosome footprints in Ribo-Seq experiments, these feature type
counts can be obtained and plotted for specific read lengths separately.
Typically, the most abundant read length obtained for translated features
corresponds to the approximate footprint length occupied by the ribosomes of a
given organism group. Based on the results from several Ribo-Seq studies, these
ribosome footprints are typically ~30 nucleotides long
(Ingolia, Lareau, and Weissman 2011; Ingolia et al. 2009; Juntawong et al. 2014). However, their
length can vary by several nucleotides depending upon the optimization of the
RNA digestion step and various factors associated with translational
regulation. For quality control purposes of Ribo-Seq experiments it is also
useful to monitor the abundance of reads mapping to rRNA genes due to the high
rRNA content of ribosomes. This information can be generated with the
featuretypeCounts
function described above.
Coverage trends along transcripts summarized for any number of transcripts can
be obtained and plotted with the functions featureCoverage
and
plotfeatureCoverage
, respectively. Their results allow monitoring
of the phasing of ribosome movements along triplets of coding sequences.
Commonly, high quality data will display here for the first nucleotide of each
codon the highest depth of coverage computed for the 5’ ends of the aligned
reads.
Ribo-seq data can also be used to evaluate various aspects of translational
control due to ribosome occupancy in upstream open reading frames (uORFs). The
latter are frequently present in (or near) 5’ UTRs of transcripts. For this,
the function predORFs
can be used to identify ORFs in the
nucleotide sequences of transcripts or their subcomponents such as UTR regions.
After scaling the resulting ORF coordinates back to the corresponding genome
locations using scaleRanges
, one can use these novel features
(e.g. uORFs) for expression analysis routines similar to those
employed for pre-existing annotations, such as the exonic regions of genes. For
instance, in Ribo-Seq experiments one can use this approach to systematically identify all transcripts occupied by ribosomes in their uORF regions. The binding of
ribosomes to uORF regions may indicate a regulatory role in the translation of
the downstream main ORFs and/or translation of the uORFs into functionally
relevant peptides.
Typically, users want to specify here all information relevant for the analysis of their NGS study. This includes detailed descriptions of FASTQ files, experimental design, reference genome, gene annotations, etc.
The systemPipeR
package needs to be loaded to perform the analysis
steps shown in this report (H Backman and Girke 2016). The package allows users
to run the entire analysis workflow interactively or with a single command
while also generating the corresponding analysis report. For details
see systemPipeR's
main vignette.
library(systemPipeR)
systemPipeRdata package is a helper package to generate a fully populated systemPipeR workflow environment in the current working directory with a single command. All the instruction for generating the workflow template are provide in the systemPipeRdata vignette here.
systemPipeRdata::genWorkenvir(workflow = "riboseq", mydirname = "riboseq")
setwd("riboseq")
After building and loading the workflow environment generated by genWorkenvir
from systemPipeRdata all data inputs are stored in
a data/
directory and all analysis results will be written to a separate
results/
directory, while the systemPipeRIBOseq.Rmd
script and the targets
file are expected to be located in the parent directory. The R session is expected to run from this parent
directory. Additional parameter files are stored under param/
.
To work with real data, users want to organize their own data similarly
and substitute all test data for their own data. To rerun an established
workflow on new data, the initial targets
file along with the corresponding
FASTQ files are usually the only inputs the user needs to provide. Please check
the initial targets
file details below.
For more details, please consult the documentation
here. More information about the targets
files from systemPipeR can be found here.
targets
fileThe targets
file defines all FASTQ files and sample comparisons of the analysis workflow.
targetspath <- system.file("extdata", "targetsPE.txt", package = "systemPipeR")
targets <- read.delim(targetspath, comment.char = "#")[, 1:4]
targets
## FileName1 FileName2
## 1 ./data/SRR446027_1.fastq.gz ./data/SRR446027_2.fastq.gz
## 2 ./data/SRR446028_1.fastq.gz ./data/SRR446028_2.fastq.gz
## 3 ./data/SRR446029_1.fastq.gz ./data/SRR446029_2.fastq.gz
## 4 ./data/SRR446030_1.fastq.gz ./data/SRR446030_2.fastq.gz
## 5 ./data/SRR446031_1.fastq.gz ./data/SRR446031_2.fastq.gz
## 6 ./data/SRR446032_1.fastq.gz ./data/SRR446032_2.fastq.gz
## 7 ./data/SRR446033_1.fastq.gz ./data/SRR446033_2.fastq.gz
## 8 ./data/SRR446034_1.fastq.gz ./data/SRR446034_2.fastq.gz
## 9 ./data/SRR446035_1.fastq.gz ./data/SRR446035_2.fastq.gz
## 10 ./data/SRR446036_1.fastq.gz ./data/SRR446036_2.fastq.gz
## 11 ./data/SRR446037_1.fastq.gz ./data/SRR446037_2.fastq.gz
## 12 ./data/SRR446038_1.fastq.gz ./data/SRR446038_2.fastq.gz
## 13 ./data/SRR446039_1.fastq.gz ./data/SRR446039_2.fastq.gz
## 14 ./data/SRR446040_1.fastq.gz ./data/SRR446040_2.fastq.gz
## 15 ./data/SRR446041_1.fastq.gz ./data/SRR446041_2.fastq.gz
## 16 ./data/SRR446042_1.fastq.gz ./data/SRR446042_2.fastq.gz
## 17 ./data/SRR446043_1.fastq.gz ./data/SRR446043_2.fastq.gz
## 18 ./data/SRR446044_1.fastq.gz ./data/SRR446044_2.fastq.gz
## SampleName Factor
## 1 M1A M1
## 2 M1B M1
## 3 A1A A1
## 4 A1B A1
## 5 V1A V1
## 6 V1B V1
## 7 M6A M6
## 8 M6B M6
## 9 A6A A6
## 10 A6B A6
## 11 V6A V6
## 12 V6B V6
## 13 M12A M12
## 14 M12B M12
## 15 A12A A12
## 16 A12B A12
## 17 V12A V12
## 18 V12B V12
systemPipeR
workflows can be designed and built from start to finish with a
single command, importing from an R Markdown file or stepwise in interactive
mode from the R console.
This tutorial will demonstrate how to build the workflow in an interactive mode,
appending each step. The workflow is constructed by connecting each step via
appendStep
method. Each SYSargsList
instance contains instructions needed
for processing a set of input files with a specific command-line or R software
and the paths to the corresponding outfiles generated by a particular tool/step.
To create a Workflow within systemPipeR
, we can start by defining an empty
container and checking the directory structure:
library(systemPipeR)
sal <- SPRproject()
sal
The systemPipeR
package needs to be loaded (H Backman and Girke 2016).
cat(crayon::blue$bold("To use this workflow, following R packages are expected:\n"))
cat(c("'rtracklayer", "GenomicFeatures", "grid", "BiocParallel",
"DESeq2", "ape", "edgeR", "biomaRt", "BBmisc", "pheatmap",
"ggplot2'\n"), sep = "', '")
### pre-end
appendStep(sal) <- LineWise(code = {
library(systemPipeR)
library(rtracklayer)
library(GenomicFeatures)
library(ggplot2)
library(grid)
library(DESeq2, quietly = TRUE)
library(ape, warn.conflicts = FALSE)
library(edgeR)
library(biomaRt)
library(BBmisc) # Defines suppressAll()
library(pheatmap)
library(BiocParallel)
}, step_name = "load_SPR")
preprocessReads
functionThe function preprocessReads
allows to apply predefined or custom
read preprocessing functions to all FASTQ files referenced in a
SYSargsList
container, such as quality filtering or adapter trimming
routines. Internally, preprocessReads
uses the FastqStreamer
function from
the ShortRead
package to stream through large FASTQ files in a
memory-efficient manner.
Here, we are appending a new step to the SYSargsList
object created previously.
All the parameters are defined on the preprocessReads/preprocessReads-pe_riboseq.yml
file.
appendStep(sal) <- SYSargsList(step_name = "preprocessing", targets = "targetsPE.txt",
dir = TRUE, wf_file = "preprocessReads/preprocessReads-pe.cwl",
input_file = "preprocessReads/preprocessReads-pe_riboseq.yml",
dir_path = system.file("extdata/cwl", package = "systemPipeR"),
inputvars = c(FileName1 = "_FASTQ_PATH1_", FileName2 = "_FASTQ_PATH2_",
SampleName = "_SampleName_"), dependency = c("load_SPR"))
The function trimbatch
used trims adapters hierarchically from the longest to
the shortest match of the right end of the reads. If internalmatch=TRUE
then
internal matches will trigger the same behavior. The argument minpatternlength
defines the shortest adapter match to consider in this iterative process. In
addition, the function removes reads containing Ns or homopolymer regions.
More detailed information on read preprocessing is provided in systemPipeR's
main vignette.
yamlinput(sal, step = "preprocessing")$Fct
# [1] ''trimbatch(fq, pattern=\'ACACGTCT\',
# internalmatch=FALSE, minpatternlength=6, Nnumber=1,
# polyhomo=50, minreadlength=16, maxreadlength=101)''
cmdlist(sal, step = "preprocessing", targets = 1)
The following seeFastq
and seeFastqPlot
functions generate and plot a series of useful
quality statistics for a set of FASTQ files, including per cycle quality box
plots, base proportions, base-level quality trends, relative k-mer
diversity, length, and occurrence distribution of reads, number of reads
above quality cutoffs and mean quality distribution. The results are
written to a png file named fastqReport.png
.
This is the pre-trimming fastq report. Another post-trimming fastq report step is not included in the default. It is recommended to run this step first to decide whether the trimming is needed.
Please note that initial targets files are being used here. In this case,
we used the getColumn
function to extract a named vector.
appendStep(sal) <- LineWise(code = {
fq_files <- getColumn(sal, "preprocessing", "targetsWF",
column = 1)
fqlist <- seeFastq(fastq = fq_files, batchsize = 10000, klength = 8)
png("./results/fastqReport.png", height = 162, width = 288 *
length(fqlist))
seeFastqPlot(fqlist)
dev.off()
}, step_name = "fastq_report", dependency = "preprocessing")