ribosomeProfilingQC 1.8.0
#Introduction
Ribosome footprinting, developed by Jonathan Weissman and Nicholas Ingolia1, measures translation by direct quantification of the coding sequence currently bound by the 80S ribosome (ribosome-protected fragments, RPFs)2. In eukaryotes, the size of RPFs is around 28-nt, where the P-site of the ribosome is typically in position 13 from the 5’ end of reads. In bacteria, Allen et. al. were able to more accurately identify the P-site from 3’ end of reads3.
There are several packages available in Bioconductor already, including, riboSeqR4, RiboProfiling5 and ORFik6. These packages are powerful in analyzing the ribosome footprinting data. The ORFik package can also be used to find new transcription start sites using CageSeq data. RiboWaltz7 is another popular package which is based on R and Bioconductor.
To help researchers quickly assess the quality of their ribosome profiling data, we have developed the ribosomeProfilingQC package. The ribosomeProfilingQC package can be used to easily make diagnostic plots to check the mapping quality and frameshifts. In addition, it can preprocess ribosome profiling data for subsequent differential analysis. We have tried to make this package as user-friendly as possible and the only input needed is a bam file of your ribosome footprinting and RNAseq data mapped to the genome.
Please note that all following analyses are based on known Open Reading Frame (ORF) annotation. The sample data provided in the package is mapped to Zebrafish UCSC danRer10 assembly; all code related to this assembly will be highlighted in light yellow background for clarity.
Here is an example using ribosomeProfilingQC with a subset of ribo-seq data.
First install ribosomeProfilingQC and other packages required to run
the examples.
Please note that the example dataset used here is from zebrafish.
To run analysis with dataset from a different species or different assembly,
please install the corresponding Bsgenome and TxDb.
For example, to analyze mouse data aligned to mm10,
please install BSgenome.Mmusculus.UCSC.mm10,
and TxDb.Mmusculus.UCSC.mm10.knownGene.
You can also generate a TxDb object by
functions makeTxDbFromGFF
from a local gff file,
or makeTxDbFromUCSC
, makeTxDbFromBiomart
, and makeTxDbFromEnsembl
,
from online resources in GenomicFeatures package.
library(BiocManager)
BiocManager::install(c("ribosomeProfilingQC",
"AnnotationDbi", "Rsamtools",
"BSgenome.Drerio.UCSC.danRer10",
"TxDb.Drerio.UCSC.danRer10.refGene",
"motifStack"))
If you have trouble in install ribosomeProfilingQC, please check your R version first. The ribosomeProfilingQC package require R >= 4.0.
R.version
## _
## platform x86_64-pc-linux-gnu
## arch x86_64
## os linux-gnu
## system x86_64, linux-gnu
## status RC
## major 4
## minor 2.0
## year 2022
## month 04
## day 19
## svn rev 82224
## language R
## version.string R version 4.2.0 RC (2022-04-19 r82224)
## nickname Vigorous Calisthenics
## load library
library(ribosomeProfilingQC)
library(AnnotationDbi)
library(Rsamtools)
In this manual, we will use the fish genome.
library(BSgenome.Drerio.UCSC.danRer10)
## set genome, Drerio is a shortname for BSgenome.Drerio.UCSC.danRer10
genome <- Drerio
If your assembly is Human hg38 please load the human library,
library(BSgenome.Hsapiens.UCSC.hg38)
genome <- Hsapiens
If your assembly is Mouse mm10 please load the mouse library,
library(BSgenome.Mmusculus.UCSC.mm10)
genome <- Mmusculus
The function prepareCDS
is used to prepare the information for downstream
analysis from a TxDb
object.
## which is corresponding to BSgenome.Drerio.UCSC.danRer10
library(TxDb.Drerio.UCSC.danRer10.refGene)
txdb <- TxDb.Drerio.UCSC.danRer10.refGene ## give it a short name
CDS <- prepareCDS(txdb)
If your assembly is Human hg38 please try to load the library,
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene ## give it a short name
CDS <- prepareCDS(txdb)
If your assembly is Mouse mm10 please try to load the library,
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene ## give it a short name
CDS <- prepareCDS(txdb)
You can also create a TxDb object from a gtf file by GenomicFeatures::makeTxDbFromGFF function. To get GTF file, you can download it from ensembl or get the online file info via AnnotationHub. Here we use a prepared TxDb object for testing.
## Create a small TxDb object which only contain chr1 information.
library(GenomicFeatures)
txdb <- makeTxDbFromGFF(system.file("extdata",
"Danio_rerio.GRCz10.91.chr1.gtf.gz",
package="ribosomeProfilingQC"),
organism = "Danio rerio",
chrominfo = seqinfo(Drerio)["chr1"],
taxonomyId = 7955)
CDS <- prepareCDS(txdb)
The input of ribosomeProfilingQC is bam file. To prepare bam file, different from riboSeqR package which ask reads mapped to transcriptome, ribosomeProfilingQC use the bam file mapped to whole genome. To get correctly mapped reads, first try to map adaptor trimmed sequences to genome assembly by bowtie2 with following parameters: –local –ma 5 –mp 8,4 –rdg 7,7 –rfg 7,7 –fr –nofw and then fileter the reads mapped to rRNA, tRNA, snRNA, snoRNA and misc_RNA from Ensembl and Repeatmasker annotations. After that, map the clean reads to genome assembly by tophat2 with following parameters: –library-type fr-firststrand –transcriptome-index=Transcriptome_data/genome. Because the library type of ribo-seq is usally strand-specific, please make sure to map the reads with correct library type.
library(Rsamtools)
## input the bamFile from the ribosomeProfilingQC package
bamfilename <- system.file("extdata", "RPF.WT.1.bam",
package="ribosomeProfilingQC")
## For your own data, please set bamfilename as your file path.
## For example, your bam file is located at C:\mydata\a.bam
## you want to set bamfilename = "C:\\mydata\\a.bam"
## or you can change your working directory by
## setwd("C:\\mydata")
## and then set bamfilename = "a.bam"
yieldSize <- 10000000
bamfile <- BamFile(bamfilename, yieldSize = yieldSize)
As it shown in the above figure, P site of the ribosome is in position 13
(if using RNase I).
However, in different experiments, the P site may be shifted due to various
experimental conditions such as the choice of enzyme and the cell type.
The estimatePsite
function can be used to check the P site.
The estimatePsite
function will search start/stop codons that occur in the
reads.
The estimatePsite
will only use 12, 13 or 14 as best P site candidates
when searching from the 5’ end.
estimatePsite(bamfile, CDS, genome)
## [1] 13
It has been shown that for certain enzymes, such as MNase,
estimating the P site from the 3’ end works much better3.
The estimatePsite
will use 15, 16 or 17 as best P site candidates
when searching from the 3’ end.
estimatePsite(bamfile, CDS, genome, anchor = "3end")
## [1] -16
The readsEndPlot
function will plot the 5’ end or 3’ end reads shifted
from the start/stop position of CDS.
There is no difference when assign the reading frame for most of the reads
if you set best P site to 13 or 10 or 16 (from 5’ end).
The readsEndPlot
can help users to determine the real best Psite.
In the example below, the start codon is enriched in position -9 from
the 5’ end of reads and in position 19 from the 3’ end of reads.
This means there are a lot of ribosome that are docking at the translation
start position and most of the reads length are 28 nt.
readsEndPlot(bamfile, CDS, toStartCodon=TRUE)
readsEndPlot(bamfile, CDS, toStartCodon=TRUE, fiveEnd=FALSE)
If you see following distribution, that means lots of gene are in active expression.
If you see a warning or error message complaining about the disagreement of chromosome sequences, please verify you are using the TxDb object with correct genome assembly. If this warning message is for the patch chromosomes you are not interested, you can ignore the warning messages.
The getPsiteCoordinates
function is used to read all P site coordinates.
Ideally, the bestpsite should be 13.
To test the data quality, we set bestpsite = 13.
pc <- getPsiteCoordinates(bamfile, bestpsite = 13)
Ribosome-protected fragments should ideally be 27 to 29-nt long. To check the fragment size distribution, use the following function:
readsLen <- summaryReadsLength(pc)
To filter reads by their length for downstream analysis, use the following script:
## for this QC demo, we will only use reads length of 28-29 nt.
pc.sub <- pc[pc$qwidth %in% c(28, 29)]
Most of the reads should be mapped to sense strand because the ribo-seq library is strand-specific.
strandPlot(pc.sub, CDS)
For ribosome footprinting, most of the reads should map to the CDS region.
The readsDistribution
function will show the P site locations
in different genomic elements: CDS, 5’UTR, 3’UTR, other type exon,
intron, promoter, downstream or intergenic region.
A high downstream percentage indicates that there is a high percentage
of alternative polyAdenylation sites usage from annotation data.
A high percentage in intronic regions indicates the possibility of
intron-retaining transcripts.
pc.sub <- readsDistribution(pc.sub, txdb, las=2)
A metagene plot can indicate the reads distribution in 5’UTR, CDS and 3’UTR region.
cvgs.utr5 <- coverageDepth(RPFs = bamfilename, gtf = txdb, region="utr5")
cvgs.CDS <- coverageDepth(RPFs = bamfilename, gtf = txdb, region="cds")
cvgs.utr3 <- coverageDepth(RPFs = bamfilename, gtf = txdb, region="utr3")
metaPlot(cvgs.utr5, cvgs.CDS, cvgs.utr3, sample=1)
Function assignReadingFrame
is used to set the reading frame for the P sites
located within the known annotated CDS.
The plotDistance2Codon
function can be used to plot the reading frame usage
in transcription initiation or termination sites.
Function plotFrameDensity
can be used to collapse all the RPFs in each frame.
These plots can help you to double check if the p-site position is correct
or not.
If it is correct, most of the reads should be assigned to frame0.
pc.sub <- assignReadingFrame(pc.sub, CDS)
plotDistance2Codon(pc.sub)
plotFrameDensity(pc.sub)
To determine how many of raw reads are mapping with P sites in frame 0.
pc <- assignReadingFrame(pc, CDS)
plotFrameDensity(pc)
Function plotTranscript
can be used to view the reading frame distribution
for given transcripts.
plotTranscript(pc.sub, c("ENSDART00000161781", "ENSDART00000166968",
"ENSDART00000040204", "ENSDART00000124837"))