Contents

1 Introduction to dupRadar

RNA-Seq experiments are a common strategy nowadays to quantify the gene expression levels in cells. Due to its higher sensitivity compared to arrays and the ability to discover novel features, makes them the choice for most modern experiments.

In RNA-Seq - as in all other NGS applications - quality control is essential. Current NGS workflows usually involve PCR steps at some point, which involves the danger of PCR artefacts and over-amplification of reads. For common DNA-based assays PCR duplicates are often simply removed before further analysis and their overall fraction or the read multiplicity taken as quality metrics. In RNA-Seq however, apart from the technical PCR duplicates, there is a strong source for biological read duplicates: for a given gene length and sequencing depth there exists an expression level beyond which it is not possible to add place more reads inside a gene locus without placing a second read exactly on the position of an already existing read. For this reason the overall duplication rate is not a useful measure for RNA-Seq.

As in the NGS/RNA-Seq QC ecosystem there were no suitable tools to address this question, we set out to develop a new tool. The dupRadar package gives an insight into the duplication problem by graphically relating the gene expression level and the duplication rate present on it. Thus, failed experiments can be easily identified at a glance.

Note: by now RNA-Seq protocols have matured so that for bulk RNA protocols data rarely suffer from technical duplicates. With newer low-input or single cell RNA-Seq protocols technical duplicates possible problems are worth to be checked for by default, especially if protocols are pushed to or beyond their boundaries. Paired-end libraries make the distinction between duplicates due to highly expressed genes and PCR duplicates easier, but the problem itself is not completely solved, especially at higher sequencing depths.

2 Getting started using dupRadar

2.1 Preparing your data

Previous to running the duplication rate analysis, the BAM file with your mapped reads has to be duplicate marked with a like Picard, or the faster BamUtil dedup BioBamBam. The dupRadar package only works with duplicate marked BAM files.

If you do not have/want a duplication marking step in your default pipeline, the dupRadar package includes, for your convenience, wrappers to properly call some of these tools from your R session. Note that you still have to supply the path of the dupmarker installation though:

library(dupRadar)

Now, simply call the wrapper:

# call the duplicate marker and analyze the reads
bamDuprm <- markDuplicates(dupremover="bamutil",
               bam="test.bam",
                   path="/opt/bamUtil-master/bin",
               rminput=FALSE)

Simply specify which tool to use, the path where this tool is installed, and the input bam file to be analyzed. After marking duplicates, it’s safe to remove to original BAM file in order to save space.

The dupRadar package currently comes with support for:

After the BAM file is marked for duplicates, dupRadar is ready to analyze how the duplication rate is related with the estimated gene expression levels.

2.2 A GTF file

Unless there is any specific reason, dupRadar can use the same GTF file that will be later used to count the reads falling on features.

A valid GTF file can be obtained from UCSC, Ensembl, the iGenomes or other projects.

Note that the resulting duplication rate plots depend on the GTF annotation used. GTF files from the gencode projects result in a less clear picture of duplication rates, as there are many more features and feature types annotated, which overlap heavily as well. In some cases creating the plots only using subsets of gencode annotation files (e.g. just protein coding genes) serve the QC purpose of this tool better.

2.3 AnnotationHub as a source of GTF files

The Bioconductor AnnotationHub package provides an alternative approach to obtain annotation in GTF format from entities such as Ensembl, UCSC, ENCODE, and 1000 Genomes projects.

This is partly outlined in the AnnotationHub ‘HOWTO’ vignette section “Ensembl GTF and FASTA files for TxDb gene models and sequence queries”; for the Takifugu example, the downloaded GTF file is available from the cache.

if(suppressWarnings(require(AnnotationHub))) {
    ah = AnnotationHub()
    query(ah, c("ensembl", "80", "Takifugu", "gtf"))    # discovery
    cache(ah["AH47101"])                                # retrieve / file path
}
##                                AH47101 
## "/home/biocbuild/.AnnotationHub/52579"

3 dupRate demo data

In the package we include two precomputed duplication matrices for two RNASeq experiments used as examples of a good and a failed (in terms of high redundancy of reads) experiments. The experiments come from the ENCODE project, as a source of a wide variety of protocols, library types and sequencing facilities.

Load the example dataset with:

attach(dupRadar_examples)

4 The duplication rate analysis

The analysis requires some info about the sequencing run and library preparation protocol:

Due to its phenomenal performance, internally we use the featureCounts() function from the Bioconductor Rsubread package, which also supports multiple threads.

# The call parameters: 
bamDuprm <- "test_duprm.bam"    # the duplicate marked bam file
gtf <- "genes.gtf"  # the gene model
stranded <- 2       # '0' (unstranded), '1' (stranded) and '2' (reversely stranded)
paired   <- FALSE   # is paired end data
threads  <- 4       # number of threads to be used

# Duplication rate analysis
dm <- analyzeDuprates(bamDuprm,gtf,stranded,paired,threads)

The duplication matrix contains read counts in different scenarios and RPK and RPKM values for every gene.

ID geneLength allCountsMulti filteredCountsMulti dupRateMulti dupsPerIdMulti RPKMulti RPKMMulti allCounts filteredCounts dupRate dupsPerId RPK RPKM
Xkr4 3634 2 2 0.0000000 0 0.5503577 0.0117159 2 2 0.0000000 0 0.5503577 0.0117159
Rp1 9747 0 0 NaN 0 0.0000000 0.0000000 0 0 NaN 0 0.0000000 0.0000000
Sox17 3130 0 0 NaN 0 0.0000000 0.0000000 0 0 NaN 0 0.0000000 0.0000000
Mrpl15 4203 429 261 0.3916084 168 102.0699500 2.1728436 419 258 0.3842482 161 99.6906971 2.1221946
Lypla1 2433 1106 576 0.4792043 530 454.5828196 9.6770634 1069 562 0.4742750 507 439.3752569 9.3533280
Tcea1 2847 2890 1046 0.6380623 1844 1015.1036178 21.6093121 1822 696 0.6180022 1126 639.9719002 13.6235871

5 Plotting and interpretation

The number of reads per base assigned to a gene in an ideal RNA-Seq data set is expected to be proportional to the abundance of its transcripts in the sample. For lowly expressed genes we expect read duplication to happen rarely by chance, while for highly expressed genes - depending on the total sequencing depth - we expect read duplication to happen often.

A good way to learn if a dataset is following this trend is by relating the normalized number of counts per gene (RPK, as a quantification of the gene expression) and the fraction represented by duplicated reads.

The dupRadar offers several functions to assess this relationship. The most prominent may be the 2d density scatter plot:

## make a duprate plot (blue cloud)
par(mfrow=c(1,2))
duprateExpDensPlot(DupMat=dm)       # a good looking plot
title("good experiment")
duprateExpDensPlot(DupMat=dm.bad)   # a dataset with duplication problems
title("duplication problems")

## duprate boxplot
duprateExpBoxplot(DupMat=dm)        # a good looking plot
duprateExpBoxplot(DupMat=dm.bad)    # a dataset with duplication problems

The duprateExpDensPlot has helper lines at the threshold of 1 read/bp and at 0.5 RPKM. Moreover by default a generalized linear model is fit to the data and overplotted (see also below).

5.1 Fitting a model into the data

To summarize the relationship between duplication rates and gene expression, we propose fitting a logistic regression curve onto the data. With the coefficients of the fitted model, one can get an idea of the initial duplication rate at low read counts (Intercept) and the progression of the duplication rate along with the progression of the read counts (Slope).

duprateExpDensPlot(DupMat=dm)

# or, just to get the fitted model without plot
fit <- duprateExpFit(DupMat=dm)
cat("duprate at low read counts: ",fit$intercept,"\n",
    "progression of the duplication rate: ",fit$slope,fill=TRUE)
## duprate at low read counts:  0.04075061 
##  
## progression of the duplication rate:  3.186793

Our main use case for that function is the condensation of the plots into quality metrics that can be used for automatic flagging of possibly problematic samples in large experiments or aggregation with other quality metrics in large tables to analyse interdependencies.

The duprateExpBoxplot plot shows the range of the duplication rates at 5% bins (default) along the distribution of RPK gene counts. The x-axis displays the quantile of the RPK distribution, and the average RPK of the genes contained in this quantile.

Individual genes can be identified in the plot:

## INTERACTIVE: identify single points on screen (name="ID" column of dm)
duprateExpPlot(DupMat=dm)       # a good looking plot
duprateExpIdentify(DupMat=dm)

One can also call the function duprateExpPlot to get smooth color density representation of the same data:

par(mfrow=c(1,2))
cols <- colorRampPalette(c("black","blue","green","yellow","red"))
duprateExpPlot(DupMat=dm,addLegend=FALSE)
duprateExpPlot(DupMat=dm.bad,addLegend=FALSE,nrpoints=10,nbin=500,colramp=cols)

Any further parm sent to the duprateExpPlot is also sent to the smoothScatter function.

5.2 Comparing the fitted parameters to other datasets

The parameters of the fitted model may mean very little (or just nothing) for many, unless there’s other data to compare with. We provide the pre-computed duplication matrices for all the RNA-Seq experiments publicly available from the ENCODE project, for human and mouse.

alt text

With the experience from the ENCODE datasets, we expect from single read experiments little duplication at low RPKM (low intercept) rapidly rising because of natural duplication (high slope). In contrast, paired-end experiments have a more mild rising of the natural duplication (low slope) due to having higher diversity of reads pairs since pairs with same start may still have different end.

The common denominator for both designs is the importance of having a low intercept, suggesting that duplication rate at lowly expressed genes may serve as a quality measure.

All the pre-computed duplication matrices are available in the dupRadar sourceforge site.

5.3 Other plots

CAVEAT: Sometimes in discussions duplicate reads (i.e. two physically different reads are mapped to the exact same position) and multi-mapping reads (i.e. a single read is mapped to two or more locations in the genome) are mixed up. dupRadar’s main focus are PCR duplicates in RNA-Seq. However internally we keep track of unique mappers and multimappers, and we use both in some of the examples, to illustrate use cases of our package beyond the main aim. Multi-mapping reads are completely independent of PCR duplicates.

Apart from the plots relating RPK and duplication rate, the dupRadar package provides some other useful plots to extract information about the gene expression levels.

An interesting quality metric are the fraction of reads taken up by groups of genes binned by 5% expression levels.

readcountExpBoxplot(DupMat=dm)

In the example we see that the 5% of highest expressed genes in our sample data set take up around 60% of all reads.

The distribution of RPK values per genes can be plotted with:

expressionHist(DupMat=dm)

This would help in identifying skewed distributions with unusual amount of lowly expressed genes, or to detect no consensus between replicates.

6 Other information deduced from the data

The duplication rate matrix calculated by the function analyzeDuprates() contains some useful information about the sequencing experiment, that can be used to assess the quality of the data.

6.1 Fraction of multimappers per gene

Analogous to per gene duplication rate, the fraction of mutimappers can be easily calculated fom the duplication matrix.

Taking the counts from the column allCountsMulti, and substracting form it the counts from the column allCounts, one can get the total number of multihits. Thus, the fraction of multihits per gene can be calculating then dividing by allCountsMulti.

# calculate the fraction of multimappers per gene
dm$mhRate <- (dm$allCountsMulti - dm$allCounts) / dm$allCountsMulti
# how many genes are exclusively covered by multimappers
sum(dm$mhRate == 1, na.rm=TRUE)
## [1] 295
# and how many have an RPKM (including multimappers) > 5
sum(dm$mhRate==1 & dm$RPKMMulti > 5, na.rm=TRUE)
## [1] 8
# and which are they?
dm[dm$mhRate==1 & dm$RPKMMulti > 5, "ID"]
## [1] GPR89C       LOC728643    LOC606724    TBC1D3C      TBC1D3H     
## [6] MIR650       LOC100133050 HIST1H4J    
## 23228 Levels: 1/2-SBSRNA4 A1BG A1BG-AS1 A1CF A2LD1 A2M A2ML1 A2MP1 ... ZZZ3

We can also generate an overall picture about less extreme cases:

hist(dm$mhRate, 
     breaks=50, 
     col="red", 
     main="Frequency of multimapping rates", 
     xlab="Multimapping rate per gene", 
     ylab="Frequency")

Also the direct comparison of reads per kilobase between uniquely and multimappers is possible.

# comparison of multi-mapping RPK and uniquely-mapping RPK
plot(log2(dm$RPK), 
     log2(dm$RPKMulti), 
     xlab="Reads per kb (uniquely mapping reads only)",
     ylab="Reads per kb (all including multimappers, non-weighted)"
)

Use with identify() to annotate interesting cases interactively.

identify(log2(dm$RPK), 
         log2(dm$RPKMulti),
     labels=dm$ID)

6.2 Connection between possible PCR artefacts and GC content

In some cases we wondered about influence of GC content on PCR artefacts. An easy way to check using our dupRadar package in conjunction with biomaRt is demonstrated in the following. For simplicity we use our demo data here in which we do not see a big influence.

library(dupRadar)
library(biomaRt)

## for detailed explanations on biomaRt, please see the respective
## vignette

## set up biomart connection for mouse (needs internet connection)
ensm <- useMart("ensembl")
ensm <- useDataset("mmusculus_gene_ensembl", mart=ensm)

## get a table which has the gene GC content for the IDs that have been
## used to generate the table (depends on the GTF annotation that you
## use)
tr <- getBM(attributes=c("mgi_symbol", "percentage_gc_content"),
            values=TRUE, mart=ensm)

## create a GC vector with IDs as element names
mgi.gc <- tr$percentage_gc_content
names(mgi.gc) <- tr$mgi_symbol

## using dm demo duplication matrix that comes with the package
## add GC content to our demo data and keep only subset for which we can
## retrieve data
keep <- dm$ID %in% tr$mgi_symbol
dm.gc <- dm[keep,]
dm.gc$gc <- mgi.gc[dm.gc$ID]

## check distribution of annotated gene GC content (in %)
boxplot(dm.gc$gc, main="Gene GC content", ylab="% GC")