1 What is genomic DNA contamination in a RNA-seq experiment

RNA sequencing (RNA-seq) libraries may contain genomic DNA (gDNA) contamination due to an absent or inefficient gDNA digestion step (with DNase) during RNA extraction or library preparation. In fact, some protocols do not include a DNase treatment step, or they include it as optional.

While gDNA contamination is not a major issue in libraries built with poly(A) selected RNA molecules, it can remarkably affect gene expression quantification from libraries of total RNA. When present, gDNA contamination can lead to a misleading attribution of expression to unannotated regions of the genome. For this reason, it is important to check the levels of gDNA contamination during quality control before performing further analyses, specially when total RNA has been sequenced.

2 Diagnose gDNA contamination

Here we illustrate the use of the gDNAx package for producing different diagnostics and how do they reveal different gDNA contamination levels. We use a subset of the data in (Li et al. 2022), which consists of 9 paired-end samples of total RNA-seq with increasing levels of gDNA contamination: 0% (no contamination), 1% and 10%, with 3 replicates each. The data is available through the Bioconductor experiment data package gDNAinRNAseqData, which allows one to download 9 BAM files, containing about 100,000 alignments, sampled uniformly at random from the complete BAM files.


# Retrieve BAM files
bamfiles <- LiYu22subsetBAMfiles()
[1] "/tmp/RtmpivJEmG/s32gDNA0.bam"  "/tmp/RtmpivJEmG/s33gDNA0.bam" 
[3] "/tmp/RtmpivJEmG/s34gDNA0.bam"  "/tmp/RtmpivJEmG/s26gDNA1.bam" 
[5] "/tmp/RtmpivJEmG/s27gDNA1.bam"  "/tmp/RtmpivJEmG/s28gDNA1.bam" 
[7] "/tmp/RtmpivJEmG/s23gDNA10.bam" "/tmp/RtmpivJEmG/s24gDNA10.bam"
[9] "/tmp/RtmpivJEmG/s25gDNA10.bam"

# Retrieve information on the gDNA concentrations of each BAM file
pdat <- LiYu22phenoData(bamfiles)
s32gDNA0     0
s33gDNA0     0
s34gDNA0     0
s26gDNA1     1
s27gDNA1     1
s28gDNA1     1
s23gDNA10   10
s24gDNA10   10
s25gDNA10   10

Diagnosing the presence of gDNA contamination requires using an annotation of genes and transcripts. The gDNAx package expects that we provide such an annotation using a so-called TxDb package, either as a TxDb object, created once such a package is loaded into the R session, or by specifying the name of the package. The Bioconductor website provides a number of TxDb packages, but if the we do not find the one we are looking for, we can build a TxDb object using the function makeTxDbFromGFF() on a given GFF or GTF file, or any of the other makeTxDbFrom*() functions, available in the GenomicFeatures package. Here we load the TxDb package corresponding to the GENCODE annotation provided by the UCSC Genome Browser.


txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg38
# Organism: Homo sapiens
# Taxonomy ID: 9606
# UCSC Table: knownGene
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: Entrez Gene ID
# Full dataset: yes
# miRBase build ID: NA
# Nb of transcripts: 276905
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2023-09-20 17:25:17 +0000 (Wed, 20 Sep 2023)
# GenomicFeatures version at creation time: 1.53.2
# RSQLite version at creation time: 2.3.1

We can calculate diagnostics for gDNA contamination using the function gDNAdx() as follows.


gdnax <- gDNAdx(bamfiles, txdb)
[1] "gDNAx"
[1] "gDNAx"
gDNAx object
# BAM files (9): s32gDNA0.bam, ..., s25gDNA10.bam
# Library layout: paired-end, 9 (2x50nt)
# Library protocol: unstranded (9 out of 9)
# Sequences: only standard chromosomes
# Annotation pkg: TxDb.Hsapiens.UCSC.hg38.knownGene
# Alignments employed: first 100000

The previous call will show progress through its calculations unless we set the argument verbose=FALSE, and return an object of class gDNAx once it has finished. We have let the gDNAdx() function figure out the library layout and protocol, but if we already knew those parameters from the data, we could set them through the arguments singleEnd and strandMode and speed up calculations. Another way to speed up calculations, which may be advantageous specially when analysing a high number of BAM files, is to use the BPPARAM argument to set a number of parallel threads of execution; see the help page of gDNAdx() for full details on how to specify non-default values to all these parameters.

Calling the plot() function with the resulting object gDNAx object as the first argument will plot several diagnostics. Here below, we also use a parameter called group to automatically color samples, in this case, by the gDNA contamination levels included in the experimental design of the data; see (Li et al. 2022) for full details on it.

par(mar=c(4, 5, 2, 1))
plot(gdnax, group=pdat$gDNA, pch=19)
Diagnostics. Default diagnostics obtained with the function `plot()` on a `gDNAx` object.

Figure 1: Diagnostics
Default diagnostics obtained with the function plot() on a gDNAx object.

The previous figure contains three diagnostic plots, each one showing the following values as a function of the percentage of read alignments fully contained in intergenic regions (IGC):

  • Percentage of Splice-compatible junction (SCJ) alignments. These are alignments compatible with a transcript in the given annotation, for which the aligned read, or at least one of the two aligned reads in the case of a paired-end layout, spans one or more exon-exon junctions over two or more exons of that transcript.
  • Percentage of splice compatible exonic (SCE) alignments. These are alignments compatible with a transcript in the given annotation, but which differently to SCJ alignments, do not include an exon-exon junction in the alignment.
  • Percentage of intronic (INT) alignments. These are alignments fully contained in intronic regions.

These data appear to come from an unstranded library, but if they would be stranded, a fourth diagnostic plot would appear showing an estimated value of the strandedness of each sample as function of the percentage of intergenic alignments. In stranded RNA-seq data, we should expect strandedness values close to 1, which imply that most reads align to the same strand than the annotated transcripts. Lower strandedness values can be indicative of gDNA contamination because reads sequenced from DNA are expected to align in equal proportions to both strands.

Because IGC alignments mainly originate from gDNA contamination, we may expect a negative correlation between the percentage of SCJ or SCE alignments and the percentage of IGC alignments. On the other hand, INT alignments may originate either from primary unprocessed transcripts in the nucleus, or from gDNA contamination as well. Therefore, we may also expect some positive correlation between the percentages of INT and IGC alignments, as it happens in this data.

Using the function getDx() on the gDNAx object, we obtain all the values used in the diagnostics.

dx <- getDx(gdnax)
               IGC      INT      SCJ      SCE      JNC  IGCFLM  SCJFLM  SCEFLM
s32gDNA0  1.036524 11.75898 15.18563 40.04114 19.56452 173.598 158.009 158.487
s33gDNA0  1.125444 11.78607 15.21656 40.31236 19.55383 175.204 160.773 161.142
s34gDNA0  1.095440 12.26551 15.40036 40.19421 19.70788 167.112 155.121 153.470
s26gDNA1  1.402844 12.22650 14.78351 38.69202 18.73132 170.242 159.978 164.952
s27gDNA1  1.365428 12.45073 14.54513 38.21891 18.31765 171.897 162.424 173.831
s28gDNA1  1.503162 12.49083 14.09956 37.67457 17.96053 175.066 164.124 162.643
s23gDNA10 3.529043 13.16473 11.22648 30.99781 14.41110 188.658 162.788 167.499
s24gDNA10 3.781627 13.65285 10.85400 30.01465 13.91748 166.722 156.873 166.619
s25gDNA10 3.412804 13.51507 11.19751 30.65271 14.20691 166.260 159.966 157.337
           INTFLM STRAND
s32gDNA0  154.844     NA
s33gDNA0  155.845     NA
s34gDNA0  150.285     NA
s26gDNA1  156.073     NA
s27gDNA1  158.668     NA
s28gDNA1  160.928     NA
s23gDNA10 160.926     NA
s24gDNA10 155.909     NA
s25gDNA10 157.855     NA

The column JNC contains the percentage of alignments that include one or more junctions, irrespective of whether those aligments are compatible with an spliced transcript in the given annotation. The columns with the suffix FLM contain an estimation of the fragment length mean in the alignments originating in the corresponding region, and the column STRAND stores the strandedness values, which in this case are NA because this dataset is not strand-specific.

We can directly plot the estimated fragments length distributions with the function plotFrgLength().