1 What is genomic DNA contamination in a RNA-seq experiment

Genomic DNA (gDNA) contamination is an internal contaminant that can be present in gene quantification techniques, such as in an RNA-sequencing (RNA-seq) experiment. This contamination can be due to an absent or inefficient gDNA digestion step (with DNase) during the extraction of total RNA in the library preparation process. In fact, some protocols do not include a DNase treatment step, or they include it as optional.

While gDNA contamination is not a major issue for poly(A) RNA-seq, it can remarkably affect gene expression quantification of total RNA-seq experiments. Moreover, gDNA contamination can lead to a misleading attribution of expression to unannotated regions of the genome. For this reason, it is important to check the level of gDNA contamination in the quality control analysis before performing further analyses, specially when total RNA has been sequenced.

2 Diagnose the presence of gDNA contamination

Here we illustrate the use of the gDNAx package for calculating different diagnostics related to gDNA contamination levels.

To do so, a subset of the data in (Li et al. 2022) is used. This data consists in 9 paired-end samples of total RNA-seq with increasing levels of gDNA contamination: 0% (no contamination), 1% and 10%, with 3 replicates each. The data is available through the gDNAinRNAseqData package. BAM files contain about 100,000 alignments, sampled uniformly at random from complete BAM files.

library(gDNAinRNAseqData)

# Retrieving BAM files
bamfiles <- LiYu22subsetBAMfiles()
bamfiles
[1] "/tmp/Rtmpf6TXkK/s32gDNA0.bam"  "/tmp/Rtmpf6TXkK/s33gDNA0.bam" 
[3] "/tmp/Rtmpf6TXkK/s34gDNA0.bam"  "/tmp/Rtmpf6TXkK/s26gDNA1.bam" 
[5] "/tmp/Rtmpf6TXkK/s27gDNA1.bam"  "/tmp/Rtmpf6TXkK/s28gDNA1.bam" 
[7] "/tmp/Rtmpf6TXkK/s23gDNA10.bam" "/tmp/Rtmpf6TXkK/s24gDNA10.bam"
[9] "/tmp/Rtmpf6TXkK/s25gDNA10.bam"

# Getting information about the gDNA concentrations of each BAM file
pdat <- LiYu22phenoData(bamfiles)
pdat
          gDNA
s32gDNA0     0
s33gDNA0     0
s34gDNA0     0
s26gDNA1     1
s27gDNA1     1
s28gDNA1     1
s23gDNA10   10
s24gDNA10   10
s25gDNA10   10

2.1 Identifying strandMode

The strandMode of a sample depends on the library protocol used: it can be strand-specific (stranded) or non strand-specific (non-stranded). Stranded paired-end RNA-seq is, in turn, divided into: libraries were the pair strand is that of the first alignment and libraries were the pair strand is that of the second alignment.

Function identifyStrandMode() can be used to try to identify the library protocol used:

library(gDNAx)

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

strandM <- identifyStrandMode(bamfiles, txdb, singleEnd=FALSE)
strandM$strandMode
[1] NA

The strandMode identified is NA, meaning that a non-stranded protocol was used. See the “Details” section in the help page of identifyStrandMode() for further details.

Let’s take a look at the data.frame with all strandedness values. The strandedness values are based on the proportion of reads aligning to the same or opposite strand as transcripts in the annotations. See help page of identifyStrandMode() for more information.

strandM$Strandedness
          strandMode1 strandMode2      ambig Nalignments
s32gDNA0    0.4661305   0.4646406 0.06922888       69133
s33gDNA0    0.4665387   0.4647487 0.06871265       69274
s34gDNA0    0.4706538   0.4619302 0.06741606       69123
s26gDNA1    0.4710407   0.4629427 0.06601662       66559
s27gDNA1    0.4727378   0.4595452 0.06771699       65567
s28gDNA1    0.4665966   0.4666584 0.06674493       64739
s23gDNA10   0.4674282   0.4642238 0.06834804       53052
s24gDNA10   0.4635957   0.4691156 0.06728863       51450
s25gDNA10   0.4666502   0.4655258 0.06782407       52474

As we can see, the proportion of alignments overlapping transcripts when strandMode = 1L is used (“strandMode1” column) is very similar to the one when strandMode = 2L is considered (“strandMode2” column), which is compatible with a non-stranded library. This contrasts with the reported stranded protocol used to obtain this data according to (Li et al. 2022). However, the results obtained by identifyStrandMode() were contrasted with results of the RSeQC infer_experiment.py tool ((Wang, Wang, and Li 2012)) and visual inspection of data in the Integrative Genomics Viewer (IGV) ((Robinson et al. 2011)), all of which point to an unstranded RNA-seq experiment.

identifyStrandMode() uses 200,000 alignments overlapping exonic regions to compute strandedness (recommended by (Signal and Kahlke 2022)), unless the number of these kind of alignments in the BAM file is lower. In this vignette, the number of alignments used is close to 60,000, which is the total number of exonic alignments present in the BAM files.

2.2 Calculating and plotting diagnostics

Once the strandMode is known, we can calculate gDNA contamination diagnostics using the gDNAdx() function. A subset of the alignments present in the BAM file are considered to obtain these diagnostics. The number of alignments used is set by the yieldSize parameter.

gdnax <- gDNAdx(bamfiles, txdb, singleEnd=FALSE, strandMode=strandM$strandMode)
gdnax
gDNAx object
# BAM files (9): s32gDNA0.bam, ..., s25gDNA10.bam
# Library layout: paired-end (2x50nt)
# Strand mode: NA
# Sequences: only standard chromosomes
# Annotation pkg: TxDb.Hsapiens.UCSC.hg38.knownGene
# Alignments employed: first 100000

We, then, can get statistics on the origin of alignments and strandedness with getDx():

dx <- getDx(gdnax)
dx
               IGC      INT      SCJ      SCE      SCC  IGCFLM  SCJFLM  SCEFLM
s32gDNA0  1.036524 11.75898 15.18563 40.04114 19.56452 178.939 157.224 158.402
s33gDNA0  1.125444 11.78607 15.21656 40.31236 19.55383 174.529 162.123 165.682
s34gDNA0  1.095440 12.26551 15.40036 40.19421 19.70788 171.118 156.341 155.538
s26gDNA1  1.402844 12.22650 14.78351 38.69202 18.73132 174.421 158.713 163.458
s27gDNA1  1.365428 12.45073 14.54513 38.21891 18.31765 173.409 166.383 164.063
s28gDNA1  1.503162 12.49083 14.09956 37.67457 17.96053 174.973 161.662 160.597
s23gDNA10 3.529043 13.16473 11.22648 30.99781 14.41110 174.421 161.371 164.748
s24gDNA10 3.781627 13.65285 10.85400 30.01465 13.91748 169.245 160.336 160.806
s25gDNA10 3.412804 13.51507 11.19751 30.65271 14.20691 174.168 160.922 158.964
           INTFLM STRAND
s32gDNA0  154.940     NA
s33gDNA0  157.002     NA
s34gDNA0  152.265     NA
s26gDNA1  157.834     NA
s27gDNA1  158.440     NA
s28gDNA1  159.784     NA
s23gDNA10 162.035     NA
s24gDNA10 163.551     NA
s25gDNA10 157.776     NA

Next, we can plot the previous gDNA diagnostic measures using the default plot() function. This creates four plots, each one representing a diagnostic measure as a function of the percentage of intergenic alignments, which can be considered the most informative measure regarding gDNA contamination levels. Here, strandedness values (STRAND column) are NA since the dataset is not strand-specific.

plot(gdnax, group=pdat$gDNA, pch=19)
Default diagnostics.

Figure 1: Default diagnostics

Splice-compatible junction (SCJ) alignments (spliced alignments overlapping a transcript in a “splice compatible” way) and splice compatible exonic (SCE) alignments (alignments without a splice site, but that overlap a transcript in a “splice compatible” way) are expected to come from RNA sequenced reads. Instead, intergenic alignments (IGC) mainly come from DNA sequenced reads. For this reason, we see a negative correlation between the percentage of SCJ or SCE alignments and the percentage of IGC alignments: higher gDNA contamination levels lead to more IGC alignments and less SCJ or SCE alignments.

On the contrary, intronic (INT) alignments are positively correlated with IGC alignments and, thus, with gDNA contamination levels.

The last plot shows the strandedness value. In stranded RNA-seq protocols, we expect a strandedness value close to 1, meaning that most reads align to the same strand than the annotated transcripts. Lower strandedness values can be indicative of gDNA contamination: reads sequenced from DNA are expected to align in equal proportions to both strands. Therefore, a low strandedness in a stranded RNA-seq experiment can be due to the presence of DNA reads (contamination) mapping to transcripts but in the opposite strand.

Another plot to represent diagnostic measures is the one representing the origin of alignments per sample. Fluctuations in this proportions evidence different levels of gDNA contamination in samples.

plotAlnOrigins(gdnax, group=pdat$gDNA)
Alignment origins.

Figure 2: Alignment origins

Finally, the estimated fragments length distributions can be plotted with plotFrgLength(). This plot can show any differences in fragment length distributions that may be present. This plot is only available for paired-end data.

plotFrgLength(gdnax)
Estimated fragments length distributions.

Figure 3: Estimated fragments length distributions

2.3 Accessing annotations

The annotations of intergenic and intronic regions used to compute these diagnostics can easily be obtain using two different functions: getIgc() and getInt, respectively. For instance, let’s retrieve intergenic annotations:

igc <- getIgc(gdnax)
head(igc, n=3)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames      ranges strand
         <Rle>   <IRanges>  <Rle>
  [1]     chr1     1-10000      *
  [2]     chr1 31132-31292      *
  [3]     chr1 31755-32840      *
  -------
  seqinfo: 25 sequences (1 circular) from hg38 genome

3 Filter alignments according to an annotation

The package also provides functions to filter splice-compatible alignments and write them into new BAM files. To do so, first we set the type of alignments to be included in the BAM file using filterBAMtxFlag(), and then we call the filterBAMtx() function. For instance, to keep only reads expected to come from RNA, we can set isSpliceCompatibleJunction and isSpliceCompatibleExonic to TRUE. The resulting BAM files, which are located in the directory indicated in the path argument, are useful for performing downstream analyses, such as differential expression analysis, without the effect of gDNA contamination.

fbf <- filterBAMtxFlag(isSpliceCompatibleJunction=TRUE,
                        isSpliceCompatibleExonic=TRUE)
tmpdir <- tempdir()
fstats <- filterBAMtx(gdnax, path=tmpdir, txflag=fbf)
# list.files(tmpdir, pattern="*.bam$")
fstats
                NALN NIGC NINT  NSCJ  NSCE
s32gDNA0.bam  106978   NA   NA 15382 43198
s33gDNA0.bam  106951   NA   NA 15409 43396
s34gDNA0.bam  106991   NA   NA 15601 43379
s26gDNA1.bam  106842   NA   NA 15011 41527
s27gDNA1.bam  107173   NA   NA 14756 40937
s28gDNA1.bam  107239   NA   NA 14285 40427
s23gDNA10.bam 107566   NA   NA 11311 33155
s24gDNA10.bam 107732   NA   NA 10925 32310
s25gDNA10.bam 106633   NA   NA 11275 32580

We can see the number of alignments in each of the selected categories, and NA for those for which we did not retrieve any alignment.

4 Session information

sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.18.0
 [2] GenomicFeatures_1.54.4                  
 [3] AnnotationDbi_1.64.1                    
 [4] Biobase_2.62.0                          
 [5] gDNAx_1.0.2                             
 [6] Rsamtools_2.18.0                        
 [7] Biostrings_2.70.3                       
 [8] XVector_0.42.0                          
 [9] GenomicRanges_1.54.1                    
[10] GenomeInfoDb_1.38.8                     
[11] IRanges_2.36.0                          
[12] S4Vectors_0.40.2                        
[13] BiocGenerics_0.48.1                     
[14] gDNAinRNAseqData_1.2.0                  
[15] knitr_1.45                              
[16] BiocStyle_2.30.0                        

loaded via a namespace (and not attached):
 [1] DBI_1.2.2                     bitops_1.0-7                 
 [3] biomaRt_2.58.2                rlang_1.1.3                  
 [5] magrittr_2.0.3                matrixStats_1.2.0            
 [7] compiler_4.3.3                RSQLite_2.3.5                
 [9] png_0.1-8                     vctrs_0.6.5                  
[11] stringr_1.5.1                 pkgconfig_2.0.3              
[13] crayon_1.5.2                  fastmap_1.1.1                
[15] magick_2.8.3                  dbplyr_2.4.0                 
[17] ellipsis_0.3.2                utf8_1.2.4                   
[19] promises_1.2.1                rmarkdown_2.26               
[21] purrr_1.0.2                   bit_4.0.5                    
[23] xfun_0.42                     zlibbioc_1.48.2              
[25] cachem_1.0.8                  jsonlite_1.8.8               
[27] progress_1.2.3                blob_1.2.4                   
[29] highr_0.10                    later_1.3.2                  
[31] DelayedArray_0.28.0           BiocParallel_1.36.0          
[33] interactiveDisplayBase_1.40.0 parallel_4.3.3               
[35] prettyunits_1.2.0             VariantAnnotation_1.48.1     
[37] R6_2.5.1                      RColorBrewer_1.1-3           
[39] bslib_0.6.1                   stringi_1.8.3                
[41] rtracklayer_1.62.0            jquerylib_0.1.4              
[43] Rcpp_1.0.12                   bookdown_0.38                
[45] SummarizedExperiment_1.32.0   httpuv_1.6.14                
[47] Matrix_1.6-5                  tidyselect_1.2.1             
[49] abind_1.4-5                   yaml_2.3.8                   
[51] codetools_0.2-19              curl_5.2.1                   
[53] lattice_0.22-5                tibble_3.2.1                 
[55] shiny_1.8.0                   withr_3.0.0                  
[57] KEGGREST_1.42.0               evaluate_0.23                
[59] BiocFileCache_2.10.1          xml2_1.3.6                   
[61] ExperimentHub_2.10.0          pillar_1.9.0                 
[63] BiocManager_1.30.22           filelock_1.0.3               
[65] MatrixGenerics_1.14.0         generics_0.1.3               
[67] RCurl_1.98-1.14               BiocVersion_3.18.1           
[69] hms_1.1.3                     GenomicFiles_1.38.0          
[71] xtable_1.8-4                  glue_1.7.0                   
[73] tools_4.3.3                   AnnotationHub_3.10.0         
[75] BiocIO_1.12.0                 BSgenome_1.70.2              
[77] GenomicAlignments_1.38.2      XML_3.99-0.16.1              
[79] grid_4.3.3                    plotrix_3.8-4                
[81] GenomeInfoDbData_1.2.11       restfulr_0.0.15              
[83] cli_3.6.2                     rappdirs_0.3.3               
[85] fansi_1.0.6                   S4Arrays_1.2.1               
[87] dplyr_1.1.4                   sass_0.4.9                   
[89] digest_0.6.35                 SparseArray_1.2.4            
[91] rjson_0.2.21                  memoise_2.0.1                
[93] htmltools_0.5.7               lifecycle_1.0.4              
[95] httr_1.4.7                    mime_0.12                    
[97] bit64_4.0.5                  

References

Li, Xiangnan, Peipei Zhang, Haijian Wang, and Ying Yu. 2022. “Genes Expressed at Low Levels Raise False Discovery Rates in Rna Samples Contaminated with Genomic Dna.” BMC Genomics 23 (1): 554.

Robinson, James T, Helga Thorvaldsdóttir, Wendy Winckler, Mitchell Guttman, Eric S Lander, Gad Getz, and Jill P Mesirov. 2011. “Integrative Genomics Viewer.” Nature Biotechnology 29 (1): 24–26.

Signal, Brandon, and Tim Kahlke. 2022. “How_are_we_stranded_here: Quick Determination of Rna-Seq Strandedness.” BMC Bioinformatics 23 (1): 1–9.

Wang, Liguo, Shengqin Wang, and Wei Li. 2012. “RSeQC: Quality Control of Rna-Seq Experiments.” Bioinformatics 28 (16): 2184–5.