Maintainer: Ji-Ping Wang, <>

Reference: Xiong, B., Yang, Y., Fineis, F. Wang, J.-P., DegNorm: normalization of generalized transcript degradation improves accuracy in RNA-seq analysis, Genome Biology, 2019,20:75

What is DegNorm?

DegNorm, short for Degradation Normalization, is a bioinformatics pipeline designed to correct for bias due to the heterogeneous patterns of transcript degradation in RNA-seq data. DegNorm helps improve the accuracy of the differential expression analysis by accounting for this degradation.

In practice, RNA samples are often more-or-less degraded, and the degradation severity is not only sample-specific, but gene-specific as well. It is known that longer genes tend to degrade faster than shorter ones. As such, commonplace global degradation normalization approaches that impose a single normalization factor on all genes within a sample can be ineffective in correcting for RNA degradation bias.

DegNorm pipline available formats

We’ve developed an R package and an indepedent Python package (download), both of which allow to run the entire pipeline from the RNA-seq alignment (.bam) files. For most-updated version, we recommend to use R package from bioconductor.

DegNorm version 1.3.4 updates

In version 1.3.1, we made following updates: 1. In plot_coverage funciton, an samples options is provided to allow user to select which samples to plot for coverage curves. 2. We fixed a bug in DegNorm funciton. In earlier version, we used genes with maximum rho<0.1 from initial SVD to determine the scaling factor before running core algorithm. For some data with many samples and large degradation, it may return NA to cause issues.

Install DegNorm R package

DegNorm main features

DegNorm R package contains two major functions: (1) processing the RNA-seq alignment file (.bam) to calculate the coverage; and (2) using a core algorithm written in RcppArmadillo to perform rank-one over-approximation on converage matrices for each gene to estimate the degramation index (DI) score for each gene within each sample.

DegNorm outputs DI scores together with degradation-normalized read counts (based on DI scores). It also provides supplementary functions for visualization of degradation at both gene and sample level. The following diagram illustrates the flow of DegNorm pipeline.

 

A diagram of DegNorm.

A diagram of DegNorm.

 

The following vignette is intended to provide example codes for running DegNorm R package. It presumes that you have successfully installed DegNorm package. We illustrate below how to: 1) calculate the read coverage curves for all genes within all samples, and 2) perform degradation normalization on coverage curves. Either step is computing intensive. Dependent upon the number of samples and the sequencing depth, the total computing time may last a few hours. DegNorm utilizes the parallel computing functionality of R and automatically detects the number of cores on your computer to run jobs in parallel. Due to the large size of bam file and limited computing power of personal computer, we recommend users to run it in servers or computing clusters.

1. Compute coverage score based on alignment .bam files

Run main function to create read coverage matrix and read counts

cores argument specifies the number of cores to use. Users should try to use as many as possible cores to maximize the computing efficiency.

Function read_coverage_batch returns the coverage matrices as a list, one per gene, and a dataframe for read counts, each row for one gene and each column for one sample.

2. DegNorm core algorithm

Run degnorm core algorithm for degradation normalization. DegNorm purpose is for differential expression analysis. Thus genes with extremely low read counts from all samples are filtered out. The current filtering criterion is that if more than half of the samples have less than 5 read count, that gene will not be considered in the degnorm algorithm. In the following example, I am using downsamling to save time below (default). Alternatively you can set down_sampling = 0, which takes longer time. If down_samplin= 1, read coverage scores are binned with size by grid_size for baseline selection to achieve better efficiency. The default grid_size is 10 bp. We recommend to use a grid_size less than 50 bp. iteration specifies the big loop in DegNorm algorithm and 5 is usually sufficient. loop specifies the iteration number in the matrix factorization over-approximation.

If down_sampling= 0, then the argument grid_size is ignored.

Function degnorm returns a list of multiple objects. counts_normed is the one with degradation normalized read counts for you to input DeSeq or EdgeR for DE analysis.

The difference of number of genes between res_DegNorm and coverage_res is 207 (339-132). The 207 genes were filtered out from degnorm degradation normalization because less than half of the samples (3) have more than 5 read count.

3. Plot functions in DegNorm

DegNorm provides four plot functions for visualization of degradation and sample quality diagnosis.

  • plot_coverage
  • plot_corr
  • plot_heatmap
  • plot_boxplot

 

– Boxplot of the degradation index(DI) scores

 

– Heatmap plot of the degradation index(DI) scores

 

– Correlation matrix plot of degradation index(DI) scores

 

Session info

sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] knitr_1.41    DegNorm_1.8.2
#> 
#> loaded via a namespace (and not attached):
#>   [1] colorspace_2.0-3            rjson_0.2.21               
#>   [3] ellipsis_0.3.2              XVector_0.38.0             
#>   [5] GenomicRanges_1.50.2        farver_2.1.1               
#>   [7] bit64_4.0.5                 AnnotationDbi_1.60.0       
#>   [9] fansi_1.0.3                 xml2_1.3.3                 
#>  [11] codetools_0.2-18            doParallel_1.0.17          
#>  [13] cachem_1.0.6                jsonlite_1.8.4             
#>  [15] Rsamtools_2.14.0            dbplyr_2.3.0               
#>  [17] png_0.1-8                   compiler_4.2.2             
#>  [19] httr_1.4.4                  assertthat_0.2.1           
#>  [21] Matrix_1.5-3                fastmap_1.1.0              
#>  [23] lazyeval_0.2.2              cli_3.6.0                  
#>  [25] htmltools_0.5.4             prettyunits_1.1.1          
#>  [27] tools_4.2.2                 gtable_0.3.1               
#>  [29] glue_1.6.2                  GenomeInfoDbData_1.2.9     
#>  [31] reshape2_1.4.4              dplyr_1.0.10               
#>  [33] rappdirs_0.3.3              Rcpp_1.0.9                 
#>  [35] Biobase_2.58.0              jquerylib_0.1.4            
#>  [37] vctrs_0.5.1                 Biostrings_2.66.0          
#>  [39] rtracklayer_1.58.0          iterators_1.0.14           
#>  [41] crosstalk_1.2.0             xfun_0.36                  
#>  [43] stringr_1.5.0               lifecycle_1.0.3            
#>  [45] restfulr_0.0.15             XML_3.99-0.13              
#>  [47] dendextend_1.16.0           ca_0.71.1                  
#>  [49] zlibbioc_1.44.0             scales_1.2.1               
#>  [51] TSP_1.2-1                   hms_1.1.2                  
#>  [53] MatrixGenerics_1.10.0       parallel_4.2.2             
#>  [55] SummarizedExperiment_1.28.0 RColorBrewer_1.1-3         
#>  [57] yaml_2.3.6                  curl_5.0.0                 
#>  [59] memoise_2.0.1               heatmaply_1.4.2            
#>  [61] gridExtra_2.3               ggplot2_3.4.0              
#>  [63] sass_0.4.4                  biomaRt_2.54.0             
#>  [65] stringi_1.7.12              RSQLite_2.2.20             
#>  [67] highr_0.10                  S4Vectors_0.36.1           
#>  [69] BiocIO_1.8.0                foreach_1.5.2              
#>  [71] seriation_1.4.1             GenomicFeatures_1.50.3     
#>  [73] BiocGenerics_0.44.0         filelock_1.0.2             
#>  [75] BiocParallel_1.32.5         GenomeInfoDb_1.34.7        
#>  [77] rlang_1.0.6                 pkgconfig_2.0.3            
#>  [79] matrixStats_0.63.0          bitops_1.0-7               
#>  [81] evaluate_0.20               lattice_0.20-45            
#>  [83] purrr_1.0.1                 GenomicAlignments_1.34.0   
#>  [85] htmlwidgets_1.6.1           labeling_0.4.2             
#>  [87] bit_4.0.5                   tidyselect_1.2.0           
#>  [89] plyr_1.8.8                  magrittr_2.0.3             
#>  [91] R6_2.5.1                    IRanges_2.32.0             
#>  [93] generics_0.1.3              DelayedArray_0.24.0        
#>  [95] DBI_1.1.3                   pillar_1.8.1               
#>  [97] withr_2.5.0                 KEGGREST_1.38.0            
#>  [99] RCurl_1.98-1.9              tibble_3.1.8               
#> [101] crayon_1.5.2                utf8_1.2.2                 
#> [103] BiocFileCache_2.6.0         plotly_4.10.1              
#> [105] rmarkdown_2.19              viridis_0.6.2              
#> [107] progress_1.2.2              grid_4.2.2                 
#> [109] data.table_1.14.6           blob_1.2.3                 
#> [111] digest_0.6.31               webshot_0.5.4              
#> [113] tidyr_1.2.1                 stats4_4.2.2               
#> [115] munsell_0.5.0               registry_0.5-1             
#> [117] viridisLite_0.4.1           bslib_0.4.2