The sRNADiff package

Matthias Zytnicki

30 April 2018

sRNADiff aims at finding differentially expressed short RNAs with or without annotation. To do so, sRNADiff uses up to 4 different strategies:

Example

Raw data have been downloaded from the GEO data set GSE62830, provided in Viollet et al., 2015 (Viollet et al. 2015). Adapters were removed with fastx_clipper and mapped with bowtie2 (Salzberg and Langmead 2012) on the human genome version GRCh38.

This example provides this data, restricted to a small locus on chr14. It uses the whole genome annotation (with coding genes, etc.) and extracts miRNAs.

The file data.csv contains three columns for each BAM file:

dir         <- system.file("extdata", package="srnadiff", mustWork = TRUE)
data        <- read.csv(file.path(dir, "data.csv"))
gtfFile     <- file.path(dir, "Homo_sapiens.GRCh38.76.gtf.gz")
annotation  <- readWholeGenomeAnnotation(gtfFile)
bamFiles    <- file.path(dir, data$FileName)
replicates  <- data$SampleName
conditions  <- factor(data$Condition)
exp         <- sRNADiffExp(annotation, bamFiles, replicates, conditions)
diffRegions <- runAll(exp)

For your conveniance, the sample can be loaded with an only command exp <- sRNADiffExample(), so the script boils down to:

exp         <- sRNADiffExample()
diffRegions <- runAll(exp)

Rationale and scope of the package

There is no real method for finding differentially expressed short RNAs. The most used method focusses on miRNAs, and only uses a standard RNA-Seq pipe-line on these genes.

However, annotated tRF, siRNAs, piRNA, etc. are thus out of the scope of these analyses. Several ad hoc method have been used, and this package implements a unifying method, finding differentially expressed genes or regions of any kind.

Explanation of the different strategies

Each strategy segments the genome its own way and finds potential differentially expressed regions. Every strategy makes assumption of the profiles of the differentially expressed regions. These assumptions differ between strategies, and thus it is useful to combined them.

Annotation

This is the standard pipe-line, and the method simply provides the list of miRNA genes extracted from the given GTF file.

If you do not provide a GTF file, this step is skipped.

Unfortunately, there is no clear standard for an annotation format. We provide here four ways to parse the annotation file.

Whole genome file annotation

This GTF file can be found in the central repositories (NCBI, Ensembl) and contains all the annotation found in an organism (coding genes, tranposable element, etc.). The following function reads the annotation file and extracts the miRNAs. Annotation files may have different formats, but this command has been tested on several model organisms (including human) from Ensembl.

miRBase annotation

miRBase (Kozomara and Griffiths-Jones 2014) is the central repository for miRNAs. If your organism is available, you can download their miRNA annotation in GFF3 format (check the “Browse” tab). The following function parses a GFF3 miRBase file, and extracts the precursor miRNAs.

As a result, the reads will be counted per pre-miRNA, and the 5’ and 3’ arms, the miRNA and the miRNA* will be merged in the same feature. If you want to separate the two, use:

Other format

When the previous functions do not work, you can use your own parser with:

The source parameter keeps all the lines such that the second field matches the given parameter (e.g. miRNA). The feature parameter keeps all the lines such that the third field matches the given parameter (e.g. gene). The name of the feature will be given by the tag name (e.g. gene_name). source, feature and name can be NULL. In this case, no selection is performed on source or feature. If name is null, then a systematic name is given (annotation_N).

Naive

This strategy is the simplest, most commonly used, to find transcription units. It proceeds in two steps.

The first step merges all the BAM files. Regions with a minimum number coverage (summing all the reads in all input BAM files) are kept.

The second steps merges regions that are within a given distance (also provided by the user). The default distance can be change using this function:

This method is efficient when a gene is differentially expressed on the whole transcript.

HMM

This strategy first computes the coverage of every input BAM file of every nucleotide of the genome. If two genome nucleotides have exactly the same coverage profile, we merge them. We give these coverage profiles to DESeq2 (Love, Huber, and Anders 2014), and compute the p-values. We thus have a p-value for each position of the genome. We then use a very crude hidden Markov model (HMM) on each chromosome of the genome. The HMM has two states: “not differentially expressed”, and “differentially expressed”. Each state uses a binomial distribution, the probability to emit a p-value < \(t\), and the probability to emit a p-value >= \(t\) (where \(t\) is a threshold that can be tuned, initially set to 0.1). We did not try to optimize the parameters of the HMM, because it takes times, and do not improve the results. Instead, the HMM is used a crude (but effective) way to segment the genome.

The parameters of the HMM can be changed using these functions:

where:

This method is efficient when a region is highly differentially expressed, even if it is concealed in a longer transcript.

Slicing

This stretegy proceeds in three steps:

Obviously, many putative regions may have very similar start and end points. If two regions differ only by a few base pairs, the smallest regions will be discarded.

The default values can be changed using these functions:

This method is especially efficient when a differentially expressed region is large, even though the fold change is moderate.

General parameters

The three last strategies can be tuned by specifying:

The default values can be changed using these functions:

Combination of the strategies

Choice of the strategies

All the regions given by each strategies are then combined into a list of regions. You can choose not to use some strategies, use the function

where annotation, naive, hmm and slice are logical variables, set to FALSE if you want to skip that step.

Quantification of the features

The selected regions are then quantified using of the summarizeOverlaps function of the GenomicAlignments package. Notice that a read can overlap two different regions (e.g. extracted from the naive and the slicing methods), and thus can be counted twice for the quantification.

You can adjust the minimum number of overlapping nucleotides between a read and a region to declare a hit, using:

DESeq2 is then used to get the adjusted p-values of these regions.

Extracting regions

The regions, with corresponding information provided by DESeq2 (mean expression, fold-change, p-value, adjusted p-value, etc.), can be extracted with this command:

where pvalue is the (adjusted) p-value threshold. The output in a GenomicRanges object, and the information is accessible with the mcols() function.

Visualization

You can get the read coverage profile of your selected regions using the plotRegion method. For instance, if exp is your srnadiff object, you can type:

plotRegion(exp, regions[1])

Each line provides the read density on an input sample. The dark grey rectangle emphasizes the selected region.

Misc

Naming the conditions

The conditions can be sorted (WT before mutant, untreated before treated) this way:

Using several cores

The quantification and differential expression steps can be accelerated using several cores and the following command:

Troubleshooting

While installing the package, if the compiler complains and says

#error This file requires compiler and library support for the ISO C++ 2011 standard.
This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.

Add this line

Sys.setenv("PKG_CXXFLAGS"="-std=c++11")

before installing the package.

Session information

devtools::session_info()
##  setting  value                       
##  version  R version 3.5.0 (2018-04-23)
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  C                           
##  tz       America/New_York            
##  date     2018-04-30                  
## 
##  package              * version   date       source        
##  acepack                1.4.1     2016-10-29 CRAN (R 3.5.0)
##  annotate               1.58.0    2018-04-30 Bioconductor  
##  AnnotationDbi          1.42.0    2018-04-30 Bioconductor  
##  assertthat             0.2.0     2017-04-11 CRAN (R 3.5.0)
##  backports              1.1.2     2017-12-13 CRAN (R 3.5.0)
##  base                 * 3.5.0     2018-04-27 local         
##  base64enc              0.1-3     2015-07-28 CRAN (R 3.5.0)
##  Biobase                2.40.0    2018-04-30 Bioconductor  
##  BiocGenerics           0.26.0    2018-04-30 Bioconductor  
##  BiocParallel           1.14.0    2018-04-30 Bioconductor  
##  BiocStyle            * 2.8.0     2018-04-30 Bioconductor  
##  biomaRt                2.36.0    2018-04-30 Bioconductor  
##  Biostrings             2.48.0    2018-04-30 Bioconductor  
##  bit                    1.1-12    2014-04-09 CRAN (R 3.5.0)
##  bit64                  0.9-7     2017-05-08 CRAN (R 3.5.0)
##  bitops                 1.0-6     2013-08-17 CRAN (R 3.5.0)
##  blob                   1.1.1     2018-03-25 CRAN (R 3.5.0)
##  checkmate              1.8.5     2017-10-24 CRAN (R 3.5.0)
##  cluster                2.0.7-1   2018-04-13 CRAN (R 3.5.0)
##  colorspace             1.3-2     2016-12-14 CRAN (R 3.5.0)
##  compiler               3.5.0     2018-04-27 local         
##  data.table             1.10.4-3  2017-10-27 CRAN (R 3.5.0)
##  datasets             * 3.5.0     2018-04-27 local         
##  DBI                    0.8       2018-03-02 CRAN (R 3.5.0)
##  DelayedArray           0.6.0     2018-04-30 Bioconductor  
##  DESeq2                 1.20.0    2018-04-30 Bioconductor  
##  devtools               1.13.5    2018-02-18 CRAN (R 3.5.0)
##  digest                 0.6.15    2018-01-28 CRAN (R 3.5.0)
##  evaluate               0.10.1    2017-06-24 CRAN (R 3.5.0)
##  foreign                0.8-70    2017-11-28 CRAN (R 3.5.0)
##  Formula                1.2-2     2017-07-10 CRAN (R 3.5.0)
##  genefilter             1.62.0    2018-04-30 Bioconductor  
##  geneplotter            1.58.0    2018-04-30 Bioconductor  
##  GenomeInfoDb           1.16.0    2018-04-30 Bioconductor  
##  GenomeInfoDbData       1.1.0     2018-04-27 Bioconductor  
##  GenomicAlignments      1.16.0    2018-04-30 Bioconductor  
##  GenomicFeatures        1.32.0    2018-04-30 Bioconductor  
##  GenomicRanges          1.32.0    2018-04-30 Bioconductor  
##  ggplot2                2.2.1     2016-12-30 CRAN (R 3.5.0)
##  graphics             * 3.5.0     2018-04-27 local         
##  grDevices            * 3.5.0     2018-04-27 local         
##  grid                   3.5.0     2018-04-27 local         
##  gridExtra              2.3       2017-09-09 CRAN (R 3.5.0)
##  gtable                 0.2.0     2016-02-26 CRAN (R 3.5.0)
##  Hmisc                  4.1-1     2018-01-03 CRAN (R 3.5.0)
##  htmlTable              1.11.2    2018-01-20 CRAN (R 3.5.0)
##  htmltools              0.3.6     2017-04-28 CRAN (R 3.5.0)
##  htmlwidgets            1.2       2018-04-19 CRAN (R 3.5.0)
##  httr                   1.3.1     2017-08-20 CRAN (R 3.5.0)
##  IRanges                2.14.0    2018-04-30 Bioconductor  
##  knitr                * 1.20      2018-02-20 CRAN (R 3.5.0)
##  labeling               0.3       2014-08-23 CRAN (R 3.5.0)
##  lattice                0.20-35   2017-03-25 CRAN (R 3.5.0)
##  latticeExtra           0.6-28    2016-02-09 CRAN (R 3.5.0)
##  lazyeval               0.2.1     2017-10-29 CRAN (R 3.5.0)
##  locfit                 1.5-9.1   2013-04-20 CRAN (R 3.5.0)
##  magrittr               1.5       2014-11-22 CRAN (R 3.5.0)
##  Matrix                 1.2-14    2018-04-13 CRAN (R 3.5.0)
##  matrixStats            0.53.1    2018-02-11 CRAN (R 3.5.0)
##  memoise                1.1.0     2017-04-21 CRAN (R 3.5.0)
##  methods              * 3.5.0     2018-04-27 local         
##  munsell                0.4.3     2016-02-13 CRAN (R 3.5.0)
##  nnet                   7.3-12    2016-02-02 CRAN (R 3.5.0)
##  parallel               3.5.0     2018-04-27 local         
##  pillar                 1.2.2     2018-04-26 CRAN (R 3.5.0)
##  plyr                   1.8.4     2016-06-08 CRAN (R 3.5.0)
##  prettyunits            1.0.2     2015-07-13 CRAN (R 3.5.0)
##  progress               1.1.2     2016-12-14 CRAN (R 3.5.0)
##  R6                     2.2.2     2017-06-17 CRAN (R 3.5.0)
##  RColorBrewer           1.1-2     2014-12-07 CRAN (R 3.5.0)
##  Rcpp                   0.12.16   2018-03-13 CRAN (R 3.5.0)
##  RCurl                  1.95-4.10 2018-01-04 CRAN (R 3.5.0)
##  reshape2               1.4.3     2017-12-11 CRAN (R 3.5.0)
##  rlang                  0.2.0     2018-02-20 CRAN (R 3.5.0)
##  rmarkdown            * 1.9       2018-03-01 CRAN (R 3.5.0)
##  rpart                  4.1-13    2018-02-23 CRAN (R 3.5.0)
##  rprojroot              1.3-2     2018-01-03 CRAN (R 3.5.0)
##  Rsamtools              1.32.0    2018-04-30 Bioconductor  
##  RSQLite                2.1.0     2018-03-29 CRAN (R 3.5.0)
##  rstudioapi             0.7       2017-09-07 CRAN (R 3.5.0)
##  rtracklayer            1.40.0    2018-04-30 Bioconductor  
##  S4Vectors              0.18.0    2018-04-30 Bioconductor  
##  scales                 0.5.0     2017-08-24 CRAN (R 3.5.0)
##  splines                3.5.0     2018-04-27 local         
##  srnadiff             * 1.0.0     2018-05-01 Bioconductor  
##  stats                * 3.5.0     2018-04-27 local         
##  stats4                 3.5.0     2018-04-27 local         
##  stringi                1.1.7     2018-03-12 CRAN (R 3.5.0)
##  stringr                1.3.0     2018-02-19 CRAN (R 3.5.0)
##  SummarizedExperiment   1.10.0    2018-04-30 Bioconductor  
##  survival               2.42-3    2018-04-16 CRAN (R 3.5.0)
##  tibble                 1.4.2     2018-01-22 CRAN (R 3.5.0)
##  tools                  3.5.0     2018-04-27 local         
##  utils                * 3.5.0     2018-04-27 local         
##  withr                  2.1.2     2018-03-15 CRAN (R 3.5.0)
##  XML                    3.98-1.11 2018-04-16 CRAN (R 3.5.0)
##  xtable                 1.8-2     2016-02-05 CRAN (R 3.5.0)
##  XVector                0.20.0    2018-04-30 Bioconductor  
##  yaml                   2.1.18    2018-03-08 CRAN (R 3.5.0)
##  zlibbioc               1.26.0    2018-04-30 Bioconductor

References

Kozomara, Ana, and Sam Griffiths-Jones. 2014. “Base: annotating high confidence microRNAs using deep sequencing data.” NAR 42:D68–D73.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12):550.

Salzberg, Steven, and Ben Langmead. 2012. “Fast gapped-read alignment with Bowtie 2.” Nature Methods 9:357–59.

Viollet, Coralie, David A. Davis, Martin Reczko, Joseph M. Ziegelbauer, Francesco Pezzella, Jiannis Ragoussis, and Robert Yarchoan. 2015. “Next-Generation Sequencing Analysis Reveals Differential Expression Profiles of Mirna-mRNA Target Pairs in Kshv-Infected Cells.” PLOS ONE 10:1–23.