sRNADiff aims at finding differentially expressed short RNAs with or without annotation. To do so, sRNADiff uses up to 4 different strategies:
Raw data have been downloaded from the GEO data set GSE62830, provided in Viollet et al., 2015 (Viollet et al. 2015). Adapters were removed with fastx_clipper and mapped with bowtie2 (Salzberg and Langmead 2012) on the human genome version GRCh38.
This example provides this data, restricted to a small locus on chr14. It uses the whole genome annotation (with coding genes, etc.) and extracts miRNAs.
The file data.csv
contains three columns for each BAM file:
FileName
)SampleName
)WT
(Condition
)dir <- system.file("extdata", package="srnadiff", mustWork = TRUE)
data <- read.csv(file.path(dir, "data.csv"))
gtfFile <- file.path(dir, "Homo_sapiens.GRCh38.76.gtf.gz")
annotation <- readWholeGenomeAnnotation(gtfFile)
bamFiles <- file.path(dir, data$FileName)
replicates <- data$SampleName
conditions <- factor(data$Condition)
exp <- sRNADiffExp(annotation, bamFiles, replicates, conditions)
diffRegions <- runAll(exp)
For your conveniance, the sample can be loaded with an only command exp <- sRNADiffExample()
, so the script boils down to:
There is no real method for finding differentially expressed short RNAs. The most used method focusses on miRNAs, and only uses a standard RNA-Seq pipe-line on these genes.
However, annotated tRF, siRNAs, piRNA, etc. are thus out of the scope of these analyses. Several ad hoc method have been used, and this package implements a unifying method, finding differentially expressed genes or regions of any kind.
Each strategy segments the genome its own way and finds potential differentially expressed regions. Every strategy makes assumption of the profiles of the differentially expressed regions. These assumptions differ between strategies, and thus it is useful to combined them.
This is the standard pipe-line, and the method simply provides the list of miRNA genes extracted from the given GTF file.
If you do not provide a GTF file, this step is skipped.
Unfortunately, there is no clear standard for an annotation format. We provide here four ways to parse the annotation file.
This GTF file can be found in the central repositories (NCBI, Ensembl) and contains all the annotation found in an organism (coding genes, tranposable element, etc.). The following function reads the annotation file and extracts the miRNAs. Annotation files may have different formats, but this command has been tested on several model organisms (including human) from Ensembl.
miRBase (Kozomara and Griffiths-Jones 2014) is the central repository for miRNAs. If your organism is available, you can download their miRNA annotation in GFF3 format (check the “Browse” tab). The following function parses a GFF3 miRBase file, and extracts the precursor miRNAs.
As a result, the reads will be counted per pre-miRNA, and the 5’ and 3’ arms, the miRNA and the miRNA* will be merged in the same feature. If you want to separate the two, use:
When the previous functions do not work, you can use your own parser with:
The source
parameter keeps all the lines such that the second field matches the given parameter (e.g. miRNA
). The feature
parameter keeps all the lines such that the third field matches the given parameter (e.g. gene
). The name of the feature will be given by the tag name
(e.g. gene_name
). source
, feature
and name
can be NULL
. In this case, no selection is performed on source
or feature
. If name
is null, then a systematic name is given (annotation_N
).
This strategy is the simplest, most commonly used, to find transcription units. It proceeds in two steps.
The first step merges all the BAM files. Regions with a minimum number coverage (summing all the reads in all input BAM files) are kept.
The second steps merges regions that are within a given distance (also provided by the user). The default distance can be change using this function:
This method is efficient when a gene is differentially expressed on the whole transcript.
This strategy first computes the coverage of every input BAM file of every nucleotide of the genome. If two genome nucleotides have exactly the same coverage profile, we merge them. We give these coverage profiles to DESeq2 (Love, Huber, and Anders 2014), and compute the p-values. We thus have a p-value for each position of the genome. We then use a very crude hidden Markov model (HMM) on each chromosome of the genome. The HMM has two states: “not differentially expressed”, and “differentially expressed”. Each state uses a binomial distribution, the probability to emit a p-value < \(t\), and the probability to emit a p-value >= \(t\) (where \(t\) is a threshold that can be tuned, initially set to 0.1). We did not try to optimize the parameters of the HMM, because it takes times, and do not improve the results. Instead, the HMM is used a crude (but effective) way to segment the genome.
The parameters of the HMM can be changed using these functions:
exp <- setTransitionProbabilities(exp, noDiffToDiff = 0.0001, diffToNoDiff = 0.01)
exp <- setEmissionProbabilities(exp, probability = 0.9)
exp <- setEmissionThreshold(exp, threshold = 0.1)
where:
noDiffToDiff
is the probability to move from the “not differentially expressed” state to the “differentially expressed” state,diffToNoDiff
is the probability to move from the “differentially expressed” state to the “not differentially expressed” state,probability
is the probability to emit a p-value < \(t\) in the “differentially expressed” state, and a p-value >= \(t\) in the “not differentially expressed” state, where \(t\) is the next threshold,threshold
is the threshold that limits each state.This method is efficient when a region is highly differentially expressed, even if it is concealed in a longer transcript.
This stretegy proceeds in three steps:
Obviously, many putative regions may have very similar start and end points. If two regions differ only by a few base pairs, the smallest regions will be discarded.
The default values can be changed using these functions:
This method is especially efficient when a differentially expressed region is large, even though the fold change is moderate.
The three last strategies can be tuned by specifying:
The default values can be changed using these functions:
All the regions given by each strategies are then combined into a list of regions. You can choose not to use some strategies, use the function
where annotation
, naive
, hmm
and slice
are logical variables, set to FALSE
if you want to skip that step.
The selected regions are then quantified using of the summarizeOverlaps
function of the GenomicAlignments package. Notice that a read can overlap two different regions (e.g. extracted from the naive and the slicing methods), and thus can be counted twice for the quantification.
You can adjust the minimum number of overlapping nucleotides between a read and a region to declare a hit, using:
DESeq2 is then used to get the adjusted p-values of these regions.
The regions, with corresponding information provided by DESeq2 (mean expression, fold-change, p-value, adjusted p-value, etc.), can be extracted with this command:
where pvalue
is the (adjusted) p-value threshold. The output in a GenomicRanges object, and the information is accessible with the mcols()
function.
You can get the read coverage profile of your selected regions using the plotRegion
method. For instance, if exp
is your srnadiff
object, you can type:
Each line provides the read density on an input sample. The dark grey rectangle emphasizes the selected region.
The conditions can be sorted (WT
before mutant
, untreated
before treated
) this way:
The quantification and differential expression steps can be accelerated using several cores and the following command:
While installing the package, if the compiler complains and says
#error This file requires compiler and library support for the ISO C++ 2011 standard.
This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
Add this line
Sys.setenv("PKG_CXXFLAGS"="-std=c++11")
before installing the package.
## setting value
## version R version 3.5.0 (2018-04-23)
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate C
## tz America/New_York
## date 2018-04-30
##
## package * version date source
## acepack 1.4.1 2016-10-29 CRAN (R 3.5.0)
## annotate 1.58.0 2018-04-30 Bioconductor
## AnnotationDbi 1.42.0 2018-04-30 Bioconductor
## assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0)
## backports 1.1.2 2017-12-13 CRAN (R 3.5.0)
## base * 3.5.0 2018-04-27 local
## base64enc 0.1-3 2015-07-28 CRAN (R 3.5.0)
## Biobase 2.40.0 2018-04-30 Bioconductor
## BiocGenerics 0.26.0 2018-04-30 Bioconductor
## BiocParallel 1.14.0 2018-04-30 Bioconductor
## BiocStyle * 2.8.0 2018-04-30 Bioconductor
## biomaRt 2.36.0 2018-04-30 Bioconductor
## Biostrings 2.48.0 2018-04-30 Bioconductor
## bit 1.1-12 2014-04-09 CRAN (R 3.5.0)
## bit64 0.9-7 2017-05-08 CRAN (R 3.5.0)
## bitops 1.0-6 2013-08-17 CRAN (R 3.5.0)
## blob 1.1.1 2018-03-25 CRAN (R 3.5.0)
## checkmate 1.8.5 2017-10-24 CRAN (R 3.5.0)
## cluster 2.0.7-1 2018-04-13 CRAN (R 3.5.0)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0)
## compiler 3.5.0 2018-04-27 local
## data.table 1.10.4-3 2017-10-27 CRAN (R 3.5.0)
## datasets * 3.5.0 2018-04-27 local
## DBI 0.8 2018-03-02 CRAN (R 3.5.0)
## DelayedArray 0.6.0 2018-04-30 Bioconductor
## DESeq2 1.20.0 2018-04-30 Bioconductor
## devtools 1.13.5 2018-02-18 CRAN (R 3.5.0)
## digest 0.6.15 2018-01-28 CRAN (R 3.5.0)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.5.0)
## foreign 0.8-70 2017-11-28 CRAN (R 3.5.0)
## Formula 1.2-2 2017-07-10 CRAN (R 3.5.0)
## genefilter 1.62.0 2018-04-30 Bioconductor
## geneplotter 1.58.0 2018-04-30 Bioconductor
## GenomeInfoDb 1.16.0 2018-04-30 Bioconductor
## GenomeInfoDbData 1.1.0 2018-04-27 Bioconductor
## GenomicAlignments 1.16.0 2018-04-30 Bioconductor
## GenomicFeatures 1.32.0 2018-04-30 Bioconductor
## GenomicRanges 1.32.0 2018-04-30 Bioconductor
## ggplot2 2.2.1 2016-12-30 CRAN (R 3.5.0)
## graphics * 3.5.0 2018-04-27 local
## grDevices * 3.5.0 2018-04-27 local
## grid 3.5.0 2018-04-27 local
## gridExtra 2.3 2017-09-09 CRAN (R 3.5.0)
## gtable 0.2.0 2016-02-26 CRAN (R 3.5.0)
## Hmisc 4.1-1 2018-01-03 CRAN (R 3.5.0)
## htmlTable 1.11.2 2018-01-20 CRAN (R 3.5.0)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0)
## htmlwidgets 1.2 2018-04-19 CRAN (R 3.5.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.5.0)
## IRanges 2.14.0 2018-04-30 Bioconductor
## knitr * 1.20 2018-02-20 CRAN (R 3.5.0)
## labeling 0.3 2014-08-23 CRAN (R 3.5.0)
## lattice 0.20-35 2017-03-25 CRAN (R 3.5.0)
## latticeExtra 0.6-28 2016-02-09 CRAN (R 3.5.0)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0)
## locfit 1.5-9.1 2013-04-20 CRAN (R 3.5.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
## Matrix 1.2-14 2018-04-13 CRAN (R 3.5.0)
## matrixStats 0.53.1 2018-02-11 CRAN (R 3.5.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
## methods * 3.5.0 2018-04-27 local
## munsell 0.4.3 2016-02-13 CRAN (R 3.5.0)
## nnet 7.3-12 2016-02-02 CRAN (R 3.5.0)
## parallel 3.5.0 2018-04-27 local
## pillar 1.2.2 2018-04-26 CRAN (R 3.5.0)
## plyr 1.8.4 2016-06-08 CRAN (R 3.5.0)
## prettyunits 1.0.2 2015-07-13 CRAN (R 3.5.0)
## progress 1.1.2 2016-12-14 CRAN (R 3.5.0)
## R6 2.2.2 2017-06-17 CRAN (R 3.5.0)
## RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.5.0)
## Rcpp 0.12.16 2018-03-13 CRAN (R 3.5.0)
## RCurl 1.95-4.10 2018-01-04 CRAN (R 3.5.0)
## reshape2 1.4.3 2017-12-11 CRAN (R 3.5.0)
## rlang 0.2.0 2018-02-20 CRAN (R 3.5.0)
## rmarkdown * 1.9 2018-03-01 CRAN (R 3.5.0)
## rpart 4.1-13 2018-02-23 CRAN (R 3.5.0)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0)
## Rsamtools 1.32.0 2018-04-30 Bioconductor
## RSQLite 2.1.0 2018-03-29 CRAN (R 3.5.0)
## rstudioapi 0.7 2017-09-07 CRAN (R 3.5.0)
## rtracklayer 1.40.0 2018-04-30 Bioconductor
## S4Vectors 0.18.0 2018-04-30 Bioconductor
## scales 0.5.0 2017-08-24 CRAN (R 3.5.0)
## splines 3.5.0 2018-04-27 local
## srnadiff * 1.0.0 2018-05-01 Bioconductor
## stats * 3.5.0 2018-04-27 local
## stats4 3.5.0 2018-04-27 local
## stringi 1.1.7 2018-03-12 CRAN (R 3.5.0)
## stringr 1.3.0 2018-02-19 CRAN (R 3.5.0)
## SummarizedExperiment 1.10.0 2018-04-30 Bioconductor
## survival 2.42-3 2018-04-16 CRAN (R 3.5.0)
## tibble 1.4.2 2018-01-22 CRAN (R 3.5.0)
## tools 3.5.0 2018-04-27 local
## utils * 3.5.0 2018-04-27 local
## withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
## XML 3.98-1.11 2018-04-16 CRAN (R 3.5.0)
## xtable 1.8-2 2016-02-05 CRAN (R 3.5.0)
## XVector 0.20.0 2018-04-30 Bioconductor
## yaml 2.1.18 2018-03-08 CRAN (R 3.5.0)
## zlibbioc 1.26.0 2018-04-30 Bioconductor
Kozomara, Ana, and Sam Griffiths-Jones. 2014. “Base: annotating high confidence microRNAs using deep sequencing data.” NAR 42:D68–D73.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12):550.
Salzberg, Steven, and Ben Langmead. 2012. “Fast gapped-read alignment with Bowtie 2.” Nature Methods 9:357–59.
Viollet, Coralie, David A. Davis, Martin Reczko, Joseph M. Ziegelbauer, Francesco Pezzella, Jiannis Ragoussis, and Robert Yarchoan. 2015. “Next-Generation Sequencing Analysis Reveals Differential Expression Profiles of Mirna-mRNA Target Pairs in Kshv-Infected Cells.” PLOS ONE 10:1–23.