The first step in HMM method is quantifying the evidence for differential
expression at the base-resolution level. To do this,
the common approach in comparative analysis of transcriptomics data: test
the null hypothesis that the logarithmic fold change between condition groups
for a nucleotide expression is exactly zero.
The next step in the HMM approach enforces a smoothness assumption over the
state of nucleotides: differential expression does not randomly switch along
the chromosome, rather, continuous regions of RNA are either “differentially
expressed” or “not”. This is captured with a hidden Markov model (HMM) with
binary latent state corresponding to the true state of each nucleotide:
differentially expressed or not differentially expressed.
The observations of the HMM are then the empirical p-values arising from the
differential expression analysis corresponding to each nucleotide position.
Modelling p-values directly enabled us to define the emission of each state
as follows: the differentially expressed state emits a p-value \(< t\) with
probability \(p\), and the not differentially expressed state emits a p-value
\(\geqslant t\) with probability \(1-p\), where \(t\) is a real number between 0
The HMM approach normally needs emission, transition, and starting
probabilities values. They can be tuned by the user according to the
overall p-values from differential analysis. We then run the Viterbi
algorithm [ref] in order to finding the most likely sequence of states
from the HMM. This essentially segments the genome into regions, where
a region is defined as a set of consecutive bases showing a common expression
signature. A region of bases with differentially expressed state is referred
as an expressed region and is given as output of the method.
To run the HMM approach,
srnadiff first form a large matrix, with
rows corresponding to bases, columns corresponding to samples and entries
are the coverage from a nucleotide of a particular sample. This count matrix
is then analyzed as into feature-level counts using the feature-level RNA-seq
differential expression analysis from DESeq2. In practice, the
p-value is not computed for every nucleotide. Nucleotides for which the sum
of the coverage across all samples is less than a threshold are given a
p-value of 1, because these poorly expressed bases are unlikely to provide a
differentially expressed sRNA.
The parameters for the HMM method are:
noDiffToDiff: Initial transition probability from
“no differentially expressed” state to “differentially expressed”
diffToNoDiff: Initial transition probability from
“differentially expressed” state to no “differentially expressed”
emission: Is the probability to emit a p-value \(<t\) in
the “differentially expressed” state, and a p-value \(\geq t\) in the
“not differentially expressed” state.
emissionThreshold: Is the threshold \(t\) that limits each
This parameters can be changed using using the assignment function
parameters(srnaExp) <- list(noDiffToDiff=0.01, emissionThreshold=0.2)
In this approach, for each base, the average from the normalized coverage is
calculated across all samples into each condition. This generates a vector of
(normalized) mean coverage expression per condition. These two vectors are
then used to compute per-nucleotide log-ratios (in absolute value) across the
genome. For the computed log-ratio expression, the method uses a sliding
threshold h that run across the log-ratio levels identifying bases with
log-ratio value above of h.
Regions of contiguous bases passing this threshold are then analyzed using an
adaptation of Aumann and Lindell algorithm for irreducibility property
(Aumann and Lindell 2003).
The minimun sliding threshold,
minLogFC, used in the IR method can
be changed using the assignment function
parameters(srnaExp) <- list(minLogFC=1)
This method is the simplest, gived a fixed threshold h, contiguous
bases with log-ratio expression (in absolute value) passing this threshold
are then considered as candidate differentially expressed regions.
The fixed threshold,
cutoff, used in this method can be changed using
the assignment function
parameters(srnaExp) <- list(cutoff=1.5)