RIVER 1.6.0

`RIVER`

is an `R`

package of a probabilistic modeling framework, called
*RIVER (RNA-Informed Variant Effect on Regulation)*, that jointly analyzes
personal genome (WGS) and transcriptome data (RNA-seq) to estimate the
probability that a variant has regulatory impact in that individual.
It is based on a generative model that assumes that genomic annotations,
such as the location of a variant with respect to regulatory elements,
determine the prior probability that variant is a functional regulatory
variant, which is an unobserved variable. The functional regulatory variant
status then influences whether nearby genes are likely to display outlier
levels of gene expression in that person.

*RIVER* is a hierarchical Bayesian model that predicts the regulatory
effects of rare variants by integrating gene expression with genomic
annotations. The *RIVER* consists of three layers: a set of nodes
\(G = G_{1}, ..., G_{P}\) in the topmost layer representing \(P\) observed
genomic annotations over all rare variants near a particular gene,
a latent binary variable \(FR\) in the middle layer representing the unobserved
funcitonal regulatory status of the rare variants, and one binary node \(E\)
in the final layer representing expression outlier status of the nearby gene.
We model each conditional probability distribution as follows:
\[ FR | G \sim Bernoulli(\psi), \psi = logit^{-1}(\beta^T, G) \]
\[E | FR \sim Categorical(\theta_{FR}) \]
\[ \beta_i \sim N(0, \frac{1}{\lambda})\]
\[\theta_{FR} \sim Beta(C,C)\]
with parameters \(\beta\) and \(\theta\) and
hyper-parameters \(\lambda\) and \(C\).

Because \(FR\) is unobserved, log-likelihood objective
of *RIVER* over instances \(n = 1, ..., N\),
\[
\sum_{n=1}^{N} log\sum_{FR_n= 0}^{1} P(E_n, G_n, FR_n | \beta, \theta),
\]
is non-convex. We therefore optimize model parameters via
Expectation-Maximization (EM) as follows:

In the E-step, we compute the posterior probabilities (\(\omega_n^{(i)}\)) of the latent variables \(FR_n\) given current parameters and observed data. For example, at the \(i\)-th iteration, the posterior probability of \(FR_n = 1\) for the \(n\)-th instance is \[ \omega_{1n}^{(i)} = P(FR_n = 1 | G_n, \beta^{(i)}, E_n, \theta^{(i)}) =\frac{P(FR_n = 1 | G_n, \beta^{(i)}) \cdotp P(E_n | FR_n = 1, \theta^{(i)})}{\sum_{FR_n = 0}^1 P(FR_n | G_n, \beta^{(i)}) \cdotp P(E_n | FR_n, \theta^{(i)})}, \] \[\omega_{0n}^{(i)} = 1 - \omega_{1n}^{(i)}.\]

In the M-step, at the \(i\)-th iteration, given the current estimates \(\omega^{(i)}\), the parameters (\(\beta^{(i+1)*}\)) are estimated as \[ \max_{\beta^{(i+1)}} \sum_{n = 1}^N \sum_{FR_n = 0}^1 log(P(FR_n | G_n, \beta^{(i+1)})) \cdotp \omega_{FR, n}^{(i)} - \frac{\lambda}{2}||\beta^{(i+1)}||_2, \] where \(\lambda\) is an L2 penalty hyper-parameter derived from the Gaussian prior on \(\beta\). The parameters \(\theta\) get updated as: \[ \theta_{s,t}^{(i+1)} = \sum_{n = 1}^{N} I(E_n = t) \cdotp \omega_{s,n}^{(i)} + C. \] where \(I\) is an indicator operator, \(t\) is the binary value of expression \(E_n\), \(s\) is the possible binary values of \(FR_n\) and \(C\) is a pseudo count derived from the Beta prior on . The E- and M-steps are applied iteratively until convergence.

The purpose of this section is to provide users a general sense
of our package, `RIVER`

, including components and their behaviors
and applications. We will briefly go over main functions,
observe basic operations and corresponding outcomes.
Throughout this section, users may have better ideas about which functions
are available, which values to be chosen for necessary parameters, and
where to seek help. More detailed descriptions are given in later sections.

First, we load `RIVER`

:

`library("RIVER")`

`RIVER`

consists of several functions mainly supporting two main functions
including `evaRIVER`

and `appRIVER`

, which we are about to show how to use
them here. We first load simulated data created beforehand for illustration.

```
filename <- system.file("extdata", "simulation_RIVER.gz",
package="RIVER")
dataInput <- getData(filename) # import experimental data
```

`getData`

combines different resources including genomic features,
outlier status of gene expression, and N2 pairs having same rare variants
into standardized data structures, called `ExpressionSet`

class.

`print(dataInput)`

```
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 18 features, 6122 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: indiv58:gene1614 indiv5:gene1331 ... indiv18:gene1111
## (6122 total)
## varLabels: Outlier N2pair
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
```

```
Feat <- t(Biobase::exprs(dataInput)) # genomic features (G)
Out <- as.numeric(unlist(dataInput$Outlier))-1 # outlier status (E)
```

In the simulated data, an input object `dataInput`

consists of
18 genomic features and expression outlier status of 6122 samples and
which samples belong to N2 pairs.

`head(Feat)`

```
## Feature1 Feature2 Feature3 Feature4 Feature5
## indiv58:gene1614 0.4890338 -1.2143183 -0.5015241 -0.8093881 -0.1916957
## indiv5:gene1331 -4.1665487 -0.6355490 -1.0914989 -2.8069442 -0.3921658
## indiv76:gene447 -0.4469724 -0.9440314 0.7386942 0.6786997 0.6043647
## indiv16:gene126 1.6252083 -2.2686506 1.4156013 -0.5662072 -0.1405224
## indiv45:gene1044 0.3086791 1.0044798 0.8493856 0.3515569 -0.3763434
## indiv6:gene258 0.9046627 1.7618399 -1.6258895 0.3961724 -0.2203089
## Feature6 Feature7 Feature8 Feature9 Feature10
## indiv58:gene1614 -0.06614698 0.1993153 -0.7544253 -1.3505036 -0.03477715
## indiv5:gene1331 -0.43831006 1.8175888 -0.8411557 2.1495237 -0.48691686
## indiv76:gene447 1.47627930 0.6521233 -0.6416004 1.0309900 -0.38262446
## indiv16:gene126 2.11676989 1.0670951 0.3404799 -1.5970916 -0.49910751
## indiv45:gene1044 1.45269690 -0.8368466 -1.0016646 0.1908291 -0.26598159
## indiv6:gene258 -0.76821018 -1.3436283 -0.7201516 0.5440035 0.31006097
## Feature11 Feature12 Feature13 Feature14 Feature15
## indiv58:gene1614 2.08675352 0.7283183 0.15074710 1.5183579 0.2226134
## indiv5:gene1331 2.57339374 0.4840111 0.58897093 -0.2069596 0.1000502
## indiv76:gene447 0.30621172 0.6690835 -1.39701213 1.4853201 1.6552545
## indiv16:gene126 0.58599041 -0.7052013 -0.07715282 -1.3326831 0.1719152
## indiv45:gene1044 1.45029299 -0.4530850 -0.09610983 1.4731617 -0.9372256
## indiv6:gene258 0.05174989 0.4026417 -0.50911462 -0.3623563 -2.5279770
## Feature16 Feature17 Feature18
## indiv58:gene1614 1.88846321 -0.4408236 -0.4558111
## indiv5:gene1331 0.22640429 1.2535793 -1.0163254
## indiv76:gene447 0.71720775 0.2758493 1.1378452
## indiv16:gene126 1.12919669 0.6166061 -1.6069772
## indiv45:gene1044 0.14794114 1.2241711 1.4916670
## indiv6:gene258 -0.01574568 -1.5835940 -1.2366762
```

`head(Out)`

`## [1] 1 1 1 0 1 1`

`Feat`

contains continuous values of genomic features (defined as \(G\)
in the objective function) while `Out`

contains binary values
representing outlier status of same samples as `Feat`

(defined as \(E\)
in the objective function).

For evaluation, we hold out pairs of individuals at genes
where only those two individuals shared the same rare variants.
Except for the list of instances, we train *RIVER* with the rest of instances,
compute the *RIVER* score (the posterior probability of having a functional
regulatory variant given both WGS and RNA-seq data) from one individual,
and assess the accuracy with respect to the second individualâ€™s held-out
expression levels. Since there is currently quite few gold standard set of
functional rare variants, using this labeled test data allow us
to evaluate predictive accuracy of *RIVER* compared with genomic annotation model,
*GAM*, that uses genomic annotations alone. We can observe
a significant improvement by incorporating expression data.

To do so, we simply use `evaRIVER`

:

`evaROC <- evaRIVER(dataInput)`

```
## ::: EM iteration is terminated since it converges within a
## predefined tolerance (0.001) :::
```

`evaROC`

is an S4 object of class `evaRIVER`

which contains
two AUC values from *RIVER* and *GAM*, specificity and sensitivity
measures from the two models, and p-value of comparing the two AUC values.

`cat('AUC (GAM-genomic annotation model) = ', round(evaROC$GAM_auc,3), '\n')`

`## AUC (GAM-genomic annotation model) = 0.58`

`cat('AUC (RIVER) = ', round(evaROC$RIVER_auc,3), '\n')`

`## AUC (RIVER) = 0.8`

`cat('P-value ', format.pval(evaROC$pvalue, digits=2, eps=0.001), '***\n')`

`## P-value <0.001 ***`

We can visualize the ROC curves with AUC values:

```
par(mar=c(6.1, 6.1, 4.1, 4.1))
plot(NULL, xlim=c(0,1), ylim=c(0,1),
xlab="False positive rate", ylab="True positive rate",
cex.axis=1.3, cex.lab=1.6)
abline(0, 1, col="gray")
lines(1-evaROC$RIVER_spec, evaROC$RIVER_sens,
type="s", col='dodgerblue', lwd=2)
lines(1-evaROC$GAM_spec, evaROC$GAM_sens,
type="s", col='mediumpurple', lwd=2)
legend(0.7,0.2,c("RIVER","GAM"), lty=c(1,1), lwd=c(2,2),
col=c("dodgerblue","mediumpurple"), cex=1.2,
pt.cex=1.2, bty="n")
title(main=paste("AUC: RIVER = ", round(evaROC$RIVER_auc,3),
", GAM = ", round(evaROC$GAM_auc,3),
", P = ", format.pval(evaROC$pvalue, digits=2,
eps=0.001),sep=""))
```

Each ROC curve from either *RIVER* or *GAM* is computed by
comparing the posterior probability given available data
for the 1st individual with the outlier status of the 2nd individual
in the list of held-out pairs and vice versa.

To extract posterior probabilities for prioritizing functional rare variants
in any downstream analysis such as finding pathogenic rare variants in disease,
you simply run `appRIVER`

to obtain the posterior probabilities:

`postprobs <- appRIVER(dataInput)`

```
## ::: EM iteration is terminated since it converges within a
## predefined tolerance (0.001) :::
```

`postprobs`

is an S4 object of class `appRIVER`

which contains
subject IDs, gene names, \(P(FR = 1 | G)\), \(P(FR = 1 | G, E)\), and `fitRIVER`

,
all the relevant information of the fitted *RIVER* including hyperparamters
for further use.

Probabilities of rare variants being functional from *RIVER* and *GAM*
for a few samples are shown below:

```
example_probs <- data.frame(Subject=postprobs$indiv_name,
Gene=postprobs$gene_name,
RIVERpp=postprobs$RIVER_posterior,
GAMpp=postprobs$GAM_posterior)
head(example_probs)
```

```
## Subject Gene RIVERpp GAMpp
## 1 indiv58 gene1614 0.16954585 0.2358697
## 2 indiv5 gene1331 0.03134497 0.1837973
## 3 indiv76 gene447 0.57821845 0.3479624
## 4 indiv16 gene126 0.03634984 0.2525892
## 5 indiv45 gene1044 0.29412286 0.2429425
## 6 indiv6 gene258 0.10924045 0.1505262
```

From left to right, it shows subject ID, gene name, posterior probabilities
from *RIVER*, posterior probabilities from *GAM*.

To observe how much we can obtain additional information on functional
rare variants by integrating the outlier status of gene expression
into *RIVER* in the following figure.

`plotPosteriors(postprobs, outliers=as.numeric(unlist(dataInput$Outlier))-1)`