true

MPRAnalyze aims to infer the transcription rate induced by each enhacer in a Massively Parallel Reporter Assay (MPRA). MPRAnalyze uses a parametric graphical model that enables direct modeling of the observed count data, without the need to use ratio-based summary statistics, and provides robust statistical methodology that addresses all major uses of MPRA.

An MPRA experiment is made up of two matching datasets: the RNA counts used to estimate transcription, and the DNA counts used to normalize by copy-number. MPRAnalyze assumes the input is unnormalized count data. Normalization is achieved by external factors (to correct library size) and regressing out confounding factors (by carefully designing the model).

Throughout this vignette, we’ll be using a subset of enhancers from Inoue et al., that examined a set of enhancers that were tranduced and remained episomal, and after being genomically integrated. We’ll be using a subset of the data, both in terms of number of enhancers and number of barcodes, for runtime purposes.

MPRanalyze expects the input to be provided as two matrices: one for the DNA and one for the RNA counts. The formatting is fairly straightforward: each row is a single enhancer, and each column is an observation. Annotations for the columns are provided for each matrix to identify batches, barcodes and conditions. When formatting the data, note that all enhancers must have the same number of columns. In the case of missing data, padding with 0s is recommended.

```
library(MPRAnalyze)
data("ChrEpi")
summary(ce.colAnnot)
```

```
## batch condition barcode
## 1:20 MT:20 1 : 4
## 2:20 WT:20 2 : 4
## 3 : 4
## 4 : 4
## 5 : 4
## 6 : 4
## (Other):16
```

`head(ce.colAnnot)`

```
## batch condition barcode
## MT:1:001 1 MT 1
## MT:1:002 1 MT 2
## MT:1:003 1 MT 3
## MT:1:004 1 MT 4
## MT:1:005 1 MT 5
## MT:1:006 1 MT 6
```

In the filtered dataset we have 110 enhancers, 40 observations each: 10 unique barcodes, 2 replicates and 2 conditions (MT stands for episomal, WT for chromosomally integrated). Note that while this datset is “paired”, and therefore the dimensionality of the two matrices and the annotations are identical, this is not always the case, and separate data frames can be used for the DNA and RNA annotations.

The MpraObject is the basic class that the package works with, and is very easy to initialize once the data is properly formatted. In addition to the data itself, the user can specify certain enhancers as “controls” (usaully scrambled, random sequences included in the experiment). These will be used by MPRAnalyze to establish the null behavior. Additionally, MPRAnalyze uses parallelization to reduce runtime, with the BiocParallel package. To utilize this, the user can create a BPPARAM object and pass it to the MpraObject, it will be used throughout the analysis.

```
obj <- MpraObject(dnaCounts = ce.dnaCounts, rnaCounts = ce.rnaCounts,
dnaAnnot = ce.colAnnot, rnaAnnot = ce.colAnnot,
controls = ce.control)
```

```
## Warning in MpraObject(dnaCounts = ce.dnaCounts, rnaCounts = ce.rnaCounts, : 1
## enhancers were removed from the analysis
```

Note that since we are using only a subset of the data, one of the enhancers included was not detected (all observations are 0). If this is the case, MPRAnalyze removes it from the analysis. In datasets with many zeros, it is possible to add pseudo-counts. With MPRA this should be done carefully, and we recommend only adding pseudocounts in cases where the RNA counts are positive but the DNA counts are 0 (these are the only cases we know this is a false 0).

MPRAnalyze allows for external factors to be used for normalization, espacially useful for library depth correction. These factors can be estimated automatically using upper quartile (default), total sum, or DESeq2-style harmonic mean normalization. If other factors are to be used, the user can provide them directly using the `setDepthFactors`

function, and providing correction factors for the RNA and DNA counts (length of the factors must be the same as the number of columns in the respective data matrix).
Note that unlike other genomic data, in which every column is a separate library, with MPRA a library is often multiple columns, since multiple barcodes can originate from a single library. For this reason, automatic estimation of library size requires the user to specify what columns belong to what library. This can be done easily by providing the names of factors from the annotations data.frame that identify library (this can be a single factor or multiple factors).

```
## If the library factors are different for the DNA and RNA data, separate
## estimation of these factors is needed. We can also change the estimation
## method (Upper quartile by default)
obj <- estimateDepthFactors(obj, lib.factor = c("batch", "condition"),
which.lib = "dna",
depth.estimator = "uq")
obj <- estimateDepthFactors(obj, lib.factor = c("condition"),
which.lib = "rna",
depth.estimator = "uq")
## In this case, the factors are the same - each combination of batch and
## condition is a single library, and we'll use the default estimation
obj <- estimateDepthFactors(obj, lib.factor = c("batch", "condition"),
which.lib = "both")
```

MPRAnalyze fits two nested generalized linear models. The two models have a conceptually different role: the DNA model is estimating plasmid copy numbers, and the RNA model is estimating transcription rate. Different factors should therefore be included or not included in either model, and the nested nature of the overall model requires careful thinking about the model design, and often using a different design for the DNA and the RNA model. Two common considerations are covered here:

In some MPRA experiments, the DNA counts originate from pre-transduction plasmid libraries. In these cases, multiple replicates may be available, but they are independant of the multiple RNA replicates. Therefore, while there may be batch effect present in the DNA data, this effect cannot be transferred into the RNA model, and should be discarded. By not including it, the DNA estimates will essentially be averaged over replicates, but the multiple replicates will still be used to estimate the dispersion. Essentially - any factor that cannot or should not be carried from the DNA to the RNA model, should not be included in the DNA model at all.

While replicate and condition factors are not always transferrable, barcode information - if available - is. Including barcode information in the DNA model allows MPRAnalyze to provide different estimated counts for each barcode, and dramatically increases the statistical power of the model. This means that barcodes should be modeled in the DNA model, but not necessarily in the RNA model. Modeling barcode effect in the RNA model essentially means different transcription rate estimates will be calculated for each different barcodes of the same enhancer, which is not usually desired. In quantification analyses, this would make comparing enhancers to eachother exceedingly complicated, since unlike batch- or condition- effects, barcodes are not comparable between enhancers. In compartive analyses, while modeling barcodes isn’t conceptually problematic, in practice this could result in overfitting the model and inlfating the results. Broadly, barcode factors should be included in the DNA model design and ignored in the RNA model design.

Quantification analysis is addressing the question of what is the transription rate for each enhancer in the dataset. These estimates can then be used to identify and classify active enhancers that induce a higher transcription rate.
Regarding model design - this data is from a paired experiment, so DNA factors are fully transferable to the RNA model. For the RNA, we will be interested in having a separate estimate of transcription rate for each condition (chromosomal and episomal), so this is the only factor included in the RNA model.
Finally, fitting the model is done by calling the `analyzeQuantification`

function:

```
obj <- analyzeQuantification(obj = obj,
dnaDesign = ~ barcode + batch + condition,
rnaDesign = ~ condition)
```

`## Fitting model...`

`## Analysis done!`

We can now extract the transcription rate estimates from the model, denoted ‘alpha values’ in the MPRAnalyze model, and use the testing functionality to test for activtiy.
extracting alpha values is done with the `getAlpha`

function, that will provide separate values per-factor if a factor is provided. In this case we want a separate alpha estimate by condition:

```
##extract alpha values from the fitted model
alpha <- getAlpha(obj, by.factor = "condition")
##visualize the estimates
boxplot(alpha)
```

We can also leverage negative controls included in the data to establish a baseline rate of transcription, and use that to test for activty among the candidate enhancers. The `testEmpirical`

function provides several statistics and p-values based on those statistics: empirical p-values, z-score based and MAD-score based p-values are currently supported.
In most cases, we recommend using the MAD-score pvalues, which are median-based variant of z-scores, which makes them more robust to outliers.

```
## test
res.epi <- testEmpirical(obj = obj,
statistic = alpha$MT)
summary(res.epi)
```

```
## statistic control zscore mad.score
## Min. : 2.947 Mode :logical Min. :-3.044 Min. :-2.5859
## 1st Qu.: 5.223 FALSE:99 1st Qu.: 0.285 1st Qu.: 0.3007
## Median : 5.962 TRUE :10 Median : 1.367 Median : 1.2388
## Mean : 6.143 Mean : 1.632 Mean : 1.4684
## 3rd Qu.: 6.757 3rd Qu.: 2.530 3rd Qu.: 2.2473
## Max. :14.018 Max. :13.153 Max. :11.4583
## pval.mad pval.zscore pval.empirical
## Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.01231 1st Qu.:0.005703 1st Qu.:0.0000
## Median :0.10770 Median :0.085820 Median :0.1000
## Mean :0.23924 Mean :0.240589 Mean :0.2339
## 3rd Qu.:0.38180 3rd Qu.:0.387815 3rd Qu.:0.4000
## Max. :0.99514 Max. :0.998834 Max. :1.0000
```

```
res.chr <- testEmpirical(obj = obj,
statistic = alpha$WT)
par(mfrow=c(2,2))
hist(res.epi$pval.mad, main="episomal, all")
hist(res.epi$pval.mad[ce.control], main="episomal, controls")
hist(res.chr$pval.mad, main="chromosomal, all")
hist(res.chr$pval.mad[ce.control], main="chromosomal, controls")
```

`par(mfrow=c(1,1))`

P-values seem well calibrated, getting a uniform distribution for inactive enhancers, and enrichment for low p-values with the active enhancers.

MPRAnalyze also supports comparative analyses, in this case: identifying enhancers that are differentially active between conditions. While we can do this indirectly by taking the quantification results and identify enhancers that are active in one condition but not the other, a direct compartive analysis is more sensitive, and allows identification of enhancers that are more or less active, avoiding the binarization of activity. MPRAnalyze also leverages negative controls to estbalish the null differential behavior, thereby correcting any systemic bias that may be present in the data. In terms of syntax, this analysis is done very similarly to quantification, with an additional reduced model that describes the null hypothesis. In this case, the null hypothesis is no differential activtiy between conditions, so the reduced model is an empty model (intercept only)

```
obj <- analyzeComparative(obj = obj,
dnaDesign = ~ barcode + batch + condition,
rnaDesign = ~ condition,
reducedDesign = ~ 1)
```

`## Fitting controls-based background model...`

`## iter:2 log-likelihood:-20076.1798347277`

`## iter:3 log-likelihood:-19664.0415669505`

`## iter:4 log-likelihood:-19299.2767542567`

`## iter:5 log-likelihood:-19025.9445481705`

`## iter:6 log-likelihood:-18871.5009668801`

`## iter:7 log-likelihood:-18808.6831629786`

`## iter:8 log-likelihood:-18788.2689244527`

`## iter:9 log-likelihood:-18782.0537783263`

`## iter:10 log-likelihood:-18780.251513757`

`## iter:11 log-likelihood:-18779.6489965293`

`## iter:12 log-likelihood:-18779.4839764466`

`## iter:13 log-likelihood:-18779.3791121402`

`## iter:14 log-likelihood:-18779.3487056302`

`## iter:15 log-likelihood:-18779.2944249404`

`## iter:16 log-likelihood:-18779.3425364527`

`## Fitting model...`

`## Fitting reduced model...`

`## Analysis Done!`

with the fitted model, we can now test for differential activity, by calling `testLrt`

`res <- testLrt(obj)`

`## Performing Likelihood Ratio Test...`

`head(res)`

```
## statistic
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013] 5.60981735
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784] 2.04112747
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092] 0.46927997
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497] 10.72607875
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525] 4.73566941
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716] 0.07637791
## pval
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013] 0.017860125
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784] 0.153096136
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092] 0.493318605
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497] 0.001056361
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525] 0.029543349
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716] 0.782267329
## fdr
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013] 0.18529365
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784] 0.47179202
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092] 0.79959406
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497] 0.02127219
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525] 0.24770962
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716] 0.92196683
## df.test df.dna
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013] 1 12
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784] 1 8
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092] 1 6
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497] 1 9
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525] 1 10
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716] 1 8
## df.rna.full
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013] 5
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784] 5
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092] 5
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497] 5
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525] 5
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716] 5
## df.rna.red
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013] 4
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784] 4
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092] 4
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497] 4
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525] 4
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716] 4
## logFC
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013] 0.24774232
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784] -0.43318950
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092] 0.24812480
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497] 0.48901596
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525] 0.31247839
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716] 0.06622427
```

`summary(res)`

```
## statistic pval fdr df.test
## Min. : 0.000441 Min. :0.0000002 Min. :0.0000172 Min. :1
## 1st Qu.: 0.146638 1st Qu.:0.0992155 1st Qu.:0.3862319 1st Qu.:1
## Median : 0.867634 Median :0.3516112 Median :0.6920815 Median :1
## Mean : 2.206074 Mean :0.4038146 Mean :0.6291653 Mean :1
## 3rd Qu.: 2.718108 3rd Qu.:0.7017694 3rd Qu.:0.9219668 3rd Qu.:1
## Max. :27.496768 Max. :0.9832551 Max. :0.9832551 Max. :1
## df.dna df.rna.full df.rna.red logFC
## Min. : 5.000 Min. :5 Min. :4 Min. :-0.49372
## 1st Qu.: 8.000 1st Qu.:5 1st Qu.:4 1st Qu.:-0.02296
## Median : 9.000 Median :5 Median :4 Median : 0.11256
## Mean : 9.349 Mean :5 Mean :4 Mean : 0.12336
## 3rd Qu.:11.000 3rd Qu.:5 3rd Qu.:4 3rd Qu.: 0.28139
## Max. :13.000 Max. :5 Max. :4 Max. : 0.83923
```

When the hypothesis teseting is simple (two-condition comparison), a fold-change estimate is also available:

```
## plot log Fold-Change
plot(density(res$logFC))
```

```
## plot volcano
plot(res$logFC, -log10(res$pval))
```