A comprehensive guide to using the mpra package for analyzing massively parallel reporter assays (MPRA).
mpra 1.0.0
The mpra package provides tools for the analysis of data from massively parallel reporter assays (MPRA). Specifically, it contains the functionality described in (Myint et al. 2017). The primary analysis purpose is to enable differential analysis of activity measures, but the package can also be used to generate precision weights useful in regression analyses of activity scores on sequence features. The main workhorse of the mpra package is the mpralm()
function which draws on the previously proposed voom framework for RNA-seq analysis (Law et al. 2014). In this document, we will be looking at MPRA data from a study comparing episomal and lentiviral versions of MPRA (Inoue et al. 2017).
If you are using this package, please cite (Myint et al. 2017). If you are using the mpralm()
function, it would be appropriate to also cite the voom framework (Law et al. 2014).
This document has the following dependencies
library(mpra)
In this package, MPRA data are contained in MPRASet
objects. Because MPRA data do not have a common prescribed format, these objects must be created manually. In this section, we demonstrate how to do this.
MPRASet
objects must contain DNA and RNA count information because this is the information used to quantify activity levels of the elements being assayed. DNA and RNA count information should be specified as \(K \times S\) integer count matrices where \(K\) is the total number of barcodes over all elements if barcode-level information is being supplied or the total number of putative regulatory elements (PREs) if element-level information is being supplied. \(S\) is the number of samples (typically, the number of independent transfections).
MPRASet
objects must also contain element identification information. This should be supplied as a character vector of length \(K\), the number of rows in the DNA and RNA count matrices. These are any strings used to describe/identify the unique PREs being assayed.
Optionally, the barcode sequences and PRE sequences can be specified as length \(K\) character vectors.
In the next sections we provide specific examples for how to specify this information for two common differential analysis settings: tissue and allele comparisons. Although we show simulated data, this information would typically be read from text files.
In tissue comparison studies, the same set of PREs is assayed in two or more cell types. In the following example, the experiment looks at four PREs with three barcodes each. Two tissues (liver and kidney) are studied, and each tissue has four replicates (four independent transfections each).
RNA and DNA count matrices would look as below:
E <- 4 # Number of elements
B <- 3 # Number of barcodes
s <- 4 # Samples per tissue
nt <- 2 # Number of tissues
set.seed(434)
rna <- matrix(rpois(E*B*s*nt, lambda = 30), nrow = E*B, ncol = s*nt)
dna <- matrix(rpois(E*B*s*nt, lambda = 30), nrow = E*B, ncol = s*nt)
rn <- as.character(outer(paste0("barcode_", seq_len(B), "_"), paste0("elem_", seq_len(E)), FUN = "paste0"))
cn <- c(paste0("liver_", seq_len(s)), paste0("kidney_", seq_len(s)))
rownames(rna) <- rn
rownames(dna) <- rn
colnames(rna) <- cn
colnames(dna) <- cn
rna
## liver_1 liver_2 liver_3 liver_4 kidney_1 kidney_2 kidney_3
## barcode_1_elem_1 31 26 36 18 27 33 28
## barcode_2_elem_1 33 28 29 32 28 33 34
## barcode_3_elem_1 22 43 25 41 28 34 30
## barcode_1_elem_2 28 27 31 20 37 33 24
## barcode_2_elem_2 32 31 32 21 38 22 27
## barcode_3_elem_2 35 36 28 38 26 23 30
## barcode_1_elem_3 30 34 26 34 40 26 26
## barcode_2_elem_3 37 27 37 29 25 29 38
## barcode_3_elem_3 34 30 20 26 30 35 33
## barcode_1_elem_4 28 27 34 26 29 30 25
## barcode_2_elem_4 31 32 28 36 31 24 29
## barcode_3_elem_4 24 28 34 31 37 29 34
## kidney_4
## barcode_1_elem_1 30
## barcode_2_elem_1 30
## barcode_3_elem_1 37
## barcode_1_elem_2 29
## barcode_2_elem_2 29
## barcode_3_elem_2 29
## barcode_1_elem_3 22
## barcode_2_elem_3 28
## barcode_3_elem_3 38
## barcode_1_elem_4 30
## barcode_2_elem_4 32
## barcode_3_elem_4 36
dna
## liver_1 liver_2 liver_3 liver_4 kidney_1 kidney_2 kidney_3
## barcode_1_elem_1 29 34 28 27 27 25 30
## barcode_2_elem_1 43 37 34 33 43 26 26
## barcode_3_elem_1 35 23 32 27 19 29 29
## barcode_1_elem_2 23 30 34 27 32 18 24
## barcode_2_elem_2 32 29 42 21 37 34 32
## barcode_3_elem_2 26 23 22 24 32 41 24
## barcode_1_elem_3 29 38 34 27 31 27 31
## barcode_2_elem_3 30 36 32 21 34 29 26
## barcode_3_elem_3 34 28 42 35 26 32 32
## barcode_1_elem_4 34 25 29 30 44 26 23
## barcode_2_elem_4 29 26 37 37 28 37 34
## barcode_3_elem_4 24 28 33 40 31 30 19
## kidney_4
## barcode_1_elem_1 31
## barcode_2_elem_1 33
## barcode_3_elem_1 25
## barcode_1_elem_2 25
## barcode_2_elem_2 31
## barcode_3_elem_2 23
## barcode_1_elem_3 33
## barcode_2_elem_3 39
## barcode_3_elem_3 26
## barcode_1_elem_4 33
## barcode_2_elem_4 31
## barcode_3_elem_4 26
PRE identification strings would look as below. When counts are provided at the barcode level, the eid
character vector will have repeated elements.
eid <- rep(paste0("elem_", seq_len(E)), each = B)
eid
## [1] "elem_1" "elem_1" "elem_1" "elem_2" "elem_2" "elem_2" "elem_3" "elem_3"
## [9] "elem_3" "elem_4" "elem_4" "elem_4"
We may also have PRE sequences as below. These sequences must be specified in a character vector of the same length as eid
and the same number of rows as rna
and dna
.
eseq <- replicate(E, paste(sample(c("A", "T", "C", "G"), 10, replace = TRUE), collapse = ""))
eseq <- rep(eseq, each = B)
eseq
## [1] "TGACCATACC" "TGACCATACC" "TGACCATACC" "ACGCCCAATT" "ACGCCCAATT"
## [6] "ACGCCCAATT" "CGGCGAGGGG" "CGGCGAGGGG" "CGGCGAGGGG" "CACTGTACTA"
## [11] "CACTGTACTA" "CACTGTACTA"
The above pieces (rna
, dna
, eid
, and eseq
) can be supplied as arguments to the MPRASet
constructor function as below. If barcode
(barcode sequences) or eseq
(PRE sequences) is not supplied, it must be specified as NULL
.
mpraset_example <- MPRASet(DNA = dna, RNA = rna, eid = eid, eseq = eseq, barcode = NULL)
mpraset_example
## class: MPRASet
## dim: 12 8
## metadata(0):
## assays(2): DNA RNA
## rownames(12): barcode_1_elem_1 barcode_2_elem_1 ... barcode_2_elem_4
## barcode_3_elem_4
## rowData names(2): eid eseq
## colnames(8): liver_1 liver_2 ... kidney_3 kidney_4
## colData names(0):
## No barcodes present
In allele comparison studies, PREs that exist with two or more alleles are assayed. All allelic-versions of the PREs are assayed in the same sample. Because activity comparisons between alleles is desired, these counts must be separated into different columns. In the following example, the experiment looks at four PREs with three barcodes each. There are two alleles per PRE, and there are four replicates (four independent transfections total).
RNA and DNA count matrices would look as below:
E <- 4 # Number of elements
B <- 3 # Number of barcodes
s <- 4 # Total number of samples
nalleles <- 2 # Number of alleles
set.seed(434)
rna <- matrix(rpois(E*B*s*nalleles, lambda = 30), nrow = E*B, ncol = s*nalleles)
dna <- matrix(rpois(E*B*s*nalleles, lambda = 30), nrow = E*B, ncol = s*nalleles)
rn <- as.character(outer(paste0("barcode_", seq_len(B), "_"), paste0("elem_", seq_len(E)), FUN = "paste0"))
cn <- c(paste0("allele1_sample", seq_len(s)), paste0("allele2_sample", seq_len(s)))
rownames(rna) <- rn
rownames(dna) <- rn
colnames(rna) <- cn
colnames(dna) <- cn
rna
## allele1_sample1 allele1_sample2 allele1_sample3
## barcode_1_elem_1 31 26 36
## barcode_2_elem_1 33 28 29
## barcode_3_elem_1 22 43 25
## barcode_1_elem_2 28 27 31
## barcode_2_elem_2 32 31 32
## barcode_3_elem_2 35 36 28
## barcode_1_elem_3 30 34 26
## barcode_2_elem_3 37 27 37
## barcode_3_elem_3 34 30 20
## barcode_1_elem_4 28 27 34
## barcode_2_elem_4 31 32 28
## barcode_3_elem_4 24 28 34
## allele1_sample4 allele2_sample1 allele2_sample2
## barcode_1_elem_1 18 27 33
## barcode_2_elem_1 32 28 33
## barcode_3_elem_1 41 28 34
## barcode_1_elem_2 20 37 33
## barcode_2_elem_2 21 38 22
## barcode_3_elem_2 38 26 23
## barcode_1_elem_3 34 40 26
## barcode_2_elem_3 29 25 29
## barcode_3_elem_3 26 30 35
## barcode_1_elem_4 26 29 30
## barcode_2_elem_4 36 31 24
## barcode_3_elem_4 31 37 29
## allele2_sample3 allele2_sample4
## barcode_1_elem_1 28 30
## barcode_2_elem_1 34 30
## barcode_3_elem_1 30 37
## barcode_1_elem_2 24 29
## barcode_2_elem_2 27 29
## barcode_3_elem_2 30 29
## barcode_1_elem_3 26 22
## barcode_2_elem_3 38 28
## barcode_3_elem_3 33 38
## barcode_1_elem_4 25 30
## barcode_2_elem_4 29 32
## barcode_3_elem_4 34 36
dna
## allele1_sample1 allele1_sample2 allele1_sample3
## barcode_1_elem_1 29 34 28
## barcode_2_elem_1 43 37 34
## barcode_3_elem_1 35 23 32
## barcode_1_elem_2 23 30 34
## barcode_2_elem_2 32 29 42
## barcode_3_elem_2 26 23 22
## barcode_1_elem_3 29 38 34
## barcode_2_elem_3 30 36 32
## barcode_3_elem_3 34 28 42
## barcode_1_elem_4 34 25 29
## barcode_2_elem_4 29 26 37
## barcode_3_elem_4 24 28 33
## allele1_sample4 allele2_sample1 allele2_sample2
## barcode_1_elem_1 27 27 25
## barcode_2_elem_1 33 43 26
## barcode_3_elem_1 27 19 29
## barcode_1_elem_2 27 32 18
## barcode_2_elem_2 21 37 34
## barcode_3_elem_2 24 32 41
## barcode_1_elem_3 27 31 27
## barcode_2_elem_3 21 34 29
## barcode_3_elem_3 35 26 32
## barcode_1_elem_4 30 44 26
## barcode_2_elem_4 37 28 37
## barcode_3_elem_4 40 31 30
## allele2_sample3 allele2_sample4
## barcode_1_elem_1 30 31
## barcode_2_elem_1 26 33
## barcode_3_elem_1 29 25
## barcode_1_elem_2 24 25
## barcode_2_elem_2 32 31
## barcode_3_elem_2 24 23
## barcode_1_elem_3 31 33
## barcode_2_elem_3 26 39
## barcode_3_elem_3 32 26
## barcode_1_elem_4 23 33
## barcode_2_elem_4 34 31
## barcode_3_elem_4 19 26
We could define PRE identification strings and sequences as above and use them in the MPRASet
constructor function similarly:
mpraset_example2 <- MPRASet(DNA = dna, RNA = rna, eid = eid, eseq = eseq, barcode = NULL)
mpraset_example2
## class: MPRASet
## dim: 12 8
## metadata(0):
## assays(2): DNA RNA
## rownames(12): barcode_1_elem_1 barcode_2_elem_1 ... barcode_2_elem_4
## barcode_3_elem_4
## rowData names(2): eid eseq
## colnames(8): allele1_sample1 allele1_sample2 ... allele2_sample3
## allele2_sample4
## colData names(0):
## No barcodes present
While the above section demonstrated how to create MPRASet
objects, we will use preconstructed objects containing data from a comparison of episomal and lentiviral versions of MPRA (Inoue et al. 2017).
data(mpraSetExample)
We create the design matrix with an indicator for the episomal (mutant integrase) samples and fit the precision-weighted linear model with mpralm
. In MPRA experiments, activity measures are quantified as the log2 ratio of RNA counts over DNA counts. When there is barcode level information (as in this experiment), there are various ways to summarize information over barcodes to compute the final element- and sample-specific log ratios that are used for subsequent statistical modeling.
We have specified aggregate = "mean"
to indicate that the element- and sample-specific log ratios will be computed by first computing the log ratio of RNA counts over DNA counts for each barcode, then taking the mean over barcodes in a particular element and sample.
We have specifed normalize = TRUE
to perform total count normalization on the RNA and DNA libraries. This scales all libraries to have a common size of 10 million reads.
Because this experiment looks at a set of PREs in two different cellular conditions, the different samples (columns of the MPRASet
object) are independent. Thus we specify model_type = "indep_groups"
to perform an unpaired analysis. In contrast, if we were performing an allele comparison, we would specify model_type = "corr_groups"
to performed a paired analysis (indicating that different columns of the MPRASet
object are linked).
Finally, we specify plot = TRUE
to plot the relationship between log ratio variability versus element copy number.
design <- data.frame(intcpt = 1, episomal = grepl("MT", colnames(mpraSetExample)))
mpralm_fit <- mpralm(object = mpraSetExample, design = design, aggregate = "mean", normalize = TRUE, model_type = "indep_groups", plot = TRUE)
The resulting fit object can be used with topTable
from the limma package.
toptab <- topTable(mpralm_fit, coef = 2, number = Inf)
toptab6 <- head(toptab)
Because the element codes are rather long for this dataset, we do some tricks to print the top differential elements:
rownames(toptab6)
## [1] "C:SLEA_hg18:chr2:210861483-210861650|5:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;26:V_HNF1_C:AGTTAATGATTAACCAA;45:V_HNF1_C:AGTTAATGATTAACCAA;64:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;85:V_HNF1_C:AGTTAATGATTAACCAA;104:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;125:V_HNF1_C:AGTTAATGATTAACCAA;144:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC"
## [2] "R:EP300-NoMod_chr9:12814543-12814714_[chr9:12814543-12814714]"
## [3] "C:SLEA_hg18:chr2:210861483-210861650|4:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;25:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;46:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;67:V_HNF1_C:AGTTAATGATTAACCAA;86:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;107:V_HNF1_C:AGTTAATGATTAACCAA;126:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;147:V_HNF1_C:AGTTAATGATTAACCAA"
## [4] "C:SLEA_hg18:chr2:210861483-210861650|6:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;27:V_HNF1_C:AGTTAATGATTAACCAA;46:V_HNF1_C:AGTTAATGATTAACCAA;65:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;86:V_HNF1_C:AGTTAATGATTAACCAA;105:V_HNF1_C:AGTTAATGATTAACCAA;124:V_HNF1_C:AGTTAATGATTAACCAA;143:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC"
## [5] "R:FOXA1_FOXA2-ChMod_chr1:38000211-38000353_[chr1:38000196-38000367]"
## [6] "R:EP300-NoMod_chr16:3070958-3071129_[chr16:3070958-3071129]"
rownames(toptab6) <- NULL
toptab6
## logFC AveExpr t P.Value adj.P.Val B
## 1 -1.2076627 1.284363 -26.99214 1.450176e-36 3.538429e-33 73.05825
## 2 -1.7316415 1.668562 -23.50738 4.357972e-33 5.316726e-30 64.87553
## 3 -1.1603766 1.497371 -23.11093 1.150733e-32 9.359292e-30 64.12459
## 4 -0.9560848 1.016586 -18.57491 2.136470e-27 1.303247e-24 52.09394
## 5 -1.1713718 1.730330 -18.25135 5.494116e-27 2.681128e-24 51.09648
## 6 -0.7639104 0.733524 -17.71599 2.689472e-26 1.093719e-23 49.54888
R version 3.4.2 (2017-09-28) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.3 LTS
Matrix products: default BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base
other attached packages: [1] mpra_1.0.0 limma_3.34.0
[3] SummarizedExperiment_1.8.0 DelayedArray_0.4.0
[5] matrixStats_0.52.2 Biobase_2.38.0
[7] GenomicRanges_1.30.0 GenomeInfoDb_1.14.0
[9] IRanges_2.12.0 S4Vectors_0.16.0
[11] BiocGenerics_0.24.0 BiocStyle_2.6.0
loaded via a namespace (and not attached): [1] Rcpp_0.12.13 knitr_1.17 XVector_0.18.0
[4] magrittr_1.5 zlibbioc_1.24.0 statmod_1.4.30
[7] munsell_0.4.3 colorspace_1.3-2 lattice_0.20-35
[10] plyr_1.8.4 stringr_1.2.0 tools_3.4.2
[13] grid_3.4.2 htmltools_0.3.6 yaml_2.1.14
[16] rprojroot_1.2 digest_0.6.12 bookdown_0.5
[19] Matrix_1.2-11 GenomeInfoDbData_0.99.1 bitops_1.0-6
[22] RCurl_1.95-4.8 evaluate_0.10.1 rmarkdown_1.6
[25] stringi_1.1.5 compiler_3.4.2 scales_0.5.0
[28] backports_1.1.1
Inoue, Fumitaka, Martin Kircher, Beth Martin, Gregory M Cooper, Daniela M Witten, Michael T McManus, Nadav Ahituv, and Jay Shendure. 2017. “A Systematic Comparison Reveals Substantial Differences in Chromosomal Versus Episomal Encoding of Enhancer Activity.” Genome Research 27: 38–52. doi:10.1101/gr.212092.116.
Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth. 2014. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-seq Read Counts.” Genome Biology 15: R29. doi:10.1186/gb-2014-15-2-r29.
Myint, Leslie, Dimitrios G Avramopoulos, Loyal A. Goff, and Kasper D Hansen. 2017. “Linear Models Enable Powerful Differential Activity Analysis in Massively Parallel Reporter Assays.” BioRxiv, 196394. doi:10.1101/196394.