nfe2-regulation-old.Rmd
The regulation of target genes by their transcription factors is complex and incompletely understood. Multiple signals involving core promoters, distal enhancers, epigenetic controls on chromatin accessibility, stochastic and cooperative binding on different times scales are all involved. Predictive modeling of these processes in fine detail and at scale is beyond our current capabilities.
We know, however, that gene regulation usually includes two gross features which on their own are poor predictors, but which together have predictive power:
trena combines these two predictors to create (at minimum) “broad brush” or “low-resolution” gene regulatory predictions. inasmuch as trena may be used with a wide range of genomic, eptigenetic and expression data “high resolution” predictions can be made as well. trena thus operates at many points along this continuum:
low-res mode: uses bulk mRNA data, generalized predictions of regulatory regions, computational matching of TF to DNA sequence in those regions
high-res mode: uses single cell RNA-seq, scATAC-seq or DNase regions, timecourse or well-discriminated environmental conditions, 3C, scChIP-seq, other recently emerging binding assays.
The low-resolution mode is useful when gene regulatory relationships are little known, or in which a coarse-grained result is adequate. A prime example of this is in the creation of genome scale regulatory models by aggregating thousands of low-res single-gene models.
The high-resolution mode can predict relationships which can approach mechanistic accuracy, and which justify attempts at laboratory validation.
In this paper we demonstrate trena at several points along this low-to-high resolution continuum by analyzing GATA2 regulation in erythropoiesis. We conclude by demonstrating the application of trena at genome scale in low resolution mode in AD and PTB, arriving at better estimates of gene expression than is possible with existing methods, using by K-fold cross validation to make that case.
In this vignette we present several successive attempts to identify transcription factors which regulate the NFE2 (Nuclear Factor, Erythroid 2) gene. NFE2 is itself a transcription factor which a role in erythroid and megakaryocytic maturation and differentiation.
Erythropoieis has been studied at depth for many years, and many TF/target gene relationhsips have been documented. Our goal in this vignette is to demonstrate that trena, using progressively more highly-resolved data, can identify some of these relationships. Evidence of novel previously unreported relationships, some of them worthy of attempts at experimental validation, will be suggested as well. We propose that trena’s ability to recapitulate known relations in well-studied systems predicts that it will make useful predictions in less-studied, less understood systems.
Bose, Francesca, et al. “Functional interaction of CP2 with GATA-1 in the regulation of erythroid promoters.” Molecular and cellular biology 26.10 (2006): 3942-3954.
Ding, Ya‐li, et al. “Over‐expression of EDAG in the myeloid cell line 32D: Induction of GATA‐1 expression and erythroid/megakaryocytic phenotype.” Journal of cellular biochemistry 110.4 (2010): 866-874.
many more to be added…
Here is a preview of the final context we use to predict transcription factor/target gene relationships, including:
phast7: conserved DNA sequence, in red, on a scale of 0-1, across 7 species: opossum, dog, rat, mouse, rhesus monkey, chimpanzee and human. Highly conserved sequence in non-coding regions is likely to be functional, that is, to play a role in regulating a nearby gene, and possibly a TF binding site.
atac combined: the union of 12 scATAC-seq datasets, collected in an erythropoiesis time course Gillespie et al, https://www.biorxiv.org/content/10.1101/812123v1
TAL1, KLF1, GATA1, etc: high-scoring motif matches for top regulating TFs predicted by trena using RNA-seq expression data from Gillespie et al.
In this vignette we will show how the progress of analysis from naive to informed:
## Loading required package: TrenaProjectHG38
## Loading required package: TrenaProject
## See system.file("LICENSE", package="MotifDb") for use restrictions.
##
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
allTFsymbols <- suppressMessages(
sort(unique(
select(org.Hs.eg.db, keys="GO:0003700", keytype="GOALL", columns="SYMBOL")$SYMBOL)))
library(trena)
## Loading required package: glmnet
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
##
## expand
## Loaded glmnet 3.0
## Loading required package: MotifDb
## Loading required package: Biostrings
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
candidate.tfs <- intersect(allTFsymbols, rownames(mtx.blood))
cat(sprintf("%d candidate.tfs\n", length(candidate.tfs)))
## 1498 candidate.tfs
es <- EnsembleSolver(mtx.blood, "NFE2", candidate.tfs)
tbl.model <- run(es)
## There are more than 500 variables and n<m;
## You may wish to restart and set use.Gram=FALSE
tbl.model <- tbl.model[order(abs(tbl.model$pearsonCoeff), decreasing=TRUE),]
rownames(tbl.model) <- NULL
grep("GATA1", tbl.model$gene) # 122
## [1] 119
head(tbl.model, n=20)
## gene betaLasso lassoPValue pearsonCoeff rfScore betaRidge
## 1 ZNF467 0.07470464 3.807015e-137 0.8858483 68.03043734 0.012301092
## 2 SPI1 0.25964462 1.221741e-88 0.8631695 22.65562591 0.015977277
## 3 ZNF438 0.11069593 3.643661e-103 0.8622473 30.19208986 0.013328559
## 4 GAS7 0.00000000 1.895326e-91 0.8620084 39.36801699 0.010572346
## 5 MXD3 0.15208025 9.404063e-101 0.8607177 37.00341006 0.012717317
## 6 RFX2 0.00000000 2.055364e-10 0.8196630 1.77125342 0.007838097
## 7 SUPT4H1 0.00000000 4.284611e-04 0.8049881 4.09739462 0.007102236
## 8 ZBTB7B 0.00000000 1.000000e+00 0.8002630 27.99585639 0.007818212
## 9 HLX 0.10354211 1.776149e-37 0.7960579 2.95602951 0.026831432
## 10 STK16 0.00000000 1.000000e+00 0.7923523 0.34157393 0.005780643
## 11 ZNF787 0.00000000 1.000000e+00 0.7923305 2.29357090 0.017382343
## 12 ZNF213 0.00000000 1.000000e+00 0.7864081 1.27899987 0.006331342
## 13 ARID3A 0.00000000 1.000000e+00 0.7808684 2.08038622 0.016199439
## 14 TFEB 0.00000000 1.000000e+00 0.7805708 1.38913490 0.005631353
## 15 CEBPD 0.00000000 1.000000e+00 0.7803907 0.09038523 0.009090038
## 16 NR2E1 0.02476031 9.543671e-51 0.7782116 4.43853373 0.005075063
## 17 ATF6 0.00000000 1.000000e+00 0.7766754 0.15684835 0.013109342
## 18 MLLT1 0.00000000 1.000000e+00 0.7743500 0.61129144 0.012092161
## 19 ZNF746 0.00000000 1.000000e+00 0.7602409 0.05870189 0.013399203
## 20 BATF 0.00000000 6.546269e-10 0.7601506 1.61960687 0.013264925
## spearmanCoeff xgboost
## 1 0.8969412 2.661903e-01
## 2 0.8783204 9.899239e-02
## 3 0.8651491 4.195779e-02
## 4 0.8626947 1.677529e-01
## 5 0.8703823 1.169558e-01
## 6 0.8283829 1.041221e-05
## 7 0.8216015 8.308591e-03
## 8 0.8271060 0.000000e+00
## 9 0.8043606 2.650176e-02
## 10 0.7941753 9.107617e-04
## 11 0.8050329 0.000000e+00
## 12 0.7905818 0.000000e+00
## 13 0.8018518 1.174641e-04
## 14 0.7987542 2.219866e-04
## 15 0.7818135 1.801124e-06
## 16 0.7858691 1.985362e-02
## 17 0.7903082 1.431481e-05
## 18 0.7883546 0.000000e+00
## 19 0.7770288 3.489503e-06
## 20 0.7628710 2.357746e-02
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] trena_1.7.14 MotifDb_1.26.0
## [3] Biostrings_2.52.0 XVector_0.24.0
## [5] glmnet_3.0 Matrix_1.2-17
## [7] org.Hs.eg.db_3.8.2 AnnotationDbi_1.46.1
## [9] IRanges_2.18.3 S4Vectors_0.22.1
## [11] Biobase_2.44.0 BiocGenerics_0.30.0
## [13] TrenaProjectErythropoiesis_1.0.2 TrenaProjectLymphocyte_1.0.2
## [15] TrenaProjectHG38_1.2.1 TrenaProject_1.2.3
## [17] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] RMySQL_0.10.17 bit64_0.9-7
## [3] foreach_1.4.7 assertthat_0.2.1
## [5] BiocManager_1.30.9 blob_1.2.0
## [7] BSgenome_1.52.0 GenomeInfoDbData_1.2.1
## [9] Rsamtools_2.0.3 yaml_2.2.0
## [11] pillar_1.4.2 RSQLite_2.1.2
## [13] backports_1.1.5 lattice_0.20-38
## [15] digest_0.6.22 GenomicRanges_1.36.1
## [17] randomForest_4.6-14 htmltools_0.4.0
## [19] XML_3.98-1.20 pkgconfig_2.0.3
## [21] bookdown_0.15 zlibbioc_1.30.0
## [23] BiocParallel_1.18.1 tibble_2.1.3
## [25] xgboost_0.90.0.2 flare_1.6.0.2
## [27] SummarizedExperiment_1.14.1 RPostgreSQL_0.6-2
## [29] splitstackshape_1.4.8 magrittr_1.5
## [31] crayon_1.3.4 memoise_1.1.0
## [33] evaluate_0.14 fs_1.3.1
## [35] MASS_7.3-51.4 tools_3.6.1
## [37] data.table_1.12.6 matrixStats_0.55.0
## [39] stringr_1.4.0 DelayedArray_0.10.0
## [41] compiler_3.6.1 pkgdown_1.4.1
## [43] GenomeInfoDb_1.20.0 rlang_0.4.1
## [45] grid_3.6.1 RCurl_1.95-4.12
## [47] iterators_1.0.12 igraph_1.2.4.1
## [49] bitops_1.0-6 rmarkdown_1.17
## [51] codetools_0.2-16 lars_1.2
## [53] DBI_1.0.0 vbsr_0.0.5
## [55] R6_2.4.1 GenomicAlignments_1.20.1
## [57] knitr_1.26 rtracklayer_1.44.4
## [59] lassopv_0.2.0 zeallot_0.1.0
## [61] bit_1.1-14 rprojroot_1.3-2
## [63] shape_1.4.4 desc_1.2.0
## [65] stringi_1.4.3 Rcpp_1.0.3
## [67] vctrs_0.2.0 png_0.1-7
## [69] xfun_0.11