Overview

The regulation of target genes by their transcription factors is complex and incompletely understood. Multiple signals involving core promoters, distal enhancers, epigenetic controls on chromatin accessibility, stochastic and cooperative binding on different times scales are all involved. Predictive modeling of these processes in fine detail and at scale is beyond our current capabilities.

We know, however, that gene regulation usually includes two gross features which on their own are poor predictors, but which together have predictive power:

  1. somewhat correlated gene expression of TF and target gene
  2. actual or predicted DNA binding of the TF in regulatory DNA regions associated with the target gene

trena combines these two predictors to create (at minimum) “broad brush” or “low-resolution” gene regulatory predictions. inasmuch as trena may be used with a wide range of genomic, eptigenetic and expression data “high resolution” predictions can be made as well. trena thus operates at many points along this continuum:

low-res mode: uses bulk mRNA data, generalized predictions of regulatory regions, computational matching of TF to DNA sequence in those regions

high-res mode: uses single cell RNA-seq, scATAC-seq or DNase regions, timecourse or well-discriminated environmental conditions, 3C, scChIP-seq, other recently emerging binding assays.

The low-resolution mode is useful when gene regulatory relationships are little known, or in which a coarse-grained result is adequate. A prime example of this is in the creation of genome scale regulatory models by aggregating thousands of low-res single-gene models.

The high-resolution mode can predict relationships which can approach mechanistic accuracy, and which justify attempts at laboratory validation.

In this paper we demonstrate trena at several points along this low-to-high resolution continuum by analyzing GATA2 regulation in erythropoiesis. We conclude by demonstrating the application of trena at genome scale in low resolution mode in AD and PTB, arriving at better estimates of gene expression than is possible with existing methods, using by K-fold cross validation to make that case.

In this vignette we present several successive attempts to identify transcription factors which regulate the NFE2 (Nuclear Factor, Erythroid 2) gene. NFE2 is itself a transcription factor which a role in erythroid and megakaryocytic maturation and differentiation.

Erythropoieis has been studied at depth for many years, and many TF/target gene relationhsips have been documented. Our goal in this vignette is to demonstrate that trena, using progressively more highly-resolved data, can identify some of these relationships. Evidence of novel previously unreported relationships, some of them worthy of attempts at experimental validation, will be suggested as well. We propose that trena’s ability to recapitulate known relations in well-studied systems predicts that it will make useful predictions in less-studied, less understood systems.

Journal Articles Reporting Regulators of NFE2

  • Bose, Francesca, et al. “Functional interaction of CP2 with GATA-1 in the regulation of erythroid promoters.” Molecular and cellular biology 26.10 (2006): 3942-3954.

  • Ding, Ya‐li, et al. “Over‐expression of EDAG in the myeloid cell line 32D: Induction of GATA‐1 expression and erythroid/megakaryocytic phenotype.” Journal of cellular biochemistry 110.4 (2010): 866-874.

  • many more to be added…

Use the regulatory landscape

Here is a preview of the final context we use to predict transcription factor/target gene relationships, including:

  1. phast7: conserved DNA sequence, in red, on a scale of 0-1, across 7 species: opossum, dog, rat, mouse, rhesus monkey, chimpanzee and human. Highly conserved sequence in non-coding regions is likely to be functional, that is, to play a role in regulating a nearby gene, and possibly a TF binding site.

  2. atac combined: the union of 12 scATAC-seq datasets, collected in an erythropoiesis time course Gillespie et al, https://www.biorxiv.org/content/10.1101/812123v1

  3. TAL1, KLF1, GATA1, etc: high-scoring motif matches for top regulating TFs predicted by trena using RNA-seq expression data from Gillespie et al.

In this vignette we will show how the progress of analysis from naive to informed:

  1. correlated gene expression, using bulk tissue GTEx blood RNA-seq, all possible TFs
  2. restrict candidate TFs to only those with strong FIMO (or PWMmatch) scores
  3. further restrict TFs to those which fall in highly conserved (phast7 >= 0.5) regions
  4. as above, but with RNA collected from multiple stages of erythropoiesis
  5. use RNA-seq and scATAC-seq from Gillespie et al. The ATAC-seq track is the union of for 20 samples across 12 erythropoiesis development timepoints.

Analysis 1: GTEx blood RNA-seq, no genome information: simple co-expression

## Loading required package: TrenaProjectHG38
## Loading required package: TrenaProject
## See system.file("LICENSE", package="MotifDb") for use restrictions.
## 

get a list of all TFs

library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
allTFsymbols <- suppressMessages(
                   sort(unique(
                      select(org.Hs.eg.db, keys="GO:0003700", keytype="GOALL", columns="SYMBOL")$SYMBOL)))

Build a “no-genome” model with GTEx blood

library(trena)
## Loading required package: glmnet
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## Loaded glmnet 3.0
## Loading required package: MotifDb
## Loading required package: Biostrings
## Loading required package: XVector
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
candidate.tfs <- intersect(allTFsymbols, rownames(mtx.blood))
cat(sprintf("%d candidate.tfs\n", length(candidate.tfs)))
## 1498 candidate.tfs
es <- EnsembleSolver(mtx.blood, "NFE2", candidate.tfs)
tbl.model <-  run(es)
## There are more than 500 variables and n<m;
## You may wish to restart and set use.Gram=FALSE
tbl.model <- tbl.model[order(abs(tbl.model$pearsonCoeff), decreasing=TRUE),]
rownames(tbl.model) <- NULL
grep("GATA1", tbl.model$gene) # 122
## [1] 119
head(tbl.model, n=20)
##       gene  betaLasso   lassoPValue pearsonCoeff     rfScore   betaRidge
## 1   ZNF467 0.07470464 3.807015e-137    0.8858483 68.03043734 0.012301092
## 2     SPI1 0.25964462  1.221741e-88    0.8631695 22.65562591 0.015977277
## 3   ZNF438 0.11069593 3.643661e-103    0.8622473 30.19208986 0.013328559
## 4     GAS7 0.00000000  1.895326e-91    0.8620084 39.36801699 0.010572346
## 5     MXD3 0.15208025 9.404063e-101    0.8607177 37.00341006 0.012717317
## 6     RFX2 0.00000000  2.055364e-10    0.8196630  1.77125342 0.007838097
## 7  SUPT4H1 0.00000000  4.284611e-04    0.8049881  4.09739462 0.007102236
## 8   ZBTB7B 0.00000000  1.000000e+00    0.8002630 27.99585639 0.007818212
## 9      HLX 0.10354211  1.776149e-37    0.7960579  2.95602951 0.026831432
## 10   STK16 0.00000000  1.000000e+00    0.7923523  0.34157393 0.005780643
## 11  ZNF787 0.00000000  1.000000e+00    0.7923305  2.29357090 0.017382343
## 12  ZNF213 0.00000000  1.000000e+00    0.7864081  1.27899987 0.006331342
## 13  ARID3A 0.00000000  1.000000e+00    0.7808684  2.08038622 0.016199439
## 14    TFEB 0.00000000  1.000000e+00    0.7805708  1.38913490 0.005631353
## 15   CEBPD 0.00000000  1.000000e+00    0.7803907  0.09038523 0.009090038
## 16   NR2E1 0.02476031  9.543671e-51    0.7782116  4.43853373 0.005075063
## 17    ATF6 0.00000000  1.000000e+00    0.7766754  0.15684835 0.013109342
## 18   MLLT1 0.00000000  1.000000e+00    0.7743500  0.61129144 0.012092161
## 19  ZNF746 0.00000000  1.000000e+00    0.7602409  0.05870189 0.013399203
## 20    BATF 0.00000000  6.546269e-10    0.7601506  1.61960687 0.013264925
##    spearmanCoeff      xgboost
## 1      0.8969412 2.661903e-01
## 2      0.8783204 9.899239e-02
## 3      0.8651491 4.195779e-02
## 4      0.8626947 1.677529e-01
## 5      0.8703823 1.169558e-01
## 6      0.8283829 1.041221e-05
## 7      0.8216015 8.308591e-03
## 8      0.8271060 0.000000e+00
## 9      0.8043606 2.650176e-02
## 10     0.7941753 9.107617e-04
## 11     0.8050329 0.000000e+00
## 12     0.7905818 0.000000e+00
## 13     0.8018518 1.174641e-04
## 14     0.7987542 2.219866e-04
## 15     0.7818135 1.801124e-06
## 16     0.7858691 1.985362e-02
## 17     0.7903082 1.431481e-05
## 18     0.7883546 0.000000e+00
## 19     0.7770288 3.489503e-06
## 20     0.7628710 2.357746e-02

Session Info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] trena_1.7.14                     MotifDb_1.26.0                  
##  [3] Biostrings_2.52.0                XVector_0.24.0                  
##  [5] glmnet_3.0                       Matrix_1.2-17                   
##  [7] org.Hs.eg.db_3.8.2               AnnotationDbi_1.46.1            
##  [9] IRanges_2.18.3                   S4Vectors_0.22.1                
## [11] Biobase_2.44.0                   BiocGenerics_0.30.0             
## [13] TrenaProjectErythropoiesis_1.0.2 TrenaProjectLymphocyte_1.0.2    
## [15] TrenaProjectHG38_1.2.1           TrenaProject_1.2.3              
## [17] BiocStyle_2.12.0                
## 
## loaded via a namespace (and not attached):
##  [1] RMySQL_0.10.17              bit64_0.9-7                
##  [3] foreach_1.4.7               assertthat_0.2.1           
##  [5] BiocManager_1.30.9          blob_1.2.0                 
##  [7] BSgenome_1.52.0             GenomeInfoDbData_1.2.1     
##  [9] Rsamtools_2.0.3             yaml_2.2.0                 
## [11] pillar_1.4.2                RSQLite_2.1.2              
## [13] backports_1.1.5             lattice_0.20-38            
## [15] digest_0.6.22               GenomicRanges_1.36.1       
## [17] randomForest_4.6-14         htmltools_0.4.0            
## [19] XML_3.98-1.20               pkgconfig_2.0.3            
## [21] bookdown_0.15               zlibbioc_1.30.0            
## [23] BiocParallel_1.18.1         tibble_2.1.3               
## [25] xgboost_0.90.0.2            flare_1.6.0.2              
## [27] SummarizedExperiment_1.14.1 RPostgreSQL_0.6-2          
## [29] splitstackshape_1.4.8       magrittr_1.5               
## [31] crayon_1.3.4                memoise_1.1.0              
## [33] evaluate_0.14               fs_1.3.1                   
## [35] MASS_7.3-51.4               tools_3.6.1                
## [37] data.table_1.12.6           matrixStats_0.55.0         
## [39] stringr_1.4.0               DelayedArray_0.10.0        
## [41] compiler_3.6.1              pkgdown_1.4.1              
## [43] GenomeInfoDb_1.20.0         rlang_0.4.1                
## [45] grid_3.6.1                  RCurl_1.95-4.12            
## [47] iterators_1.0.12            igraph_1.2.4.1             
## [49] bitops_1.0-6                rmarkdown_1.17             
## [51] codetools_0.2-16            lars_1.2                   
## [53] DBI_1.0.0                   vbsr_0.0.5                 
## [55] R6_2.4.1                    GenomicAlignments_1.20.1   
## [57] knitr_1.26                  rtracklayer_1.44.4         
## [59] lassopv_0.2.0               zeallot_0.1.0              
## [61] bit_1.1-14                  rprojroot_1.3-2            
## [63] shape_1.4.4                 desc_1.2.0                 
## [65] stringi_1.4.3               Rcpp_1.0.3                 
## [67] vctrs_0.2.0                 png_0.1-7                  
## [69] xfun_0.11