Introduction

The regulation of target genes by their transcription factors is complex and incompletely understood. Multiple signals involving core promoters, distal enhancers, epigenetic controls on chromatin accessibility, stochastic and cooperative binding on different times scales are all involved.

Many hundreds of human TF/target gene relationships are known, however. These relationships are primarily from well-studied systems and from research into human diseases. Hundreds of thousands of TF/target gene relationships exist in human cells. There is an urgent need for laboratory and computational methods which predict and validate the many that remain unknown.

trena is one such computational method. It's predictive success depends upon the judicious combination of two predictors, each of which in isolation has little predictive value:

1) correlated gene expression of TF and target gene 2) actual or predicted DNA binding of the TF in regulatory DNA regions associated with the target gene

We will show that these predictors, when applied in tandem to high quality gene expression, genomic and epigenomic data, can 1) recapitulate known TF/target gene relationships, and 2) predict new ones sufficiently plausible to warrant small-scale laboratory valdidation, and 3) specify useful genome-scale tissue-specific regulatory networks.

Methodological Overview

There are two stages in every trena analysis:

Select a relevant set of transcription factors
Score the gene expression association of each TF with the target gene using an ensemble of the independent feature selectors trena provides: lasso, ridge, xgboost, randomForest pearson & spearman correlation.

Step One: select candidate transcription factors

Step one can be accomplished in many ways. A naive approach uses all 1663 human genes currently annotated by the GeneOntology project to the molecular function, “DNA-binding transcription factor activity”. A more conservative choice is to use the 558 human genes for which Jaspar2018 and HOCOMOCO provide binding motifs. We will demonstrate these two annotation-based strategies below and then proceed to use more discriminating approaches:

Assume a traditional cis-regulatory promoter region for the target gene (say, +/- 5kb from the TSS).
Obtain cis-regulatory promoters and enhancers from public databaes or - best of all - by cell type specific experimental data for the system under study
Select transcription factors with motifs which match DNA sequence in these regulatory regions. Each match can be quantified in (at least) two ways: by motif/sequence match, and by the evolutionary conservation of that matching sequence.

Each of these strategies has both strengths weaknesses. The naive 1663 GO annotation gene strategy will identify TFs (or co-factors) which may be active and functional but for which a binding motif is not neccesarily known. The 558 motif TF strategy similarly makes no assumptions about regulatory regions, but probably is enriched for better studied TFs. In the absence of good data about the target genes's actual regulatory regions, and binding, either of these inclusive approaches can be useful.

However, there is an emerging consensus (Cusanovich 2014 & etc) that a great deal of actually bound TFs, as detected by ChIP-seq, in known regulatory regions, are not functional. Their knockdown or silencing has no effect on the expression of the target gene. Thus the primary challenge for the first stage of trena - TF selection - is how to avoid considering too many extraneous TF candidates.

We adopt these strategies for TF selection:

1) select only TFs with motifs with very high DNA matching scores 2) occuring in non-coding DNA sequence that is highly conserved over 100 million years 3) when avaiable, as with the increasingly available single-cell ATAC-seq, further filter the TFs by requring that condition-specific chromatin is open

The stringency of strategies 1 and 2 will, in any given study, exclude functional and relevant binding sites, and their cognate TFs from consideration in the trena regulatory model. This loss is more than offset, as we will show, by the credibility of the models which emerge from that stringency, based on the assumption that (again, following Cusanovich 2014) that since each human gene has a median TF count of 35, that high fidelity motif match in highly conserved regions of open chromatin identifies at least some TFs of interest - and, we speculate, that these will often include the oldest and most central TFs for each target gene.

Step Two: Run the ensemble of feature selectors

After appropriate normalization, the expression of the TFs selected in step one are used to predict the expression of the target gene, using a variety of algorithms:

lasso
lassoPValue
ridge regression
pearson correlation
spearman correlation
random forest
xgboost

The resulting “trena model” is an R data.frame containing the score using each algorithm of each TF as a predictor of the gene expression of the target gene. We as yet have no single settled heuristic method to aggregate these scores. Individucal analysts evolve their own informal aggregation scheme. For example: giving weight and preference to the random forest score while also heeding spearman correlation. Quite commonly, as will be seen in the case studies, the various methods produce compatible predictions.

The first criterion of value for a computational tool such as trena is its ability to reliably reproduce existing knowledge. Here we accomplish that with four case studies, first by demonstrating failure with bulk GTEx blood RNA-seq, then success with three applications of stage-specific erythropoiesis RNA-seq and ATAC-seq data.

Case Studies

Case Study 1 An informative failure: trena with GTEx bulk blood expression
Case Study 2 Successful recapitulation: trena + Corces et al, 2016, RNA-seq on FACS-separated erythropoiesis stages, GSE74246
Case Study 3 Successful recapituation: more recent Brand lab (similar to Corces et al), GSE118537
Case Study 4 Discover a novel regulator of GATA2?: trena + Brand lab + scATAC-seq