Introduction

A central concern of genome biology is improving understanding of gene transcription. In simple terms, transcription factors (TFs) are proteins that bind to DNA, typically near gene promoter regions. The role of TFs in gene expression variation is of great interest. Progress in deciphering genetic and epigenetic processes that affect TF abundance and function will be essential in clarifying and interpreting gene expression variation patterns and their effects on phenotype. Difficulties of identifying TFs, and opportunities for doing so in systems biology contexts, are reviewed in Weirauch et al. (2014).

This paper describes an R/Bioconductor package called TFutils, which assembles various resources intended to clarify and unify approaches to working with TF concepts in bioinformatic analysis. Computations described in this paper can be carried out with Bioconductor version 3.6. The package can be installed with

library(BiocInstaller) 
# use source("http://www.bioconductor.org/biocLite.R") if not available
biocLite("TFutils")

In the next section we describe the basic concepts of enumerating and classifying TFs, enumerating their targets, and representing genome-wide quantification of TF binding affinity. This is followed by a review of the key data structures and functions provided in the package, and an example in cancer infomatics.

Basic concepts of transcription factor bioinformatics

Enumerating transcription factors

Given the importance of the topic, it is not surprising that a number of bioinformatic research groups have published catalogs of transcription factors along with metadata about their features. Standard nomenclature for TFs has yet to be established. Gene symbols, motif sequences, and position-weight matrix catalog entries have all been used as TF identifiers.

In TFutils we have gathered information from four widely used resources: Gene Ontology (GO, Ashburner et al. (2000), in which GO:0003700 is the tag for the molecular function concept “DNA binding transcription factor activity”), CISBP (Weirauch et al. (2014)), HOCOMOCO (Kulakovskiy et al. (2018)), and MSigDb (Subramanian et al. (2005)). Figure @ref(fig:lkupset) depicts the sizes of these catalogs, measured using counts of unique HGNC gene symbols. The enumeration for GO uses Bioconductor’s org.Hs.eg.db package to find direct associations from GO:0003700 to HGNC symbols. The enumeration for MSigDb is heuristic and involves parsing the gene set identifiers used in MSigDb for exact or close matches to HGNC symbols. For CISBP and HOCOMOCO, the associated web servers provide easily parsed tabular catalogs.

Sizes of TF catalogs and of intersections based on HGNC symbols for TFs.

Sizes of TF catalogs and of intersections based on HGNC symbols for TFs.

Classification of transcription factors

As noted by Weirauch et al. (2014), interpretation of the “function and evolution of DNA sequences” is dependent on the analysis of sequence-specific DNA binding domains. These domains are dynamic and cell-type specific (Gertz et al. (2013)). Classifying TFs according to features of the binding domain is an ongoing process of increasing intricacy. Figure @ref(fig:TFclass) shows excerpts of hierarchies of terms related to TF type derived from GO (on the left) and TFclass (Wingender et al. (2018)). There is a disagreement between our enumeration of TFs based on GO in Figure @ref(fig:lkupset) and the 1919 shown in AmiGO, as the latter includes a broader collection of receptor activities.

Screenshots of AmiGO and TFClass hierarchy excerpts.

Screenshots of AmiGO and TFClass hierarchy excerpts.

Table @ref(tab:classtab) provides examples of frequently encountered TF classifications in the CISBP and HOCOMOCO catalogs. The numerical components of the HOCOMOCO classes correspond to TFClass subfamilies (Wingender et al. (2018)).

(#tab:classtab) Most frequently represented TF classes in CISBP and HOCOMOCO. Entries in columns Nc (Nh) are numbers of distinct TFs annotated to classes in columns CISBP (HOCOMOCO) respectively.
CISBP Nc HOCOMOCO Nh
C2H2 ZF 655 More than 3 adjacent zinc finger factors{2.3.3} 106
Homeodomain 199 HOX-related factors{3.1.1} 43
bHLH 104 NK-related factors{3.1.2} 36
bZIP 66 Paired-related HD factors{3.1.3} 35
Unknown 49 Factors with multiple dispersed zinc fingers{2.3.4} 30
Forkhead 48 Forkhead box (FOX) factors{3.3.1} 27
Sox 48 Ets-related factors{3.5.2} 25
Nuclear receptor 46 Three-zinc finger Krueppel-related factors{2.3.1} 20
Myb/SANT 30 POU domain factors{3.1.10} 18
Ets 27 Tal-related factors{1.2.3} 18

Enumerating TF targets

The Broad Institute MSigDb (Subramanian et al. (2005)) includes a gene set collection devoted to cataloging TF targets. We have used Bioconductor’s GSEABase package to import and serialize the gmt representation of this collection.

TFutils::tftColl
## GeneSetCollection
##   names: AAANWWTGC_UNKNOWN, AAAYRNCTG_UNKNOWN, ..., GCCATNTTG_YY1_Q6 (615 total)
##   unique identifiers: 4208, 481, ..., 56903 (12774 total)
##   types in collection:
##     geneIdType: EntrezIdentifier (1 total)
##     collectionType: NullCollection (1 total)

Names of TFs for which target sets are assembled are encoded in a systematic way, with underscores separating substrings describing motifs, genes, and versions. Some peculiarity in nomenclature in the MSigDb labels can be observed:

grep("NFK", names(TFutils::tftColl), value=TRUE)
## [1] "NFKAPPAB65_01"         "NFKAPPAB_01"           "NFKB_Q6"              
## [4] "NFKB_C"                "NFKB_Q6_01"            "GGGNNTTTCC_NFKB_Q6_01"

Manual curation will be needed to improve the precision with which MSigDb TF target sets can used.

Cataloging TF targets

The MSigDb collection is provided primarily for the purpose of defining gene sets in terms of TF targets. We use the GSEABase package GeneSetCollection class to manage these sets.

Quantitative predictions of TF binding affinities

The FIMO algorithm of the MEME suite (Grant, Bailey, and Noble (2011)) was used to score the human reference genome for TF binding affinity for 689 motif matrices to which genes are associated. Sixteen (16) tabix-indexed BED files are lodged in an AWS S3 bucket for illustration purposes.

library(GenomicFiles)
data(fimo16)
fimo16
## GenomicFiles object with 0 ranges and 16 files: 
## files: M0635_1.02sort.bed.gz, M3433_1.02sort.bed.gz, ..., M6159_1.02sort.bed.gz, M6497_1.02sort.bed.gz 
## detail: use files(), rowRanges(), colData(), ...
head(colData(fimo16))
## DataFrame with 6 rows and 2 columns
##          Mtag        HGNC
##   <character> <character>
## 1     M0635_1      DMRTC2
## 2     M3433_1       HOXA3
## 3     M3467_1        IRF1
## 4     M3675_1      POU2F1
## 5     M3698_1        TP53
## 6     M3966_1       STAT1

We harvest scores in a genomic interval of interest (bound to fimo16 in the rowRanges assignment below) using reduceByFile. This yields a list with one element per file. Each such element holds a list of scanTabix results, one per query range.

library(BiocParallel)
register(SerialParam()) # important for macosx?
rowRanges(fimo16) = GRanges("chr17", IRanges(38.077e6, 38.084e6))
rr = GenomicFiles::reduceByFile(fimo16, MAP=function(r,f)
  scanTabix(f, param=r))

scanTabix produces a list of vectors of text strings, which we parse with data.table::fread. The resulting tables are then reduced to a genomic location and -log10 of the p-value derived from the binding affinity statistic of FIMO in the vicinity of that location.

asdf = function(x) data.table::fread(paste0(x, collapse="\n"), header=FALSE)
gg = lapply(rr, function(x) {
       tmp = asdf(x[[1]][[1]]) 
       data.frame(loc=tmp$V2, score=-log10(tmp$V7))
     })
for (i in 1:length(gg))  gg[[i]]$tf = colData(fimo16)[i,2]

It turns out there are too many distinct TFs to display individually, so let’s also label the scores with the TF families as defined in CISBP.

matchcis = match(colData(fimo16)[,2], cisbpTFcat[,2])
famn = cisbpTFcat[matchcis,]$Family_Name
for (i in 1:length(gg))  gg[[i]]$tffam = famn[i]
nn = do.call(rbind, gg)

A simple display of predicted TF binding affinity in a genomic region is then

library(ggplot2)
ggplot(nn, aes(x=loc,y=score,group=tffam, colour=tffam)) + geom_point()

Summary

We have compared enumerations of human transcription factors by different projects, provided access to two forms of binding domain classification, and illustrated the use of cloud-resident genome-wide binding predictions. In the next section we review selected details of data structures and methods of the TFutils package.

Methods

Implementation

The TFCatalog class

A number of relatively small reference

data(tftColl)
data(tftCollMap)
data(cisbpTFcat)

TFs_MSIG = TFCatalog(name="MsigDb.TFT", nativeIds=names(tftColl),
 HGNCmap=data.frame(tftCollMap,stringsAsFactors=FALSE))

TFs_CISBP = TFCatalog(name="CISBP.info", nativeIds=cisbpTFcat[,1],
 HGNCmap = cisbpTFcat)

TFs_MSIG

TFs_CISBP

Operation

Ashburner, M., C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, et al. 2000. “Gene Ontology: Tool for the Unification of Biology. The Gene Ontology Consortium.” Nature Genetics 25 (1):25–29. https://doi.org/10.1038/75556.

Gertz, Jason, Daniel Savic, Katherine E. Varley, E. Christopher Partridge, Alexias Safi, Preti Jain, Gregory M. Cooper, Timothy E. Reddy, Gregory E. Crawford, and Richard M. Myers. 2013. “Distinct Properties of Cell-Type-Specific and Shared Transcription Factor Binding Sites.” Molecular Cell 52 (1):25–36. https://doi.org/10.1016/j.molcel.2013.08.037.

Grant, Charles E., Timothy L. Bailey, and William Stafford Noble. 2011. “FIMO: Scanning for Occurrences of a Given Motif.” Bioinformatics (Oxford, England) 27 (7):1017–8. https://doi.org/10.1093/bioinformatics/btr064.

Kulakovskiy, Ivan V., Ilya E. Vorontsov, Ivan S. Yevshin, Ruslan N. Sharipov, Alla D. Fedorova, Eugene I. Rumynskiy, Yulia A. Medvedeva, et al. 2018. “HOCOMOCO: Towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis.” Nucleic Acids Research 46 (D1). Oxford University Press:D252–D259. https://doi.org/10.1093/nar/gkx1106.

Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43). National Academy of Sciences:15545–50. https://doi.org/10.1073/pnas.0506580102.

Weirauch, Matthew T., Ally Yang, Mihai Albu, Atina G. Cote, Alejandro Montenegro-Montero, Philipp Drewe, Hamed S. Najafabadi, et al. 2014. “Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity.” Cell 158 (6):1431–43. https://doi.org/10.1016/j.cell.2014.08.009.

Wingender, Edgar, Torsten Schoeps, Martin Haubrock, Mathias Krull, and Jürgen Dönitz. 2018. “TFClass: Expanding the classification of human transcription factors to their mammalian orthologs.” Nucleic Acids Research 46 (D1):D343–D347. https://doi.org/10.1093/nar/gkx987.