Contents

1 Version Info

## See system.file("LICENSE", package="MotifDb") for use restrictions.

R version: R version 4.3.1 (2023-06-16)
Bioconductor version: 3.18
Package version: 1.26.0

2 Introduction

Eukaryotic gene regulation can be very complex. Transcription factor binding to promoter DNA sequences is a stochastic process, and imperfect matches can be sufficient for binding. Chromatin remodeling, methylation, histone modification, chromosome interaction, distal enhancers, and the cooperative binding of transcription co-factors all play an important role. We avoid most of this complexity in this demonstration workflow in order to examine transcription factor binding sites in a small set of seven broadly co-expressed Saccharomyces cerevisiae genes of related function. These genes exhibit highly correlated mRNA expression across 200 experimental conditions, and are annotated to Nitrogen Catabolite Repression (NCR), the means by which yeast cells switch between using rich and poor nitrogen sources.

We will see, however, that even this small collection of co-regulated genes of similar function exhibits considerable regulatory complexity, with (among other things) activators and repressors competing to bind to the same DNA promoter sequence. Our case study sheds some light on this complexity, and demonstrates how several new Bioconductor packages and methods allow us to

[ Back to top ]

3 Installation and Use

To install the necessary packages and all of their dependencies, evaluate the commands

## try http:// if https:// URLs are not supported
library(BiocManager)
BiocManager::install(c("MotifDb",  "GenomicFeatures", 
           "TxDb.Scerevisiae.UCSC.sacCer3.sgdGene",
           "org.Sc.sgd.db", "BSgenome.Scerevisiae.UCSC.sacCer3",
           "motifStack", "seqLogo"))

Package installation is required only once per R installation. When working with an organism other than S.cerevisiae, substitute the three species-specific packages as needed.

To use these packages in an R session, evaluate these commands:

library(MotifDb)
library(S4Vectors)
library(seqLogo)
library(motifStack)
library(Biostrings)
library(GenomicFeatures)
library(org.Sc.sgd.db)
library(BSgenome.Scerevisiae.UCSC.sacCer3)
library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)

These instructions are required once in each R session.

[ Back to top ]

4 Biological Background

The x-y plot below displays expression levels of seven genes across 200 conditions, from a compendium of yeast expression data which accompanies Allocco et al, 2004, “Quantifying the relationship between co-expression, co-regulation and gene function”:

compendium.png


Allocco et al establish that

In S. cerevisiae, two genes have a 50% chance of having a common transcription factor binder if the correlation between their expression profiles is equal to 0.84.

These seven highly-correlated (> 0.85) NCR genes form a connected subnetwork within the complete co-expresson network derived from the compendium data (work not shown). Network edges indicate correlated expression of the two connected genes across all 200 conditions. The edges are colored as a function of that correlation: red for perfect correlation, white indicating correlation of 0.85, and intermediate colors for intermediate values. DAL80 is rendered as an octagon to indicate its special status as a transcription factor. We presume, following Allocco, that such correlation among genes, including one transcription factor, is a plausible place to look for shared transcription factor binding sites.


dal80-subnet.png


Some insight into the co-regulation of these seven genes is obtained from Georis et al, 2009, “The Yeast GATA Factor Gat1 Occupies a Central Position in Nitrogen Catabolite Repression-Sensitive Gene Activation”:

Saccharomyces cerevisiae cells are able to adapt their metabolism according to the quality of the nitrogen sources available in the environment. Nitrogen catabolite repression (NCR) restrains the yeast’s capacity to use poor nitrogen sources when rich ones are available. NCR-sensitive expression is modulated by the synchronized action of four DNA-binding GATA factors. Although the first identified GATA factor, Gln3, was considered the major activator of NCR-sensitive gene expression, our work positions Gat1 as a key factor for the integrated control of NCR in yeast for the following reasons: (i) Gat1 appeared to be the limiting factor for NCR gene expression, (ii) GAT1 expression was regulated by the four GATA factors in response to nitrogen availability, (iii) the two negative GATA factors Dal80 and Gzf3 interfered with Gat1 binding to DNA, and (iv) Gln3 binding to some NCR promoters required Gat1. Our study also provides mechanistic insights into the mode of action of the two negative GATA factors. Gzf3 interfered with Gat1 by nuclear sequestration and by competition at its own promoter. Dal80-dependent repression of NCR-sensitive gene expression occurred at three possible levels: Dal80 represses GAT1 expression, it competes with Gat1 for binding, and it directly represses NCR gene transcription. (emphasis added)

Thus DAL80 is but one of four interacting transcription factors which all bind the GATA motif. We will see below that DAL80 lacks the GATA sequence in its own promoter, but that the motif is well-represented in the promoters of the other six.

In order to demonstrate Bioconductor capabilities for finding binding sites for known transcription factors via sequence matching, we will use the shared DNA-binding GATA sequence as retrieved from one of those factors from MotifDb, DAL80.

[ Back to top ]

6 Minimal Example

Only eight lines of code (excluding library statements) are required to find two matches to the JASPAR DAL80 motif in the promoter of DAL1.

library(MotifDb)
library(seqLogo)
library(motifStack)
library(Biostrings)
library(GenomicFeatures)
library(org.Sc.sgd.db)
library(BSgenome.Scerevisiae.UCSC.sacCer3)
library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)

query(MotifDb, "DAL80")   
## MotifDb object of length 6
## | Created from downloaded public sources, last update: 2022-Mar-04
## | 6 position frequency matrices from 6 sources:
## |        JASPAR_2014:    1
## |        JASPAR_CORE:    1
## |             ScerTF:    1
## |         jaspar2016:    1
## |         jaspar2018:    1
## |         jaspar2022:    1
## | 1 organism/s
## |        Scerevisiae:    6
## Scerevisiae-ScerTF-DAL80-harbison 
## Scerevisiae-JASPAR_CORE-DAL80-MA0289.1 
## Scerevisiae-JASPAR_2014-DAL80-MA0289.1 
## Scerevisiae-jaspar2016-DAL80-MA0289.1 
## Scerevisiae-jaspar2018-DAL80-MA0289.1 
## Scerevisiae-jaspar2022-DAL80-MA0289.1
pfm.dal80.jaspar <- query(MotifDb,"DAL80")[[1]]
seqLogo(pfm.dal80.jaspar)

dal1 <- "YIR027C"
chromosomal.loc <- 
  transcriptsBy(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene, by="gene") [dal1]
promoter.dal1 <- 
  getPromoterSeq(chromosomal.loc, Scerevisiae, upstream=1000, downstream=0)
pcm.dal80.jaspar <- round(100 * pfm.dal80.jaspar)
matchPWM(pcm.dal80.jaspar, unlist(promoter.dal1)[[1]], "90%")
## Views on a 1000-letter DNAString subject
## subject: TTGAGGAGTTGTCCACATACACATTAGTGTTGAT...GCAAAAAAAAAGTGAAATACTGCGAAGAACAAAG
## views:
##       start end width
##   [1]   621 625     5 [GATAA]
##   [2]   638 642     5 [GATAA]

[ Back to top ]

7 Sample Workflow: an extended example

We begin by visualizing DAL80’s TF binding motif using either of two Bioconductor packages: seqLogo, and motifStack. First, query MotifDb for the PFM (position frequency matrix):

query(MotifDb,"DAL80")
## MotifDb object of length 6
## | Created from downloaded public sources, last update: 2022-Mar-04
## | 6 position frequency matrices from 6 sources:
## |        JASPAR_2014:    1
## |        JASPAR_CORE:    1
## |             ScerTF:    1
## |         jaspar2016:    1
## |         jaspar2018:    1
## |         jaspar2022:    1
## | 1 organism/s
## |        Scerevisiae:    6
## Scerevisiae-ScerTF-DAL80-harbison 
## Scerevisiae-JASPAR_CORE-DAL80-MA0289.1 
## Scerevisiae-JASPAR_2014-DAL80-MA0289.1 
## Scerevisiae-jaspar2016-DAL80-MA0289.1 
## Scerevisiae-jaspar2018-DAL80-MA0289.1 
## Scerevisiae-jaspar2022-DAL80-MA0289.1

There are two motifs. How do they compare? The seqlogo package has been the standard tool for viewing sequence logos, but can only portray one logo at a time.

dal80.jaspar <- query(MotifDb,"DAL80")[[1]]
dal80.scertf <-query(MotifDb,"DAL80")[[2]]
seqLogo(dal80.jaspar)

seqLogo(dal80.scertf)

With a little preparation, the new (October 2012) package motifStack can plot both motifs together. First, create instances of the pfm class:

pfm.dal80.jaspar <- new("pfm", mat=query(MotifDb, "dal80")[[1]], 
                        name="DAL80-JASPAR")
pfm.dal80.scertf <- new("pfm", mat=query(MotifDb, "dal80")[[2]], 
                        name="DAL80-ScerTF")
plotMotifLogoStack(DNAmotifAlignment(c(pfm.dal80.scertf, pfm.dal80.jaspar)))
## Loading required namespace: Cairo