1 Introduction

ATAC-seq, an assay for Transposase-Accessible Chromatin using sequencing, is a widely used technique for chromatin accessibility analysis. Detecting differential activation of transcription factors between two different experiment conditions provides the possibility of decoding the key factors in a phenotype. Lots of tools have been developed to detect the differential activity of TFs (DATFs) for different groups of samples. Those tools can be divided into two groups. One group detects DATFs from differential accessibility analysis, such as MEME1, HOMER2, enrichr3, and ChEA4. Another group finds the DATFs by enrichment tests, such as BiFET5, diffTF6, and TFEA7. For single-cell ATAC-seq analysis, Signac and chromVar are widely used tools.

2 Motivation

All of these tools detect the DATF by only considering the open status of chromatin. None of them take the TF footprint into count. The open status provides the possibility of TF can bind to that position. The TF footprint by ATAC-seq shows the status of TF bindings.

To help researchers quickly assess the differential activity of hundreds of TFs by detecting the difference in TF footprint via enrichment score8, we have developed the ATACseqTFEA package. The ATACseqTFEA package is a robust and reliable computational tool to identify the key regulators responding to a phenotype.

schematic diagram of ATACseqTFEA

3 Quick start

Here is an example using ATACseqTFEA with a subset of ATAC-seq data.

3.1 Installation

First, install ATACseqTFEA and other packages required to run the examples. Please note that the example dataset used here is from zebrafish. To run an analysis with a dataset from a different species or different assembly, please install the corresponding Bsgenome and “TxDb”. For example, to analyze mouse data aligned to “mm10”, please install “BSgenome.Mmusculus.UCSC.mm10”, and “TxDb.Mmusculus.UCSC.mm10.knownGene”. You can also generate a TxDb object by functions makeTxDbFromGFF from a local “gff” file, or makeTxDbFromUCSC, makeTxDbFromBiomart, and makeTxDbFromEnsembl, from online resources in the GenomicFeatures package.

library(BiocManager)
BiocManager::install(c("ATACseqTFEA",
                       "ATACseqQC",
                       "Rsamtools",
                       "BSgenome.Drerio.UCSC.danRer10",
                       "TxDb.Drerio.UCSC.danRer10.refGene"))

3.2 Load library

library(ATACseqTFEA)
library(BSgenome.Drerio.UCSC.danRer10) ## for binding sites search
library(ATACseqQC) ## for footprint

3.3 Prepare binding sites

To do TFEA, there are two inputs, the binding sites, and the change ranks. To get the binding sites, the ATACseqTFEA package provides the prepareBindingSites function. Users can also try to get the binding sites list by other tools such as “fimo”9.

The prepareBindingSites function request a cluster of position weight matrix (PWM) of TF motifs. ATACseqTFEA prepared a merged PWMatrixList for 405 motifs. The PWMatrixList is a collection of jasper2018, jolma2013 and cisbp_1.02 from package motifDB (v 1.28.0) and merged by distance smaller than 1e-9 calculated by MotIV::motifDistances function (v 1.42.0). The merged motifs were exported by motifStack (v 1.30.0).

motifs <- readRDS(system.file("extdata", "PWMatrixList.rds",
                               package="ATACseqTFEA"))

The best_curated_Human is a list of TF motifs downloaded from TFEA github7. There are 1279 human motifs in the data set.

motifs_human <- readRDS(system.file("extdata", "best_curated_Human.rds",
                                    package="ATACseqTFEA"))

Another list of non-redundant TF motifs are also available by downloading the data from DeepSTARR10. There are 6502 motifs in the data set.

MotifsSTARR <- readRDS(system.file("extdata", "cluster_PWMs.rds",
                                      package="ATACseqTFEA"))

To scan the binding sites along a genome, a BSgenome object is required by the prepareBindingSites function.

# for test run, we use a subset of data within chr1:5000-100000
# for real data, use the merged peaklist as grange input.
# Drerio is the short-link of BSgenome.Drerio.UCSC.danRer10
seqlev <- "chr1" 
bindingSites <- 
  prepareBindingSites(motifs, Drerio, seqlev,
                      grange=GRanges("chr1", IRanges(5000, 100000)),
                      p.cutoff = 5e-05)#set higher p.cutoff to get more sites.

3.4 TFEA

The correct insertion site is the key to the enrichment analysis of TF binding sites. The parameter positive and negative in the function of TFEA are used to shift the 5’ ends of the reads to the correct insertion positions. However, this shift does not consider the soft clip of the reads. The best way to generate correct shifted bam files is using ATACseqQC::shiftGAlignmentsList11 for paired-end or shiftGAlignments for single-end of the bam file. The samples must be at least biologically duplicated for the one-step TFEA function.

bamExp <- system.file("extdata",
                      c("KD.shift.rep1.bam",
                        "KD.shift.rep2.bam"),
                      package="ATACseqTFEA")
bamCtl <- system.file("extdata",
                      c("WT.shift.rep1.bam",
                        "WT.shift.rep2.bam"),
                      package="ATACseqTFEA")
res <- TFEA(bamExp, bamCtl, bindingSites=bindingSites,
            positive=0, negative=0) # the bam files were shifted reads

3.5 View results

The results will be saved in a TFEAresults object. We will use multiple functions to present the results. The plotES function will return a ggplot object for single TF input and no outfolder is defined. The ESvolcanoplot function will provide an overview of all the TFs enrichment. And we can borrow the factorFootprints function from ATACseqQC package to view the footprints of one TF.

TF <- "Tal1::Gata1"
## volcanoplot
ESvolcanoplot(TFEAresults=res, TFnameToShow=TF)

### plot enrichment score for one TF
plotES(res, TF=TF, outfolder=NA)

## footprint
sigs <- factorFootprints(c(bamCtl, bamExp), 
                         pfm = as.matrix(motifs[[TF]]),
                         bindingSites = getBindingSites(res, TF=TF),
                         seqlev = seqlev, genome = Drerio,
                         upstream = 100, downstream = 100,
                         group = rep(c("WT", "KD"), each=2))

## export the results into a csv file
write.csv(res$resultsTable, tempfile(fileext = ".csv"), 
          row.names=FALSE)

The command-line scripts are available at extdata named as sample_scripts.R.

4 Do TFEA step by step.

The one-step TFEA is a function containing multiple steps, which include:

  1. count the reads in binding sites, proximal region, and distal region;
  2. filter the binding site not open;
  3. normalize the count number by the width of the count region;
  4. calculate the binding scores and weight the binding scores by open scores;
  5. differential analysis by limma for the binding score
  6. filter the differential results by P-value and fold change
  7. TF enrichment analysis

If you want to tune the parameters, it will be much better to do it step by step to avoid repeating the computation for the same step. Here are the details for each step.

4.1 Counting reads

We will count the insertion site in binding sites, proximal and distal regions by counting the 5’ ends of the reads in a shifted bam file. Here we suggest keeping the proximal and distal the same value.

# prepare the counting region
exbs <- expandBindingSites(bindingSites=bindingSites,
                           proximal=40,
                           distal=40,
                           gap=10)
## count reads by 5'ends
counts <- count5ends(bam=c(bamExp, bamCtl),
                     positive=0L, negative=0L,
                     bindingSites = bindingSites,
                     bindingSitesWithGap=exbs$bindingSitesWithGap,
                     bindingSitesWithProximal=exbs$bindingSitesWithProximal,
                     bindingSitesWithProximalAndGap=
                         exbs$bindingSitesWithProximalAndGap,
                     bindingSitesWithDistal=exbs$bindingSitesWithDistal)

4.2 Filter the counts

We filter the binding sites by at least there is 1 reads in proximal region. Users may want to try filter the sites by more stringent criteria such as “proximalRegion>1”.

colnames(counts)
## [1] "bindingSites"   "proximalRegion" "distalRegion"
counts <- eventsFilter(counts, "proximalRegion>0")

4.3 Normalize the counts by width of count region

We will normalize the counts to count per base (CPB).

counts <- countsNormalization(counts, proximal=40, distal=40)

4.4 Get weighted binding scores

Here we use the open score to weight the binding score. Users can also define the weight for binding score via parameter weight in the function getWeightedBindingScore.

counts <- getWeightedBindingScore(counts)

4.5 Differential analysis

Here we use DBscore, which borrows the power of the limma package, to do differential binding analysis.

design <- cbind(CTL=1, EXPvsCTL=c(1, 1, 0, 0))
counts <- DBscore(counts, design=design, coef="EXPvsCTL")

4.6 Filter the DB results

We can filter the binding results to decrease the data size by the function eventsFilter. For the sample data, we skip this step.

4.7 TF enrichment analysis

Last, we use the function doTFEA to get the enrichment scores.

res <- doTFEA(counts)
res
## This is an object of TFEAresults with 
## slot enrichmentScore ( matrix:  399 x 2166 ), 
## slot bindingSites ( GRanges object with  2166  ranges and  12  metadata columns ), 
## slot motifID ( a list of the positions of binding sites for  399 TFs ), and 
## slot resultsTable ( 399  x  5 ). Here is the top 2 rows:
##          TF enrichmentScore normalizedEnrichmentScore   p_value   adjPval
## NRF1   NRF1       0.1923613                 0.7960275 0.7253614 0.9994472
## Gfi1b Gfi1b       0.3099024                 1.3769160 0.1143751 0.9585665
plotES(res, TF=TF, outfolder=NA) ## will show same figure as above one

5 SessionInfo

sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ATACseqQC_1.28.0                    Rsamtools_2.20.0                   
##  [3] BSgenome.Drerio.UCSC.danRer10_1.4.2 BSgenome_1.72.0                    
##  [5] rtracklayer_1.64.0                  BiocIO_1.14.0                      
##  [7] Biostrings_2.72.0                   XVector_0.44.0                     
##  [9] GenomicRanges_1.56.0                GenomeInfoDb_1.40.0                
## [11] IRanges_2.38.0                      S4Vectors_0.42.0                   
## [13] BiocGenerics_0.50.0                 ATACseqTFEA_1.6.0                  
## [15] BiocStyle_2.32.0                   
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.4.0               bitops_1.0-7               
##   [3] filelock_1.0.3              tibble_3.2.1               
##   [5] R.oo_1.26.0                 graph_1.82.0               
##   [7] XML_3.99-0.16.1             DirichletMultinomial_1.46.0
##   [9] lifecycle_1.0.4             httr2_1.0.1                
##  [11] pwalign_1.0.0               edgeR_4.2.0                
##  [13] lattice_0.22-6              ensembldb_2.28.0           
##  [15] MASS_7.3-60.2               magrittr_2.0.3             
##  [17] limma_3.60.0                sass_0.4.9                 
##  [19] rmarkdown_2.26              jquerylib_0.1.4            
##  [21] yaml_2.3.8                  grImport2_0.3-1            
##  [23] DBI_1.2.2                   CNEr_1.40.0                
##  [25] ade4_1.7-22                 abind_1.4-5                
##  [27] zlibbioc_1.50.0             preseqR_4.0.0              
##  [29] purrr_1.0.2                 R.utils_2.12.3             
##  [31] AnnotationFilter_1.28.0     RCurl_1.98-1.14            
##  [33] pracma_2.4.4                rappdirs_0.3.3             
##  [35] GenomeInfoDbData_1.2.12     ggrepel_0.9.5              
##  [37] seqLogo_1.70.0              annotate_1.82.0            
##  [39] codetools_0.2-20            DelayedArray_0.30.0        
##  [41] xml2_1.3.6                  tidyselect_1.2.1           
##  [43] futile.logger_1.4.3         farver_2.1.1               
##  [45] UCSC.utils_1.0.0            universalmotif_1.22.0      
##  [47] base64enc_0.1-3             matrixStats_1.3.0          
##  [49] BiocFileCache_2.12.0        GenomicAlignments_1.40.0   
##  [51] jsonlite_1.8.8              multtest_2.60.0            
##  [53] motifStack_1.48.0           survival_3.6-4             
##  [55] motifmatchr_1.26.0          tools_4.4.0                
##  [57] progress_1.2.3              TFMPvalue_0.0.9            
##  [59] Rcpp_1.0.12                 glue_1.7.0                 
##  [61] ChIPpeakAnno_3.38.0         SparseArray_1.4.0          
##  [63] xfun_0.43                   MatrixGenerics_1.16.0      
##  [65] dplyr_1.1.4                 HDF5Array_1.32.0           
##  [67] withr_3.0.0                 formatR_1.14               
##  [69] BiocManager_1.30.22         fastmap_1.1.1              
##  [71] rhdf5filters_1.16.0         fansi_1.0.6                
##  [73] caTools_1.18.2              digest_0.6.35              
##  [75] R6_2.5.1                    colorspace_2.1-0           
##  [77] Cairo_1.6-2                 GO.db_3.19.1               
##  [79] gtools_3.9.5                poweRlaw_0.80.0            
##  [81] jpeg_0.1-10                 biomaRt_2.60.0             
##  [83] RSQLite_2.3.6               R.methodsS3_1.8.2          
##  [85] utf8_1.2.4                  tidyr_1.3.1                
##  [87] generics_0.1.3              data.table_1.15.4          
##  [89] prettyunits_1.2.0           InteractionSet_1.32.0      
##  [91] httr_1.4.7                  htmlwidgets_1.6.4          
##  [93] S4Arrays_1.4.0              TFBSTools_1.42.0           
##  [95] regioneR_1.36.0             pkgconfig_2.0.3            
##  [97] gtable_0.3.5                blob_1.2.4                 
##  [99] htmltools_0.5.8.1           bookdown_0.39              
## [101] RBGL_1.80.0                 ProtGenerics_1.36.0        
## [103] scales_1.3.0                Biobase_2.64.0             
## [105] png_0.1-8                   knitr_1.46                 
## [107] lambda.r_1.2.4              tzdb_0.4.0                 
## [109] reshape2_1.4.4              rjson_0.2.21               
## [111] curl_5.2.1                  cachem_1.0.8               
## [113] rhdf5_2.48.0                stringr_1.5.1              
## [115] BiocVersion_3.19.1          KernSmooth_2.23-22         
## [117] parallel_4.4.0              AnnotationDbi_1.66.0       
## [119] restfulr_0.0.15             pillar_1.9.0               
## [121] grid_4.4.0                  vctrs_0.6.5                
## [123] randomForest_4.7-1.1        dbplyr_2.5.0               
## [125] xtable_1.8-4                evaluate_0.23              
## [127] magick_2.8.3                tinytex_0.50               
## [129] readr_2.1.5                 VennDiagram_1.7.3          
## [131] GenomicFeatures_1.56.0      cli_3.6.2                  
## [133] locfit_1.5-9.9              compiler_4.4.0             
## [135] futile.options_1.0.1        rlang_1.1.3                
## [137] crayon_1.5.2                labeling_0.4.3             
## [139] plyr_1.8.9                  stringi_1.8.3              
## [141] BiocParallel_1.38.0         munsell_0.5.1              
## [143] lazyeval_0.2.2              Matrix_1.7-0               
## [145] hms_1.1.3                   bit64_4.0.5                
## [147] ggplot2_3.5.1               Rhdf5lib_1.26.0            
## [149] KEGGREST_1.44.0             statmod_1.5.0              
## [151] highr_0.10                  SummarizedExperiment_1.34.0
## [153] AnnotationHub_3.12.0        GenomicScores_2.16.0       
## [155] memoise_2.0.1               bslib_0.7.0                
## [157] bit_4.0.5                   polynom_1.4-1

References

1. Bailey, T. L., Williams, N., Misleh, C. & Li, W. W. MEME: Discovering and analyzing dna and protein sequence motifs. Nucleic acids research 34, W369–W373 (2006).

2. Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and b cell identities. Molecular cell 38, 576–589 (2010).

3. Chen, E. Y. et al. Enrichr: Interactive and collaborative html5 gene list enrichment analysis tool. BMC bioinformatics 14, 128 (2013).

4. Lachmann, A. et al. ChEA: Transcription factor regulation inferred from integrating genome-wide chip-x experiments. Bioinformatics 26, 2438–2444 (2010).

5. Youn, A., Marquez, E. J., Lawlor, N., Stitzel, M. L. & Ucar, D. BiFET: Sequencing bi as-free transcription factor f ootprint e nrichment t est. Nucleic acids research 47, e11–e11 (2019).

6. Berest, I. et al. Quantification of differential transcription factor activity and multiomics-based classification into activators and repressors: DiffTF. Cell Reports 29, 3147–3159 (2019).

7. Rubin, J. D. et al. Transcription factor enrichment analysis (tfea): Quantifying the activity of hundreds of transcription factors from a single experiment. Commun Biol 661 (2021).

8. Subramanian, A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102, 15545–15550 (2005).

9. Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: Scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).

10. Almeida, B. P. de, Reiter, F., Pagani, M. & Stark, A. DeepSTARR predicts enhancer activity from dna sequence and enables the de novo design of enhancers. bioRxiv (2021).

11. Ou, J. et al. ATACseqQC: A bioconductor package for post-alignment quality assessment of atac-seq data. BMC genomics 19, 169 (2018).