1 Introduction

Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) is an alternative or complementary technique to MNase-seq, DNase-seq, and FAIRE-seq for chromatin accessibility analysis. The results obtained from ATAC-seq are similar to those from DNase-seq and FAIRE-seq. ATAC-seq is gaining popularity because it does not require cross-linking, has higher signal to noise ratio, requires a much smaller amount of biological material and is faster and easier to perform, compared to other techniques¹.

To help researchers quickly assess the quality of ATAC-seq data, we have developed the ATACseqQC package for easily making diagnostic plots following the published guidelines¹. In addition, it has functions to preprocess ATACseq data for subsequent peak calling.

2 Quick start

Here is an example using ATACseqQC with a subset of published ATAC-seq data¹. Currently, only bam input file format is supported.

First install ATACseqQC and other packages required to run the examples. Please note that the example dataset used here is from human. To run analysis with dataset from a different species or different assembly, please install the corresponding BSgenome, TxDb and phastCons. For example, to analyze mouse data aligned to mm10, please install BSgenome.Mmusculus.UCSC.mm10, TxDb.Mmusculus.UCSC.mm10.knownGene and phastCons60way.UCSC.mm10. Please note that phstCons60way.UCSC.mm10 is optional, which can be obtained according to the vignettes of GenomicScores.

library(BiocInstaller)
biocLite(c("ATACseqQC", "ChIPpeakAnno", "MotifDb",
           "BSgenome.Hsapiens.UCSC.hg19", "TxDb.Hsapiens.UCSC.hg19.knownGene",
           "phastCons100way.UCSC.hg19"))

## load the library
library(ATACseqQC)
## input the bamFile from the ATACseqQC package 
bamfile <- system.file("extdata", "GL1.bam", 
                        package="ATACseqQC", mustWork=TRUE)
bamfile.labels <- gsub(".bam", "", basename(bamfile))

2.1 Fragment size distribution

First, there should be a large proportion of reads with less than 100 bp, which represents the nucleosome-free region. Second, the fragment size distribution should have a clear periodicity, which is evident in the inset figure, indicative of nucleosome occupacy (present in integer multiples).

## generate fragement size distribution
fragSize <- fragSizeDist(bamfile, bamfile.labels)

2.2 Nucleosome positioning

2.2.1 Adjust the read start sites

Tn5 transposase has been shown to bind as a dimer and inserts two adaptors into accessible DNA locations separated by 9 bp².

Therefore, for downstream analysis, such as peak-calling and footprinting, all reads in input bamfile need to be shifted. The function shiftGAlignmentsList can be used to shift the reads. By default, all reads aligning to the positive strand are offset by +4bp, and all reads aligning to the negative strand are offset by -5bp¹.

The adjusted reads will be written into a new bamfile for peak calling or footprinting.

## bamfile tags to be read in
tags <- c("AS", "XN", "XM", "XO", "XG", "NM", "MD", "YS", "YT")
## files will be output into outPath
outPath <- "splited"
dir.create(outPath)
## shift the coordinates of 5'ends of alignments in the bam file
library(BSgenome.Hsapiens.UCSC.hg19)
seqlev <- "chr1" ## subsample data for quick run
which <- as(seqinfo(Hsapiens)[seqlev], "GRanges")
gal <- readBamFile(bamfile, tag=tags, which=which, asMates=TRUE)
gal1 <- shiftGAlignmentsList(gal)
shiftedBamfile <- file.path(outPath, "shifted.bam")
export(gal1, shiftedBamfile)

2.2.2 Split reads

The shifted reads will be split into different bins, namely nucleosome free, mononucleosome, dinucleosome, and trinucleosome. Shifted reads that do not fit into any of the above bins will be discarded. Splitting reads is a time-consuming step because we are using random forest to classify the fragments based on fragment length, GC content and conservation scores³.

By default, we assign the top 10% of short reads (reads below 100_bp) as nucleosome-free regions and the top 10% of intermediate length reads as (reads between 180 and 247 bp) mononucleosome. This serves as the training set to classify the rest of the fragments using random forest. The number of trees will be set to 2 times of square root of the training set size.

library(phastCons100way.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txs <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)
## run program for chromosome 1 only
txs <- txs[seqnames(txs) %in% "chr1"]
genome <- Hsapiens
## split the reads into NucleosomeFree, mononucleosome, 
## dinucleosome and trinucleosome.
objs <- splitGAlignmentsByCut(gal1, txs=txs, genome=genome,
                              conservation=phastCons100way.UCSC.hg19)

Save the binned alignments into bam files.

null <- writeListOfGAlignments(objs, outPath)
## list the files generated by splitBam.
dir(outPath)

##  [1] "NucleosomeFree.bam"     "NucleosomeFree.bam.bai"
##  [3] "dinucleosome.bam"       "dinucleosome.bam.bai"  
##  [5] "inter1.bam"             "inter1.bam.bai"        
##  [7] "inter2.bam"             "inter2.bam.bai"        
##  [9] "inter3.bam"             "inter3.bam.bai"        
## [11] "mononucleosome.bam"     "mononucleosome.bam.bai"
## [13] "others.bam"             "others.bam.bai"        
## [15] "shifted.bam"            "shifted.bam.bai"       
## [17] "trinucleosome.bam"      "trinucleosome.bam.bai"

You can also perform shifting, splitting and saving in one step by calling splitBam.

objs <- splitBam(bamfile, tags=tags, outPath=outPath,
                 txs=txs, genome=genome,
                 conservation=phastCons100way.UCSC.hg19)

Conservation is an optional parameter. If you do not have the conservation score or you would like to simply split the bam files using the fragment length, then you will just need to run the command without providing the conservation argument. Without setting the conservation parameter, it will run much faster.

2.2.3 Heatmap and coverage curve for nucleosome positions

By averaging the signal across all active TSSs, we should observe that nucleosome-free fragments are enriched at the TSSs, whereas the nucleosome-bound fragments should be enriched both upstream and downstream of the active TSSs and display characteristic phasing of upstream and downstream nucleosomes. Because ATAC-seq reads are concentrated at regions of open chromatin, users should see a strong nucleosome signal at the +1 nucleosome, but the signal decreases at the +2, +3 and +4 nucleosomes.

library(ChIPpeakAnno)
bamfiles <- file.path(outPath,
                     c("NucleosomeFree.bam",
                     "mononucleosome.bam",
                     "dinucleosome.bam",
                     "trinucleosome.bam"))
## Plot the cumulative percentage of tag allocation in nucleosome-free 
## and mononucleosome bam files.
cumulativePercentage(bamfiles[1:2], as(seqinfo(Hsapiens)["chr1"], "GRanges"))

TSS <- promoters(txs, upstream=0, downstream=1)
TSS <- unique(TSS)
## estimate the library size for normalization
(librarySize <- estLibSize(bamfiles))

## splited/NucleosomeFree.bam splited/mononucleosome.bam 
##                      33374                       2142 
##   splited/dinucleosome.bam  splited/trinucleosome.bam 
##                       2041                        454

## calculate the signals around TSSs.
NTILE <- 101
dws <- ups <- 1010
sigs <- enrichedFragments(gal=objs[c("NucleosomeFree", 
                                     "mononucleosome",
                                     "dinucleosome",
                                     "trinucleosome")], 
                          TSS=TSS,
                          librarySize=librarySize,
                          seqlev=seqlev,
                          TSS.filter=0.5,
                          n.tile = NTILE,
                          upstream = ups,
                          downstream = dws)
## log2 transformed signals
sigs.log2 <- lapply(sigs, function(.ele) log2(.ele+1))
#plot heatmap
featureAlignedHeatmap(sigs.log2, reCenterPeaks(TSS, width=ups+dws),
                      zeroAt=.5, n.tile=NTILE)

## get signals normalized for nucleosome-free and nucleosome-bound regions.
out <- featureAlignedDistribution(sigs, 
                                  reCenterPeaks(TSS, width=ups+dws),
                                  zeroAt=.5, n.tile=NTILE, type="l", 
                                  ylab="Averaged coverage")

## rescale the nucleosome-free and nucleosome signals to 0~1
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
out <- apply(out, 2, range01)
matplot(out, type="l", xaxt="n", 
        xlab="Position (bp)", 
        ylab="Fraction of signal")
axis(1, at=seq(0, 100, by=10)+1, 
     labels=c("-1K", seq(-800, 800, by=200), "1K"), las=2)
abline(v=seq(0, 100, by=10)+1, lty=2, col="gray")

2.3 plot Footprints

ATAC-seq footprints infer factor occupancy genome-wide. The factorFootprints function uses matchPWM to predict the binding sites using the input position weight matrix (PWM). Then it calculates and plots the accumulated coverage for those binding sites to show the status of the occupancy genome-wide. Unlike CENTIPEDE⁴, the footprints generated here do not take the conservation (PhyloP) into consideration. factorFootprints function could also accept the binding sites as a GRanges object.

## foot prints
library(MotifDb)
CTCF <- query(MotifDb, c("CTCF"))
CTCF <- as.list(CTCF)
print(CTCF[[1]], digits=2)

##      1    2    3    4    5    6     7     8     9      10    11      12
## A 0.17 0.23 0.29 0.10 0.33 0.06 0.052 0.037 0.023 0.00099 0.245 0.00099
## C 0.42 0.28 0.30 0.32 0.11 0.33 0.562 0.005 0.960 0.99702 0.670 0.68901
## G 0.25 0.23 0.26 0.27 0.42 0.55 0.052 0.827 0.013 0.00099 0.027 0.00099
## T 0.16 0.27 0.15 0.31 0.14 0.06 0.334 0.131 0.004 0.00099 0.058 0.30900
##        13    14    15    16    17      18    19   20
## A 0.00099 0.050 0.253 0.004 0.172 0.00099 0.019 0.19
## C 0.99702 0.043 0.073 0.418 0.150 0.00099 0.063 0.43
## G 0.00099 0.017 0.525 0.546 0.055 0.99702 0.865 0.15
## T 0.00099 0.890 0.149 0.032 0.623 0.00099 0.053 0.23

sigs <- factorFootprints(shiftedBamfile, pfm=CTCF[[1]], 
                         genome=genome,
                         min.score="90%", seqlev=seqlev,
                         upstream=100, downstream=100)

featureAlignedHeatmap(sigs$signal, 
                      feature.gr=reCenterPeaks(sigs$bindingSites,
                                               width=200+width(sigs$bindingSites[1])), 
                      annoMcols="score",
                      sortBy="score",
                      n.tile=ncol(sigs$signal[[1]]))

sigs$spearman.correlation

## $`+`
## 
##  Spearman's rank correlation rho
## 
## data:  predictedBindingSiteScore and highest.sig.windows
## S = 6179300, p-value = 1.241e-06
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2499378 
## 
## 
## $`-`
## 
##  Spearman's rank correlation rho
## 
## data:  predictedBindingSiteScore and highest.sig.windows
## S = 6521300, p-value = 5.731e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2084312

Here is the CTCF footprints for the full dataset.

3 Session Info

sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] MotifDb_1.20.0                         
##  [2] phastCons100way.UCSC.hg19_3.6.0        
##  [3] GenomicScores_1.2.2                    
##  [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [5] GenomicFeatures_1.30.3                 
##  [6] AnnotationDbi_1.40.0                   
##  [7] Biobase_2.38.0                         
##  [8] BSgenome.Hsapiens.UCSC.hg19_1.4.0      
##  [9] BSgenome_1.46.0                        
## [10] rtracklayer_1.38.3                     
## [11] ChIPpeakAnno_3.12.7                    
## [12] VennDiagram_1.6.19                     
## [13] futile.logger_1.4.3                    
## [14] GenomicRanges_1.30.3                   
## [15] GenomeInfoDb_1.14.0                    
## [16] Biostrings_2.46.0                      
## [17] XVector_0.18.0                         
## [18] IRanges_2.12.0                         
## [19] ATACseqQC_1.2.10                       
## [20] S4Vectors_0.16.0                       
## [21] BiocGenerics_0.24.0                    
## [22] BiocStyle_2.6.1                        
## 
## loaded via a namespace (and not attached):
##  [1] ProtGenerics_1.10.0           bitops_1.0-6                 
##  [3] matrixStats_0.53.1            bit64_0.9-7                  
##  [5] progress_1.1.2                httr_1.3.1                   
##  [7] rprojroot_1.3-2               tools_3.4.3                  
##  [9] backports_1.1.2               rGADEM_2.26.0                
## [11] R6_2.2.2                      splitstackshape_1.4.2        
## [13] seqLogo_1.44.0                colorspace_1.3-2             
## [15] DBI_0.8                       lazyeval_0.2.1               
## [17] ade4_1.7-10                   motifStack_1.22.2            
## [19] prettyunits_1.0.2             RMySQL_0.10.14               
## [21] bit_1.1-12                    curl_3.1                     
## [23] compiler_3.4.3                graph_1.56.0                 
## [25] grImport_0.9-0                DelayedArray_0.4.1           
## [27] bookdown_0.7                  scales_0.5.0                 
## [29] randomForest_4.6-12           RBGL_1.54.0                  
## [31] stringr_1.3.0                 digest_0.6.15                
## [33] Rsamtools_1.30.0              rmarkdown_1.9                
## [35] pkgconfig_2.0.1               htmltools_0.3.6              
## [37] ensembldb_2.2.2               limma_3.34.9                 
## [39] regioneR_1.10.0               htmlwidgets_1.0              
## [41] rlang_0.2.0                   RSQLite_2.0                  
## [43] BiocInstaller_1.28.0          shiny_1.0.5                  
## [45] BiocParallel_1.12.0           RCurl_1.95-4.10              
## [47] magrittr_1.5                  GO.db_3.5.0                  
## [49] GenomeInfoDbData_1.0.0        Matrix_1.2-12                
## [51] munsell_0.4.3                 Rcpp_0.12.15                 
## [53] stringi_1.1.6                 yaml_2.1.17                  
## [55] MASS_7.3-49                   SummarizedExperiment_1.8.1   
## [57] zlibbioc_1.24.0               plyr_1.8.4                   
## [59] AnnotationHub_2.10.1          blob_1.1.0                   
## [61] lattice_0.20-35               splines_3.4.3                
## [63] multtest_2.34.0               knitr_1.20                   
## [65] pillar_1.2.1                  MotIV_1.34.0                 
## [67] seqinr_3.4-5                  biomaRt_2.34.2               
## [69] futile.options_1.0.0          XML_3.98-1.10                
## [71] evaluate_0.10.1               data.table_1.10.4-3          
## [73] lambda.r_1.2                  idr_1.2                      
## [75] httpuv_1.3.6.2                assertthat_0.2.0             
## [77] xfun_0.1                      mime_0.5                     
## [79] xtable_1.8-2                  AnnotationFilter_1.2.0       
## [81] survival_2.41-3               tibble_1.4.2                 
## [83] GenomicAlignments_1.14.1      memoise_1.1.0                
## [85] interactiveDisplayBase_1.16.0

References

1. Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, dna-binding proteins and nucleosome position. Nature methods 10, 1213–1218 (2013).

2. Adey, A. et al. Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome biology 11, R119 (2010).

3. Chen, K. et al. DANPOS: Dynamic analysis of nucleosome position and occupancy by sequencing. Genome research 23, 341–351 (2013).

4. Pique-Regi, R. et al. Accurate inference of transcription factor binding from dna sequence and chromatin accessibility data. Genome research 21, 447–455 (2011).

ATACseqQC Guide

7 March 2018

Package

Contents