To install the latest version of esATAC, you will need to be using the latest version of R. esATAC is part of Bioconductor project starting from Bioc 3.6 built on R 3.4.2. Please check your current Bioconductor version and R version first. Similar to other bioconductor package, you can download install esATAC and all its dependencies software like this:
source("http://www.bioconductor.org/biocLite.R")
biocLite("esATAC")
NOTE: We recommend to use this package in RStudio. Or you have to install pandoc yourself if you use R terminal
Just like other R packages, you need to load esATAC like this each time before using the package.
library(esATAC)
Most of the test datasets in this package (esATAC/extdata/) are generated from GEO: SRR891271 (from GSE47753)[7]. The data is ATAC-seq paired end sequencing for GM12878 cell line. We random sampling 20000 mapped fragments from chr20 and rebuild raw paired-end FASTQ files(file names with chr20 prefix). We also subsample the reads in SAM file and peak calling BED result files. Besides, the files in “uzmg” and “bt2” are the test files from AdapterRemoval and Bowtie2. For detail, you can read the subsequent sections.
esATAC provides an easy-to-use entry, you only need to provide your ATAC-seq sequencing files (FASTQ format), and assign the spaces and genome assembly, it will do everything for you.
The R scripts below are ready to run. No more edit is needed.
Customize the code commented with “MODIFY” if you need to run on your own data.
Need to be prepared for your own data:
fastqInput1
: mate 1 FASTQ file(s)fastqInput2
: mate 2 FASTQ file(s)fastqInput1
: mate 1 FASTQ file(s)fastqInput2
: mate 2 FASTQ file(s)genome
: genome version may be one of these
refdir
:(optional) Directory for installing genome reference and storage for reuse. Default: “./esATAC_pipeline/refdir” will be created if not exist.tmpdir
:(optional) Directory for intermediate files and results storage. Default: “./esATAC_pipeline/esATAC_result” will be created if not exist.threads
:(optional) The max threads allowed to be created. Default: 2.For other genomes, you can built your own pipeline through Customized Pipeline presently. More genome will be supported in the future.
Here, we show an simple runnable example for case-control analysis. Case and control test sample are both paired-end data. Each of them contains two gzipped FASTQ files. The test file paths can are obtained like this: system.file(package="esATAC", "extdata", "chr20_1.1.fq.gz")
. They are under folder path “esATAC/extdata/”.
library(esATAC)
# call pipeline
# all human motif in JASPAR will be processed
conclusion <-
atacPipe2(
# MODIFY: Change these paths to your own case files!
# e.g. fastqInput1 = "your/own/data/path.fastq"
case=list(fastqInput1 = system.file(package="esATAC", "extdata", "chr20_1.1.fq.gz"),
fastqInput2 = system.file(package="esATAC", "extdata", "chr20_2.1.fq.gz")),
# MODIFY: Change these paths to your own control files!
# e.g. fastqInput1 = "your/own/data/path.fastq"
control=list(fastqInput1 = system.file(package="esATAC", "extdata", "chr20_1.2.fq.bz2"),
fastqInput2 = system.file(package="esATAC", "extdata", "chr20_2.2.fq.bz2")),
# MODIFY: Change this path to an permanent path to be used in future!
# refdir <- "./esATAC_pipeline/refdir",
# tmpdir = "./esATAC_pipeline/esATAC_result",
# MODIFY: Set the genome for your data
genome = "hg19")
Note: By default, esATAC will build bowtie2 index if there is no bowtie2 index under the path set by argument refdir
, which may take several hours. If you want to download bowtie2 index instead, please see section 2.2
Note: By default, esATAC will perform footprint analysis for all the TF motif PWM matrix in JASPAR database. This step may take a few hours to 2-days for human genome analysis depends on your hardware. If you only want to analyze a specific motif or your own PWMs, please see section 2.3
The reference data will be installed in “./esATAC_pipeline/refdir/” and all of temporary data and result will be stored in “./esATAC_pipeline/esATAC_result”.
If you run the scripts above without modification, you are able to obtain default HTML example report.
If you download raw data (GSM2356780: SRR4435490.sra) and (GSM2356795: SRR4435505.sra) from GEO, modif the scripts above like this example scripts, you will obtain HTML report for GSM2356780 and GSM2356795 data.
The R scripts below are ready to run. No more edit is needed.
Customize the code commented with “MODIFY” if you need to run on your own data.
Need to be prepared for your own data:
fastqInput1
: mate 1 FASTQ file(s)fastqInput2
: mate 2 FASTQ file(s)genome
: may be one of these
refdir
:(optional) Directory for installing genome reference and storage for reuse. Default: “./esATAC_pipeline/refdir” will be created if not exist.tmpdir
:(optional) Directory for intermediate files and results storage. Default: “./esATAC_pipeline/esATAC_result” will be created if not exist.threads
:(optional) The max threads allowed to be created. Default: 2.For other genomes, you can built your own pipeline through Customized Pipeline presently. More genome will be supported in the future.
Here, we show an simple runnable example for single sample analysis. We just use case data in case-control section.
library(esATAC)
# call pipeline
# for overall example(all human motif in JASPAR will be processed)
conclusion <-
atacPipe(
# MODIFY: Change these paths to your own case files!
# e.g. fastqInput1 = "your/own/data/path.fastq"
fastqInput1 = system.file(package="esATAC", "extdata", "chr20_1.1.fq.gz"),
fastqInput2 = system.file(package="esATAC", "extdata", "chr20_2.1.fq.gz"),
# MODIFY: Change this path to an permanent path to be used in future!
# refdir <- "./esATAC_pipeline/refdir",
# tmpdir = "./esATAC_pipeline/esATAC_result",
# MODIFY: Set the genome for your data
genome = "hg19")
Note: By default, esATAC will build bowtie2 index if there is no bowtie2 index under the path set by argument refdir
, which may take several hours. If you want to download bowtie2 index instead, please see section 2.2
Note: By default, esATAC will perform footprint analysis for all the TF motif PWM matrix in JASPAR database. This step may take a few hours to 2-days for human genome analysis depends on your hardware. If you only want to analyze a specific motif or your own PWMs, please see section 2.3
The reference data will be installed in “./esATAC_pipeline/refdir/” and all of temporary data and result will be stored in “./esATAC_pipeline/esATAC_result”.
If you run the scripts above without modification, you are able to obtain default HTML Example report
If you download raw data (GSM1155957(SRR891268.sra)) from GEO, modify the scripts above like this example scripts, you will obtain HTML report for GSM1155957 data.
esATAC will download the genome sequence and annotation files, build bowtie2 index, mapping the reads, do the quality control analysis, find peak regions, perform GO analysis and motif enrichment analysis, etc. automatically. Finally, you will get an report file in html format included to the analysis results.
Build2 bowtie index may take some time. If you already have bowtie2 index files or you want to download instead of building, you can let esATAC skip the steps by renaming them following the format (genome+suffix) and put them in reference installation path (refdir).
Example: hg19 bowtie2 index files
bowtie2 index download path:
ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes
Modify the “refdir” in Starting from Scratch, you can run the example code.
By default, esATAC will perform footprint analysis for all the TF motif PWM matrix in JASPAR database. This step may take a few hours to 2-days for human genome analysis depends on your hardware. If you only want to analyze a specific motif or your own PWMs, you can simple do it like this:
### case-control
library(esATAC)
conclusion2 <-
atacPipe2(
case=list(fastqInput1 = system.file(package="esATAC", "extdata", "chr20_1.1.fq.gz"),
fastqInput2 = system.file(package="esATAC", "extdata", "chr20_2.1.fq.gz")),
control=list(fastqInput1 = system.file(package="esATAC", "extdata", "chr20_1.2.fq.bz2"),
fastqInput2 = system.file(package="esATAC", "extdata", "chr20_2.2.fq.bz2")),
genome = "hg19",
motifPWM = getMotifPWM(motif.file = system.file("extdata", "CTCF.txt", package="esATAC"), is.PWM = FALSE))
#===============================================================================
### single sample
library(esATAC)
conclusion <-
atacPipe(
fastqInput1 = system.file(package="esATAC", "extdata", "chr20_1.1.fq.gz"),
fastqInput2 = system.file(package="esATAC", "extdata", "chr20_2.1.fq.gz"),
genome = "hg19",
motifPWM = getMotifPWM(motif.file = system.file("extdata", "CTCF.txt", package="esATAC"), is.PWM = FALSE))
To run the whole pipeline for a typical ATACseq data set of human sample may take ~2 days on a personal computer with single thread. If your thread or R has been stopped during the process. You can simplely resume the analysis by retyping your command line in Starting from Scratch. The program will automatically check the steps that have been finished and continue the anlalysis.
The esATAC package provides a dataflow graphs organized end-to-end pipeline for quantifying and annotating ATAC-seq and DNase-seq Reads in R, which integrate the functionality of several R packages (such as Rsamtools, ChIPpeakAnno and so on) and external softwares (e.g. AdapterRemoval[1], bowtie2[2], through the Rowtie2 package and Fseq[3]). Users could process raw FASTQ files through preset pipeline or customize their own workflow starting from any intermediate stages easily and flexibly in a single R script. That will be convenient to migrate, share and reproduce all details such as parameters settings, intermediate result and so on. Besides, a pretty quality control report file in HTML, which is able to be viewed in web browser, will be created in preset pipelines.
esATAC can be easily installed on various operator system platforms (Windows, Linux, Mac OS). All functions in package consume up to 16G memory. Most function only consume less than 8G. So the package is available for not only servers but also most of PC.
esATAC supports analysis of both single end reads and paired-end reads ATAC-seq data generated by Illumina sequencing platform. It can directly process raw datasets (FASTQ files) from GEO. Other standard format intermediate result files (FASTQ, SAM, BAM, BED file) generated by other programs (such as BAM BED files from ENCODE) are also tested by rebuilt sub-pipeline.
If you do not know where to start with ATAC-seq or DNase-seq data, you can print flowchart like this:
library(esATAC)
printMap()
Following the flowchart, related functions could be found in manual. For example,if you want to query functions related to “SamToBed” in the flowchart, you can query “SamToBed” directly like this:
?SamToBed
If you know exactly function name, you can add “atac” prefix to query mannual like this:
?atacSamToBed
or use the lowercase of the initial letter
?samToBed
The workflow start with “UnzipAndMerge” function atacUnzipAndMerge. It unzips and merges the replicates into one FASTQ file(two for paired end file). Names of reads will be renamed as numbers: 1,2,3,… by calling “Renamer” function atacRenamer. The file will be smaller for further analysis. Adapter of reads may be found and removed by “RemoveAdapter” function atacRemoveAdapter. Then reads are ready for mapping to reference genome. “Bowtie2Mapping” mapping function atacBowtie2Mapping can do this job. “SamToBam”, “Rsortbam”,“BamToBed”,“SamToBed” and “BedUtils” provide general processing methods for SAM file including converting format into BAM or BED file, sorting according to chromosome/start site/end site, reads conditional filtering, reads shifting and so on. The ready-use reads in BED file may call peak by “PeakCallingFseq” function atacPeakCallingFseq.
For preset pipeline (see Quick Start), several summary tables will be shown in an HTML file(Example report) like this:
Item | Case | Control | Reference |
---|---|---|---|
Sequence Files Type | paired end (PE) | paired end (PE) | SE / PE |
Original total reads | 54.1M | 56.5M | |
– Reads after adapter removing (ratio) | 54.1M (100.00%) | 56.5M (100.00%) | >99% |
– – Total mapped reads (ratio of original reads) | 52.7M (97.53%) | 55.1M (97.63%) | >95% |
– – – Unique locations mapped uniquely by reads | 27.3M | 25.6M | |
– – – Non-Redundant Fraction (NRF) | 0.73 | 0.7 | >0.7 |
– – – Locations with only 1 reads mapping uniquely | 24.1M | 23.1M | |
– – – Locations with only 2 reads mapping uniquely | 2.5M | 1.9M | |
– – – PCR Bottlenecking Coefficients 1 (PBC1) | 0.88 | 0.9 | >0.7 |
– – – PCR Bottlenecking Coefficients 2 (PBC2) | 9.71 | 12.28 | >3 |
– – – Non-mitochondrial reads (ratio) | 37.4M (70.87%) | 36.6M (66.31%) | >70% |
– – – – Unique mapped reads (ratio) | 29.1M (55.26%) | 26.6M (48.21%) | |
– – – – – Duplicate removed reads (final for use) | 25.9M (49.13%) | 24.2M (43.97%) | >25M |
– – – – – – Nucleosome free reads (<100bp) | 10.5M (40.47%) | 8.2M (33.92%) | |
– – – – – – – Total peaks | 118157 | 116686 | |
– – – – – – – Peaks overlaped with union DHS ratio | 76.00% | 79.00% | |
– – – – – – – Peaks overlaped with blacklist ratio | 0.10% | 0.10% | |
– – – – – – Fraction of reads in peaks (FRiP) | 52.00% | 66.50% |
The pipeline also provide quality control elements (e.g.“FragLenDistr”, “FastQC”, ) and some general genome function analysis elements (e.g. “RMotifScan”,“RPeakAnno”). For more detail, you can see the manual or the examples in following sections.
Fragments Length Distribution Example
Fourier Transformation Analysis of Distribution
This package is developed and maintained by members of
Ministry of Education Key Laboratory of Bioinformatics,
Center for Synthetic and Systems Biology,
Department of Automation,
Tsinghua University, Beijing, 100084, China
email:{wei-z14,w-zhang16}(at)mails.tsinghua.edu.cn
All sub-processes are available for recombine new whole pipeline or sub-pipeline easily and flexibly. They are also able to be called individually. We just show some functions and their combinations from the package. For detail, the users can read the manual.
Just like other R package, you need to load esATAC like this each time before using the package.
library(esATAC)
If you need to use fseq, we recommend to set max memory size for java (8G, 8000M in the example). Or rJava will use the default parameter for fseq.
options(java.parameters = "-Xmx8000m")
The BSgenome package, TxDb known gene package and OrgDb annotation package for some functions are required. We recommend to install (use biocLite("packageName")
) and load the specific species related packages before using the packages.
library(magrittr)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
library(R.utils)
Packages for specific genomes are shown below:
genome | BSgenome | TxDb | OrgDb |
---|---|---|---|
hg19 | BSgenome.Hsapiens.UCSC.hg19 | TxDb.Hsapiens.UCSC.hg19.knownGene | org.Hs.eg.db |
hg38 | BSgenome.Hsapiens.UCSC.hg38 | TxDb.Hsapiens.UCSC.hg38.knownGene | org.Hs.eg.db |
mm9 | BSgenome.Mmusculus.UCSC.mm9 | TxDb.Mmusculus.UCSC.mm9.knownGene | org.Mm.eg.db |
mm10 | BSgenome.Mmusculus.UCSC.mm10 | TxDb.Mmusculus.UCSC.mm10.knownGene | org.Mm.eg.db |
danRer10 | BSgenome.Drerio.UCSC.danRer10 | TxDb.Drerio.UCSC.danRer10.refGene | org.Dr.eg.db |
galGal5 | BSgenome.Ggallus.UCSC.galGal5 | TxDb.Ggallus.UCSC.galGal5.refGene | org.Gg.eg.db |
galGal4 | BSgenome.Ggallus.UCSC.galGal4 | TxDb.Ggallus.UCSC.galGal4.refGene | org.Gg.eg.db |
rheMac3 | BSgenome.Mmulatta.UCSC.rheMac3 | TxDb.Mmulatta.UCSC.rheMac3.refGene | org.Mmu.eg.db |
rheMac8 | BSgenome.Mmulatta.UCSC.rheMac8 | TxDb.Mmulatta.UCSC.rheMac8.refGene | org.Mmu.eg.db |
rn6 | BSgenome.Rnorvegicus.UCSC.rn6 | TxDb.Rnorvegicus.UCSC.rn6.refGene | org.Rn.eg.db |
rn5 | BSgenome.Rnorvegicus.UCSC.rn5 | TxDb.Rnorvegicus.UCSC.rn5.refGene | org.Rn.eg.db |
sacCer3 | BSgenome.Scerevisiae.UCSC.sacCer3 | TxDb.Scerevisiae.UCSC.sacCer3.sgdGene | org.Sc.sgd.db |
sacCer2 | BSgenome.Scerevisiae.UCSC.sacCer2 | TxDb.Scerevisiae.UCSC.sacCer2.sgdGene | org.Sc.sgd.db |
susScr3 | BSgenome.Sscrofa.UCSC.susScr3 | TxDb.Sscrofa.UCSC.susScr3.refGene | org.Ss.eg.db |
These configurations are also optional. “tmpdir” is the path to save all of the temporary data and the default result storage path. If it is not configured, current work directory will be set as “tmpdir”. “threads” is the maximum threads allowed to be created for data processing. The default value is 1. More thread will consume more memery in some processes.
# we use temp directiory "td" here
# Change it to your directiory because the intermediate file may be huge
td<-tempdir()
options(atacConf=setConfigure("tmpdir",td))
options(atacConf=setConfigure("threads",8))
We strongly recommend to install reference data first before using the package although it is optional. “refdir” is the folder that will save all of the reference data. “genome” is the genome name like hg19, hg38, mm10, mm9 and so on. The program will detect the elements that have not been installed and install them. Some resources need to be downloaded from internet. So don’t forget to connect internet during installation. Or the installation will be failed. If all of the reference data was installed, these two lines still need to be called for configuring the reference data path and genome.
#uncomment and modify to run:
#options(atacConf=setConfigure("refdir","path/to/refdatafolder"))
#options(atacConf=setConfigure("genome","hg19"))
NOTE: The installation will consume several hours for data download and building bowtie2 index depending on computer performance and network bandwidth.
NOTE: The installation is network based. Please keep your network connection. But you don need to worry about disconnect. The program will continue to check finished part and only build unfinish part.
WARNNING: If the reference data is not configured, the related reference argument of functions has to be set manually during using.
Build bowtie index may take some time. If you already have bowtie2 index files or you want to download instead of building, you can let esATAC skip the steps by renaming them following the format (genome+suffix) and put them in reference installation path (refdir).
Example: hg19 bowtie2 index files
bowtie2 index download path:
Users can use %>% to build a pipeline to obtain merged, renamed and adapter removed clean reads fastq file(s) that is ready for mapping.
# Identify adapters
prefix<-system.file(package="esATAC", "extdata", "uzmg")
(reads_1 <-file.path(prefix,"m1",dir(file.path(prefix,"m1"))))
(reads_2 <-file.path(prefix,"m2",dir(file.path(prefix,"m2"))))
reads_merged_1 <- file.path(td,"reads1.fastq")
reads_merged_2 <- file.path(td,"reads2.fastq")
atacproc <-
atacUnzipAndMerge(fastqInput1 = reads_1,fastqInput2 = reads_2) %>%
atacRenamer %>% atacRemoveAdapter
If you want to modify the parameters of AdapterRemoval, you have to refer to Rbowtie2 package:
library(Rbowtie2)
adapterremoval_usage()
## [1] "AdapterRemoval ver. 2.2.1a"
## [2] ""
## [3] "This program searches for and removes remnant adapter sequences from"
## [4] "your read data. The program can analyze both single end and paired end"
## [5] "data. For detailed explanation of the parameters, please refer to the"
## [6] "man page. For comments, suggestions and feedback please contact Stinus"
## [7] "Lindgreen (stinus@binf.ku.dk) and Mikkel Schubert (MikkelSch@gmail.com)."
## [8] ""
## [9] "If you use the program, please cite the paper:"
## [10] " Schubert, Lindgreen, and Orlando (2016). AdapterRemoval v2: rapid"
## [11] " adapter trimming, identification, and read merging."
## [12] " BMC Research Notes, 12;9(1):88."
## [13] ""
## [14] " http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-1900-2"
## [15] ""
## [16] ""
## [17] "Arguments: Description:"
## [18] " --help Display this message."
## [19] " --version Print the version string."
## [20] ""
## [21] " --file1 FILE [FILE ...] Input files containing mate 1 reads or"
## [22] " single-ended reads; one or more files"
## [23] " may be listed [REQUIRED]."
## [24] " --file2 [FILE ...] Input files containing mate 2 reads; if"
## [25] " used, then the same number of files as"
## [26] " --file1 must be listed [OPTIONAL]."
## [27] ""
## [28] "FASTQ OPTIONS:"
## [29] " --qualitybase BASE Quality base used to encode Phred scores"
## [30] " in input; either 33, 64, or solexa"
## [31] " [current: 33]."
## [32] " --qualitybase-output BASE Quality base used to encode Phred scores"
## [33] " in output; either 33, 64, or solexa."
## [34] " By default, reads will be written in"
## [35] " the same format as the that specified"
## [36] " using --qualitybase."
## [37] " --qualitymax BASE Specifies the maximum Phred score"
## [38] " expected in input files, and used when"
## [39] " writing output. ASCII encoded values"
## [40] " are limited to the characters '!'"
## [41] " (ASCII = 33) to '~' (ASCII = 126),"
## [42] " meaning that possible scores are 0 -"
## [43] " 93 with offset 33, and 0 - 62 for"
## [44] " offset 64 and Solexa scores [default:"
## [45] " 41]."
## [46] " --mate-separator CHAR Character separating the mate number (1"
## [47] " or 2) from the read name in FASTQ"
## [48] " records [default: '/']."
## [49] " --interleaved This option enables both the"
## [50] " --interleaved-input option and the"
## [51] " --interleaved-output option [current:"
## [52] " off]."
## [53] " --interleaved-input The (single) input file provided"
## [54] " contains both the mate 1 and mate 2"
## [55] " reads, one pair after the other, with"
## [56] " one mate 1 reads followed by one mate"
## [57] " 2 read. This option is implied by the"
## [58] " --interleaved option [current: off]."
## [59] " --interleaved-output If set, trimmed paired-end reads are"
## [60] " written to a single file containing"
## [61] " mate 1 and mate 2 reads, one pair"
## [62] " after the other. This option is"
## [63] " implied by the --interleaved option"
## [64] " [current: off]."
## [65] " --combined-output If set, all reads are written to the"
## [66] " same file(s), specified by --output1"
## [67] " and --output2 (--output1 only if"
## [68] " --interleaved-output is not set). Each"
## [69] " read is further marked by either a"
## [70] " \"PASSED\" or a \"FAILED\" flag, and any"
## [71] " read that has been FAILED (including"
## [72] " the mate for collapsed reads) are"
## [73] " replaced with a single 'N' with Phred"
## [74] " score 0 [current: off]."
## [75] ""
## [76] "OUTPUT FILES:"
## [77] " --basename BASENAME Default prefix for all output files for"
## [78] " which no filename was explicitly set"
## [79] " [current: your_output]."
## [80] " --settings FILE Output file containing information on"
## [81] " the parameters used in the run as well"
## [82] " as overall statistics on the reads"
## [83] " after trimming [default:"
## [84] " BASENAME.settings]"
## [85] " --output1 FILE Output file containing trimmed mate1"
## [86] " reads [default:"
## [87] " BASENAME.pair1.truncated (PE),"
## [88] " BASENAME.truncated (SE), or"
## [89] " BASENAME.paired.truncated (interleaved"
## [90] " PE)]"
## [91] " --output2 FILE Output file containing trimmed mate 2"
## [92] " reads [default:"
## [93] " BASENAME.pair2.truncated (only used in"
## [94] " PE mode, but not if"
## [95] " --interleaved-output is enabled)]"
## [96] " --singleton FILE Output file to which containing paired"
## [97] " reads for which the mate has been"
## [98] " discarded [default:"
## [99] " BASENAME.singleton.truncated]"
## [100] " --outputcollapsed FILE If --collapsed is set, contains"
## [101] " overlapping mate-pairs which have been"
## [102] " merged into a single read (PE mode) or"
## [103] " reads for which the adapter was"
## [104] " identified by a minimum overlap,"
## [105] " indicating that the entire template"
## [106] " molecule is present. This does not"
## [107] " include which have subsequently been"
## [108] " trimmed due to low-quality or"
## [109] " ambiguous nucleotides [default:"
## [110] " BASENAME.collapsed]"
## [111] " --outputcollapsedtruncated FILE Collapsed reads (see --outputcollapsed)"
## [112] " which were trimmed due the presence of"
## [113] " low-quality or ambiguous nucleotides"
## [114] " [default:"
## [115] " BASENAME.collapsed.truncated]"
## [116] " --discarded FILE Contains reads discarded due to the"
## [117] " --minlength, --maxlength or --maxns"
## [118] " options [default: BASENAME.discarded]"
## [119] ""
## [120] "TRIMMING SETTINGS:"
## [121] " --adapter1 SEQUENCE Adapter sequence expected to be found in"
## [122] " mate 1 reads [current:"
## [123] " AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG]."
## [124] " --adapter2 SEQUENCE Adapter sequence expected to be found in"
## [125] " mate 2 reads [current:"
## [126] " AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT]."
## [127] " --adapter-list FILENAME Read table of white-space separated"
## [128] " adapters pairs, used as if the first"
## [129] " column was supplied to --adapter1, and"
## [130] " the second column was supplied to"
## [131] " --adapter2; only the first adapter in"
## [132] " each pair is required SE trimming mode"
## [133] " [current: <not set>]."
## [134] ""
## [135] " --mm MISMATCH_RATE Max error-rate when aligning reads"
## [136] " and/or adapters. If > 1, the max"
## [137] " error-rate is set to 1 /"
## [138] " MISMATCH_RATE; if < 0, the defaults"
## [139] " are used, otherwise the user-supplied"
## [140] " value is used directly [defaults: 1/3"
## [141] " for trimming; 1/10 when identifying"
## [142] " adapters]."
## [143] " --maxns MAX Reads containing more ambiguous bases"
## [144] " (N) than this number after trimming"
## [145] " are discarded [current: 1000]."
## [146] " --shift N Consider alignments where up to N"
## [147] " nucleotides are missing from the 5'"
## [148] " termini [current: 2]."
## [149] ""
## [150] " --trimns If set, trim ambiguous bases (N) at"
## [151] " 5'/3' termini [current: off]"
## [152] " --trimqualities If set, trim bases at 5'/3' termini with"
## [153] " quality scores <= to --minquality"
## [154] " value [current: off]"
## [155] " --trimwindows INT If set, quality trimming will be carried"
## [156] " out using window based approach, where"
## [157] " windows with an average quality less"
## [158] " than --minquality will be trimmed. If"
## [159] " >= 1, this value will be used as the"
## [160] " window size. If the value is < 1, the"
## [161] " value will be multiplied with the read"
## [162] " length to determine a window size per"
## [163] " read. If the resulting window size is"
## [164] " 0 or larger than the read length, the"
## [165] " read length is used as the window"
## [166] " size. This option implies"
## [167] " --trimqualities [current: <not set>]."
## [168] " --minquality PHRED Inclusive minimum; see --trimqualities"
## [169] " for details [current: 2]"
## [170] " --minlength LENGTH Reads shorter than this length are"
## [171] " discarded following trimming [current:"
## [172] " 15]."
## [173] " --maxlength LENGTH Reads longer than this length are"
## [174] " discarded following trimming [current:"
## [175] " 4294967295]."
## [176] " --collapse When set, paired ended read alignments"
## [177] " of --minalignmentlength or more bases"
## [178] " are combined into a single consensus"
## [179] " sequence, representing the complete"
## [180] " insert, and written to either"
## [181] " basename.collapsed or"
## [182] " basename.collapsed.truncated (if"
## [183] " trimmed due to low-quality bases"
## [184] " following collapse); for single-ended"
## [185] " reads, putative complete inserts are"
## [186] " identified as having at least"
## [187] " --minalignmentlength bases overlap"
## [188] " with the adapter sequence, and are"
## [189] " written to the the same files"
## [190] " [current: off]."
## [191] " --minalignmentlength LENGTH If --collapse is set, paired reads must"
## [192] " overlap at least this number of bases"
## [193] " to be collapsed, and single-ended"
## [194] " reads must overlap at least this"
## [195] " number of bases with the adapter to be"
## [196] " considered complete template molecules"
## [197] " [current: 11]."
## [198] " --minadapteroverlap LENGTH In single-end mode, reads are only"
## [199] " trimmed if the overlap between read"
## [200] " and the adapter is at least X bases"
## [201] " long, not counting ambiguous"
## [202] " nucleotides (N); this is independent"
## [203] " of the --minalignmentlength when using"
## [204] " --collapse, allowing a conservative"
## [205] " selection of putative complete inserts"
## [206] " while ensuring that all possible"
## [207] " adapter contamination is trimmed"
## [208] " [current: 0]."
## [209] ""
## [210] "DEMULTIPLEXING:"
## [211] " --barcode-list FILENAME List of barcodes or barcode pairs for"
## [212] " single or double-indexed"
## [213] " demultiplexing. Note that both indexes"
## [214] " should be specified for both"
## [215] " single-end and paired-end trimming, if"
## [216] " double-indexed multiplexing was used,"
## [217] " in order to ensure that the"
## [218] " demultiplexed reads can be trimmed"
## [219] " correctly [current: <not set>]."
## [220] " --barcode-mm N Maximum number of mismatches allowed"
## [221] " when counting mismatches in both the"
## [222] " mate 1 and the mate 2 barcode for"
## [223] " paired reads."
## [224] " --barcode-mm-r1 N Maximum number of mismatches allowed for"
## [225] " the mate 1 barcode; if not set, this"
## [226] " value is equal to the '--barcode-mm'"
## [227] " value; cannot be higher than the"
## [228] " '--barcode-mm value'."
## [229] " --barcode-mm-r2 N Maximum number of mismatches allowed for"
## [230] " the mate 2 barcode; if not set, this"
## [231] " value is equal to the '--barcode-mm'"
## [232] " value; cannot be higher than the"
## [233] " '--barcode-mm value'."
## [234] " --demultiplex-only Only carry out demultiplexing using the"
## [235] " list of barcodes supplied with"
## [236] " --barcode-list; do not attempt to trim"
## [237] " adapters to carry out other"
## [238] " processing."
## [239] ""
## [240] "MISC:"
## [241] " --identify-adapters Attempt to identify the adapter pair of"
## [242] " PE reads, by searching for overlapping"
## [243] " reads [current: off]."
## [244] " --seed SEED Sets the RNG seed used when choosing"
## [245] " between bases with equal Phred scores"
## [246] " when collapsing. Note that runs are"
## [247] " not deterministic if more than one"
## [248] " thread is used. If not specified, a"
## [249] " seed is generated using the current"
## [250] " time."
## [251] " --threads THREADS Maximum number of threads [current: 1]"
## [252] ""
If the reference has not been configured, the bowtie2 index should be built first. Then bowtie2 mapping functions could used to map reads to reference genome.
## Building a bowtie2 index
library("Rbowtie2")
refs <- dir(system.file(package="esATAC", "extdata", "bt2","refs"),
full=TRUE)
bowtie2_build(references=refs, bt2Index=file.path(td, "lambda_virus"),
"--threads 4 --quiet",overwrite=TRUE)
## Alignments
reads_1 <- system.file(package="esATAC", "extdata", "bt2", "reads",
"reads_1.fastq")
reads_2 <- system.file(package="esATAC", "extdata", "bt2", "reads",
"reads_2.fastq")
if(file.exists(file.path(td, "lambda_virus.1.bt2"))){
(bowtie2Mapping(bt2Idx = file.path(td, "lambda_virus"),
samOutput = file.path(td, "result.sam"),
fastqInput1=reads_1,fastqInput2=reads_2,threads=3))
head(readLines(file.path(td, "result.sam")))
}
If you want to modify the parameters of bowtie2, you have to refer to Rbowtie2 package:
library(Rbowtie2)
bowtie2_usage()
## [1] "Bowtie 2 version 2.3.2 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)"
## [2] "Usage: "
## [3] " bowtie2-align [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i>} [-S <sam>]"
## [4] ""
## [5] " <bt2-idx> Index filename prefix (minus trailing .X.bt2)."
## [6] " NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible."
## [7] " <m1> Files with #1 mates, paired with files in <m2>."
## [8] " <m2> Files with #2 mates, paired with files in <m1>."
## [9] " <r> Files with unpaired reads."
## [10] " <i> Files with interleaved paired-end FASTQ reads"
## [11] " <sam> File for SAM output (default: stdout)"
## [12] ""
## [13] " <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be"
## [14] " specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'."
## [15] ""
## [16] "Options (defaults in parentheses):"
## [17] ""
## [18] " Input:"
## [19] " -q query input files are FASTQ .fq/.fastq (default)"
## [20] " --tab5 query input files are TAB5 .tab5"
## [21] " --tab6 query input files are TAB6 .tab6"
## [22] " --qseq query input files are in Illumina's qseq format"
## [23] " -f query input files are (multi-)FASTA .fa/.mfa"
## [24] " -r query input files are raw one-sequence-per-line"
## [25] " -c <m1>, <m2>, <r> are sequences themselves, not files"
## [26] " -s/--skip <int> skip the first <int> reads/pairs in the input (none)"
## [27] " -u/--upto <int> stop after first <int> reads/pairs (no limit)"
## [28] " -5/--trim5 <int> trim <int> bases from 5'/left end of reads (0)"
## [29] " -3/--trim3 <int> trim <int> bases from 3'/right end of reads (0)"
## [30] " --phred33 qualities are Phred+33 (default)"
## [31] " --phred64 qualities are Phred+64"
## [32] " --int-quals qualities encoded as space-delimited integers"
## [33] ""
## [34] " Presets: Same as:"
## [35] " For --end-to-end:"
## [36] " --very-fast -D 5 -R 1 -N 0 -L 22 -i S,0,2.50"
## [37] " --fast -D 10 -R 2 -N 0 -L 22 -i S,0,2.50"
## [38] " --sensitive -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default)"
## [39] " --very-sensitive -D 20 -R 3 -N 0 -L 20 -i S,1,0.50"
## [40] ""
## [41] " For --local:"
## [42] " --very-fast-local -D 5 -R 1 -N 0 -L 25 -i S,1,2.00"
## [43] " --fast-local -D 10 -R 2 -N 0 -L 22 -i S,1,1.75"
## [44] " --sensitive-local -D 15 -R 2 -N 0 -L 20 -i S,1,0.75 (default)"
## [45] " --very-sensitive-local -D 20 -R 3 -N 0 -L 20 -i S,1,0.50"
## [46] ""
## [47] " Alignment:"
## [48] " -N <int> max # mismatches in seed alignment; can be 0 or 1 (0)"
## [49] " -L <int> length of seed substrings; must be >3, <32 (22)"
## [50] " -i <func> interval between seed substrings w/r/t read len (S,1,1.15)"
## [51] " --n-ceil <func> func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)"
## [52] " --dpad <int> include <int> extra ref chars on sides of DP table (15)"
## [53] " --gbar <int> disallow gaps within <int> nucs of read extremes (4)"
## [54] " --ignore-quals treat all quality values as 30 on Phred scale (off)"
## [55] " --nofw do not align forward (original) version of read (off)"
## [56] " --norc do not align reverse-complement version of read (off)"
## [57] " --no-1mm-upfront do not allow 1 mismatch alignments before attempting to"
## [58] " scan for the optimal seeded alignments"
## [59] " --end-to-end entire read must align; no clipping (on)"
## [60] " OR"
## [61] " --local local alignment; ends might be soft clipped (off)"
## [62] ""
## [63] " Scoring:"
## [64] " --ma <int> match bonus (0 for --end-to-end, 2 for --local) "
## [65] " --mp <int> max penalty for mismatch; lower qual = lower penalty (6)"
## [66] " --np <int> penalty for non-A/C/G/Ts in read/ref (1)"
## [67] " --rdg <int>,<int> read gap open, extend penalties (5,3)"
## [68] " --rfg <int>,<int> reference gap open, extend penalties (5,3)"
## [69] " --score-min <func> min acceptable alignment score w/r/t read length"
## [70] " (G,20,8 for local, L,-0.6,-0.6 for end-to-end)"
## [71] ""
## [72] " Reporting:"
## [73] " (default) look for multiple alignments, report best, with MAPQ"
## [74] " OR"
## [75] " -k <int> report up to <int> alns per read; MAPQ not meaningful"
## [76] " OR"
## [77] " -a/--all report all alignments; very slow, MAPQ not meaningful"
## [78] ""
## [79] " Effort:"
## [80] " -D <int> give up extending after <int> failed extends in a row (15)"
## [81] " -R <int> for reads w/ repetitive seeds, try <int> sets of seeds (2)"
## [82] ""
## [83] " Paired-end:"
## [84] " -I/--minins <int> minimum fragment length (0)"
## [85] " -X/--maxins <int> maximum fragment length (500)"
## [86] " --fr/--rf/--ff -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)"
## [87] " --no-mixed suppress unpaired alignments for paired reads"
## [88] " --no-discordant suppress discordant alignments for paired reads"
## [89] " --dovetail concordant when mates extend past each other"
## [90] " --no-contain not concordant when one mate alignment contains other"
## [91] " --no-overlap not concordant when mates overlap at all"
## [92] ""
## [93] " Output:"
## [94] " -t/--time print wall-clock time taken by search phases"
## [95] " --quiet print nothing to stderr except serious errors"
## [96] " --met-file <path> send metrics to file at <path> (off)"
## [97] " --met-stderr send metrics to stderr (off)"
## [98] " --met <int> report internal counters & metrics every <int> secs (1)"
## [99] " --no-unal suppress SAM records for unaligned reads"
## [100] " --no-head suppress header lines, i.e. lines starting with @"
## [101] " --no-sq suppress @SQ header lines"
## [102] " --rg-id <text> set read group id, reflected in @RG line and RG:Z: opt field"
## [103] " --rg <text> add <text> (\"lab:value\") to @RG line of SAM header."
## [104] " Note: @RG line only printed when --rg-id is set."
## [105] " --omit-sec-seq put '*' in SEQ and QUAL fields for secondary alignments."
## [106] " --sam-noqname-trunc Suppress standard behavior of truncating readname at first whitespace "
## [107] " at the expense of generating non-standard SAM."
## [108] ""
## [109] " Performance:"
## [110] " -p/--threads <int> number of alignment threads to launch (1)"
## [111] " --reorder force SAM output order to match order of input reads"
## [112] " --mm use memory-mapped I/O for index; many 'bowtie's can share"
## [113] ""
## [114] " Other:"
## [115] " --qc-filter filter out reads that are bad according to QSEQ filter"
## [116] " --seed <int> seed for random number generator (0)"
## [117] " --non-deterministic seed rand. gen. arbitrarily instead of using read attributes"
## [118] " --version print version information and quit"
## [119] " -h/--help print this usage message"
The mapping results are stored in a SAM file. SamToBed functions can covert it into BED file. During converting, the operation like sorting, shifting, filtering chromosome and so on can also be set to do in the meantime.
sambzfile <- system.file(package="esATAC", "extdata", "Example.sam.bz2")
samfile <- file.path(td,"Example.sam")
bunzip2(sambzfile,destname=samfile,overwrite=TRUE,remove=FALSE)
samToBed(samInput = samfile)
Filter the nucleosome free reads(<100bp) for peak calling.
bedbzfile <- system.file(package="esATAC", "extdata", "chr20.50000.bed.bz2")
bedfile <- file.path(td,"chr20.50000.bed")
bunzip2(bedbzfile,destname=bedfile,overwrite=TRUE,remove=FALSE)
bedUtils(bedInput = bedfile,maxFragLen = 100, chrFilterList = NULL) %>%
atacPeakCalling
ATAC-seq peak locate at open chromatin regions. Annotating these peak could find whether they locate at functional regions(such as promoter and enhancer).
Function “atacPeakAnno” and “peakanno” use function “annotatePeak” in package “ChIPseeker” to annotate ATAC-seq peak. for more information about package “ChIPseeker”, please click here[4].
Function “atacPeakAnno” and “peakanno” accept a bed file path as an input, users can change the parameters like “tssRegion”, “TxDb” according to their require. Now, bioconductor offers many species’ annotation database, click here to search more.
The following example is to exhibit how to annotate a UCSC bed file.
## extract example peak file from package "esATAC"
p1bz <- system.file("extdata", "Example_peak1.bed.bz2", package="esATAC")
peak1_path <- as.vector(bunzip2(filename = p1bz,
destname = file.path(getwd(), "Example_peak1.bed"),
ext="bz2", FUN=bzfile, overwrite=TRUE, remove = FALSE))
## run peakanno to annotate peaks
AnnoInfo <- peakanno(peakInput = peak1_path, TxDb = TxDb.Hsapiens.UCSC.hg19.knownGene, annoDb = "org.Hs.eg.db")
The output contains a pie chart in pdf format like below. It reports the percentage of peaks located in different functional regions.
The function also generate a file(with suffix .df) contains annotation for all peaks. It is converted from dataframe in R, and users could open it with text editor or excel. Below is a part of the output.
chromatin | start | end | annotation | geneStart | geneEnd | geneId | distanceToTSS | SYMBOL |
---|---|---|---|---|---|---|---|---|
chr1 | 416606 | 416895 | Distal Intergenic | 367659 | 368597 | 729759 | 48947 | OR4F29 |
chr1 | 2313275 | 2313587 | Intron (uc001ajb.1/79906, intron 6 of 13) | 2252696 | 2322993 | 79906 | 9406 | MORN1 |
chr1 | 2516858 | 2518618 | Promoter | 2517899 | 2522908 | 127281 | 0 | FAM213B |
chr1 | 2685755 | 2686581 | Intron (uc021oey.1/100287898, intron 4 of 6) | 2572807 | 2706230 | 100287898 | 19649 | TTC34 |
chr1 | 3418588 | 3418822 | Intron (uc001akk.3/1953, intron 14 of 29) | 3404506 | 3448012 | 1953 | 29190 | MEGF6 |
GO analysis is performing enrichment analysis on gene sets. It establishes the relationship between gene sets and functions, and report the most significant function to users.
Function “atacGOAnalysis” and “goanalysis” use function “enrichGO” in package “clusterProfiler” to do GO analysis. for more information about package “clusterProfiler”, please click here[5].
The function need gene Id set as input. User could choose different GO terms(molecular function, biological process and cellular component) according to different input of parameter “ont”.
The following example is to exhibit how to do GO analysis on a gene set.
## extract gene ID
library(clusterProfiler)
data(geneList)
geneId <- names(geneList)[1:100]
## do GO analysis
goAna <- goanalysis(gene = geneId, OrgDb = "org.Hs.eg.db", keytype = "ENTREZID", ont = "MF")
The output file(suffix .df) contains the GO term sorted by p-value, below is a part of the output.
ID | Description | GeneRatio | pvalue | qvalue |
---|---|---|---|---|
GO:0008017 | microtubule binding | 13/95 | 0.0e+00 | 1.00e-07 |
GO:0015631 | tubulin binding | 13/95 | 0.0e+00 | 1.10e-06 |
GO:0050786 | RAGE receptor binding | 4/95 | 3.0e-07 | 2.04e-05 |
GO:0003777 | microtubule motor activity | 7/95 | 4.0e-07 | 2.14e-05 |
GO:0045236 | CXCR chemokine receptor binding | 4/95 | 1.5e-06 | 6.60e-05 |
This function search motif occurrence in the given regions.
Function “atacMotifScan” and “motifscan” use function “matchPWM” in package “Biostrings”, for more parameters and usage, click here[6].
Multi-motif is supported, and the output file is named by your input PWM list. for Multi-motif, we offer parallel computing method for accelerating. Users could specify the parameter “n.cores” to accelerate the program.
The input motif PWM matrix is stored in a list like below.
pwm <- readRDS(system.file("extdata", "motifPWM.rds", package="esATAC"))
pwm
## $CTCF
## 1 2 3 4 5 6
## A 0.02960540 0.03743389 0.04368501 0.02430735 0.001023094 0.05537836
## C 0.04410295 0.03573708 0.02271820 0.05624947 0.057703951 0.00675723
## G 0.02797499 0.04833945 0.04931320 0.01254142 -0.069438277 0.02610485
## T 0.04957887 0.03879208 0.03478530 0.01903153 -0.015525860 0.03014738
## 7 8 9 10 11 12
## A 0.020271900 0.03209952 0.057005153 -0.00460166 0.045748514 0.023909107
## C 0.051259116 0.04889070 0.004838716 -0.06943828 -0.010700941 0.005881734
## G 0.045758368 0.02246973 0.017613999 0.05773064 0.052120515 0.050726179
## T 0.004838716 0.04540858 0.010683413 -0.01070094 0.002433989 0.046034249
## 13 14 15 16 17 18
## A 0.005881734 0.024346457 0.03179891 0.04710491 0.02895828 0.03323194
## C -0.069438277 0.001023094 0.05525243 0.00684137 0.05022974 0.04538490
## G 0.057569617 0.055907445 -0.00460166 0.05082564 0.04481264 0.02756395
## T 0.001023094 0.027215451 0.02651920 0.01005852 0.01942021 0.04786956
## 19
## A 0.04804997
## C 0.03846515
## G 0.04309319
## T 0.02501100
##
## $ATF3
## 1 2 3 4 5 6
## A 0.04367299 0.03800285 0.07142085 0.009661673 -0.02889619 0.07216227
## C 0.05222528 0.03098095 0.03725448 -0.028896186 -0.02889619 -0.02889619
## G 0.04093099 0.07155594 0.04294621 -0.028896186 0.07216227 -0.02889619
## T 0.06989254 0.03510088 0.01473202 0.072148910 -0.02889619 -0.02889619
## 7 8 9 10 11 12
## A 0.01301289 0.02882151 -0.02889619 0.02959694 0.07211734 0.02431670
## C 0.07214318 -0.02889619 -0.02889619 0.07195987 -0.02889619 0.04391574
## G -0.02889619 0.07206064 0.01651982 0.01907820 -0.02889619 0.04041291
## T -0.02889619 -0.02889619 0.07213457 0.02292425 0.02109342 0.07121981
## 13 14
## A 0.03937329 0.06758346
## C 0.07143835 0.04184396
## G 0.03039048 0.05912350
## T 0.03823980 0.04864436
Using “motifscan” function to search motif in given genome regions, UCSC bed file is recommended.
sample.path <- system.file("extdata", "chr20_sample_peak.bed.bz2", package="esATAC")
sample.path <- as.vector(bunzip2(filename = sample.path,
destname = file.path(getwd(), "chr20_sample_peak.bed"),
ext="bz2", FUN=bzfile, overwrite=TRUE, remove = FALSE))
motif.data <- motifscan(peak = sample.path, genome = BSgenome.Hsapiens.UCSC.hg19,
motifPWM = pwm, prefix = "test")
This function reports the exact motif position in the given genome like below(motif: CTCF).
chromatin | start | end | strand | score | sequence |
---|---|---|---|---|---|
chr20 | 189774 | 189792 | + | 0.8794931 | ACTCCTCTAGAGGGTGCTC |
chr20 | 239773 | 239791 | + | 0.9003697 | TTGCCACTGGGGGGAGACA |
chr20 | 247783 | 247801 | - | 0.9337214 | CTGCCGGCAGATGGCGGTA |
chr20 | 281074 | 281092 | - | 0.8511201 | TTGCCTGCAGGGGTGGGAA |
The interaction between TF and DNA would leave a “footprint” in motif position, but it is not evident in a single site, so integrated footprint is necessary. In addition, we only consider Tn5 cut site. This function is based on the motif scan.
First, collecting all cut site from the bed file(Note: every line in the bed file is a DNA fragment) and save them.
## extract cut site position from bed file
fra_path <- system.file("extdata", "chr20.50000.bed.bz2", package="esATAC")
frag <- as.vector(bunzip2(filename = fra_path,
destname = file.path(getwd(), "chr20.50000.bed"),
ext="bz2", FUN=bzfile, overwrite=TRUE, remove = FALSE))
cs.data <- extractcutsite(bedInput = frag, prefix = "ATAC")
Next, plot footprint for different motifs.
In the motif scan, we get a variable named “motif.data”, is contains multi-motif information. In order to plot footprint of these motif in a single procedure, we will use the output of function motifscan, here is “motif.data”.
fp <- atacCutSiteCount(atacProcCutSite = cs.data, atacProcMotifScan = motif.data)
The following is CTCF footprint using example data.
Note: we only using a small part of the chromatin 20 as example.
esATAC is organized in data flow graph. Except for referring manual, the user may print the map to know workflow order.
bedbzfile <- system.file(package="esATAC", "extdata", "chr20.50000.bed.bz2")
bedfile <- file.path(td,"chr20.50000.bed")
bunzip2(bedbzfile,destname=bedfile,overwrite=TRUE,remove=FALSE)
peakproc <-bedUtils(bedInput = bedfile,maxFragLen = 100, chrFilterList = NULL) %>%
atacPeakCalling
peakproc %>% printMap
By printing the map, it is easy to know what valid processes are available to call next and what preprocess has been done before.
It is easy to query the parameters set for the process with ATACProc objects. You can query available parameters like this:
#query all of available parameters
getParamItems(peakproc)
## [1] "bedInput" "bedFileList" "inBedDir" "bedOutput"
## [5] "outTmpDir" "fragmentSize" "fileformat" "verbose"
The value of a specific parameter can be obtain like this:
#query a parameter value
getParam(peakproc,"fragmentSize")
## [1] 0
Similarly, it is also easy to query the report value calculated by the process with ATACProc objects
sambzfile <- system.file(package="esATAC", "extdata", "Example.sam.bz2")
samfile <- file.path(td,"Example.sam")
bunzip2(sambzfile,destname=samfile,overwrite=TRUE,remove=FALSE)
samToBedProc<-samToBed(samInput = samfile)
When the ATACProc objects are obtained, you can query all of available report items.
getReportItems(samToBedProc)
## [1] "report" "total"
## [3] "save" "filted"
## [5] "extlen" "unique"
## [7] "multimap" "non-mitochondrial"
## [9] "non-mitochondrial-multimap"
The value of specific report item can be get like this:
#query a parameter value
getReportVal(samToBedProc,"report")
## Item Value
## 1 total 209
## 2 save 13
## 3 filted 150
## 4 extlen 0
## 5 unique 0
## 6 multimap 15
If the user call a process function that was called last time and finished, the process function will not redo the process. So if users need to redo the process, they have to clear the cache like this:
clearProcCache(peakproc)
process(peakproc)
We would like to thank Huan Fang for package testing and valuable suggestions,
and Kui Hua for providing package testing on Macbook.
[1] Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357-359.
[2] Schubert, Lindgreen, and Orlando (2016). AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Research Notes, 12;9(1):88.
[3] Boyle, A. P., Guinney, J., Crawford, G. E., & Furey, T. S. (2008). F-Seq: a feature density estimator for high-throughput sequence tags. Bioinformatics, 24(21), 2537-2538.
[4] Yu G, Wang L and He Q (2015). “ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization.” Bioinformatics, 31(14), pp. 2382-2383. doi: 10.1093/bioinformatics/btv145.
[5] Yu G, Wang L, Han Y and He Q (2012). “clusterProfiler: an R package for comparing biological themes among gene clusters.” OMICS: A Journal of Integrative Biology, 16(5), pp. 284-287. doi: 10.1089/omi.2011.0118.
[6] Pagès H, Aboyoun P, Gentleman R and DebRoy S (2017). Biostrings: String objects representing biological sequences, and matching algorithms. R package version 2.44.2.
[7] Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y., & Greenleaf, W. J. (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature methods, 10(12), 1213-1218.
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
## [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] clusterProfiler_3.6.0
## [2] DOSE_3.4.0
## [3] ChIPseeker_1.14.0
## [4] Rbowtie2_1.0.1
## [5] R.utils_2.6.0
## [6] R.oo_1.21.0
## [7] R.methodsS3_1.7.1
## [8] org.Hs.eg.db_3.5.0
## [9] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [10] GenomicFeatures_1.30.0
## [11] AnnotationDbi_1.40.0
## [12] Biobase_2.38.0
## [13] BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [14] BSgenome_1.46.0
## [15] rtracklayer_1.38.1
## [16] magrittr_1.5
## [17] bindrcpp_0.2
## [18] esATAC_1.0.10
## [19] Rsamtools_1.30.0
## [20] Biostrings_2.46.0
## [21] XVector_0.18.0
## [22] QuasR_1.18.0
## [23] Rbowtie_1.18.0
## [24] GenomicRanges_1.30.0
## [25] GenomeInfoDb_1.14.0
## [26] IRanges_2.12.0
## [27] S4Vectors_0.16.0
## [28] BiocGenerics_0.24.0
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.1 fastmatch_1.1-0
## [3] VGAM_1.0-4 plyr_1.8.4
## [5] igraph_1.1.2 lazyeval_0.2.1
## [7] splines_3.4.3 BiocParallel_1.12.0
## [9] ggplot2_2.2.1 gridBase_0.4-7
## [11] TFBSTools_1.16.0 digest_0.6.12
## [13] BiocInstaller_1.28.0 htmltools_0.3.6
## [15] GOSemSim_2.4.0 viridis_0.4.0
## [17] GO.db_3.5.0 gdata_2.18.0
## [19] memoise_1.1.0 JASPAR2016_1.6.0
## [21] readr_1.1.1 annotate_1.56.1
## [23] matrixStats_0.52.2 prettyunits_1.0.2
## [25] colorspace_1.3-2 blob_1.1.0
## [27] dplyr_0.7.4 RCurl_1.95-4.8
## [29] jsonlite_1.5 bindr_0.1
## [31] TFMPvalue_0.0.6 brew_1.0-6
## [33] VariantAnnotation_1.24.2 glue_1.2.0
## [35] gtable_0.2.0 zlibbioc_1.24.0
## [37] UpSetR_1.3.3 DelayedArray_0.4.1
## [39] Rook_1.1-1 scales_0.5.0
## [41] futile.options_1.0.0 DBI_0.7
## [43] Rcpp_0.12.14 plotrix_3.6-6
## [45] viridisLite_0.2.0 xtable_1.8-2
## [47] progress_1.1.2 bit_1.1-12
## [49] htmlwidgets_0.9 httr_1.3.1
## [51] DiagrammeR_0.9.2 fgsea_1.4.0
## [53] gplots_3.0.1 RColorBrewer_1.1-2
## [55] rJava_0.9-9 pkgconfig_2.0.1
## [57] XML_3.98-1.9 rlang_0.1.4
## [59] reshape2_1.4.2 munsell_0.4.3
## [61] tools_3.4.3 visNetwork_2.0.1
## [63] downloader_0.4 DirichletMultinomial_1.20.0
## [65] RSQLite_2.0 evaluate_0.10.1
## [67] stringr_1.2.0 yaml_2.1.15
## [69] knitr_1.17 bit64_0.9-7
## [71] caTools_1.17.1 purrr_0.2.4
## [73] KEGGREST_1.18.0 poweRlaw_0.70.1
## [75] DO.db_2.9 biomaRt_2.34.0
## [77] compiler_3.4.3 rstudioapi_0.7
## [79] rgexf_0.15.3 png_0.1-7
## [81] tibble_1.3.4 stringi_1.1.6
## [83] highr_0.6 futile.logger_1.4.3
## [85] GenomicFiles_1.14.0 lattice_0.20-35
## [87] CNEr_1.14.0 Matrix_1.2-12
## [89] data.table_1.10.4-3 bitops_1.0-6
## [91] qvalue_2.10.0 R6_2.2.2
## [93] latticeExtra_0.6-28 hwriter_1.3.2
## [95] RMySQL_0.10.13 ShortRead_1.36.0
## [97] KernSmooth_2.23-15 gridExtra_2.3
## [99] lambda.r_1.2 boot_1.3-20
## [101] gtools_3.5.0 assertthat_0.2.0
## [103] seqLogo_1.44.0 SummarizedExperiment_1.8.0
## [105] rprojroot_1.2 GenomicAlignments_1.14.1
## [107] GenomeInfoDbData_0.99.1 hms_0.4.0
## [109] influenceR_0.1.0 VennDiagram_1.6.18
## [111] grid_3.4.3 tidyr_0.7.2
## [113] rmarkdown_1.8 rvcheck_0.0.9