CTCF defines an AnnotationHub resource representing genomic coordinates of FIMO-predicted CTCF binding sites with motif MA0139.1 (Jaspar).

  • Human (hg19, hg38) and mouse (mm9, mm10) genomes.
  • The binding sites were detected using the FIMO tool of the MEME suite using default settings.
  • Extra columns include motif name (MA0139.1), score, p-value, q-value, and the motif sequence.

0.1 Installation instructions

Get the latest stable R release from CRAN. Then install CTCF using from Bioconductor the following code:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("CTCF")

0.2 Example

suppressMessages(library(AnnotationHub))
ah <- AnnotationHub()
#> snapshotDate(): 2021-06-15
query_data <- query(ah, "CTCF")
query_data
#> AnnotationHub with 466 records
#> # snapshotDate(): 2021-06-15
#> # $dataprovider: UCSC, Haemcode, UCSC Jaspar, Pazar
#> # $species: Homo sapiens, Mus musculus, NA
#> # $rdataclass: GRanges, BigWigFile
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH22248"]]' 
#> 
#>             title                                             
#>   AH22248 | pazar_CTCF_Cui_20120522.csv                       
#>   AH22249 | pazar_CTCF_HEPG2_Schmidt_20120522.csv             
#>   AH22519 | wgEncodeAwgTfbsBroadDnd41CtcfUniPk.narrowPeak.gz  
#>   AH22521 | wgEncodeAwgTfbsBroadGm12878CtcfUniPk.narrowPeak.gz
#>   AH22524 | wgEncodeAwgTfbsBroadH1hescCtcfUniPk.narrowPeak.gz 
#>   ...       ...                                               
#>   AH28453 | CTCF_GSM918744_Immortalized_Erythroid.csv         
#>   AH95565 | CTCF_hg19.RData                                   
#>   AH95566 | CTCF_hg38.RData                                   
#>   AH95567 | CTCF_mm9.RData                                    
#>   AH95568 | CTCF_mm10.RData

The FIMO-predicted CTCF sites are named as “CTCF_”, e.g., “CTCF_hg38”. Use query_data <- query(ah , "CTCF_hg38") for a more targeted search.

We can check the details about the object.

query_data["AH95566"]
#> AnnotationHub with 1 record
#> # snapshotDate(): 2021-06-15
#> # names(): AH95566
#> # $dataprovider: UCSC Jaspar
#> # $species: Homo sapiens
#> # $rdataclass: GRanges
#> # $rdatadateadded: 2021-05-18
#> # $title: CTCF_hg38.RData
#> # $description: hg38 genomic coordinates of CTCF binding motif MA0139.1, det...
#> # $taxonomyid: 9606
#> # $genome: hg38
#> # $sourcetype: RData
#> # $sourceurl: https://drive.google.com/drive/folders/19ZXr7IETfks0OdYlmuc1Hq...
#> # $sourcesize: NA
#> # $tags: c("FunctionalAnnotation", "GenomicSequence", "hg38") 
#> # retrieve record with 'object[["AH95566"]]'

And retrieve the object.

CTCF_hg38 <- query_data[["AH95566"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
CTCF_hg38
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> GRanges object with 56049 ranges and 5 metadata columns:
#>           seqnames            ranges strand |       motif     score   p.value
#>              <Rle>         <IRanges>  <Rle> | <character> <numeric> <numeric>
#>       [1]     chr1       11223-11241      - |    MA0139.1   24.4754  1.34e-09
#>       [2]     chr1       11281-11299      - |    MA0139.1   22.7377  1.01e-08
#>       [3]     chr1       24782-24800      - |    MA0139.1   17.3770  7.11e-07
#>       [4]     chr1       91420-91438      + |    MA0139.1   16.2951  1.41e-06
#>       [5]     chr1     104985-105003      - |    MA0139.1   16.7869  1.04e-06
#>       ...      ...               ...    ... .         ...       ...       ...
#>   [56045]     chrY 57044316-57044334      - |    MA0139.1   16.4590  1.27e-06
#>   [56046]     chrY 57189659-57189677      + |    MA0139.1   15.7541  1.95e-06
#>   [56047]     chrY 57203409-57203427      - |    MA0139.1   15.6393  2.09e-06
#>   [56048]     chrY 57215279-57215297      + |    MA0139.1   19.5738  1.53e-07
#>   [56049]     chrY 57215337-57215355      + |    MA0139.1   24.4754  1.34e-09
#>             q.value            sequence
#>           <numeric>         <character>
#>       [1]    0.0216 TCGCCAGCAGGGGGCGCCC
#>       [2]    0.0398 GCGCCAGCAGGGGGCGCTG
#>       [3]    0.2350 CGTCCAGCAGATGGCGGAT
#>       [4]    0.3080 GTGGCACCAGGTGGCAGCA
#>       [5]    0.2750 CCAACAGCAGGTGGCAGCC
#>       ...       ...                 ...
#>   [56045]    0.2990 TGGTCACCTGGGGGCACTA
#>   [56046]    0.3430 TGTCCTCTAGGGGTCAGCC
#>   [56047]    0.3510 CTGCCGCAAGGGGGCGCAT
#>   [56048]    0.1190 gcgccacgagggggcggtg
#>   [56049]    0.0216 tcgccagcagggggcgccc
#>   -------
#>   seqinfo: 24 sequences from hg38 genome

Note that the default q-value cutoff is 0.5. Looking at the q-value distribution:

one may decide to use a more stringent cutoff. E.g., filtering by q-value less than 0.3 filters out more than half of the predicted sites. The remaining sites may be considered as high-confidence CTCF sites.

# Check length before filtering
length(CTCF_hg38)
#> [1] 56049
# Filter and check length after filtering
CTCF_hg38 <- CTCF_hg38[CTCF_hg38$q.value < 0.3]
length(CTCF_hg38)
#> [1] 25474

0.3 CTCF GRanges for other organisms

# hg19 CTCF coordinates
CTCF_hg19 <- query_data[["AH95565"]]
# mm9 CTCF coordinates
CTCF_mm9 <- query_data[["AH95567"]]
# mm10 CTCF coordinates
CTCF_mm10 <- query_data[["AH95568"]]

See ../inst/scripts/make-data.R how to create the CTCF GRanges objects.

0.4 Citation

Below is the citation output from using citation('CTCF') in R. Please run this yourself to check for any updates on how to cite CTCF.

print(citation("CTCF"), bibtex = TRUE)
#> 
#> Dozmorov MG, Davis E, Mu W, Lee S, Triche T, Phanstiel D, Love M
#> (2021). _CTCF_. https://github.com/mdozmorov/CTCF/CTCF - R package
#> version 0.99.4, <URL: https://github.com/mdozmorov/CTCF>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {CTCF},
#>     author = {Mikhail G. Dozmorov and Eric Davis and Wancen Mu and Stuart Lee and Tim Triche and Douglas Phanstiel and Michael Love},
#>     year = {2021},
#>     url = {https://github.com/mdozmorov/CTCF},
#>     note = {https://github.com/mdozmorov/CTCF/CTCF - R package version 0.99.4},
#>   }

Date the vignette was generated.

#> [1] "2021-06-17 09:09:59 EDT"

R session information.

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value                                      
#>  version  R version 4.1.0 Patched (2021-05-24 r80367)
#>  os       macOS High Sierra 10.13.6                  
#>  system   x86_64, darwin17.7.0                       
#>  ui       unknown                                    
#>  language (EN)                                       
#>  collate  C                                          
#>  ctype    en_US.UTF-8                                
#>  tz       America/New_York                           
#>  date     2021-06-17                                 
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package                * version  date       lib source        
#>  AnnotationDbi            1.55.1   2021-06-07 [2] Bioconductor  
#>  AnnotationHub          * 3.1.0    2021-05-20 [2] Bioconductor  
#>  assertthat               0.2.1    2019-03-21 [2] CRAN (R 4.1.0)
#>  Biobase                  2.53.0   2021-05-19 [2] Bioconductor  
#>  BiocFileCache          * 2.1.0    2021-05-19 [2] Bioconductor  
#>  BiocGenerics           * 0.39.1   2021-06-08 [2] Bioconductor  
#>  BiocManager              1.30.16  2021-06-15 [3] CRAN (R 4.1.0)
#>  BiocStyle              * 2.21.2   2021-06-07 [2] Bioconductor  
#>  BiocVersion              3.14.0   2021-05-19 [2] Bioconductor  
#>  Biostrings               2.61.1   2021-06-04 [2] Bioconductor  
#>  bit                      4.0.4    2020-08-04 [2] CRAN (R 4.1.0)
#>  bit64                    4.0.5    2020-08-30 [2] CRAN (R 4.1.0)
#>  bitops                   1.0-7    2021-04-24 [2] CRAN (R 4.1.0)
#>  blob                     1.2.1    2020-01-20 [2] CRAN (R 4.1.0)
#>  bookdown                 0.22     2021-04-22 [2] CRAN (R 4.1.0)
#>  bslib                    0.2.5.1  2021-05-18 [2] CRAN (R 4.1.0)
#>  cachem                   1.0.5    2021-05-15 [2] CRAN (R 4.1.0)
#>  cli                      2.5.0    2021-04-26 [2] CRAN (R 4.1.0)
#>  crayon                   1.4.1    2021-02-08 [2] CRAN (R 4.1.0)
#>  CTCF                     0.99.4   2021-06-17 [1] Bioconductor  
#>  curl                     4.3.1    2021-04-30 [2] CRAN (R 4.1.0)
#>  DBI                      1.1.1    2021-01-15 [2] CRAN (R 4.1.0)
#>  dbplyr                 * 2.1.1    2021-04-06 [2] CRAN (R 4.1.0)
#>  digest                   0.6.27   2020-10-24 [2] CRAN (R 4.1.0)
#>  dplyr                    1.0.6    2021-05-05 [2] CRAN (R 4.1.0)
#>  ellipsis                 0.3.2    2021-04-29 [2] CRAN (R 4.1.0)
#>  evaluate                 0.14     2019-05-28 [2] CRAN (R 4.1.0)
#>  fansi                    0.5.0    2021-05-25 [2] CRAN (R 4.1.0)
#>  fastmap                  1.1.0    2021-01-25 [2] CRAN (R 4.1.0)
#>  filelock                 1.0.2    2018-10-05 [2] CRAN (R 4.1.0)
#>  generics                 0.1.0    2020-10-31 [2] CRAN (R 4.1.0)
#>  GenomeInfoDb           * 1.29.0   2021-05-19 [2] Bioconductor  
#>  GenomeInfoDbData         1.2.6    2021-05-24 [2] Bioconductor  
#>  GenomicRanges          * 1.45.0   2021-05-19 [2] Bioconductor  
#>  glue                     1.4.2    2020-08-27 [2] CRAN (R 4.1.0)
#>  highr                    0.9      2021-04-16 [2] CRAN (R 4.1.0)
#>  htmltools                0.5.1.1  2021-01-22 [2] CRAN (R 4.1.0)
#>  httpuv                   1.6.1    2021-05-07 [2] CRAN (R 4.1.0)
#>  httr                     1.4.2    2020-07-20 [2] CRAN (R 4.1.0)
#>  interactiveDisplayBase   1.31.0   2021-05-19 [2] Bioconductor  
#>  IRanges                * 2.27.0   2021-05-19 [2] Bioconductor  
#>  jquerylib                0.1.4    2021-04-26 [2] CRAN (R 4.1.0)
#>  jsonlite                 1.7.2    2020-12-09 [2] CRAN (R 4.1.0)
#>  KEGGREST                 1.33.0   2021-05-19 [2] Bioconductor  
#>  knitr                    1.33     2021-04-24 [2] CRAN (R 4.1.0)
#>  later                    1.2.0    2021-04-23 [2] CRAN (R 4.1.0)
#>  lifecycle                1.0.0    2021-02-15 [2] CRAN (R 4.1.0)
#>  magrittr                 2.0.1    2020-11-17 [2] CRAN (R 4.1.0)
#>  memoise                  2.0.0    2021-01-26 [2] CRAN (R 4.1.0)
#>  mime                     0.10     2021-02-13 [2] CRAN (R 4.1.0)
#>  pillar                   1.6.1    2021-05-16 [2] CRAN (R 4.1.0)
#>  pkgconfig                2.0.3    2019-09-22 [2] CRAN (R 4.1.0)
#>  png                      0.1-7    2013-12-03 [2] CRAN (R 4.1.0)
#>  promises                 1.2.0.1  2021-02-11 [2] CRAN (R 4.1.0)
#>  purrr                    0.3.4    2020-04-17 [2] CRAN (R 4.1.0)
#>  R6                       2.5.0    2020-10-28 [2] CRAN (R 4.1.0)
#>  rappdirs                 0.3.3    2021-01-31 [2] CRAN (R 4.1.0)
#>  Rcpp                     1.0.6    2021-01-15 [2] CRAN (R 4.1.0)
#>  RCurl                    1.98-1.3 2021-03-16 [2] CRAN (R 4.1.0)
#>  rlang                    0.4.11   2021-04-30 [2] CRAN (R 4.1.0)
#>  rmarkdown                2.9      2021-06-15 [2] CRAN (R 4.1.0)
#>  RSQLite                  2.2.7    2021-04-22 [2] CRAN (R 4.1.0)
#>  S4Vectors              * 0.31.0   2021-05-19 [2] Bioconductor  
#>  sass                     0.4.0    2021-05-12 [2] CRAN (R 4.1.0)
#>  sessioninfo            * 1.1.1    2018-11-05 [2] CRAN (R 4.1.0)
#>  shiny                    1.6.0    2021-01-25 [2] CRAN (R 4.1.0)
#>  stringi                  1.6.2    2021-05-17 [2] CRAN (R 4.1.0)
#>  stringr                  1.4.0    2019-02-10 [2] CRAN (R 4.1.0)
#>  tibble                   3.1.2    2021-05-16 [2] CRAN (R 4.1.0)
#>  tidyselect               1.1.1    2021-04-30 [2] CRAN (R 4.1.0)
#>  utf8                     1.2.1    2021-03-12 [2] CRAN (R 4.1.0)
#>  vctrs                    0.3.8    2021-04-29 [2] CRAN (R 4.1.0)
#>  withr                    2.4.2    2021-04-18 [2] CRAN (R 4.1.0)
#>  xfun                     0.24     2021-06-15 [2] CRAN (R 4.1.0)
#>  xtable                   1.8-4    2019-04-21 [2] CRAN (R 4.1.0)
#>  XVector                  0.33.0   2021-05-19 [2] Bioconductor  
#>  yaml                     2.2.1    2020-02-01 [2] CRAN (R 4.1.0)
#>  zlibbioc                 1.39.0   2021-05-19 [2] Bioconductor  
#> 
#> [1] /private/var/folders/sk/hy088prx12l_cqspv3lbpl9s6_3r11/T/RtmpUmbPA2/Rinst16f5420f1d20
#> [2] /Users/ka36530_ca/R-stuff/bin/R-4-1/4.1-Bioc-3.14/library
#> [3] /Users/ka36530_ca/R-stuff/bin/R-4-1/library