1 Introduction
2 Setup
3 Searching Sequences
- 3.1 Finding PWM Matches
4 Analysis of Positional Bias
- 4.1 Testing for Positional Bias
- 4.2 Viewing Matches
5 Testing For Motif Enrichment
- 5.1 Defining a Set of Control Sequences
- 5.2 Testing For Enrichment
6 Working Larger Datasets
References
Appendix
Session info

1 Introduction

Bioinformatic analysis of data from ChIP-Seq and ATAC-Seq commonly involves the analysis of sequences within the regions identified is being of interest. Whilst these analyses are not restricted to transcription factors, this can often form an important component of this type of analysis. Analysis of Transcription Factor Binding Motifs (TFBMs) is often performed using Position Weight Matrices (PWMs) to encode the flexibility in which exact sequence is bound by the particular transcription factor, and is a computationally demanding task with many popular tools enabling analysis outside of R.

The tools within motifTestR aim to build on and expand the existing resources available to the Bioconductor community, performing all analyses inside the R environment, The package offers two complementary approaches to TFBM analysis within XStringSet objects containing multiple sequences. The function testMotifPos() identifies motifs showing positional bias within a set of sequences, whilst overall enrichment within a set of sequences is enabled by testMotifEnrich(). Additional functions aid in the visualisation and preparation of these two key approaches.

2 Setup

2.1 Installation

In order to perform the operations in this vignette, first install motifTestR.

if (!"BiocManager" %in% rownames(installed.packages()))
  install.packages("BiocManager")
BiocManager::install("motifTestR")

Once installed, we can load all required packages, set a default plotting theme and setup how many threads to use during the analysis.

library(motifTestR)
library(extraChIPs)
library(rtracklayer)
library(BSgenome.Hsapiens.UCSC.hg19)
library(parallel)
library(ggplot2)
library(patchwork)
theme_set(theme_bw())
cores <- 1

2.2 Defining a Set of Peaks

The peaks used in this workflow were obtained from the bed files denoting binding sites of the Androgen Receptor and Estrogen Receptor, which are also marked by H3K27ac, in ZR-75-1 cells under DHT treatment (Hickey et al. 2021). The object ar_er_peaks contains a subset of 229 peaks found within chromosome 1, with all peaks resized to 400bp

data("ar_er_peaks")
ar_er_peaks

## GRanges object with 229 ranges and 0 metadata columns:
##         seqnames              ranges strand
##            <Rle>           <IRanges>  <Rle>
##     [1]     chr1     5658040-5658439      *
##     [2]     chr1     6217913-6218312      *
##     [3]     chr1     6543164-6543563      *
##     [4]     chr1     8319267-8319666      *
##     [5]     chr1     8472552-8472951      *
##     ...      ...                 ...    ...
##   [225]     chr1 244490601-244491000      *
##   [226]     chr1 246158593-246158992      *
##   [227]     chr1 246771887-246772286      *
##   [228]     chr1 246868678-246869077      *
##   [229]     chr1 246873126-246873525      *
##   -------
##   seqinfo: 24 sequences from hg19 genome

sq <- seqinfo(ar_er_peaks)

Whilst the example dataset is small for the convenience of an R package, those wishing to work on the complete set of peaks (i.e. not just chromosome 1) may run the code provided in the final section to obtain all peaks. This will produce a greater number of significant results in subsequent analyses but will also increase running times for all functions.

2.3 Obtaining a Set of Sequences for Testing

Now that we have genomic co-ordinates as a set of peaks, we can obtain the sequences that are associated with each peak. The source ranges can optionally be added to the sequences as names by coercing the ranges to a character vector.

test_seq <- getSeq(BSgenome.Hsapiens.UCSC.hg19, ar_er_peaks)
names(test_seq) <- as.character(ar_er_peaks)

2.4 Obtaining a List of PWMs for Testing

A small list of PWMs, obtained from MotifDb are provided with the package and these will be suitable for all downstream analysis.

data("ex_pwm")
names(ex_pwm)

## [1] "ESR1"  "ANDR"  "FOXA1" "ZN143" "ZN281"

ex_pwm$ESR1

##       1     2     3     4     5     6     7     8     9    10    11    12    13
## A 0.638 0.074 0.046 0.094 0.002 0.856 0.108 0.396 0.182 0.104 0.054 0.618 0.040
## C 0.048 0.006 0.018 0.072 0.888 0.006 0.442 0.604 0.376 0.078 0.034 0.198 0.884
## G 0.260 0.808 0.908 0.178 0.048 0.112 0.312 0.000 0.286 0.044 0.908 0.070 0.014
## T 0.054 0.112 0.028 0.656 0.062 0.026 0.138 0.000 0.156 0.774 0.004 0.114 0.062
##      14    15
## A 0.090 0.058
## C 0.822 0.330
## G 0.008 0.066
## T 0.080 0.546

Again, a larger set of motifs may be obtained using or modifying the example code at the end of the vignette

3 Searching Sequences

3.1 Finding PWM Matches

All PWM matches within the test sequences can be returned for any of the PWMs, with getPwmMatches() searching using the PWM and it’s reverse complement by default. Matches are returned showing their position within the sequence, as well as the distance from the centre of the sequence and the matching section within the larger sequence. Whilst there is no strict requirement for sequences of the same width, generally this is good practice for this type of analysis.

getPwmMatches(ex_pwm$ESR1, test_seq)

## DataFrame with 62 rows and 8 columns
##                        seq     score direction     start       end from_centre
##                <character> <numeric>  <factor> <integer> <integer>   <numeric>
## 1     chr1:6543164-6543563     8.644         F       176       190         -17
## 2     chr1:6543164-6543563     9.326         R       176       190         -17
## 3   chr1:15556368-15556767     8.732         F       132       146         -61
## 4   chr1:16934836-16935235     8.748         R       122       136         -71
## 5   chr1:19753686-19754085     8.680         F       141       155         -52
## ...                    ...       ...       ...       ...       ...         ...
## 58  chr1:224638244-22463..     8.672         R       270       284          77
## 59  chr1:224680871-22468..     9.072         R       179       193         -14
## 60  chr1:232458390-23245..     9.128         F       261       275          68
## 61  chr1:235011318-23501..     8.694         R        89       103        -104
## 62  chr1:235099011-23509..     8.832         R        79        93        -114
##     seq_width           match
##     <integer>  <DNAStringSet>
## 1         400 GGGACAGGATGACCC
## 2         400 GGGTCATCCTGTCCC
## 3         400 GGGTAACCCTGACAT
## 4         400 GGGTCAGAGAGTCCT
## 5         400 AGTTCATCAAGACCT
## ...       ...             ...
## 58        400 AGGCCATCTTGACAC
## 59        400 AGGTTTCCCTGACCT
## 60        400 AGGCCAACATGACCA
## 61        400 GGGGCAACCTGAACT
## 62        400 CGGTTACCCTGACCG

Many sequences will contain multiple matches, and we can subset our results to only the ‘best match’ by setting best_only = TRUE. The best match is chosen by the highest score returned for each match. If multiple matches return identical scores, all tied matches are returned by default and will be equally down-weighted during positional analysis. This can be further controlled by setting break_ties to any of c(“random”, “first”, “last”, “central”), which will choose randomly, by sequence order or proximity to centre.

getPwmMatches(ex_pwm$ESR1, test_seq, best_only = TRUE)

## DataFrame with 45 rows and 8 columns
##                        seq     score direction     start       end from_centre
##                <character> <numeric>  <factor> <integer> <integer>   <numeric>
## 1     chr1:6543164-6543563     8.644         F       176       190         -17
## 2     chr1:6543164-6543563     9.326         R       176       190         -17
## 3   chr1:15556368-15556767     8.732         F       132       146         -61
## 4   chr1:16934836-16935235     8.748         R       122       136         -71
## 5   chr1:19753686-19754085     8.680         F       141       155         -52
## ...                    ...       ...       ...       ...       ...         ...
## 41  chr1:224499774-22450..     8.978         F       239       253          46
## 42  chr1:224638244-22463..     8.672         R       270       284          77
## 43  chr1:224680871-22468..     9.072         R       179       193         -14
## 44  chr1:235011318-23501..     8.694         R        89       103        -104
## 45  chr1:235099011-23509..     8.832         R        79        93        -114
##     seq_width           match
##     <integer>  <DNAStringSet>
## 1         400 GGGACAGGATGACCC
## 2         400 GGGTCATCCTGTCCC
## 3         400 GGGTAACCCTGACAT
## 4         400 GGGTCAGAGAGTCCT
## 5         400 AGTTCATCAAGACCT
## ...       ...             ...
## 41        400 AAGTCAACATGACCA
## 42        400 AGGCCATCTTGACAC
## 43        400 AGGTTTCCCTGACCT
## 44        400 GGGGCAACCTGAACT
## 45        400 CGGTTACCCTGACCG

We can return all matches for a complete list of PWMs, as a list of DataFrame objects. This strategy allows for visualisation of results as well as testing for positional bias.

bm_all <- getPwmMatches(
  ex_pwm, test_seq, best_only = TRUE, break_ties = "all",
  mc.cores = cores
)

This same strategy of passing a single, or multiple PWMs can be applied even when simply wishing to count the total matches for each PWM. Counting may be useful for restricting downstream analysis to the set of motifs with more than a given number of matches.

countPwmMatches(ex_pwm, test_seq, mc.cores = cores)

##  ESR1  ANDR FOXA1 ZN143 ZN281 
##    62    61   334     1    63

4 Analysis of Positional Bias

4.1 Testing for Positional Bias

A common tool within MEME-Suite is centrimo (Bailey and Machanick 2012) and motifTestR provides a simple, easily interpretable alternative using testMotifPos(). This function bins the distances from the centre of each sequence and, if no positional bias is expected (i.e. H₀), matches should be equally distributed between bins. Unlike centrimo, no assumption of centrality is made and any notable deviations from a discrete uniform distribution may be considered as significant.

A test within each bin is performed using binom.test() and a single, summarised p-value across all bins is returned using the asymptotically exact harmonic mean p-value (HMP) (Wilson 2019). By default, the binomial test is applied for the null hypothesis to detect matches in each bin which are greater than expected, however, this can also be set by the user. When using the harmonic-mean p-value however, this tends return a more conservative p-value across the entire set of bins.

res_pos <- testMotifPos(bm_all, mc.cores = cores)
head(res_pos)

##       start end centre width total_matches matches_in_region    expected
## ANDR   -195 195      0   390            53                38 21.25520833
## FOXA1  -185 185      0   370           179               108 87.20512821
## ZN281  -195 175    -10   370            36                34 18.51162791
## ESR1   -165 185     10   350            45                38 27.90697674
## ZN143   115 125    120    10             1                 1  0.02631579
##       enrichment prop_total         p       fdr consensus_motif
## ANDR    1.787797  0.7169811 0.7367342 0.9937435    4, 3, 2,....
## FOXA1   1.238459  0.6033520 0.8365378 0.9937435    6, 0, 0,....
## ZN281   1.836683  0.9444444 0.9291655 0.9937435    5, 0, 25....
## ESR1    1.361667  0.8444444 0.9741004 0.9937435    30, 2, 1....
## ZN143  38.000000  1.0000000 0.9937435 0.9937435    1, 0, 0,....

The bins returned by the function represent the widest range of bins where the raw p-values were below the HMP. Wide ranges tend to be associated with lower significance for a specific PWM.

Due to the two-stranded nature of DNA, the distance from zero cn also be assessed by setting abs = TRUE and in this case the first bin begins at zero.

res_abs <- testMotifPos(bm_all, abs = TRUE, mc.cores = cores) 
head(res_abs)

##       start end centre width total_matches matches_in_region     expected
## ESR1     10  40     25    30            45                15   6.99481865
## ANDR      0 190     95   190            53                28  13.80208333
## FOXA1     0 190     95   190           179               110 100.97435897
## ZN281    60 190    125   130            36                29  18.65284974
## ZN143   120 130    125    10             1                 1   0.05263158
##       enrichment prop_total         p       fdr consensus_motif
## ESR1    2.144444  0.3333333 0.2493238 0.7445537    30, 2, 1....
## ANDR    2.028679  0.5283019 0.2978215 0.7445537    4, 3, 2,....
## FOXA1   1.089385  0.6145251 0.8064541 0.9520175    6, 0, 0,....
## ZN281   1.554722  0.8055556 0.8349283 0.9520175    5, 0, 25....
## ZN143  19.000000  1.0000000 0.9520175 0.9520175    1, 0, 0,....

This approach is particularly helpful for detecting co-located transcription factors which can be any distance from the TF which was used to obtain and centre the test set of sequences.

4.2 Viewing Matches

The complete set of matches returned as a list above can be simply passed to ggplot2 for visualisation, in order to asses whether any PWM appears to have a positional bias. By default, smoothed values across all motifs will be overlaid (Figure 1A), however, tailoring using ggplot is simple to produce a wide variety of outputs (Figure 1B)

topMotifs <- rownames(res_pos)[1:4]
A <- plotMatchPos(bm_all[topMotifs], binwidth = 10, se = FALSE)
B <- plotMatchPos(bm_all[topMotifs], binwidth = 10, geom = "col") +
  geom_smooth(se = FALSE, show.legend = FALSE) +
  facet_wrap(~name)
A + B + plot_annotation(tag_levels = "A") & theme(legend.position = "bottom")

Figure 1: Distribution of motif matches around the centres of the set of peaks

Whilst the above will produce figures showing the symmetrical distribution around the peak centres, the distance from the peak centre can also be shown as an absolute distance. In Figure 2 distances shown as a heatmap (A) or as a CDF (B). The latter makes it easy to see that 50% of ESR1 matches occur within a short distance of the centre (~30bp), whilst for ANDR and FOXA1, this distance is roughly doubled. Changing the binwidth argument can either smooth data or increase the fine resolution.

topMotifs <- rownames(res_abs)[1:4]
A <- plotMatchPos(bm_all[topMotifs], abs = TRUE, type = "heatmap") +
  scale_fill_viridis_c()
B <- plotMatchPos(
  bm_all[topMotifs], abs = TRUE, type = "cdf", geom = "line", binwidth = 5
)
A + B + plot_annotation(tag_levels = "A") & theme(legend.position = "bottom")

Figure 2: Distribution of motif matches shown as a distance from the centre of each sequence

5 Testing For Motif Enrichment

As well as providing methods for analysing positional bias within a set of PWM matches, methods to test for enrichment are also implemented in motifTestR. A common approach when testing for motif enrichment is to obtain a set of random or background sequences which represent a suitable control set to define the null hypothesis (H₀). In motifTestR, two alternatives are offered utilising this approach, which both return similar results but involve different levels of computational effort.

The first approach is to sample multiple sets of background sequences and by ‘iterating’ through to obtain a null distribution for PWM matches and comparing our observed counts against this distribution. It has been noticed that this approach commonly produces a set of counts for H₀ which closely resemble a Poisson distribution, and a second approach offered in motifTestR is to sample a suitable large set of background sequences and estimate the parameters for the Poisson distribution for each PWM, and testing against these.

5.1 Defining a Set of Control Sequences

Choosing a suitable set of control sequences can be undertaken by any number of methods. motifTestR enables a strategy of matching sequences by any number of given features. The data object zr75_enh contains the candidate enhancers for ZR-75-1 cells defined by v2.0 of the Enhancer Atlas (Gao and Qian 2019), for chromosome 1 only. A high proportion of our peaks are associated with these regions and choosing control sequences drawn from the same proportion of these regions may be a viable strategy.

data("zr75_enh")
mean(overlapsAny(ar_er_peaks, zr75_enh))

## [1] 0.8122271

First we can annotate each peak by whether there is any overlap with an enhancer, or whether the peak belongs to any other region. Next we can define two sets of GenomicRanges, one representing the enhancers and the other being the remainder of the genome, here restricted to chromosome 1 for consistency. Control regions can be drawn from each with proportions that match the test set of sequences.

ar_er_peaks$feature <- ifelse(
  overlapsAny(ar_er_peaks, zr75_enh), "enhancer", "other"
)
chr1 <- GRanges(sq)[1]
bg_ranges <- GRangesList(
  enhancer = zr75_enh,
  other = GenomicRanges::setdiff(chr1, zr75_enh)
)

The provided object hg19_mask contains regions of the genome which are rich in Ns, such as centromeres and telomeres. Sequences containing Ns produce warning messages when matching PWMs and avoiding these regions may be wise, without introducing any sequence bias. These are then passed to makeRMRanges() as ranges to be excluded, whilst sampling multiple random, size-matched ranges corresponding to our test set of ranges with sequences being analysed, and drawn proportionally from matching genomic regions. Whilst our example only used candidate enhancers, any type and number of genomic regions can be used, with a limitless number of classification strategies being possible.

data("hg19_mask")
set.seed(305)
rm_ranges <- makeRMRanges(
  splitAsList(ar_er_peaks, ar_er_peaks$feature),
  bg_ranges, exclude = hg19_mask,
  n_iter = 100
)

This has now returned a set of control ranges which are randomly-selected (R) size-matched (M) to our peaks and are drawn from a similar distribution of genomic features. By setting n_iter = 100, this set will be 100 times larger than our test set and typically this value can be set to 1000 or even 5000 for better estimates of parameters under the null distribution. However, this will increase the computational burden for analysis.

If not choosing an iterative strategy, a total number of sampled ranges can also be specified. In this case the column iteration will not be added to the returned ranges.

In order to perform the analysis, we can now extract the genomic sequences corresponding to our randomly selected control ranges. Passing the mcols element ensure the iteration numbers are passed to the sequences, as these are required for this approach.

rm_seq <- getSeq(BSgenome.Hsapiens.UCSC.hg19, rm_ranges)
mcols(rm_seq) <- mcols(rm_ranges)

If choosing strategies for enrichment testing outside of motifTestR, these sequences can be exported as a fasta file using writeXStringSet from the Biostrings package.

5.2 Testing For Enrichment

Testing for overall motif enrichment is implemented using multiple strategies, using Poisson, QuasiPoisson or pure Iterative approaches. Whilst some PWMs may closely follow a Poisson distribution under H₀, others may be over-dispersed and more suited to a Quasi-Poisson approach. Each approach has unique advantages and weaknesses as summarised below:

Poisson
- BG Sequences can be of any size, unrelated to the test set
- Modelling is performed based on the expected number of matches per sequence
- The fastest approach
- Anti-conservative p-values where counts are over-dispersed
QuasiPoisson
- Modelling is performed per set of sequences (of identical size to the test set)
- Requires BG Sequences to be in ‘iterative’ blocks
- Fewer ‘iterative blocks’ can still model over-dispersion reasonably well
Iterative
- No model assumptions
- Requires BG Sequences to be in ‘iterative’ blocks
- P-Values derived from Z-scores (using the Central Limit Theorem)
- Sampled p-values from iterations can be used if preferred
- Requires the largest number of iterative blocks (>1000)
- Slowest, but most reliable approach

From a per iteration perspective there is little difference between the Iterative and the modelled QuasiPoisson approaches, however the modelled approaches can still return reliable results from a lower number of iterative blocks, lending a clear speed advantage. Z-scores returned are only used for statistical testing under the iterative approach and are for indicative purposes only under all other models model.

Whilst no guidelines have been developed for an optimal number of sequences, a control set which is orders of magnitude larger than the test set may be prudent. A larger set of control sequences clearly leads to longer analytic time-frames and larger computational resources, so this is left to what is considered appropriate by the researcher, nothing that here, we chose a control set which is 100x larger than our test sequences. If choosing an iterative approach and using the iteration-derived p-values, setting a number of iterations based on the resolution required for these values may be important, noting that the lowest possible p-value is 1/n_iterations.

enrich_res <- testMotifEnrich(
  ex_pwm, test_seq, rm_seq, model = "quasi", mc.cores = cores
)
head(enrich_res)

##       sequences matches expected enrichment          Z            p
## ESR1        229      62    14.00   4.428571 10.5949029 1.571145e-15
## FOXA1       229     334   205.50   1.625304  8.0584819 3.635468e-12
## ZN281       229      63    44.68   1.410027  1.7312522 9.025992e-02
## ANDR        229      61    55.27   1.103673  0.5888032 5.596488e-01
## ZN143       229       1     0.58   1.724138  0.4496872 6.606112e-01
##                fdr n_iter      sd_bg
## ESR1  7.855724e-15    100  4.5304804
## FOXA1 9.088670e-12    100 15.9459314
## ZN281 1.504332e-01    100 10.5819362
## ANDR  6.606112e-01    100  9.7316053
## ZN143 6.606112e-01    100  0.9339825

Setting the model to “iteration” instead uses a classical iterative approach to define the null distributions of counts and Z-scores are calculated from these values. The returned p-values from this test are taken from the Z-scores directly, with p-values derived from the sampled iterations also returned if preferred by the researcher. Whilst requiring greater computational effort, fewer statistical assumptions are made and results may be more conservative than under modelling approaches.

iter_res <- testMotifEnrich(
  ex_pwm, test_seq, rm_seq, mc.cores = cores, model = "iteration"
)
head(iter_res)

##       sequences matches expected enrichment          Z            p
## ESR1        229      62    14.00   4.428571 10.5949029 0.000000e+00
## FOXA1       229     334   205.50   1.625304  8.0584819 7.771561e-16
## ZN281       229      63    44.68   1.410027  1.7312522 8.340680e-02
## ANDR        229      61    55.27   1.103673  0.5888032 5.559933e-01
## ZN143       229       1     0.58   1.724138  0.4496872 6.529360e-01
##                fdr iter_p n_iter      sd_bg
## ESR1  0.000000e+00   0.01    100  4.5304804
## FOXA1 1.942890e-15   0.01    100 15.9459314
## ZN281 1.390113e-01   0.05    100 10.5819362
## ANDR  6.529360e-01   0.24    100  9.7316053
## ZN143 6.529360e-01   0.10    100  0.9339825

Once we have selected our motifs of interest, sequences with matches can be compared to easily assess co-occurrence, using plotOverlaps() from extraChIPs. In our test set, peaks were selected based on co-detection of ESR1 and ANDR, however the rate of co-occurrence is low, revealing key insights into the binding dynamics of these two TFs.

topMotifs <- rownames(enrich_res)[1:4]
ex_pwm[topMotifs] |>
  getPwmMatches(test_seq, mc.cores = cores) |>
  lapply(\(x) x$seq) |>
  plotOverlaps(type = "upset", min_size = 5)

Distribution of select PWM matches within sequences. Each sequence is only considered once and as such, match numbers may be below those returned during testing, which includes multiple matches within a sequence.

Figure 3: Distribution of select PWM matches within sequences
Each sequence is only considered once and as such, match numbers may be below those returned during testing, which includes multiple matches within a sequence.

6 Working Larger Datasets

Vignettes are commonly prepared for compiling with limited resources and as such example datasets and analyses may reveal less information than realistically sized data. Motif analysis is particularly well-known for taking many minutes when working with large datasets. For more comprehensive analysis and realistically sized data, the following code snippets will allow analysis of the above dataset, but without being restricted to chromosome 1.

To obtain the full set of peaks, simply run the following and use these peaks repeating the steps above.

## Not run
base_url <- "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3511nnn"
bed_url <- list(
  AR = file.path(
    base_url, "GSM3511083/suppl/GSM3511083%5FAR%5Fpeaks%5FED.bed.gz"
  ),
  ER = file.path(
    base_url, "GSM3511085/suppl/GSM3511085%5FER%5Fpeaks%5FED.bed.gz"
  ),
  H3K27ac = file.path(
    base_url, "GSM3511087/suppl/GSM3511087%5FH3K27ac%5Fpeaks%5FED.bed.gz"
  )
)
all_peaks <- GRangesList(lapply(bed_url, import.bed))
seqlevels(all_peaks) <- seqnames(sq)
seqinfo(all_peaks) <- sq
## Return the ranges with coverage from 2 or more targets
ar_er_peaks <- makeConsensus(
  all_peaks, p = 2/3, method = "coverage", min_width = 200
) |>
  ## Now subset to the ranges which overlap a peak from every target
  subset(n == 3) |> 
  resize(width = 400, fix = 'center')

The full set of PWMs for HOCOMOCOv11 (core-A) provided in MotifDb can be obtained using the following. Alternatively, query fields can be customised as preferred.

## Not run
library(MotifDb)
ex_pwm <- MotifDb |>
  subset(organism == "Hsapiens") |>
  query("HOCOMOCOv11-core-A") |>
  as.list() 
names(ex_pwm) <- gsub(".+HOCOMOCOv11-core-A-(.+)_.+", "\\1", names(ex_pwm))

Similarly, a set of candidate enhancers found on all chromosomes can be obtained here. If choosing this dataset, note that bg_ranges will need to be drawn from the entire genome, not just chromosome 1.

## Not run
zr75_url <- "http://www.enhanceratlas.org/data/download/enhancer/hs/ZR75-1.bed"
zr75_enh <- import.bed(zr75_url)
zr75_enh <- granges(zr75_enh)
seqlevels(zr75_enh) <- seqnames(sq)
seqinfo(zr75_enh) <- sq
mean(overlapsAny(ar_er_peaks, zr75_enh))

References

Bailey, Timothy L, and Philip Machanick. 2012. “Inferring Direct DNA Binding from ChIP-seq.” Nucleic Acids Res. 40 (17): e128.

Gao, Tianshun, and Jiang Qian. 2019. “EnhancerAtlas 2.0: an updated resource with enhancer annotation in 586 tissue/cell types across nine species.” Nucleic Acids Research 48 (D1): D58–D64. https://doi.org/10.1093/nar/gkz980.

Hickey, Theresa E, Luke A Selth, Kee Ming Chia, Geraldine Laven-Law, Heloisa H Milioli, Daniel Roden, Shalini Jindal, et al. 2021. “The Androgen Receptor Is a Tumor Suppressor in Estrogen Receptor-Positive Breast Cancer.” Nat. Med. 27 (2): 310–20.

Wilson, Daniel J. 2019. “The Harmonic Mean P-Value for Combining Dependent Tests.” Proc. Natl. Acad. Sci. U. S. A. 116 (4): 1195–1200.

Appendix

Session info

## R version 4.4.0 (2024-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] patchwork_1.2.0                   BSgenome.Hsapiens.UCSC.hg19_1.4.3
##  [3] BSgenome_1.72.0                   BiocIO_1.14.0                    
##  [5] rtracklayer_1.64.0                extraChIPs_1.8.1                 
##  [7] tibble_3.2.1                      SummarizedExperiment_1.34.0      
##  [9] Biobase_2.64.0                    MatrixGenerics_1.16.0            
## [11] matrixStats_1.3.0                 ggside_0.3.1                     
## [13] BiocParallel_1.38.0               motifTestR_1.0.3                 
## [15] ggplot2_3.5.1                     GenomicRanges_1.56.0             
## [17] Biostrings_2.72.1                 GenomeInfoDb_1.40.1              
## [19] XVector_0.44.0                    IRanges_2.38.0                   
## [21] S4Vectors_0.42.0                  BiocGenerics_0.50.0              
## [23] BiocStyle_2.32.0                 
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.4.0              bitops_1.0-7              
##   [3] filelock_1.0.3             polyclip_1.10-6           
##   [5] XML_3.99-0.16.1            rpart_4.1.23              
##   [7] lifecycle_1.0.4            httr2_1.0.1               
##   [9] edgeR_4.2.0                lattice_0.22-6            
##  [11] ensembldb_2.28.0           MASS_7.3-60.2             
##  [13] backports_1.5.0            magrittr_2.0.3            
##  [15] limma_3.60.2               Hmisc_5.1-3               
##  [17] sass_0.4.9                 rmarkdown_2.27            
##  [19] jquerylib_0.1.4            yaml_2.3.8                
##  [21] metapod_1.12.0             Gviz_1.48.0               
##  [23] DBI_1.2.3                  RColorBrewer_1.1-3        
##  [25] harmonicmeanp_3.0.1        abind_1.4-5               
##  [27] zlibbioc_1.50.0            purrr_1.0.2               
##  [29] AnnotationFilter_1.28.0    biovizBase_1.52.0         
##  [31] RCurl_1.98-1.14            nnet_7.3-19               
##  [33] VariantAnnotation_1.50.0   tweenr_2.0.3              
##  [35] rappdirs_0.3.3             GenomeInfoDbData_1.2.12   
##  [37] ggrepel_0.9.5              codetools_0.2-20          
##  [39] DelayedArray_0.30.1        xml2_1.3.6                
##  [41] ggforce_0.4.2              tidyselect_1.2.1          
##  [43] futile.logger_1.4.3        UCSC.utils_1.0.0          
##  [45] farver_2.1.2               ComplexUpset_1.3.3        
##  [47] universalmotif_1.22.0      BiocFileCache_2.12.0      
##  [49] base64enc_0.1-3            GenomicAlignments_1.40.0  
##  [51] jsonlite_1.8.8             Formula_1.2-5             
##  [53] tools_4.4.0                progress_1.2.3            
##  [55] Rcpp_1.0.12                glue_1.7.0                
##  [57] gridExtra_2.3              SparseArray_1.4.8         
##  [59] mgcv_1.9-1                 xfun_0.44                 
##  [61] dplyr_1.1.4                withr_3.0.0               
##  [63] formatR_1.14               BiocManager_1.30.23       
##  [65] fastmap_1.2.0              latticeExtra_0.6-30       
##  [67] fansi_1.0.6                digest_0.6.35             
##  [69] R6_2.5.1                   colorspace_2.1-0          
##  [71] jpeg_0.1-10                dichromat_2.0-0.1         
##  [73] biomaRt_2.60.0             RSQLite_2.3.7             
##  [75] utf8_1.2.4                 tidyr_1.3.1               
##  [77] generics_0.1.3             data.table_1.15.4         
##  [79] prettyunits_1.2.0          InteractionSet_1.32.0     
##  [81] httr_1.4.7                 htmlwidgets_1.6.4         
##  [83] S4Arrays_1.4.1             pkgconfig_2.0.3           
##  [85] gtable_0.3.5               blob_1.2.4                
##  [87] htmltools_0.5.8.1          bookdown_0.39             
##  [89] ProtGenerics_1.36.0        scales_1.3.0              
##  [91] png_0.1-8                  knitr_1.47                
##  [93] lambda.r_1.2.4             rstudioapi_0.16.0         
##  [95] rjson_0.2.21               nlme_3.1-164              
##  [97] checkmate_2.3.1            curl_5.2.1                
##  [99] cachem_1.1.0               stringr_1.5.1             
## [101] foreign_0.8-86             AnnotationDbi_1.66.0      
## [103] restfulr_0.0.15            pillar_1.9.0              
## [105] grid_4.4.0                 vctrs_0.6.5               
## [107] dbplyr_2.5.0               cluster_2.1.6             
## [109] htmlTable_2.4.2            evaluate_0.23             
## [111] VennDiagram_1.7.3          GenomicFeatures_1.56.0    
## [113] cli_3.6.2                  locfit_1.5-9.9            
## [115] compiler_4.4.0             futile.options_1.0.1      
## [117] Rsamtools_2.20.0           rlang_1.1.4               
## [119] crayon_1.5.2               FMStable_0.1-4            
## [121] labeling_0.4.3             interp_1.1-6              
## [123] forcats_1.0.0              stringi_1.8.4             
## [125] viridisLite_0.4.2          deldir_2.0-4              
## [127] munsell_0.5.1              lazyeval_0.2.2            
## [129] csaw_1.38.0                Matrix_1.7-0              
## [131] hms_1.1.3                  bit64_4.0.5               
## [133] KEGGREST_1.44.0            statmod_1.5.0             
## [135] highr_0.11                 igraph_2.0.3              
## [137] broom_1.0.6                memoise_2.0.1             
## [139] bslib_0.7.0                bit_4.0.5                 
## [141] GenomicInteractions_1.38.0

Motif Analysis Using motifTestR

5 June 2024

Abstract

Package

Contents