seqArchRplus 1.2.0
Analysis of the promoterome of any organism entails analyzing data from an experiment like CAGE (Cap analysis of gene expression) (Kodzius et al. 2006) that provides information on genome-wide transcription start sites at single nucleotide resolution. These promoters can then be further studied to identify different classes (Carninci et al. 2006) among them based on different attributes, for instance, their shape (broad or focused/sharp promoters), gene function (tissue specific vs. housekeeping) etc. These different promoter classes harbour a variety of promoter architectures orchestrated by different proteins together with sequence elements at near-fixed positions in the promoter that determine the position of the transcription start site. The different promoter architectures are known to be used differentially by genes either in different tissues or at different developmental timepoints. For instance, (Haberle et al. 2014) have shown a dynamic interchange of promoter architecture within the same genomic region between maternal and zygotic stages of zebrafish development. Identifying and studying these different promoter architectures further is thus vital.
While the R/Bioconductor package seqArchR enables de novo identification of clusters of promoter sequence architectures (Nikumbh and Lenhard 2023), this package, seqArchRplus, enables performing various steps in their downstream bioinformatic analyses and produce publication-ready plots (building on various other Bioconductor packages). The many steps in the downstream analyses of promoter sequence architectures enabled by seqArchRplus are:
Curate the final set of clusters from seqArchR. See the seqArchR vignette , or the bioRxiv preprint to understand in detail why this may be required
Order the sequence architectures by the interquantile widths (IQWs) of the
tag clusters (aka promoter shape).
See
CAGEr
vignette
for more information on tag clusters and their IQWs
Visualize distributions of IQW, TPM (Tags per million) and conservation scores (or other when available) per cluster
Visualize the percentages of annotations for genome-wide CAGE-derived transcription tag clusters for each architecture-based clusters
Ease of comparison across samples/stages: Visualize the above plots as (publication ready) combined panels
Many per cluster visualizations including:
Produce BED track files of seqArchR clusters for visualization in a genome browser or IGV
Examples of most of these capabilities are provided in this vignette.
We hope that using seqArchRplus (together with seqArchR) will be useful in swiftly, but comprehensively, analyzing promoters identified using CAGE for any organism, and enable insights and hypotheses generation with ease. So far, we have tested seqArchR and seqArchRplus on promoters in drosophila from (Schor et al. 2017) and modENCODE (Chen et al. 2014), zebrafish (Nepal et al. 2013), mice (Fantom (The FANTOM Consortium 2014)), humans (ENCODE), and also in plants (barley and maize).
We already have a plan for a feature in the future version of seqArchRplus: generate HTML reports that help you navigate this wealth of information with relative ease.
The latest stable version of seqArchjRplus can be installed from Bioconductor as shown below.
## When available on Bioconductor
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("seqArchRplus")
In case of any errors or help required, please consider looking up: https://github.com/snikumbh/seqArchRplus and file a new issue.
# Load seqArchRplus
library(seqArchRplus)
library(seqArchR)
library(Biostrings)
library(TxDb.Dmelanogaster.UCSC.dm6.ensGene)
library(org.Dm.eg.db)
library(ChIPseeker)
# Set seed for reproducibility
set.seed(1234)
To begin using seqArchRplus, you require a set of promoters sequences (as a DNAStringSet object) and their clustering information (as a simple list storing the sequence IDs belonging to each cluster). With this, you can already visualize the cluster-wise sequence logos, distribution of chromosome and strand locations, GO term-enrichments, motif occurrence heatmaps. When you have additional information such as the IQWs (shape information), seqArchRplus can use these to order the clusters in the visualizations. This visualization can be supplemented with the distribution of per-cluster TPM values and/or conservation scores etc. when provided.
If your workflow involves CAGEr (for pre-processing raw CAGE data) and seqArchR (for clusters), they can be seamlessly used to utilize the full scope of seqArchRplus. In this case, it will be helpful if you have
the CAGEr object or information on the tag clusters from https://bioconductor.org/packages/release/bioc/html/CAGEr.html, specifically, the width of the tag clusters, total TPM value of the cluster and that of the dominant CTSS in the cluster
the seqArchR result object.
In this vignette, we use the example data provided with the package. The raw CAGE data for different developmental timepoints in Drosophila melanogaster (Schor et al. 2017) were pre-processed with CAGEr to obtain promoter sequences. These were then processed with seqArchR to cluster the promoter DNA sequences. Only a subset of the complete data is provided with the package and used here to enable demonstration of package utility. In particular, the data included is for the timepoint 10-12h after egg laying.
Specifically, the following files are provided:
example_promoters45.fa.gz
and
example_promoters200.fa.gz
)example_clust_info.rds
)example_tc_gr.rds
)example_info_df.rds
)seqArchR_result.rds
)## 1. Raw DNA sequences
raw_seqs <- Biostrings::readDNAStringSet(
filepath = system.file("extdata",
"example_promoters45.fa.gz",
package = "seqArchRplus",
mustWork = TRUE
)
)
## 2. Clustering information (arbitrary order/unordered)
unord_clusts <- readRDS(system.file("extdata", "example_clust_info.rds",
package = "seqArchRplus", mustWork = TRUE
))
## 3. GRanges object
tc_gr <- readRDS(system.file("extdata", "example_tc_gr.rds",
package = "seqArchRplus", mustWork = TRUE
))
## 4. BED file
bed_info_fname <- system.file("extdata", "example_info_df.bed.gz",
package = "seqArchRplus", mustWork = TRUE
)
info_df <- read.delim(file = bed_info_fname, sep = "\t", header = TRUE)
## 5. seqArchR result
seqArchR_result <- readRDS(system.file("extdata", "seqArchR_result.rds",
package = "seqArchRplus", mustWork = TRUE))
## **NOTE** Only for the example seqArchR result object provided with this
## package:
##
## Any seqArchR result object has the raw DNA sequences as a part of it.
## See details here: https://snikumbh.github.io/seqArchR/reference/seqArchR.html
##
## But, in order to reduce the size of the example seqArchR result object
## provided with this (seqArchRplus) package, the `rawSeqs` element of the
## result is missing. This is reassigned below for the purposes of this
## vignette.
##
## The seqArchR result was obtained by processing the DNA sequences in
## `example_promoters45.fa.gz`. Thus we reassign them to
## `seqArchR_result$rawSeqs` here
##
if(!("rawSeqs" %in% names(seqArchR_result)))
seqArchR_result$rawSeqs <- raw_seqs
If you are not using the result from seqArchR or are already using a curated set of clusters, you can skip the following subsection and jump to section 3.2 for a demonstration of generating all plots.
seqArchR
resultRaw clusters from seqArchR result (say, the final iteration) often require curation.
The seqArchR result does have a final clustSol
(the clustering
solution) where the clusters from the final iteration are collated.
seqArchR uses hierarchical clustering for this purpose.
However, hierarchical clustering, with a chosen agglomeration and distance
method, does not necessarily output the best suitable collated set of clusters.
Therefore, some small amount of manual curation (re-assignments) may be
required to reach the ideal collated set of clusters as the final solution.
This is achieved with the help of the seqArchRplus::curate_clusters
function.
The basic idea of the curate_clusters function is available in
help(seqArchRplus::curate_clusters)
.
A more elaborate description is as follows.
It takes as input an agglomeration method and also a distance method (see more
arguments in ).
On the first call to the function, a plot with the associated dendrogram
resulting from the hierarchical clustering is shown together with the per
cluster sequence logos.
This should help in identifying if the chosen agglomeration and distance
methods worked well (i.e., are clusters with similar sequence logos also
together in the dendrogram?).
If this is satisfactory (i.e., chosen agglomeration and distance method have
done well already and only few curations will be required), count the number of
overall clusters that you can visually see.
Generally, if (estimated) 3-4 (even 5-6) clusters out of 30-40 require
curation, I would consider it as a satisfactory outcome from the hierarchical
clustering output.
Now, in the second call to curate_clusters
, specify the number of clusters
based on your count.
View the resulting clustering in the new plot produced as output.
This time the dendrogram shows this clustering with colors and grey-scaled
boxes drawn around the clusters.
Now you can exactly note which clusters need curation.
Imagine a hypothetical scenario where a total of 16 raw clusters are obtained from the final seqArchR iteration. Collating them into 4 clusters using hierarchical clustering results in:
But, due to very minor differences in the sequence logos, you may want to move
raw cluster 9 from collated cluster 1 to collated cluster 3 (that of raw
clusters 11-13).
To do this, set the need_change
argument of curate_clusters
to c(9)
and
change_to
argument to c(11)
.
Any one of the destination cluster members can be specified in change_to
.
Similarly, any other such curations can be added.
Specify all such curations together in one go like this:
need_change <- list(c(9), c(4, 7), c(16))
andchange_to <- list(11, 14, 0)
Here, (a) raw cluster 9 is moved to the collated cluster containing raw cluster 11; (b) 4 and 7 are moved to that of 14; and (c) 16 is moved to a totally new cluster of itself. The result is the re-assigned clusters.
Below, this procedure of curation is re-iterated with the help of actual function calls and resulting figures using a reduced example data available with the package.
sn <- "RAL28_10_to_12h"
use_aggl <- "complete"
use_dist <- "euclid"
## get seqArchR clusters custom curated
seqArchR_curated_clusts <- curate_clusters(sname = sn,
use_aggl = use_aggl, use_dist = use_dist,
seqArchR_result = seqArchR_result, iter = 5,
pos_lab = NULL, regularize = TRUE, topn = 50,
use_cutk = 2, final = FALSE, dir_path = NULL
)
#>
#> ── seqArchR result clusters curation ───────────────────────────────────────────
#>
#> ── Sample: RAL28_10_to_12h ──
#>
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
seqArchR_curated_clusts$curation_plot
The motive of Figure 1 is to look at the clusters ordered by the hierarchical clustering step. This way, counting the tentative number of clusters (K) is easier.
In Figure 1, we can see that the first and the second clusters can be collated together. Therefore, to demonstrate curation, we will set K=5 clusters and perform this minor curation as shown below.
## Let us set K=5 for this example, and combine clusters 1 and 2 into one.
set_cutk <- 5
## Form the lists `need_change` and `change_to` for re-assignments
need_change <- list(c(2))
change_to <- list(c(1))
seqArchR_curated_clusts <- curate_clusters(sname = sn,
use_aggl = use_aggl, use_dist = use_dist,
seqArchR_result = seqArchR_result, iter = 5,
regularize = TRUE, topn = 50,
use_cutk = set_cutk, #***
need_change = need_change,
change_to = change_to,
final = FALSE, #***
dir_path = NULL
)
#>
#> ── seqArchR result clusters curation ───────────────────────────────────────────
#>
#> ── Sample: RAL28_10_to_12h ──
#>
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
seqArchR_curated_clusts$curation_plot
Notice that each dendrogram branch (and leaf node) has been assigned a different
color because we set K=5.
We are now ready to make the final call to curate_clusters
and obtain the
curated clusters’ list.
## Satisfied with the re-assignments? now set final = TRUE
seqArchR_curated_clusts <- curate_clusters(sname = sn,
use_aggl = use_aggl, use_dist = use_dist,
seqArchR_result = seqArchR_result, iter = 5,
pos_lab = NULL, regularize = FALSE, topn = 50,
use_cutk = set_cutk, #***
need_change = need_change,
change_to = change_to,
final = TRUE, #***
dir_path = NULL
)
#>
#> ── seqArchR result clusters curation ───────────────────────────────────────────
#>
#> ── Sample: RAL28_10_to_12h ──
#>
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
seqArchR_curated_clusts$curation_plot
Notice that the re-assigned clusters (1 and 2) now have different colors than the corresponding dendrogram branches.
str(seqArchR_curated_clusts$clusters_list)
#> List of 4
#> $ : int [1:244] 1 2 3 4 5 6 7 8 9 10 ...
#> $ : int [1:111] 245 246 247 248 249 250 251 252 253 254 ...
#> $ : int [1:93] 356 357 358 359 360 361 362 363 364 365 ...
#> $ : int [1:142] 449 450 451 452 453 454 455 456 457 458 ...
This fetches us clusters in arbitrary order, i.e., not related to the IQWs
of the clusters.
[See below for the function call order_clusters_iqw()
that orders these
clusters by their median/mean interquantile widths (section 3.2.2).]
This curated set of clusters can now be used for further downstream analyses.
The following subsections demonstrate how the seqArchRplus functions can be used to generate the various visualizations.
One can visualize the architecture for each cluster of promoters as a sequence
logo using the function per_cluster_seqlogos
.
When the argument one_plot
is set to TRUE, the function returns a single plot
where sequence logos for all clusters are arranged one below the other.
When set to FALSE, a list of sequence logo plots is returned instead.
In the later case, each plot has a title with information on the number of
sequences in the cluster.
##
seqlogos_oneplot_pl <- per_cluster_seqlogos(
sname = "RAL28_10_to_12h",
seqs = raw_seqs,
clusts = unord_clusts,
pos_lab = -45:45, bits_yax = "auto",
strand_sep = FALSE, one_plot = TRUE,
txt_size = 12, dir_path = NULL
)
#>
#> ── All clusters' sequence logos ────────────────────────────────────────────────
#>
#> ── Sample: RAL28_10_to_12h ──
#>
#> Generating architectures for clusters of sequences...
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
seqlogos_oneplot_pl
The sequence logos can also be obtained as a list instead of already combined
into one plot by simply setting the argument one_plot = FALSE
.
This gives you liberty to use them (either individually or otherwise)
together with any other plots as is suitable.
## Obtain the sequence logos as a list for combining later by setting the
## 'one_plot' argument to FALSE
seqlogos_list_pl <- per_cluster_seqlogos(
sname = "RAL28_10_to_12h",
seqs = raw_seqs,
clusts = unord_clusts,
pos_lab = -45:45, bits_yax = "auto",
strand_sep = FALSE, one_plot = FALSE,
dir_path = NULL,
txt_size = 12
)
Obtaining strand-separated sequence logos is demonstrated in section 3.2.7.
The function iqw_tpm_plots
orders the clusters by their IQW values
(when quantitative information on promoter shape is available, for instance, the
interquantile widths of the CAGEr tag clusters) and visualizes the distribution
of the IQW and TPM values for each cluster as boxplots.
The input clusters are ordered from sharp (top) to broad (bottom).
##
iqw_tpm_pl <- iqw_tpm_plots(sname = "RAL28_10_to_12h",
dir_path = NULL,
info_df = info_df,
iqw = TRUE, tpm = TRUE, cons = FALSE,
clusts = unord_clusts,
txt_size = 15
)
#>
#> ── IQW-ordered boxplots ────────────────────────────────────────────────────────
#>
#> ── Sample: RAL28_10_to_12h ──
#>
#> ── Plotting: _IQW _TPM plots ──
#>
#> ── Order clusters by IQW ───────────────────────────────────────────────────────
#>
#> ── Sample: RAL28_10_to_12h ──
#>
#> ── Order clusters by IQW ───────────────────────────────────────────────────────
#>
#> ── Sample: RAL28_10_to_12h ──
#>
iqw_tpm_pl
The cluster order visualized in the IQW_TPM plots above can be fetched using
the function order_cluster_iqw
.
## get clusters ordered by median IQW values
seqArchR_clusts_ord <- order_clusters_iqw(sname = "RAL28_10_to_12h",
clusts = unord_clusts,
info_df = info_df,
order_by_median = TRUE
)
#> ── Order clusters by IQW ───────────────────────────────────────────────────────
#>
#> ── Sample: RAL28_10_to_12h ──
#>
str(unord_clusts)
#> List of 5
#> $ : int [1:140] 1 2 3 4 5 6 7 8 9 10 ...
#> $ : int [1:104] 141 142 143 144 145 146 147 148 149 150 ...
#> $ : int [1:111] 245 246 247 248 249 250 251 252 253 254 ...
#> $ : int [1:93] 356 357 358 359 360 361 362 363 364 365 ...
#> $ : int [1:142] 449 450 451 452 453 454 455 456 457 458 ...
str(seqArchR_clusts_ord)
#> List of 5
#> $ : int [1:140] 1 2 3 4 5 6 7 8 9 10 ...
#> $ : int [1:104] 141 142 143 144 145 146 147 148 149 150 ...
#> $ : int [1:93] 356 357 358 359 360 361 362 363 364 365 ...
#> $ : int [1:142] 449 450 451 452 453 454 455 456 457 458 ...
#> $ : int [1:111] 245 246 247 248 249 250 251 252 253 254 ...
The function seqs_acgt_image
enables visualizing a set of sequences as an
image colored by the nucleotides.
Here, one can choose to order sequences by the clusters (see argument
seqs_ord
).
##
seqs_acgt_image(sname = "RAL28_10_to_12h",
seqs = raw_seqs,
seqs_ord = unlist(seqArchR_clusts_ord),
pos_lab = -45:45, dir_path = NULL
)
Dinucleotides are often deemed important within promoter sequences.
For instance, the common dinucleotides at TSSs are PyPu (a pyrimidine-purine
pair).
Similarly, W-boxes (WW) that occur at ~30 bp upstream of the TSS are an
important TSS-determining feature in Zebrafish maternal stage promoters.
Occurrences of various dinucleotides can be visualized as heatmaps using the
function plot_motif_heatmaps
(based on heatmaps)) or
plot_motif_heatmaps2
(based on seqPattern.
We demonstrate only plot_motif_heatmaps
here.
## Get larger flank FASTA sequences (larger than those used for seqArchR)
use_seqs <- Biostrings::readDNAStringSet(filepath = system.file("extdata",
"example_promoters200.fa.gz",
package = "seqArchRplus",
mustWork = TRUE
)
)
plot_motif_heatmaps(sname = "RAL28_10_to_12h", seqs = use_seqs,
flanks = c(200),
clusts = seqArchR_clusts_ord,
motifs = c("WW", "SS"),
dir_path = NULL,
fheight = 80, fwidth = 160
)
#> ── Motif heatmaps ──────────────────────────────────────────────────────────────
#>
#> ── Sample: RAL28_10_to_12h ──
#>
#> Using default colors
#> ── Using flank: 200
#>
#> Calculating kernel density...
#>
#> Calculating kernel density...
#> plotting heatmap WW
#> plotting heatmap SS
In each cluster, the promoters (more specifically, the transcription start
sites) can be annotated by their genomic feature, i.e., whether the start site
is located within a promoter region of a gene, the 5’ UTR, 3’ UTR, exon, intron
or is intergenic.
seqArchRplus enables visualizing these annotation proportions
per cluster as stacked bar plots via the function per_cluster_annotations
.
When the argument one_plot
is set to TRUE, this function returns all stacked
bars combined within a single plot (see example below).
Otherwise, a list of individual/per-cluster plots is returned.
##
annotations_oneplot_pl <-
per_cluster_annotations(
sname = "RAL28_10_to_12h",
clusts = seqArchR_clusts_ord,
tc_gr = tc_gr,
cager_obj = NULL,
qLow = 0.1, qUp = 0.9,
txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene,
tss_region = c(-500, 100),
orgdb_obj = NULL, dir_path = NULL,
one_plot = TRUE,
txt_size = 12
)
#>
#> ── All clusters' genomic annotations ───────────────────────────────────────────
#>
#> ── Sample: RAL28_10_to_12h ──
#>
#> Loading required package: CAGEr
#> Loading required package: MultiAssayExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: 'matrixStats'
#> The following objects are masked from 'package:Biobase':
#>
#> anyMissing, rowMedians
#>
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> The following object is masked from 'package:Biobase':
#>
#> rowMedians
#> Registered S3 method overwritten by 'vegan':
#> method from
#> rev.hclust dendextend
#> >> preparing features information... 2023-10-24 19:02:22
#> >> identifying nearest features... 2023-10-24 19:02:23
#> >> calculating distance from peak to TSS... 2023-10-24 19:02:23
#> >> assigning genomic annotation... 2023-10-24 19:02:23
#> >> assigning chromosome lengths 2023-10-24 19:02:28
#> >> done... 2023-10-24 19:02:28
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
annotations_oneplot_pl
## Obtain the per cluster annotations as a list for combining later by setting
## the 'one_plot' argument to FALSE
# annotations_list_pl <- per_cluster_annotations(
# sname = "RAL28_10_to_12h",
# clusts = unord_clusts,
# tc_gr = tc_gr,
# cager_obj = NULL,
# qLow = 0.1, qUp = 0.9,
# txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene,
# tss_region = c(-500, 100),
# orgdb_obj = NULL, dir_path = NULL,
# one_plot = FALSE,
# txt_size = 12
# )
Similar to the sequence logos of all promoters in a cluster (section XY), one
can separate promoters in a cluster that are on the positive strand vs. the
negative strand, and visualize them separately as shown below.
Note that this is the same function per_cluster_seqlogos
with the argument
strand_sep
set to TRUE.
## Obtain strand-separated sequence logos corresponding to each cluster
stranded_seqlogos_pl <- per_cluster_seqlogos(
sname = "RAL28_10_to_12h",
seqs = raw_seqs,
clusts = seqArchR_clusts_ord,
pos_lab = -45:45, bits_yax = "auto",
info_df = info_df,
strand_sep = TRUE, #**
one_plot = FALSE, #**
dir_path = NULL,
txt_size = 12
)
#>
#> ── All clusters' strand-separated sequence logos ───────────────────────────────
#>
#> ── Sample: RAL28_10_to_12h ──
#>
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Plot title:(1/5) Arch `1': 69 (+ strand) /140
#>
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Plot title:(1/5) Arch `1': 71 (- strand) /140
#>
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Plot title:(2/5) Arch `2': 60 (+ strand) /104
#>
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Plot title:(2/5) Arch `2': 44 (- strand) /104
#>
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Plot title:(3/5) Arch `3': 48 (+ strand) /93
#>
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Plot title:(3/5) Arch `3': 45 (- strand) /93
#>
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Plot title:(4/5) Arch `4': 71 (+ strand) /142
#>
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Plot title:(4/5) Arch `4': 71 (- strand) /142
#>
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Plot title:(5/5) Arch `5': 54 (+ strand) /111
#>
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Plot title:(5/5) Arch `5': 57 (- strand) /111
one_plot <- cowplot::plot_grid(plotlist = stranded_seqlogos_pl, ncol = 1)
one_plot
The seqArchRplus function per_cluster_strand_dist
can enable
visualizing as barplots how promoters in each cluster are distributed on
different chromosomes and strands.
pair_colrs <- RColorBrewer::brewer.pal(n = 5, name = "Set3")
## Obtain strand-separated sequence logos corresponding to each cluster
stranded_dist_pl <- per_cluster_strand_dist(
sname = "RAL28_10_to_12h",
clusts = seqArchR_clusts_ord,
info_df = info_df,
dir_path = NULL,
txt_size = 12
)
#>
#> ── All clusters' strand distributions ──────────────────────────────────────────
#>
#> ── Sample: RAL28_10_to_12h ──
#>
one_plot <- cowplot::plot_grid(plotlist = stranded_dist_pl, ncol = 1)
one_plot
GO terms enriched per cluster can be obtained using the
seqArchRplus per_cluster_go_term_enrichments()
.
This function returns a list of plots showing the GO terms enriched.
go_enrich_pl <- per_cluster_go_term_enrichments(
sname = "RAL28_10_to_12h",
clusts = seqArchR_clusts_ord,
tc_gr = tc_gr,
txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene,
dir_path = NULL,
one_file = FALSE, #***
tss_region = c(-500,100),
orgdb_obj = "org.Dm.eg.db"
)
#> ── All clusters' GO term enrichments ───────────────────────────────────────────
#>
#> ── Sample: RAL28_10_to_12h ──
#>
#> >> preparing features information... 2023-10-24 19:02:39
#> >> identifying nearest features... 2023-10-24 19:02:39
#> >> calculating distance from peak to TSS... 2023-10-24 19:02:39
#> >> assigning genomic annotation... 2023-10-24 19:02:39
#> >> adding gene annotation... 2023-10-24 19:02:40
#> 'select()' returned 1:many mapping between keys and columns
#> >> assigning chromosome lengths 2023-10-24 19:02:40
#> >> done... 2023-10-24 19:02:40
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
one_plot <- cowplot::plot_grid(plotlist = go_enrich_pl, ncol = 2)
one_plot
form_combined_panel(iqw_tpm_pl = iqw_tpm_pl, seqlogos_pl = seqlogos_oneplot_pl,
annot_pl = annotations_oneplot_pl)
In a multi-sample study where the samples designate either different timepoints or cell types, all such (large) figures can be generated for all samples to produce a list of such plots.
generate_html_report(snames = c("RAL28_10_to_12h", "RAL28_10_to_12h"),
file_type = "PDF", dir_path = tempdir())
In a multi-sample study where the samples designate either different timepoints or conditions, all such (large) figures can be generated for all samples to produce a list of such plots which can then be visualised in an HTML file that enables scrolling between the combined panel plots from one sample to another.
We recommend viewing the promoters in the genome browser or IGV to
observe additional details such as the position of the dominant TSS etc.
The relevant seqArchRplus function that writes to disk this
information as browser track BED files is write_seqArchR_cluster_track_bed
.
##
write_seqArchR_cluster_track_bed(
sname = "RAL28_10_to_12h",
clusts = seqArchR_clusts_ord,
info_df = info_df,
use_q_bound = FALSE,
one_zip_all = TRUE,
org_name = "Dmelanogaster",
dir_path = tempdir(),
include_in_report = FALSE,
strand_sep = FALSE
)
#>
#> ── Writing cluster track BED files ─────────────────────────────────────────────
#>
#> ── Sample: RAL28_10_to_12h ──
#>
#> ℹ Preparing cluster-wise BED for RAL28_10_to_12h
#> ! Creating directory: /tmp/RtmpiFFc6t/RAL28_10_to_12h_results
#> ! Creating directory: /tmp/RtmpiFFc6t/RAL28_10_to_12h_results/Cluster_BED_tracks
#> ℹ Writing cluster BED track files at: /tmp/RtmpiFFc6t/RAL28_10_to_12h_results/Cluster_BED_tracks
Depending upon the use case, one may use any ordering for the input clusters for all functions. Looking at their annotations, if some clusters happen to occur predominantly in 5’ UTR or 3’ UTR or exons or introns etc., such clusters can be sequestered from further downstream analyses if suitable for the end goal.
You can let seqArchRplus generate all plots using default settings like so.
generate_all_plots(
sname = "RAL28_10_to_12h",
bed_info_fname = bed_info_fname,
seqArchR_clusts = unord_clusts,
raw_seqs = raw_seqs,
tc_gr = tc_gr,
use_q_bound = FALSE,
order_by_iqw = TRUE,
use_median_iqw = TRUE,
iqw = TRUE, tpm = TRUE, cons = FALSE,
pos_lab = -45:45,
txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene,
org_name = "Dmelanogaster",
qLow = 0.1, qUp = 0.9,
tss_region = c(-500, 100),
raw_seqs_mh = use_seqs,
motifs = c("WW", "SS", "TATAA", "CG"),
motif_heatmaps_flanks = c(50, 100, 200),
dir_path = tempdir(),
txt_size = 25
)
Promoter sequences can be identified using CAGE or variations of CAGE experiment. Clusters of promoter sequences (identified by seqArchR or otherwise) can be further analyzed with seqArchRplus. This vignette demonstrated many of the downstream analyses steps for studying these promoter sequence clusters/architectures.
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] CAGEr_2.8.0
#> [2] MultiAssayExperiment_1.28.0
#> [3] SummarizedExperiment_1.32.0
#> [4] MatrixGenerics_1.14.0
#> [5] matrixStats_1.0.0
#> [6] ChIPseeker_1.38.0
#> [7] org.Dm.eg.db_3.18.0
#> [8] TxDb.Dmelanogaster.UCSC.dm6.ensGene_3.12.0
#> [9] GenomicFeatures_1.54.0
#> [10] AnnotationDbi_1.64.0
#> [11] Biobase_2.62.0
#> [12] Biostrings_2.70.0
#> [13] XVector_0.42.0
#> [14] seqArchR_1.6.0
#> [15] seqArchRplus_1.2.0
#> [16] GenomicRanges_1.54.0
#> [17] GenomeInfoDb_1.38.0
#> [18] IRanges_2.36.0
#> [19] S4Vectors_0.40.0
#> [20] BiocGenerics_0.48.0
#> [21] BiocStyle_2.30.0
#>
#> loaded via a namespace (and not attached):
#> [1] ProtGenerics_1.34.0
#> [2] fs_1.6.3
#> [3] bitops_1.0-7
#> [4] enrichplot_1.22.0
#> [5] EBImage_4.44.0
#> [6] HDO.db_0.99.1
#> [7] httr_1.4.7
#> [8] RColorBrewer_1.1-3
#> [9] tools_4.3.1
#> [10] backports_1.4.1
#> [11] vegan_2.6-4
#> [12] utf8_1.2.4
#> [13] R6_2.5.1
#> [14] mgcv_1.9-0
#> [15] lazyeval_0.2.2
#> [16] Gviz_1.46.0
#> [17] permute_0.9-7
#> [18] withr_2.5.1
#> [19] prettyunits_1.2.0
#> [20] gridExtra_2.3
#> [21] cli_3.6.1
#> [22] factoextra_1.0.7
#> [23] scatterpie_0.2.1
#> [24] labeling_0.4.3
#> [25] sass_0.4.7
#> [26] Rsamtools_2.18.0
#> [27] yulab.utils_0.1.0
#> [28] foreign_0.8-85
#> [29] gson_0.1.0
#> [30] DOSE_3.28.0
#> [31] stringdist_0.9.10
#> [32] dichromat_2.0-0.1
#> [33] plotrix_3.8-2
#> [34] BSgenome_1.70.0
#> [35] VGAM_1.1-9
#> [36] rstudioapi_0.15.0
#> [37] RSQLite_2.3.1
#> [38] generics_0.1.3
#> [39] gridGraphics_0.5-1
#> [40] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
#> [41] BiocIO_1.12.0
#> [42] heatmaps_1.26.0
#> [43] gtools_3.9.4
#> [44] car_3.1-2
#> [45] dplyr_1.1.3
#> [46] dendextend_1.17.1
#> [47] GO.db_3.18.0
#> [48] interp_1.1-4
#> [49] Matrix_1.6-1.1
#> [50] fansi_1.0.5
#> [51] abind_1.4-5
#> [52] lifecycle_1.0.3
#> [53] yaml_2.3.7
#> [54] carData_3.0-5
#> [55] CAGEfightR_1.22.0
#> [56] gplots_3.1.3
#> [57] qvalue_2.34.0
#> [58] SparseArray_1.2.0
#> [59] BiocFileCache_2.10.0
#> [60] grid_4.3.1
#> [61] blob_1.2.4
#> [62] promises_1.2.1
#> [63] crayon_1.5.2
#> [64] lattice_0.22-5
#> [65] cowplot_1.1.1
#> [66] KEGGREST_1.42.0
#> [67] magick_2.8.1
#> [68] pillar_1.9.0
#> [69] knitr_1.44
#> [70] fgsea_1.28.0
#> [71] rjson_0.2.21
#> [72] boot_1.3-28.1
#> [73] codetools_0.2-19
#> [74] fastmatch_1.1-4
#> [75] glue_1.6.2
#> [76] ggimage_0.3.3
#> [77] ggfun_0.1.3
#> [78] data.table_1.14.8
#> [79] vctrs_0.6.4
#> [80] png_0.1-8
#> [81] treeio_1.26.0
#> [82] gtable_0.3.4
#> [83] assertthat_0.2.1
#> [84] cachem_1.0.8
#> [85] xfun_0.40
#> [86] S4Arrays_1.2.0
#> [87] mime_0.12
#> [88] tidygraph_1.2.3
#> [89] som_0.3-5.1
#> [90] interactiveDisplayBase_1.40.0
#> [91] ellipsis_0.3.2
#> [92] nlme_3.1-163
#> [93] ggtree_3.10.0
#> [94] bit64_4.0.5
#> [95] progress_1.2.2
#> [96] filelock_1.0.2
#> [97] bslib_0.5.1
#> [98] rpart_4.1.21
#> [99] KernSmooth_2.23-22
#> [100] Hmisc_5.1-1
#> [101] colorspace_2.1-0
#> [102] DBI_1.1.3
#> [103] nnet_7.3-19
#> [104] seqPattern_1.34.0
#> [105] tidyselect_1.2.0
#> [106] bit_4.0.5
#> [107] compiler_4.3.1
#> [108] curl_5.1.0
#> [109] htmlTable_2.4.1
#> [110] xml2_1.3.5
#> [111] DelayedArray_0.28.0
#> [112] bookdown_0.36
#> [113] shadowtext_0.1.2
#> [114] rtracklayer_1.62.0
#> [115] checkmate_2.2.0
#> [116] scales_1.2.1
#> [117] caTools_1.18.2
#> [118] rappdirs_0.3.3
#> [119] tiff_0.1-11
#> [120] stringr_1.5.0
#> [121] digest_0.6.33
#> [122] fftwtools_0.9-11
#> [123] rmarkdown_2.25
#> [124] base64enc_0.1-3
#> [125] htmltools_0.5.6.1
#> [126] pkgconfig_2.0.3
#> [127] jpeg_0.1-10
#> [128] sparseMatrixStats_1.14.0
#> [129] ensembldb_2.26.0
#> [130] dbplyr_2.3.4
#> [131] fastmap_1.1.1
#> [132] rlang_1.1.1
#> [133] htmlwidgets_1.6.2
#> [134] shiny_1.7.5.1
#> [135] DelayedMatrixStats_1.24.0
#> [136] farver_2.1.1
#> [137] jquerylib_0.1.4
#> [138] jsonlite_1.8.7
#> [139] BiocParallel_1.36.0
#> [140] GOSemSim_2.28.0
#> [141] VariantAnnotation_1.48.0
#> [142] RCurl_1.98-1.12
#> [143] magrittr_2.0.3
#> [144] Formula_1.2-5
#> [145] GenomeInfoDbData_1.2.11
#> [146] ggplotify_0.1.2
#> [147] patchwork_1.1.3
#> [148] munsell_0.5.0
#> [149] Rcpp_1.0.11
#> [150] ape_5.7-1
#> [151] viridis_0.6.4
#> [152] stringi_1.7.12
#> [153] ggraph_2.1.0
#> [154] zlibbioc_1.48.0
#> [155] MASS_7.3-60
#> [156] AnnotationHub_3.10.0
#> [157] plyr_1.8.9
#> [158] formula.tools_1.7.1
#> [159] ggseqlogo_0.1
#> [160] parallel_4.3.1
#> [161] HPO.db_0.99.2
#> [162] ggrepel_0.9.4
#> [163] deldir_1.0-9
#> [164] graphlayouts_1.0.1
#> [165] splines_4.3.1
#> [166] hms_1.1.3
#> [167] locfit_1.5-9.8
#> [168] igraph_1.5.1
#> [169] ggpubr_0.6.0
#> [170] ggsignif_0.6.4
#> [171] reshape2_1.4.4
#> [172] biomaRt_2.58.0
#> [173] BiocVersion_3.18.0
#> [174] XML_3.99-0.14
#> [175] evaluate_0.22
#> [176] latticeExtra_0.6-30
#> [177] biovizBase_1.50.0
#> [178] BiocManager_1.30.22
#> [179] operator.tools_1.6.3
#> [180] tweenr_2.0.2
#> [181] httpuv_1.6.12
#> [182] tidyr_1.3.0
#> [183] purrr_1.0.2
#> [184] polyclip_1.10-6
#> [185] ggplot2_3.4.4
#> [186] ggforce_0.4.1
#> [187] broom_1.0.5
#> [188] xtable_1.8-4
#> [189] AnnotationFilter_1.26.0
#> [190] restfulr_0.0.15
#> [191] tidytree_0.4.5
#> [192] MPO.db_0.99.7
#> [193] rstatix_0.7.2
#> [194] later_1.3.1
#> [195] viridisLite_0.4.2
#> [196] tibble_3.2.1
#> [197] clusterProfiler_4.10.0
#> [198] aplot_0.2.2
#> [199] memoise_2.0.1
#> [200] GenomicAlignments_1.38.0
#> [201] cluster_2.1.4
Carninci, Piero, Albin Sandelin, Boris Lenhard, Shintaro Katayama, Kazuro Shimokawa, Jasmina Ponjavic, Colin A M Semple, et al. 2006. “Genome-wide analysis of mammalian promoter architecture and evolution.” Nature Genetics 38 (6): 626–35.
Chen, Zhen-Xia, David Sturgill, Jiaxin Qu, Huaiyang Jiang, Soo Park, Nathan Boley, Ana Maria Suzuki, et al. 2014. “Comparative validation of the D. melanogaster modENCODE transcriptome annotation.” Genome Research 24 (7): 1209–23.
Haberle, Vanja, Nan Li, Yavor Hadzhiev, Charles Plessy, Christopher Previti, Chirag Nepal, Jochen Gehrig, et al. 2014. “Two Independent Transcription Initiation Codes Overlap on Vertebrate Core Promoters.” Nature 507 (7492): 381–85.
Kodzius, Rimantas, Miki Kojima, Hiromi Nishiyori, Mari Nakamura, Shiro Fukuda, Michihira Tagami, Daisuke Sasaki, et al. 2006. “CAGE: cap analysis of gene expression.” Nature Methods 3 (3): 211–22.
Nepal, Chirag, Yavor Hadzhiev, Christopher Previti, Vanja Haberle, Nan Li, Hazuki Takahashi, Ana Maria M Suzuki, et al. 2013. “Dynamic Regulation of the Transcription Initiation Landscape at Single Nucleotide Resolution During Vertebrate Embryogenesis.” Genome Research 23 (11): 1938–50.
Nikumbh, Sarvesh, and Boris Lenhard. 2023. “Identifying Promoter Sequence Architectures via a Chunking-Based Algorithm Using Non-Negative Matrix Factorisation.” bioRxiv. https://doi.org/10.1101/2023.03.02.530868.
Schor, Ignacio E, Jacob F Degner, Dermot Harnett, Enrico Cannavò, Francesco P Casale, Heejung Shim, David A Garfield, et al. 2017. “Promoter Shape Varies Across Populations and Affects Promoter Evolution and Expression Noise.” Nature Genetics 49 (4): 550.
The FANTOM Consortium. 2014. “A promoter-level mammalian expression atlas.” Nature 507 (7493): 462–70.