1 Preface

1.1 Motivation

Most cellular processes are regulated by RNA-binding proteins (RBPs). Knowledge on their exact positioning can be obtained from individual-nucleotide resolution UV crosslinking and immunoprecipitation (iCLIP) experiments. In a recent publication we described a complete analysis workflow to detect RBP binding sites from iCLIP data. The workflow covers all essential steps from quality control of sequencing reads, different peak calling options, to the downstream analysis and definition of binding sites. The pre-processing and peak calling steps rely on publicly available software, whereas the definition of the final binding sites follows a custom procedure implemented in BindingSiteFinder. This vignette explains how equally sized binding sites can be defined from a genome-wide iCLIP coverage.

1.2 Prerequisites

The workflow described herein is based on our recently published complete iCLIP analysis pipeline (Busch et al. 2020). Thus, we expect the user to have preprocessed their iCLIP sequencing reads up to the point of the peak calling step. In brief, this includes basic processing of the sequencing reads, such as quality filtering, barcode handling, mapping and the generation of a single nucleotide crosslink file for all replicates under consideration. As we describe in our manuscript, replicate .bam files may or may not be merged prior to peak calling, for which we suggest PureCLIP (Krakau, Richard, and Marsico 2017). For simplicity, we address only the case in which peak calling was performed on the merge of all replicates.

Overview of the preprocessing workflow.

Overview of the preprocessing workflow.

1.3 Installation

The BindingSiteFinder package is available at https://bioconductor.org and can be installed via BiocManager::install:

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("BindingSiteFinder")

Note: If you use BindingSiteFinder in published research, please cite:

Busch, A., Brüggemann, M., Ebersberger, S., & Zarnack, K. (2020) iCLIP data analysis: A complete pipeline from sequencing reads to RBP binding sites. Methods, 178, 49-62. https://doi.org/10.1016/j.ymeth.2019.11.008

2 Standard workflow

2.2 Merge peaks into binding sites

2.2.1 Construction of the BindingSiteFinder dataset

As input, the BindingSiteFinder package expects two types of data. First, a GRanges object of all ranges that should be merged into binding sites. In our example, this was the created by PureCLIP and we just have to import the file. The second information is the per replicate coverage information in form of a data.frame, which is created below.

# Load clip signal files and define meta data object
files <- system.file("extdata", package="BindingSiteFinder")
clipFilesP <- list.files(files, pattern = "plus.bw$", full.names = TRUE)
clipFilesM <- list.files(files, pattern = "minus.bw$", full.names = TRUE)

meta = data.frame(
  id = c(1:4),
  condition = factor(rep("WT", 4)), 
  clPlus = clipFilesP, clMinus = clipFilesM)
meta
##   id condition
## 1  1        WT
## 2  2        WT
## 3  3        WT
## 4  4        WT
##                                                                            clPlus
## 1 /tmp/RtmpzNcWaO/Rinst3cee351c8975c3/BindingSiteFinder/extdata/rep1_clip_plus.bw
## 2 /tmp/RtmpzNcWaO/Rinst3cee351c8975c3/BindingSiteFinder/extdata/rep2_clip_plus.bw
## 3 /tmp/RtmpzNcWaO/Rinst3cee351c8975c3/BindingSiteFinder/extdata/rep3_clip_plus.bw
## 4 /tmp/RtmpzNcWaO/Rinst3cee351c8975c3/BindingSiteFinder/extdata/rep4_clip_plus.bw
##                                                                            clMinus
## 1 /tmp/RtmpzNcWaO/Rinst3cee351c8975c3/BindingSiteFinder/extdata/rep1_clip_minus.bw
## 2 /tmp/RtmpzNcWaO/Rinst3cee351c8975c3/BindingSiteFinder/extdata/rep2_clip_minus.bw
## 3 /tmp/RtmpzNcWaO/Rinst3cee351c8975c3/BindingSiteFinder/extdata/rep3_clip_minus.bw
## 4 /tmp/RtmpzNcWaO/Rinst3cee351c8975c3/BindingSiteFinder/extdata/rep4_clip_minus.bw

This data.frame has to have at minimum three columns, which must be named condition, clPlus and clMinus. The condition column specifies the different conditions of the replicates (see Work with different conditions for more examples). The clPlus and clMinus columns point towards the strand-specific coverage for each replicate. This information will be imported as Rle objects upon object initialization. The number of ranges and crosslinks imported in the object can be shown once it has been constructed1 Here, we load a previously compiled BindingSiteFinder DataSet to save disc space..

library(BindingSiteFinder)
bds = BSFDataSetFromBigWig(ranges = csFilter, meta = meta, silent = TRUE)
exampleFile <- list.files(files, pattern = ".rda$", full.names = TRUE)
load(exampleFile)
bds

2.2.2 Choosing the binding site width

Choosing how wide the final binding sites should be is not an easy choice and depends on various different factors. For that reason different filter options alongside with diagnostic plots were implemented to guide the decision. At first one must decide on a desired binding site width which can be done via the bsSize parameter. Alongside, one must also choose minWidth, which removes all ranges smaller than minWidth during the merging routine. More specifically, only ranges larger than minWidth are allowed to be extended into the final binding sites2 Keep this option at default or higher to prevent unwanted mapping artifacts to blur binding site definition.

As a first diagnostic, a ratio between the crosslink events within binding sites and the crosslink events in adjacent windows to the binding sites can be computed. The higher the ratio, the better the associated binding site width captures the distribution of the underlying crosslink events. The supportRatioPlot() function offers a quick screen for different width choices3 Note that given the little difference between 5nt, 7nt and 9nt both options seem to be reasonable.

supportRatioPlot(bds, bsWidths = seq(from = 3, to = 29, by = 2), minWidth = 2) 

To further guide the choice, a selection of potential width can be computed explicitly, here for widths 3nt, 9nt, 19nt and 29nt while holding all other parameters constant.

bds1 <- makeBindingSites(object = bds, bsSize = 3, minWidth = 2,
                         minCrosslinks = 2, minClSites = 1)

bds2 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 2,
                         minCrosslinks = 2, minClSites = 1)

bds3 <- makeBindingSites(object = bds, bsSize = 19, minWidth = 2,
                         minCrosslinks = 2, minClSites = 1)

bds4 <- makeBindingSites(object = bds, bsSize = 29, minWidth = 2,
                         minCrosslinks = 2, minClSites = 1)
l = list(`1. bsSize = 3` = bds1, `2. bsSize = 9` = bds2, 
         `3. bsSize = 19` = bds3, `4. bsSize = 29` = bds4)

The effect of the binding site width can be visualized by looking at the total iCLIP coverage. Here, each plot is centered around the binding site’s midpoint and the computed width is indicated by the gray frame. In our example, size = 3 appears too small, since not all of the relevant peak signal seems to be captured. On the contrary, size = 29 appears extremely large. Here, we decided for size = 9 because it seems to capture the central coverage peak best.

rangeCoveragePlot(l, width = 20) 

2.2.3 Applying additional constraints

Once a decision on a binding site width has been made additional filtering options can be used to force more/ less stringent binding sites. Essentially an initially computed set of binding sites is filtered by a certain parameter. The minClSites parameter for instance allows the user to define how many of the initially computed PureCLIP sites should fall within the range of the computed binding site.

bds1 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 2,
                         minCrosslinks = 2, minClSites = 1)

bds2 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 2,
                         minCrosslinks = 2, minClSites = 2)

bds3 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 2,
                         minCrosslinks = 2, minClSites = 3)

bds4 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 2,
                         minCrosslinks = 2, minClSites = 4)

As one might have expected the number of final binding sites that meet the threshold drops with higher constraints. The coverage on the other hand spans wider around the binding site center, resulting in a broader coverage.

l = list(`1. minClSites = 1` = bds1, `2. minClSites = 2` = bds2, 
         `3. minClSites = 3` = bds3, `4. minClSites = 4` = bds4)
mergeSummaryPlot(l, select = "minClSites")

l = list(`1. minClSites = 1` = bds1, `2. minClSites = 2` = bds2, 
         `3. minClSites = 3` = bds3, `4. minClSites = 4` = bds4)
rangeCoveragePlot(l, width = 20)

A further point to consider is the number of positions within the binding site that are actually covered by iCLIP signal (crosslink events). The minCrosslinks parameter can be used to filter for that. In principle, this parameter represents a ‘weaker’ version of the minClSites filter. It should be set in conjunction with the binding site width. Here, for example we aim to have at least about one third of all positions to be covered by crosslink events.

bds10 <- makeBindingSites(object = bds, bsSize = 3, minWidth = 2,
                         minCrosslinks = 1, minClSites = 1)

bds20 <- makeBindingSites(object = bds, bsSize = 9, minWidth = 2,
                         minCrosslinks = 3, minClSites = 1)

bds30 <- makeBindingSites(object = bds, bsSize = 19, minWidth = 2,
                         minCrosslinks = 6, minClSites = 1)

bds40 <- makeBindingSites(object = bds, bsSize = 29, minWidth = 2,
                         minCrosslinks = 10, minClSites = 1)

After all diagnostics, we decide for the following parameter set to derive our binding sites.

bds1 <- makeBindingSites(object = bds, bsSize = 7, minWidth = 2,
                         minCrosslinks = 2, minClSites = 1)

2.3 Reproducibility filter

2.3.1 Choosing a replicate-specific cutoff

Since the initial PureCLIP run was based on the merged signal of all four replicates, an additional reproducibility filter must be used. More specifically, we ask which of the computed binding sites are reproduced by the individual replicates. Since replicates might differ in library size, we will decide on a replicates-specific threshold based on the binding site support distribution, i.e., how many crosslink events fall into the computed binding sites in each replicate. The reproducibiliyCutoffPlot() function visualizes this distribution for each replicate. Here, we decided for the 5% quantile4 A lower boundary of 1 nucleotide is set as default minimum support., which results in the crosslink thresholds 1, 2, 3 and 1 for the respective replicates WT1-4. This means that for replicate WT3, a minimum of 3 crosslink events must be seen in a binding site to call this binding site reproduced by replicate WT3.

reproducibiliyCutoffPlot(bds1, max.range = 20, cutoff = 0.05)

2.3.2 Defining the overall support level

After deciding which quantile cutoff to use, the number of replicates that must meet the selected threshold must be specified. This offers another level of filtering and allows for some adjustment, since not always all replicates are forced to agree on every binding site. For instance, we could implement the rule that 3 out of 4 replicates should agree for a binding site, while setting the quantile cutoff to 5% for all 4 replicates. This can be achieved with the reproducibilityFilter() function.

When setting the returnType = "data.frame", the function returns a plottable object instead of a final filtered GRanges object. The overlap among the four replicates after filtering can be easily visualized as an UpSet plot, using the complexHeatmaps package.

s1 = reproducibilityFilter(bds1, cutoff = 0.05, n.reps = 3, 
                           returnType = "data.frame")
library(ComplexHeatmap)
m1 = make_comb_mat(s1)
UpSet(m1)

Once all decisions on filtering parameters have been made, the reproducibilityFilter() function is used again. The final binding site object can be retrieved through the getRanges() accessor function.

bdsFinal = reproducibilityFilter(bds1, cutoff = 0.05, n.reps = 3)
getRanges(bdsFinal)
## GRanges object with 3689 ranges and 0 metadata columns:
##        seqnames            ranges strand
##           <Rle>         <IRanges>  <Rle>
##      1    chr22 11630137-11630143      +
##      2    chr22 11630412-11630418      +
##      3    chr22 15785522-15785528      +
##      4    chr22 15785545-15785551      +
##      7    chr22 15786267-15786273      +
##    ...      ...               ...    ...
##   4766    chr22 50769683-50769689      -
##   4768    chr22 50776991-50776997      -
##   4769    chr22 50777820-50777826      -
##   4770    chr22 50777868-50777874      -
##   4771    chr22 50783297-50783303      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

In the case that the initial peak calling was performed with PureCLIP, we offer the annotateWithScore() function to re-annotate the computed binding sites with the PureCLIP score for each position. In particular we compute the mean, the maximum and the sum over all PureCLIP sites that fall within the binding site.

bdsFinal = annotateWithScore(bdsFinal, getRanges(bds))
# set nice binding site names
finalRanges = getRanges(bdsFinal)
names(finalRanges) = paste0("BS", seq_along(finalRanges))
bdsFinal = setRanges(bdsFinal, finalRanges)
bindingSites = getRanges(bdsFinal)
bindingSites
## GRanges object with 3689 ranges and 3 metadata columns:
##          seqnames            ranges strand |  scoreSum scoreMean  scoreMax
##             <Rle>         <IRanges>  <Rle> | <numeric> <numeric> <numeric>
##      BS1    chr22 11630137-11630143      + |  592.9750 197.65833 334.18700
##      BS2    chr22 11630412-11630418      + |  121.2398  30.30995  77.58330
##      BS3    chr22 15785522-15785528      + |   24.0186   4.80372   9.21879
##      BS4    chr22 15785545-15785551      + |   30.5446  10.18155  15.20870
##      BS5    chr22 15786267-15786273      + |   37.9989   7.59977  16.53900
##      ...      ...               ...    ... .       ...       ...       ...
##   BS3685    chr22 50769683-50769689      - |  40.51640  20.25820  20.35940
##   BS3686    chr22 50776991-50776997      - |  60.35172  20.11724  37.55070
##   BS3687    chr22 50777820-50777826      - |   9.21558   3.07186   5.38047
##   BS3688    chr22 50777868-50777874      - |  54.12220  18.04073  22.78030
##   BS3689    chr22 50783297-50783303      - |  39.85535  13.28512  16.22960
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Lastly, we make use of the rtracklayer package once more to export the GRanges of our final binding sites as a .bed file for visual inspection eg. in IGV.

library(rtracklayer)
export(object = bindingSites, con = "./bindingSites.bed", format = "BED")

3 Downstream characterization

After the definition of binding sites, one is usually interested in the binding behavior of the RBP. General questions are: Which gene types does my RBP bind? If it binds to protein-coding genes, to which part of the transcripts does it bind? Answering these types of questions requires using an additional gene annotation resource to overlap the binding sites with. In the following, we demonstrate how this could be achieved using the binding sites defined in the standard workflow.

3.1 Target gene assignment

As gene annotation, any desired source can be used. Here, we use GENCODE gene annotations (hg38) from chromosome 22 imported from GFF3 format as an example5 Note that this file is a subset of the respective annotation that we prepared beforehand. Here, we only load the final object (see /inst/scripts/PrepareExampleData.R). First, we make use of the GenomicFeatures package and import our annotation as a TxDB database object. This database can be queried using the genes() function to retrieve the positions of all genes from our resource as GRanges objects.

library(GenomicFeatures)
# Make annotation database from gff3 file
annoFile = "./gencode.v37.annotation.chr22.header.gff3"
annoDb = makeTxDbFromGFF(file = annoFile, format = "gff3")
annoInfo = import(annoFile, format = "gff3")
# Get genes as GRanges
gns = genes(annoDb)
idx = match(gns$gene_id, annoInfo$gene_id)
elementMetadata(gns) = cbind(elementMetadata(gns),
                             elementMetadata(annoInfo)[idx,])
# Load GRanges with genes
geneFile <- list.files(files, pattern = "gns.rds$", full.names = TRUE)
load(geneFile)
gns
## GRanges object with 1387 ranges and 3 metadata columns:
##                      seqnames            ranges strand |            gene_id
##                         <Rle>         <IRanges>  <Rle> |        <character>
##   ENSG00000008735.14    chr22 50600793-50613981      + | ENSG00000008735.14
##   ENSG00000015475.19    chr22 17734138-17774770      - | ENSG00000015475.19
##   ENSG00000025708.14    chr22 50525752-50530032      - | ENSG00000025708.14
##   ENSG00000025770.19    chr22 50508224-50524780      + | ENSG00000025770.19
##   ENSG00000040608.14    chr22 20241415-20283246      - | ENSG00000040608.14
##                  ...      ...               ...    ... .                ...
##    ENSG00000288106.1    chr22 38886837-38903794      + |  ENSG00000288106.1
##    ENSG00000288153.1    chr22 30664961-30674134      + |  ENSG00000288153.1
##    ENSG00000288262.1    chr22 11474744-11479643      + |  ENSG00000288262.1
##    ENSG00000288540.1    chr22 48320412-48333199      - |  ENSG00000288540.1
##    ENSG00000288683.1    chr22 18077989-18131720      + |  ENSG00000288683.1
##                                   gene_type   gene_name
##                                 <character> <character>
##   ENSG00000008735.14         protein_coding    MAPK8IP2
##   ENSG00000015475.19         protein_coding         BID
##   ENSG00000025708.14         protein_coding        TYMP
##   ENSG00000025770.19         protein_coding      NCAPH2
##   ENSG00000040608.14         protein_coding       RTN4R
##                  ...                    ...         ...
##    ENSG00000288106.1                 lncRNA  AL022318.5
##    ENSG00000288153.1 unprocessed_pseudogene  AC003072.2
##    ENSG00000288262.1 unprocessed_pseudogene  CT867976.1
##    ENSG00000288540.1                 lncRNA  AL008720.2
##    ENSG00000288683.1         protein_coding  AC016027.6
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

3.2 Dealing with annotation overlaps

Depending on the source and organism, gene annotations overlap each other to some degree. This complicates assigning binding sites to their respective host genes. The following plot illustrates the degree of overlaps of gene annotations in the specific annotation set that we use. We observe that the vast majority of binding sites reside in exactly one gene annotation. However, a set of 322 binding sites overlap with two different gene annotations and 81 binding sites are even completely outside of any annotation.

# Count the number of annotation overlaps
mcols(bindingSites)$geneOl = factor(countOverlaps(bindingSites, gns))
df = as.data.frame(mcols(bindingSites))

ggplot(df, aes(x = geneOl)) +
    geom_bar() +
    scale_y_log10() +
    geom_text(stat='count', aes(label=..count..), vjust=-0.3) +
    labs(
        title = "Overlapping gene annotation problem",
        subtitle = "Bar plot shows how many times a binding site overlaps with an annotated gene.",
    x = "Number of overlaps", 
    y = "Count (log10)"
    ) + theme_bw()

For the sake of simplicity, we directly remove the 81 binding sites that reside in regions outside of any gene annotation. Removing binding site should be done carefully and only after detailed inspection in IGV ect. (see Export binding sites to IGV for more details). For all other binding sites, we can analyze the gene type of the duplicated annotations by counting the annotation type split by overlap frequency. In the present case, we observe that most overlaps seemed to be caused by lncRNA annotations, probably residing within introns of protein-coding genes. Since we are interested in the binding of our RBP on protein-coding genes, we remove all binding sites on spurious annotations[^7] (see iCLIP data analysis workflow for more details).

[^7] Note: This is a simplified scheme which one might need to adapt for more complex research questions

# Show which gene types cause the most annotation overlaps. Split binding sites
# by the number of overlaps and remove those binding sites that do not overlap
# with any annotation (intergenic)
library(forcats)
idx = findOverlaps(gns, bindingSites)
df = data.frame(geneType = gns$gene_type[queryHits(idx)], 
                overlapType = bindingSites$geneOl[subjectHits(idx)]) %>%
    mutate(geneType = as.factor(geneType)) %>% 
    mutate(geneType = fct_lump_n(geneType, 4)) %>%
    group_by(overlapType, geneType) %>%
    tally() 

ggplot(df, aes(x = overlapType, y = n, fill = geneType)) +
    geom_col(position = "fill") +
    scale_fill_brewer(palette = "Set2") +
    scale_y_continuous(labels = scales::percent) +
    labs(
        title = "Overlapping gene annotation causes",
        subtitle = "Percentage bar plot that show the fraction for each gene type and overlap number",
        fill = "Gene type",
        x = "Number of overlaps", 
        y = "Percent"
    ) + 
    theme_bw()

df = data.frame(geneType = gns$gene_type[queryHits(idx)], 
                overlapType = bindingSites$geneOl[subjectHits(idx)]) %>%
    filter(overlapType == 1) %>%
    group_by(geneType) %>%
    tally() %>% 
    arrange(n) %>%
    mutate(geneType = factor(geneType, levels = geneType))

ggplot(df, aes(x = geneType, y = n, fill = geneType, label = n)) +
    geom_col(color = "black") +
    scale_y_log10() +
    coord_flip() +
    scale_fill_brewer(palette = "Greys", direction = -1) +
    theme_bw() +
    theme(legend.position = "none") +
    geom_text(color = "blue") +
    labs(
        title = "Clean RBP target spectrum",
        x = "",
        y = "#N (log10)"
    )

# Remove all binding sites that overlap multiple gene annotations and retain
# only protein-coding genes and binding sites.
bindingSites = subset(bindingSites, geneOl == 1)
targetGenes = subsetByOverlaps(gns, bindingSites)
targetGenes = subset(targetGenes, gene_type == "protein_coding")

3.3 Transcript region assignment

Now that we know that our RBP targets mainly protein-coding genes, we will have a closer look at its positioning within the transcript regions of the protein-coding genes. For this, we again make use of the TxDB compiled from our GENCODE annotation file6 Note that we prepared this file beforehand and only load the final object (see /inst/scripts/PrepareExampleData.R). As one would expect, the problem of overlapping annotations is increase manifold on the level of transcript regions compared to entire genes.

# Count the overlaps of each binding site for each region of the transcript. 
cdseq = cds(annoDb) 
intrns = unlist(intronsByTranscript(annoDb)) 
utrs3 = unlist(threeUTRsByTranscript(annoDb)) 
utrs5 = unlist(fiveUTRsByTranscript(annoDb)) 
regions = list(CDS = cdseq, Intron = intrns, UTR3 = utrs3, UTR5 = utrs5)
# Load list with transcript regions
regionFile <- list.files(files, pattern = "regions.rds$", full.names = TRUE)
load(regionFile)
# Count the overlaps of each binding site fore each region of the transcript. 
cdseq = regions$CDS %>% countOverlaps(bindingSites,.)
intrns = regions$Intron %>% countOverlaps(bindingSites,.)
utrs3 = regions$UTR3 %>% countOverlaps(bindingSites,.)
utrs5 = regions$UTR5 %>% countOverlaps(bindingSites,.)
countDf = data.frame(CDS = cdseq, Intron = intrns, UTR3 = utrs3, UTR5 = utrs5)
# Count how many times an annotation is not present.
df = data.frame(olClass = apply(countDf,1,function(x) length(x[x == 0]))) 
df = data.frame(type = rev(names(table(df))), count = as.vector(table(df))) 
ggplot(df, aes(x = type, y = count)) +
    geom_col() +
    scale_y_log10() +
    geom_text(aes(label = count), vjust=-0.3) +
    labs(
        title = "Overlapping transcript annotation clashes",
        subtitle = "Bar plot shows how many times a binding site overlaps with an annotation of a different \ntranscript region.",
        x = "Number of overlaps", 
        y = "Count (log10)"
    ) + theme_bw()

Again, the simplest approach would be to remove all 395 binding sites that do not overlap with one distinct annotation. This could however lead to many binding sites getting lost, depending on the complexity of the annotation/ organism etc. Here, we decided to implement a majority vote system followed by a simple priority list to resolve ties (see iCLIP data analysis workflow for more details). In any case, we recommend careful visual inspection in e.g. IGV to complement the presented workflow.

3.3.1 Majority vote-based assignment

Before assigning binding sites to distinct transcript regions, it is worth to first plot the unresolved binding site distribution7 Note: Binding sites are essentially counted multiple times here.. This allows us to later compare to the resolved binding site assignment and identify potential unexpected behaviors. In our example, we would at this point deduce a predominance of intronic binding over 3’UTRs, CDS and 5’UTRs.

# Count the number of binding sites within each transcript region, ignoring the 
# problem of overlapping annotations (counting binding sites multiple times).
plotCountDf = countDf
plotCountDf[plotCountDf > 1] = 1
df = plotCountDf %>% 
    pivot_longer(everything()) %>% 
    group_by(name) %>%
    summarize(count = sum(value), .groups = "drop") %>%
    arrange(count) %>%
    mutate(name = factor(name, levels = name))

ggplot(df, aes(x = name, y = count, fill = name)) + 
    geom_col(color = "black") +
    scale_y_log10() +
    coord_flip() +
    scale_fill_brewer(palette = "Greys", direction = -1) +
    xlab("Number of overlaps") +
    ylab("Count") +
    geom_text(aes(label = count), color = "blue") +
    labs(
        title = "Binding sites per transcript region - unresolved annotation overlaps",
        subtitle = "Bar plot that shows the number of binding sites per transcript region.",
        x = "Transcript region", 
        y = "#N (log10)"
    ) + 
    theme_bw() +
    theme(legend.position = "none") 

We now further visualize the exact overlap conflicts using an UpSet plot representation from the ComplextHeatmap package. From this plot, it becomes again clear that most binding sites are located within regions annotated as introns. But we also observe that most annotation conflicts are between introns and 3’UTRs.

# Show how many annotation overlaps are caused by what transcript regions
m = make_comb_mat(plotCountDf)
UpSet(m)

To resolve these conflicts, we are using a majority vote-based system followed by a hierarchical rule. The majority vote assigns a binding site to a transcript region based on which region is most frequently overlapping. In case of ties, we apply the rule introns > 3’UTR > CDS > 5’UTR which we deduced from the unresolved binding site distribution8 Note: In rare cases, additional intergenic binding sites can be found even though we removed these in a previous step. This is because some binding sites overlap with the annotation of a gene, but not with any of the transcript isoforms associated with the gene..

# Solve the overlapping transcript region problem with majority votes and 
# hierarchical rule. 
rule = c("Intron", "UTR3", "CDS", "UTR5")
# Prepare dataframe for counting (reorder by rule, add intergenic counts)
countDf = countDf[, rule] %>% 
    as.matrix() %>%
    cbind.data.frame(., Intergenic = ifelse(rowSums(countDf) == 0, 1, 0) ) 
# Apply majority vote scheme
regionName = colnames(countDf)
region = apply(countDf, 1, function(x){regionName[which.max(x)]})
mcols(bindingSites)$region = region
df = data.frame(Region = region) %>%
    group_by(Region) %>%
    tally() %>%
    arrange(n) %>%
    mutate(Region = factor(Region, levels = Region))


ggplot(df, aes(x = Region, y = n, fill = Region)) + 
    geom_col(color = "black") +
    scale_y_log10() +
    coord_flip() +
    scale_fill_brewer(palette = "Greys", direction = -1) +
    xlab("Number of overlaps") +
    ylab("Count") +
    geom_text(aes(label = n), color = "blue") +
    labs(
        title = "Binding sites per transcript region - resolved annotation overlaps",
        subtitle = "Bar plot that shows the number of binding sites per transcript region.",
        x = "Transcript region", 
        y = "#N (log10)"
    ) + 
    theme_bw() +
    theme(legend.position = "none") 

3.3.2 Relative region enrichment

For a splicing regulator protein such as U2AF65 binding to introns is expected. However, due to the sheer size difference in introns one might also expect to observe more binding in introns just by chance. For that reason one could divide the number of binding sites by the summed up region length to compute a relative enrichment score9 Note: Such a relative enrichment heavily favors smaller regions such as 5’UTRs. Therefore, absolute numbers and relative enrichment should both be considered..

To calculate the relative binding enrichment per region, we summed up the length of each transcript region that is hosting a binding site. In the case that a binding site overlaps with multiple regions of the same type (which happens quite frequently), we selected the first matching region as representative.

# Function that selects the first hosting region of each binding site and then
# sums up the lengths from all of them
getRegionLengthByFirstMatch <- function(region, bindingsites) {
    # region = regions$Intron
    # bindingsites = bindingSites
    idx = findOverlaps(region, bindingsites) %>% 
        as.data.frame() %>% 
        group_by(subjectHits) %>% 
        arrange(queryHits) %>% 
        filter(row_number() == 1)
    len = region[idx$queryHits] %>%
        width() %>%
        sum()
    return(len)
}
regionLength = lapply(regions, function(x){
    getRegionLengthByFirstMatch(x, bindingSites)
}) %>% unlist() %>% as.data.frame()
# Compute the relative enrichment per transcript region.
bindingSites = subset(bindingSites, region != "Intergenic")
df = as.data.frame(mcols(bindingSites)) %>%
    dplyr::select(region) %>%
    group_by(region) %>%
    tally() %>%
    mutate(regionLength = regionLength[,1]) %>%
    mutate(normalizedCount = n / regionLength) %>%
    arrange(normalizedCount) %>%
    mutate(region = factor(region, levels = region)) 
# 
ggplot(df, aes(x = region, y = normalizedCount, fill = region)) +
    geom_col(color = "black") +
    coord_flip() +
    scale_fill_brewer(palette = "Greys", direction = -1) +
    labs(
        title = "Relative enrichment per transcript region ",
        subtitle = "Bar plot of region count, divided by the summed length.",
        x = "Transcript region", 
        y = "Relative enrichment",
        fill = "Transcript region"
    ) + 
    theme_bw() +
    theme(legend.position = "none") 

4 Additional functions

4.1 Subset data for faster iterations

Subsetting a BSFinderData can be useful in a variety of cases, e.g. for reducing the object size for faster parameter testing, limiting the analysis to some candidate genes etc. Here, we subset the object by a random set of 100 binding sites and plot their count distribution.

set.seed(1234)
bdsSub = bds[sample(seq_along(getRanges(bds)), 100, replace = FALSE)]

cov = coverageOverRanges(bdsSub, returnOptions = "merge_positions_keep_replicates")
df = mcols(cov) %>%
    as.data.frame() %>%
    pivot_longer(everything())

ggplot(df, aes(x = name, y = log2(value+1), fill = name)) +
    geom_violin() +
    geom_boxplot(width = 0.1, fill = "white") +
    scale_fill_brewer(palette = "Greys") +
    theme_bw() +
    theme(legend.position = "none") +
    labs(x = "Samples", y = "#Crosslinks (log2)")

As another example, we here compare the crosslink distribution between binding sites on protein-coding and lncRNA genes.

# candidateGenes = gns[sample(seq_along(gns), 10, replace = FALSE)]
protGenes = gns[gns$gene_type == "protein_coding"]
lncGenes = gns[gns$gene_type == "lncRNA"]

calcCoverageForGeneSet <- function(geneSet, bsfObj) {
    idx = findOverlaps(geneSet, getRanges(bsfObj))
    currBsfObj = bsfObj[subjectHits(idx)]
    cov = coverageOverRanges(currBsfObj, returnOptions = "merge_positions_keep_replicates")
    df = mcols(cov) %>%
        as.data.frame() %>%
        pivot_longer(everything())
    return(df)
}

dfProtGenes = calcCoverageForGeneSet(geneSet = protGenes, bsfObj = bds) %>% 
    cbind.data.frame(geneType = "protein coding")
dfLncGenes = calcCoverageForGeneSet(geneSet = lncGenes, bsfObj = bds) %>% 
    cbind.data.frame(geneType = "lncRNA")
df = rbind(dfProtGenes, dfLncGenes)

ggplot(df, aes(x = name, y = log2(value+1), fill = geneType)) +
    geom_boxplot() +
    scale_fill_brewer(palette = "Greys") +
    theme_bw() +
    labs(x = "Samples", y = "#Crosslinks (log2)")

4.2 Merge replicate signal

Depending on the task at hand, one either wants to keep the iCLIP signal separated by replicates or merge the signal over the replicates (e.g. of the same condition). Merging signal can be done using the collapseReplicates() function. Doing so allows for example to identify the proportion of crosslink events that each sample contributes to the total of a binding site. Here, we did this for the first 100 binding sites. We sort all binding sites by their fraction and color the plot based on the replicate.

bdsMerge = collapseReplicates(bds)[1:100]
covTotal = coverageOverRanges(bdsMerge, returnOptions = "merge_positions_keep_replicates")

covRep = coverageOverRanges(bds[1:100], returnOptions = "merge_positions_keep_replicates")

df = cbind.data.frame(mcols(covTotal), mcols(covRep)) %>%
    mutate(rep1 = round(`1_WT`/ WT, digits = 2) * 100,
           rep2 = round(`2_WT`/ WT, digits = 2) * 100,
           rep3 = round(`3_WT`/ WT, digits = 2) * 100,
           rep4 = round(`4_WT`/ WT, digits = 2) * 100) %>%
    tibble::rowid_to_column("BsID") %>%
    dplyr::select(BsID, rep1, rep2, rep3, rep4) %>%
    pivot_longer(-BsID) %>%
    group_by(BsID) %>%
    arrange(desc(value), .by_group = TRUE) %>%
    mutate(name = factor(name, levels = name)) %>%
    group_by(name) %>%
    arrange(desc(value), .by_group = TRUE) %>%
    mutate(BsID = factor(BsID, levels = BsID))


ggplot(df, aes(x = BsID, y = value, fill = name)) +
    geom_col(position = "fill", width = 1) +
    theme_bw() +
    scale_fill_brewer(palette = "Set3") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 7)) +
    labs(x = "Binding site ID",
         y = "Percentage",
         fill = "Replicate"
         ) +
    scale_y_continuous(labels = scales::percent)

4.3 Make diagnostic coverage plots

It can also be useful to examine the coverage on example binding sites. A simple function for this is the bindingSiteCoveragePlot(), which plots the coverage as bars in a defined range around a selected binding site. The function is based on the Gviz package. The plotIdx indicates which range should be used as center, the flankPos parameter allows to zoom in and out. In addition to the selected range, also all other ranges which fall into the selected window will be shown. Here, we plot the coverage of a random binding site from the final object, once per replicate and once merged per condition.

bindingSiteCoveragePlot(bdsFinal, plotIdx = 8, flankPos = 100, autoscale = TRUE)

bindingSiteCoveragePlot(bdsFinal, plotIdx = 8, flankPos = 100, mergeReplicates = TRUE)

A common case that makes use of this function is when one wants to see why a particular binding site is lost after a certain filtering step. Here, we look at a binding site that was filtered out from the final object by the reproducibility filter. Doing so, we can visually confirm that binding sites 5 and 6 were correctly removed by the reproducibility filter function.

rangesBeforeRepFilter = getRanges(bds1)
rangesAfterRepFilter = getRanges(bdsFinal)
idx = which(!match(rangesBeforeRepFilter, rangesAfterRepFilter, nomatch = 0) > 0)
rangesRemovedByFilter = rangesBeforeRepFilter[idx]
bdsRemovedRanges = setRanges(bds1, rangesRemovedByFilter)

bindingSiteCoveragePlot(bdsRemovedRanges, plotIdx = 2, flankPos = 50)

bindingSiteCoveragePlot(bdsRemovedRanges, plotIdx = 2, flankPos = 50, mergeReplicates = TRUE)

We could also look at the final binding site definition and see how they were derived from the initial PureCLIP sites. To achieve this, we make use of the customRange slot to add these sites at the bottom of the coverage plot.

pSites = getRanges(bds)
bindingSiteCoveragePlot(bdsFinal, plotIdx = 8, flankPos = 20, autoscale = TRUE,
                        customRange = pSites, customRange.name = "pSites", shiftPos = -10)

Another use case would be to check the coverage on some example genes, or as we do here on the binding site with the highest number of crosslinks:

bindingSiteCoverage = coverageOverRanges(bdsFinal, returnOptions = "merge_positions_keep_replicates")
idxMaxCountsBs = which.max(rowSums(as.data.frame(mcols(bindingSiteCoverage))))
bindingSiteCoveragePlot(bdsFinal, plotIdx = idxMaxCountsBs, flankPos = 100, mergeReplicates = FALSE, shiftPos = 50)

It is also possible to anchor the plot on any other GenomicRange. Here, we take annotated CDS regions and ask for the one with the most overlapping binding sites. We then use this ranges as center for the plot and further zoom in to a particular range. We then make use of the customRange slot to re-include the binding site ranges as additional annotation shown underneath the signal. Additionally, one could also add a custom annotation track in the form of a GenomicRanges or TxDB object.

bdsCDS = setRanges(bdsFinal, regions$CDS)
cdsWithMostBs = which.max(countOverlaps(regions$CDS, getRanges(bdsFinal)))

bindingSiteCoveragePlot(bdsCDS, plotIdx = cdsWithMostBs, showCentralRange = FALSE,
                       flankPos = -250, shiftPos = -50, mergeReplicates = TRUE,
                       highlight = FALSE, customRange = getRanges(bdsFinal))

bindingSiteCoveragePlot(bdsCDS, plotIdx = cdsWithMostBs, showCentralRange = FALSE,
                       flankPos = -250, shiftPos = -50, mergeReplicates = TRUE,
                       highlight = FALSE, customRange = getRanges(bdsFinal),
                       customAnnotation = regions$CDS)

4.4 Export binding sites to IGV

For further diagnostics in third party tools, such as IGV, it might be useful to export binding sites into the commonly used BED file format. This task can directly be performed using the rtracklayer package. Here we again export binding sites before and after reproducibility filtering.

library(rtracklayer)
rangesBeforeRepFilter = getRanges(bds1)
rangesAfterRepFilter = getRanges(bdsFinal)
export(object = rangesBeforeRepFilter, con = "./rangesBeforeRepFilter.bed", format = "BED")
export(object = rangesAfterRepFilter, con = "./rangesAfterRepFilter.bed", format = "BED")

4.5 Merge replicate signal

Similar to exporting the binding sites as BED file for external visualization, one can do the same with the iCLIP signal itself. We again use the rtracklayer export function. Here, we first export the signal from a single replicate. Next, we first merge the signal over the replicates and export the combined signal.

sgn = getSignal(bds)
export(sgn$signalPlus$`1_WT`, con = "./WT_1_plus.bw", format = "bigwig")
export(sgn$signalMinus$`1_WT`, con = "./WT_1_minus.bw", format = "bigwig")
bdsMerge = collapseReplicates(bds)
sgn = getSignal(bdsMerge)
export(sgn$signalPlus$WT, con = "./sgn_plus.bw", format = "bigwig")
export(sgn$signalPlus$WT, con = "./sgn_minus.bw", format = "bigwig")

5 Variations of the standard workflow

5.1 Work with different conditions

When dealing with different conditions, one simply adjusts the meta data table in the BSFDataSet object. Here, we artificially label two of our four replicates as KD to exemplary show some applications. Be aware that the condition column is a factor and that the order of the factor levels matters in downstream functions.

# Set artificial KD condition
metaCond = getMeta(bdsFinal)
metaCond$condition = factor(c(rep("WT", 2), rep("KD", 2)), levels = c("WT", "KD"))
bdsCond = setMeta(bdsFinal, metaCond)
# Fix replicate names in signal
namesCond = c("1_WT", "2_WT", "3_KD", "4_KD")
sgn = getSignal(bdsCond)
names(sgn$signalPlus) = namesCond
names(sgn$signalMinus) = namesCond
bdsCond = setSignal(bdsCond, sgn)

5.1.1 Reproducibility filtering

Here, we use the reproducibilityCutoffPlot() to define the replicate support levels for all replicates over both conditions. The function expects a n.reps and cutoff value for each condition level, even if the same cutoffs are applied. At first, we apply the same cutoff to both conditions and then change the settings for the KD condition [^13] [^14].

[^13] Note: Be careful when filtering parameters differ between conditions, as downstream significance testing is impaired. [^14] Note: The order by which the n.reps and cutoff vectors are applied to the samples is defined by the order of the factor levels of the condition column in the meta data.

reproducibiliyCutoffPlot(bdsCond, max.range = c(20), n.reps = c(2,2), cutoff = c(0.1, 0.1))

reproducibiliyCutoffPlot(bdsCond, max.range = c(20), n.reps = c(2,2), cutoff = c(0.1, 0.05))

The final reproducibilityFilter() functions supports the same options as the reproducibiliyCutoffPlot() function and can be used in the same way to apply the tested filter settings.

bdsCond = reproducibilityFilter(bdsCond, n.reps = c(2,2), cutoff = c(0.1, 0.05))

5.1.2 Diagnostic coverage plots

The diagnostic example plots that can be made with the bindingSiteCoveragePlot() function also support multiple conditions and adjust the coloring accordingly.

bindingSiteCoverage = coverageOverRanges(bdsCond, returnOptions = "merge_positions_keep_replicates")
idxMaxCountsBs = which.max(rowSums(as.data.frame(mcols(bindingSiteCoverage))))
bindingSiteCoveragePlot(bdsCond, plotIdx = idxMaxCountsBs, flankPos = 100, mergeReplicates = FALSE, shiftPos = 50)

6 Session info

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] dplyr_1.0.10                forcats_0.5.2              
##  [3] BindingSiteFinder_1.4.0     ComplexHeatmap_2.14.0      
##  [5] tidyr_1.2.1                 ggplot2_3.3.6              
##  [7] rtracklayer_1.58.0          GenomicAlignments_1.34.0   
##  [9] Rsamtools_2.14.0            Biostrings_2.66.0          
## [11] XVector_0.38.0              SummarizedExperiment_1.28.0
## [13] Biobase_2.58.0              MatrixGenerics_1.10.0      
## [15] matrixStats_0.62.0          GenomicRanges_1.50.0       
## [17] GenomeInfoDb_1.34.0         IRanges_2.32.0             
## [19] S4Vectors_0.36.0            BiocGenerics_0.44.0        
## [21] BiocStyle_2.26.0           
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.4.1          circlize_0.4.15          Hmisc_4.7-1             
##   [4] BiocFileCache_2.6.0      plyr_1.8.7               lazyeval_0.2.2          
##   [7] splines_4.2.1            BiocParallel_1.32.0      digest_0.6.30           
##  [10] foreach_1.5.2            ensembldb_2.22.0         htmltools_0.5.3         
##  [13] magick_2.7.3             fansi_1.0.3              magrittr_2.0.3          
##  [16] checkmate_2.1.0          memoise_2.0.1            BSgenome_1.66.0         
##  [19] cluster_2.1.4            doParallel_1.0.17        prettyunits_1.1.1       
##  [22] jpeg_0.1-9               colorspace_2.0-3         blob_1.2.3              
##  [25] rappdirs_0.3.3           xfun_0.34                crayon_1.5.2            
##  [28] RCurl_1.98-1.9           jsonlite_1.8.3           survival_3.4-0          
##  [31] VariantAnnotation_1.44.0 iterators_1.0.14         glue_1.6.2              
##  [34] polyclip_1.10-4          gtable_0.3.1             zlibbioc_1.44.0         
##  [37] GetoptLong_1.0.5         DelayedArray_0.24.0      shape_1.4.6             
##  [40] scales_1.2.1             DBI_1.1.3                Rcpp_1.0.9              
##  [43] progress_1.2.2           htmlTable_2.4.1          clue_0.3-62             
##  [46] foreign_0.8-83           bit_4.0.4                Formula_1.2-4           
##  [49] htmlwidgets_1.5.4        httr_1.4.4               RColorBrewer_1.1-3      
##  [52] ellipsis_0.3.2           farver_2.1.1             pkgconfig_2.0.3         
##  [55] XML_3.99-0.12            Gviz_1.42.0              nnet_7.3-18             
##  [58] sass_0.4.2               dbplyr_2.2.1             deldir_1.0-6            
##  [61] utf8_1.2.2               labeling_0.4.2           tidyselect_1.2.0        
##  [64] rlang_1.0.6              AnnotationDbi_1.60.0     munsell_0.5.0           
##  [67] tools_4.2.1              cachem_1.0.6             cli_3.4.1               
##  [70] generics_0.1.3           RSQLite_2.2.18           evaluate_0.17           
##  [73] stringr_1.4.1            fastmap_1.1.0            yaml_2.3.6              
##  [76] knitr_1.40               bit64_4.0.5              purrr_0.3.5             
##  [79] KEGGREST_1.38.0          AnnotationFilter_1.22.0  xml2_1.3.3              
##  [82] biomaRt_2.54.0           compiler_4.2.1           rstudioapi_0.14         
##  [85] filelock_1.0.2           curl_4.3.3               png_0.1-7               
##  [88] tweenr_2.0.2             tibble_3.1.8             bslib_0.4.0             
##  [91] stringi_1.7.8            highr_0.9                GenomicFeatures_1.50.0  
##  [94] lattice_0.20-45          ProtGenerics_1.30.0      Matrix_1.5-1            
##  [97] vctrs_0.5.0              pillar_1.8.1             lifecycle_1.0.3         
## [100] BiocManager_1.30.19      jquerylib_0.1.4          GlobalOptions_0.1.2     
## [103] data.table_1.14.4        bitops_1.0-7             R6_2.5.1                
## [106] BiocIO_1.8.0             latticeExtra_0.6-30      bookdown_0.29           
## [109] gridExtra_2.3            codetools_0.2-18         dichromat_2.0-0.1       
## [112] MASS_7.3-58.1            assertthat_0.2.1         rjson_0.2.21            
## [115] withr_2.5.0              GenomeInfoDbData_1.2.9   parallel_4.2.1          
## [118] hms_1.1.2                rpart_4.1.19             rmarkdown_2.17          
## [121] Cairo_1.6-0              biovizBase_1.46.0        ggforce_0.4.1           
## [124] base64enc_0.1-3          interp_1.1-3             restfulr_0.0.15

Bibliography

Busch, Anke, Mirko Brüggemann, Stefanie Ebersberger, and Kathi Zarnack. 2020. “ICLIP Data Analysis: A Complete Pipeline from Sequencing Reads to Rbp Binding Sites.” Methods 178 (June): 49–62. https://doi.org/10.1016/j.ymeth.2019.11.008.

Krakau, Sabrina, Hugues Richard, and Annalisa Marsico. 2017. “PureCLIP: Capturing Target-Specific ProteinRNA Interaction Footprints from Single-Nucleotide Clip-Seq Data.” Genome Biology 18 (1). https://doi.org/10.1186/s13059-017-1364-2.