1 Introduction

Cross-linking immunoprecipitation (CLIP) is a technique that combines UV cross-linking and immunoprecipitation to analyse protein-RNA interactions or to pinpoint RNA modifications (e.g. m6A). CLIP-based methods, such as iCLIP and eCLIP, allow precise mapping of RNA modification sites or RNA-binding protein (RBP) binding sites on a genome-wide scale. These techniques help us to unravel post-transcriptional regulatory networks. In order to make the visualization of CLIP data easier, we develop cliProfiler package. The cliProfiler includes seven functions which allow users easily make different profile plots.

The cliProfiler package is available at https://bioconductor.org and can be installed via BiocManager::install:

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("cliProfiler")

A package only needs to be installed once. Load the package into an R session with

library(cliProfiler)

2 The Requirement of Data and Annotation file

The input data for using all the functions in cliProfiler should be the peak calling result or other similar object that represents the RBP binding sites or RNA modification position. Moreover, these peaks/signals be stored in the GRanges object. The GRanges is an S4 class which defined by GenomicRanges. The GRanges class is a container for the genomic locations and their associated annotations. For more information about GRanges objects please check GenomicRanges package. An example of GRanges object is shown below:

testpath <- system.file("extdata", package = "cliProfiler")
## loading the test GRanges object
test <- readRDS(file.path(testpath, "test.rds"))
## Show an example of GRanges object
test
## GRanges object with 100 ranges and 0 metadata columns:
##         seqnames              ranges strand
##            <Rle>           <IRanges>  <Rle>
##     [1]    chr17   28748198-28748218      +
##     [2]    chr10 118860137-118860157      -
##     [3]     chr2 148684461-148684481      +
##     [4]     chr2   84602546-84602566      -
##     [5]    chr18     6111874-6111894      -
##     ...      ...                 ...    ...
##    [96]     chr7 127254692-127254712      +
##    [97]     chr2   28833830-28833850      -
##    [98]     chr9   44607255-44607275      +
##    [99]     chr1 133621331-133621351      -
##   [100]     chr4 130316598-130316618      -
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths

The annotation file that required by functions exonProfile, geneTypeProfile, intronProfile, spliceSiteProfile and metaGeneProfile should be in the gff3 format and download from https://www.gencodegenes.org/. In the cliProfiler package, we include a test gff3 file.

## the path for the test gff3 file
test_gff3 <- file.path(testpath, "annotation_test.gff3")
## the gff3 file can be loaded by import.gff3 function in rtracklayer package
shown_gff3 <- rtracklayer::import.gff3(test_gff3)
## show the test gff3 file
shown_gff3
## GRanges object with 3068 ranges and 23 metadata columns:
##          seqnames              ranges strand |   source            type
##             <Rle>           <IRanges>  <Rle> | <factor>        <factor>
##      [1]     chr1   72159442-72212307      - |   HAVANA     transcript 
##      [2]     chr1   72212017-72212307      - |   HAVANA     exon       
##      [3]     chr1   72212017-72212111      - |   HAVANA     CDS        
##      [4]     chr1   72212109-72212111      - |   HAVANA     start_codon
##      [5]     chr1   72192043-72192202      - |   HAVANA     exon       
##      ...      ...                 ...    ... .      ...             ...
##   [3064]     chrX 153392866-153392868      + |   HAVANA stop_codon     
##   [3065]     chrX 153237748-153238092      + |   HAVANA five_prime_UTR 
##   [3066]     chrX 153308852-153308924      + |   HAVANA five_prime_UTR 
##   [3067]     chrX 153370845-153370846      + |   HAVANA five_prime_UTR 
##   [3068]     chrX 153392869-153396132      + |   HAVANA three_prime_UTR
##              score     phase                     ID               gene_id
##          <numeric> <integer>            <character>           <character>
##      [1]        NA      <NA>   ENSMUST00000048860.8  ENSMUSG00000039395.8
##      [2]        NA      <NA> exon:ENSMUST00000048..  ENSMUSG00000039395.8
##      [3]        NA         0 CDS:ENSMUST000000488..  ENSMUSG00000039395.8
##      [4]        NA         0 start_codon:ENSMUST0..  ENSMUSG00000039395.8
##      [5]        NA      <NA> exon:ENSMUST00000048..  ENSMUSG00000039395.8
##      ...       ...       ...                    ...                   ...
##   [3064]        NA         0 stop_codon:ENSMUST00.. ENSMUSG00000041649.13
##   [3065]        NA      <NA> UTR5:ENSMUST00000112.. ENSMUSG00000041649.13
##   [3066]        NA      <NA> UTR5:ENSMUST00000112.. ENSMUSG00000041649.13
##   [3067]        NA      <NA> UTR5:ENSMUST00000112.. ENSMUSG00000041649.13
##   [3068]        NA      <NA> UTR3:ENSMUST00000112.. ENSMUSG00000041649.13
##               gene_type   gene_name       level      mgi_id
##             <character> <character> <character> <character>
##      [1] protein_coding        Mreg           2 MGI:2151839
##      [2] protein_coding        Mreg           2 MGI:2151839
##      [3] protein_coding        Mreg           2 MGI:2151839
##      [4] protein_coding        Mreg           2 MGI:2151839
##      [5] protein_coding        Mreg           2 MGI:2151839
##      ...            ...         ...         ...         ...
##   [3064] protein_coding        Klf8           2 MGI:2442430
##   [3065] protein_coding        Klf8           2 MGI:2442430
##   [3066] protein_coding        Klf8           2 MGI:2442430
##   [3067] protein_coding        Klf8           2 MGI:2442430
##   [3068] protein_coding        Klf8           2 MGI:2442430
##                   havana_gene               Parent        transcript_id
##                   <character>      <CharacterList>          <character>
##      [1] OTTMUSG00000049069.1 ENSMUSG00000039395.8 ENSMUST00000048860.8
##      [2] OTTMUSG00000049069.1 ENSMUST00000048860.8 ENSMUST00000048860.8
##      [3] OTTMUSG00000049069.1 ENSMUST00000048860.8 ENSMUST00000048860.8
##      [4] OTTMUSG00000049069.1 ENSMUST00000048860.8 ENSMUST00000048860.8
##      [5] OTTMUSG00000049069.1 ENSMUST00000048860.8 ENSMUST00000048860.8
##      ...                  ...                  ...                  ...
##   [3064] OTTMUSG00000019377.5 ENSMUST00000112574.8 ENSMUST00000112574.8
##   [3065] OTTMUSG00000019377.5 ENSMUST00000112574.8 ENSMUST00000112574.8
##   [3066] OTTMUSG00000019377.5 ENSMUST00000112574.8 ENSMUST00000112574.8
##   [3067] OTTMUSG00000019377.5 ENSMUST00000112574.8 ENSMUST00000112574.8
##   [3068] OTTMUSG00000019377.5 ENSMUST00000112574.8 ENSMUST00000112574.8
##          transcript_type transcript_name transcript_support_level
##              <character>     <character>              <character>
##      [1]  protein_coding        Mreg-201                        1
##      [2]  protein_coding        Mreg-201                        1
##      [3]  protein_coding        Mreg-201                        1
##      [4]  protein_coding        Mreg-201                        1
##      [5]  protein_coding        Mreg-201                        1
##      ...             ...             ...                      ...
##   [3064]  protein_coding        Klf8-202                        1
##   [3065]  protein_coding        Klf8-202                        1
##   [3066]  protein_coding        Klf8-202                        1
##   [3067]  protein_coding        Klf8-202                        1
##   [3068]  protein_coding        Klf8-202                        1
##                                                     tag    havana_transcript
##                                         <CharacterList>          <character>
##      [1]                  basic,appris_principal_1,CCDS OTTMUST00000125321.1
##      [2]                  basic,appris_principal_1,CCDS OTTMUST00000125321.1
##      [3]                  basic,appris_principal_1,CCDS OTTMUST00000125321.1
##      [4]                  basic,appris_principal_1,CCDS OTTMUST00000125321.1
##      [5]                  basic,appris_principal_1,CCDS OTTMUST00000125321.1
##      ...                                            ...                  ...
##   [3064] alternative_5_UTR,basic,appris_principal_1,... OTTMUST00000046245.1
##   [3065] alternative_5_UTR,basic,appris_principal_1,... OTTMUST00000046245.1
##   [3066] alternative_5_UTR,basic,appris_principal_1,... OTTMUST00000046245.1
##   [3067] alternative_5_UTR,basic,appris_principal_1,... OTTMUST00000046245.1
##   [3068] alternative_5_UTR,basic,appris_principal_1,... OTTMUST00000046245.1
##                    protein_id      ccdsid   trans_len exon_number
##                   <character> <character> <character> <character>
##      [1] ENSMUSP00000041878.7 CCDS15032.1        2284        <NA>
##      [2] ENSMUSP00000041878.7 CCDS15032.1        2284           1
##      [3] ENSMUSP00000041878.7 CCDS15032.1        2284           1
##      [4] ENSMUSP00000041878.7 CCDS15032.1        2284           1
##      [5] ENSMUSP00000041878.7 CCDS15032.1        2284           2
##      ...                  ...         ...         ...         ...
##   [3064] ENSMUSP00000108193.2 CCDS30481.1        4752           7
##   [3065] ENSMUSP00000108193.2 CCDS30481.1        4752           1
##   [3066] ENSMUSP00000108193.2 CCDS30481.1        4752           2
##   [3067] ENSMUSP00000108193.2 CCDS30481.1        4752           3
##   [3068] ENSMUSP00000108193.2 CCDS30481.1        4752           7
##                       exon_id
##                   <character>
##      [1]                 <NA>
##      [2] ENSMUSE00000600755.2
##      [3] ENSMUSE00000600755.2
##      [4] ENSMUSE00000600755.2
##      [5] ENSMUSE00000262166.1
##      ...                  ...
##   [3064] ENSMUSE00000692289.2
##   [3065] ENSMUSE00000745002.1
##   [3066] ENSMUSE00000692290.1
##   [3067] ENSMUSE00000253395.2
##   [3068] ENSMUSE00000692289.2
##   -------
##   seqinfo: 19 sequences from an unspecified genome; no seqlengths

The function windowProfile allows users to find out the enrichment of peaks against the customized annotation file. This customized annotation file should be stored in the GRanges object.

3 metaGeneProfile

metaGeneProfile() outputs a meta profile, which shows the location of binding sites or modification sites ( peaks/signals) along transcript regions (5’UTR, CDS and 3’UTR). The input of this function should be a GRanges object.

Besides the GRanges object, a path to the gff3 annotation file which download from Gencode is required by metaGeneProfile.

The output of metaGeneProfile is a List objects. The List one contains the GRanges objects with the calculation result which can be used in different ways later.

meta <- metaGeneProfile(object = test, annotation = test_gff3)
meta[[1]]
## GRanges object with 100 ranges and 5 metadata columns:
##         seqnames              ranges strand |    center    location
##            <Rle>           <IRanges>  <Rle> | <integer> <character>
##     [1]    chr10 118860137-118860157      - | 118860147         CDS
##     [2]     chr2   84602546-84602566      - |  84602556        UTR3
##     [3]    chr18     6111874-6111894      - |   6111884         CDS
##     [4]    chr11   33213145-33213165      - |  33213155        UTR3
##     [5]    chr11   96819422-96819442      - |  96819432         CDS
##     ...      ...                 ...    ... .       ...         ...
##    [96]     chr8   72222842-72222862      + |  72222852          NO
##    [97]    chr18   36648184-36648204      + |  36648194         CDS
##    [98]     chr8 105216021-105216041      + | 105216031        UTR3
##    [99]     chr7 127254692-127254712      + | 127254702        UTR3
##   [100]     chr9   44607255-44607275      + |  44607265        UTR5
##                       Gene_ID         Transcript_ID  Position
##                   <character>           <character> <numeric>
##     [1]  ENSMUSG00000028630.9  ENSMUST00000004281.9  0.674444
##     [2] ENSMUSG00000034101.14  ENSMUST00000067232.9  0.122384
##     [3] ENSMUSG00000041225.16 ENSMUST00000077128.12  0.199836
##     [4] ENSMUSG00000040594.19  ENSMUST00000102815.9  0.159303
##     [5] ENSMUSG00000038615.17  ENSMUST00000107658.7  0.889039
##     ...                   ...                   ...       ...
##    [96]                   Nan                  <NA> 5.0000000
##    [97]  ENSMUSG00000117942.1  ENSMUST00000140061.7 0.1694561
##    [98] ENSMUSG00000031885.14  ENSMUST00000109392.8 0.0457421
##    [99]  ENSMUSG00000054716.4  ENSMUST00000052509.5 0.3978495
##   [100] ENSMUSG00000032097.10  ENSMUST00000217034.1 0.5779817
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths

Here is an explanation of the metaData columns of the output GRanges objects:

  • center The center position of each peaks. This center position is used for calculating the position of peaks within the assigned genomic regions.
  • location The genomic region to which this peak/signal belongs to.
  • Gene ID The gene to which this peak/signal belongs.
  • Position The relative position of each peak/signal within the genomic region. This value close to 0 means this peak located close to the 5’ end of the genomic feature. The position value close to 1 means the peak close to the 3’ end of the genomic feature. Value 5 means this peaks can not be mapped to any annotation.

The List two is the meta plot which in the ggplot class. The user can use all the functions from ggplot2 to change the detail of this plot.

library(ggplot2)
## For example if user want to have a new name for the plot
meta[[2]] + ggtitle("Meta Profile 2")

For the advance usage, the metaGeneProfile provides two methods to calculate the relative position. The first method return a relative position of the peaks/signals in the genomic feature without the introns. The second method return a relative position value of the peak in the genomic feature with the introns. With the parameter include_intron we can easily shift between these two methods. If the data is a polyA plus data, we will recommend you to set include_intron = FALSE.

meta <- metaGeneProfile(object = test, annotation = test_gff3, 
                        include_intron = TRUE)
meta[[2]]

The group option allows user to make a meta plot with multiple conditions. Here is an example:

test$Treat <- c(rep("Treatment 1",50), rep("Treatment 2", 50))
meta <- metaGeneProfile(object = test, annotation = test_gff3, 
                        group = "Treat")
meta[[2]]