Contents

1 Installation

EventPointer can be installed from Bioconductor using the BiocManager package:

library(BiocManager)

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

BiocManager::install("EventPointer")

2 Introduction

EventPointer R package provides users a simplified way to identify, classify and visualize alternative splicing events. The steps required by the algorithm are almost identical for both technologies. The algorithm only differs in its inital step.

Figure 1. Definition of alternative splicing events for EventPointer

Figure 1. Definition of alternative splicing events for EventPointer

Figure 2 shows each of the main steps in a graphical layout.

This vignette is divided in two sections. In the first one, the complete analysis for junction arrays is described and the second one describes the analysis for RNA-Seq data.

To cite EventPointer:

Figure 2. Overview of EventPointer

Figure 2. Overview of EventPointer

3 Analysis of junction arrays

3.1 Overview of junction arrays

EventPointer is prepared to work with different Affymetrix arrays, such as: HTA 2.0, Clariom-D, RTA and MTA.

To build the CDF file (to work under the aroma.affymetrix framework), EventPointer requires a GTF file with the reference transcriptome information. In case not provided, the algorithm automatically downloads the required information from different databases such as ENSEMBL or UCSC. As the probes for the HTA 2.0 array are mapped to the HG19 genome, the latests versions of the ensembl and ucsc genome, mapped to the hg19 version, will be downloaded. For the other arrays, the following genomes are used: ClariomD = GRCh38, MTA = mm10 and RTA = rn6.

The required files are:

  1. Exon probes genomic position (Tab separated .txt file)
  2. Junction probes genomic position (Tab separated .txt file)
  3. Reference transcriptome (GTF file)

Files 1 and 2 contain probe information for the array. These files and the corresponding CDF files can be downloaded from the following URL: EventPointer Dropbox

The format of these files is briefly explained in the following paragraphs:

For the Exon Probes, the file corresponds to a tab separated .txt file composed of 11 columns that include: Probe Id, X coordinate in the array, Y coordinate in the array, Transcript Cluster Id, Probeset Id, Probeset name, Probe sequence, chromosome, start, end and strand.

The Junction probes file is also a tab separated .txt composed of 10 columns: Probe Id, X coordinate in the array, Y coordinate in the array, Transcript Cluster Id, Probeset Id, Probeset name, Probe sequence, chromosome and probe alignments.

For the GTF the standard format is used. (For more information see GTF)

This vignette uses the probes and annotation file for the DONSON gene in order to run the examples in a short amount of time. To generate the corresponding CDF file for the whole genome, the files from the Dropbox link must be used.

Note: It is advisable to work with reference GTF files, as the probes are annotated to them. However, if other database is used, the algorithm will only include probes that are mapped to such database.

3.2 CDF file creation

This step can be skipped if the corresponding CDF file is doownloaded from the Dropbox link. The available CDF files were generated using the GTF files for each of the arrays, if users want to generate a CDF for other databases (Ensembl or UCSC), this step should be used.

The function CDFfromGTF generates the CDF file used afterwards in the aroma.affymetrix pre-processing pipeline. Internally, it calls flat2cdf function written by Elizabeth Purdom. More information about it can be seen in the following webpage: Create CDF from scratch

The required input for the function is described below:

  • input : Reference transcriptome. Available options are : “Ensembl”,“UCSC” , “AffyGTF” and “CustomGTF”.
  • inputFile: If input is “AffyGTF” or “CustomGTF”, inputFile should point to the GTF file to be used.
  • PSR: Path to the Exon probes txt file
  • Junc: Path to the Junction probes txt file
  • PathCDF: Directory where the output will be saved
  • microarray: Microarray used to create the CDF file. Must be one of: “HTA-2_0”, “ClariomD”, “RTA” or “MTA”.

This function takes a couple of hours to complete (depending on the computer), and creates the following files:

  1. EventsFound.txt : Tab separated file with all the information of all the alternative splcing events found.
  2. .flat file : Used to build the corresponding CDF file.
  3. .CDF file: Output required for the aroma.affymetrix preprocessing pipeline.

The following code chunks show examples on how to run the function using part of the Affymetrix GTF and the example data included in the package. This data corresponds to the Affymetrix HTA 2.0 GTF representing only the DONSON gene and the probes that are mapped to it.

Using Affymetrix GTF as reference transcriptome


# Set input variables
   PathFiles<-system.file("extdata",package="EventPointer")
   DONSON_GTF<-paste(PathFiles,"/DONSON.gtf",sep="")
   PSRProbes<-paste(PathFiles,"/PSR_Probes.txt",sep="")
   JunctionProbes<-paste(PathFiles,"/Junction_Probes.txt",sep="")
   Directory<-tempdir()
   array<-"HTA-2_0"
   
# Run the function

   CDFfromGTF(input="AffyGTF",inputFile=DONSON_GTF,
              PSR=PSRProbes,Junc=JunctionProbes,
              PathCDF=Directory,microarray=array)
## Creating SG Information...
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
## 
## Reading Information On Probes...Done
## Indexing Genes and Probes...Done
## Finding Events...
## 
  |                                                                         
  |                                                                   |   0%
  |                                                                         
  |===================================================================| 100%
## 
## Creating .flat ...DoneReading TXT file ... Done.
## Splitting TXT file into units ... split into 1 initial chunks ... unwrapped into 8 chunks ... Done.
## Creating structure for 8 units (dot=250):
## 
## Writing CDF file...
##   Pathname: /tmp/Rtmp7nKw2m/HTA-2_0.cdf
##   Writes CDF header and unit names...
##   Writes CDF header and unit names...done
##   Writes QC units...
##     Units left: 0
##   Writes QC units...done
##   Writes 8 units...
##     Units left: 0
##   Writes 8 units...done
## Writing CDF file...done

Note: Both the .flat and .CDF file take large amounts of space in the hard drive, it is recommended that the directory has at least 1.5 GB of free space.

Figure 3 shows a screen shot with the output of the CDFfromGTF function

Figure 3. Output of CDFfromGTF

Figure 3. Output of CDFfromGTF

Once the file is created, the name of the cdf file can be changed. Internally, the algorithm gives the name as HTA-2_0, but the actual name of the file can be different. In the rest of the vignette, we have renamed our CDF file as EP_HTA-2_0.

Once the CDF file has been created, it can be used for the analysis of different experiments.

3.3 Statistical Analysis

3.3.1 aroma.affymetrix pre-processing pipeline

For microarray data, a pre-processing pipeline must be applied to the .cel files of the experiment. Usually this pre-processing is done using the aroma.affymetrix R package. This procedure normalizes and summarizes the expression of the different probesets into single values.

The aroma.affymetrix R package provides users different functions to work with affymetrix arrays. The functions are used to extract all the information contained in the .cel files and to do all the required pre-processing steps such as background correction, normalization and summarization. The package requires a certain folder structure in order to work correctly. For more information about aroma.affymetrix visit the webpage:aroma project

The following code chunk displays the pipeline used to get the results required by EventPointer after the pre-processing using aroma.affymetrix. The code is not intended to be a runable example, but just to show users the settings and functions that can be used. In order for users to have a runable example, the corrrespoding folder structure and files are required.

# Simple example of Aroma.Affymetrix Preprocessing Pipeline

verbose <- Arguments$getVerbose(-8);
timestampOn(verbose);
projectName <- "Experiment"
cdfGFile <- "EP_HTA-2_0,r"
cdfG <- AffymetrixCdfFile$byChipType(cdfGFile)
cs <- AffymetrixCelSet$byName(projectName, cdf=cdfG)
bc <- NormExpBackgroundCorrection(cs, method="mle", tag=c("*","r11"));
csBC <- process(bc,verbose=verbose,ram=20);
qn <- QuantileNormalization(csBC, typesToUpdate="pm");
csN <- process(qn,verbose=verbose,ram=20);
plmEx <- ExonRmaPlm(csN, mergeGroups=FALSE)
fit(plmEx, verbose=verbose)
cesEx <- getChipEffectSet(plmEx)
ExFit <- extractDataFrame(cesEx, addNames = TRUE)

3.3.2 EventPointer function

EventPointer is the main function of the algorithm

The function requires the following parameters:

  • Design : The design matrix for the experiment.
  • Contrast: The contrast matrix for the experiment.
  • ExFit: [aroma.affymetrix] pre-processed variable after using extractDataFrame(affy, addNames=TRUE)
  • Eventstxt: Path to the EventsFound.txt file generated by CDFfromGTF function.
  • Filter: Boolean variable to indicate if an expression filter is applied.
  • Qn: Quantile used to filter the events (Bounded between 0-1, Q1 would be 0.25).
  • Statistic: Statistical test to identify differential splicing events, must be one of : “LogFC”,“Dif_LogFC” and “DRS”.
  • PSI: Boolean variable to indicate if \(\Delta \Psi\) should be calculated for every splicing event.

If the Filter variable is TRUE, the following is performed:

The summarized expression value of all the reference paths is obtained and the threshold is set depending on the Qn value used.

An event is considered if at least one sample in all paths is expressed above the threshold.

The rest of the events are not shown unless the Filter variable is set to FALSE

   data(ArraysData)

   Dmatrix<-matrix(c(1,1,1,1,0,0,1,1),nrow=4,ncol=2,byrow=FALSE)
   Cmatrix<-t(t(c(0,1)))
   EventsFound<-paste(system.file("extdata",package="EventPointer"),"/EventsFound.txt",sep="")
   
   Events<-EventPointer(Design=Dmatrix,
                      Contrast=Cmatrix,
                      ExFit=ArraysData,
                      Eventstxt=EventsFound,
                      Filter=FALSE,
                      Qn=0.25,
                      Statistic="LogFC",
                      PSI=TRUE)
## 10:28:11 PM Running EventPointer:  
##  ** Statistical Analysis: Logarithm of the fold change of both isoforms 
##  ** Delta PSI will be calculated 
##  ** No expression filter 
##             ---------------------------------------------------------------- 
##  ** Calculating PSI...Done 
##  ** Running Statistical Analysis...Done 
## 
##  10:28:11 PM  Analysis Completed

Table 1 displays the output of EventPointer function

(#tab:EP_Arrays_Res_Table)Table 1: EventPointer Arrays results
Gene name Event Type Genomic Position Splicing Z Value Splicing Pvalue Delta PSI
TC21001058.hg_8 TC21001058.hg Alternative 3’ Splice Site 21:34957032-34958284 6.86631 0.0000 0.00000
TC21001058.hg_6 TC21001058.hg Complex Event 21:34950750-34953608 6.33338 0.0000 -0.09861
TC21001058.hg_9 TC21001058.hg Alternative Last Exon 21:34951868-34956896 6.08929 0.0000 -0.44545
TC21001058.hg_10 TC21001058.hg Complex Event 21:34955972-34958284 -5.03597 0.0000 0.04857
TC21001058.hg_4 TC21001058.hg Complex Event 21:34955972-34958284 1.43180 0.1522 0.00000

The columns of the data.frame are:

  • Gene name : Gene identifier
  • Event Type: Type of alternative splicing event (Cassette, Alternative 3’,Alternative 5’, Alternative Last Exon, Alternative First Exon, Mutually Exclusive Exons or Complex Event)
  • Genomic Position: Genomic Coordinates for the event
  • Splicing Z Value: Corresponding Z value for the statistical test performed
  • Splicing P Value: Corresponding P-value for the statistical test performed
  • Delta PSI: \(\Delta \Psi\) parameter for each event

3.4 IGV visualization

EventPointer creates two different GTF files to visualize the alternative splicing events. Figure 4 displays the cassette exon for the COPS7A gene (4th ranked event in Table 1). In the IGV visualization, the reference path is shown in gray color, the path 1 in red and path 2 in green. Below the transcripts, the different probes that are measuring each of the paths are represented in the same colors.

Figure 4. EventPointer visualization using IGV

Figure 4. EventPointer visualization using IGV

To create the GTF files, the algorithm uses the EventPointer_IGV functions with the following parameters:

  • Events : Data.frame generated by EventPointer with the events to be included in the GTF file.
  • input: Reference transcriprome. Must be one of: “Ensembl”, “UCSC” , “AffyGTF” or “CustomGTF”.
  • inputFile: If input is “AffyGTF” or “CustomGTF”, inputFile should point to the GTF file to be used.
  • PSR: Path to the Exon probes txt file.
  • Junc: Path to the Junction probes txt file.
  • PathGTF: Directory where to write the GTF files.
  • EventsFile: Path to EventsFound.txt file generated with CDFfromGTF function.
  • microarray: Microarray used to create the CDF file. Must be one of: HTA-2_0, ClariomD, RTA or MTA

The inital process of the function is slow, as the splicing graphs must be created every time the function is executed. A progress bar is displayed to the user to inform about the progress of the function.

Once the process is completed two GTF files are generated in the specified directory:

  1. paths.gtf : GTF file representing the alternative splicing events.
  2. probes.gtf : GTF file representing the probes that measure each event and each path.
Figure 5. GTF files generated with EventPointer_IGV

Figure 5. GTF files generated with EventPointer_IGV


# Set Input Variables
  
   DONSON_GTF<-paste(PathFiles,"/DONSON.gtf",sep="")
   PSRProbes<-paste(PathFiles,"/PSR_Probes.txt",sep="")
   JunctionProbes<-paste(PathFiles,"/Junction_Probes.txt",sep="")
   Directory<-tempdir()
   EventsFound<-paste(system.file("extdata",package="EventPointer"),"/EventsFound.txt",sep="")
   array<-"HTA-2_0"

                      
# Generate Visualization files  

 EventPointer_IGV(Events[1,,drop=FALSE],"AffyGTF",DONSON_GTF,PSRProbes,JunctionProbes,Directory,EventsFound,array)
## Creating SG Information...
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
## 
## Reading Information On Probes...Done
## Indexing Genes and Probes...Done
##  Generating GTF Files...
  |                                                                         
  |                                                                   |   0%
  |                                                                         
  |===================================================================| 100%

3.5 Domain Enrichment

The EventPointer paper states that the algorithm provides a function to identify the protein domains that are affected by alternative splicing using an enrichment analysis.

As we have updated the algorithm, we are currently working on the implementation of this feature in the newer version. As soon as it is implemented, it will be uploaded along with the vignette to show its use.

4 RNA-Seq analysis

EventPointer has two pipelines for RNA-Seq analysis: Analysis from BAM files and analysis from a transcriptome reference. These two pipelines are described in section 4.1 and 4.2.

4.1 Analysis from BAM files

4.1.1 Overview of RNA-Seq

EventPointer is also able to identify alternative splicing events from RNA-Seq data. The only required files are the corresponding .BAM files from the experiment.

Each time an experiment is analyzed with EventPointer, the complete process needs to be executed which may cause long waiting times to get the results. To avoid this issue, we have divided every step of the algorithm in different functions so as the user can reuse previous result and thus reduce computational time.

For the examples in this part of the vignette, we will use .bam files depicted in the SGSeq vignette that correspond to paired-end RNA-seq data from four tumor and four normal colorectal samples, which are part of a data set published in Seshagiri et al. 2012. As stated in SGSeq vignette the bam files only include reads mapping to a single gene of interest (FBXO31).

Note: For sequencing data, there are two requirements for the BAM files in order to get EventPointer working correctly:

  1. The BAM files should include the XS-flag, the flag can be included in the files when running the alignment. Most of the available software has parameters to incude the flag. For example, in the case of STAR the flag –outSAMattributes XS must be included when mapping

  2. All files to be analyzed should have the corresponding index files (.bai) in the same directory as the BAM files. Create the index before running EventPointer.

4.1.2 BAM Preparation

The first step to analyze alternative splicing events in RNA-Seq data, is the creation of the splicing graphs. This step relies entirely on SGSeq R package.

The function PrepareBam_EP transforms all the information found in the bam files into splicing graph features and counts

  • Samples : Name of the .bam files to be analyzed (Sample1.bam,Sample2.bam,…,etc)
  • SamplePath: Path where the bam files are stored
  • Ref_Transc: Reference transcriptome used to name the genes found. Options are: “Ensembl”,“UCSC” or “GTF”.
  • fileTransc: Path to the GTF reference transcriptome if Ref_Transc is “GTF”.
  • cores: Number of cores used for parallel processing
  • Alpha: Internal SGSeq parameter to include or exclude regions (See Advanced Use)
# Obtain the samples and directory for .bam files

# the object si contains example sample information from the SGSeq R package 
# use ?si to see the corresponding documentation 
   
   BamInfo<-si
   Samples<-BamInfo[,2]
   PathToSamples <- system.file("extdata/bams", package = "SGSeq")
   PathToGTF<-paste(system.file("extdata",package="EventPointer"),"/FBXO31.gtf",sep="")

  # Run PrepareBam function
   SG_RNASeq<-PrepareBam_EP(Samples=Samples,
                            SamplePath=PathToSamples,
                            Ref_Transc="GTF",
                            fileTransc=PathToGTF,
                            cores=1)

The output of PrepareBam_EP function is a SGFeaturesCounts object, for more information check SGSeq Vignette. Briefly the SGFeaturesCounts contains a GRanges object with all the required elements to create the different splicing graphs found in the given samples. It also contains the number of counts associated with each element of the splicing graph.

4.1.3 Event Detection

After running PrepareBam_EP, we have all the elements to analyze each of the splicing graphs. The next step is to identify and classify all the events, that are present in the BAM files.

For this purpose, the function EventDetection is used.

  • Input : Output of the PrepareBam_EP function
  • cores: Number of cores used for parallel processing
  • Path: Directory where to write the EventsFound.txt file
  # Run EventDetection function
   data(SG_RNASeq)
   TxtPath<-tempdir()
   AllEvents_RNASeq<-EventDetection(SG_RNASeq,cores=1,Path=TxtPath)
## 
##  Obtaining Events
  |                                                                         
  |                                                                   |   0%
  |                                                                         
  |===========                                                        |  17%
  |                                                                         
  |======================                                             |  33%
  |                                                                         
  |==================================                                 |  50%
  |                                                                         
  |=============================================                      |  67%
  |                                                                         
  |========================================================           |  83%
  |                                                                         
  |===================================================================| 100%

This function retireves a list with all the events found for all the genes present in the experiment. It also generates a file called EventsFound_RNASeq.txt with the information for every detected event.

The list is organized in the following way:

Events[[i]][[j]]

The list will have as many \(i\) values as genes and \(j\) values as many events detected for the \(i_{th}\) gene. In other words, the command above will display the \(j_{th}\) event detected for the \(i_{th}\) gene.

4.1.4 Statistical Analysis

The statistical analysis of the alternative splicing events is done in exactly the same way as for junction arrays. With the Design and Contrast matrices, the algorithm gives the statistical significance and \(\Delta \Psi\).

The function for the statistical analysis using EventPointer method.

  • Events : Output from EventDetection function
  • Design: The design matrix for the experiment.
  • Contrast: The contrast matrix for the experiment.
  • Statistic: Statistical test to identify differential splicing events, must be one of : “LogFC”,“Dif_LogFC” and “DRS”.
  • PSI: Boolean variable to indicate if \(\Delta \Psi\) should be calculated for every splicing event.

The algorithm displays the different parameters that are selected to perform the analysis.

Following our example, the code chunk to obtain the results:

   Dmatrix<-matrix(c(1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,1),ncol=2,byrow=FALSE)
   Cmatrix<-t(t(c(0,1)))
   Events <- EventPointer_RNASeq(AllEvents_RNASeq,Dmatrix,Cmatrix,Statistic="LogFC",PSI=TRUE)
## 10:28:16 PM Running EventPointer:  
##  ** Statistical Analysis: Logarithm of the fold change of both isoforms 
##  ** Delta PSI will be calculated 
##             ---------------------------------------------------------------- 
##  ** Calculating PSI...Done
##  Analysis Finished
##  Done 
## 
##  10:28:16 PM  Analysis Completed

Table 2 displays the output of EventPointer function

(#tab:EP_RNASeq_Res_Table)Table 2: EventPointer RNASeq results
Gene Event_Type Position Pvalue Zvalue Delta PSI
3_17 TC16001330.hg Alternative Last Exon 16:87423454-87445125 0.03439 2.11544 -1.00000
3_6 TC16001330.hg Complex Event 16:87377272-87380780 0.09905 1.64946 -0.02136
3_2 TC16001330.hg Alternative 3’ Splice Site 16:87369063-87369767 0.10470 1.62248 -0.01995
3_7 TC16001330.hg Complex Event 16:87369867-87377343 0.11808 -1.56287 0.00626
3_14 TC16001330.hg Cassette Exon 16:87380856-87393901 0.17744 1.34868 -0.23687

4.1.5 IGV visualization

EventPointer creates one GTF file that can be loaded into IGV to visualize the alternative splicing events. Figure 6 displays an example result showed in IGV (5th ranked event in Table 2). Also, in the figure a reference transcriptome is displayed (blue track), and it can be seen that the displayed event corresponds to a novel event discovered with sequencing data and that it will not be detected using junction arrays.

Figure 6. EventPointer visualization using IGV

Figure 6. EventPointer visualization using IGV

To create the GTF files, the algorithm uses the following code.

  • Events : Data.frame generated by EventPointer_RNASeq with the events to be included in the GTF file.
  • SG_RNASeq: Output from PrepareBam_EP function. Contains splicing graphs components.
  • EventsTxt: Path to EventsFound.txt file generated with EventDetection function
  • PathGTF: Directory where to write the GTF files.

A progress bar is displayed to the user to inform about the progress of the function.

Once the process is completed the GTF file is generated in the specified directory:

  • paths_RNASeq.gtf : GTF file representing the alternative splicing events.

   # IGV Visualization
   EventsTxt<-paste(system.file("extdata",package="EventPointer"),"/EventsFound_RNASeq.txt",sep="")
   PathGTF<-tempdir()
   EventPointer_RNASeq_IGV(Events,SG_RNASeq,EventsTxt,PathGTF)
## 
##  Generating GTF Files...
  |                                                                         
  |                                                                   |   0%
  |                                                                         
  |====                                                               |   6%
  |                                                                         
  |========                                                           |  12%
  |                                                                         
  |============                                                       |  18%
  |                                                                         
  |================                                                   |  24%
  |                                                                         
  |====================                                               |  29%
  |                                                                         
  |========================                                           |  35%
  |                                                                         
  |============================                                       |  41%
  |                                                                         
  |================================                                   |  47%
  |                                                                         
  |===================================                                |  53%
  |                                                                         
  |=======================================                            |  59%
  |                                                                         
  |===========================================                        |  65%
  |                                                                         
  |===============================================                    |  71%
  |                                                                         
  |===================================================                |  76%
  |                                                                         
  |=======================================================            |  82%
  |                                                                         
  |===========================================================        |  88%
  |                                                                         
  |===============================================================    |  94%
  |                                                                         
  |===================================================================| 100%

4.2 Analysis From Transcriptome Reference

The idea of this pipeline is to apply a statistical analysis to the alternative splicing by using quantification data obtained from kallisto. For this purpose we have to follow the pipeline described in figure 7.