IsoformSwitchAnalyzeR

Enabling Identification and Analysis of Isoform Switches with Functional Consequences from RNA-sequencing data

Kristoffer Vitting-Seerup

2018-04-30

Abstract

Recent breakthroughs in bioinformatics now allow us to accurately reconstruct and quantify full-length gene isoforms from RNA-sequencing data (via tools such as Cufflinks, Kallisto and Salmon). These tools make it possible to analyzing alternative isoform usage, but unfortunatly this is rarely done. Thus, RNA-se data is typically not used to its full potential.

To solve this problem we developed IsoformSwitchAnalyzeR. IsoformSwitchAnalyzeR is an easy-to use-R package that enables statistical identification of isoform switching from RNA-seq derived quantification of novel and/or annotated full-length isoforms. IsoformSwitchAnalyzeR facilitates integration of many sources of (predicted) annotation including features, including Open Reading Frame (ORF), protein domains (via Pfam), signal peptides (via SignalP), coding potential (via CPAT) and sensitivity to Non-sense Mediated Decay (NMD). The combination of identified isoform switches and their annotation also enables IsoformSwitchAnalyzeR to predict potential functional consequences of the identified isoform switches — such as loss of protein domains or coding potential — thereby identifying isoform switches of particular interest. Lastly, IsoformSwitchAnalyzeR provides article-ready visualization methods for isoform switches, and summary statistics describing the genome-wide occurences of isoform switches, their consequences as well as the associated alternative splicing.

In summary, IsoformSwitchAnalyzeR enables analysis of RNA-seq data with isoform resolution with a focus on isoform switching (with predicted consequences), thereby expanding the usability of RNA-seq data.


Table of Content

Preliminaries

What To Cite (please remember)

Quick Start

Detailed Workflow

Analyzing Alternative Splicing (new)

Other workflows

Frequently Asked Questions, Problems and Errors

Final Remarks

Sessioninfo


Preliminaries

Background and Package Description

The usage of Alternative Transcription Start sites (aTSS), Alternative Splicing (AS) and alternative Transcription Termination Sites (aTTS) are collectively collectively results in the production of different isoforms. Alternative isoforms are widely used as recently demonstrated by The ENCODE Consortium, which found that on average, 6.3 different transcripts are generated per gene; a number whic may var considerably per gene.

The importance of analyzing isoforms instead of genes has been highlighted by many examples showing functionally important changes. One of these examples is the pyruvate kinase. In normal adult homeostasis, cells use the adult isoform (M1), which supports oxidative phosphorylation. However, almost all cancer cells use the embryonic isoform (M2), which promotes aerobic glycolysis, one of the hallmarks of cancer. Such shifts in isoform usage are termed ‘isoform switching’ and cannot be detected at when only analyzing data on gene level.

Tools such as Cufflinks, Salmon and Kallisto allows for reconstruction and quantification of full-length transcripts from RNA-seq data. Such data has the potential to facilitate genome-wide analysis of alternative isoform usage and identification of isoform switching — but unfortunately these types of analyses are still only rarely done; most analyses are on gene level only.

We hypothesize that there are multiple reasons why RNA-seq data is not used to its full potential:

  1. There is still a lack of tools that can identify isoform switches with isoform resolution
  2. Although there are many excellent tools to perform sequence analysis, there is no common framework which allows for integration of the analysis provided by these tools.
  3. There is a lack of tools facilitating easy and article-ready visualization of isoform switches.

To solve all these problems, we developed IsoformSwitchAnalyzeR.

IsoformSwitchAnalyzeR is an easy-to-use R package that enables the user to import (novel) full-length derived isoforms from an RNA-seq experiment into R. If annotated transcripts are analyzed, IsoformSwitchAnalyzeR offers integration with the multi-layer information stored in a GTF file including the annotated coding sequences (CDS). If transcript structures are predicted (either de-novo or guided) IsoformSwitchAnalyzeR offers an accurate tool for identifying the dominant ORF of the isoforms. The knowledge of isoform positions for the CDS/ORF allows for prediction of sensitivity to Nonsense Mediated Decay (NMD) — the mRNA quality control machinery that degrades isoforms with pre-mature termination codons (PTC).

IsoformSwitchAnalyzeR facilitates identification of isoform switches via newly developed statistical methods that tests each individual isoform for differential usage and thereby identifies the exact isoforms involved in an isoform switch.

Since we know the exon structure of the full-length isoform, IsoformSwitchAnalyzeR can extract the underlying nucleotide sequence from a reference genome. This enables integration with the Coding Potential Assessment Tool (CPAT) which predicts the coding potential of an isoform and can be used to increase accuracy of ORF predictions. By combining the CDS/ORF isoform positions with the nucleotide sequence, we can extract the most likely amino acid sequence of the CDS/ORF. The amino acid sequence enables integration of analysis of protein domains (via Pfam) and signal peptides (via SignalP) — both supported by IsoformSwitchAnalyzeR. Lastly, since the structures of all expressed isoforms from a given gene are known, one can also annotate alternative splicing - including retentions - a functionality also implemented in IsoformSwitchAnalyzeR.

Thus, in summary, IsoformSwitchAnalyzeR enables annotation of isoforms with intron retention, ORF, NMD sensitivity, coding potential, protein domains and signal peptides (and many more), resulting in the ability to predict important functional consequences of isoform switches.

IsoformSwitchAnalyzeR contains tools that allow the user to create article-ready visualizations of:

  1. Individual isoform switches
  2. Genome-wide analysis of isoform switches and their predicted consequences
  3. Genome-wide analusis of alternative splicing.

These visualizations are easy to understand and integrate all information gathered throughout the workflow. Example of visualizations can be found in the Examples of Switch Visualizations section.

Lastly IsoformSwitchAnalyzeR is based on standard Bioconductor classes such as GRanges and BSgenome. Thus, it supports all species- and annotation versions facilitated by the Bioconductor annotation packages.

Back to Table of Content.

Installation

IsoformSwitchAnalyzeR is part of the Bioconductor repository and community which means it is distributed with, and dependent on, Bioconductor. Installation of IsoformSwitchAnalyzeR is easy and can be done from within the R terminal. If it is the first time you use Bioconductor, simply copy-paste the following into an R session to install the basic Bioconductor packages:

source("http://bioconductor.org/biocLite.R")
biocLite()

If you already have installed Bioconductor, running these two commands will check whether updates for installed packages are available.


After you have installed the basic Bioconductor packages you can install IsoformSwitchAnalyzeR by copy-pasting the following into an R session:

source("http://bioconductor.org/biocLite.R")
biocLite("IsoformSwitchAnalyzeR")

This will install the IsoformSwitchAnalyzeR package as well as other R packages that are needed for IsoformSwitchAnalyzeR to work.


If you need to install from the developmental branch of Bioconductor you need to specify that - note that this is for advanced uses and should not be done unless you have good reason to:

source("http://bioconductor.org/biocLite.R")
useDevel(devel = TRUE)
biocLite("IsoformSwitchAnalyzeR")


How To Get Help

This R package comes with plenty of documentation. Much information can be found in the R help files (which can easily be accessed by running the following command in R “?functionName”, for example “?isoformSwitchTest”). Furthermore, this vignette contains a lot of information and will be continously updated, so make sure to read both sources carefully as it contains the answers to the most Frequently Asked Questions, Problems and Errors.

If you want to report a bug/error (found in the newest version of the R package!) please make an issue with a reproducible example at github — remember to add the appropriate label.

If you have unanswered questions or comments regarding IsoformSwitchAnalyzeR, please post them on the associated google group (plase make sure the question was not already answered there).

If you have suggestions for improvements, please put them on github. This will allow other people to upvote your idea, thereby showing us there is wide support of implementing your idea.

Back to Table of Content.


What To Cite

The analysis performed by IsoformSwitchAnalyzeR is only possible due to a string of other tools and scientific discoveries — please read this section thoroughly and cite the appropriate articles to give credit where credit is due.

If you are using the


Refrences:

  1. Vitting-Seerup et al. The Landscape of Isoform Switches in Human Cancers. Cancer Res. (2017)
  2. Ferguson et al. P-value calibration for multiple testing problems in genomics. Stat. Appl. Genet. Mol. Biol. 2014, 13:659-673.
  3. Nowicka et al. DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000Research, 5(0), 1356.
  4. Vitting-Seerup et al. spliceR: an R package for classification of alternative splicing and prediction of coding potential from RNA-seq data. BMC Bioinformatics 2014, 15:81.
  5. Weischenfeldt et al. Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrant splicing patterns. Genome Biol 2012, 13:R35
  6. Huber et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods, 2015, 12:115-121.
  7. Wang et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 2013, 41:e74.
  8. Finn et al. The Pfam protein families database. Nucleic Acids Research (2014) Database Issue 42:D222-D230
  9. Petersen et al. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature Methods, 8:785-786, 2011
  10. Soneson et al. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research 4, 1521 (2015).
  11. Robinson et al. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology (2010)


Quick Start

Workflow Overview

The idea behind IsoformSwitchAnalyzeR is to make it easy to do advanced post-analysis of full-length RNA-seq derived transcripts with a focus on finding, annotating and visualizing isoform switches with functional consequences. In the switch analysis workflow (see Analyzing Alternative Splicing for the alternative splicing workflow) IsoformSwitchAnalyzeR performs three high-level tasks:

  • Identification of isoform switches.
  • Annotation of the transcripts involved in the isoform switches.
  • Visualization of predicted consequences of the isoform switches, for indivsual switches and globally.

A normal workflow for identification and analysis of isoform switches with functional consequences can be divided into two parts (also illustrated below in Figure 1).


Part 1) Extract Isoform Switches and Their Sequences. This part includes importing the data into R, identifying isoform switches, annotating those switches with open reading frames (ORF) and extracting the nucleotide and amino acid (peptide) sequences. The latter step enables the usage of external sequence analysis tools such as

  • CPAT : The Coding-Potential Assessment Tool, which can be run either locally or via their webserver.
  • Pfam : Prediction of protein domains, which can be run either locally or via their webserver.
  • SignalP : Prediction of Signal Peptides, which can be run either locally or via their webserver.

All of the above steps are performed by the high-level function:

isoformSwitchAnalysisPart1()

See below for example of usage, and Detailed Workflow for details on the individual steps.


Part 2) Plot All Isoform Switches and Their annotation. This part involves importing and incorporating the results of the external sequence analysis, identifying intron retention, predicting functional consequences and plotting i) all genes with isoform switches and ii) summaries of general consequences of switching.

All of this can be done using the function:

isoformSwitchAnalysisPart2()

See below for usage example, and Detailed Workflow for details on the individual steps.


Alternatively if one does not plan to incorporate external sequence analysis, it is possible to run the full workflow using:

isoformSwitchAnalysisCombined()

This corresponds to running isoformSwitchAnalysisPart1() and isoformSwitchAnalysisPart2() without adding the external results.

Figure 1: Workflow overview. Grey transparent boxes indicate the two parts of a normal workflow for analyzing isoform switches. The individual steps in the two sub-workflows are indicated by arrows. Speech bubbles summarize how this full analysis can be done in a two-step process using the high-level functions (isoformSwitchAnalysisPart1() and isoformSwitchAnalysisPart2()).

Back to Table of Content.


Short Example Workflow

As indicted above, a full, but less customizable, analysis of isoform switches can be done using the two high-level functions isoformSwitchAnalysisPart1() and isoformSwitchAnalysisPart2(). This section aims to show how these functions are used as well as illustrate what IsoformSwitchAnalyzeR can be used for.

Lets start by loading the R package

library(IsoformSwitchAnalyzeR)


Note that newer versions of RStudio support auto-completion of package names.

Importing the Data

The first step is to import all data needed for the analysis and store them in the switchAnalyzeRlist object. IsoformSwitchAnalyzeR contains different functions for importing data from tools such as Salmon/Kallisto/RSEM/Cufflinks, but can be used with all isoform-level expression data using implemented general-purpose functions . For the purpose of illustrating data import, let’s use Cufflinks/Cuffdiff data as (See Importing Data Into R) for details on importing data from other tools).

cuffDB <- prepareCuffExample()

This function is just a wrapper for readCufflinks() which makes the sql database from the example Cufflinks/Cuffdiff results data included in the R package “cummeRbund” so when analyzing you own data you should just use the readCufflinks() function instead.

Once you have a CuffSet (the object type generated by readCufflinks() and prepareCuffExample()), a switchAnalyzeRlist is then created by:

aSwitchList <- importCufflinksCummeRbund(cuffDB)


Unfortunately, this example dataset is not ideal for illustrating the usability of IsoformSwitchAnalyzeR, as it only has two replicates, and the isoform switch tests rely on replicates. To illustrate the workflow, let’s instead use some of the test data from the IsoformSwitchAnalyzeR package:

data("exampleSwitchList")
exampleSwitchList
#> This switchAnalyzeRlist list contains:
#>  259 isoforms from 84 genes
#>  1 comparison from 2 conditions
#> 
#> Switching features:
#>            Comparison switchingIsoforms switchingGenes
#> 1 hESC vs Fibroblasts                 0              0
#> 
#> Feature analyzed:
#> [1] "Isoform Swich Identification"

The data stored in this example switch list correspond to a small subset of the result of using as illustrated above. Note that although a switch identification analysis was performed (by CuffDiff), no genes with differential usage were found. This small data subset is ideal as a showcase due to the short runtimes.


Part 1

Before we can run the analysis it is nessesary to know that IsoformSwitchAnalyzeR measures isoform usageas via isoform fraction (IF) values which quantifies the fration of the parent gene expression originating from a specific isoform (calculated as / ). Consequently the difference in isoform usage is quantifed as the difference in isoform fraction (dIF) caluclated as IF2 - IF1.

We can now run the first part of the isoform switch analysis workflow which filters for non-expressed genes/isoforms, identifies isoform switches, annotates open reading frames (ORF), switches with and extracts both the nucleotide and peptide (amino acid) sequences.

### isoformSwitchAnalysisPart1 needs the genomic sequence to predict ORFs

# Genome sequences are available from Biocindoctor as BSgenome objects: 
# http://bioconductor.org/packages/release/BiocViews.html#___BSgenome
# Here we use the gg19 reference genome - which can be downloaded by
# copy/pasting the following two lines into the R terminal:
# source("https://bioconductor.org/biocLite.R")
# biocLite("BSgenome.Hsapiens.UCSC.hg19")

library(BSgenome.Hsapiens.UCSC.hg19)

exampleSwitchList <- isoformSwitchAnalysisPart1(
    input=exampleSwitchList,
    genomeObject = Hsapiens, # the reference to the human BS genome
    dIFcutoff = 0.4,         # Cutoff for finding switches - set high for short runtime in example data
    outputSequences = FALSE  # prevents outputting of the fasta files used for external sequence analysis
)

Note that:

  1. It is possible to supply the CuffSet (the object which links to the cufflinks/cuffdiff SQL database, created with readCufflinks{cummeRbund}) directly to the input argument instead of creating the switchAnalyzeRlist first.

  2. The isoformSwitchAnalysisPart1() function has an argument, overwritePvalues, which overwrites the result of user-supplied p-values (such as those imported by cufflinks) with the result of running isoformSwitchTestDRIMSeq().

  3. The switchAnalyzeRlist produced by isoformSwitchAnalysisPart1() has been reduced to only contain genes where an isoform switch (as defined by the alpha and dIFcutoff arguments) was identified. This enables much faster runtimes for the rest of the pipeline, as only isoforms from a gene with a switch are analyzed.

The number of switching features is easily summarized via:

extractSwitchSummary(exampleSwitchList, dIFcutoff = 0.4) # supply the same cutoff to the summary function
#>            Comparison nrIsoforms nrGenes
#> 1 hESC vs Fibroblasts         12       7


Part 2

The second part of the isoform switch analysis workflow, which includes importing and incorporating external sequence annotation, analysis of alternative splicing, predicting functional consequences and visualizing both the general effects of isoform switches and the individual isoform switches. The combined analysis can be done by:

exampleSwitchList <- isoformSwitchAnalysisPart2(
    switchAnalyzeRlist      = exampleSwitchList, 
    dIFcutoff               = 0.4,   # Cutoff for finding switches - set high for short runtime in example data
    n                       = 10,    # if plotting was enabled, it would only output the top 10 switches
    removeNoncodinORFs      = TRUE,  # Because ORF was predicted de novo
    pathToCPATresultFile    = system.file("extdata/cpat_results.txt"   , package = "IsoformSwitchAnalyzeR"),
    pathToPFAMresultFile    = system.file("extdata/pfam_results.txt"   , package = "IsoformSwitchAnalyzeR"),
    pathToSignalPresultFile = system.file("extdata/signalP_results.txt", package = "IsoformSwitchAnalyzeR"),
    codingCutoff            = 0.725, # the coding potential cutoff we suggested for human 
    outputPlots             = FALSE  # keeps the function from outputting the plots from this example
)

The exampleSwitchList now contains all the information needed to analyze isoform switches and alterantive splicing both for individual genes as well as genome-wide summary statistics or analysis.

The number of isoform switches with functional consequences are easily extracted by setting “filterForConsequences = TRUE”:

extractSwitchSummary(exampleSwitchList, filterForConsequences = TRUE, dIFcutoff = 0.4) # supply the same cutoff to the summary function
#>            Comparison nrIsoforms nrGenes
#> 1 hESC vs Fibroblasts         10       5

For each of these top genes, a switch plot will be generated if “outputPlots=TRUE” were used. Let’s take a closer look at the top candidates:

The top genes with isoform switches are:

extractTopSwitches(exampleSwitchList, filterForConsequences = TRUE, n=3)
#>            gene_ref           gene_id gene_name condition_1 condition_2
#> 1 geneComp_00000098 XLOC_000088:KIF1B     KIF1B        hESC Fibroblasts
#> 2 geneComp_00000389       XLOC_001345  ARHGEF19        hESC Fibroblasts
#> 3 geneComp_00000205       XLOC_000177   LDLRAD2        hESC Fibroblasts
#>   gene_switch_q_value switchConsequencesGene Rank
#> 1       2.507552e-263                   TRUE    1
#> 2        1.413555e-80                   TRUE    2
#> 3        9.231366e-45                   TRUE    3


Examples of Switch Visualizations

Let’s take a look at the isoform switch in the KIF1B gene via the switch plot produced by IsoformSwitchAnalyzeR:

switchPlot(exampleSwitchList, gene='KIF1B')

From this plot, we first note that the gene expression is not significantly changed (bottom left). Next, we see the large significant switch in isoform usage across conditions (bottom right). By comparing which isoforms are chaning (bottom right) to the isoform structure (top) it can be seen that in hESC it is primarly the long isoforms that are used, but as the cells are differentiated to fibroblasts, there is a change towards mainly using the short isoform. Interestingly, this isoform switch results in the loss of several protein domains including the PH domain which plays a role in anchoring proteins to cell membranes. This could lead to the hypothesis that KIF1B would lose its ability to bind vesicles — , which would have to be confirmed experimentally.

Note that if you want to save this plot as a pdf file via the pdf command, you need to specify “onefile = FALSE”. The folloing colde chunk that will produce a nicely-sized figure:

pdf(file = '<outoutDirAndFileName>.pdf', onefile = FALSE, height=5, width = 8)
switchPlot(exampleSwitchList, gene='KIF1B')
dev.off()

Furthermore, note that if you are analyzing multiple conditions, you will also need to specify the ‘condition1’ and ‘condition2’ arguments in the switchPlot() function to indicate specifically which comparison you want to plot.


To illustrate the genome-wide analysis of consequences of isoform switching in the different comparisons, we will use a larger dataset which is a subset of two of the TCGA Cancer types analyzed in Vitting-Seerup et al 2017.

data("exampleSwitchListAnalyzed")

The first step is to take a look at the number and overlap of isoform switches. The numbers can be extracted via the extractSwitchSummary() function and by setting “filterForConsequences=TRUE” we only extract the features involved in a switch which is predicted to have a functional consequence:

extractSwitchSummary(
    exampleSwitchListAnalyzed,
    filterForConsequences=TRUE
)
#>                 Comparison nrIsoforms nrGenes
#> 1 COAD_ctrl vs COAD_cancer        690     368
#> 2 LUAD_ctrl vs LUAD_cancer        422     218
#> 3                 combined       1008     529


The number, and overlap, between condtions can be viusalized with the extractSwitchOverlap() function:

extractSwitchOverlap(
    exampleSwitchListAnalyzed,
    filterForConsequences=TRUE
)