IsoformSwitchAnalyzeR

Enabling Identification and Analysis of Isoform Switches with Functional Consequences from RNA-sequencing data

Kristoffer Vitting-Seerup

2018-10-30

Abstract

Recent breakthroughs in bioinformatics now allow us to accurately reconstruct and quantify full-length gene isoforms from RNA-sequencing data (via tools such as Cufflinks, StringTie, Kallisto and Salmon). These tools make it possible to analyzing alternative isoform usage, but unfortunatly this is rarely done meaning RNA-seq data is typically not used to its full potential.

To solve this problem we developed IsoformSwitchAnalyzeR. IsoformSwitchAnalyzeR is an easy-to use-R package that enables statistical identification of isoform switching from RNA-seq derived quantification of novel and/or annotated full-length isoforms. IsoformSwitchAnalyzeR facilitates integration of many sources of (predicted) annotation such as Open Reading Frame (ORF), protein domains (via Pfam), signal peptides (via SignalP), coding potential (via CPAT) and sensitivity to Non-sense Mediated Decay (NMD) etc. The combination of identified isoform switches and their annotation also enables IsoformSwitchAnalyzeR to predict potential functional consequences of the identified isoform switches — such as loss of protein domains or coding potential — thereby identifying isoform switches of particular interest. Lastly, IsoformSwitchAnalyzeR provides article-ready visualization methods for isoform switches, and summary statistics describing the genome-wide occurences of isoform switches, their consequences as well as the associated alternative splicing.

In summary, IsoformSwitchAnalyzeR enables analysis of RNA-seq data with isoform resolution with a focus on isoform switching (with predicted consequences), thereby expanding the usability of RNA-seq data.

Table of Content

Preliminaries

Background and Package Description
Installation
How To Get Help

What To Cite (please remember)

Quick Start

Overview of Isoform Switch Workflow
Example Workflow (a.k.a. the “Too long; didn’t read” section)
Examples Visualizations

Detailed Workflow

Overview of Detailed Workflow
- IsoformSwitchAnalyzeR Background Information
Importing Data Into R
Filtering
Identifying Isoform Switches
Analyzing Open Reading Frames
Extracting Nucleotide and Amino Acid Sequences
Advice for Running External Sequence Analysis Tools
Importing External Sequence Analysis
Predicting Alternative Splicing
Predicting Switch Consequences
Removal of annotation data not needed anymore
Post Analysis of Isoform Switches with Consequences
- Analysis of Individual Isoform Switching
- Genome-Wide Analysis of Isoform Switching

Analyzing Alternative Splicing (new)

Overview of Alternative Splicing Workflow
Genome-Wide Analysis of Alternative Splicing

Other workflows

Augmenting ORF Predictions with Pfam Results
Analyze Small Upstream ORFs
Remove Sequences Stored in SwitchAnalyzeRlist
Adding Uncertain Category to Coding Potential Predictions
Quality control of ORF of known annotation
Analyzing the Biological Mechanisms Behind Isoform Switching
Analysing experiments without replicates

Frequently Asked Questions, Problems and Errors

What Quantification Tool(s) Should I Use?
How to handle cofounding effects (including batches)
What constitute an independent biological replicate?
etc

Final Remarks

Sessioninfo

Preliminaries

Background and Package Description

The usage of Alternative Transcription Start sites (aTSS), Alternative Splicing (AS) and alternative Transcription Termination Sites (aTTS) are collectively collectively results in the production of different isoforms. Alternative isoforms are widely used as recently demonstrated by The ENCODE Consortium, which found that on average, 6.3 different transcripts are generated per gene; a number which may vary considerably per gene.

The importance of analyzing isoforms instead of genes has been highlighted by many examples showing functionally important changes. One of these examples is the pyruvate kinase. In normal adult homeostasis, cells use the adult isoform (M1), which supports oxidative phosphorylation. However, almost all cancer cells use the embryonic isoform (M2), which promotes aerobic glycolysis, one of the hallmarks of cancer. Such shifts in isoform usage are termed ‘isoform switching’ and cannot be detected at when only analyzing data on gene level.

On a more systematic level several recent studies suggest that isoform switches are quite common since they often identify hundres of switche events.

Tools such as Cufflinks, Salmon and Kallisto allows for reconstruction and quantification of full-length transcripts from RNA-seq data. Such data has the potential to facilitate genome-wide analysis of alternative isoform usage and identification of isoform switching — but unfortunately these types of analyses are still only rarely done; most analyses are on gene level only.

We hypothesize that there are multiple reasons why RNA-seq data is not used to its full potential:

There is still a lack of tools that can identify isoform switches with isoform resolution
Although there are many excellent tools to perform sequence analysis, there is no common framework which allows for integration of the analysis provided by these tools.
There is a lack of tools facilitating easy and article-ready visualization of isoform switches.

To solve all these problems, we developed IsoformSwitchAnalyzeR.

IsoformSwitchAnalyzeR is an easy-to-use R package that enables the user to import (novel) full-length derived isoforms from an RNA-seq experiment into R. If annotated transcripts are analyzed, IsoformSwitchAnalyzeR offers integration with the multi-layer information stored in a GTF file including the annotated coding sequences (CDS). If transcript structures are predicted (either de-novo or guided) IsoformSwitchAnalyzeR offers an accurate tool for identifying the dominant ORF of the isoforms. The knowledge of isoform positions for the CDS/ORF allows for prediction of sensitivity to Nonsense Mediated Decay (NMD) — the mRNA quality control machinery that degrades isoforms with pre-mature termination codons (PTC).

IsoformSwitchAnalyzeR facilitates identification of isoform switches via newly developed statistical methods that tests each individual isoform for differential usage and thereby identifies the exact isoforms involved in an isoform switch.

Since we know the exon structure of the full-length isoform, IsoformSwitchAnalyzeR can extract the underlying nucleotide sequence from a reference genome. This enables integration with the Coding Potential Assessment Tool (CPAT) which predicts the coding potential of an isoform and can be used to increase accuracy of ORF predictions. By combining the CDS/ORF isoform positions with the nucleotide sequence, we can extract the most likely amino acid sequence of the CDS/ORF. The amino acid sequence enables integration of analysis of protein domains (via Pfam) and signal peptides (via SignalP) — both supported by IsoformSwitchAnalyzeR. Lastly, since the structures of all expressed isoforms from a given gene are known, one can also annotate alternative splicing - including retentions - a functionality also implemented in IsoformSwitchAnalyzeR.

Thus, in summary, IsoformSwitchAnalyzeR enables annotation of isoforms with intron retention, ORF, NMD sensitivity, coding potential, protein domains and signal peptides (and many more), resulting in the ability to predict important functional consequences of isoform switches.

IsoformSwitchAnalyzeR contains tools that allow the user to create article-ready visualizations of:

Individual isoform switches
Genome-wide analysis of isoform switches and their predicted consequences
Genome-wide analysis of alternative splicing and isoform switches and their predicted consequences.

These visualizations are easy to understand and integrate all information gathered throughout the workflow. Example of visualizations can be found in the Examples Visualizations section.

Lastly IsoformSwitchAnalyzeR is based on standard Bioconductor classes such as GRanges and BSgenome. Thus, it supports all species- and annotation versions facilitated by the Bioconductor annotation packages.

Back to Table of Content.

Installation

IsoformSwitchAnalyzeR is part of the Bioconductor repository and community which means it is distributed with, and dependent on, Bioconductor. Installation of IsoformSwitchAnalyzeR is easy and can be done from within the R terminal. If it is the first time you use Bioconductor, simply copy-paste the following into an R session to install the basic Bioconductor packages:

install.packages("BiocManager")
BiocManager::install()

If you already have installed Bioconductor, running these two commands will check whether updates for installed packages are available.

After you have installed the basic Bioconductor packages you can install IsoformSwitchAnalyzeR by copy-pasting the following into an R session:

BiocManager::install("IsoformSwitchAnalyzeR")

This will install the IsoformSwitchAnalyzeR package as well as other R packages that are needed for IsoformSwitchAnalyzeR to work.

If you need to install from the developmental branch of Bioconductor it is nesseary to explicitely specify that. Please note that this is for advanced uses and should not be done unless you have good reason to. Installation from the developmental branch can be done by copy/pasting:

BiocManager::install("IsoformSwitchAnalyzeR", version = "devel")

How To Get Help

This R package comes with plenty of documentation. Much information can be found in the R help files (which can easily be accessed by running the following command in R “?functionName”, for example “?importRdata”) - here a lot of information can be found in the individual argument description as well as in the details section. This vignette contains a lot of information and will be continously updated, so make sure to read both sources carefully as it contains the answers to the most Frequently Asked Questions, Problems and Errors.

If you want to report a bug/error (found in the newest version of the R package!) please make an issue with a reproducible example at github.

If you have unanswered questions or comments regarding IsoformSwitchAnalyzeR or how to run it please post them on the associated google group (after make sure the question was not already answered).

If you have suggestions for improvements, please put them on github. This will allow other people to upvote your idea, thereby showing us there is wide support of implementing your idea.

Back to Table of Content.

What To Cite

The analysis performed by IsoformSwitchAnalyzeR is only possible due to a string of other tools and scientific discoveries — please read this section thoroughly and cite the appropriate articles to give credit where credit is due (and allow people to continue both maintaining and developing bioinformatic tools).

If you are using the

Import of data from Salmon/Kallisto/RSEM/StringTie : Please cite reference 10.
Inter-library normalization of abundance values : Please cite reference 10 and 11.
Isoform switch test implemented utilizing DEXSeq via IsoformSwitchAnalyzeR (Default) : Please cite reference 1, 12 and 13.
Isoform switch test implemented in the DRIMSeq package : Please cite referencea 1 and 3.
Prediction of open reading frames (ORF) analysis : Please cite reference 1 and 4.
Prediction of pre-mature termination codons (PTC) and thereby NMD-sensitivity : Please cite refrence 1, 4, 5 and 6.
CPAT : Please cite reference 7.
Pfam : Please cite reference 8.
SignalP : Please cite reference 9.
Prediction of consequences please cite reference 1.
Visualizations (plots) implemented in the IsoformSwitchAnalyzeR package : Please cite reference 1.
Alternative splicing analysis : Please cite both reference 1 and 4.
Genome-wide enrichment analysis : Please cite both reference 1 and 2.

Refrences:

Vitting-Seerup et al. The Landscape of Isoform Switches in Human Cancers. Cancer Res. (2017)
Vitting-Seerup et al. IsoformSwitchAnalyzeR: Analysis of changes in genome-wide patterns of alternative splicing and its functional consequences. bioRxiv (2018)
Nowicka et al. DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000Research, 5(0), 1356.
Vitting-Seerup et al. spliceR: an R package for classification of alternative splicing and prediction of coding potential from RNA-seq data. BMC Bioinformatics 2014, 15:81.
Weischenfeldt et al. Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrant splicing patterns. Genome Biol 2012, 13:R35
Huber et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods, 2015, 12:115-121.
Wang et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 2013, 41:e74.
Finn et al. The Pfam protein families database. Nucleic Acids Research (2014) Database Issue 42:D222-D230
Petersen et al. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature Methods, 8:785-786, 2011
Soneson et al. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research 4, 1521 (2015).
Robinson et al. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology (2010)
Ritchie et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research (2015)
Anders et al. Detecting differential usage of exons from RNA-seq data. Genome Research (2012)

Quick Start

Overview of Isoform Switch Workflow

The idea behind IsoformSwitchAnalyzeR is to make it easy to do advanced post-analysis of full-length RNA-seq derived transcripts quantification with a focus on finding, annotating and visualizing isoform switches with functional consequences. Furthermore IsoformSwitchAnalyzeR also allows for analysis of alternative splicing. Here we will go though the isoform switch workflow. For the workflow of alternative splicing see Analyzing Alternative Splicing. If you want to know more about recomendations for bioinformatic tools which can perform the initial isoform quantifications please refere to What Quantification Tool(s) Should I Use?.

From the isoform quantifications IsoformSwitchAnalyzeR performs three high-level tasks:

Statistical identification of isoform switches.
Annotation of the transcripts involved in the isoform switches.
Visualization of predicted consequences of the isoform switches, for individual switches and globally.

Please note that just like any other statistical tool IsoformSwitchAnalyzeR requires independent biological replicates (see What constitute an independent biological replicate?) and that recent benchmarks highlights that at least three independent replicates are needed for good statistical performance - most tools have a hard time controling the False Discovery Rate (FDR) with only two replicates so extra caution is needed for interpreting results based on few replicates.

A normal workflow for identification and analysis of isoform switches with functional consequences can be divided into two parts (also illustrated below in Figure 1).

Part 1) Extract Isoform Switches and Their Sequences. This part includes importing the data into R, identifying isoform switches, annotating those switches with open reading frames (ORF) and extracting the nucleotide and amino acid (peptide) sequences. The latter step enables the usage of external sequence analysis tools such as

CPAT : The Coding-Potential Assessment Tool, which can be run either locally or via their webserver.
Pfam : Prediction of protein domains, which can be run either locally or via their webserver.
SignalP : Prediction of Signal Peptides, which can be run either locally or via their webserver.

All of the above steps are performed by the high-level function:

isoformSwitchAnalysisPart1()

See below for example of usage, and Detailed Workflow for details on the individual steps.

Part 2) Plot All Isoform Switches and Their annotation. This part involves importing and incorporating the results of the external sequence analysis, identifying intron retention, predicting functional consequences and plotting i) all genes with isoform switches and ii) summaries of general consequences of switching.

All of this can be done using the function:

isoformSwitchAnalysisPart2()

See below for usage example, and Detailed Workflow for details on the individual steps.

Alternatively if one does not plan to incorporate external sequence analysis, it is possible to run the full workflow using:

isoformSwitchAnalysisCombined()

This corresponds to running isoformSwitchAnalysisPart1() and isoformSwitchAnalysisPart2() without adding the external results.

Figure 1: Workflow overview. Grey transparent boxes indicate the two parts of a normal workflow for analyzing isoform switches. The individual steps in the two sub-workflows are indicated by arrows. Speech bubbles summarize how this full analysis can be done in a two-step process using the high-level functions (isoformSwitchAnalysisPart1() and isoformSwitchAnalysisPart2()).

Back to Table of Content.

Example Workflow

As indicted above, a full, but less customizable, analysis of isoform switches can be done using the two high-level functions isoformSwitchAnalysisPart1() and isoformSwitchAnalysisPart2(). This section aims to show how these functions are used as well as illustrate what IsoformSwitchAnalyzeR can be used for.

Lets start by loading the R package

library(IsoformSwitchAnalyzeR)

Note that newer versions of RStudio support auto-completion of package names.

Importing the Data

The first step is to import all data needed for the analysis and store them in a switchAnalyzeRlist object. IsoformSwitchAnalyzeR has different functions for importing data from tools such as Salmon/Kallisto/RSEM/StringTie/Cufflinks, but can be used with all isoform-level expression data using implemented general-purpose functions. For the purpose of illustrating data import lets use some data quantified via Salmon. Note the approach for Kallisto/RSEM/StringTie would be identical - for other sources of quantification (including Cufflinks/Cuffdiff) see Importing Data Into R.

To illustrate the data import lets use some Salmon data included in the IsoformSwitchAnalyzeR package.

### Please note
# The way of importing files in the following example with
# "system.file('pathToFile', package="IsoformSwitchAnalyzeR") is
# specialiced way of accessing the example data in the IsoformSwitchAnalyzeR package
# and not somhting you need to do - just supply the string e.g.
# "/mySalmonQuantifications/" pointing to the parent directory (where 
# each sample is a seperate sub-directory) to the function.

### Import quantifications
salmonQuant <- importIsoformExpression(
    parentDir = system.file("extdata/",package="IsoformSwitchAnalyzeR")
)
#> Step 1 of 3: Identifying which algorithm was used...
#>     The quantification algorithm used was: Salmon
#> Step 2 of 3: Reading data...
#>     Found 6 quantification file(s) of interest
#> reading in files with read_tsv
#> 1 2 3 4 5 6 
#> Step 3 of 3: Normalizing FPKM/TxPM values via edgeR...
#> Done

Which results in a list containing both count and abundance estimates for each isoform:

head(salmonQuant$abundance, 2)
#>       isoform_id Fibroblasts_0 Fibroblasts_1 hESC_0    hESC_1    iPS_0
#> 1 TCONS_00000001      11.36932       14.0215      0 0.0000000  0.00000
#> 2 TCONS_00000002       0.00000        0.0000      0 0.6904076 13.07866
#>       iPS_1
#> 1 11.303557
#> 2  6.838089

head(salmonQuant$counts, 2)
#>       isoform_id Fibroblasts_0 Fibroblasts_1 hESC_0    hESC_1    iPS_0
#> 1 TCONS_00000001      12.30707      14.02487      0 0.0000000  0.00000
#> 2 TCONS_00000002       0.00000       0.00000      0 0.1116201 21.10248
#>      iPS_1
#> 1 18.13313
#> 2 10.96964

Apart from the isoform quantification we need three additional pices of information.

The transcript structure of the isoforms (in genomic coordinats). This is typically stored in a GTF file - a file format directly supported by IsoformSwitchAnalyzeR (as described below).
The information about which isoforms belongs to the same gene (also in the GTF file)
A design matrix indicating which of the independent biological replicates belong to which condition (and if there are other covariates that should be taking into account).

The design matrix will have to be generated manually to ensure correct grouping:

myDesign <- data.frame(
    sampleID = colnames(salmonQuant$abundance)[-1],
    condition = gsub('_.*', '', colnames(salmonQuant$abundance)[-1])
)
myDesign
#>        sampleID   condition
#> 1 Fibroblasts_0 Fibroblasts
#> 2 Fibroblasts_1 Fibroblasts
#> 3        hESC_0        hESC
#> 4        hESC_1        hESC
#> 5         iPS_0         iPS
#> 6         iPS_1         iPS

Note that if you have additional covariates in the data of interest these can also be added to the design matrix to ensure they are taking into account during the statistical analysis. ( Covariates are (unwanted) sources of variation not due to experimental groups you are interested in. This can be anything from a batch effects (e.g. data produced at different points in time) to an effect you are not interested in but which migth influence the result (e.g. sex or age). See How to handle cofounding effects (including batches) for more info ).

Now have the isoform quantifications, the design matrix and the isoform annotation (in a GTF file which IsoformSwitchAnalyzeR will import itsef) we can now combine and store all the relevant data in a switchAnalyzeRlist.

Please note that:

It is highly recommended to both supply count and abundance expression matrixes to the importRdata() but a switchAnalyzeRlist can also be created with only one of them.
It is essential that the isoforms quantified (with Salmon, Kallisto etc) and the annotation stored in the GTF file is identical. We highly recomend using files from GENCODE genes - where The “Comprehensive gene annotation” GTF and the “Transcript sequences” fasta file is a perfect pair. For more information (including instructions for how to use Ensemble annotation) refere to the FAQ The error “The annotation does not fit the expression data”.

Once we have all the nessesary data the easiest way to create a switchAnalyzeRlist is with the importRdata() function - which also directly can handle import and integration of annotation from a GTF file:

### Please note
# The way of importing files in the following example with
# "system.file("extdata/example.gtf.gz", package="IsoformSwitchAnalyzeR")"" is
# specialiced way of accessing the example data in the IsoformSwitchAnalyzeR package
# and not somhting you need to do - just supply the string e.g.
# "/myAnnotation/myQuantified.gtf" to the isoformExonAnnoation argument

### Create switchAnalyzeRlist
aSwitchList <- importRdata(
    isoformCountMatrix   = salmonQuant$counts,
    isoformRepExpression = salmonQuant$abundance,
    designMatrix         = myDesign,
    isoformExonAnnoation = system.file("extdata/example.gtf.gz", package="IsoformSwitchAnalyzeR"),
    showProgress = FALSE
)
aSwitchList
#> This switchAnalyzeRlist list contains:
#>  1092 isoforms from 362 genes
#>  3 comparison from 3 conditions

Note that by supplying the GTF file to the “isoformExonAnnoation” argument IsoformSwitchAnalyzeR will automatically import and integrate CDS regions from in the GTF file as the ORF regions used by IsoformSwitchAnalyzeR (if addAnnotatedORFs=TRUE (default)).

For more information about the switchAnalyzeRlist see IsoformSwitchAnalyzeR Background Information.

To illustrate the IsoformSwitchAnalyzeR workflow we will use a smaller example dataset originally quantified with Cufflinks/CuffDiff. The corresponding switchAnalyzeRlist is included in IsoformSwitchAnalyzeR and can loaded as follows:

data("exampleSwitchList")
exampleSwitchList
#> This switchAnalyzeRlist list contains:
#>  259 isoforms from 84 genes
#>  1 comparison from 2 conditions
#> 
#> Switching features:
#>            Comparison switchingIsoforms switchingGenes
#> 1 hESC vs Fibroblasts                 0              0
#> 
#> Feature analyzed:
#> [1] "Isoform Swich Identification"

Note that although a isoform switch identification analysis was performed (by CuffDiff), no genes with differential usage were found.

Part 1

Before we can run the analysis it is necessary to know that IsoformSwitchAnalyzeR measures isoform usage via isoform fraction (IF) values which quantifies the fraction of the parent gene expression originating from a specific isoform (calculated as / ). Consequently the difference in isoform usage is quantifed as the difference in isoform fraction (dIF) calculated as IF2 - IF1, and these dIF are used to measure the effect size (like fold changes are in gene/isoform expression analysis).

We can now run the first part of the isoform switch analysis workflow which filters for non-expressed genes/isoforms, identifies isoform switches, annotates open reading frames (ORF), switches with and extracts both the nucleotide and peptide (amino acid) sequences and output them as two seperate fasta files.

### isoformSwitchAnalysisPart1 needs the genomic sequence to predict ORFs

# Genome sequences are available from Bioconductor as BSgenome objects: 
# http://bioconductor.org/packages/release/BiocViews.html#___BSgenome
# Here we use the hg19 reference genome - which can be downloaded by
# copy/pasting the following two lines into the R terminal:
# BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")

library(BSgenome.Hsapiens.UCSC.hg19)

exampleSwitchList <- isoformSwitchAnalysisPart1(
    switchAnalyzeRlist = exampleSwitchList,
    genomeObject = Hsapiens, # the reference to the human BS genome
    dIFcutoff = 0.3,         # Cutoff for finding switches - set high for short runtime in example data
    # pathToOutput = 'path/to/where/ouput/should/be/'
    outputSequences = FALSE # prevents outputting of the fasta files used for external sequence analysis
)

Note that:

In the example above we set outputSequences=FALSE only to make the example data run nicely (e.g. to prevent the function from out the two fasta files of the example data to your computer). When you analyze your own data you want to set outputSequences=TRUE and use the pathToOutput to specify where the files sould be saved. The two files outputted are needed to do the external analysis as described below.
The isoformSwitchAnalysisPart1() function has an argument, overwritePvalues, which overwrites the result of an already existing switch test (such as those imported by cufflinks) with the result of running isoformSwitchTestDEXSeq().
The switchAnalyzeRlist returned by isoformSwitchAnalysisPart1() has been reduced to only contain genes where an isoform switch (as defined by the alpha and dIFcutoff arguments) was identified. This enables much faster runtimes for the rest of the pipeline, as only isoforms from a gene with a switch are analyzed.

The number of switching features is easily summarized as follows:

extractSwitchSummary(
    exampleSwitchList,
    dIFcutoff = 0.3 # supply the same cutoff to the summary function
)
#>            Comparison nrIsoforms nrGenes
#> 1 hESC vs Fibroblasts         31      18

In a typical workflow the user would here have to use produced fasta files to perfrom the external analysis tools (Pfam (protein domains), SignalP (signal peptides), CPAT (coding potential)). For more information on how to run those tools refere to Advice for Running External Sequence Analysis Tools. To illustrate the workflow we will here use the result of running those tools on the example data wich we have also included in our R package.

Part 2

The second part of the isoform switch analysis workflow, which includes importing and incorporating external sequence annotation, analysis of alternative splicing, predicting functional consequences and visualizing both the general effects of isoform switches and the individual isoform switches. The combined analysis can be done by:

# Please note that in the following the part of the examples using
# the "system.file()" commandis not nesseary when using your own
# data - just supply the path as a string
# (e.g. pathToCPATresultFile = "/myFiles/myCPATresults.txt" )

exampleSwitchList <- isoformSwitchAnalysisPart2(
    switchAnalyzeRlist      = exampleSwitchList, 
    dIFcutoff               = 0.3,   # Cutoff for finding switches - set high for short runtime in example data
    n                       = 10,    # if plotting was enabled, it would only output the top 10 switches
    removeNoncodinORFs      = TRUE,  # Because ORF was predicted de novo
    pathToCPATresultFile    = system.file("extdata/cpat_results.txt"   , package = "IsoformSwitchAnalyzeR"),
    pathToPFAMresultFile    = system.file("extdata/pfam_results.txt"   , package = "IsoformSwitchAnalyzeR"),
    pathToSignalPresultFile = system.file("extdata/signalP_results.txt", package = "IsoformSwitchAnalyzeR"),
    codingCutoff            = 0.725, # the coding potential cutoff we suggested for human 
    outputPlots             = FALSE  # keeps the function from outputting the plots from this example
)

The exampleSwitchList now contains all the information needed to analyze isoform switches and alterantive splicing both for individual genes as well as genome-wide summary statistics or analysis.

The number of isoform switches with predicted functional consequences are extracted by setting “filterForConsequences = TRUE”:

extractSwitchSummary(
    exampleSwitchList,
    filterForConsequences = TRUE,
    dIFcutoff = 0.3               # supply the same cutoff to the summary function
) 
#>            Comparison nrIsoforms nrGenes
#> 1 hESC vs Fibroblasts         24      12

For each of these top genes, a switch plot will be generated if “outputPlots=TRUE” were used. Let’s take a closer look at the top candidates:

The top genes with isoform switches are:

extractTopSwitches(exampleSwitchList, filterForConsequences = TRUE, n=3)
#>            gene_ref           gene_id gene_name condition_1 condition_2
#> 1 geneComp_00000237 XLOC_000202:SRRM1     SRRM1        hESC Fibroblasts
#> 2 geneComp_00000098 XLOC_000088:KIF1B     KIF1B        hESC Fibroblasts
#> 3 geneComp_00000389       XLOC_001345  ARHGEF19        hESC Fibroblasts
#>   gene_switch_q_value switchConsequencesGene Rank
#> 1       1.553471e-185                   TRUE    1
#> 2        1.879647e-87                   TRUE    2
#> 3        1.312385e-83                   TRUE    3

Examples Visualizations

Let’s take a look at the isoform switch in the SRRM1 gene via the switch plot produced by IsoformSwitchAnalyzeR:

switchPlot(exampleSwitchList, gene='SRRM1')

From this plot, we first note that the gene expression is not significantly changed (bottom left). Next, we see the large significant switch in isoform usage across conditions (bottom right). By comparing which isoforms are changing (bottom right) to the isoform structure (top) it can be deduced that in hESC it is primarly the short isoforms that is used, but as the cells are differentiated to fibroblasts, there is a change towards mainly using the long isoform. Interestingly, this isoform switch seems to result in a truncation of the PWI domain which is important for DNA/RNA binding. SRRM1 is a splice factor and this switch could lead to the hypothesis that SRRM1 is nonfunctional in hESC but functional in Fibroblasts even though the total output from the gene is the same — a hypothesis which would naturallt have to be confirmed experimentally.

Note that if you want to save this plot as a pdf file via the pdf command, you need to specify “onefile = FALSE”. The folloing code chunk will produce a nicely-sized figure:

pdf(file = '<outoutDirAndFileName>.pdf', onefile = FALSE, height=5, width = 8)
switchPlot(exampleSwitchList, gene='SRRM1')
dev.off()

Furthermore, note that:

If the switchAnalyzeRList contains multiple comparison, you will also need to specify the ‘condition1’ and ‘condition2’ arguments in the switchPlot() function to indicate specifically which comparison you want to plot.
The differential isoform/gene expression analysis is not a part of the IsoformSwitchAnalyzeR workflow but can easily be added as described in Adding differential gene expression.
the switchPlot() function have a argument called IFcutoff which requires the Isoform Fraction of an isforom to be larger than IFcutoff (default 0.01) to be included in the plot. Increasing this cutoff can result in “cleaner” plots as minor isoforms will be removed.

To illustrate the genome-wide analysis of consequences of isoform switching in the different comparisons, we will use a larger dataset which is a subset of two of the TCGA Cancer types analyzed in Vitting-Seerup et al 2017.

data("exampleSwitchListAnalyzed")

The first step is to take a look at the number and overlap of isoform switches. The numbers can be extracted via the extractSwitchSummary() function and by setting “filterForConsequences=TRUE” we only extract the features involved in a switch which is predicted to have a functional consequence:

extractSwitchSummary(
    exampleSwitchListAnalyzed,
    filterForConsequences=TRUE
)
#>                 Comparison nrIsoforms nrGenes
#> 1 COAD_ctrl vs COAD_cancer        690     368
#> 2 LUAD_ctrl vs LUAD_cancer        422     218
#> 3                 combined       1008     529

The number, and overlap, between condtions can be viusalized with the extractSwitchOverlap() function:

extractSwitchOverlap(
    exampleSwitchListAnalyzed,
    filterForConsequences=TRUE
)

We can look more into the details of the consequences by considering each type consequence seperately as follows:

extractConsequenceSummary(
    exampleSwitchListAnalyzed,
    consequencesToAnalyze='all',
    plotGenes = FALSE,           # enables analysis of genes (instead of isoforms)
    asFractionTotal = FALSE      # enables analysis of fraction of significant features
)

Note the “consequencesToAnalyze” argument enables analysis of only a subset of features.

From this summary plot many conclusions are possible. First of all, the most frequent changes are changes affecting protein domains and ORFs. Secondly, intron retention is more commonin LUAD than in COAD. Lastly, when considering oppositeconsequence (e.g. the gain vs loss of protein domains) its quite easy to see they are unevenly distributed (e.g. more protein domain loss than protein domain gain). This uneven distribution can be systematically analyzed using the build in enrichment analysis:

extractConsequenceEnrichment(
    exampleSwitchListAnalyzed,
    consequencesToAnalyze='all',
    analysisOppositeConsequence = TRUE
)

For each pair of oppositeconsequences (e.g. protein domain loss vs gain) (y-axis) this plot shows the fraction of switches, affected by either of the opposing consequence, that results in the consequnce indicated (e.g. protein domain loss) (x-axis). If this fraction is significantly different from 0.5 it indicates there is a systematic biases in which consequence is detected.

From the analysis above, it is therefore quite clear, that many of the oposing consequences are significantly unevenly distributed. In other words, many types of consequences seems to be used in a group-specific manner.

When comparing the two cancer types (right vs left plot) the overall pattern seems similar - but there might be differences. This can formally be analyzed with the extractConsequenceEnrichmentComparison() function. This function with for each oposing set of consequence (e.g. protein domain loss vs gain), in a pairwise manner contrasts the individual comparisons to assess whether the ratio of loss/gains are indeed significantly different:

extractConsequenceEnrichmentComparison(
    exampleSwitchListAnalyzed,
    consequencesToAnalyze=c('domains_identified','intron_retention','coding_potential'),
    analysisOppositeConsequence = TRUE
)

#>                                             comparisonsCompared
#> 1 COAD_ctrl vs COAD_cancer compared to LUAD_ctrl vs LUAD_cancer
#> 2 COAD_ctrl vs COAD_cancer compared to LUAD_ctrl vs LUAD_cancer
#> 3 COAD_ctrl vs COAD_cancer compared to LUAD_ctrl vs LUAD_cancer
#>                                                  consequence
#> 1                      Domain loss (paired with Domain gain)
#> 2  Intron retention loss (paired with Intron retention gain)
#> 3 Transcript is coding (paired with Transcript is Noncoding)
#>   propUp_comparison_1 propUp_comparison_2 fisherQvalue Significant
#> 1           0.8220065           0.6405229 7.891585e-05        TRUE
#> 2           0.3661972           0.2881356 3.575517e-01       FALSE
#> 3           0.1851852           0.3529412 1.920463e-01       FALSE

The analysis shows that compared to LUAD, COAD have a significantly higher fraction of switches resulting in protein domain loss, but there is no difference in terms of intron retention loss or switches from noncoding to to coding transcripts.

Such a results leads to the question what cellular mechanism underlies behind these changes. Due to the detailed alternative splicing analysis performed we can analyze this using the built-in summary function:

extractSplicingEnrichment(exampleSwitchListAnalyzed)

Which is equivivalent to the consequnce enrichment analysis described above. From this analysis, we see that although the patterns of consequences looked somewhat slimiar (as illustrated above), the underlying consequences are quite different. COAD utilizes alternative 3’ acceptor sites (A3), multiple exon skipping (MES) and alternative transcription start sites (ATSS) while LUAD utilize intron retention (IR) and alternative transcription termination sites (ATTS) more. For more details (and functionality) regarding splicing analysis see the Genome-Wide Analysis of Alternative Splicing section.

Since all the data about isoform and gene expression is saved in the switchAnalyzeRlist, we can also make a set of overview plots:

data("exampleSwitchListAnalyzed")

### Vulcano like plot:
ggplot(data=exampleSwitchListAnalyzed$isoformFeatures, aes(x=dIF, y=-log10(isoform_switch_q_value))) +
     geom_point(
        aes( color=abs(dIF) > 0.1 & isoform_switch_q_value < 0.05 ), # default cutoff
        size=1
    ) +
    geom_hline(yintercept = -log10(0.05), linetype='dashed') + # default cutoff
    geom_vline(xintercept = c(-0.1, 0.1), linetype='dashed') + # default cutoff
    facet_wrap( ~ condition_2) +
    #facet_grid(condition_1 ~ condition_2) + # alternative to facet_wrap if you have overlapping conditions
    scale_color_manual('Signficant\nIsoform Switch', values = c('black','red')) +
    labs(x='dIF', y='-Log10 ( Isoform Switch Q Value )') +
    theme_bw()

As there are many dIF values (effect size) very close to zero, which have a significant isoform switch (black dots above dashed hoizontal line) this nicely illustrates why a cutoffs both on the dIF and the q-value are necessary.

Another interesting overview plot can be made as follows:

### Switch vs Gene changes:
ggplot(data=exampleSwitchListAnalyzed$isoformFeatures, aes(x=gene_log2_fold_change, y=dIF)) +
    geom_point(
        aes( color=abs(dIF) > 0.1 & isoform_switch_q_value < 0.05 ), # default cutoff
        size=1
    ) + 
    facet_wrap(~ condition_2) +
    #facet_grid(condition_1 ~ condition_2) + # alternative to facet_wrap if you have overlapping conditions
    geom_hline(yintercept = 0, linetype='dashed') +
    geom_vline(xintercept = 0, linetype='dashed') +
    scale_color_manual('Signficant\nIsoform Switch', values = c('black','red')) +
    labs(x='Gene log2 fold change', y='dIF') +
    theme_bw()

Here, it is clear that changes in gene expression and isoform switches are not necessarily mutually exclusive, as there are many genes which are both differentially expressed (large gene log2FC) and contain isoform switches (color). This also highlights the importance of also analyzing RNA-seq data at isoform-level resolution.

Back to Table of Content.

Detailed Workflow

Overview of Detailed Workflow

Before you start on this section, we recommend you to read though the Quick Start section, as it gives the large overview, introduces some basic concepts and gives a couple of tips not repeated here.

Compared to the workflow presented in Quick Start, a full workflow for analyzing isoform switches has many sub-steps which each can be customized/optimized. In this section, we will go more into depth with each of the steps as well as provide tips and shortcuts for working with IsoformSwitchAnalyzeR. Specifically, each of the main functions behind these steps will be described and illustrated. For a more comprehensive and detailed description of each individual function please refer to the individual function documentation build into the R package (easily accessed via ?functionName).

First we will start by getteing an overview of the detailed workflow before diving into the individual parts. Then we will introduce the IsoformSwitchAnalyzeR Background Information - all you need to know before doing the detailed analysis. This is followed by a detailed step-by-step isoform switch analysis workflow and lastly an overview of useful Other Tools in IsoformSwitchAnalyzeR is provided.

The detailed workflow consists of the following steps (illustrated in Figure 2) which, just like before, can be divided into two parts:

Part 1) Extract Isoform Switches and Their Sequences.

Importing Data Into R
Filtering
Identifying Isoform Switches
Analyzing Open Reading Frames
Extracting Nucleotide and Amino Acid Sequences
Advice for Running External Sequence Analysis Tools

This corresponds to running the following functions in sequential order (which incidentally is just what the isoformSwitchAnalysisPart1() function does):

### Import data into R via one of:
myQantifications <- importIsoformExpression() # RSEM/Kallisto/Salmon/StringTie
myCuffDb         <- readCufflinks()           # Cufflinks/Cuffdiff

### Create SwitchAnalyzeRlist
mySwitchList <- importRdata()         # OR
mySwitchList <- importCufflinksData()

### Filter
mySwitchList <- preFilter( mySwitchList )

### Test for isoform switches
mySwitchList <- isoformSwitchTestDEXSeq( mySwitchList )   # OR
mySwitchList <- isoformSwitchTestDRIMSeq( mySwitchList ) 

### If novel isoforms (else use CDS from ORF as explained in importRdata() )
mySwitchList <- analyzeORF( mySwitchList )

### Extract Sequences
mySwitchList <- extractSequence( mySwitchList )

### Summary
extractSwitchSummary(mySwitchList)

Part 2) Plot All Isoform Switches and Their annotation.

Importing External Sequence Analysis
Predicting Alternative Splicing
Predicting Switch Consequences
Post Analysis of Isoform Switches with Consequences
- Analysis of Individual Isoform Switching
- Genome-Wide Analysis of Isoform Switching

This corresponds to running the following functions in sequential order (just like the isoformSwitchAnalysisPart2() function does):

### Add annotation
mySwitchList <- analyzeCPAT( mySwitchList )
mySwitchList <- analyzePFAM( mySwitchList )
mySwitchList <- analyzeSignalP( mySwitchList )
mySwitchList <- analyzeAlternativeSplicing( mySwitchList )

### Analyse consequences
mySwitchList <- analyzeSwitchConsequences( mySwitchList )

### Visual analysis
# Indiviudal switches
switchPlotTopSwitches( mySwitchList )

### Summary
extractSwitchSummary(mySwitchList, filterForConsequences = TRUE)

For a normal workflow this would then typically be followed by the global analysis of alternative splcing and consequences consequences:

# global consequence analysis
extractConsequenceSummary( mySwitchList )
extractConsequenceEnrichment( mySwitchList )
extractConsequenceGenomeWide( mySwitchList )

# global splicing analysis
extractSplicingSummary( mySwitchList )
extractSplicingEnrichment( mySwitchList )
extractSplicingGenomeWide( mySwitchList )

The combined workflow is visualized here:

Figure 2: Detailed workflow overview. The grey transparent boxes indicate the two parts of a normal workflow for analyzing isoform switches. The individual steps in the two sub-workflows are indicated by arrows along with a description and the main R functions for performing the steps.

IsoformSwitchAnalyzeR Background Information

The switchAnalyzeRlist

The switchAnalyzeRlist object is created to specifically contain and summarize all relevant information about the isoforms involved in isoform switches. The switchAnalyzeRlist object is a named list, meaning each entry in the list can be accessed by its name via the ‘$’ symbol or by using “[[‘entryName’]]”. A newly created switchAnalyzeRlist object contains 6 entries, and as the isoforms are gradually annotated and analyzed more entries are added.

data("exampleSwitchList")         # A newly created switchAnalyzeRlist + switch analysis
names(exampleSwitchList)
#> [1] "isoformFeatures"       "exons"                 "conditions"           
#> [4] "designMatrix"          "isoformCountMatrix"    "sourceId"             
#> [7] "isoformSwitchAnalysis"

data("exampleSwitchListAnalyzed") # A fully analyzed switchAnalyzeRlist
names(exampleSwitchListAnalyzed)
#> [1] "isoformFeatures"             "exons"                      
#> [3] "conditions"                  "sourceId"                   
#> [5] "orfAnalysis"                 "domainAnalysis"             
#> [7] "signalPeptideAnalysis"       "AlternativeSplicingAnalysis"
#> [9] "switchConsequence"

The first entry ‘isoformFeatures’ is a data.frame where all relevant data about each comparison of an isoform (between conditions), as well as the analysis performed and annotation incooperate via IsoformSwitchAnalyzeR, is stored. Amongst the default information is isoform and gene id, gene and isoform expression as well as the isoform_switch_q_value and isoform_switch_q_value where the result of the differential isoform analysis is stored. The comparisons made can be identified as “from ‘condition_1’ to ‘condition_2’”, meaning ‘condition_1’ is considered the ground state and ‘condition_2’ the changed state. This also means that a positive dIF value indicates that the isoform usage is increased in ‘condition_2’ compared to ‘condition_1’. Since the ‘isoformFeatures’ entry is the most relevant part of the switchAnalyzeRlist object, the most-used standard methods have also been implemented to work directly on isoformFeatures.

### Preview
head(exampleSwitchList, 2)
#>             iso_ref          gene_ref     isoform_id     gene_id
#> 19 isoComp_00000007 geneComp_00000005 TCONS_00000007 XLOC_000005
#> 22 isoComp_00000008 geneComp_00000005 TCONS_00000008 XLOC_000005
#>    condition_1 condition_2 gene_name nearest_ref_id class_code
#> 19        hESC Fibroblasts      <NA>     uc009vjk.2          =
#> 22        hESC Fibroblasts      <NA>     uc001aau.2          =
#>    TSS_group_id length              locus gene_status gene_overall_mean
#> 19         TSS2   2750 chr1:322036-328580          OK          193.8339
#> 22         TSS3   4369 chr1:322036-328580          OK          171.6605
#>    gene_value_1 gene_value_2 gene_stderr_1 gene_stderr_2
#> 19      696.704      48.0566      3.592857      2.307488
#> 22      696.704      48.0566      3.592857      2.307488
#>    gene_log2_fold_change gene_p_value gene_q_value gene_significant
#> 19              -3.85774  2.66665e-09  3.20379e-08              yes
#> 22              -3.85774  2.66665e-09  3.20379e-08              yes
#>    iso_status iso_overall_mean iso_value_1 iso_value_2 iso_stderr_1
#> 19         OK         372.3803     358.383    29.28480     2.091049
#> 22         OK         372.3803     338.308     5.01291     1.322809
#>    iso_stderr_2 iso_log2_fold_change iso_p_value iso_q_value
#> 19    17.489950             -3.61328 4.83698e-05 0.000548629
#> 22     6.999295             -6.07655 2.64331e-03 0.015898000
#>    iso_significant IF_overall       IF1       IF2         dIF
#> 19             yes  0.5618896 0.5143978 0.6093814  0.09498364
#> 22             yes  0.2949481 0.4855835 0.1043126 -0.38127092
#>    isoform_switch_q_value gene_switch_q_value
#> 19                     NA                   1
#> 22                     NA                   1
# tail(exampleSwitchList, 2)

### Dimentions
dim(exampleSwitchList$isoformFeatures)
#> [1] 259  38

nrow(exampleSwitchList)
#> [1] 259
ncol(exampleSwitchList)
#> [1] 38
dim(exampleSwitchList)
#> [1] 259  38

A very useful functionality implemented in IsoformSwitchAnalyzeR is the subsetSwitchAnalyzeRlist() function, which allows for removal of isoforms and all their associated information across all entries in a switchAnalyzeRlist. The function subsets the switchAnalyzeRlist based on a vector of logicals matching the isoformFeatures entry of the list.

exampleSwitchList
#> This switchAnalyzeRlist list contains:
#>  259 isoforms from 84 genes
#>  1 comparison from 2 conditions
#> 
#> Switching features:
#>            Comparison switchingIsoforms switchingGenes
#> 1 hESC vs Fibroblasts                 0              0
#> 
#> Feature analyzed:
#> [1] "Isoform Swich Identification"

### Subset
subsetSwitchAnalyzeRlist(
    exampleSwitchList,
    exampleSwitchList$isoformFeatures$gene_name == 'ARHGEF19'
)
#> This switchAnalyzeRlist list contains:
#>  3 isoforms from 1 genes
#>  1 comparison from 2 conditions
#> 
#> Switching features:
#>            Comparison switchingIsoforms switchingGenes
#> 1 hESC vs Fibroblasts                 0              0
#> 
#> Feature analyzed:
#> [1] "Isoform Swich Identification"

Transcript structure information is stored in the exon entry of the switchAnalyzeRlist and contains the genomic coordinates for each exon in each isoform, and a column indicating which isoform it originates from. This information is stored as GenomicRanges (GRanges), which is very useful for overlapping genomic features and interacting with other Bioconductor packages.

head(exampleSwitchList$exons,2)
#> GRanges object with 2 ranges and 2 metadata columns:
#>       seqnames        ranges strand |     isoform_id     gene_id
#>          <Rle>     <IRanges>  <Rle> |    <character> <character>
#>   [1]     chr1 322037-322228      + | TCONS_00000007 XLOC_000005
#>   [2]     chr1 324288-324345      + | TCONS_00000007 XLOC_000005
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

A full description of the initial switchAnalyzeRlist can be found via ?switchAnalyzeRlist and each additional entry added is described in the “value” part of the documentation for the specific function adding the entry.

IsoformSwitchAnalyzeR

Enabling Identification and Analysis of Isoform Switches with Functional Consequences from RNA-sequencing data

Kristoffer Vitting-Seerup

2018-10-30

Abstract

Table of Content

Preliminaries

Background and Package Description

Installation

How To Get Help

What To Cite

Quick Start

Overview of Isoform Switch Workflow

Example Workflow

Importing the Data

Part 1

Part 2

Examples Visualizations

Detailed Workflow

Overview of Detailed Workflow

IsoformSwitchAnalyzeR Background Information

The switchAnalyzeRlist

Function overview

Importing Data Into R

Importing Data from Kallisto, Salmon, RSEM or StringTie

Importing Data from Cufflinks/Cuffdiff

Importing Data from other Full-length Transcript Assemblers

Filtering

Identifying Isoform Switches

Testing for Isoform Switches via DEXSeq

Testing for Isoform Switches via DRIMSeq

Testing for Isoform Switches with other Tools

Analyzing Open Reading Frames

Extracting Nucleotide and Amino Acid Sequences

Advice for Running External Sequence Analysis Tools

Importing External Sequence Analysis

Predicting Alternative Splicing

Predicting Switch Consequences

Removal of annotation data not needed anymore

Post Analysis of Isoform Switches with Consequences

Analysis of Individual Isoform Switching

Genome-Wide Analysis of Isoform Switching

Other Tools in IsoformSwitchAnalyzeR

Analyzing Alternative Splicing

Overview of Alternative Splicing Workflow

Genome-Wide Analysis of Alternative Splicing

Other workflows

Augmenting ORF Predictions with Pfam Results

Analyze Small Upstream ORFs

Remove Sequences Stored in SwitchAnalyzeRlist

Adding Uncertain Category to Coding Potential Predictions

Quality control of ORF of known annotation

Analyzing the Biological Mechanisms Behind Isoform Switching

Analysing experiments without replicates

Frequently Asked Questions, Problems and Errors

FAQ Table of Content

What Quantification Tool(s) Should I Use?

How to handle cofounding effects (including batches)

What constitute an independent biological replicate?

Adding differential gene expression

The error “The annotation does not fit the expression data”

The error “The annotation (count matrix and isoform annotation) contain differences in which isoforms are analyzed…”

The error “The supplied design matrix will result in a model matrix that is not full rank”

Final Remarks

Sessioninfo