psichomics tutorial: command-line interface (CLI)

Nuno Saraiva-Agostinho

22 April 2016


psichomics is an interactive R package for integrative analyses of alternative splicing using data from The Cancer Genome Atlas (TCGA) (containing molecular data associated with 34 tumour types) and from the Genotype-Tissue Expression (GTEx) project (containing data for multiple normal human tissues). The data leveraged from these projects includes clinical information and transcriptomic data, such as the quantification of RNA-Seq reads aligning to splice junctions (henceforth called junction quantification) and exons.

Installing and starting the program

Install psichomics by typing the following in an R console (the R environment is required):

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("psichomics")

After the installation, load psichomics by typing:

library(psichomics)

Retrieving data

The quantification of each alternative splicing event is based on the proportion of junction reads that support the inclusion isoform, known as percent spliced-in or PSI (E. T. Wang et al. 2008).

To estimate this value for each splicing event, both alternative splicing annotation and junction quantification are required. While alternative splicing annotation is provided by the package, junction quantification may be retrieved from TCGA.

Download and load data from TCGA/Firebrowse

Data is downloaded from Firebrowse, a service that hosts proccessed data from TCGA as required to run the downstream analyses. Before downloading data, check the following available options:

# Available tumour types
cohorts <- getFirebrowseCohorts()

# Available sample dates
date <- getFirebrowseDates()

# Available data types
dataTypes <- getFirebrowseDataTypes()

After deciding on the options to use, download and load data of interest using:

# Set download folder
folder <- getDownloadsFolder()

# Download and load most recent junction quantification and clinical data from
# TCGA/Firebrowse for Adrenocortical Carcinoma
data <- loadFirebrowseData(folder=folder,
                         cohort="BRCA",
                         data=c("Clinical", "junction_quantification"),
                         date="2016-01-28")

# Select clinical and junction quantification dataset
clinical <- data[[1]]$`Clinical data`
junctionQuant <- data[[1]]$`Junction quantification (Illumina HiSeq)`

Data is only downloaded if the files are not present in the given folder. In other words, if the files were already downloaded, the function will just load the files, so it is possible to reuse the code above just to load the requested files.

Load local files

To load local files instead, indicate the folder of interest. Any files located in this folder and sub-folders will be attempted to load. To avoid errors during this process, files of interest should be put in a dedicated folder.

For instance, to load GTEx files, create a directory called GTEx, put all files of interest inside that folder and follow these commands:

folder <- "~/Downloads/GTEx/"
ignore <- c(".aux.", ".mage-tab.")
data <- loadLocalFiles(folder, ignore=ignore)

# Select clinical and junction quantification dataset
clinical <- data[[1]]$`Clinical data`
junctionQuant <- data[[1]]$`Junction quantification (Illumina HiSeq)`

Quantifying alternative splicing

As previously mentioned, alternative splicing is quantified from the previously loaded junction quantification and an alternative splicing annotation file. To check current annotation files available:

# Available alternative splicing annotation
annotList <- listSplicingAnnotations()
annotList
##                            Human (hg19/GRCh37) 
## "annotationHub_alternativeSplicingEvents.hg19"

Custom alternative splicing annotation obtained from SUPPA, rMATS, VAST-TOOLS or MISO can also be provided. Note that SUPPA and rMATS are able to create their splicing annotation based on transcript annotation. Read more here.

To quantify alternative splicing, first select the junction quantification, alternative splicing annotation and alternative splicing event type(s) of interest:

# Load Human (hg19/GRCh37 assembly) annotation
human <- listSplicingAnnotations()[[1]]
annotation <- loadAnnotation(human)
# Available alternative splicing event types (skipped exon, alternative 
# first/last exon, mutually exclusive exons, etc.)
eventType <- getSplicingEventTypes()
eventType
##               Skipped exon    Mutually exclusive exon 
##                       "SE"                      "MXE" 
## Alternative 5' splice site Alternative 3' splice site 
##                     "A5SS"                     "A3SS" 
##     Alternative first exon      Alternative last exon 
##                      "AFE"                      "ALE"
eventType <- "SE"

Afterwards, quantify alternative splicing using the previously defined parameters:

# Min. reads threshold: number of reads required to quantify a splicing event
minReads <- 10

psi <- quantifySplicing(annotation, junctionQuant, eventType=eventType, 
                        minReads=minReads)
# Check the identifier of the splicing events in the resulting table
events <- rownames(psi)
head(events)
## [1] "SE_1_-_2125078_2124414_2124284_2121220_C1orf86"        
## [2] "SE_1_+_46033849_46034157_46034356_46034599_AKR1A1"     
## [3] "SE_1_-_113249700_113247790_113247722_113246428_RHOC"   
## [4] "SE_1_-_113249700_113248874_113247722_113246428_RHOC"   
## [5] "SE_1_-_154186933_154186422_154186369_154185100_C1orf43"
## [6] "SE_1_+_154241430_154241838_154241888_154242676_UBAP2L"

Note that the event identifier (for instance, SE_1_-_2125078_2124414_2124284_2121220_C1orf86) is composed of:

WARNING: all the examples shown below are performed using a small, but representative subset of the data available.

Survival analysis

Survival data can be analysed based on clinical attributes, for instance, by tumour stage and patient gender, using time to death as the follow-up time and death as the event of interest.

Around 80% of breast cancers have cells that express estrogen receptors (ER) and require estrogen binding to grow. Estrogen receptors may be blocked by tamoxifen and other drugs. These drugs are thus used as treatment or prevention for ER-positive breast cancers.

To compare the overall survival of ER-positive patients treated with and without tamoxifen, the following groups will be created:

# Get available groups in clinical data
cols <- colnames(clinical)

# Check the name from which to retrieve groups of interest
er_expr <- grep("estrogen_receptor_status", cols, value=TRUE)[1]
er_expr <- createGroupByAttribute(col=er_expr, dataset=clinical)

# Discard values other than "positive" and "negative" for ER expression
er_expr <- er_expr[c("positive", "negative")]

############################################################################
# Now the same for patients treated with tamoxifen. However, TCGA presents #
# data for the administred drugs spread through multiple columns, so...    #
############################################################################

# Look for the appropriate columns with the drugs administred
drug_name <- grep("drug_name", cols, value=TRUE)[1:23]

# Using the previous columns, look for either "tamoxifen" or "tamoxiphen"
tamoxifen <- lapply(drug_name, function(i) grep("tamoxi.*", clinical[ , i]))

# Collect all previous results and create a single list named tamoxifen
tamoxifen <- sort(unique(unlist(tamoxifen)))
tamoxifen <- list("tamoxifen"=tamoxifen)

# Combine all previously created groups
groups <- c(er_expr, tamoxifen)

# Assign each patient to a group
patients <- nrow(clinical)
g <- groupPerPatient(groups[c("tamoxifen", "positive")], patients)

# Append the created groups to the clinical dataset
cl <- cbind(clinical, groups=g)

Before continuing, be sure you undestand how to use formulas in R.

help(formula)

To plot the survival curves:

# Create the right-hand side of a formula
formulaStr <- "groups"

# Events are retrieved based on the information available for time to event: if 
# there is info for a patient, the event is considered as occurred
daysToDeath <- "days_to_death"
survTerms  <- processSurvTerms(cl, censoring="right", event=daysToDeath,
                               timeStart=daysToDeath,
                               formulaStr=formulaStr, scale="years")

require("survival")
## Loading required package: survival
surv <- survfit(survTerms)
pvalue <- testSurvival(survTerms)

plotSurvivalCurves(surv, pvalue=pvalue)

Information regarding number of individuals and events is returned when hovering over each survival curve in the plot. The plot also allows zooming in by clicking and dragging and to omit data series by clicking on their name in the legend.

Cox Proportional Hazards model

Calculate Cox proportional hazards model using the previously used parameters by adding the argument coxph=TRUE when processing survival terms.

survTermsCox <- processSurvTerms(cl, censoring="right", event=daysToDeath,
                                 timeStart=daysToDeath,
                                 formulaStr=formulaStr, coxph=TRUE)
summary(survTermsCox)
## Call:
## coxph(formula = form, data = survData)
## 
##   n= 841, number of events= 66 
##    (255 observations deleted due to missingness)
## 
##                        coef exp(coef) se(coef)      z Pr(>|z|)   
## tamoxifen            0.9762    2.6545   0.7241  1.348  0.17756   
## tamoxifen, positive -1.1990    0.3015   0.4308 -2.783  0.00538 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                     exp(coef) exp(-coef) lower .95 upper .95
## tamoxifen              2.6545     0.3767    0.6422   10.9722
## tamoxifen, positive    0.3015     3.3167    0.1296    0.7014
## 
## Concordance= 0.582  (se = 0.038 )
## Rsquare= 0.015   (max possible= 0.523 )
## Likelihood ratio test= 12.53  on 2 df,   p=0.001901
## Wald test            = 10.07  on 2 df,   p=0.006492
## Score (logrank) test = 11.58  on 2 df,   p=0.003065

Exploring principal component analysis

Principal component analysis (PCA) is a technique to reduce data dimensionality by identifying variable combinations (called principal components) that explain the variance in the data (Ringnér 2008).

To explore alternative splicing quantification groups by estrogen receptor (ER) expression:

# Percentage of missing values tolerated by splicing event: tolerated missing
# values are replaced with the median value of that splicing event
naTolerance <- 0

# Center bu do not scale values (they are already scaled between 0 and 1)
center <- TRUE
scale  <- FALSE

# Match patients with samples
samples <- colnames(psi)
match <- getPatientFromSample(samples, clinical)

# Filter alternative splicing quantification to colour values by positive or
# negative ER expression
erGroups <- getMatchingSamples(groups[c("positive", "negative")], samples, 
                               clinical, match=match)
filtered_psi <- psi[ , unlist(erGroups)]

# Perform principal component analysis (transpose alternative splicing
# quantification to have samples as rows)
pca <- performPCA(t(filtered_psi), center=center, scale.=scale,
                  naTolerance=naTolerance)

# Plot the variance explained by each principal component
plotVariance(pca)

Now plot the score plot of the PCA coloured according to the ER expression:

plotPCA(pca, pcX="PC1", pcY="PC2", erGroups)

Moreover, let’s plot the corresponding loadings plot that displays the variables (in this case, alternative splicing events):

plotPCA(pca, pcX="PC1", pcY="PC2", individuals = FALSE, loadings = TRUE)

The bubble size of the loadings plot represent the total contribution of each alternative splicing event to the selected principal components.

Note that the clinical samples from ER-positive individuals (i.e. patients whose cancer cells express estrogen receptors) seem to separate from samples from ER-negative individuals along the principal component 1. There is one alternative splicing event that may contribute to this separation: SE 10 + 79797062 79799962 79799983 79800373 RPS24.

Differential splicing analysis

By default, differential splicing analysis is performed by sample types (i.e. tumour, normal, metastasis, etc) for multiple parametric and non-parametric statistical tests. Let’s analyse the previous splicing event found:

# Choose a method to use when adjusting p-values ("none" is valid)
# help(p.adjust.methods)
pvalueAdjust <- "BH"

# Check sample types available (normal, tumour, etc.)
types <- parseSampleGroups(colnames(psi))
unique(types)
## [1] "Primary solid Tumor" "Solid Tissue Normal" "Metastatic"
# Analyse by sample types (by default) and perform all statistical analyses (by 
# default)
event <- "SE_10_+_79797062_79799962_79799983_79800373_RPS24"
eventPSI <- psi[event, ]
stats <- diffAnalyses(eventPSI, pvalueAdjust=pvalueAdjust,
                      analyses = c("wilcoxRankSum", "ttest", "kruskal", 
                                   "levene", "fligner"))
## Performing statistical analysis 5
## Time difference of 0.0364778 secs
## Preparing data
## Time difference of 0.008715868 secs
## Adjusting p-values BH
## Time difference of 0.001695395 secs
## Include splicing event information
# View(stats)
plotDistribution(as.numeric(eventPSI), groups=types)

Differential splicing analysis returns a data frame with the results of the statistical tests, as well as other info, such as associated gene and number of samples per group. The following statistical tests are available:

Not all statistical tests are performed depending on the number of groups available. Check if any tests show statistical significance using only tumour against normal samples (for instance, Log-rank p-values below 0.05):

filter <- grep("Tumor|Normal", types)
gr_event <- types[filter]
eventPSI <- eventPSI[filter]

stats <- diffAnalyses(eventPSI, groups=gr_event, pvalueAdjust=pvalueAdjust,
                      analyses = c("wilcoxRankSum", "ttest", "kruskal", 
                                   "levene", "fligner"))
## Performing statistical analysis 5
## Time difference of 0.162817 secs
## Preparing data
## Time difference of 0.01178217 secs
## Calculating delta variance and median
## Time difference of 0.0007202625 secs
## Adjusting p-values BH
## Time difference of 0.001231194 secs
## Include splicing event information
# View(stats)
plotDistribution(as.numeric(eventPSI), groups=gr_event)

Now, change the groups to positive and negative for ER expression and check if any of the tests show statistical significance:

# Filter alternative splicing quantification by positive or negative ER
# expression; match between patient information and samples as such:
erGroups <- getMatchingSamples(groups[c("negative", "positive")], samples, 
                               clinical, match=match)
eventPSI <- psi[event, unlist(erGroups)]
erGroups <- rep(names(erGroups), sapply(erGroups, length))

stats <- diffAnalyses(eventPSI, groups=erGroups, pvalueAdjust=pvalueAdjust,
                      analyses = c("wilcoxRankSum", "ttest", "kruskal", 
                                   "levene", "fligner"))
## Performing statistical analysis 5
## Time difference of 0.03852177 secs
## Preparing data
## Time difference of 0.01090145 secs
## Calculating delta variance and median
## Time difference of 0.0007030964 secs
## Adjusting p-values BH
## Time difference of 0.001366615 secs
## Include splicing event information
# View(stats)
plotDistribution(as.numeric(eventPSI), groups=erGroups)

Survival analysis

To study the impact of an alternative splicing event on prognosis, Kaplan-Meier curves may be plotted for groups of patients separated by a given PSI cut-off for the selected alternative splicing event. Let’s carry on with the splicing event from before:

event <- "SE_10_+_79797062_79799962_79799983_79800373_RPS24"

# Assign alternative splicing quantification to patients based on their samples
clinicalPSI <- getPSIperPatient(psi, match, clinical)
eventPSI <- as.numeric(clinicalPSI[event, ])

# Statistics of the alternative splicing event's quantification
summary(eventPSI)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.01419 0.46876 0.71263 0.64484 0.83518 1.00000       4

The optimal PSI cut-off that maximises the significance of their difference in survival (i.e. minimises the p-value of the Wald/Log/Logrank tests of difference in survival between individuals with PSI below and above that threshold) can be calculated as per:

opt <- optimalPSIcutoff(clinical, eventPSI, censoring="right", 
                        event="days_to_death", timeStart="days_to_death")
optimalCutoff <- opt$par # Optimal exon inclusion level
optimalCutoff
## [1] 0.6355068
optimalPvalue <- opt$value # Respective p-value
optimalPvalue
## [1] 0.02102

Finally, plot survival curves separated by the optimal quantification cut-off or any other cut-off of interest:

cutoff <- optimalCutoff
group <- labelBasedOnCutoff(eventPSI, cutoff, label="PSI values")
survTerms <- processSurvTerms(clinical, censoring="right",
                              event="days_to_death", timeStart="days_to_death", 
                              group=group)

require("survival")
surv <- survfit(survTerms)
pvalue <- testSurvival(survTerms)
plotSurvivalCurves(surv, pvalue=pvalue)

Literature support and external database information

If an event is differentially spliced and has an impact on patient survival, its association with the studied disease might be already described in the literature. Check for relevant research articles on PubMed.

It is also possible to plot transcripts and proteins related to the gene associated to a given splicing event. To retrieve and plot transcript information from ENSEMBL:

parsed <- parseSplicingEvent(event)
info <- queryEnsemblByEvent(event, species="human", assembly="hg19")
plotTranscripts(info, parsed$pos[[1]])

To retrieve and plot protein information from UniProt:

parsed <- parseSplicingEvent(event)
info <- queryEnsemblByEvent(event, species="human", assembly="hg19")

# Some of these transcripts have no corresponding protein
proteins <- info$Transcript$Translation$id
protein <- proteins[[6]]

# Convert protein identifier from ENSEMBL to UniProt
uniprot <- ensemblToUniprot(protein)
uniprot
## RS24_HUMAN (UniProtKB/Swiss-Prot) 
##                          "P62847"
uniprot <- uniprot[[1]]

plotProtein(uniprot)
## Retrieving protein annotation from UniProt...
## Plotting protein domains...

The protein plot shows the UniProt matches for the selected transcript. Hover the protein’s rendered domains to obtain more information on them.

Other database of interest include: - Human Protein Atlas (Cancer Atlas) allows to check the evidence of a gene at protein level for multiple cancer tissues. - VastDB shows multi-species alternative splicing profiles for diverse tissues and cell types. - UCSC Genome Browser may reveal protein domain disruptions caused by the alternative splicing event. To check so, activate the Pfam in UCSC Gene and UniProt tracks (in Genes and Gene Predictions) and check if any domains are annotated in the alternative and/or constitutive exons of the splicing event.

Exploring differential splicing analysis

To analyse differencial splicing, choose which clinical groups on which to perform the analyses. For instance, splicing events can be analysed based on sample types (e.g., tumour versus normal tissue, if available) or clinical groups of the patients (e.g. stage of the disease).

Let’s perform differential splicing analysis based on ER expression:

# Filter alternative splicing quantification by positive or negative ER
# expression; match between patient information and samples as such:
erGroups <- getMatchingSamples(groups[c("negative", "positive")], samples, 
                               clinical, match=match)
eventsPSI <- psi[ , unlist(erGroups)]
erGroups  <- rep(names(erGroups), sapply(erGroups, length))

diff <- diffAnalyses(eventsPSI, erGroups)
## Performing statistical analysis 7
## Time difference of 1.658497 secs
## Preparing data
## Time difference of 0.03277755 secs
## Calculating delta variance and median
## Time difference of 0.0003833771 secs
## Adjusting p-values BH
## Time difference of 0.001128197 secs
## Include splicing event information
# View(diff)

Survival analysis

To study the impact of the alternative splicing events on prognosis, significant differences in survival may be identified between patients separated by a given splicing quantification cut-off. Given the time-consuming process of identifying a cut-off that minimises the survival difference, it is recommended to filter differentially spliced events supported by statistical significance before calculating the optimal cut-off and associated p-value.

# Filter statistically significant events as you see fit
filter <- diff$`Wilcoxon p-value (BH adjusted)` <= 0.05
filter <- filter[!is.na(filter)]
events <- rownames(diff[filter, ])

# Assign alternative splicing quantification to patients based on their samples
clinicalPSI <- getPSIperPatient(psi, match, clinical)

# Get time for all survival analyses of interest (the same for all, so this is
# faster)
daysToDeath <- "days_to_death"
survTime <- getColumnsTime(clinical, event=daysToDeath, timeStart=daysToDeath)

# Prepare progress bar
min <- 1
max <- nrow(clinicalPSI)
pb <- txtProgressBar(min, max, "Calculating optimal PSI cut-off", style=3)

# Prepare empty vectors where information will be stored (faster)
optimalCutoff <- rep(NA, max)
optimalPvalue <- rep(NA, max)

# Retrieve the optimal PSI cut-off and respective p-value for multiple events
for (row in min:max) {
    singlePSI <- as.numeric(clinicalPSI[row, ])
    opt <- optimalPSIcutoff(clinical, singlePSI, censoring="right", 
                            event=daysToDeath, timeStart=daysToDeath,
                            survTime=survTime)
    optimalCutoff[row] <- opt$par # Optimal exon inclusion level
    optimalPvalue[row] <- opt$value # Respective p-value
    setTxtProgressBar(pb, row)
}

# Bind columns to differential splicing analysis if desired
diff <- cbind(diff, optimalCutoff, optimalPvalue)
# View(diff)

Afterwards, check literature information and search external databases for more information on the events of interest, including on UCSC Genome Browser for putative protein domain disruptions resulting from the splicing event.

Exploring an alternative splicing event of interest

The event SE 6 - 46823711 46822518 46822452 46821808 GPR116 has been previously reported to have potential prognostic value in breast cancer patients, where patients with higher PSI values for this event have a lower 5-year survival rate than patients with lower PSI values (Tsai et al. 2015).

Confirm so by performing differential splicing (for example, normal vs tumour samples) on this event and survival analysis by PSI cut-off. Also, check literature information and search external databases for more information, including on UCSC Genome Browser for putative protein domain disruptions resulting from the alternative splicing event.

Feedback

All feedback on the program, documentation and associated material (including this tutorial) is welcome. Please send any suggestions and comments to:

Nuno Saraiva-Agostinho (nunodanielagostinho@gmail.com) Computation Biology Lab, Instituto de Medicina Molecular (Portugal)

References

Ringnér, Markus. 2008. “What is principal component analysis?” Nature Biotechnology 26 (3): 303–4.

Tsai, Yihsuan S, Daniel Dominguez, Shawn M Gomez, and Zefeng Wang. 2015. “Transcriptome-wide identification and study of cancer-specific splicing events across multiple tumors.” Oncotarget 6 (9): 6825–39.

Wang, E. T., R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S. F. Kingsmore, G. P. Schroth, and C. B. Burge. 2008. “Alternative isoform regulation in human tissue transcriptomes.” Nature 456 (7221): 470–76.