psichomics is an interactive R package for integrative analyses of alternative splicing and gene expression based on The Cancer Genome Atlas (TCGA) (containing molecular data associated with 34 tumour types), the Genotype-Tissue Expression (GTEx) project (containing data for multiple normal human tissues), Sequence Read Archive and user-provided data. The data from GTEx, TCGA and select SRA projects include subject/sample-associated information and transcriptomic data, such as the quantification of RNA-Seq reads aligning to splice junctions (henceforth called junction quantification) and exons.
Install psichomics by typing the following in an R console (the R environment is required):
After the installation, load psichomics by typing:
psichomics
: Start the visual interface of psichomicsparseSplicingEvent
: Parse splicing eventsgetSplicingEventFromGenes
: Get alternative splicing events from genesgetGenesFromSplicingEvents
: Get genes from alternative splicing eventsTCGA/Firebrowse
getDownloadsFolder
: Get the user’s Downloads folderisFirebrowseUp
: Check if Firebrowse web API is onlinegetFirebrowseCohorts
: Query the Firebrowse web API for TCGA cohortsgetFirebrowseDataTypes
: Query the Firebrowse web API for TCGA data typesgetFirebrowseDates
: Query the Firebrowse web API for processing datesloadFirebrowseData
: Download and load TCGA data through the Firebrowse web APIparseTcgaSampleInfo
: Parse sample information from TCGA samplesGTEx
getGtexTissues
: Check available tissues from a file containing sample metadataloadGtexData
: Load GTEx dataSRA
recount::recount_abstract
: Check available SRA projects from recountloadSRAproject
: Download and load SRA projects through the recount R packageCustom and/or local files
loadLocalFiles
: Load local files from a given folderprepareSRAmetadata
: Prepare metadata from SRA (as obtained from Run Selector page)prepareJunctionQuant
: Prepare junction quantification files from splice-aware aligners (currently, psichomics supports the output of the splice-aware aligner STAR)prepareGeneQuant
: Prepare gene quantification files from splice-aware aligners (currently, psichomics supports the output of the splice-aware aligner STAR)plotRowStats
: Plot statistics (median, variance, etc.) per geneplotGeneExprPerSample
: Plot distribution of gene expression per samplefilterGeneExpr
: Filter genes based on their expressionnormaliseGeneExpression
: Normalise gene expression dataconvertGeneIdentifiers
: Convert between different gene identifiersgetSplicingEventTypes
: Get quantifiable splicing event typeslistSplicingAnnotations
: List available alternative splicing annotation filesloadAnnotation
: Load an alternative splicing annotation filequantifySplicing
: Quantify alternative splicingplotRowStats
: Plot statistics (median, variance, etc.) per alternative splicing eventfilterPSI
: Filter alternative splicing quantificationCustom alternative splicing annotation preparation
prepareAnnotationFromEvents
parseMatsAnnotation
: Parse splicing annotation from rMATSparseMisoAnnotation
: Parse splicing annotation from MISOparseSuppaAnnotation
: Parse splicing annotation from SUPPAparseVastToolsAnnotation
: Parse splicing annotation from VAST-TOOLSgetGeneList
: Get pre-created, literature-based lists of genescreateGroupByAttribute
: Split elements into groups based on a given attributegetSampleFromSubject
: Get samples matching the given patientsgetSubjectFromSample
: Get patients matching the given samplesgroupPerElem
: Return a vector with one group for each elementtestGroupIndependence
: Test multiple contigency tables comprised by two groups (one reference group and another containing remaing elements) and provided groupsplotGroupIndependence
: Plot -log10(p-values) of the results obtained after multiple group independence testingPrincipal component analysis (PCA)
performPCA
: Perform PCAplotVariance
: Render variance plotcalculateLoadingsContribution
: Calculate the contribution of the variables and return values in a data frameplotPCA
: Plot PCA individuals (scores) and/or variable contributionsIndependent component analysis (ICA)
performICA
: Perform ICAplotICA
: Plot ICA scoresgetAttributesTime
: Get time for given columns in a clinical datasetassignValuePerSubject
: Assign average samples values to their corresponding patientslabelBasedOnCutoff
: Label groups based on a given cutoffoptimalSurvivalCutoff
: Calculate optimal data cutoff that best separates survival curvestestSurvival
: Test the survival difference between groups of patientsprocessSurvTerms
: Process survival curves terms to calculate survival curvessurvfit
: Compute estimates of survival curvessurvdiffTerms
: Test differences between survival curvesplotSurvivalCurves
: Plot survival curvesplotSurvivalPvaluesByCutoff
: Plot p-values of survival difference between groups based on multiple cutoffsdiffAnalyses
: Perform statistical analyses (including differential splicing and gene expression)plotDistribution
: Plot distribution using a density plotcorrelateGEandAS
: Test for association between paired samples’ gene expression (for any genes of interest) and alternative splicing quantificationplotCorrelation
: Scatter plots of the correlation resultsas.table
: Table of the correlation resultsqueryEnsemblByEvent
: Query Ensembl based on an alternative splicing eventqueryEnsemblByGene
: Query Ensembl based on a geneensemblToUniprot
: Convert an Ensembl identifier to the respective UniProt identifierplotProtein
: Plot domains of a given proteinplotTranscripts
: Plot transcripts of a given geneThe following case study was adapted from psichomics’ original article:
Nuno Saraiva-Agostinho and Nuno L. Barbosa-Morais (2019). psichomics: graphical application for alternative splicing quantification and analysis. Nucleic Acids Research.
Breast cancer is the cancer type with the highest incidence and mortality in women (Torre et al., 2015) and multiple studies have suggested that transcriptome-wide analyses of alternative splicing changes in breast tumours are able to uncover tumour-specific biomarkers (Tsai et al., 2015; Danan-Gotthold et al., 2015; Anczuków et al., 2015). Given the relevance of early detection of breast cancer to patient survival, we can use psichomics to identify novel tumour stage-I-specific molecular signatures based on differentially spliced events.
The quantification of each alternative splicing event is based on the proportion of junction reads that support the inclusion isoform, known as percent spliced-in or PSI (Wang et al., 2008).
To estimate this value for each splicing event, both alternative splicing annotation and junction quantification are required. While alternative splicing annotation is provided by the package, junction quantification may be retrieved from TCGA, GTEx, SRA or user-provided files.
Data is downloaded from Firebrowse, a service that hosts proccessed data from TCGA, as required to run the downstream analyses. Before downloading data, check the following options:
# Available tumour types
cohorts <- getFirebrowseCohorts()
# Available sample dates
date <- getFirebrowseDates()
# Available data types
dataTypes <- getFirebrowseDataTypes()
Note there is also the option for Gene expression (normalised by RSEM). However, we recommend to load the raw gene expression data instead, followed by filtering and normalisation as demonstrated afterwards.
After deciding on the options to use, download and load breast cancer data as follows:
# Set download folder
folder <- getDownloadsFolder()
# Download and load most recent junction quantification and clinical data from
# TCGA/Firebrowse for Breast Cancer
data <- loadFirebrowseData(folder=folder,
cohort="BRCA",
data=c("clinical", "junction_quantification",
"RSEM_genes"),
date="2016-01-28")
# Select clinical and junction quantification dataset
clinical <- data[[1]]$`Clinical data`
sampleInfo <- data[[1]]$`Sample metadata`
junctionQuant <- data[[1]]$`Junction quantification (Illumina HiSeq)`
geneExpr <- data[[1]]$`Gene expression`
Data is only downloaded if the files are not present in the given folder. In other words, if the files were already downloaded, the function will just load the files, so it is possible to reuse the code above just to load the requested files.
Windows limitations: If you are using Windows, note that the downloaded files have huge names that may be over Windows Maximum Path Length. A workaround would be to manually rename the downloaded files to have shorter names, move all downloaded files to a single folder and load such folder. Read how in section Load unspecified local files at the end of this document.
As this package does not focuses on gene expression analysis, we suggest to read the RNA-seq section of limma
’s user guide. Nevertheless, we present the following commands to quickly filter and normalise gene expression:
# Check genes where min. counts are available in at least N samples and filter
# out genes with mean expression and variance of 0
checkCounts <- rowSums(geneExpr >= 10) >= 10
filter <- rowMeans(geneExpr) > 0 & rowVars(geneExpr) > 0 & checkCounts
geneExprFiltered <- geneExpr[filter, ]
# Normalise gene expression and perform log2-transformation (after adding 0.5
# to avoid zeroes)
geneExprNorm <- normaliseGeneExpression(geneExprFiltered, log2transform=TRUE)
After loading the clinical and alternative splicing junction quantification data from TCGA, quantify alternative splicing by clicking the green panel Alternative splicing quantification.
As previously mentioned, alternative splicing is quantified from the previously loaded junction quantification and an alternative splicing annotation file. To check current annotation files available:
## Human hg19/GRCh37 (2017-10-20)
## "annotationHub_alternativeSplicingEvents.hg19_V2.rda"
## Human hg19/GRCh37 (2016-10-11)
## "annotationHub_alternativeSplicingEvents.hg19.rda"
## Human hg38 (2018-04-30)
## "annotationHub_alternativeSplicingEvents.hg38_V2.rda"
Custom splicing annotation: Additional alternative splicing annotations can be prepared for psichomics by parsing the annotation from programs like VAST-TOOLS, MISO, SUPPA and rMATS. Note that SUPPA and rMATS are able to create their splicing annotation based on transcript annotation. For more information, read this tutorial.
To quantify alternative splicing, first select the junction quantification, alternative splicing annotation and alternative splicing event type(s) of interest:
# Load Human (hg19/GRCh37 assembly) annotation
human <- listSplicingAnnotations()[[1]]
annotation <- loadAnnotation(human)
# Available alternative splicing event types (skipped exon, alternative
# first/last exon, mutually exclusive exons, etc.)
getSplicingEventTypes()
## Skipped exon
## "SE"
## Mutually exclusive exon
## "MXE"
## Alternative 5' splice site
## "A5SS"
## Alternative 3' splice site
## "A3SS"
## Alternative first exon
## "AFE"
## Alternative last exon
## "ALE"
## Alternative first exon (exon-centred - less reliable)
## "AFE_exon"
## Alternative last exon (exon-centred - less reliable)
## "ALE_exon"
Afterwards, quantify alternative splicing using the previously defined parameters:
# Discard alternative splicing quantified using few reads
minReads <- 10 # default
psi <- quantifySplicing(annotation, junctionQuant, minReads=minReads)
# Check the identifier of the splicing events in the resulting table
events <- rownames(psi)
head(events)
## [1] "SE_3_+_13661331_13663275_13663415_13667945_FBLN2"
## [2] "SE_3_+_57908750_57911572_57911661_57913023_SLMAP"
## [3] "ALE_3_+_57908750_57911572_57913023_SLMAP"
## [4] "SE_3_-_37136283_37133029_37132958_37125297_LRRFIP2"
## [5] "SE_12_-_56558432_56558152_56558087_56557549_SMARCC2"
## [6] "AFE_4_+_56755098_56750094_56756389_EXOC1"
Note that the event identifier (for instance, SE_1_-_2125078_2124414_2124284_2121220_C1orf86
) is composed of:
SE
stands for skipped exon)1
)-
)C1orf86
)Warning: all examples shown in this case study are performed using a small, yet representative subset of the available data. Therefore, values shown here may correspond to those when performing the whole analysis.
Let us create groups based on available samples types (i.e. Metastatic, Primary solid Tumor and Solid Tissue Normal) and tumour stages. As tumour stages are divided by sub-stages, we will merge sub-stages so as to have only tumour samples from stages I, II, III and IV (stage X samples are discarded as they are uncharacterised tumour samples).
# Group by normal and tumour samples
types <- createGroupByAttribute("Sample types", sampleInfo)
normal <- types$`Solid Tissue Normal`
tumour <- types$`Primary solid Tumor`
# Group by tumour stage (I, II, III or IV) or normal samples
stages <- createGroupByAttribute(
"patient.stage_event.pathologic_stage_tumor_stage", clinical)
groups <- list()
for (i in c("i", "ii", "iii", "iv")) {
stage <- Reduce(union,
stages[grep(sprintf("stage %s[a|b|c]{0,1}$", i), names(stages))])
# Include only tumour samples
stageTumour <- names(getSubjectFromSample(tumour, stage))
elem <- list(stageTumour)
names(elem) <- paste("Tumour Stage", toupper(i))
groups <- c(groups, elem)
}
groups <- c(groups, Normal=list(normal))
# Prepare group colours (for consistency across downstream analyses)
colours <- c("#6D1F95", "#FF152C", "#00C7BA", "#FF964F", "#00C65A")
names(colours) <- names(groups)
attr(groups, "Colour") <- colours
# Prepare normal versus tumour stage I samples
normalVSstage1Tumour <- groups[c("Tumour Stage I", "Normal")]
attr(normalVSstage1Tumour, "Colour") <- attr(groups, "Colour")
# Prepare normal versus tumour samples
normalVStumour <- list(Normal=normal, Tumour=tumour)
attr(normalVStumour, "Colour") <- c(Normal="#00C65A", Tumour="#EFE35C")
PCA is a technique to reduce data dimensionality by identifying variable combinations (called principal components) that explain the variance in the data (Ringnér, 2008). Use the following commands to perform PCA:
# PCA of PSI between normal and tumour stage I samples
psi_stage1Norm <- psi[ , unlist(normalVSstage1Tumour)]
pcaPSI_stage1Norm <- performPCA(t(psi_stage1Norm))
As PCA cannot be performed on data with missing values, missing values need to be either removed (thus discarding data from whole splicing events or genes) or impute them (i.e. attributing to missing values the median of the non-missing ones). Use the argument
missingValues
within functionperformPCA
to select the number of missing values that are tolerable per event (i.e. if a splicing event or gene has less than N missing values, those missing values will be imputed; otherwise, the event is discarded from PCA).