Abstract
This Quick-Start is a runnable example showing the functionalities of the SpliceWiz workflow. Version 1.2.3
SpliceWiz is a graphical interface for differential alternative splicing and visualization in R. It differs from other alternative splicing tools as it is designed for users with basic bioinformatic skills to analyze datasets containing up to hundreds of samples! SpliceWiz contains a number of innovations including:
This vignette is a runnable working example of the SpliceWiz workflow. The purpose is to quickly demonstrate the basic functionalities of SpliceWiz.
We provide here a brief outline of the workflow for users to get started as quickly as possible. However, we also provide more details for those wishing to know more. Many sections will contain extra information that can be displayed when clicked on, such as these:
How does SpliceWiz measure alternative splicing?
SpliceWiz defines alternative splicing events (ASEs) as binary events between two possibilities, the included and excluded isoform. It detects and measures: skipped (casette) exons (SE), mutually-exclusive exons (MXE), alternative 5’/3’ splice site usage (A5SS / A3SS), alternate first / last exon usage (AFE / ALE), and retained introns (IR or RI).
SpliceWiz uses splice-specific read counts to measure ASEs. Namely, these are junction reads (reads that align across splice sites). The exception is intron retention (IR) whereby the (trimmed) mean read depth across the intron is measured (identical to the method used in IRFinder).
SpliceWiz provides two metrics:
SpliceOver
or SpliceMax
method (the latter is identical to that used in IRFinder)
Does SpliceWiz detect novel splicing events?
Novel splicing events are those in which at least one isoform is not an annotated transcript in the given gene annotation. SpliceWiz DOES detect novel splicing events.
It detects novel events by using novel junctions, using pairs of junctions that originate from or terminate at a common coordinate (novel alternate splice site usage).
Additionally, SpliceWiz detects “tandem junction reads”. These are reads that span across two or more splice junctions. The region between splice junctions can then be annotated as novel exons (if they are not identical to annotated exons). These novel exons can then be used to measure novel casette exon usage.
The basic steps of SpliceWiz are as follows:
To install SpliceWiz, start R (version “4.3”) and enter:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SpliceWiz")
Setting up a Conda environment to use SpliceWiz
For those wishing to set up a self-contained environment with SpliceWiz installed (e.g. on a high performance cluster), we recommend using miniconda. For installation instructions, see the documentation on how to install miniconda
After installing miniconda, create a conda environment as follows:
After following the prompts, activate the environment:
Next, install R 4.2.1 as follows:
NB: We have not been able to successfuly use r-base=4.3, so we recommend using r-base=4.2.1 (until further notice).
Many of SpliceWiz’s dependencies are up-to-date from the conda-forge channel, so they are best installed via conda:
conda install -c conda-forge r-devtools r-essentials r-xml r-biocmanager \
r-fst r-plotly r-rsqlite r-rcurl
After this is done, the remainder of the packages need to be installed from the R terminal. This is because most Bioconductor packages are from the bioconda channel and appear not to be routinely updated.
So, lets enter the R terminal from the command line:
Set up Bioconductor 3.16 (which is the latest version compatible with R 4.2):
Again, follow the prompts to update any necessary packages.
Once this is done, install SpliceWiz (devel) from github:
The last step will install any remaining dependencies, taking approximately 20-30 minutes depending on your system.
Enabling OpenMP (multi-threading) for MacOS users (Optional)
For MacOS users, make sure OpenMP libraries are installed correctly. We recommend users follow this guide, but the quickest way to get started is to install libomp
via brew:
Installing statistical package dependencies (Optional)
SpliceWiz uses established statistical tools to perform alternative splicing differential analysis:
To install all of these packages:
NxtIRFdata
data package. This data package contains the example “chrZ” genome / annotations and 6 example BAM files that are used in this working example. Also, NxtIRFdata provides pre-generated mappability exclusion annotations for building human and mouse SpliceWiz references
SpliceWiz offers a graphical user interface (GUI) for interactive users, e.g. in the RStudio environment. To start using SpliceWiz GUI:
Why do we need the SpliceWiz reference?
SpliceWiz first needs to generate a set of reference files. The SpliceWiz reference is used to quantitate alternative splicing in BAM files, as well as in downstream collation, differential analysis and visualisation.
Using the example FASTA and GTF files, use the buildRef()
function to build the SpliceWiz reference:
ref_path <- file.path(tempdir(), "Reference")
buildRef(
reference_path = ref_path,
fasta = chrZ_genome(),
gtf = chrZ_gtf(),
ontologySpecies = "Homo sapiens"
)
The SpliceWiz reference can be viewed as data frames using various getter functions. For example, to view the annotated alternative splicing events (ASE):
See ?View-Reference-methods
for a comprehensive list of getter functions
Using the GUI
After starting the SpliceWiz GUI in demo mode, click the Reference
tab from the menu side bar. The following interface will be shown:
Load Demo FASTA/GTF
(5), and then click Build Reference
(6)
Where did the FASTA and GTF files come from?
The helper functions chrZ_genome()
and chrZ_gtf()
returns the paths to the example genome (FASTA) and transcriptome (GTF) file included with the NxtIRFdata
package that contains the working example used by SpliceWiz:
What is the gene ontology species?
SpliceWiz supports gene ontology analysis. To enable this capability, we first need to generate the gene ontology annotations for the appropriate species.
To see a list of supported species:
getAvailableGO()
#> [1] "Anopheles gambiae" "Arabidopsis thaliana"
#> [3] "Bos taurus" "Canis familiaris"
#> [5] "Gallus gallus" "Pan troglodytes"
#> [7] "Escherichia coli" "Drosophila melanogaster"
#> [9] "Homo sapiens" "Mus musculus"
#> [11] "Sus scrofa" "Rattus norvegicus"
#> [13] "Macaca mulatta" "Caenorhabditis elegans"
#> [15] "Xenopus laevis" "Saccharomyces cerevisiae"
#> [17] "Danio rerio"
Note that, if genome_type
is specified to a supported genome, the human / mouse gene ontology annotation will be automatically generated.
What is Mappability and why should I care about it?
For the most part, the SpliceWiz reference can be built with just the FASTA and GTF files. This is sufficient for assessment for most forms of alternative splicing events.
For intron retention, accurate assessment of intron depth is important. However, introns contain many repetitive regions that are difficult to map. We refer to these regions as “mappability exclusions”.
We adopt IRFinder’s algorithm to identify these mappability exclusions. This is determined empirically by generating synthetic reads systematically from the genome, then aligning these reads back to the same genome. Regions that contain less than the expected coverage depth of reads define “mappability exclusions”.
See the vignette: SpliceWiz cookbook for details on how to generate “mappability exclusions” for any genome.How do I use pre-built mappability exclusions to generate human and mouse references?
For human and mouse genomes, SpliceWiz provides pre-built mappability exclusion references that can be used to build the SpliceWiz reference. SpliceWiz provides these annotations via the NxtIRFdata
package.
Simply specify the genome in the parameter genome_type
in the buildRef()
function (which accepts hg38
, hg19
, mm10
and mm9
).
Additionally, a reference for non-polyadenylated transcripts is used. This has a minor role in QC of samples (to assess the adequacy of polyA capture).
For example, assuming your genome file "genome.fa"
and a transcript annotation "transcripts.gtf"
are in the working directory, a SpliceWiz reference can be built using the built-in hg38
low mappability regions and non-polyadenylated transcripts as follows:
The function SpliceWiz_example_bams()
retrieves 6 example BAM files from ExperimentHub and places a copy of these in the temporary directory.
How can I easily locate multiple BAM files?
Often, alignment pipelines process multiple samples. SpliceWiz provides convenience functions to recursively locate all the BAM files in a given folder, and tries to ascertain sample names. Often sample names can be gleaned when: * The BAM files are named by their sample names, e.g. “sample1.bam”, “sample2.bam”. In this case, level = 0
* The BAM files have generic names but are contained inside parent directories labeled by their sample names, e.g. “sample1/Unsorted.bam”, “sample2/Unsorted.bam”. In this case, level = 1
Process these BAM files using SpliceWiz:
pb_path <- file.path(tempdir(), "pb_output")
processBAM(
bamfiles = bams$path,
sample_names = bams$sample,
reference_path = ref_path,
output_path = pb_path
)
Using the GUI
After building the demo reference as shown in the previous section, start SpliceWiz GUI in demo mode. Then, click the Experiment
tab from the menu side bar. The following interface will be shown: