pariedGSEA
is a user-friendly framework for paired differential gene
expression and splicing analyses.
Providing bulk RNA-seq count data, pariedGSEA
combines the results of
DESeq2 (Love, Huber, and Anders 2014) and
DEXSeq (Anders, Reyes, and Huber 2012), aggregates the p-values to
gene level and allows
you to run a subsequent gene set over-representation analysis using
fgsea’s fora
function (Korotkevich, Sukhov, and Sergushichev 2019).
Since version 0.99.2, you can also do the differential analyses using
limma .
pairedGSEA
was developed to highlight the importance of differential
splicing analysis. It was build in a way that yields comparable results between
splicing and expression-related events. It, by default, accounts for
surrogate variables in the data, and facilitates exploratory data analysis
either through storing intermediate results or through plotting functions
for the over-representation analysis.
This vignette will guide you through how to use the main functions of
pairedGSEA
.
pariedGSEA
assumes you have already preprocessed and aligned your sequencing
reads to transcript level.
Before starting, you should therefore have a counts matrix and a metadata file.
This data may also be in the format of a
SummarizedExperiment (Morgan et al. 2022) or
DESeqDataSet
.
Importantly, please ensure that the rownames have the format:
gene:transcript
.
The metadata should, as a minimum, contain the sample IDs
corresponding to
the column names of the count matrix, a group
column containing information
on which samples corresponds to the baseline (controls) and
the case (condition). Bear in mind, the column names can be as you wish,
but the names must be provided in the sample_col
and group_col
parameters,
respectively.
To see an example of what such data could look like,
please see ?pairedGSEA::example_se
.
To install this package, start R (version “4.3”) and enter:
Bioconductor dependencies
# Install Bioconductor dependencies
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c(
"SummarizedExperiment", "S4Vectors", "DESeq2", "DEXSeq",
"fgsea", "sva", "BiocParallel"))
Install pairedGSEA
from Bioconductor
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("pairedGSEA")
Install development version from GitHub
Running a paired Differential Gene Expression (DGE) and
Differential Gene Splicing (DGS) analysis is the first step in
the pairedGSEA
workflow.
But before running the paired_diff
function, we recommend storing
the experiment parameters in a set of variables at the top of your script
for future reference and easy access:
library("pairedGSEA")
# Defining plotting theme
ggplot2::theme_set(ggplot2::theme_classic(base_size = 20))
## Load data
# In this vignette we will use the included example Summarized Experiment.
# See ?example_se for more information about the data.
data("example_se")
if(FALSE){ # Examples of other data imports
# Example of count matrix
tx_count <- read.csv("path/to/counts.csv") # Or other load function
md_file <- "path/to/metadata.xlsx" # Can also be a .csv file or a data.frame
# Example of locally stored DESeqDataSet
dds <- readRDS("path/to/dds.RDS")
# Example of locally stored SummarizedExperiment
se <- readRDS("path/to/se.RDS")
}
## Experiment parameters
group_col <- "group_nr" # Column with the groups you would like to compare
sample_col <- "id" # Name of column that specifies the sample id of each sample.
# This is used to ensure the metadata and count data contains the same samples
# and to arrange the data according to the metadata
# (important for underlying tools)
baseline <- 1 # Name of baseline group
case <- 2 # Name of case group
experiment_title <- "TGFb treatment for 12h" # Name of your experiment. This is
# used in the file names that are stored if store_results is TRUE (recommended)
# Check if parameters above fit with metadaata
SummarizedExperiment::colData(example_se)
#> DataFrame with 6 rows and 5 columns
#> study id source final_description
#> <character> <character> <character> <character>
#> GSM1499784 GSE61220 GSM1499784 small airway epithel.. Epi,Control
#> GSM1499785 GSE61220 GSM1499785 small airway epithel.. Epi,Control
#> GSM1499786 GSE61220 GSM1499786 small airway epithel.. Epi,Control
#> GSM1499790 GSE61220 GSM1499790 small airway epithel.. Epi,TNF,12hr
#> GSM1499791 GSE61220 GSM1499791 small airway epithel.. Epi,TNF,12hr
#> GSM1499792 GSE61220 GSM1499792 small airway epithel.. Epi,TNF,12hr
#> group_nr
#> <factor>
#> GSM1499784 1
#> GSM1499785 1
#> GSM1499786 1
#> GSM1499790 2
#> GSM1499791 2
#> GSM1499792 2
The paired DGE/DGS analysis is run with paired_diff()
.
paired_diff()
is essentially a wrapper function around
DESeq2::DESeq
(Love, Huber, and Anders 2014) and DEXSeq::DEXSeq
(Anders, Reyes, and Huber 2012), the latter takes in
the ballpark of 20-30 minutes to run depending on the size of the data
and computational resources. Please visit their individual vignettes
for further information.
set.seed(500) # For reproducible results
diff_results <- paired_diff(
object = example_se,
metadata = NULL, # Use with count matrix or if you want to change it in
# the input object
group_col = group_col,
sample_col = sample_col,
baseline = baseline,
case = case,
experiment_title = experiment_title,
store_results = FALSE # Set to TRUE (recommended) if you want
# to store intermediate results, such as the results on transcript level
)
#> Running TGFb treatment for 12h
#> Preparing metadata
#>
#> Removing 0 rows with a summed count lower than 10
#> Removing 108 rows with counts in less than 2 samples.
#> Running SVA
#> No significant surrogate variables
#>
#> Found 0 surrogate variables
#> Running DESeq2
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
#> Extracting results
#> Initiating DEXSeq
#> Creating DEXSeqDataSet
#> converting counts to integer mode
#> Warning in DESeqDataSet(rse, design, ignoreRank = TRUE): some variables in
#> design formula are characters, converting to factors
#>
#> Running DEXSeq -- This might take a while
#> TGFb treatment for 12h is analysed.
#> Aggregating p values
#> Done.
After running the analyses, paired_diff
will aggregate the p-values to
gene level using lancaster aggregation
(Lancaster 1961) and calculate
the FDR-adjusted p-values (see ?pairedGSEA:::aggregate_pvalue
for more information).
For the DGE transcripts, the log2 fold changes will be aggregated
using a weighted mean, whereas the DGS log2 fold changes will be
assigned to the log2 fold change of the transcript with the lowest p-value.
Use the latter with a grain of salt.
From here, feel free to analyse the gene-level results using your
preferred method.
If you set store_results = TRUE
, you could extract the transcript
level results found in the results
folder under the names
"*_expression_results.RDS"
and "*_splicing_results.RDS" for the DGE and DGS analysis, respectively
(The * corresponds to the provided experiment title).
There are a range of other parameters you can play with to tailor
the experience. Here, the default settings are showed.
See ?pairedGSEA::paired_diff
for further details.
covariates = NULL,
run_sva = TRUE,
use_limma = FALSE,
prefilter = 10,
fit_type = "local",
test = "LRT",
quiet = FALSE,
parallel = TRUE,
BPPARAM = BiocParallel::bpparam(),
...
To highlight some examples of use:
limma::eBayes
and limma::diffSplice
for the analyses with
use_limma = TRUE
.covariates
to a character vector of the
specific names. This will be used in SVA, DGE, and DGS....
parameters will be fed to DESeq2::DESeq
,
see their manual for options.pairedGSEA
comes with a wrapper function for fgsea::fora
(Korotkevich, Sukhov, and Sergushichev 2019).
If you wish,
feel free to use that directly or any other gene set analysis method
- investigate the diff_results
object before use to see what
information it contains.
The inbuilt wrapper also computes the relative risk for each gene set and an
enrichment score (log2(relative_risk + 0.06)
, the pseudo count is
for plotting purposes).
Before you get going, you will need a list of gene sets (aka. pathways)
according to the species you are working with and the category of
gene sets of interest.
For this purpose, feel free to use the msigdbr
(Dolgalev 2022) wrapper function in
pairedGSEA
: pairedGSEA::prepare_msigdb
. If you do,
see ?pairedGSEA::prepare_msigdb
for further details.
The inbuilt ORA function is called paired_ora
and is run as follows
# Define gene sets in your preferred way
gene_sets <- pairedGSEA::prepare_msigdb(
species = "Homo sapiens",
category = "C5",
gene_id_type = "ensembl_gene"
)
ora <- paired_ora(
paired_diff_result = diff_results,
gene_sets = gene_sets,
experiment_title = NULL # experiment_title - if not NULL,
# the results will be stored in an RDS object in the 'results' folder
)
#> Running over-representation analyses
#> Joining result
There are many options for investigating your ORA results.
pairedGSEA
comes with an inbuilt scatter plot function that plots the
enrichment score of DGE against those of DGS.
The function allows you to interactively look at the placement of the significant pathways using plotly (Sievert 2020). You can color specific points based on a regular expression for gene sets of interest.
# Scatter plot function with default settings
plot_ora(
ora,
plotly = FALSE,
pattern = "Telomer", # Identify all gene sets about telomeres
cutoff = 0.05, # Only include significant gene sets
lines = TRUE # Guide lines
)
As mentioned, this function can be utilized in a few different ways.
The default settings will plot the enrichment scores of each analysis
and draw dashed lines for the y = x
line, y = 0
, and x = 0
.
Remove those by setting lines = FALSE
.
Make the plot interactive with plotly = TRUE
.
Highlight gene sets containing a specific regex pattern
by setting pattern
to the regex pattern of interest.
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] DEXSeq_1.48.0 DESeq2_1.42.0
#> [3] SummarizedExperiment_1.32.0 GenomicRanges_1.54.0
#> [5] IRanges_2.36.0 S4Vectors_0.40.0
#> [7] pairedGSEA_1.2.0 BiocStyle_2.30.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.1.3 bitops_1.0-7 biomaRt_2.58.0
#> [4] rlang_1.1.1 magrittr_2.0.3 msigdbr_7.5.1
#> [7] aggregation_1.0.1 matrixStats_1.0.0 compiler_4.3.1
#> [10] RSQLite_2.3.1 mgcv_1.9-0 png_0.1-8
#> [13] vctrs_0.6.4 sva_3.50.0 stringr_1.5.0
#> [16] pkgconfig_2.0.3 crayon_1.5.2 fastmap_1.1.1
#> [19] magick_2.8.1 dbplyr_2.3.4 XVector_0.42.0
#> [22] labeling_0.4.3 utf8_1.2.4 Rsamtools_2.18.0
#> [25] rmarkdown_2.25 bit_4.0.5 xfun_0.40
#> [28] zlibbioc_1.48.0 cachem_1.0.8 GenomeInfoDb_1.38.0
#> [31] jsonlite_1.8.7 progress_1.2.2 blob_1.2.4
#> [34] DelayedArray_0.28.0 BiocParallel_1.36.0 parallel_4.3.1
#> [37] prettyunits_1.2.0 R6_2.5.1 bslib_0.5.1
#> [40] stringi_1.7.12 RColorBrewer_1.1-3 limma_3.58.0
#> [43] genefilter_1.84.0 jquerylib_0.1.4 Rcpp_1.0.11
#> [46] bookdown_0.36 knitr_1.44 Matrix_1.6-1.1
#> [49] splines_4.3.1 tidyselect_1.2.0 abind_1.4-5
#> [52] yaml_2.3.7 codetools_0.2-19 hwriter_1.3.2.1
#> [55] curl_5.1.0 lattice_0.22-5 tibble_3.2.1
#> [58] withr_2.5.1 Biobase_2.62.0 KEGGREST_1.42.0
#> [61] evaluate_0.22 survival_3.5-7 BiocFileCache_2.10.0
#> [64] xml2_1.3.5 Biostrings_2.70.0 pillar_1.9.0
#> [67] BiocManager_1.30.22 filelock_1.0.2 MatrixGenerics_1.14.0
#> [70] stats4_4.3.1 generics_0.1.3 RCurl_1.98-1.12
#> [73] hms_1.1.3 ggplot2_3.4.4 munsell_0.5.0
#> [76] scales_1.2.1 xtable_1.8-4 glue_1.6.2
#> [79] tools_4.3.1 data.table_1.14.8 fgsea_1.28.0
#> [82] annotate_1.80.0 locfit_1.5-9.8 babelgene_22.9
#> [85] XML_3.99-0.14 fastmatch_1.1-4 cowplot_1.1.1
#> [88] grid_4.3.1 edgeR_4.0.0 AnnotationDbi_1.64.0
#> [91] colorspace_2.1-0 nlme_3.1-163 GenomeInfoDbData_1.2.11
#> [94] cli_3.6.1 rappdirs_0.3.3 fansi_1.0.5
#> [97] S4Arrays_1.2.0 dplyr_1.1.3 gtable_0.3.4
#> [100] sass_0.4.7 digest_0.6.33 BiocGenerics_0.48.0
#> [103] SparseArray_1.2.0 farver_2.1.1 geneplotter_1.80.0
#> [106] memoise_2.0.1 htmltools_0.5.6.1 lifecycle_1.0.3
#> [109] httr_1.4.7 statmod_1.5.0 bit64_4.0.5
Anders, Simon, Alejandro Reyes, and Wolfgang Huber. 2012. “Detecting Differential Usage of Exons from Rna-Seq Data.” Genome Research 22 (10): 2008–17. https://doi.org/10.1101/gr.133744.111.
Dolgalev, Igor. 2022. Msigdbr: MSigDB Gene Sets for Multiple Organisms in a Tidy Data Format. https://CRAN.R-project.org/package=msigdbr.
Korotkevich, Gennady, Vladimir Sukhov, and Alexey Sergushichev. 2019. “Fast Gene Set Enrichment Analysis.” bioRxiv. https://doi.org/10.1101/060012.
Lancaster, H. O. 1961. “THE Combination of Probabilities: AN Application of Orthonormal Functions.” Australian Journal of Statistics 3 (1): 20–33. https://doi.org/https://doi.org/10.1111/j.1467-842X.1961.tb00058.x.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.
Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2022. SummarizedExperiment: SummarizedExperiment Container. https://bioconductor.org/packages/SummarizedExperiment.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.