Contents


1 Introduction

pariedGSEA is a user-friendly framework for paired differential gene expression and splicing analyses. Providing bulk RNA-seq count data, pariedGSEA combines the results of DESeq2 (Love, Huber, and Anders 2014) and DEXSeq (Anders, Reyes, and Huber 2012), aggregates the p-values to gene level and allows you to run a subsequent gene set over-representation analysis using fgsea’s fora function (Korotkevich, Sukhov, and Sergushichev 2019). Since version 0.99.2, you can also do the differential analyses using limma .

pairedGSEA was developed to highlight the importance of differential splicing analysis. It was build in a way that yields comparable results between splicing and expression-related events. It, by default, accounts for surrogate variables in the data, and facilitates exploratory data analysis either through storing intermediate results or through plotting functions for the over-representation analysis.

This vignette will guide you through how to use the main functions of pairedGSEA.

pariedGSEA assumes you have already preprocessed and aligned your sequencing reads to transcript level. Before starting, you should therefore have a counts matrix and a metadata file. This data may also be in the format of a SummarizedExperiment (Morgan et al. 2022) or DESeqDataSet. Importantly, please ensure that the rownames have the format: gene:transcript.

The metadata should, as a minimum, contain the sample IDs corresponding to the column names of the count matrix, a group column containing information on which samples corresponds to the baseline (controls) and the case (condition). Bear in mind, the column names can be as you wish, but the names must be provided in the sample_col and group_col parameters, respectively.

To see an example of what such data could look like, please see ?pairedGSEA::example_se.

2 Installation

To install this package, start R (version “4.2”) and enter:

Bioconductor dependencies

# Install Bioconductor dependencies
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c(
    "SummarizedExperiment", "S4Vectors", "DESeq2", "DEXSeq",
    "fgsea", "sva", "BiocParallel"))

Install pairedGSEA from GitHub

# Install pairedGSEA from github
devtools::install_github("shdam/pairedGSEA", build_vignettes = TRUE)

Install pairedGSEA from Bioconductor

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("pairedGSEA", version = 'devel')

3 Paired differential expression and splicing

3.1 Preparing the experiment parameters

Running a paired Differential Gene Expression (DGE) and Differential Gene Splicing (DGS) analysis is the first step in the pairedGSEA workflow.

But before running the paired_diff function, we recommend storing the experiment parameters in a set of variables at the top of your script for future reference and easy access:

library("pairedGSEA")
#> Warning: replacing previous import 'utils::findMatches' by
#> 'S4Vectors::findMatches' when loading 'AnnotationDbi'

# Defining plotting theme
ggplot2::theme_set(ggplot2::theme_classic(base_size = 20))

## Load data
# In this vignette we will use the included example Summarized Experiment.
# See ?example_se for more information about the data.
data("example_se") 

if(FALSE){ # Examples of other data imports
    # Example of count matrix
    tx_count <- read.csv("path/to/counts.csv") # Or other load function
    md_file <- "path/to/metadata.xlsx" # Can also be a .csv file or a data.frame
    
    # Example of locally stored DESeqDataSet
    dds <- readRDS("path/to/dds.RDS")
    
    # Example of locally stored SummarizedExperiment
    se <- readRDS("path/to/se.RDS")
}

## Experiment parameters
group_col <- "group_nr" # Column with the groups you would like to compare
sample_col <- "id" # Name of column that specifies the sample id of each sample.
# This is used to ensure the metadata and count data contains the same samples
# and to arrange the data according to the metadata
# (important for underlying tools)
baseline <- 1 # Name of baseline group
case <- 2 # Name of case group
experiment_title <- "TGFb treatment for 12h" # Name of your experiment. This is
# used in the file names that are stored if store_results is TRUE (recommended)

3.2 Check your metadata

# Check if parameters above fit with metadaata
SummarizedExperiment::colData(example_se)
#> DataFrame with 6 rows and 5 columns
#>                  study          id                 source final_description
#>            <character> <character>            <character>       <character>
#> GSM1499784    GSE61220  GSM1499784 small airway epithel..       Epi,Control
#> GSM1499785    GSE61220  GSM1499785 small airway epithel..       Epi,Control
#> GSM1499786    GSE61220  GSM1499786 small airway epithel..       Epi,Control
#> GSM1499790    GSE61220  GSM1499790 small airway epithel..      Epi,TNF,12hr
#> GSM1499791    GSE61220  GSM1499791 small airway epithel..      Epi,TNF,12hr
#> GSM1499792    GSE61220  GSM1499792 small airway epithel..      Epi,TNF,12hr
#>            group_nr
#>            <factor>
#> GSM1499784        1
#> GSM1499785        1
#> GSM1499786        1
#> GSM1499790        2
#> GSM1499791        2
#> GSM1499792        2
# Check that all data samples are in the metadata
all(colnames(example_se) %in%
        SummarizedExperiment::colData(example_se)[[sample_col]])
#> [1] TRUE

3.3 Running the analysis

The paired DGE/DGS analysis is run with paired_diff(). paired_diff() is essentially a wrapper function around DESeq2::DESeq (Love, Huber, and Anders 2014) and DEXSeq::DEXSeq (Anders, Reyes, and Huber 2012), the latter takes in the ballpark of 20-30 minutes to run depending on the size of the data and computational resources. Please visit their individual vignettes for further information.

set.seed(500) # For reproducible results

diff_results <- paired_diff(
    object = example_se,
    metadata = NULL, # Use with count matrix or if you want to change it in
    # the input object
    group_col = group_col,
    sample_col = sample_col,
    baseline = baseline,
    case = case,
    experiment_title = experiment_title,
    store_results = FALSE # Set to TRUE (recommended)  if you want
    # to store intermediate results, such as the results on transcript level 
    )
#> Running TGFb treatment for 12h
#> Preparing metadata
#> 
#> Removing 0 rows with a summed count lower than 10
#> Removing 108 rows with counts in less than 2 samples.
#> Running SVA
#> No significant surrogate variables
#> 
#> Found 0 surrogate variables
#> Running DESeq2
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
#> Extracting results
#> Initiating DEXSeq
#> Creating DEXSeqDataSet
#> converting counts to integer mode
#> Warning in DESeqDataSet(rse, design, ignoreRank = TRUE): some variables in
#> design formula are characters, converting to factors
#> 
#> Running DEXSeq -- This might take a while
#> TGFb treatment for 12h is analysed.
#> Aggregating p values
#> Done.

After running the analyses, paired_diff will aggregate the p-values to gene level using lancaster aggregation (Lancaster 1961) and calculate the FDR-adjusted p-values (see ?pairedGSEA:::aggregate_pvalue for more information). For the DGE transcripts, the log2 fold changes will be aggregated using a weighted mean, whereas the DGS log2 fold changes will be assigned to the log2 fold change of the transcript with the lowest p-value. Use the latter with a grain of salt.

From here, feel free to analyse the gene-level results using your preferred method. If you set store_results = TRUE, you could extract the transcript level results found in the results folder under the names "*_expression_results.RDS" and "*_splicing_results.RDS" for the DGE and DGS analysis, respectively (The * corresponds to the provided experiment title).

3.4 Additional parameters

There are a range of other parameters you can play with to tailor the experience. Here, the default settings are showed. See ?pairedGSEA::paired_diff for further details.

covariates = NULL,
run_sva = TRUE,
use_limma = FALSE,
prefilter = 10,
fit_type = "local",
test = "LRT",
quiet = FALSE,
parallel = TRUE,
BPPARAM = BiocParallel::bpparam(),
...

To highlight some examples of use:

  1. You can use limma::eBayes and limma::diffSplice for the analyses with use_limma = TRUE.
  2. You can use additional columns in your metadata as covariates in the model matrix by setting covariates to a character vector of the specific names. This will be used in SVA, DGE, and DGS.
  3. The test should be kept at default settings, but advanced users may use the “wald” test instead, if they wish.
  4. The ... parameters will be fed to DESeq2::DESeq, see their manual for options.
  5. If you want a stricter, more loose, or more advanced pre-filtering (generally not necessary, as DESeq2 and DEXSeq performs their own pre-filtering), you can set the parameter to a different value or NULL and use your own pre-filtering method directly on the counts matrix.


4 Over-Representation Analysis

pairedGSEA comes with a wrapper function for fgsea::fora (Korotkevich, Sukhov, and Sergushichev 2019). If you wish, feel free to use that directly or any other gene set analysis method - investigate the diff_results object before use to see what information it contains.

The inbuilt wrapper also computes the relative risk for each gene set and an enrichment score (log2(relative_risk + 0.06), the pseudo count is for plotting purposes).

4.0.1 Running the inbuilt ORA function

Before you get going, you will need a list of gene sets (aka. pathways) according to the species you are working with and the category of gene sets of interest. For this purpose, feel free to use the msigdbr (Dolgalev 2022) wrapper function in pairedGSEA: pairedGSEA::prepare_msigdb. If you do, see ?pairedGSEA::prepare_msigdb for further details.

The inbuilt ORA function is called paired_ora and is run as follows

# Define gene sets in your preferred way
gene_sets <- pairedGSEA::prepare_msigdb(
    species = "Homo sapiens", 
    category = "C5", 
    gene_id_type = "ensembl_gene"
    )

ora <- paired_ora(
    paired_diff_result = diff_results,
    gene_sets = gene_sets,
    experiment_title = NULL # experiment_title - if not NULL,
    # the results will be stored in an RDS object in the 'results' folder
    )
#> Running over-representation analyses
#> Joining result

4.1 Additional parameters

cutoff = 0.05,
min_size = 25,
quiet = FALSE

Please investigate the returned object to see the column names and what they contain.


5 Analysing ORA results

There are many options for investigating your ORA results. pairedGSEA comes with an inbuilt scatter plot function that plots the enrichment score of DGE against those of DGS.

The function allows you to interactively look at the placement of the significant pathways using plotly (Sievert 2020). You can color specific points based on a regular expression for gene sets of interest.

5.1 Scatter plot of enrichment scores

# Scatter plot function with default settings
plot_ora(
    ora,
    plotly = FALSE,
    pattern = "Telomer", # Identify all gene sets about telomeres
    cutoff = 0.05, # Only include significant gene sets
    lines = TRUE # Guide lines
    )

As mentioned, this function can be utilized in a few different ways. The default settings will plot the enrichment scores of each analysis and draw dashed lines for the y = x line, y = 0, and x = 0. Remove those by setting lines = FALSE.

Make the plot interactive with plotly = TRUE.

Highlight gene sets containing a specific regex pattern by setting pattern to the regex pattern of interest.

6 Session Info

sessionInfo()
#> R version 4.3.0 RC (2023-04-13 r84269)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pairedGSEA_1.0.0 BiocStyle_2.28.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.1.3                   bitops_1.0-7               
#>   [3] biomaRt_2.56.0              rlang_1.1.0                
#>   [5] magrittr_2.0.3              msigdbr_7.5.1              
#>   [7] aggregation_1.0.1           matrixStats_0.63.0         
#>   [9] compiler_4.3.0              RSQLite_2.3.1              
#>  [11] mgcv_1.8-42                 png_0.1-8                  
#>  [13] vctrs_0.6.2                 sva_3.48.0                 
#>  [15] stringr_1.5.0               pkgconfig_2.0.3            
#>  [17] crayon_1.5.2                fastmap_1.1.1              
#>  [19] magick_2.7.4                dbplyr_2.3.2               
#>  [21] XVector_0.40.0              labeling_0.4.2             
#>  [23] utf8_1.2.3                  Rsamtools_2.16.0           
#>  [25] rmarkdown_2.21              bit_4.0.5                  
#>  [27] xfun_0.39                   zlibbioc_1.46.0            
#>  [29] cachem_1.0.7                GenomeInfoDb_1.36.0        
#>  [31] jsonlite_1.8.4              progress_1.2.2             
#>  [33] blob_1.2.4                  highr_0.10                 
#>  [35] DelayedArray_0.26.0         BiocParallel_1.34.0        
#>  [37] parallel_4.3.0              prettyunits_1.1.1          
#>  [39] R6_2.5.1                    bslib_0.4.2                
#>  [41] stringi_1.7.12              RColorBrewer_1.1-3         
#>  [43] limma_3.56.0                genefilter_1.82.0          
#>  [45] GenomicRanges_1.52.0        jquerylib_0.1.4            
#>  [47] Rcpp_1.0.10                 bookdown_0.33              
#>  [49] SummarizedExperiment_1.30.0 knitr_1.42                 
#>  [51] IRanges_2.34.0              Matrix_1.5-4               
#>  [53] splines_4.3.0               tidyselect_1.2.0           
#>  [55] yaml_2.3.7                  codetools_0.2-19           
#>  [57] hwriter_1.3.2.1             curl_5.0.0                 
#>  [59] lattice_0.21-8              tibble_3.2.1               
#>  [61] withr_2.5.0                 Biobase_2.60.0             
#>  [63] KEGGREST_1.40.0             evaluate_0.20              
#>  [65] survival_3.5-5              BiocFileCache_2.8.0        
#>  [67] xml2_1.3.3                  Biostrings_2.68.0          
#>  [69] pillar_1.9.0                BiocManager_1.30.20        
#>  [71] filelock_1.0.2              MatrixGenerics_1.12.0      
#>  [73] stats4_4.3.0                generics_0.1.3             
#>  [75] RCurl_1.98-1.12             S4Vectors_0.38.0           
#>  [77] hms_1.1.3                   ggplot2_3.4.2              
#>  [79] munsell_0.5.0               scales_1.2.1               
#>  [81] xtable_1.8-4                glue_1.6.2                 
#>  [83] tools_4.3.0                 data.table_1.14.8          
#>  [85] fgsea_1.26.0                annotate_1.78.0            
#>  [87] locfit_1.5-9.7              babelgene_22.9             
#>  [89] XML_3.99-0.14               fastmatch_1.1-3            
#>  [91] cowplot_1.1.1               grid_4.3.0                 
#>  [93] DEXSeq_1.46.0               edgeR_3.42.0               
#>  [95] AnnotationDbi_1.62.0        colorspace_2.1-0           
#>  [97] nlme_3.1-162                GenomeInfoDbData_1.2.10    
#>  [99] cli_3.6.1                   rappdirs_0.3.3             
#> [101] fansi_1.0.4                 dplyr_1.1.2                
#> [103] gtable_0.3.3                DESeq2_1.40.0              
#> [105] sass_0.4.5                  digest_0.6.31              
#> [107] BiocGenerics_0.46.0         farver_2.1.1               
#> [109] geneplotter_1.78.0          memoise_2.0.1              
#> [111] htmltools_0.5.5             lifecycle_1.0.3            
#> [113] httr_1.4.5                  statmod_1.5.0              
#> [115] bit64_4.0.5

References

Anders, Simon, Alejandro Reyes, and Wolfgang Huber. 2012. “Detecting Differential Usage of Exons from Rna-Seq Data.” Genome Research 22 (10): 2008–17. https://doi.org/10.1101/gr.133744.111.

Dolgalev, Igor. 2022. Msigdbr: MSigDB Gene Sets for Multiple Organisms in a Tidy Data Format. https://CRAN.R-project.org/package=msigdbr.

Korotkevich, Gennady, Vladimir Sukhov, and Alexey Sergushichev. 2019. “Fast Gene Set Enrichment Analysis.” bioRxiv. https://doi.org/10.1101/060012.

Lancaster, H. O. 1961. “THE Combination of Probabilities: AN Application of Orthonormal Functions.” Australian Journal of Statistics 3 (1): 20–33. https://doi.org/https://doi.org/10.1111/j.1467-842X.1961.tb00058.x.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.

Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2022. SummarizedExperiment: SummarizedExperiment Container. https://bioconductor.org/packages/SummarizedExperiment.

Sievert, Carson. 2020. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.