Version Info

R version: R version 4.2.0 RC (2022-04-19 r82224)

Bioconductor version: 3.15

Package version: 1.4.0

Introduction

ExpHunterSuite implements a comprehensive protocol for the analysis of transcriptional data using established R packages and combining their results. It covers all key steps in DEG detection, CEG detection and functional analysis for RNA-seq data. It has been implemented as an R package containing functions that can be run interactively. In addition, it also contains scripts that wrap the functions and can be run directly from the command line.

Standard Package Usage

In this section we will describe how the functions in ExpHunterSuite can be used interactively or joined together in user-written scripts. We will also describe how the output reports can be generated from this data.

Differential Expression Analysis

The most basic use of the package is to perform differential expression (DE) gene analysis. ExpHunterSuite will, following some initial preprocessing, run the different methods, combine the results, and produce an output report, as well as a single output table containing the results of all of the methods used, and their combined scores. The combined scores consist of the mean logFC value and the combined adjusted p-value (FDR) values, calculated by Fishers method.

To use ExpHunterSuite with only a single DE package, one can run the following command:

library(ExpHunterSuite)
data(toc)
data(target)
degh_out_one_pack <- main_degenes_Hunter(raw=toc, 
                                         target=target,
                                         modules="D") # D for DESeq2

Where toc is a data.frame of aligned reads per samples and target is a data.frame relating each sample to its sample meta data. Here we include a minimal example of the target file whcih includes the samples (CTL/epm2a), the samples condition (Ctrl or Treat):

head(toc)

##                    CTL_1 CTL_2 CTL_3 CTL_4 epm2a_1 epm2a_2 epm2a_3
## ENSMUSG00000102693     0     0     0     0       0       0       0
## ENSMUSG00000064842     0     0     0     0       0       0       0
## ENSMUSG00000051951   497   687   645   585     563     533     478
## ENSMUSG00000102851     0     0     0     0       0       0       0
## ENSMUSG00000103377     0     0     0     0       1       1       2
## ENSMUSG00000104017     0     2     0     0       0       4       0

head(target)

##    sample treat
## 1   CTL_1  Ctrl
## 2   CTL_2  Ctrl
## 3   CTL_3  Ctrl
## 4   CTL_4  Ctrl
## 5 epm2a_1 Treat
## 6 epm2a_2 Treat

The files containing this data are contained within the extData package directory and can be accessed in the following manner. We will come back to them in the section on command line usage.

system.file("extData", "table_of_counts.txt", package = "ExpHunterSuite")

## [1] "/tmp/RtmpfvQNw4/Rinst214a8671063e0e/ExpHunterSuite/extData/table_of_counts.txt"

system.file("extData", "target.txt", package = "ExpHunterSuite")

## [1] "/tmp/RtmpfvQNw4/Rinst214a8671063e0e/ExpHunterSuite/extData/target.txt"

To use it with multiple packages, one can run the following command:

degh_out_multi_pack <- main_degenes_Hunter(raw=toc, 
                                    target=target, 
                                    modules="DEL") # D:DESeq2 E:EdgeR, L:limma

The output is a list, which includes, in slot DE_all_genes a data.frame containing, for each gene, logFC/p-values/adjusted p-values for the different DE methods implemented:

head(degh_out_multi_pack$DE_all_genes)

##                    logFC_DESeq2    FDR_DESeq2 pvalue_DESeq2 logFC_edgeR
## ENSMUSG00000055493    -4.762721 1.621389e-131 6.841303e-134   -4.707185
## ENSMUSG00000026822     7.163151  2.054983e-86  2.601244e-88    7.171445
## ENSMUSG00000024164     3.941761 2.820463e-128 2.380137e-130    3.984540
## ENSMUSG00000097971     2.377971  8.318050e-84  1.403890e-85    2.422976
## ENSMUSG00000034855     4.460555  2.184106e-19  1.474502e-20    4.485882
## ENSMUSG00000069516     2.428258  6.232232e-61  1.314817e-62    2.476535
##                        FDR_edgeR  pvalue_edgeR logFC_limma    FDR_limma
## ENSMUSG00000055493 1.044257e-207 8.812294e-210   -4.695295 5.079109e-08
## ENSMUSG00000026822 2.664949e-237 1.124451e-239    7.613741 7.533045e-06
## ENSMUSG00000024164 3.806399e-131 4.818227e-133    3.987732 8.916977e-07
## ENSMUSG00000097971  2.984579e-90  6.296579e-92    2.416537 4.914712e-07
## ENSMUSG00000034855 5.770358e-126 9.739002e-128    4.776115 8.879106e-06
## ENSMUSG00000069516  2.536675e-65  6.421961e-67    2.463295 4.234716e-06
##                    pvalue_limma DESeq2_DEG edgeR_DEG limma_DEG DEG_counts
## ENSMUSG00000055493 2.143084e-10       TRUE      TRUE      TRUE          3
## ENSMUSG00000026822 4.132050e-07       TRUE      TRUE      TRUE          3
## ENSMUSG00000024164 1.504975e-08       TRUE      TRUE      TRUE          3
## ENSMUSG00000097971 4.147437e-09       TRUE      TRUE      TRUE          3
## ENSMUSG00000034855 5.619687e-07       TRUE      TRUE      TRUE          3
## ENSMUSG00000069516 1.923696e-07       TRUE      TRUE      TRUE          3
##                     combined_FDR FDR_labeling mean_logFCs     genes_tag
## ENSMUSG00000055493  0.000000e+00         SIGN   -4.721734 PREVALENT_DEG
## ENSMUSG00000026822 1.185758e-322         SIGN    7.316113 PREVALENT_DEG
## ENSMUSG00000024164 1.774813e-259         SIGN    3.971344 PREVALENT_DEG
## ENSMUSG00000097971 1.040397e-174         SIGN    2.405828 PREVALENT_DEG
## ENSMUSG00000034855 6.620145e-145         SIGN    4.574184 PREVALENT_DEG
## ENSMUSG00000069516 3.027486e-126         SIGN    2.456029 PREVALENT_DEG

It also contains information on whether the genes are considered to be DE, in the column genes_tag The tag PREVALENT_DEGS refers to those genes that are considered significant in at least n of the DE methods used POSSIBLE_DEGS are those considered significant by at least one method. As such, PREVALENT_DEGS and POSSIBLE_DEGS will be the same when n = 1. N is controlled by the argument minlibraries.

To be considered significant for a given method, a gene must have an adjusted-pvalue < 0.05 and |logFC| > 1; these values are adjustable using the arguments p_val_cutoff and lfc.

The genes_tag columns includes the labels NOT_DEGS and FILTERED_OUT to refer to those genes not detected as DE by at least one DE method and those that do not pass the initial low-count filtering step, controlled by parameters reads and minlibraries.

There is another column, combined_FDR – which is POS/NEG depending on whether the combined adjusted p-value as described above is less than or equal to 0.05 (or whatever the value of the argument p_val_cutoff is).

More complex model designs.

In order to control for specific variables (such as individuals in paired designs, potential confounding factors such as age, etc.),

For example, if we consider our previous experiment, but add an extra column to the target, indicating different age groupings for the samples we obtain the following:

target_multi <- data.frame(target,
  age_group = c("adult", "child", "adult", "child", "adult", "adult", "child"))
target_multi

##    sample treat age_group
## 1   CTL_1  Ctrl     adult
## 2   CTL_2  Ctrl     child
## 3   CTL_3  Ctrl     adult
## 4   CTL_4  Ctrl     child
## 5 epm2a_1 Treat     adult
## 6 epm2a_2 Treat     adult
## 7 epm2a_3 Treat     child

We may wish to control for the effects of age_group on the experiment.

This can be achieved using the argument model_variables. The variables given to this argument will be used in the model when calculating differential expression between the Treat and Ctrl samples:

degh_out_model <- main_degenes_Hunter(raw=toc, target=target_multi, 
                                      modules="D", model_variables="age_group")

This works by using the variable age_group to create a linear model formula to be passed to the different DE methods (with the exception of NOISeq).

The output has the same structure as the original analysis.

Custom model designs can also be specified in the model_variables argument, based on the R model syntax, see help(“formula”) for more details. If a custom formula is used, the custom_model argument must be set to true.

Co-expression Analysis

Co-expression analysis is included via the R package Weighted correlation network analysis (WGCNA). The idea is to look for groups (modules) of genes showing correlated expression. The groups can then be correlated with experimental factors, such as treatment vs. non treatment, as well as other groupings such as the age grouping mentioned earlier, or numeric factors such as known values of metabolites related to the experiment.

WGCNA is activated using by adding “W” to the modules argument. The traits to be correlated with the modules are specified using the string_factors and numeric_factors options:

degh_out_coexp <- main_degenes_Hunter(raw=toc, target=target_multi, 
                                      modules="DW", string_factors="age_group")

Please note that WGCNA requires a normalized expression matrix as input, as such it cannot be run alone, it must be run alongside at least one DE method, which is specified with the argument WGCNA_norm_method.

Functional Analysis

fh_out_one_pack <- functional_hunter( #Perform enrichment analysis
        degh_out_one_pack,
        'Mouse', # Use specified organism database 
        func_annot_db = "gKR", # Enrichment analysis for GO, KEGG and Reactome
        GO_subont = "BMC",
        analysis_type= "o" # Use overepresentation analysis only (Not GSEA)
)
fh_out_coexp <- functional_hunter( # Perform enrichment analisys
        degh_out_coexp,
        'Mouse', # Use specified organism database 
        func_annot_db = "gKR", # Enrichment analysis for GO, KEGG and Reactome
        GO_subont = "BMC",
        analysis_type= "o" # Use overepresentation analysi only (Not GSEA)
)

Obtaining Reports

To obtain highly detailed html reports including multiple plots to visualize the data and the results of the different analysis methods, the following commands can be used:

print(getwd())
write_expression_report(exp_results=degh_out_coexp)
write_enrich_files(func_results=fh_out_one_pack)
write_functional_report(hunter_results=degh_out_coexp, 
                        func_results=fh_out_coexp)

In all cases, the output folder for each report can be specified with the output_files option.

Command-line Package Usage

The package also includes a number of scripts, in the folder inst/scripts, which can be used to run the above functions from the command line.

input_toc <- system.file("extdata", "table_of_counts.txt", 
                         package = "mypackage")
input_toc
input_target <- system.file("extdata", "target.txt", package = "mypackage")
input_target

We recommend the user first creates a folder in which to install the ExpHunterSuite command line scripts, then copies the scripts there and make them command line accesible using these commands:

mkdir install_folder
Rscript -e "ExpHunterSuite::install_DEgenes_hunter('install_folder')"
export PATH=path_to_install_folder:$PATH

This export PATH can also be added to the .bashrc or .bash_profile files.

The user can then run the protocol from the command line with scripts such as the following, which will implement the functions and create the output reports, all from a single script.

degenes_Hunter.R -t $TARGET_FILE -i $TOC -o $EXP_RESULTS
functional_Hunter.R -i $EXP_RESULTS -m Organism -o FUNC_RESULTS

Full details of the arguments to give the the script can be found by running degenes_Hunter.R -h or functional_Hunter.R -h. More examples are given in the README file for this packet

The Expression Hunter Suite

James R. Perkins

04/29/2022