Contents

1 Introduction

Geneset enrichment is an important step in many biological data analysis workflows, particularly in bioinformatics and computational biology. At a basic level, one is testing if a group of genes has a significant overlap with a series of pre-defined sets of genes, which typically signify some biological relevance. The R package hypeR enables users to easily perform this type of analysis via a hypergeometric test with default compatibility with The Molecular Signatures Database (MSigDB). While hypeR is similar to other geneset enrichment programs - such as the popular Enrichr - it does have some unique features such as setting a specific background integer, reducing genesets to their intersection with a background set of genes, as well as useful functions designed for R markdown-style reports. Additionally, users can use custom genesets that are easily defined, extending the analysis of genes to other areas of interest such as proteins, microbes, metabolites etc. The hypeR package is designed to make routine geneset enrichment seamless for scientist working in R. 

2 Installation

Download the package from Bioconductor.

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("hypeR")

Or install the development version of the package from Github.

BiocManager::install("montilab/hypeR")

Load the package into R session.

library(hypeR)

Download all available MSigDB genesets to be available for hypeR::db_get().

msigdb_info <- hypeR::download_msigdb(species="Homo sapiens") 
## v6.2.1
## Downloading Gene Sets to...
## /tmp/RtmpuJkbZ8
## - C1 -> 326 Gene Sets 
## - C2.CGP -> 3433 Gene Sets 
## - C2.CP -> 252 Gene Sets 
## - C2.CP.BIOCARTA -> 217 Gene Sets 
## - C2.CP.KEGG -> 186 Gene Sets 
## - C2.CP.REACTOME -> 674 Gene Sets 
## - C3.MIR -> 221 Gene Sets 
## - C3.TFT -> 615 Gene Sets 
## - C4.CGN -> 427 Gene Sets 
## - C4.CM -> 431 Gene Sets 
## - C5.BP -> 4436 Gene Sets 
## - C5.CC -> 580 Gene Sets 
## - C5.MF -> 901 Gene Sets 
## - C6 -> 189 Gene Sets 
## - C7 -> 4872 Gene Sets 
## - H -> 50 Gene Sets

3 Workflows

3.1 Example Data

Here we define our genes of interest as a group of genes known involved in tricarboxylic acid cycle.

symbols <- c("IDH3B","DLST","PCK2","CS","PDHB","PCK1","PDHA1","LOC642502",
             "PDHA2","LOC283398","FH","SDHD","OGDH","SDHB","IDH3A","SDHC",
             "IDH2","IDH1","OGDHL","PC","SDHA","SUCLG1","SUCLA2","SUCLG2")

3.3 Loading Gene Sets

Use hypeR::db_get() to retrieve a downloaded geneset. In this example, we are interested in all three of the following genesets, therefore we concatenate them. A geneset is simply a list of character vectors, therefore, one can use any custom geneset in their analysis, as long as it is appropriately defined.

BIOCARTA <- db_get(msigdb_info, "C2.CP.BIOCARTA")
KEGG     <- db_get(msigdb_info, "C2.CP.KEGG")
REACTOME <- db_get(msigdb_info, "C2.CP.REACTOME")

gsets <- c(BIOCARTA, KEGG, REACTOME)

3.4 Hyper Enrichment

hyp <- hypeR(symbols, gsets, bg=7842, fdr=0.05)
## Number of genes =  24 
## Number of gene sets =  1077 
## Background population size =  7842 
## P-Value cutoff =  1 
## FDR cutoff =  0.05

3.5 Visualize Results

hyp_plot(hyp)

3.6 Interactive Table

hyp_show(hyp)

3.7 Save Results to Excel

hyp_to_excel(hyp, file.path="pathways.xlsx")

3.8 Save Results to Table

hyp_to_table(hyp, file.path="pathways.txt")

4 Alternative Functionality

4.1 Use Custom Gene Sets

As mentioned previously, one can use custom genesets with hypeR. In this example, we download one of the many publicly available genesets hosted by Enrichr. Once downloaded, one performs hyper enrichment as normal.

url = "http://amp.pharm.mssm.edu/Enrichr/geneSetLibrary?mode=text&libraryName=Cancer_Cell_Line_Encyclopedia"
r <- httr::GET(url)
text <- httr::content(r, "text", encoding="ISO-8859-1")
text.split <- strsplit(text, "\n")[[1]]
gsets <- sapply(text.split, function(x) {
    genes <- strsplit(x, "\t")[[1]]
    return(genes[3:length(genes)])
})
names(gsets) <- unlist(lapply(text.split, function(x) strsplit(x, "\t")[[1]][1]))

hyp <- hypeR(symbols, gsets, bg=7842, fdr=0.05)
## Number of genes =  24 
## Number of gene sets =  967 
## Background population size =  7842 
## P-Value cutoff =  1 
## FDR cutoff =  0.05

4.2 Specify a Background Population of Genes

In cases where the background population is small it is advisable to first reduce genesets to their intersection with the background population of genes. By providing a character vector of background genes instead of an integer, hypeR will do just that. Here is an example of an experiment that only uses sex-linked genes and therefore genesets are restricted to only genes included in the background population.

url = "https://www.genenames.org/cgi-bin/download/custom?col=gd_app_sym&chr=X&chr=Y&format=text"
r <- httr::GET(url)
text <- httr::content(r, "text", encoding="ISO-8859-1")
text.split <- strsplit(text, "\n")[[1]]

bg <- text.split[2:length(text.split)]
head(bg)
## [1] "ABCB7"  "ABCD1"  "ACE2"   "ACOT9"  "ACSL4"  "ACTBP1"
hyp <- hypeR(symbols, gsets, bg=bg)
## Number of genes =  24 
## Number of gene sets =  967 
## Background population size =  2387 
## P-Value cutoff =  1 
## FDR cutoff =  1

5 Session Info

sessionInfo()
## R Under development (unstable) (2019-03-18 r76245)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] hypeR_1.00.0     testthat_2.0.1   BiocStyle_2.11.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1         tidyr_0.8.3        prettyunits_1.0.2 
##  [4] ps_1.3.0           assertthat_0.2.1   rprojroot_1.3-2   
##  [7] digest_0.6.18      mime_0.6           R6_2.4.0          
## [10] plyr_1.8.4         backports_1.1.3    evaluate_0.13     
## [13] httr_1.4.0         ggplot2_3.1.0      pillar_1.3.1      
## [16] rlang_0.3.2        curl_3.3           lazyeval_0.2.2    
## [19] rstudioapi_0.10    data.table_1.12.0  callr_3.2.0       
## [22] DT_0.5             rmarkdown_1.12     desc_1.2.0        
## [25] devtools_2.0.1     stringr_1.4.0      htmlwidgets_1.3   
## [28] munsell_0.5.0      shiny_1.2.0        compiler_3.6.0    
## [31] httpuv_1.5.0       xfun_0.5           pkgconfig_2.0.2   
## [34] pkgbuild_1.0.3     htmltools_0.3.6    tidyselect_0.2.5  
## [37] tibble_2.1.1       bookdown_0.9       viridisLite_0.3.0 
## [40] crayon_1.3.4       dplyr_0.8.0.1      withr_2.1.2       
## [43] later_0.8.0        grid_3.6.0         jsonlite_1.6      
## [46] xtable_1.8-3       gtable_0.2.0       magrittr_1.5      
## [49] scales_1.0.0       zip_2.0.1          cli_1.1.0         
## [52] stringi_1.4.3      msigdbr_6.2.1      fs_1.2.7          
## [55] promises_1.0.1     remotes_2.0.2      openxlsx_4.1.0    
## [58] tools_3.6.0        glue_1.3.1         purrr_0.3.2       
## [61] crosstalk_1.0.0    processx_3.3.0     pkgload_1.0.2     
## [64] yaml_2.2.0         colorspace_1.4-1   BiocManager_1.30.4
## [67] sessioninfo_1.1.1  memoise_1.1.0      plotly_4.8.0      
## [70] knitr_1.22         usethis_1.4.0