1 GSEA algorithm

A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated in Disease enrichment analysis vignette were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)¹ directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.

Genes are ranked based on their phenotypes. Given a priori defined set of gens S (e.g., genes shareing the same DO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout the ranked gene list (L) or primarily found at the top or bottom.

There are three key elements of the GSEA method:

Calculation of an Enrichment Score.
The enrichment score (ES) represent the degree to which a set S is over-represented at the top or bottom of the ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The ES is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic¹.
Esimation of Significance Level of ES.
The p-value of the ES is calculated using permutation test. Specifically, we permute the gene labels of the gene list L and recompute the ES of the gene set for the permutated data, which generate a null distribution for the ES. The p-value of the observed ES is then calculated relative to this null distribution.
Adjustment for Multiple Hypothesis Testing.
When the entire gene sets were evaluated, DOSE adjust the estimated significance level to account for multiple hypothesis testing and also q-values were calculated for FDR control.

We implemented GSEA algorithm proposed by Subramanian¹. Alexey Sergushichev implemented an algorithm for fast GSEA analysis in the fgsea² package.

In DOSE³, user can use GSEA algorithm implemented in DOSE or fgsea by specifying the parameter by="DOSE" or by="fgsea". By default, DOSE use fgsea since it is much more fast.

1.1 Leading edge analysis and core enriched genes

Leading edge analysis reports Tags to indicate the percentage of genes contributing to the enrichment score, List to indicate where in the list the enrichment score is attained and Signal for enrichment signal strength.

It would also be very interesting to get the core enriched genes that contribute to the enrichment.

DOSE supports leading edge analysis and report core enriched genes in GSEA analysis.

1.2 `gseDO` fuction

In the following example, in order to speedup the compilation of this document, only gene sets with size above 120 were tested and only 100 permutations were performed.

library(DOSE)
data(geneList)
y <- gseDO(geneList,
           nPerm         = 100, 
           minGSSize     = 120,
           pvalueCutoff  = 0.2, 
           pAdjustMethod = "BH",
           verbose       = FALSE)
head(y, 3)

##                  ID            Description setSize enrichmentScore
## DOID:374   DOID:374      nutrition disease     313      -0.3421127
## DOID:1492 DOID:1492 eye and adnexa disease     459      -0.3105160
## DOID:5614 DOID:5614            eye disease     450      -0.3125247
##                 NES     pvalue   p.adjust    qvalues rank
## DOID:374  -1.467844 0.01234568 0.09334126 0.05069153 1464
## DOID:1492 -1.381221 0.01250000 0.09334126 0.05069153 1793
## DOID:5614 -1.385393 0.01250000 0.09334126 0.05069153 1768
##                             leading_edge
## DOID:374  tags=22%, list=12%, signal=20%
## DOID:1492 tags=22%, list=14%, signal=19%
## DOID:5614 tags=22%, list=14%, signal=19%
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      core_enrichment
## DOID:374                                                                                                                                                                 2169/1490/7840/4887/4314/595/4018/6403/590/3087/866/66036/5919/5176/3953/164656/5950/2638/2166/5243/5468/5108/10560/4023/3485/7350/3952/1149/585/1513/3489/79068/4671/477/4313/3625/9369/6720/7494/2099/3480/3991/23446/6678/4915/5167/8228/165/2152/185/367/4982/3667/4128/9607/3572/150/563/1489/3479/9370/9122/5105/2167/5346/79689/5241
## DOID:1492 3371/3082/5914/2878/4153/3791/23247/1543/80184/6750/1958/2098/7450/596/9187/2034/482/948/1490/1280/3931/5737/4314/4881/2261/3426/187/629/6403/7042/6785/7507/2934/5176/4060/1277/7078/5950/2057/727/10516/4311/2247/1295/358/10203/2192/582/10218/57125/3485/585/1675/6310/2202/4313/2944/4254/3075/1501/2099/3480/4653/1195/6387/3305/1471/857/4016/1909/4053/6678/1296/7033/4915/55812/1191/5654/10631/2152/2697/7043/2952/6935/2200/3572/7177/7031/3479/2006/10451/9370/771/3117/125/652/4693/5346/1524
## DOID:5614           3082/5914/2878/4153/3791/23247/1543/80184/6750/1958/2098/7450/596/9187/2034/482/948/1490/1280/3931/5737/4314/4881/2261/3426/187/629/6403/7042/6785/7507/2934/5176/4060/1277/7078/5950/2057/727/10516/4311/2247/1295/358/10203/2192/582/10218/57125/3485/585/1675/6310/2202/4313/2944/4254/3075/1501/2099/3480/4653/6387/3305/1471/857/4016/1909/4053/6678/1296/7033/4915/55812/1191/5654/10631/2152/2697/7043/2952/6935/2200/3572/7177/7031/3479/2006/10451/9370/771/3117/125/652/4693/5346/1524

1.3 `gseNCG` fuction

ncg <- gseNCG(geneList,
              nPerm         = 100, 
              minGSSize     = 120,
              pvalueCutoff  = 0.2, 
              pAdjustMethod = "BH",
              verbose       = FALSE)
ncg <- setReadable(ncg, 'org.Hs.eg.db')
head(ncg, 3)

##                ID Description setSize enrichmentScore       NES     pvalue
## lung         lung        lung     173      -0.3880662 -1.593916 0.01369863
## breast     breast      breast     133      -0.4869070 -1.902163 0.01492537
## lymphoma lymphoma    lymphoma     188       0.2999589  1.369619 0.02857143
##            p.adjust    qvalues rank                   leading_edge
## lung     0.04477612 0.02356638 2775 tags=31%, list=22%, signal=25%
## breast   0.04477612 0.02356638 2930 tags=33%, list=23%, signal=26%
## lymphoma 0.05714286 0.03007519 2087 tags=21%, list=17%, signal=18%
##                                                                                                                                                                                                         core_enrichment
## lung     SETD2/ATXN3L/LRP1B/BRD3/ARID1A/INHBA/RB1/ADCY1/LYRM9/NF1/CTNNB1/TP53/SATB2/STK11/CTIF/CTNNA3/KDR/COL11A1/FLT3/APC/ADGRL3/FGFR3/NCAM2/DIP2C/APLNR/SLIT2/EPHA3/RUNX1T1/ZMYND10/ZFHX4/GLI3/TNN/PLSCR4/DACH1/ERBB4
## breast                                                                                   KMT2A/ERBB3/SETD2/ARID1A/GPS2/NCOR1/RB1/MAP2K4/NF1/TP53/PIK3R1/STK11/CDKN1B/PTGFR/APC/CCND1/TRAF5/MAP3K1/ESR1/TBX3/FOXA1/GATA3
## lymphoma                                        DUSP2/EZH2/PRDM1/MYC/ZWILCH/IKZF3/PLCG2/IDH2/HIST1H1C/MAGEC3/CD79B/ETV6/HIST1H1E/HIST1H1B/IRF8/CD28/SLC29A2/DUSP9/TNFAIP3/DNMT3A/SYK/TNF/BCR/HIST1H1D/DSC3/UBE2A/PABPC1

1.4 `gseDGN` fuction

dgn <- gseDGN(geneList,
              nPerm         = 100, 
              minGSSize     = 120,
              pvalueCutoff  = 0.2, 
              pAdjustMethod = "BH",
              verbose       = FALSE)
dgn <- setReadable(dgn, 'org.Hs.eg.db')
head(dgn, 3)

##                          ID      Description setSize enrichmentScore
## umls:C0029456 umls:C0029456     Osteoporosis     375      -0.3439046
## umls:C0004936 umls:C0004936 Mental disorders     348      -0.3087838
## umls:C0032914 umls:C0032914    Pre-Eclampsia     334      -0.3066475
##                     NES     pvalue p.adjust   qvalues rank
## umls:C0029456 -1.482463 0.01190476 0.146771 0.1040272 1766
## umls:C0004936 -1.331668 0.01234568 0.146771 0.1040272 2007
## umls:C0032914 -1.321802 0.01234568 0.146771 0.1040272 1909
##                                 leading_edge
## umls:C0029456 tags=23%, list=14%, signal=20%
## umls:C0004936 tags=20%, list=16%, signal=17%
## umls:C0032914 tags=29%, list=15%, signal=25%
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           core_enrichment
## umls:C0029456                                          HGF/PTH1R/CYP1A1/JAG1/ROR2/FLT3/CUL9/EEF1A2/THSD4/BCL2/ITGAV/WIF1/GREM2/COL15A1/HPGDS/VGLL3/SLIT3/NRIP1/TMEM135/MGP/PLCL1/OSBPL1A/PIBF1/SELP/SPRY1/MMP13/ID4/SPP2/COL1A2/AOX1/ARHGEF3/GSN/TSC22D3/ATP1B1/NR5A2/ANKH/COL1A1/LEPR/THSD7A/GC/FGF2/PPARG/NOX4/ZNF266/GHRH/BHLHE40/SLC19A2/THBD/FLNB/KL/LEP/HSD17B4/CTSK/FTO/MMP2/ESR1/IGF1R/PTN/IRAK3/HSPA1L/CST3/GHR/SPARC/KDM4B/LRP1/INPP4B/BMPR1B/PTHLH/DPT/FRZB/GSTT1/AR/TNFRSF11B/IRS1/WLS/GSTM3/TGFBR3/TPH1/IGF1/SFRP4/CORIN/BMP4/CHAD/FOXA1/PGR
## umls:C0004936                                                                                                                                                 BDNF/BCL9/NR3C1/PCLO/DEAF1/FZD4/CREBBP/ITIH3/GLI1/GAD1/GRM5/CUL3/NUCB2/CHRM2/WFS1/PROC/HTR1A/PRKG1/HPGDS/DBH/MMP3/APC/FZD10/OXTR/NDN/PTPRN2/TPPP/SGCE/NGF/LEPR/ABCB1/CRHBP/SHH/MAGEL2/GLT8D1/LEP/FTO/PER2/ATXN1/CADPS2/NRXN3/CTNND2/XBP1/TCF4/ESR1/NTF3/HSPG2/CACNA1C/GABBR1/MAGI1/NR3C2/NTRK2/ANOS1/APOD/TAC1/ZBTB20/GRIA2/UCN/MAOA/CARTPT/TPH1/SLC1A1/CACNA1D/MAOB/ADH1B/NLGN4X/ERBB4/GRP
## umls:C0032914 PLAC1/PSG5/ERCC2/ADD1/ACTG2/PECAM1/PGF/VEGFC/DDAH2/F7/PDE5A/ADAM12/CAPN10/LTF/SOD3/COL4A6/TEK/IL5/PRCP/HPGD/SCNN1A/MBL2/CYP1A1/IL1R1/INSR/PROC/HP/VWF/HDC/EFNA1/FABP2/MMP3/NPR1/OXTR/LPA/EDIL3/MGP/APLNR/PYGM/SELP/FGF1/GJA4/FGF14/MMP13/SLC22A5/COL1A2/ANG/COL1A1/LEPR/PROS1/FGF2/PPARG/CRHBP/SYNPO/COL3A1/LPL/THBD/MMP10/COL5A2/LEP/PTGER3/MMP2/PDGFC/GSTM1/CFH/NOV/ESR1/IGF1R/TPBG/HSPA1L/HSPG2/VCAN/COL5A1/SPARC/NR3C2/CLU/ENPP1/F13A1/HTRA1/F3/AGTR1/GSTT1/PLAT/AR/IRS1/IL6ST/COL4A5/THBS4/IGF1/ELN/ADIPOQ/CORIN/HLA-DQA1/FABP4/CX3CR1

Disease Gene Set Enrichment Analysis

Guangchuang Yu (guangchuangyu@gmail.com)
School of Public Health, The University of Hong Kong

2017-01-19

Contents

1 GSEA algorithm

1.1 Leading edge analysis and core enriched genes

1.2 `gseDO` fuction

1.3 `gseNCG` fuction

1.4 `gseDGN` fuction

2 Visualization

2.1 cnetplot

2.2 enrichMap

2.3 gseaplot

References

Disease Gene Set Enrichment Analysis

Guangchuang Yu (guangchuangyu@gmail.com) School of Public Health, The University of Hong Kong

2017-01-19

Contents

1 GSEA algorithm

1.1 Leading edge analysis and core enriched genes

1.2 gseDO fuction

1.3 gseNCG fuction

1.4 gseDGN fuction

2 Visualization

2.1 cnetplot

2.2 enrichMap

2.3 gseaplot

References

Guangchuang Yu (guangchuangyu@gmail.com)
School of Public Health, The University of Hong Kong

1.2 `gseDO` fuction

1.3 `gseNCG` fuction

1.4 `gseDGN` fuction