Contents

1 Abstract

Disease Ontology (DO) aims to provide an open source ontology for the integration of biomedical data that is associated with human disease. We developed DOSE package to promote the investigation of diseases. DOSE provides five methods including Resnik, Lin, Jiang, Rel and Wang for measuring semantic similarities among DO terms and gene products; Hypergeometric model and gene set enrichment analysis were also implemented for extracting disease association insight from genome wide expression profiles.

2 Citation

If you use DOSE in published research, please cite G. Yu (2015). In addition please cite G. Yu (2010) when using compareCluster in clusterProfiler, G. Yu (2015) when applying enrichment analysis to NGS data by using ChIPseeker and G. Yu (2010) when using GOSemSim for GO semantic similarity analysis.

G Yu, LG Wang, GR Yan, QY He.
DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis
Bioinformatics 2015, 31(4):608-609.

URL: http://dx.doi.org/10.1093/bioinformatics/btu684

G Yu, F Li, Y Qin, X Bo, Y Wu, S Wang. 
GOSemSim: an R package for measuring semantic similarity among GO terms and gene products.
Bioinformatics 2010, 26(7):976-978.

URL: http://dx.doi.org/10.1093/bioinformatics/btq064

G Yu, LG Wang, Y Han, QY He.
clusterProfiler: an R package for comparing biological themes among gene clusters.
OMICS: A Journal of Integrative Biology 2012, 16(5):284-287.

URL: http://dx.doi.org/10.1089/omi.2011.0118

G Yu, LG Wang, QY He.
ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization.
Bioinformatics 2015, 31(14):2382-2383.

URL: http://dx.doi.org/10.1093/bioinformatics/btv145

3 Introduction

Public health is an important driving force behind biological and medical research. A major challenge of the post-genomic era is bridging the gap between fundamental biological research and its clinical applications. Recent research has increasingly demonstrated that many seemingly dissimilar diseases have common molecular mechanisms. Understanding similarities among disease aids in early diagnosis and new drug development.

Formal knowledge representation of gene-disease association is demanded for this purpose. Ontologies, such as Gene Ontology, have been successfully applied to represent biological knowledge, and many related techniques have been adopted to extract information. Disease Ontology (DO)1 was developed to create a consistent description of gene products with disease perspectives, and is essential for supporting functional genomics in disease context. Accurate disease descriptions can discover new relationships between genes and disease, and new functions for previous uncharacteried genes and alleles.

Unlike other clinical vocabularies that defined disease related concepts disparately, DO is organized as a directed acyclic graph, laying the foundation for quantitative computation of disease knowledge. The application of disease ontology is in its infancy, lacking programs for mining DO knowledge automatically.

Here, we present an R package DOSE[Reference 2) for analyzing semantic similarities among DO terms and gene products annotated with DO terms, and extracting disease association insight from genome wide expression profiles.

Four information content (IC)-based methods and one graph structure-based method were implemented for measuring semantic similarity. Hypergeometric test and Gene Set Enrichment Analysis were implemented for extracting biological insight.

To start with DOSE package, type following code below:

library(DOSE)
help(DOSE)

4 DO term semantic similarity measurement

Four methods determine the semantic similarity of two terms based on the Information Content of their common ancestor term were proposed by Resnik3, Jiang4, Lin5 and Schlicker6. Wang7 presented a method to measure the similarity based on the graph structure. Each of these methods has its own advantage and weakness. DOSE implemented all these methods to compute semantic similarity among DO terms and gene products. We have developed another package GOSemSim8 to explore the functional similarity at GO perspective, including molecular function (MF), biological process (BP) and cellular component (CC).

For algorithm details, please refer to the vignette of GOSemSim.

4.1 doSim function

In DOSE, we implemented doSim for calculating semantic similarity between two DO terms and two set of DO terms.

data(DO2EG)
set.seed(123)
a <- sample(names(DO2EG), 10)
a
##  [1] "DOID:1407"  "DOID:5844"  "DOID:2034"  "DOID:8432"  "DOID:9146" 
##  [6] "DOID:10584" "DOID:3209"  "DOID:848"   "DOID:3341"  "DOID:2512"
b <- sample(names(DO2EG), 5)
b
## [1] "DOID:9409"  "DOID:2481"  "DOID:4465"  "DOID:3498"  "DOID:11252"
doSim(a[1], b[1], measure="Wang")
## [1] 0.1132852
doSim(a[1], b[1], measure="Resnik")
## [1] 0.0809122
doSim(a[1], b[1], measure="Lin")
## [1] 0.09407429
s <- doSim(a, b, measure="Wang")
s
##             DOID:9409  DOID:2481  DOID:4465  DOID:3498 DOID:11252
## DOID:1407  0.11328518 0.08595600 0.01522238 0.01522238 0.08190022
## DOID:5844  0.14897652 0.07834240 0.02801328 0.02801328 0.11564838
## DOID:2034  0.17347273 0.47553108 0.03676194 0.03676194 0.13877811
## DOID:8432  0.17347273 0.09998997 0.03676194 0.03676194 0.42199873
## DOID:9146  0.07142995 0.04117234 0.03676194 0.03676194 0.05714393
## DOID:10584 0.12108370 0.09869482 0.01802164 0.01802164 0.08927900
## DOID:3209  0.14897652 0.07834240 0.02801328 0.02801328 0.11564838
## DOID:848   0.14897652 0.07834240 0.02801328 0.02801328 0.11564838
## DOID:3341  0.13240905 0.06367153 0.02208240 0.02208240 0.09998997
## DOID:2512  0.07142995 0.04117234 0.03676194 0.03676194 0.05714393

The doSim function requires three parameter DOID1, DOID2 and measure. DOID1 and DOID2 should be a vector of DO terms, while measure should be one of Resnik, Jiang, Lin, Rel, and Wang.

We also implement a plot function simplot to visualize the similarity result.

simplot(s, 
        color.low="white", color.high="red", 
        labs=TRUE, digits=2, labs.size=5, 
        font.size=14, xlab="", ylab="")

Parameter color.low and colow.high are used to setting the color gradient; labs is a logical parameter indicating whether to show the similarity values or not, digits to indicate the number of decimal places to be used and labs.size control the font size of similarity values; font.size setting the font size of axis and label of the coordinate system.

5 Gene semantic similarity measurement

On the basis of semantic similarity between DO terms, DOSE can also compute semantic similarity among gene products. DOSE provides four methods which called max, avg, rcmax and BMA to combine semantic similarity scores of multiple DO terms. The similarities among genes and gene clusters which annotated by multiple DO terms were also calculated by these combine methods. For calculation details, please refer to the vignette of GOSemSim.

5.1 geneSim function

In DOSE, we implemented geneSim to measure semantic similarities among genes.

data(EG2DO)
g1 <- sample(names(EG2DO), 5)
g1
## [1] "84842" "2521"  "10592" "3069"  "91746"
g2 <- sample(names(EG2DO), 4)
g2
## [1] "84289" "6045"  "56999" "9869"
geneSim(g1[1], g2[1], measure="Wang", combine="BMA")
## [1] 0.057
gs <- geneSim(g1, g2, measure="Wang", combine="BMA")
gs
##       84289  6045 56999  9869
## 84842 0.057 0.135 0.355 0.098
## 2521  0.573 0.253 0.511 0.482
## 10592 0.057 0.187 0.296 0.128
## 3069  0.573 0.517 1.000 1.000
## 91746 0.573 0.308 0.527 0.501

The geneSim requires four parameter geneID1, geneID2, measure and combine. geneID1 and geneID2 should be a vector of entrez gene IDs; measure should be one of Resnik, Jiang, Lin, Rel, and Wang, while combine should be one of max, avg, rcmax and BMA as described previously.

The simplot works well with both the output of doSim and geneSim.

5.2 clusterSim and mclusterSim

We also implemented clusterSim for calculating semantic similarity between two gene clusters and mclusterSim for calculating semantic similarities among multiple gene clusters.

clusterSim(g1, g2, measure="Wang", combine="BMA")
## [1] 0.501
clusters <- list(a=g1, b=g2, c=sample(names(EG2DO), 6))
mclusterSim(clusters, measure="Wang", combine="BMA")
##       a     b     c
## a 1.000 0.501 0.403
## b 0.501 1.000 0.623
## c 0.403 0.623 1.000

6 DO term enrichment analysis

6.1 Hypergeometric model

Over-representation test9 is a widely used approach to identify biological themes. Here we implement hypergeometric model to assess whether the number of selected genes associated with disease is larger than expected.

To determine whether any terms annotate a specified list of genes at frequency greater than that would be expected by chance, DOSE calculates a p-value using the hypergeometric distribution:

\(p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{{M \choose i}{{N-M} \choose {n-i}}} {{N \choose n}}\)

In this equation, N is the total number of genes in the background distribution, M is the number of genes within that distribution that are annotated (either directly or indirectly) to the node of interest, n is the size of the list of genes of interest and k is the number of genes within that list which are annotated to the node. The background distribution by default is all the genes that have annotation. User can set the background via universe parameter.

P-values were adjusted for multiple comparison, and q-values were also calculated for FDR control.

6.2 enrichDO function

DOSE provides an example dataset geneList which was derived from R package breastCancerMAINZ that contained 200 samples, including 29 samples in grade I, 136 samples in grade II and 35 samples in grade III. We computed the ratios of geometric means of grade III samples versus geometric means of grade I samples. Logarithm of these ratios (base 2) were stored in geneList dataset.

In the following example, we selected fold change above 1 as the differential genes and analyzing their disease association.

data(geneList)
gene <- names(geneList)[abs(geneList) > 1.5]
head(gene)
## [1] "4312"  "8318"  "10874" "55143" "55388" "991"
x <- enrichDO(gene          = gene,
              ont           = "DO", 
              pvalueCutoff  = 0.05,
              pAdjustMethod = "BH",
              universe      = names(geneList), 
              minGSSize     = 5,
              qvalueCutoff  = 0.05,
              readable      = FALSE)
head(summary(x))
##                        ID                       Description GeneRatio
## DOID:162         DOID:162                            cancer   266/331
## DOID:14566     DOID:14566 disease of cellular proliferation   267/331
## DOID:0050686 DOID:0050686               organ system cancer   187/331
## DOID:2994       DOID:2994                  germ cell cancer    47/331
## DOID:193         DOID:193         reproductive organ cancer    61/331
## DOID:10283     DOID:10283                   prostate cancer    40/331
##                BgRatio       pvalue     p.adjust       qvalue
## DOID:162     4259/6274 1.409970e-07 0.0001230903 9.483901e-05
## DOID:14566   4307/6274 3.266965e-07 0.0001426030 1.098732e-04
## DOID:0050686 2756/6274 1.610655e-06 0.0004687006 3.611258e-04
## DOID:2994     483/6274 2.269485e-05 0.0045378581 3.496342e-03
## DOID:193      691/6274 2.599002e-05 0.0045378581 3.496342e-03
## DOID:10283    394/6274 3.777164e-05 0.0054957741 4.234400e-03
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 geneID
## DOID:162          4312/8318/10874/55143/991/6280/2305/1062/4605/9833/9133/6279/10403/597/7153/6278/79733/259266/1381/3627/27074/6241/55165/9787/7368/11065/55355/55872/22974/10563/4751/6373/8140/820/10635/1844/4283/27338/890/9415/983/10232/4085/6362/9837/5080/7850/81930/5918/81620/332/3832/6286/5163/2146/3002/7272/2568/366/9212/8208/1111/9055/3833/4321/10112/3902/3620/3887/51514/6790/4521/891/8544/10578/4174/9232/10855/29968/3695/4001/4171/2921/7941/1033/6364/23532/9928/1164/3161/3595/11004/993/8061/990/5347/29851/29127/4102/55215/2173/4318/4105/5004/701/2633/55723/4069/9156/79682/3576/1978/1515/4821/79852/8836/6890/875/10321/3159/53335/1894/79623/7980/8438/9636/9700/5888/7083/333/898/23594/56649/952/8792/3294/1493/6352/25907/9768/6663/51378/8842/4998/11339/3070/4288/2151/2331/4485/23541/3551/3730/2487/4886/347/10788/2152/1264/214/247/2697/3131/9590/63923/185/4330/7043/3357/6863/2205/5104/9828/1036/2953/2952/3913/6935/9737/27123/6297/3487/5327/367/4982/23261/1473/3667/51466/2200/1634/5157/131578/4128/4059/2947/4056/3572/7177/1287/4582/23371/7049/563/3679/4117/2053/27324/7503/7060/7122/7031/6926/3479/9723/9254/4680/6424/2006/57535/10451/9338/1846/10266/80310/9370/1602/3708/23090/9122/4629/771/3117/80129/125/4857/5174/2018/2532/4250/23362/2167/652/4036/4137/8839/2066/3169/1408/9547/2922/11283/64499/1524/1580/10647/5304/8614/2625/7021/9/5241/10551/10974/57758/4969
## DOID:14566   4312/8318/10874/55143/991/6280/2305/1062/4605/9833/9133/6279/10403/597/7153/6278/79733/259266/1381/3627/27074/6241/55165/9787/7368/11065/55355/55872/22974/10563/4751/6373/8140/820/10635/1844/4283/27338/890/9415/983/10232/4085/6362/9837/5080/7850/81930/5918/81620/332/3832/6286/5163/2146/3002/7272/2568/366/9212/8208/1111/9055/3833/4321/10112/3902/3620/3887/51514/6790/4521/891/8544/10578/4174/9232/10855/29968/3695/4001/4171/2921/7941/1033/6364/23532/9928/1164/3161/3595/11004/993/8061/990/5347/29851/29127/4102/55215/2173/4318/4105/5004/701/2633/55723/4069/9156/79682/3576/1978/1515/4821/79852/8836/6890/875/10321/3159/53335/1894/79623/7980/8438/9636/9700/5888/7083/333/898/23594/56649/952/8792/3294/1493/6352/25907/9768/6663/51378/8842/4998/11339/3070/4288/2151/2331/4485/23541/3551/3730/2487/4886/347/10788/2152/1264/214/247/2697/3131/9590/63923/185/4330/7043/1811/3357/6863/2205/5104/9828/1036/2953/2952/3913/6935/9737/27123/6297/3487/5327/367/4982/23261/1473/3667/51466/2200/1634/5157/131578/4128/4059/2947/4056/3572/7177/1287/4582/23371/7049/563/3679/4117/2053/27324/7503/7060/7122/7031/6926/3479/9723/9254/4680/6424/2006/57535/10451/9338/1846/10266/80310/9370/1602/3708/23090/9122/4629/771/3117/80129/125/4857/5174/2018/2532/4250/23362/2167/652/4036/4137/8839/2066/3169/1408/9547/2922/11283/64499/1524/1580/10647/5304/8614/2625/7021/9/5241/10551/10974/57758/4969
## DOID:0050686                                                                                                                                                                                                                                                                                                                                                                                                                                               4312/10874/991/6280/2305/1062/4605/9133/6279/597/7153/6278/79733/259266/1381/3627/6241/55165/9787/11065/55872/10563/4751/6373/8140/820/10635/1844/27338/890/983/10232/4085/6362/5080/5918/332/6286/2146/3002/366/9212/1111/4321/10112/3902/3620/3887/6790/4521/891/8544/10578/4174/9232/10855/4001/4171/2921/1033/6364/23532/9928/1164/3161/3595/993/8061/990/5347/4102/2173/4318/4105/5004/701/2633/4069/3576/1978/1515/4821/79852/8836/6890/875/10321/3159/53335/1894/7980/8438/9636/5888/7083/898/952/8792/3294/1493/6352/9768/51378/8842/4288/2331/3551/2487/347/10788/2152/1264/214/247/2697/3131/9590/185/4330/7043/6863/2205/5104/1036/2952/6935/27123/6297/3487/5327/367/4982/23261/3667/2200/1634/4128/4059/2947/4056/3572/7177/1287/4582/23371/7049/563/3679/4117/7122/7031/3479/4680/6424/57535/10451/1846/10266/80310/9370/23090/4629/771/3117/125/5174/2532/4250/2167/652/4036/4137/8839/2066/3169/9547/2922/64499/1524/10647/5304/8614/2625/7021/9/5241/10551
## DOID:2994                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      4312/10403/259266/6241/4751/4085/332/366/3620/891/10855/4171/6364/9928/5347/4102/4318/5004/9156/3576/4821/8836/875/9636/5888/333/898/8842/4288/2487/2697/3357/2952/367/3667/4059/3572/4582/7503/3479/6424/10451/80310/771/4250/652/2066
## DOID:193                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   4312/6280/6279/597/7153/3627/820/983/10232/6362/332/6286/2146/9212/4321/6790/4521/891/4174/4171/5347/4102/4318/701/3576/1978/79852/10321/7083/898/1493/6352/8842/4288/3551/2152/214/247/2952/3487/367/3667/4128/3572/4582/563/3679/4117/7122/7031/3479/4680/6424/57535/10451/80310/3117/652/4036/5241/10551
## DOID:10283                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        4312/6280/6279/597/3627/332/6286/2146/4321/4521/891/5347/4102/4318/701/3576/79852/10321/6352/4288/3551/2152/247/2952/3487/367/3667/4128/4582/563/3679/4117/7031/3479/6424/10451/80310/652/4036/10551
##              Count
## DOID:162       266
## DOID:14566     267
## DOID:0050686   187
## DOID:2994       47
## DOID:193        61
## DOID:10283      40

The enrichDO function requires an entrezgene ID vector as input, mostly is the differential gene list of gene expression profile studies. If user needs to convert other gene ID type to entrezgene ID, we recommend using bitr function provided by clusterProfiler.

The ont parameter can be “DO” or “DOLite”, DOLite10 was constructed to aggregate the redundant DO terms. The DOLite data is not updated, we recommend user use ont=“DO”. pvalueCutoff setting the cutoff value of p value and p value adjust; pAdjustMethod setting the p value correction methods, include the Bonferroni correction (“bonferroni”), Holm (“holm”), Hochberg (“hochberg”), Hommel (“hommel”), Benjamini & Hochberg (“BH”) and Benjamini & Yekutieli (“BY”) while qvalueCutoff is used to control q-values.

The universe setting the background gene universe for testing. If user do not explicitly setting this parameter, enrichDO will set the universe to all human genes that have DO annotation.

The minGSSize indicates that only those DO terms that have more than genes annotated will be tested.

The readable is a logical parameter, indicates whether the entrezgene IDs will mapping to gene symbols or not.

We also implement setReadable function that helps the user to convert entrezgene IDs to gene symbols.

x <- setReadable(x)
head(summary(x))
##                        ID                       Description GeneRatio
## DOID:162         DOID:162                            cancer   266/331
## DOID:14566     DOID:14566 disease of cellular proliferation   267/331
## DOID:0050686 DOID:0050686               organ system cancer   187/331
## DOID:2994       DOID:2994                  germ cell cancer    47/331
## DOID:193         DOID:193         reproductive organ cancer    61/331
## DOID:10283     DOID:10283                   prostate cancer    40/331
##                BgRatio       pvalue     p.adjust       qvalue
## DOID:162     4259/6274 1.409970e-07 0.0001230903 9.483901e-05
## DOID:14566   4307/6274 3.266965e-07 0.0001426030 1.098732e-04
## DOID:0050686 2756/6274 1.610655e-06 0.0004687006 3.611258e-04
## DOID:2994     483/6274 2.269485e-05 0.0045378581 3.496342e-03
## DOID:193      691/6274 2.599002e-05 0.0045378581 3.496342e-03
## DOID:10283    394/6274 3.777164e-05 0.0054957741 4.234400e-03
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           geneID
## DOID:162             MMP1/CDC45/NMU/CDCA8/CDC20/S100A9/FOXM1/CENPE/MYBL2/MELK/CCNB2/S100A8/NDC80/BCL2A1/TOP2A/S100A7/E2F8/ASPM/CRABP1/CXCL10/LAMP3/RRM2/CEP55/DLGAP5/UGT8/UBE2C/HJURP/PBK/TPX2/CXCL13/NEK2/CXCL11/SLC7A5/CAMP/RAD51AP1/DUSP2/CXCL9/UBE2S/CCNA2/FADS2/CDK1/MSLN/MAD2L1/CCL18/GINS1/PAX6/IL1R2/KIF18A/RARRES1/CDT1/BIRC5/KIF11/S100P/PDK1/EZH2/GZMB/TTK/GABRP/AQP9/AURKB/CHAF1B/CHEK1/PRC1/KIFC1/MMP12/KIF20A/LAG3/IDO1/KRT81/DTL/AURKA/NUDT1/CCNB1/PIR/GNLY/MCM5/PTTG1/HPSE/PSAT1/ITGB7/LMNB1/MCM2/CXCL3/PLA2G7/CDKN3/CCL20/PRAME/KIF14/CKS2/HMMR/IL12RB2/KIF2C/CDC25A/FOSL1/CDC6/PLK1/ICOS/RACGAP1/MAGEA3/FANCI/FABP7/MMP9/MAGEA6/ORM1/BUB1B/GBP1/ASF1B/LYZ/EXO1/CENPU/CXCL8/EIF4EBP1/CTSV/NKX2-2/EPHX3/GGH/TAP1/CBS/CRISP3/HMGA1/BCL11A/ECT2/GALNT14/TFPI2/RAD54L/ISG15/ESPL1/RAD51/TK1/APLP1/CCNE1/ORC6/TMPRSS4/CD38/TNFRSF11A/HSD17B2/CTLA4/CCL5/TMEM158/KIAA0101/SOX10/ANGPT4/PROM1/ORC1/OIP5/HELLS/MKI67/F2RL2/FMOD/MST1/SEC14L2/IKBKB/ANOS1/FRZB/NPY1R/APOD/IQGAP2/F3/CNN1/ALCAM/ALOX15B/GJA1/HLF/AKAP12/TNN/AGTR1/MN1/TGFB3/HTR2B/TAC1/FCER1A/SERPINA5/ARHGEF17/CDO1/GSTT2/GSTT1/LAMB2/ZEB1/GPRASP1/DKK2/SALL2/IGFBP4/PLAT/AR/TNFRSF11B/CAMTA1/CST5/IRS1/EVL/FBN1/DCN/PDGFRL/LRRC15/MAOA/BCAM/GSTM3/LTC4S/IL6ST/TPSAB1/COL4A5/MUC1/TNS2/TGFBR3/AZGP1/ITGA7/MAK/EPHX2/TOX3/XIST/THBS4/CLDN5/TFF1/TBX3/IGF1/SEMA3E/CACNA2D2/CEACAM6/SFRP4/ELN/KIAA1324/VAV3/TCEAL1/DUSP4/RAMP2/PDGFD/ADIPOQ/DACH1/ITPR1/ZNF423/SLC16A4/MYH11/CA12/HLA-DQA1/CCDC170/ADH1B/NOVA1/PDZK1/EMX2/ACKR1/SCGB2A2/PSD3/FABP4/BMP4/LRP2/MAPT/WISP2/ERBB4/FOXA1/CRY2/CXCL14/GRP/CYP4F8/TPSB2/CX3CR1/CYP4B1/SCGB1D2/PIP/STC2/GATA3/TFAP2B/NAT1/PGR/AGR2/ADIRF/SCUBE2/OGN
## DOID:14566   MMP1/CDC45/NMU/CDCA8/CDC20/S100A9/FOXM1/CENPE/MYBL2/MELK/CCNB2/S100A8/NDC80/BCL2A1/TOP2A/S100A7/E2F8/ASPM/CRABP1/CXCL10/LAMP3/RRM2/CEP55/DLGAP5/UGT8/UBE2C/HJURP/PBK/TPX2/CXCL13/NEK2/CXCL11/SLC7A5/CAMP/RAD51AP1/DUSP2/CXCL9/UBE2S/CCNA2/FADS2/CDK1/MSLN/MAD2L1/CCL18/GINS1/PAX6/IL1R2/KIF18A/RARRES1/CDT1/BIRC5/KIF11/S100P/PDK1/EZH2/GZMB/TTK/GABRP/AQP9/AURKB/CHAF1B/CHEK1/PRC1/KIFC1/MMP12/KIF20A/LAG3/IDO1/KRT81/DTL/AURKA/NUDT1/CCNB1/PIR/GNLY/MCM5/PTTG1/HPSE/PSAT1/ITGB7/LMNB1/MCM2/CXCL3/PLA2G7/CDKN3/CCL20/PRAME/KIF14/CKS2/HMMR/IL12RB2/KIF2C/CDC25A/FOSL1/CDC6/PLK1/ICOS/RACGAP1/MAGEA3/FANCI/FABP7/MMP9/MAGEA6/ORM1/BUB1B/GBP1/ASF1B/LYZ/EXO1/CENPU/CXCL8/EIF4EBP1/CTSV/NKX2-2/EPHX3/GGH/TAP1/CBS/CRISP3/HMGA1/BCL11A/ECT2/GALNT14/TFPI2/RAD54L/ISG15/ESPL1/RAD51/TK1/APLP1/CCNE1/ORC6/TMPRSS4/CD38/TNFRSF11A/HSD17B2/CTLA4/CCL5/TMEM158/KIAA0101/SOX10/ANGPT4/PROM1/ORC1/OIP5/HELLS/MKI67/F2RL2/FMOD/MST1/SEC14L2/IKBKB/ANOS1/FRZB/NPY1R/APOD/IQGAP2/F3/CNN1/ALCAM/ALOX15B/GJA1/HLF/AKAP12/TNN/AGTR1/MN1/TGFB3/SLC26A3/HTR2B/TAC1/FCER1A/SERPINA5/ARHGEF17/CDO1/GSTT2/GSTT1/LAMB2/ZEB1/GPRASP1/DKK2/SALL2/IGFBP4/PLAT/AR/TNFRSF11B/CAMTA1/CST5/IRS1/EVL/FBN1/DCN/PDGFRL/LRRC15/MAOA/BCAM/GSTM3/LTC4S/IL6ST/TPSAB1/COL4A5/MUC1/TNS2/TGFBR3/AZGP1/ITGA7/MAK/EPHX2/TOX3/XIST/THBS4/CLDN5/TFF1/TBX3/IGF1/SEMA3E/CACNA2D2/CEACAM6/SFRP4/ELN/KIAA1324/VAV3/TCEAL1/DUSP4/RAMP2/PDGFD/ADIPOQ/DACH1/ITPR1/ZNF423/SLC16A4/MYH11/CA12/HLA-DQA1/CCDC170/ADH1B/NOVA1/PDZK1/EMX2/ACKR1/SCGB2A2/PSD3/FABP4/BMP4/LRP2/MAPT/WISP2/ERBB4/FOXA1/CRY2/CXCL14/GRP/CYP4F8/TPSB2/CX3CR1/CYP4B1/SCGB1D2/PIP/STC2/GATA3/TFAP2B/NAT1/PGR/AGR2/ADIRF/SCUBE2/OGN
## DOID:0050686                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    MMP1/NMU/CDC20/S100A9/FOXM1/CENPE/MYBL2/CCNB2/S100A8/BCL2A1/TOP2A/S100A7/E2F8/ASPM/CRABP1/CXCL10/RRM2/CEP55/DLGAP5/UBE2C/PBK/CXCL13/NEK2/CXCL11/SLC7A5/CAMP/RAD51AP1/DUSP2/UBE2S/CCNA2/CDK1/MSLN/MAD2L1/CCL18/PAX6/RARRES1/BIRC5/S100P/EZH2/GZMB/AQP9/AURKB/CHEK1/MMP12/KIF20A/LAG3/IDO1/KRT81/AURKA/NUDT1/CCNB1/PIR/GNLY/MCM5/PTTG1/HPSE/LMNB1/MCM2/CXCL3/CDKN3/CCL20/PRAME/KIF14/CKS2/HMMR/IL12RB2/CDC25A/FOSL1/CDC6/PLK1/MAGEA3/FABP7/MMP9/MAGEA6/ORM1/BUB1B/GBP1/LYZ/CXCL8/EIF4EBP1/CTSV/NKX2-2/EPHX3/GGH/TAP1/CBS/CRISP3/HMGA1/BCL11A/ECT2/TFPI2/RAD54L/ISG15/RAD51/TK1/CCNE1/CD38/TNFRSF11A/HSD17B2/CTLA4/CCL5/KIAA0101/ANGPT4/PROM1/MKI67/FMOD/IKBKB/FRZB/APOD/IQGAP2/F3/CNN1/ALCAM/ALOX15B/GJA1/HLF/AKAP12/AGTR1/MN1/TGFB3/TAC1/FCER1A/SERPINA5/CDO1/GSTT1/ZEB1/DKK2/SALL2/IGFBP4/PLAT/AR/TNFRSF11B/CAMTA1/IRS1/FBN1/DCN/MAOA/BCAM/GSTM3/LTC4S/IL6ST/TPSAB1/COL4A5/MUC1/TNS2/TGFBR3/AZGP1/ITGA7/MAK/CLDN5/TFF1/IGF1/CEACAM6/SFRP4/KIAA1324/VAV3/DUSP4/RAMP2/PDGFD/ADIPOQ/ZNF423/MYH11/CA12/HLA-DQA1/ADH1B/PDZK1/ACKR1/SCGB2A2/FABP4/BMP4/LRP2/MAPT/WISP2/ERBB4/FOXA1/CXCL14/GRP/TPSB2/CX3CR1/SCGB1D2/PIP/STC2/GATA3/TFAP2B/NAT1/PGR/AGR2
## DOID:2994                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MMP1/NDC80/ASPM/RRM2/NEK2/MAD2L1/BIRC5/AQP9/IDO1/CCNB1/HPSE/MCM2/CCL20/KIF14/PLK1/MAGEA3/MMP9/ORM1/EXO1/CXCL8/NKX2-2/GGH/CBS/ISG15/RAD51/APLP1/CCNE1/PROM1/MKI67/FRZB/GJA1/HTR2B/GSTT1/AR/IRS1/BCAM/IL6ST/MUC1/XIST/IGF1/SFRP4/VAV3/PDGFD/CA12/SCGB2A2/BMP4/ERBB4
## DOID:193                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MMP1/S100A9/S100A8/BCL2A1/TOP2A/CXCL10/CAMP/CDK1/MSLN/CCL18/BIRC5/S100P/EZH2/AURKB/MMP12/AURKA/NUDT1/CCNB1/MCM5/MCM2/PLK1/MAGEA3/MMP9/BUB1B/CXCL8/EIF4EBP1/EPHX3/CRISP3/TK1/CCNE1/CTLA4/CCL5/PROM1/MKI67/IKBKB/F3/ALCAM/ALOX15B/GSTT1/IGFBP4/AR/IRS1/MAOA/IL6ST/MUC1/AZGP1/ITGA7/MAK/CLDN5/TFF1/IGF1/CEACAM6/SFRP4/KIAA1324/VAV3/PDGFD/HLA-DQA1/BMP4/LRP2/PGR/AGR2
## DOID:10283                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    MMP1/S100A9/S100A8/BCL2A1/CXCL10/BIRC5/S100P/EZH2/MMP12/NUDT1/CCNB1/PLK1/MAGEA3/MMP9/BUB1B/CXCL8/EPHX3/CRISP3/CCL5/MKI67/IKBKB/F3/ALOX15B/GSTT1/IGFBP4/AR/IRS1/MAOA/MUC1/AZGP1/ITGA7/MAK/TFF1/IGF1/SFRP4/VAV3/PDGFD/BMP4/LRP2/AGR2
##              Count
## DOID:162       266
## DOID:14566     267
## DOID:0050686   187
## DOID:2994       47
## DOID:193        61
## DOID:10283      40

6.3 Visualze enrichment result

We also implement a bar plot and category-gene-network for visualization. It is very common to visualize the enrichment result in bar or pie chart. We believe the pie chart is misleading and only provide bar chart.

barplot(x)

dotplot is a good alternative to barplot.

dotplot(x)

In order to consider the potentially biological complexities in which a gene may belong to multiple annotation categories, we developed cnetplot function to extract the complex association between genes and diseases.

cnetplot(x, categorySize="pvalue", foldChange=geneList)

6.4 Disease association comparison

We have developed an R package clusterProfiler11 for comparing biological themes among gene clusters. DOSE works fine with clusterProfiler and can compare biological themes at disease perspective.

require(clusterProfiler)
data(gcSample)
cdo <- compareCluster(gcSample, fun="enrichDO")
plot(cdo)

7 Disease analysis of NGS data

Disease analysis using NGS data (eg, RNA-Seq and ChIP-Seq) can be performed by linking coding and non-coding regions to coding genes via ChIPseeker package, which can annotates genomic regions to their nearest genes, host genes, and flanking genes respectivly. In addtion, it provides a function, seq2gene, that simultaneously considering host genes, promoter region and flanking gene from intergenic region that may under control via cis-regulation. This function maps genomic regions to genes in a many-to-many manner and facilitate functional analysis. For more details, please refer to ChIPseeker12.

8 Gene set enrichment analysis

8.1 GSEA algorithm

A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)13 directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.

Genes are ranked based on their phenotypes. Given a priori defined set of gens S (e.g., genes shareing the same DO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout the ranked gene list (L) or primarily found at the top or bottom.

There are three key elements of the GSEA method:

8.2 gseAnalyzer fuction

In DOSE, we implemented GSEA algorithm proposed by Subramanian13 in gseAnalyzer function.

In the following example, in order to speedup the compilation of this document, only gene sets with size above 120 were tested and only 100 permutations were performed.

y <- gseAnalyzer(geneList,
                 setType       = "DO",
                 nPerm         = 100, 
                 minGSSize     = 120,
                 pvalueCutoff  = 0.2, 
                 pAdjustMethod = "BH",
                 verbose       = FALSE)
res <- summary(y)
head(res)
##                        ID                   Description setSize
## DOID:0050828 DOID:0050828                artery disease     853
## DOID:5614       DOID:5614                   eye disease     545
## DOID:1287       DOID:1287 cardiovascular system disease    1176
## DOID:178         DOID:178              vascular disease    1004
## DOID:5679       DOID:5679               retinal disease     394
## DOID:1492       DOID:1492        eye and adnexa disease     554
##              enrichmentScore       NES     pvalue  p.adjust    qvalues
## DOID:0050828      -0.2924694 -1.347109 0.01086957 0.1302469 0.09226771
## DOID:5614         -0.2854687 -1.285156 0.01098901 0.1302469 0.09226771
## DOID:1287         -0.2732810 -1.266774 0.01111111 0.1302469 0.09226771
## DOID:178          -0.2780269 -1.279185 0.01111111 0.1302469 0.09226771
## DOID:5679         -0.3054442 -1.364227 0.01111111 0.1302469 0.09226771
## DOID:1492         -0.2838433 -1.273374 0.01123596 0.1302469 0.09226771

The setType should be one of “DO” or “DOLite and was required for gseaAnalyzer to prepare the corresponding gene sets. The setType can also be one of”GO" or “KEGG” if clusterProfiler is loaded or “Reactome” if ReactomePA is loaded.

topID <- res[1,1]
topID
## [1] "DOID:0050828"
plot(y, geneSetID = topID)

Parameter geneSetID can be numeric, the following command will generate the same figure as illustrated above.

plot(y, geneSetID = 1)

9 enrichMap

Enrichment Map can be visualized by enrichMap function. It supports both enrichment result and GSEA result.

enrichMap(x)

10 GO semantic similarity calculation

GO Semantic similarity can be calculated by GOSemSim8.

11 Other enrichment analysis tools

We provides GO & KEGG enrichment analysis in clusterProfiler11 and Reactome pathway enrichment analysis in ReactomePA package. Both hypergeometric test and GSEA are supported.

12 External document

13 Bugs/Feature Requests

If you have any, let me know.

14 Session Information

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 3.2.4 Revised (2016-03-16 r70336)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.2.3    DO.db_2.9             AnnotationDbi_1.32.3 
##  [4] IRanges_2.4.8         S4Vectors_0.8.11      Biobase_2.30.0       
##  [7] BiocGenerics_0.16.1   clusterProfiler_2.4.3 DOSE_2.8.3           
## [10] RSQLite_1.0.0         DBI_0.3.1             BiocStyle_1.8.0      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.4       XVector_0.10.0    formatR_1.3      
##  [4] plyr_1.8.3        zlibbioc_1.16.0   tools_3.2.4      
##  [7] digest_0.6.9      lattice_0.20-33   evaluate_0.8.3   
## [10] gtable_0.2.0      png_0.1-7         graph_1.48.0     
## [13] igraph_1.0.1      yaml_2.1.13       SparseM_1.7      
## [16] topGO_2.22.0      stringr_1.0.0     httr_1.1.0       
## [19] knitr_1.12.3      Biostrings_2.38.4 grid_3.2.4       
## [22] qvalue_2.2.2      R6_2.1.2          GOSemSim_1.28.2  
## [25] rmarkdown_0.9.5   GO.db_3.2.2       ggplot2_2.1.0    
## [28] reshape2_1.4.1    magrittr_1.5      scales_0.4.0     
## [31] htmltools_0.3.5   splines_3.2.4     KEGGREST_1.10.1  
## [34] colorspace_1.2-6  labeling_0.3      stringi_1.0-1    
## [37] munsell_0.4.3

References

1.Schriml, L. M. et al. Disease ontology: A backbone for disease semantic integration. Nucleic Acids Research 40, D940–D946 (2011).

2.Yu, G., Wang, L.-G., Yan, G.-R. & He, Q.-Y. DOSE: An r/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics 31, 608–609 (2015).

3.Philip, R. Semantic similarity in a taxonomy: An Information-Based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research 11, 95–130 (1999).

4.Jiang, J. J. & Conrath, D. W. Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings of 10th International Conference on Research In Computational Linguistics (1997).

5.Lin, D. An Information-Theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning 296—304 (1998).

6.Schlicker, A., Domingues, F. S., Rahnenführer, J. & Lengauer, T. A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics 7, 302 (2006).

7.Wang, J. Z., Du, Z., Payattakool, R., Yu, P. S. & Chen, C.-F. A new method to measure the semantic similarity of gO terms. Bioinformatics (Oxford, England) 23, 1274–81 (2007).

8.Yu, G. et al. GOSemSim: An r package for measuring semantic similarity among gO terms and gene products. Bioinformatics 26, 976–978 (2010).

9.Boyle, E. I. et al. GO::TermFinder–open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics (Oxford, England) 20, 3710–3715 (2004).

10.Du, P. et al. From disease ontology to disease-ontology lite: Statistical methods to adapt a general-purpose ontology for the test of gene-ontology associations. Bioinformatics 25, i63–i68 (2009).

11.Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. clusterProfiler: an r package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16, 284–287 (2012).

12.Yu, G., Wang, L.-G. & He, Q.-Y. ChIPseeker: An r/Bioconductor package for chIP peak annotation, comparison and visualization. Bioinformatics 31, 2382–2383 (2015).

13.Subramanian, A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102, 15545–15550 (2005).