# 1 Abstract

Disease Ontology (DO) aims to provide an open source ontology for the integration of biomedical data that is associated with human disease. We developed DOSE package to promote the investigation of diseases. DOSE provides five methods including Resnik, Lin, Jiang, Rel and Wang for measuring semantic similarities among DO terms and gene products; Hypergeometric model and gene set enrichment analysis were also implemented for extracting disease association insight from genome wide expression profiles.

# 2 Citation

If you use DOSE in published research, please cite G. Yu (2015). In addition please cite G. Yu (2012) when using compareCluster in clusterProfiler, G. Yu (2015) when applying enrichment analysis to NGS data by using ChIPseeker and G. Yu (2010) when using GOSemSim for GO semantic similarity analysis.

G Yu, LG Wang, GR Yan, QY He.
DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis
Bioinformatics 2015, 31(4):608-609.
G Yu, F Li, Y Qin, X Bo, Y Wu, S Wang.
GOSemSim: an R package for measuring semantic similarity among GO terms and gene products.
Bioinformatics 2010, 26(7):976-978.
G Yu, LG Wang, Y Han, QY He.
clusterProfiler: an R package for comparing biological themes among gene clusters.
OMICS: A Journal of Integrative Biology 2012, 16(5):284-287.
G Yu, LG Wang, QY He.
ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization.
Bioinformatics 2015, 31(14):2382-2383.

# 3 Introduction

Public health is an important driving force behind biological and medical research. A major challenge of the post-genomic era is bridging the gap between fundamental biological research and its clinical applications. Recent research has increasingly demonstrated that many seemingly dissimilar diseases have common molecular mechanisms. Understanding similarities among disease aids in early diagnosis and new drug development.

Formal knowledge representation of gene-disease association is demanded for this purpose. Ontologies, such as Gene Ontology, have been successfully applied to represent biological knowledge, and many related techniques have been adopted to extract information. Disease Ontology (DO)1 was developed to create a consistent description of gene products with disease perspectives, and is essential for supporting functional genomics in disease context. Accurate disease descriptions can discover new relationships between genes and disease, and new functions for previous uncharacteried genes and alleles.

Unlike other clinical vocabularies that defined disease related concepts disparately, DO is organized as a directed acyclic graph, laying the foundation for quantitative computation of disease knowledge. The application of disease ontology is in its infancy, lacking programs for mining DO knowledge automatically.

Here, we present an R package DOSE[2) for analyzing semantic similarities among DO terms and gene products annotated with DO terms, and extracting disease association insight from genome wide expression profiles.

Four information content (IC)-based methods and one graph structure-based method were implemented for measuring semantic similarity. Hypergeometric test and Gene Set Enrichment Analysis were implemented for extracting biological insight.

To start with DOSE package, type following code below:

library(DOSE)
help(DOSE)

# 4 DO term semantic similarity measurement

Four methods determine the semantic similarity of two terms based on the Information Content of their common ancestor term were proposed by Resnik3, Jiang4, Lin5 and Schlicker6. Wang7 presented a method to measure the similarity based on the graph structure. Each of these methods has its own advantage and weakness. DOSE implemented all these methods to compute semantic similarity among DO terms and gene products. We have developed another package GOSemSim8 to explore the functional similarity at GO perspective, including molecular function (MF), biological process (BP) and cellular component (CC).

For algorithm details, please refer to the vignette of GOSemSim.

## 4.1 doSim function

In DOSE, we implemented doSim for calculating semantic similarity between two DO terms and two set of DO terms.

data(DO2EG)
set.seed(123)
a <- sample(names(DO2EG), 10)
a
##  [1] "DOID:14095" "DOID:5844"  "DOID:2044"  "DOID:8432"  "DOID:9146"
##  [6] "DOID:10588" "DOID:3209"  "DOID:848"   "DOID:3341"  "DOID:252"
b <- sample(names(DO2EG), 5)
b
## [1] "DOID:9409"  "DOID:2491"  "DOID:4467"  "DOID:3498"  "DOID:11256"
doSim(a[1], b[1], measure="Wang")
## [1] 0.07142995
doSim(a[1], b[1], measure="Resnik")
## [1] 0
doSim(a[1], b[1], measure="Lin")
## [1] 0
s <- doSim(a, b, measure="Wang")
s
##             DOID:9409  DOID:2491  DOID:4467  DOID:3498 DOID:11256
## DOID:14095 0.07142995 0.05714393 0.03676194 0.03676194 0.52749870
## DOID:5844  0.14897652 0.11564838 0.02801328 0.02801328 0.06134327
## DOID:2044  0.14897652 0.11564838 0.02801328 0.02801328 0.06134327
## DOID:8432  0.17347273 0.13877811 0.03676194 0.03676194 0.07142995
## DOID:9146  0.07142995 0.05714393 0.03676194 0.03676194 0.17347273
## DOID:10588 0.13240905 0.18401515 0.02208240 0.02208240 0.05452137
## DOID:3209  0.14897652 0.11564838 0.02801328 0.02801328 0.06134327
## DOID:848   0.14897652 0.11564838 0.02801328 0.02801328 0.06134327
## DOID:3341  0.13240905 0.09998997 0.02208240 0.02208240 0.05452137
## DOID:252   0.06134327 0.04761992 0.02801328 0.02801328 0.06134327

The doSim function requires three parameter DOID1, DOID2 and measure. DOID1 and DOID2 should be a vector of DO terms, while measure should be one of Resnik, Jiang, Lin, Rel, and Wang.

We also implement a plot function simplot to visualize the similarity result.

simplot(s,
color.low="white", color.high="red",
labs=TRUE, digits=2, labs.size=5,
font.size=14, xlab="", ylab="")

Parameter color.low and colow.high are used to setting the color gradient; labs is a logical parameter indicating whether to show the similarity values or not, digits to indicate the number of decimal places to be used and labs.size control the font size of similarity values; font.size setting the font size of axis and label of the coordinate system.

# 5 Gene semantic similarity measurement

On the basis of semantic similarity between DO terms, DOSE can also compute semantic similarity among gene products. DOSE provides four methods which called max, avg, rcmax and BMA to combine semantic similarity scores of multiple DO terms. The similarities among genes and gene clusters which annotated by multiple DO terms were also calculated by these combine methods. For calculation details, please refer to the vignette of GOSemSim.

## 5.1 geneSim function

In DOSE, we implemented geneSim to measure semantic similarities among genes.

data(EG2DO)
g1 <- sample(names(EG2DO), 5)
g1
## [1] "84842" "2524"  "10590" "3070"  "91746"
g2 <- sample(names(EG2DO), 4)
g2
## [1] "84289" "6045"  "56999" "9869"
geneSim(g1[1], g2[1], measure="Wang", combine="BMA")
## [1] 0.051
gs <- geneSim(g1, g2, measure="Wang", combine="BMA")
gs
##       84289  6045 56999  9869
## 84842 0.051 0.135 0.355 0.098
## 2524  0.284 0.172 0.517 0.517
## 10590 0.150 0.173 0.242 0.265
## 3070  0.573 0.517 1.000 1.000
## 91746 0.351 0.308 0.527 0.501

The geneSim requires four parameter geneID1, geneID2, measure and combine. geneID1 and geneID2 should be a vector of entrez gene IDs; measure should be one of Resnik, Jiang, Lin, Rel, and Wang, while combine should be one of max, avg, rcmax and BMA as described previously.

The simplot works well with both the output of doSim and geneSim.

## 5.2 clusterSim and mclusterSim

We also implemented clusterSim for calculating semantic similarity between two gene clusters and mclusterSim for calculating semantic similarities among multiple gene clusters.

clusterSim(g1, g2, measure="Wang", combine="BMA")
## [1] 0.51
clusters <- list(a=g1, b=g2, c=sample(names(EG2DO), 6))
mclusterSim(clusters, measure="Wang", combine="BMA")
##       a    b     c
## a 1.000 0.51 0.379
## b 0.510 1.00 0.620
## c 0.379 0.62 1.000

# 6 DO term enrichment analysis

## 6.1 Hypergeometric model

Over-representation test9 is a widely used approach to identify biological themes. Here we implement hypergeometric model to assess whether the number of selected genes associated with disease is larger than expected.

To determine whether any terms annotate a specified list of genes at frequency greater than that would be expected by chance, DOSE calculates a p-value using the hypergeometric distribution:

$$p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{{M \choose i}{{N-M} \choose {n-i}}} {{N \choose n}}$$

In this equation, N is the total number of genes in the background distribution, M is the number of genes within that distribution that are annotated (either directly or indirectly) to the node of interest, n is the size of the list of genes of interest and k is the number of genes within that list which are annotated to the node. The background distribution by default is all the genes that have annotation. User can set the background via universe parameter.

P-values were adjusted for multiple comparison, and q-values were also calculated for FDR control.

## 6.2 enrichDO function

DOSE provides an example dataset geneList which was derived from R package breastCancerMAINZ that contained 200 samples, including 29 samples in grade I, 136 samples in grade II and 35 samples in grade III. We computed the ratios of geometric means of grade III samples versus geometric means of grade I samples. Logarithm of these ratios (base 2) were stored in geneList dataset.

In the following example, we selected fold change above 1 as the differential genes and analyzing their disease association.

data(geneList)
gene <- names(geneList)[abs(geneList) > 1.5]
head(gene)
## [1] "4312"  "8318"  "10874" "55143" "55388" "991"
x <- enrichDO(gene          = gene,
ont           = "DO",
pvalueCutoff  = 0.05,
universe      = names(geneList),
minGSSize     = 5,
qvalueCutoff  = 0.05,
head(summary(x))
##                    ID                    Description GeneRatio  BgRatio
## DOID:170     DOID:170         endocrine gland cancer    48/331 472/6268
## DOID:10283 DOID:10283                prostate cancer    40/331 394/6268
## DOID:3459   DOID:3459               breast carcinoma    37/331 357/6268
## DOID:3856   DOID:3856 male reproductive organ cancer    40/331 404/6268
## DOID:824     DOID:824                  periodontitis    16/331 109/6268
## DOID:3905   DOID:3905                 lung carcinoma    43/331 465/6268
##                  pvalue    p.adjust      qvalue
## DOID:170   5.662129e-06 0.004648608 0.003707204
## DOID:10283 3.859157e-05 0.013526328 0.010787071
## DOID:3459  4.942629e-05 0.013526328 0.010787071
## DOID:3856  6.821467e-05 0.014001061 0.011165664
## DOID:824   1.699304e-04 0.018323811 0.014613001
## DOID:3905  1.749754e-04 0.018323811 0.014613001
##                                                                                                                                                                                                                                                   geneID
## DOID:170   10874/7153/1381/6241/11065/10232/332/6286/2146/10112/891/9232/4171/993/5347/4318/3576/1515/4821/8836/3159/7980/5888/333/898/9768/4288/3551/2152/9590/185/7043/3357/2952/5327/3667/1634/1287/4582/7122/3479/4680/6424/80310/652/8839/9547/1524
## DOID:10283                                          4312/6280/6279/597/3627/332/6286/2146/4321/4521/891/5347/4102/4318/701/3576/79852/10321/6352/4288/3551/2152/247/2952/3487/367/3667/4128/4582/563/3679/4117/7031/3479/6424/10451/80310/652/4036/10551
## DOID:3459                                                          4312/6280/6279/7153/4751/890/4085/332/6286/6790/891/9232/10855/4171/5347/4318/701/2633/3576/9636/898/8792/4288/2952/4982/4128/4582/7031/3479/771/4250/2066/3169/10647/5304/5241/10551
## DOID:3856                                           4312/6280/6279/597/3627/332/6286/2146/4321/4521/891/5347/4102/4318/701/3576/79852/10321/6352/4288/3551/2152/247/2952/3487/367/3667/4128/4582/563/3679/4117/7031/3479/6424/10451/80310/652/4036/10551
## DOID:824                                                                                                                                                                   4312/6279/820/7850/4321/3595/4318/4069/3576/1493/6352/8842/185/2952/5327/4982
## DOID:3905                          4312/6280/2305/9133/6279/7153/6278/6241/55165/11065/8140/10232/332/6286/3002/9212/4521/891/4171/9928/8061/4318/3576/1978/1894/7980/7083/898/6352/8842/4288/2152/2697/2952/3572/4582/7049/563/3479/1846/3117/2532/2922
##            Count
## DOID:170      48
## DOID:10283    40
## DOID:3459     37
## DOID:3856     40
## DOID:824      16
## DOID:3905     43

The enrichDO function requires an entrezgene ID vector as input, mostly is the differential gene list of gene expression profile studies. If user needs to convert other gene ID type to entrezgene ID, we recommend using bitr function provided by clusterProfiler.

The ont parameter can be “DO” or “DOLite”, DOLite10 was constructed to aggregate the redundant DO terms. The DOLite data is not updated, we recommend user use ont=“DO”. pvalueCutoff setting the cutoff value of p value and p value adjust; pAdjustMethod setting the p value correction methods, include the Bonferroni correction (“bonferroni”), Holm (“holm”), Hochberg (“hochberg”), Hommel (“hommel”), Benjamini & Hochberg (“BH”) and Benjamini & Yekutieli (“BY”) while qvalueCutoff is used to control q-values.

The universe setting the background gene universe for testing. If user do not explicitly setting this parameter, enrichDO will set the universe to all human genes that have DO annotation.

The minGSSize indicates that only those DO terms that have more than genes annotated will be tested.

The readable is a logical parameter, indicates whether the entrezgene IDs will mapping to gene symbols or not.

We also implement setReadable function that helps the user to convert entrezgene IDs to gene symbols.

x <- setReadable(x, 'org.Hs.eg.db')
head(summary(x))
##                    ID                    Description GeneRatio  BgRatio
## DOID:170     DOID:170         endocrine gland cancer    48/331 472/6268
## DOID:10283 DOID:10283                prostate cancer    40/331 394/6268
## DOID:3459   DOID:3459               breast carcinoma    37/331 357/6268
## DOID:3856   DOID:3856 male reproductive organ cancer    40/331 404/6268
## DOID:824     DOID:824                  periodontitis    16/331 109/6268
## DOID:3905   DOID:3905                 lung carcinoma    43/331 465/6268
##                  pvalue    p.adjust      qvalue
## DOID:170   5.662129e-06 0.004648608 0.003707204
## DOID:10283 3.859157e-05 0.013526328 0.010787071
## DOID:3459  4.942629e-05 0.013526328 0.010787071
## DOID:3856  6.821467e-05 0.014001061 0.011165664
## DOID:824   1.699304e-04 0.018323811 0.014613001
## DOID:3905  1.749754e-04 0.018323811 0.014613001
##                                                                                                                                                                                                                                                                                             geneID
## DOID:10283                                                      MMP1/S100A9/S100A8/BCL2A1/CXCL10/BIRC5/S100P/EZH2/MMP12/NUDT1/CCNB1/PLK1/MAGEA3/MMP9/BUB1B/CXCL8/EPHX3/CRISP3/CCL5/MKI67/IKBKB/F3/ALOX15B/GSTT1/IGFBP4/AR/IRS1/MAOA/MUC1/AZGP1/ITGA7/MAK/TFF1/IGF1/SFRP4/VAV3/PDGFD/BMP4/LRP2/AGR2
## DOID:3856                                                       MMP1/S100A9/S100A8/BCL2A1/CXCL10/BIRC5/S100P/EZH2/MMP12/NUDT1/CCNB1/PLK1/MAGEA3/MMP9/BUB1B/CXCL8/EPHX3/CRISP3/CCL5/MKI67/IKBKB/F3/ALOX15B/GSTT1/IGFBP4/AR/IRS1/MAOA/MUC1/AZGP1/ITGA7/MAK/TFF1/IGF1/SFRP4/VAV3/PDGFD/BMP4/LRP2/AGR2
## DOID:824                                                                                                                                                                                           MMP1/S100A8/CAMP/IL1R2/MMP12/IL12RB2/MMP9/LYZ/CXCL8/CTLA4/CCL5/PROM1/AGTR1/GSTT1/PLAT/TNFRSF11B
## DOID:3905                               MMP1/S100A9/FOXM1/CCNB2/S100A8/TOP2A/S100A7/RRM2/CEP55/UBE2C/SLC7A5/MSLN/BIRC5/S100P/GZMB/AURKB/NUDT1/CCNB1/MCM2/KIF14/FOSL1/MMP9/CXCL8/EIF4EBP1/ECT2/TFPI2/TK1/CCNE1/CCL5/PROM1/MKI67/F3/GJA1/GSTT1/IL6ST/MUC1/TGFBR3/AZGP1/IGF1/DUSP4/HLA-DQA1/ACKR1/GRP
##            Count
## DOID:170      48
## DOID:10283    40
## DOID:3459     37
## DOID:3856     40
## DOID:824      16
## DOID:3905     43

## 6.3 Visualze enrichment result

We also implement a bar plot and category-gene-network for visualization. It is very common to visualize the enrichment result in bar or pie chart. We believe the pie chart is misleading and only provide bar chart.

barplot(x)

dotplot is a good alternative to barplot.

dotplot(x)

In order to consider the potentially biological complexities in which a gene may belong to multiple annotation categories, we developed cnetplot function to extract the complex association between genes and diseases.

cnetplot(x, categorySize="pvalue", foldChange=geneList)

## 6.4 Disease association comparison

We have developed an R package clusterProfiler11 for comparing biological themes among gene clusters. DOSE works fine with clusterProfiler and can compare biological themes at disease perspective.

require(clusterProfiler)
data(gcSample)
cdo <- compareCluster(gcSample, fun="enrichDO")
plot(cdo)

# 7 Disease analysis of NGS data

Disease analysis using NGS data (eg, RNA-Seq and ChIP-Seq) can be performed by linking coding and non-coding regions to coding genes via ChIPseeker package, which can annotates genomic regions to their nearest genes, host genes, and flanking genes respectivly. In addtion, it provides a function, seq2gene, that simultaneously considering host genes, promoter region and flanking gene from intergenic region that may under control via cis-regulation. This function maps genomic regions to genes in a many-to-many manner and facilitate functional analysis. For more details, please refer to ChIPseeker12.

# 8 Gene set enrichment analysis

## 8.1 GSEA algorithm

A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)13 directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.

Genes are ranked based on their phenotypes. Given a priori defined set of gens S (e.g., genes shareing the same DO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout the ranked gene list (L) or primarily found at the top or bottom.

There are three key elements of the GSEA method:

• Calculation of an Enrichment Score.
• The enrichment score (ES) represent the degree to which a set S is over-represented at the top or bottom of the ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The ES is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic13.
• Esimation of Significance Level of ES.
• The p-value of the is calculated using permutation test. Specifically, we permute the gene labels of the gene list L and recompute the ES of the gene set for the permutated data, which generate a null distribution for the ES. The p-value of the observed ES is then calculated relative to this null distribution.
• Adjustment for Multiple Hypothesis Testing.
• When the entire DO gene sets is evaluated, DOSE adjust the estimated significance level to account for multiple hypothesis testing and also q-values were calculated for FDR control.

## 8.2 gseAnalyzer fuction

In DOSE, we implemented GSEA algorithm proposed by Subramanian13 in gseDOr function.

In the following example, in order to speedup the compilation of this document, only gene sets with size above 120 were tested and only 100 permutations were performed.

y <- gseDO(geneList,
nPerm         = 100,
minGSSize     = 120,
pvalueCutoff  = 0.2,
verbose       = FALSE)
res <- summary(y)
head(res)
##                  ID            Description setSize enrichmentScore
## DOID:9970 DOID:9970                obesity     289      -0.3316172
## DOID:374   DOID:374      nutrition disease     313      -0.3421127
## DOID:654   DOID:654          overnutrition     298      -0.3490198
## DOID:1492 DOID:1492 eye and adnexa disease     459      -0.3105160
## DOID:114   DOID:114          heart disease     462      -0.2978223
## DOID:5614 DOID:5614            eye disease     450      -0.3125247
##                 NES     pvalue   p.adjust    qvalues
## DOID:9970 -1.419542 0.01176471 0.09389952 0.04973558
## DOID:374  -1.460875 0.01190476 0.09389952 0.04973558
## DOID:654  -1.484608 0.01190476 0.09389952 0.04973558
## DOID:1492 -1.372303 0.01204819 0.09389952 0.04973558
## DOID:114  -1.315930 0.01234568 0.09389952 0.04973558
## DOID:5614 -1.376323 0.01234568 0.09389952 0.04973558
topID <- res[1,1]
topID
## [1] "DOID:9970"
plot(y, geneSetID = topID)