1 Introduction

Functional enrichment analysis methods such as gene set enrichment analysis (GSEA) have been widely used for analyzing gene expression data. GSEA is a powerful method to infer results of gene expression data at a level of gene sets by calculating enrichment scores for predefined sets of genes. GSEA depends on the availability and accuracy of gene sets. There are overlaps between terms of gene sets or categories because multiple terms may exist for a single biological process, and it can thus lead to redundancy within enriched terms. In other words, the sets of related terms are overlapping. Using deep learning, this pakage is aimed to predict enrichment scores for unique tokens or words from text in names of gene sets to resolve this overlapping set issue. Furthermore, we can coin a new term by combining tokens and find its enrichment score by predicting such a combined tokens.

Text can be seen as sequential data, either as a sequence of characters or as a sequence of words. Recurrent Neural Network (RNN) operating on sequential data is a type of the neural network. RNN has been applied to a variety of tasks including natural language processing such as machine translation. However, RNN suffers from the problem of long-term dependencies which means that RNN struggles to remember information for long periods of time. An Long Short-Term Memory (LSTM) network is a special kind of RNN that is designed to solve the long-term dependency problem. The bidirectional LSTM network consists of two distinct LSTM networks, termed the forward LSTM and the backward LSTM, which process the sequences in opposite directions. Gated Recurrent Unit (GRU) is a simplified version of LSTM with less number of parameters, and thus, the total number of parameters can be greatly reduced for a large neural network. LSTM and GRU are known to be successful remedies to the long-term dependency problem. The above models take terms of gene sets as input and enrichment scores as output to predict enrichment scores of new terms.



2 Example

2.1 Terms of gene sets

2.1.1 GSEA

Consider a simple example. Once GSEA is performed, the result calculated from GSEA is fed into the algorithm to train the deep learning models.

library(ttgsea)
library(fgsea)
data(examplePathways)
data(exampleRanks)
names(examplePathways) <- gsub("_", " ", substr(names(examplePathways), 9, 1000))

set.seed(1)
fgseaRes <- fgseaSimple(examplePathways, exampleRanks, nperm = 10000)
data.table::data.table(fgseaRes[order(fgseaRes$NES, decreasing = TRUE),])
##                                                        pathway         pval
##    1:                                     Mitotic Prometaphase 0.0001520219
##    2:                  Resolution of Sister Chromatid Cohesion 0.0001537043
##    3:                                      Cell Cycle, Mitotic 0.0001255020
##    4:                             RHO GTPases Activate Formins 0.0001534684
##    5:                                               Cell Cycle 0.0001227446
##   ---                                                                      
## 1416: Downregulation of SMAD2 3:SMAD4 transcriptional activity 0.0022655188
## 1417:                                  HATs acetylate histones 0.0002779322
## 1418:                   TRAF3-dependent IRF activation pathway 0.0010962508
## 1419:                                     Nephrin interactions 0.0013498313
## 1420:                                  Interleukin-6 signaling 0.0004174494
##              padj         ES       NES nMoreExtreme size
##    1: 0.004481064  0.7253270  2.963541            0   82
##    2: 0.004481064  0.7347987  2.954314            0   74
##    3: 0.004481064  0.5594755  2.751403            0  317
##    4: 0.004481064  0.6705979  2.717798            0   78
##    5: 0.004481064  0.5388497  2.688064            0  369
##   ---                                                   
## 1416: 0.028982313 -0.6457899 -1.984552            9   16
## 1417: 0.006365544 -0.4535612 -1.994238            0   68
## 1418: 0.017020529 -0.7176839 -2.022102            4   12
## 1419: 0.019558780 -0.6880106 -2.025979            5   14
## 1420: 0.008590987 -0.8311374 -2.079276            1    8
##                                      leadingEdge
##    1:   66336,66977,12442,107995,66442,52276,...
##    2:   66336,66977,12442,107995,66442,52276,...
##    3:   66336,66977,12442,107995,66442,12571,...
##    4:   66336,66977,107995,66442,52276,67629,...
##    5:   66336,66977,12442,107995,66442,19361,...
##   ---                                           
## 1416:    66313,20482,20481,17127,17128,83814,...
## 1417: 74026,319190,244349,75560,13831,246103,...
## 1418:   56489,12914,54131,54123,56480,217069,...
## 1419:   109711,14360,20742,17973,18708,12488,...
## 1420:        16194,16195,16451,12402,16452,20848
### convert from gene set defined by BiocSet::BiocSet to list
#library(BiocSet)
#genesets <- BiocSet(examplePathways)
#gsc_list <- as(genesets, "list")

# convert from gene set defined by GSEABase::GeneSetCollection to list
#library(GSEABase)
#genesets <- BiocSet(examplePathways)
#gsc <- as(genesets, "GeneSetCollection")
#gsc_list <- list()
#for (i in 1:length(gsc)) {
#  gsc_list[[setName(gsc[[i]])]] <- geneIds(gsc[[i]])
#}

#set.seed(1)
#fgseaRes <- fgseaSimple(gsc_list, exampleRanks, nperm = 10000)


2.1.2 deep learning and embedding

Since deep learning architectures are incapable of processing characters or words in their raw form, the text needs to be converted to numbers as inputs. Word embeddings are the texts converted into numbers. For tokenization, unigram and bigram sequences are used as default. An integer is assigned to each token, and then each term is converted to a sequence of integers. The sequences that are longer than the given maximum length are truncated, whereas shorter sequences are padded with zeros. Keras is a higher-level library built on top of TensorFlow. It is available in R through the keras package. The input to the Keras embedding are integers. These integers are of the tokens. This representation is passed to the embedding layer. The embedding layer acts as the first hidden layer of the neural network.

if (keras::is_keras_available() & reticulate::py_available()) {
  # model parameters
  num_tokens <- 1000
  length_seq <- 30
  batch_size <- 32
  embedding_dim <- 50
  num_units <- 32
  epochs <- 10
  
  # algorithm
  ttgseaRes <- fit_model(fgseaRes, "pathway", "NES",
                         model = bi_lstm(num_tokens, embedding_dim,
                                         length_seq, num_units),
                         num_tokens = num_tokens,
                         length_seq = length_seq,
                         epochs = epochs,
                         batch_size = batch_size,
                         use_generator = FALSE,
                         callbacks = keras::callback_early_stopping(
                            monitor = "loss",
                            patience = 10,
                            restore_best_weights = TRUE))
  
  # prediction for every token
  ttgseaRes$token_pred
  ttgseaRes$token_gsea[["TGF beta"]][,1:5]
}
## Epoch 1/10
## 
 1/45 [..............................] - ETA: 2:16 - loss: 1.5214 - pearson_correlation: 0.0440
 3/45 [=>............................] - ETA: 2s - loss: 1.6660 - pearson_correlation: -0.0810 
 5/45 [==>...........................] - ETA: 1s - loss: 1.8039 - pearson_correlation: 0.0022 
 7/45 [===>..........................] - ETA: 1s - loss: 1.6422 - pearson_correlation: 0.0182
 9/45 [=====>........................] - ETA: 1s - loss: 1.6194 - pearson_correlation: 0.0341
11/45 [======>.......................] - ETA: 1s - loss: 1.6264 - pearson_correlation: 0.0194
13/45 [=======>......................] - ETA: 1s - loss: 1.6269 - pearson_correlation: 0.0522
15/45 [=========>....................] - ETA: 0s - loss: 1.5840 - pearson_correlation: 0.0564
17/45 [==========>...................] - ETA: 0s - loss: 1.5910 - pearson_correlation: 0.0448
19/45 [===========>..................] - ETA: 0s - loss: 1.5853 - pearson_correlation: 0.0713
21/45 [=============>................] - ETA: 0s - loss: 1.5682 - pearson_correlation: 0.0751
23/45 [==============>...............] - ETA: 0s - loss: 1.5474 - pearson_correlation: 0.0889
25/45 [===============>..............] - ETA: 0s - loss: 1.5405 - pearson_correlation: 0.0890
27/45 [=================>............] - ETA: 0s - loss: 1.5322 - pearson_correlation: 0.0886
29/45 [==================>...........] - ETA: 0s - loss: 1.5279 - pearson_correlation: 0.0921
31/45 [===================>..........] - ETA: 0s - loss: 1.5302 - pearson_correlation: 0.0823
33/45 [=====================>........] - ETA: 0s - loss: 1.5489 - pearson_correlation: 0.0924
35/45 [======================>.......] - ETA: 0s - loss: 1.5569 - pearson_correlation: 0.0975
37/45 [=======================>......] - ETA: 0s - loss: 1.5509 - pearson_correlation: 0.0978
39/45 [=========================>....] - ETA: 0s - loss: 1.5546 - pearson_correlation: 0.1007
41/45 [==========================>...] - ETA: 0s - loss: 1.5548 - pearson_correlation: 0.1093
43/45 [===========================>..] - ETA: 0s - loss: 1.5526 - pearson_correlation: 0.1126
45/45 [==============================] - ETA: 0s - loss: 1.5523 - pearson_correlation: 0.1222
45/45 [==============================] - 5s 33ms/step - loss: 1.5523 - pearson_correlation: 0.1222
## Epoch 2/10
## 
 1/45 [..............................] - ETA: 1s - loss: 1.6141 - pearson_correlation: 0.0186
 3/45 [=>............................] - ETA: 1s - loss: 1.6057 - pearson_correlation: 0.2921
 5/45 [==>...........................] - ETA: 1s - loss: 1.4747 - pearson_correlation: 0.3138
 7/45 [===>..........................] - ETA: 1s - loss: 1.5079 - pearson_correlation: 0.3027
 9/45 [=====>........................] - ETA: 1s - loss: 1.4670 - pearson_correlation: 0.2818
11/45 [======>.......................] - ETA: 0s - loss: 1.4424 - pearson_correlation: 0.2665
13/45 [=======>......................] - ETA: 0s - loss: 1.4193 - pearson_correlation: 0.2938
15/45 [=========>....................] - ETA: 0s - loss: 1.4346 - pearson_correlation: 0.2961
17/45 [==========>...................] - ETA: 0s - loss: 1.4312 - pearson_correlation: 0.3217
19/45 [===========>..................] - ETA: 0s - loss: 1.4214 - pearson_correlation: 0.3260
21/45 [=============>................] - ETA: 0s - loss: 1.4099 - pearson_correlation: 0.3392
23/45 [==============>...............] - ETA: 0s - loss: 1.4040 - pearson_correlation: 0.3350
25/45 [===============>..............] - ETA: 0s - loss: 1.4002 - pearson_correlation: 0.3463
27/45 [=================>............] - ETA: 0s - loss: 1.4045 - pearson_correlation: 0.3536
29/45 [==================>...........] - ETA: 0s - loss: 1.4136 - pearson_correlation: 0.3538
31/45 [===================>..........] - ETA: 0s - loss: 1.3918 - pearson_correlation: 0.3619
33/45 [=====================>........] - ETA: 0s - loss: 1.3897 - pearson_correlation: 0.3514
35/45 [======================>.......] - ETA: 0s - loss: 1.3721 - pearson_correlation: 0.3636
37/45 [=======================>......] - ETA: 0s - loss: 1.4136 - pearson_correlation: 0.3675
39/45 [=========================>....] - ETA: 0s - loss: 1.3977 - pearson_correlation: 0.3713
41/45 [==========================>...] - ETA: 0s - loss: 1.3938 - pearson_correlation: 0.3784
43/45 [===========================>..] - ETA: 0s - loss: 1.3758 - pearson_correlation: 0.3887
45/45 [==============================] - ETA: 0s - loss: 1.3735 - pearson_correlation: 0.3988
45/45 [==============================] - 1s 32ms/step - loss: 1.3735 - pearson_correlation: 0.3988
## Epoch 3/10
## 
 1/45 [..............................] - ETA: 1s - loss: 1.2717 - pearson_correlation: 0.4977
 3/45 [=>............................] - ETA: 1s - loss: 1.0861 - pearson_correlation: 0.5783
 5/45 [==>...........................] - ETA: 1s - loss: 1.1227 - pearson_correlation: 0.6122
 7/45 [===>..........................] - ETA: 1s - loss: 1.1596 - pearson_correlation: 0.5951
 9/45 [=====>........................] - ETA: 1s - loss: 1.1232 - pearson_correlation: 0.6080
11/45 [======>.......................] - ETA: 1s - loss: 1.1360 - pearson_correlation: 0.6069
13/45 [=======>......................] - ETA: 0s - loss: 1.1254 - pearson_correlation: 0.6211
15/45 [=========>....................] - ETA: 0s - loss: 1.1311 - pearson_correlation: 0.6296
17/45 [==========>...................] - ETA: 0s - loss: 1.1261 - pearson_correlation: 0.6401
19/45 [===========>..................] - ETA: 0s - loss: 1.1199 - pearson_correlation: 0.6260
21/45 [=============>................] - ETA: 0s - loss: 1.1117 - pearson_correlation: 0.6246
23/45 [==============>...............] - ETA: 0s - loss: 1.0917 - pearson_correlation: 0.6289
25/45 [===============>..............] - ETA: 0s - loss: 1.0826 - pearson_correlation: 0.6202
27/45 [=================>............] - ETA: 0s - loss: 1.0855 - pearson_correlation: 0.6146
29/45 [==================>...........] - ETA: 0s - loss: 1.1030 - pearson_correlation: 0.6082
31/45 [===================>..........] - ETA: 0s - loss: 1.0884 - pearson_correlation: 0.6111
33/45 [=====================>........] - ETA: 0s - loss: 1.0765 - pearson_correlation: 0.6106
35/45 [======================>.......] - ETA: 0s - loss: 1.0771 - pearson_correlation: 0.6076
37/45 [=======================>......] - ETA: 0s - loss: 1.0676 - pearson_correlation: 0.6115
39/45 [=========================>....] - ETA: 0s - loss: 1.0644 - pearson_correlation: 0.6143
41/45 [==========================>...] - ETA: 0s - loss: 1.0514 - pearson_correlation: 0.6141
43/45 [===========================>..] - ETA: 0s - loss: 1.0465 - pearson_correlation: 0.6117
45/45 [==============================] - ETA: 0s - loss: 1.0450 - pearson_correlation: 0.6163
45/45 [==============================] - 1s 31ms/step - loss: 1.0450 - pearson_correlation: 0.6163
## Epoch 4/10
## 
 1/45 [..............................] - ETA: 1s - loss: 0.6863 - pearson_correlation: 0.6190
 3/45 [=>............................] - ETA: 1s - loss: 0.7338 - pearson_correlation: 0.6945
 5/45 [==>...........................] - ETA: 1s - loss: 0.7688 - pearson_correlation: 0.7081
 7/45 [===>..........................] - ETA: 1s - loss: 0.7984 - pearson_correlation: 0.7026
 9/45 [=====>........................] - ETA: 1s - loss: 0.7718 - pearson_correlation: 0.7039
11/45 [======>.......................] - ETA: 1s - loss: 0.7835 - pearson_correlation: 0.6974
13/45 [=======>......................] - ETA: 1s - loss: 0.8154 - pearson_correlation: 0.6881
15/45 [=========>....................] - ETA: 0s - loss: 0.8344 - pearson_correlation: 0.6732
17/45 [==========>...................] - ETA: 0s - loss: 0.8312 - pearson_correlation: 0.6748
19/45 [===========>..................] - ETA: 0s - loss: 0.8405 - pearson_correlation: 0.6719
21/45 [=============>................] - ETA: 0s - loss: 0.8411 - pearson_correlation: 0.6688
23/45 [==============>...............] - ETA: 0s - loss: 0.8524 - pearson_correlation: 0.6724
25/45 [===============>..............] - ETA: 0s - loss: 0.8477 - pearson_correlation: 0.6709
27/45 [=================>............] - ETA: 0s - loss: 0.8516 - pearson_correlation: 0.6708
29/45 [==================>...........] - ETA: 0s - loss: 0.8402 - pearson_correlation: 0.6759
31/45 [===================>..........] - ETA: 0s - loss: 0.8286 - pearson_correlation: 0.6802
33/45 [=====================>........] - ETA: 0s - loss: 0.8209 - pearson_correlation: 0.6813
35/45 [======================>.......] - ETA: 0s - loss: 0.8330 - pearson_correlation: 0.6778
37/45 [=======================>......] - ETA: 0s - loss: 0.8327 - pearson_correlation: 0.6792
39/45 [=========================>....] - ETA: 0s - loss: 0.8273 - pearson_correlation: 0.6806
41/45 [==========================>...] - ETA: 0s - loss: 0.8371 - pearson_correlation: 0.6779
43/45 [===========================>..] - ETA: 0s - loss: 0.8382 - pearson_correlation: 0.6757
45/45 [==============================] - ETA: 0s - loss: 0.8515 - pearson_correlation: 0.6649
45/45 [==============================] - 1s 32ms/step - loss: 0.8515 - pearson_correlation: 0.6649
## Epoch 5/10
## 
 1/45 [..............................] - ETA: 1s - loss: 0.7932 - pearson_correlation: 0.6760
 3/45 [=>............................] - ETA: 1s - loss: 0.7761 - pearson_correlation: 0.7201
 5/45 [==>...........................] - ETA: 1s - loss: 0.7066 - pearson_correlation: 0.7353
 7/45 [===>..........................] - ETA: 1s - loss: 0.7674 - pearson_correlation: 0.7157
 9/45 [=====>........................] - ETA: 1s - loss: 0.7588 - pearson_correlation: 0.7080
11/45 [======>.......................] - ETA: 1s - loss: 0.7366 - pearson_correlation: 0.7230
13/45 [=======>......................] - ETA: 0s - loss: 0.7126 - pearson_correlation: 0.7248
15/45 [=========>....................] - ETA: 0s - loss: 0.7187 - pearson_correlation: 0.7307
17/45 [==========>...................] - ETA: 0s - loss: 0.7135 - pearson_correlation: 0.7305
19/45 [===========>..................] - ETA: 0s - loss: 0.7127 - pearson_correlation: 0.7296
21/45 [=============>................] - ETA: 0s - loss: 0.6983 - pearson_correlation: 0.7330
23/45 [==============>...............] - ETA: 0s - loss: 0.7051 - pearson_correlation: 0.7277
25/45 [===============>..............] - ETA: 0s - loss: 0.7028 - pearson_correlation: 0.7339
27/45 [=================>............] - ETA: 0s - loss: 0.7107 - pearson_correlation: 0.7318
29/45 [==================>...........] - ETA: 0s - loss: 0.7217 - pearson_correlation: 0.7277
31/45 [===================>..........] - ETA: 0s - loss: 0.7249 - pearson_correlation: 0.7262
33/45 [=====================>........] - ETA: 0s - loss: 0.7302 - pearson_correlation: 0.7253
35/45 [======================>.......] - ETA: 0s - loss: 0.7244 - pearson_correlation: 0.7294
37/45 [=======================>......] - ETA: 0s - loss: 0.7309 - pearson_correlation: 0.7271
39/45 [=========================>....] - ETA: 0s - loss: 0.7364 - pearson_correlation: 0.7197
41/45 [==========================>...] - ETA: 0s - loss: 0.7374 - pearson_correlation: 0.7188
43/45 [===========================>..] - ETA: 0s - loss: 0.7394 - pearson_correlation: 0.7186
45/45 [==============================] - ETA: 0s - loss: 0.7417 - pearson_correlation: 0.7126
45/45 [==============================] - 1s 32ms/step - loss: 0.7417 - pearson_correlation: 0.7126
## Epoch 6/10
## 
 1/45 [..............................] - ETA: 1s - loss: 0.6441 - pearson_correlation: 0.8075
 3/45 [=>............................] - ETA: 1s - loss: 0.6349 - pearson_correlation: 0.7876
 5/45 [==>...........................] - ETA: 1s - loss: 0.5897 - pearson_correlation: 0.7784
 7/45 [===>..........................] - ETA: 1s - loss: 0.6652 - pearson_correlation: 0.7312
 9/45 [=====>........................] - ETA: 1s - loss: 0.6518 - pearson_correlation: 0.7278
11/45 [======>.......................] - ETA: 1s - loss: 0.6587 - pearson_correlation: 0.7428
13/45 [=======>......................] - ETA: 0s - loss: 0.6571 - pearson_correlation: 0.7447
15/45 [=========>....................] - ETA: 0s - loss: 0.6612 - pearson_correlation: 0.7442
17/45 [==========>...................] - ETA: 0s - loss: 0.6650 - pearson_correlation: 0.7420
19/45 [===========>..................] - ETA: 0s - loss: 0.6662 - pearson_correlation: 0.7399
21/45 [=============>................] - ETA: 0s - loss: 0.6722 - pearson_correlation: 0.7371
23/45 [==============>...............] - ETA: 0s - loss: 0.6775 - pearson_correlation: 0.7377
25/45 [===============>..............] - ETA: 0s - loss: 0.6808 - pearson_correlation: 0.7356
27/45 [=================>............] - ETA: 0s - loss: 0.6865 - pearson_correlation: 0.7337
29/45 [==================>...........] - ETA: 0s - loss: 0.6731 - pearson_correlation: 0.7398
31/45 [===================>..........] - ETA: 0s - loss: 0.6678 - pearson_correlation: 0.7443
33/45 [=====================>........] - ETA: 0s - loss: 0.6642 - pearson_correlation: 0.7470
35/45 [======================>.......] - ETA: 0s - loss: 0.6772 - pearson_correlation: 0.7402
37/45 [=======================>......] - ETA: 0s - loss: 0.6773 - pearson_correlation: 0.7404
39/45 [=========================>....] - ETA: 0s - loss: 0.6714 - pearson_correlation: 0.7433
41/45 [==========================>...] - ETA: 0s - loss: 0.6714 - pearson_correlation: 0.7455
43/45 [===========================>..] - ETA: 0s - loss: 0.6646 - pearson_correlation: 0.7496
45/45 [==============================] - ETA: 0s - loss: 0.6707 - pearson_correlation: 0.7451
45/45 [==============================] - 1s 32ms/step - loss: 0.6707 - pearson_correlation: 0.7451
## Epoch 7/10
## 
 1/45 [..............................] - ETA: 1s - loss: 0.7147 - pearson_correlation: 0.7289
 3/45 [=>............................] - ETA: 1s - loss: 0.6716 - pearson_correlation: 0.6873
 5/45 [==>...........................] - ETA: 1s - loss: 0.6342 - pearson_correlation: 0.7215
 7/45 [===>..........................] - ETA: 1s - loss: 0.6405 - pearson_correlation: 0.7437
 9/45 [=====>........................] - ETA: 1s - loss: 0.6304 - pearson_correlation: 0.7550
11/45 [======>.......................] - ETA: 0s - loss: 0.6086 - pearson_correlation: 0.7690
13/45 [=======>......................] - ETA: 0s - loss: 0.5982 - pearson_correlation: 0.7734
15/45 [=========>....................] - ETA: 0s - loss: 0.5948 - pearson_correlation: 0.7784
17/45 [==========>...................] - ETA: 0s - loss: 0.6015 - pearson_correlation: 0.7699
19/45 [===========>..................] - ETA: 0s - loss: 0.6105 - pearson_correlation: 0.7705
21/45 [=============>................] - ETA: 0s - loss: 0.6080 - pearson_correlation: 0.7715
23/45 [==============>...............] - ETA: 0s - loss: 0.6038 - pearson_correlation: 0.7774
25/45 [===============>..............] - ETA: 0s - loss: 0.6050 - pearson_correlation: 0.7727
27/45 [=================>............] - ETA: 0s - loss: 0.5937 - pearson_correlation: 0.7763
29/45 [==================>...........] - ETA: 0s - loss: 0.5983 - pearson_correlation: 0.7749
31/45 [===================>..........] - ETA: 0s - loss: 0.5896 - pearson_correlation: 0.7795
33/45 [=====================>........] - ETA: 0s - loss: 0.6025 - pearson_correlation: 0.7712
35/45 [======================>.......] - ETA: 0s - loss: 0.6047 - pearson_correlation: 0.7702
37/45 [=======================>......] - ETA: 0s - loss: 0.6090 - pearson_correlation: 0.7696
39/45 [=========================>....] - ETA: 0s - loss: 0.6110 - pearson_correlation: 0.7701
41/45 [==========================>...] - ETA: 0s - loss: 0.6099 - pearson_correlation: 0.7713
43/45 [===========================>..] - ETA: 0s - loss: 0.6143 - pearson_correlation: 0.7709
45/45 [==============================] - ETA: 0s - loss: 0.6176 - pearson_correlation: 0.7669
45/45 [==============================] - 1s 32ms/step - loss: 0.6176 - pearson_correlation: 0.7669
## Epoch 8/10
## 
 1/45 [..............................] - ETA: 1s - loss: 0.3571 - pearson_correlation: 0.8634
 3/45 [=>............................] - ETA: 1s - loss: 0.5380 - pearson_correlation: 0.8028
 5/45 [==>...........................] - ETA: 1s - loss: 0.5249 - pearson_correlation: 0.8125
 7/45 [===>..........................] - ETA: 1s - loss: 0.5643 - pearson_correlation: 0.7944
 9/45 [=====>........................] - ETA: 1s - loss: 0.5853 - pearson_correlation: 0.7872
11/45 [======>.......................] - ETA: 1s - loss: 0.5892 - pearson_correlation: 0.7835
13/45 [=======>......................] - ETA: 0s - loss: 0.5791 - pearson_correlation: 0.7869
15/45 [=========>....................] - ETA: 0s - loss: 0.5651 - pearson_correlation: 0.7933
17/45 [==========>...................] - ETA: 0s - loss: 0.5579 - pearson_correlation: 0.8013
19/45 [===========>..................] - ETA: 0s - loss: 0.5623 - pearson_correlation: 0.8011
21/45 [=============>................] - ETA: 0s - loss: 0.5711 - pearson_correlation: 0.7976
23/45 [==============>...............] - ETA: 0s - loss: 0.5600 - pearson_correlation: 0.7999
25/45 [===============>..............] - ETA: 0s - loss: 0.5706 - pearson_correlation: 0.7946
27/45 [=================>............] - ETA: 0s - loss: 0.5622 - pearson_correlation: 0.7961
29/45 [==================>...........] - ETA: 0s - loss: 0.5622 - pearson_correlation: 0.7950
31/45 [===================>..........] - ETA: 0s - loss: 0.5700 - pearson_correlation: 0.7907
33/45 [=====================>........] - ETA: 0s - loss: 0.5747 - pearson_correlation: 0.7878
35/45 [======================>.......] - ETA: 0s - loss: 0.5808 - pearson_correlation: 0.7857
37/45 [=======================>......] - ETA: 0s - loss: 0.5852 - pearson_correlation: 0.7832
39/45 [=========================>....] - ETA: 0s - loss: 0.5890 - pearson_correlation: 0.7819
41/45 [==========================>...] - ETA: 0s - loss: 0.5833 - pearson_correlation: 0.7826
43/45 [===========================>..] - ETA: 0s - loss: 0.5814 - pearson_correlation: 0.7839
45/45 [==============================] - ETA: 0s - loss: 0.5858 - pearson_correlation: 0.7810
45/45 [==============================] - 1s 32ms/step - loss: 0.5858 - pearson_correlation: 0.7810
## Epoch 9/10
## 
 1/45 [..............................] - ETA: 1s - loss: 0.6307 - pearson_correlation: 0.8039
 3/45 [=>............................] - ETA: 1s - loss: 0.4656 - pearson_correlation: 0.8088
 5/45 [==>...........................] - ETA: 1s - loss: 0.5383 - pearson_correlation: 0.8031
 7/45 [===>..........................] - ETA: 1s - loss: 0.5169 - pearson_correlation: 0.8125
 9/45 [=====>........................] - ETA: 1s - loss: 0.5597 - pearson_correlation: 0.8030
11/45 [======>.......................] - ETA: 1s - loss: 0.5499 - pearson_correlation: 0.8042
13/45 [=======>......................] - ETA: 1s - loss: 0.5434 - pearson_correlation: 0.8092
15/45 [=========>....................] - ETA: 0s - loss: 0.5496 - pearson_correlation: 0.8033
17/45 [==========>...................] - ETA: 0s - loss: 0.5542 - pearson_correlation: 0.7988
19/45 [===========>..................] - ETA: 0s - loss: 0.5376 - pearson_correlation: 0.8045
21/45 [=============>................] - ETA: 0s - loss: 0.5269 - pearson_correlation: 0.8093
23/45 [==============>...............] - ETA: 0s - loss: 0.5327 - pearson_correlation: 0.8082
25/45 [===============>..............] - ETA: 0s - loss: 0.5405 - pearson_correlation: 0.8025
27/45 [=================>............] - ETA: 0s - loss: 0.5380 - pearson_correlation: 0.8033
29/45 [==================>...........] - ETA: 0s - loss: 0.5425 - pearson_correlation: 0.8015
31/45 [===================>..........] - ETA: 0s - loss: 0.5521 - pearson_correlation: 0.7950
33/45 [=====================>........] - ETA: 0s - loss: 0.5529 - pearson_correlation: 0.7960
35/45 [======================>.......] - ETA: 0s - loss: 0.5572 - pearson_correlation: 0.7950
37/45 [=======================>......] - ETA: 0s - loss: 0.5592 - pearson_correlation: 0.7932
39/45 [=========================>....] - ETA: 0s - loss: 0.5573 - pearson_correlation: 0.7934
41/45 [==========================>...] - ETA: 0s - loss: 0.5562 - pearson_correlation: 0.7924
43/45 [===========================>..] - ETA: 0s - loss: 0.5692 - pearson_correlation: 0.7878
45/45 [==============================] - ETA: 0s - loss: 0.5621 - pearson_correlation: 0.7889
45/45 [==============================] - 1s 32ms/step - loss: 0.5621 - pearson_correlation: 0.7889
## Epoch 10/10
## 
 1/45 [..............................] - ETA: 1s - loss: 0.5441 - pearson_correlation: 0.8533
 3/45 [=>............................] - ETA: 1s - loss: 0.6499 - pearson_correlation: 0.7959
 5/45 [==>...........................] - ETA: 1s - loss: 0.6105 - pearson_correlation: 0.7961
 7/45 [===>..........................] - ETA: 1s - loss: 0.5743 - pearson_correlation: 0.8031
 9/45 [=====>........................] - ETA: 1s - loss: 0.5210 - pearson_correlation: 0.8215
11/45 [======>.......................] - ETA: 1s - loss: 0.5213 - pearson_correlation: 0.8206
13/45 [=======>......................] - ETA: 1s - loss: 0.5068 - pearson_correlation: 0.8200
15/45 [=========>....................] - ETA: 0s - loss: 0.4872 - pearson_correlation: 0.8291
17/45 [==========>...................] - ETA: 0s - loss: 0.4902 - pearson_correlation: 0.8247
19/45 [===========>..................] - ETA: 0s - loss: 0.4941 - pearson_correlation: 0.8241
21/45 [=============>................] - ETA: 0s - loss: 0.4883 - pearson_correlation: 0.8262
23/45 [==============>...............] - ETA: 0s - loss: 0.4853 - pearson_correlation: 0.8290
25/45 [===============>..............] - ETA: 0s - loss: 0.4947 - pearson_correlation: 0.8234
27/45 [=================>............] - ETA: 0s - loss: 0.4908 - pearson_correlation: 0.8236
29/45 [==================>...........] - ETA: 0s - loss: 0.4881 - pearson_correlation: 0.8233
31/45 [===================>..........] - ETA: 0s - loss: 0.5032 - pearson_correlation: 0.8194
33/45 [=====================>........] - ETA: 0s - loss: 0.5158 - pearson_correlation: 0.8141
35/45 [======================>.......] - ETA: 0s - loss: 0.5233 - pearson_correlation: 0.8081
37/45 [=======================>......] - ETA: 0s - loss: 0.5268 - pearson_correlation: 0.8068
39/45 [=========================>....] - ETA: 0s - loss: 0.5338 - pearson_correlation: 0.8060
41/45 [==========================>...] - ETA: 0s - loss: 0.5358 - pearson_correlation: 0.8054
43/45 [===========================>..] - ETA: 0s - loss: 0.5415 - pearson_correlation: 0.8031
45/45 [==============================] - ETA: 0s - loss: 0.5416 - pearson_correlation: 0.8006
45/45 [==============================] - 1s 33ms/step - loss: 0.5416 - pearson_correlation: 0.8006
##                                                                       pathway
## 614                               TGF-beta receptor signaling activates SMADs
## 615                                    Signaling by TGF-beta Receptor Complex
## 624                             Downregulation of TGF-beta receptor signaling
## 1267 TGF-beta receptor signaling in EMT epithelial to mesenchymal transition 
##             pval       padj         ES       NES
## 614  0.042648445 0.21175102 -0.4407843 -1.551889
## 615  0.004676539 0.04614047 -0.4244875 -1.763638
## 624  0.033777574 0.18377071 -0.4868887 -1.591561
## 1267 0.726840855 0.88913295 -0.2872284 -0.788917


2.1.3 Monte Carlo p-value

Deep learning models predict only enrichment scores. The p-values of the scores are not provided by the model. So, the Monte Carlo p-value method is used within the algorithm. Computing the p-value for a statistical test can be easily accomplished via Monte Carlo. The ordinary Monte Carlo is a simulation technique for approximating the expectation of a function for a general random variable, when the exact expectation cannot be found analytically. The Monte Carlo p-value method simply simulates a lot of datasets under the null, computes a statistic for each generated dataset, and then computes the percentile rank of observed value among these sets of simulated values. The number of tokens used for each simulation is the same to the length of the sequence of the corresponding term. If a new text does not have any tokens, its p-value is not available.

if (exists("ttgseaRes")) {
  # prediction with MC p-value
  set.seed(1)
  new_text <- c("Cell Cycle DNA Replication",
                "Cell Cycle",
                "DNA Replication",
                "Cycle Cell",
                "Replication DNA",
                "TGF-beta receptor")
  print(predict_model(ttgseaRes, new_text))
  print(predict_model(ttgseaRes, "data science"))
}
##                     new_text test_value MC_p_value adj_p_value
## 1 Cell Cycle DNA Replication  3.5655158      0.000       0.000
## 2                 Cell Cycle  2.0632474      0.006       0.009
## 3            DNA Replication  2.6883779      0.002       0.006
## 4                 Cycle Cell  0.6710172      0.256       0.256
## 5            Replication DNA  1.7267684      0.006       0.009
## 6        TGF - beta receptor -1.3684919      0.060       0.072
##        new_text test_value MC_p_value adj_p_value
## 1 datum science 0.08944048         NA          NA


2.1.4 visualization

You are allowed to create a visualization of your model architecture.

if (exists("ttgseaRes")) {
  summary(ttgseaRes$model)
  plot_model(ttgseaRes$model)
}
## Model: "model"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  input_1 (InputLayer)               [(None, 30)]                    0           
##  embedding (Embedding)              (None, 30, 50)                  50050       
##  bidirectional (Bidirectional)      (None, 64)                      21248       
##  dense (Dense)                      (None, 1)                       65          
## ================================================================================
## Total params: 71,363
## Trainable params: 71,363
## Non-trainable params: 0
## ________________________________________________________________________________


2.2 Leading edge genes

Take another exmaple. A set of names of ranked genes can be seen as sequential data. In the result of GSEA, names of leading edge genes for each gene set are given. The leading edge subset contains genes which contribute most to the enrichment score. Thus the scores of one or more genes of the leading edge subset can be predicted.

if (keras::is_keras_available() & reticulate::py_available()) {
  # leading edge
  LE <- unlist(lapply(fgseaRes$leadingEdge, function(x) gsub(",", "", toString(x))))
  fgseaRes <- cbind(fgseaRes, LE)
  
  # model parameters
  num_tokens <- 1000
  length_seq <- 30
  batch_size <- 32
  embedding_dim <- 50
  num_units <- 32
  epochs <- 10
  
  # algorithm
  ttgseaRes <- fit_model(fgseaRes, "LE", "NES",
                         model = bi_lstm(num_tokens, embedding_dim,
                                         length_seq, num_units),
                         num_tokens = num_tokens,
                         length_seq = length_seq,
                         epochs = epochs,
                         batch_size = batch_size,
                         verbose = 0,
                         callbacks = callback_early_stopping(
                            monitor = "loss",
                            patience = 5,
                            restore_best_weights = TRUE))
  
  # prediction for every token
  ttgseaRes$token_pred
  
  # prediction with MC p-value
  set.seed(1)
  new_text <- c("107995 56150", "16952")
  predict_model(ttgseaRes, new_text)
}



3 Case Study

The “airway” dataset has four cell lines with two conditions, control and treatment with dexamethasone. By using the package “DESeq2”, differntially expressed genes between controls and treated samples are identified from the gene expression data. Then the log2FC is used as a score for GSEA. For GSEA, GOBP for human is obtained from the package “org.Hs.eg.db”, by using the package “BiocSet”. GSEA is performed by the package “fgsea”. Since “fgsea” can accept a list, the type of gene set is converted to a list. Finally, the result of GSEA is fitted to a deep learning model, and then enrichment scores of new terms can be predicted.

if (keras::is_keras_available() & reticulate::py_available()) {
  ## data preparation
  library(airway)
  data(airway)
  
  ## differentially expressed genes
  library(DESeq2)
  des <- DESeqDataSet(airway, design = ~ dex)
  des <- DESeq(des)
  res <- results(des)
  head(res)
  # log2FC used for GSEA
  statistic <- res$"log2FoldChange"
  names(statistic) <- rownames(res)
  statistic <- na.omit(statistic)
  head(statistic)
  
  ## gene set
  library(org.Hs.eg.db)
  library(BiocSet)
  go <- go_sets(org.Hs.eg.db, "ENSEMBL", ontology = "BP")
  go <- as(go, "list")
  # convert GO id to term name
  library(GO.db)
  names(go) <- Term(GOTERM)[names(go)]
  
  ## GSEA
  library(fgsea)
  set.seed(1)
  fgseaRes <- fgsea(go, statistic)
  head(fgseaRes)
  
  ## tokenizing text of GSEA
  # model parameters
  num_tokens <- 5000
  length_seq <- 30
  batch_size <- 64
  embedding_dim <- 128
  num_units <- 32
  epochs <- 20
  # algorithm
  ttgseaRes <- fit_model(fgseaRes, "pathway", "NES",
                         model = bi_lstm(num_tokens, embedding_dim,
                                         length_seq, num_units),
                         num_tokens = num_tokens,
                         length_seq = length_seq,
                         epochs = epochs,
                         batch_size = batch_size,
                         callbacks = keras::callback_early_stopping(
                           monitor = "loss",
                           patience = 5,
                           restore_best_weights = TRUE))
  # prediction
  ttgseaRes$token_pred
  set.seed(1)
  predict_model(ttgseaRes, c("translation response",
                             "cytokine activity",
                             "rhodopsin mediate",
                             "granzyme",
                             "histone deacetylation",
                             "helper T cell",
                             "Wnt"))
}



4 Session information

sessionInfo()
## R version 4.3.1 Patched (2023-06-17 r84564)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.6.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] fgsea_1.28.0  ttgsea_1.10.0 keras_2.13.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0     dplyr_1.1.3          tensorflow_2.14.0   
##  [4] fastmap_1.1.1        textshape_1.7.3      digest_0.6.33       
##  [7] lifecycle_1.0.3      ellipsis_0.3.2       koRpus_0.13-8       
## [10] tokenizers_0.3.0     NLP_0.2-1            magrittr_2.0.3      
## [13] compiler_4.3.1       rlang_1.1.1          sass_0.4.7          
## [16] tools_4.3.1          utf8_1.2.4           yaml_2.3.7          
## [19] qdapRegex_0.7.8      data.table_1.14.8    knitr_1.44          
## [22] stopwords_2.3        htmlwidgets_1.6.2    reticulate_1.34.0   
## [25] xml2_1.3.5           textclean_0.9.3      RColorBrewer_1.1-3  
## [28] BiocParallel_1.36.0  purrr_1.0.2          grid_4.3.1          
## [31] fansi_1.0.5          tm_0.7-11            colorspace_2.1-0    
## [34] ggplot2_3.4.4        scales_1.2.1         zeallot_0.1.0       
## [37] cli_3.6.1            rmarkdown_2.25       DiagrammeR_1.0.10   
## [40] crayon_1.5.2         generics_0.1.3       rstudioapi_0.15.0   
## [43] tfruns_1.5.1         visNetwork_2.1.2     cachem_1.0.8        
## [46] sylly.en_0.1-3       parallel_4.3.1       textstem_0.1.4      
## [49] base64enc_0.1-3      vctrs_0.6.4          Matrix_1.6-1.1      
## [52] jsonlite_1.8.7       slam_0.1-50          koRpus.lang.en_0.1-4
## [55] lgr_0.4.4            jquerylib_0.1.4      glue_1.6.2          
## [58] codetools_0.2-19     cowplot_1.1.1        sylly_0.1-6         
## [61] stringi_1.7.12       gtable_0.3.4         munsell_0.5.0       
## [64] mlapi_0.1.1          tibble_3.2.1         pillar_1.9.0        
## [67] htmltools_0.5.6.1    float_0.3-1          rsparse_0.5.1       
## [70] R6_2.5.1             evaluate_0.22        lattice_0.22-5      
## [73] lexicon_1.2.1        png_0.1-8            SnowballC_0.7.1     
## [76] syuzhet_1.0.7        RhpcBLASctl_0.23-42  bslib_0.5.1         
## [79] text2vec_0.6.3       Rcpp_1.0.11          fastmatch_1.1-4     
## [82] whisker_0.4.1        xfun_0.40            pkgconfig_2.0.3



5 References

Alterovitz, G., & Ramoni, M. (Eds.). (2011). Knowledge-Based Bioinformatics: from Analysis to Interpretation. John Wiley & Sons.

Consoli, S., Recupero, D. R., & Petkovic, M. (2019). Data Science for Healthcare: Methodologies and Applications. Springer.

DasGupta, A. (2011). Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. Springer.

Ghatak, A. (2019). Deep Learning with R. Springer.

Hassanien, A. E., & Elhoseny, M. (2019). Cybersecurity and Secure Information Systems: Challenges and Solutions and Smart Environments. Springer.

Leong, H. S., & Kipling, D. (2009). Text-based over-representation analysis of microarray gene lists with annotation bias. Nucleic acids research, 37(11), e79.

Micheas, A. C. (2018). Theory of Stochastic Objects: Probability, Stochastic Processes and Inference. CRC Press.

Shaalan, K., Hassanien, A. E., & Tolba, F. (Eds.). (2017). Intelligent Natural Language Processing: Trends and Applications (Vol. 740). Springer.