Functional enrichment analysis methods such as gene set enrichment analysis (GSEA) have been widely used for analyzing gene expression data. GSEA is a powerful method to infer results of gene expression data at a level of gene sets by calculating enrichment scores for predefined sets of genes. GSEA depends on the availability and accuracy of gene sets. There are overlaps between terms of gene sets or categories because multiple terms may exist for a single biological process, and it can thus lead to redundancy within enriched terms. In other words, the sets of related terms are overlapping. Using deep learning, this pakage is aimed to predict enrichment scores for unique tokens or words from text in names of gene sets to resolve this overlapping set issue. Furthermore, we can coin a new term by combining tokens and find its enrichment score by predicting such a combined tokens.
Text can be seen as sequential data, either as a sequence of characters or as a sequence of words. Recurrent Neural Network (RNN) operating on sequential data is a type of the neural network. RNN has been applied to a variety of tasks including natural language processing such as machine translation. However, RNN suffers from the problem of long-term dependencies which means that RNN struggles to remember information for long periods of time. An Long Short-Term Memory (LSTM) network is a special kind of RNN that is designed to solve the long-term dependency problem. The bidirectional LSTM network consists of two distinct LSTM networks, termed the forward LSTM and the backward LSTM, which process the sequences in opposite directions. Gated Recurrent Unit (GRU) is a simplified version of LSTM with less number of parameters, and thus, the total number of parameters can be greatly reduced for a large neural network. LSTM and GRU are known to be successful remedies to the long-term dependency problem. The above models take terms of gene sets as input and enrichment scores as output to predict enrichment scores of new terms.
Consider a simple example. Once GSEA is performed, the result calculated from GSEA is fed into the algorithm to train the deep learning models.
library(ttgsea)
library(fgsea)
data(examplePathways)
data(exampleRanks)
names(examplePathways) <- gsub("_", " ", substr(names(examplePathways), 9, 1000))
set.seed(1)
fgseaRes <- fgseaSimple(examplePathways, exampleRanks, nperm = 10000)
data.table::data.table(fgseaRes[order(fgseaRes$NES, decreasing = TRUE),])
## pathway pval
## 1: Mitotic Prometaphase 0.0001520219
## 2: Resolution of Sister Chromatid Cohesion 0.0001537043
## 3: Cell Cycle, Mitotic 0.0001255020
## 4: RHO GTPases Activate Formins 0.0001534684
## 5: Cell Cycle 0.0001227446
## ---
## 1416: Downregulation of SMAD2 3:SMAD4 transcriptional activity 0.0022655188
## 1417: HATs acetylate histones 0.0002779322
## 1418: TRAF3-dependent IRF activation pathway 0.0010962508
## 1419: Nephrin interactions 0.0013498313
## 1420: Interleukin-6 signaling 0.0004174494
## padj ES NES nMoreExtreme size
## 1: 0.004481064 0.7253270 2.963541 0 82
## 2: 0.004481064 0.7347987 2.954314 0 74
## 3: 0.004481064 0.5594755 2.751403 0 317
## 4: 0.004481064 0.6705979 2.717798 0 78
## 5: 0.004481064 0.5388497 2.688064 0 369
## ---
## 1416: 0.028982313 -0.6457899 -1.984552 9 16
## 1417: 0.006365544 -0.4535612 -1.994238 0 68
## 1418: 0.017020529 -0.7176839 -2.022102 4 12
## 1419: 0.019558780 -0.6880106 -2.025979 5 14
## 1420: 0.008590987 -0.8311374 -2.079276 1 8
## leadingEdge
## 1: 66336,66977,12442,107995,66442,52276,...
## 2: 66336,66977,12442,107995,66442,52276,...
## 3: 66336,66977,12442,107995,66442,12571,...
## 4: 66336,66977,107995,66442,52276,67629,...
## 5: 66336,66977,12442,107995,66442,19361,...
## ---
## 1416: 66313,20482,20481,17127,17128,83814,...
## 1417: 74026,319190,244349,75560,13831,246103,...
## 1418: 56489,12914,54131,54123,56480,217069,...
## 1419: 109711,14360,20742,17973,18708,12488,...
## 1420: 16194,16195,16451,12402,16452,20848
### convert from gene set defined by BiocSet::BiocSet to list
#library(BiocSet)
#genesets <- BiocSet(examplePathways)
#gsc_list <- as(genesets, "list")
# convert from gene set defined by GSEABase::GeneSetCollection to list
#library(GSEABase)
#genesets <- BiocSet(examplePathways)
#gsc <- as(genesets, "GeneSetCollection")
#gsc_list <- list()
#for (i in 1:length(gsc)) {
# gsc_list[[setName(gsc[[i]])]] <- geneIds(gsc[[i]])
#}
#set.seed(1)
#fgseaRes <- fgseaSimple(gsc_list, exampleRanks, nperm = 10000)
Since deep learning architectures are incapable of processing characters or words in their raw form, the text needs to be converted to numbers as inputs. Word embeddings are the texts converted into numbers. For tokenization, unigram and bigram sequences are used as default. An integer is assigned to each token, and then each term is converted to a sequence of integers. The sequences that are longer than the given maximum length are truncated, whereas shorter sequences are padded with zeros. Keras is a higher-level library built on top of TensorFlow. It is available in R through the keras package. The input to the Keras embedding are integers. These integers are of the tokens. This representation is passed to the embedding layer. The embedding layer acts as the first hidden layer of the neural network.
if (keras::is_keras_available() & reticulate::py_available()) {
# model parameters
num_tokens <- 1000
length_seq <- 30
batch_size <- 32
embedding_dim <- 50
num_units <- 32
epochs <- 10
# algorithm
ttgseaRes <- fit_model(fgseaRes, "pathway", "NES",
model = bi_lstm(num_tokens, embedding_dim,
length_seq, num_units),
num_tokens = num_tokens,
length_seq = length_seq,
epochs = epochs,
batch_size = batch_size,
use_generator = FALSE,
callbacks = keras::callback_early_stopping(
monitor = "loss",
patience = 10,
restore_best_weights = TRUE))
# prediction for every token
ttgseaRes$token_pred
ttgseaRes$token_gsea[["TGF beta"]][,1:5]
}
## Epoch 1/10
##
1/45 [..............................] - ETA: 2:16 - loss: 1.5214 - pearson_correlation: 0.0440
3/45 [=>............................] - ETA: 2s - loss: 1.6660 - pearson_correlation: -0.0810
5/45 [==>...........................] - ETA: 1s - loss: 1.8039 - pearson_correlation: 0.0022
7/45 [===>..........................] - ETA: 1s - loss: 1.6422 - pearson_correlation: 0.0182
9/45 [=====>........................] - ETA: 1s - loss: 1.6194 - pearson_correlation: 0.0341
11/45 [======>.......................] - ETA: 1s - loss: 1.6264 - pearson_correlation: 0.0194
13/45 [=======>......................] - ETA: 1s - loss: 1.6269 - pearson_correlation: 0.0522
15/45 [=========>....................] - ETA: 0s - loss: 1.5840 - pearson_correlation: 0.0564
17/45 [==========>...................] - ETA: 0s - loss: 1.5910 - pearson_correlation: 0.0448
19/45 [===========>..................] - ETA: 0s - loss: 1.5853 - pearson_correlation: 0.0713
21/45 [=============>................] - ETA: 0s - loss: 1.5682 - pearson_correlation: 0.0751
23/45 [==============>...............] - ETA: 0s - loss: 1.5474 - pearson_correlation: 0.0889
25/45 [===============>..............] - ETA: 0s - loss: 1.5405 - pearson_correlation: 0.0890
27/45 [=================>............] - ETA: 0s - loss: 1.5322 - pearson_correlation: 0.0886
29/45 [==================>...........] - ETA: 0s - loss: 1.5279 - pearson_correlation: 0.0921
31/45 [===================>..........] - ETA: 0s - loss: 1.5302 - pearson_correlation: 0.0823
33/45 [=====================>........] - ETA: 0s - loss: 1.5489 - pearson_correlation: 0.0924
35/45 [======================>.......] - ETA: 0s - loss: 1.5569 - pearson_correlation: 0.0975
37/45 [=======================>......] - ETA: 0s - loss: 1.5509 - pearson_correlation: 0.0978
39/45 [=========================>....] - ETA: 0s - loss: 1.5546 - pearson_correlation: 0.1007
41/45 [==========================>...] - ETA: 0s - loss: 1.5548 - pearson_correlation: 0.1093
43/45 [===========================>..] - ETA: 0s - loss: 1.5526 - pearson_correlation: 0.1126
45/45 [==============================] - ETA: 0s - loss: 1.5523 - pearson_correlation: 0.1222
45/45 [==============================] - 5s 33ms/step - loss: 1.5523 - pearson_correlation: 0.1222
## Epoch 2/10
##
1/45 [..............................] - ETA: 1s - loss: 1.6141 - pearson_correlation: 0.0186
3/45 [=>............................] - ETA: 1s - loss: 1.6057 - pearson_correlation: 0.2921
5/45 [==>...........................] - ETA: 1s - loss: 1.4747 - pearson_correlation: 0.3138
7/45 [===>..........................] - ETA: 1s - loss: 1.5079 - pearson_correlation: 0.3027
9/45 [=====>........................] - ETA: 1s - loss: 1.4670 - pearson_correlation: 0.2818
11/45 [======>.......................] - ETA: 0s - loss: 1.4424 - pearson_correlation: 0.2665
13/45 [=======>......................] - ETA: 0s - loss: 1.4193 - pearson_correlation: 0.2938
15/45 [=========>....................] - ETA: 0s - loss: 1.4346 - pearson_correlation: 0.2961
17/45 [==========>...................] - ETA: 0s - loss: 1.4312 - pearson_correlation: 0.3217
19/45 [===========>..................] - ETA: 0s - loss: 1.4214 - pearson_correlation: 0.3260
21/45 [=============>................] - ETA: 0s - loss: 1.4099 - pearson_correlation: 0.3392
23/45 [==============>...............] - ETA: 0s - loss: 1.4040 - pearson_correlation: 0.3350
25/45 [===============>..............] - ETA: 0s - loss: 1.4002 - pearson_correlation: 0.3463
27/45 [=================>............] - ETA: 0s - loss: 1.4045 - pearson_correlation: 0.3536
29/45 [==================>...........] - ETA: 0s - loss: 1.4136 - pearson_correlation: 0.3538
31/45 [===================>..........] - ETA: 0s - loss: 1.3918 - pearson_correlation: 0.3619
33/45 [=====================>........] - ETA: 0s - loss: 1.3897 - pearson_correlation: 0.3514
35/45 [======================>.......] - ETA: 0s - loss: 1.3721 - pearson_correlation: 0.3636
37/45 [=======================>......] - ETA: 0s - loss: 1.4136 - pearson_correlation: 0.3675
39/45 [=========================>....] - ETA: 0s - loss: 1.3977 - pearson_correlation: 0.3713
41/45 [==========================>...] - ETA: 0s - loss: 1.3938 - pearson_correlation: 0.3784
43/45 [===========================>..] - ETA: 0s - loss: 1.3758 - pearson_correlation: 0.3887
45/45 [==============================] - ETA: 0s - loss: 1.3735 - pearson_correlation: 0.3988
45/45 [==============================] - 1s 32ms/step - loss: 1.3735 - pearson_correlation: 0.3988
## Epoch 3/10
##
1/45 [..............................] - ETA: 1s - loss: 1.2717 - pearson_correlation: 0.4977
3/45 [=>............................] - ETA: 1s - loss: 1.0861 - pearson_correlation: 0.5783
5/45 [==>...........................] - ETA: 1s - loss: 1.1227 - pearson_correlation: 0.6122
7/45 [===>..........................] - ETA: 1s - loss: 1.1596 - pearson_correlation: 0.5951
9/45 [=====>........................] - ETA: 1s - loss: 1.1232 - pearson_correlation: 0.6080
11/45 [======>.......................] - ETA: 1s - loss: 1.1360 - pearson_correlation: 0.6069
13/45 [=======>......................] - ETA: 0s - loss: 1.1254 - pearson_correlation: 0.6211
15/45 [=========>....................] - ETA: 0s - loss: 1.1311 - pearson_correlation: 0.6296
17/45 [==========>...................] - ETA: 0s - loss: 1.1261 - pearson_correlation: 0.6401
19/45 [===========>..................] - ETA: 0s - loss: 1.1199 - pearson_correlation: 0.6260
21/45 [=============>................] - ETA: 0s - loss: 1.1117 - pearson_correlation: 0.6246
23/45 [==============>...............] - ETA: 0s - loss: 1.0917 - pearson_correlation: 0.6289
25/45 [===============>..............] - ETA: 0s - loss: 1.0826 - pearson_correlation: 0.6202
27/45 [=================>............] - ETA: 0s - loss: 1.0855 - pearson_correlation: 0.6146
29/45 [==================>...........] - ETA: 0s - loss: 1.1030 - pearson_correlation: 0.6082
31/45 [===================>..........] - ETA: 0s - loss: 1.0884 - pearson_correlation: 0.6111
33/45 [=====================>........] - ETA: 0s - loss: 1.0765 - pearson_correlation: 0.6106
35/45 [======================>.......] - ETA: 0s - loss: 1.0771 - pearson_correlation: 0.6076
37/45 [=======================>......] - ETA: 0s - loss: 1.0676 - pearson_correlation: 0.6115
39/45 [=========================>....] - ETA: 0s - loss: 1.0644 - pearson_correlation: 0.6143
41/45 [==========================>...] - ETA: 0s - loss: 1.0514 - pearson_correlation: 0.6141
43/45 [===========================>..] - ETA: 0s - loss: 1.0465 - pearson_correlation: 0.6117
45/45 [==============================] - ETA: 0s - loss: 1.0450 - pearson_correlation: 0.6163
45/45 [==============================] - 1s 31ms/step - loss: 1.0450 - pearson_correlation: 0.6163
## Epoch 4/10
##
1/45 [..............................] - ETA: 1s - loss: 0.6863 - pearson_correlation: 0.6190
3/45 [=>............................] - ETA: 1s - loss: 0.7338 - pearson_correlation: 0.6945
5/45 [==>...........................] - ETA: 1s - loss: 0.7688 - pearson_correlation: 0.7081
7/45 [===>..........................] - ETA: 1s - loss: 0.7984 - pearson_correlation: 0.7026
9/45 [=====>........................] - ETA: 1s - loss: 0.7718 - pearson_correlation: 0.7039
11/45 [======>.......................] - ETA: 1s - loss: 0.7835 - pearson_correlation: 0.6974
13/45 [=======>......................] - ETA: 1s - loss: 0.8154 - pearson_correlation: 0.6881
15/45 [=========>....................] - ETA: 0s - loss: 0.8344 - pearson_correlation: 0.6732
17/45 [==========>...................] - ETA: 0s - loss: 0.8312 - pearson_correlation: 0.6748
19/45 [===========>..................] - ETA: 0s - loss: 0.8405 - pearson_correlation: 0.6719
21/45 [=============>................] - ETA: 0s - loss: 0.8411 - pearson_correlation: 0.6688
23/45 [==============>...............] - ETA: 0s - loss: 0.8524 - pearson_correlation: 0.6724
25/45 [===============>..............] - ETA: 0s - loss: 0.8477 - pearson_correlation: 0.6709
27/45 [=================>............] - ETA: 0s - loss: 0.8516 - pearson_correlation: 0.6708
29/45 [==================>...........] - ETA: 0s - loss: 0.8402 - pearson_correlation: 0.6759
31/45 [===================>..........] - ETA: 0s - loss: 0.8286 - pearson_correlation: 0.6802
33/45 [=====================>........] - ETA: 0s - loss: 0.8209 - pearson_correlation: 0.6813
35/45 [======================>.......] - ETA: 0s - loss: 0.8330 - pearson_correlation: 0.6778
37/45 [=======================>......] - ETA: 0s - loss: 0.8327 - pearson_correlation: 0.6792
39/45 [=========================>....] - ETA: 0s - loss: 0.8273 - pearson_correlation: 0.6806
41/45 [==========================>...] - ETA: 0s - loss: 0.8371 - pearson_correlation: 0.6779
43/45 [===========================>..] - ETA: 0s - loss: 0.8382 - pearson_correlation: 0.6757
45/45 [==============================] - ETA: 0s - loss: 0.8515 - pearson_correlation: 0.6649
45/45 [==============================] - 1s 32ms/step - loss: 0.8515 - pearson_correlation: 0.6649
## Epoch 5/10
##
1/45 [..............................] - ETA: 1s - loss: 0.7932 - pearson_correlation: 0.6760
3/45 [=>............................] - ETA: 1s - loss: 0.7761 - pearson_correlation: 0.7201
5/45 [==>...........................] - ETA: 1s - loss: 0.7066 - pearson_correlation: 0.7353
7/45 [===>..........................] - ETA: 1s - loss: 0.7674 - pearson_correlation: 0.7157
9/45 [=====>........................] - ETA: 1s - loss: 0.7588 - pearson_correlation: 0.7080
11/45 [======>.......................] - ETA: 1s - loss: 0.7366 - pearson_correlation: 0.7230
13/45 [=======>......................] - ETA: 0s - loss: 0.7126 - pearson_correlation: 0.7248
15/45 [=========>....................] - ETA: 0s - loss: 0.7187 - pearson_correlation: 0.7307
17/45 [==========>...................] - ETA: 0s - loss: 0.7135 - pearson_correlation: 0.7305
19/45 [===========>..................] - ETA: 0s - loss: 0.7127 - pearson_correlation: 0.7296
21/45 [=============>................] - ETA: 0s - loss: 0.6983 - pearson_correlation: 0.7330
23/45 [==============>...............] - ETA: 0s - loss: 0.7051 - pearson_correlation: 0.7277
25/45 [===============>..............] - ETA: 0s - loss: 0.7028 - pearson_correlation: 0.7339
27/45 [=================>............] - ETA: 0s - loss: 0.7107 - pearson_correlation: 0.7318
29/45 [==================>...........] - ETA: 0s - loss: 0.7217 - pearson_correlation: 0.7277
31/45 [===================>..........] - ETA: 0s - loss: 0.7249 - pearson_correlation: 0.7262
33/45 [=====================>........] - ETA: 0s - loss: 0.7302 - pearson_correlation: 0.7253
35/45 [======================>.......] - ETA: 0s - loss: 0.7244 - pearson_correlation: 0.7294
37/45 [=======================>......] - ETA: 0s - loss: 0.7309 - pearson_correlation: 0.7271
39/45 [=========================>....] - ETA: 0s - loss: 0.7364 - pearson_correlation: 0.7197
41/45 [==========================>...] - ETA: 0s - loss: 0.7374 - pearson_correlation: 0.7188
43/45 [===========================>..] - ETA: 0s - loss: 0.7394 - pearson_correlation: 0.7186
45/45 [==============================] - ETA: 0s - loss: 0.7417 - pearson_correlation: 0.7126
45/45 [==============================] - 1s 32ms/step - loss: 0.7417 - pearson_correlation: 0.7126
## Epoch 6/10
##
1/45 [..............................] - ETA: 1s - loss: 0.6441 - pearson_correlation: 0.8075
3/45 [=>............................] - ETA: 1s - loss: 0.6349 - pearson_correlation: 0.7876
5/45 [==>...........................] - ETA: 1s - loss: 0.5897 - pearson_correlation: 0.7784
7/45 [===>..........................] - ETA: 1s - loss: 0.6652 - pearson_correlation: 0.7312
9/45 [=====>........................] - ETA: 1s - loss: 0.6518 - pearson_correlation: 0.7278
11/45 [======>.......................] - ETA: 1s - loss: 0.6587 - pearson_correlation: 0.7428
13/45 [=======>......................] - ETA: 0s - loss: 0.6571 - pearson_correlation: 0.7447
15/45 [=========>....................] - ETA: 0s - loss: 0.6612 - pearson_correlation: 0.7442
17/45 [==========>...................] - ETA: 0s - loss: 0.6650 - pearson_correlation: 0.7420
19/45 [===========>..................] - ETA: 0s - loss: 0.6662 - pearson_correlation: 0.7399
21/45 [=============>................] - ETA: 0s - loss: 0.6722 - pearson_correlation: 0.7371
23/45 [==============>...............] - ETA: 0s - loss: 0.6775 - pearson_correlation: 0.7377
25/45 [===============>..............] - ETA: 0s - loss: 0.6808 - pearson_correlation: 0.7356
27/45 [=================>............] - ETA: 0s - loss: 0.6865 - pearson_correlation: 0.7337
29/45 [==================>...........] - ETA: 0s - loss: 0.6731 - pearson_correlation: 0.7398
31/45 [===================>..........] - ETA: 0s - loss: 0.6678 - pearson_correlation: 0.7443
33/45 [=====================>........] - ETA: 0s - loss: 0.6642 - pearson_correlation: 0.7470
35/45 [======================>.......] - ETA: 0s - loss: 0.6772 - pearson_correlation: 0.7402
37/45 [=======================>......] - ETA: 0s - loss: 0.6773 - pearson_correlation: 0.7404
39/45 [=========================>....] - ETA: 0s - loss: 0.6714 - pearson_correlation: 0.7433
41/45 [==========================>...] - ETA: 0s - loss: 0.6714 - pearson_correlation: 0.7455
43/45 [===========================>..] - ETA: 0s - loss: 0.6646 - pearson_correlation: 0.7496
45/45 [==============================] - ETA: 0s - loss: 0.6707 - pearson_correlation: 0.7451
45/45 [==============================] - 1s 32ms/step - loss: 0.6707 - pearson_correlation: 0.7451
## Epoch 7/10
##
1/45 [..............................] - ETA: 1s - loss: 0.7147 - pearson_correlation: 0.7289
3/45 [=>............................] - ETA: 1s - loss: 0.6716 - pearson_correlation: 0.6873
5/45 [==>...........................] - ETA: 1s - loss: 0.6342 - pearson_correlation: 0.7215
7/45 [===>..........................] - ETA: 1s - loss: 0.6405 - pearson_correlation: 0.7437
9/45 [=====>........................] - ETA: 1s - loss: 0.6304 - pearson_correlation: 0.7550
11/45 [======>.......................] - ETA: 0s - loss: 0.6086 - pearson_correlation: 0.7690
13/45 [=======>......................] - ETA: 0s - loss: 0.5982 - pearson_correlation: 0.7734
15/45 [=========>....................] - ETA: 0s - loss: 0.5948 - pearson_correlation: 0.7784
17/45 [==========>...................] - ETA: 0s - loss: 0.6015 - pearson_correlation: 0.7699
19/45 [===========>..................] - ETA: 0s - loss: 0.6105 - pearson_correlation: 0.7705
21/45 [=============>................] - ETA: 0s - loss: 0.6080 - pearson_correlation: 0.7715
23/45 [==============>...............] - ETA: 0s - loss: 0.6038 - pearson_correlation: 0.7774
25/45 [===============>..............] - ETA: 0s - loss: 0.6050 - pearson_correlation: 0.7727
27/45 [=================>............] - ETA: 0s - loss: 0.5937 - pearson_correlation: 0.7763
29/45 [==================>...........] - ETA: 0s - loss: 0.5983 - pearson_correlation: 0.7749
31/45 [===================>..........] - ETA: 0s - loss: 0.5896 - pearson_correlation: 0.7795
33/45 [=====================>........] - ETA: 0s - loss: 0.6025 - pearson_correlation: 0.7712
35/45 [======================>.......] - ETA: 0s - loss: 0.6047 - pearson_correlation: 0.7702
37/45 [=======================>......] - ETA: 0s - loss: 0.6090 - pearson_correlation: 0.7696
39/45 [=========================>....] - ETA: 0s - loss: 0.6110 - pearson_correlation: 0.7701
41/45 [==========================>...] - ETA: 0s - loss: 0.6099 - pearson_correlation: 0.7713
43/45 [===========================>..] - ETA: 0s - loss: 0.6143 - pearson_correlation: 0.7709
45/45 [==============================] - ETA: 0s - loss: 0.6176 - pearson_correlation: 0.7669
45/45 [==============================] - 1s 32ms/step - loss: 0.6176 - pearson_correlation: 0.7669
## Epoch 8/10
##
1/45 [..............................] - ETA: 1s - loss: 0.3571 - pearson_correlation: 0.8634
3/45 [=>............................] - ETA: 1s - loss: 0.5380 - pearson_correlation: 0.8028
5/45 [==>...........................] - ETA: 1s - loss: 0.5249 - pearson_correlation: 0.8125
7/45 [===>..........................] - ETA: 1s - loss: 0.5643 - pearson_correlation: 0.7944
9/45 [=====>........................] - ETA: 1s - loss: 0.5853 - pearson_correlation: 0.7872
11/45 [======>.......................] - ETA: 1s - loss: 0.5892 - pearson_correlation: 0.7835
13/45 [=======>......................] - ETA: 0s - loss: 0.5791 - pearson_correlation: 0.7869
15/45 [=========>....................] - ETA: 0s - loss: 0.5651 - pearson_correlation: 0.7933
17/45 [==========>...................] - ETA: 0s - loss: 0.5579 - pearson_correlation: 0.8013
19/45 [===========>..................] - ETA: 0s - loss: 0.5623 - pearson_correlation: 0.8011
21/45 [=============>................] - ETA: 0s - loss: 0.5711 - pearson_correlation: 0.7976
23/45 [==============>...............] - ETA: 0s - loss: 0.5600 - pearson_correlation: 0.7999
25/45 [===============>..............] - ETA: 0s - loss: 0.5706 - pearson_correlation: 0.7946
27/45 [=================>............] - ETA: 0s - loss: 0.5622 - pearson_correlation: 0.7961
29/45 [==================>...........] - ETA: 0s - loss: 0.5622 - pearson_correlation: 0.7950
31/45 [===================>..........] - ETA: 0s - loss: 0.5700 - pearson_correlation: 0.7907
33/45 [=====================>........] - ETA: 0s - loss: 0.5747 - pearson_correlation: 0.7878
35/45 [======================>.......] - ETA: 0s - loss: 0.5808 - pearson_correlation: 0.7857
37/45 [=======================>......] - ETA: 0s - loss: 0.5852 - pearson_correlation: 0.7832
39/45 [=========================>....] - ETA: 0s - loss: 0.5890 - pearson_correlation: 0.7819
41/45 [==========================>...] - ETA: 0s - loss: 0.5833 - pearson_correlation: 0.7826
43/45 [===========================>..] - ETA: 0s - loss: 0.5814 - pearson_correlation: 0.7839
45/45 [==============================] - ETA: 0s - loss: 0.5858 - pearson_correlation: 0.7810
45/45 [==============================] - 1s 32ms/step - loss: 0.5858 - pearson_correlation: 0.7810
## Epoch 9/10
##
1/45 [..............................] - ETA: 1s - loss: 0.6307 - pearson_correlation: 0.8039
3/45 [=>............................] - ETA: 1s - loss: 0.4656 - pearson_correlation: 0.8088
5/45 [==>...........................] - ETA: 1s - loss: 0.5383 - pearson_correlation: 0.8031
7/45 [===>..........................] - ETA: 1s - loss: 0.5169 - pearson_correlation: 0.8125
9/45 [=====>........................] - ETA: 1s - loss: 0.5597 - pearson_correlation: 0.8030
11/45 [======>.......................] - ETA: 1s - loss: 0.5499 - pearson_correlation: 0.8042
13/45 [=======>......................] - ETA: 1s - loss: 0.5434 - pearson_correlation: 0.8092
15/45 [=========>....................] - ETA: 0s - loss: 0.5496 - pearson_correlation: 0.8033
17/45 [==========>...................] - ETA: 0s - loss: 0.5542 - pearson_correlation: 0.7988
19/45 [===========>..................] - ETA: 0s - loss: 0.5376 - pearson_correlation: 0.8045
21/45 [=============>................] - ETA: 0s - loss: 0.5269 - pearson_correlation: 0.8093
23/45 [==============>...............] - ETA: 0s - loss: 0.5327 - pearson_correlation: 0.8082
25/45 [===============>..............] - ETA: 0s - loss: 0.5405 - pearson_correlation: 0.8025
27/45 [=================>............] - ETA: 0s - loss: 0.5380 - pearson_correlation: 0.8033
29/45 [==================>...........] - ETA: 0s - loss: 0.5425 - pearson_correlation: 0.8015
31/45 [===================>..........] - ETA: 0s - loss: 0.5521 - pearson_correlation: 0.7950
33/45 [=====================>........] - ETA: 0s - loss: 0.5529 - pearson_correlation: 0.7960
35/45 [======================>.......] - ETA: 0s - loss: 0.5572 - pearson_correlation: 0.7950
37/45 [=======================>......] - ETA: 0s - loss: 0.5592 - pearson_correlation: 0.7932
39/45 [=========================>....] - ETA: 0s - loss: 0.5573 - pearson_correlation: 0.7934
41/45 [==========================>...] - ETA: 0s - loss: 0.5562 - pearson_correlation: 0.7924
43/45 [===========================>..] - ETA: 0s - loss: 0.5692 - pearson_correlation: 0.7878
45/45 [==============================] - ETA: 0s - loss: 0.5621 - pearson_correlation: 0.7889
45/45 [==============================] - 1s 32ms/step - loss: 0.5621 - pearson_correlation: 0.7889
## Epoch 10/10
##
1/45 [..............................] - ETA: 1s - loss: 0.5441 - pearson_correlation: 0.8533
3/45 [=>............................] - ETA: 1s - loss: 0.6499 - pearson_correlation: 0.7959
5/45 [==>...........................] - ETA: 1s - loss: 0.6105 - pearson_correlation: 0.7961
7/45 [===>..........................] - ETA: 1s - loss: 0.5743 - pearson_correlation: 0.8031
9/45 [=====>........................] - ETA: 1s - loss: 0.5210 - pearson_correlation: 0.8215
11/45 [======>.......................] - ETA: 1s - loss: 0.5213 - pearson_correlation: 0.8206
13/45 [=======>......................] - ETA: 1s - loss: 0.5068 - pearson_correlation: 0.8200
15/45 [=========>....................] - ETA: 0s - loss: 0.4872 - pearson_correlation: 0.8291
17/45 [==========>...................] - ETA: 0s - loss: 0.4902 - pearson_correlation: 0.8247
19/45 [===========>..................] - ETA: 0s - loss: 0.4941 - pearson_correlation: 0.8241
21/45 [=============>................] - ETA: 0s - loss: 0.4883 - pearson_correlation: 0.8262
23/45 [==============>...............] - ETA: 0s - loss: 0.4853 - pearson_correlation: 0.8290
25/45 [===============>..............] - ETA: 0s - loss: 0.4947 - pearson_correlation: 0.8234
27/45 [=================>............] - ETA: 0s - loss: 0.4908 - pearson_correlation: 0.8236
29/45 [==================>...........] - ETA: 0s - loss: 0.4881 - pearson_correlation: 0.8233
31/45 [===================>..........] - ETA: 0s - loss: 0.5032 - pearson_correlation: 0.8194
33/45 [=====================>........] - ETA: 0s - loss: 0.5158 - pearson_correlation: 0.8141
35/45 [======================>.......] - ETA: 0s - loss: 0.5233 - pearson_correlation: 0.8081
37/45 [=======================>......] - ETA: 0s - loss: 0.5268 - pearson_correlation: 0.8068
39/45 [=========================>....] - ETA: 0s - loss: 0.5338 - pearson_correlation: 0.8060
41/45 [==========================>...] - ETA: 0s - loss: 0.5358 - pearson_correlation: 0.8054
43/45 [===========================>..] - ETA: 0s - loss: 0.5415 - pearson_correlation: 0.8031
45/45 [==============================] - ETA: 0s - loss: 0.5416 - pearson_correlation: 0.8006
45/45 [==============================] - 1s 33ms/step - loss: 0.5416 - pearson_correlation: 0.8006
## pathway
## 614 TGF-beta receptor signaling activates SMADs
## 615 Signaling by TGF-beta Receptor Complex
## 624 Downregulation of TGF-beta receptor signaling
## 1267 TGF-beta receptor signaling in EMT epithelial to mesenchymal transition
## pval padj ES NES
## 614 0.042648445 0.21175102 -0.4407843 -1.551889
## 615 0.004676539 0.04614047 -0.4244875 -1.763638
## 624 0.033777574 0.18377071 -0.4868887 -1.591561
## 1267 0.726840855 0.88913295 -0.2872284 -0.788917
Deep learning models predict only enrichment scores. The p-values of the scores are not provided by the model. So, the Monte Carlo p-value method is used within the algorithm. Computing the p-value for a statistical test can be easily accomplished via Monte Carlo. The ordinary Monte Carlo is a simulation technique for approximating the expectation of a function for a general random variable, when the exact expectation cannot be found analytically. The Monte Carlo p-value method simply simulates a lot of datasets under the null, computes a statistic for each generated dataset, and then computes the percentile rank of observed value among these sets of simulated values. The number of tokens used for each simulation is the same to the length of the sequence of the corresponding term. If a new text does not have any tokens, its p-value is not available.
if (exists("ttgseaRes")) {
# prediction with MC p-value
set.seed(1)
new_text <- c("Cell Cycle DNA Replication",
"Cell Cycle",
"DNA Replication",
"Cycle Cell",
"Replication DNA",
"TGF-beta receptor")
print(predict_model(ttgseaRes, new_text))
print(predict_model(ttgseaRes, "data science"))
}
## new_text test_value MC_p_value adj_p_value
## 1 Cell Cycle DNA Replication 3.5655158 0.000 0.000
## 2 Cell Cycle 2.0632474 0.006 0.009
## 3 DNA Replication 2.6883779 0.002 0.006
## 4 Cycle Cell 0.6710172 0.256 0.256
## 5 Replication DNA 1.7267684 0.006 0.009
## 6 TGF - beta receptor -1.3684919 0.060 0.072
## new_text test_value MC_p_value adj_p_value
## 1 datum science 0.08944048 NA NA
You are allowed to create a visualization of your model architecture.
## Model: "model"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## input_1 (InputLayer) [(None, 30)] 0
## embedding (Embedding) (None, 30, 50) 50050
## bidirectional (Bidirectional) (None, 64) 21248
## dense (Dense) (None, 1) 65
## ================================================================================
## Total params: 71,363
## Trainable params: 71,363
## Non-trainable params: 0
## ________________________________________________________________________________
Take another exmaple. A set of names of ranked genes can be seen as sequential data. In the result of GSEA, names of leading edge genes for each gene set are given. The leading edge subset contains genes which contribute most to the enrichment score. Thus the scores of one or more genes of the leading edge subset can be predicted.
if (keras::is_keras_available() & reticulate::py_available()) {
# leading edge
LE <- unlist(lapply(fgseaRes$leadingEdge, function(x) gsub(",", "", toString(x))))
fgseaRes <- cbind(fgseaRes, LE)
# model parameters
num_tokens <- 1000
length_seq <- 30
batch_size <- 32
embedding_dim <- 50
num_units <- 32
epochs <- 10
# algorithm
ttgseaRes <- fit_model(fgseaRes, "LE", "NES",
model = bi_lstm(num_tokens, embedding_dim,
length_seq, num_units),
num_tokens = num_tokens,
length_seq = length_seq,
epochs = epochs,
batch_size = batch_size,
verbose = 0,
callbacks = callback_early_stopping(
monitor = "loss",
patience = 5,
restore_best_weights = TRUE))
# prediction for every token
ttgseaRes$token_pred
# prediction with MC p-value
set.seed(1)
new_text <- c("107995 56150", "16952")
predict_model(ttgseaRes, new_text)
}
The “airway” dataset has four cell lines with two conditions, control and treatment with dexamethasone. By using the package “DESeq2”, differntially expressed genes between controls and treated samples are identified from the gene expression data. Then the log2FC is used as a score for GSEA. For GSEA, GOBP for human is obtained from the package “org.Hs.eg.db”, by using the package “BiocSet”. GSEA is performed by the package “fgsea”. Since “fgsea” can accept a list, the type of gene set is converted to a list. Finally, the result of GSEA is fitted to a deep learning model, and then enrichment scores of new terms can be predicted.
if (keras::is_keras_available() & reticulate::py_available()) {
## data preparation
library(airway)
data(airway)
## differentially expressed genes
library(DESeq2)
des <- DESeqDataSet(airway, design = ~ dex)
des <- DESeq(des)
res <- results(des)
head(res)
# log2FC used for GSEA
statistic <- res$"log2FoldChange"
names(statistic) <- rownames(res)
statistic <- na.omit(statistic)
head(statistic)
## gene set
library(org.Hs.eg.db)
library(BiocSet)
go <- go_sets(org.Hs.eg.db, "ENSEMBL", ontology = "BP")
go <- as(go, "list")
# convert GO id to term name
library(GO.db)
names(go) <- Term(GOTERM)[names(go)]
## GSEA
library(fgsea)
set.seed(1)
fgseaRes <- fgsea(go, statistic)
head(fgseaRes)
## tokenizing text of GSEA
# model parameters
num_tokens <- 5000
length_seq <- 30
batch_size <- 64
embedding_dim <- 128
num_units <- 32
epochs <- 20
# algorithm
ttgseaRes <- fit_model(fgseaRes, "pathway", "NES",
model = bi_lstm(num_tokens, embedding_dim,
length_seq, num_units),
num_tokens = num_tokens,
length_seq = length_seq,
epochs = epochs,
batch_size = batch_size,
callbacks = keras::callback_early_stopping(
monitor = "loss",
patience = 5,
restore_best_weights = TRUE))
# prediction
ttgseaRes$token_pred
set.seed(1)
predict_model(ttgseaRes, c("translation response",
"cytokine activity",
"rhodopsin mediate",
"granzyme",
"histone deacetylation",
"helper T cell",
"Wnt"))
}
## R version 4.3.1 Patched (2023-06-17 r84564)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.6.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] fgsea_1.28.0 ttgsea_1.10.0 keras_2.13.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 dplyr_1.1.3 tensorflow_2.14.0
## [4] fastmap_1.1.1 textshape_1.7.3 digest_0.6.33
## [7] lifecycle_1.0.3 ellipsis_0.3.2 koRpus_0.13-8
## [10] tokenizers_0.3.0 NLP_0.2-1 magrittr_2.0.3
## [13] compiler_4.3.1 rlang_1.1.1 sass_0.4.7
## [16] tools_4.3.1 utf8_1.2.4 yaml_2.3.7
## [19] qdapRegex_0.7.8 data.table_1.14.8 knitr_1.44
## [22] stopwords_2.3 htmlwidgets_1.6.2 reticulate_1.34.0
## [25] xml2_1.3.5 textclean_0.9.3 RColorBrewer_1.1-3
## [28] BiocParallel_1.36.0 purrr_1.0.2 grid_4.3.1
## [31] fansi_1.0.5 tm_0.7-11 colorspace_2.1-0
## [34] ggplot2_3.4.4 scales_1.2.1 zeallot_0.1.0
## [37] cli_3.6.1 rmarkdown_2.25 DiagrammeR_1.0.10
## [40] crayon_1.5.2 generics_0.1.3 rstudioapi_0.15.0
## [43] tfruns_1.5.1 visNetwork_2.1.2 cachem_1.0.8
## [46] sylly.en_0.1-3 parallel_4.3.1 textstem_0.1.4
## [49] base64enc_0.1-3 vctrs_0.6.4 Matrix_1.6-1.1
## [52] jsonlite_1.8.7 slam_0.1-50 koRpus.lang.en_0.1-4
## [55] lgr_0.4.4 jquerylib_0.1.4 glue_1.6.2
## [58] codetools_0.2-19 cowplot_1.1.1 sylly_0.1-6
## [61] stringi_1.7.12 gtable_0.3.4 munsell_0.5.0
## [64] mlapi_0.1.1 tibble_3.2.1 pillar_1.9.0
## [67] htmltools_0.5.6.1 float_0.3-1 rsparse_0.5.1
## [70] R6_2.5.1 evaluate_0.22 lattice_0.22-5
## [73] lexicon_1.2.1 png_0.1-8 SnowballC_0.7.1
## [76] syuzhet_1.0.7 RhpcBLASctl_0.23-42 bslib_0.5.1
## [79] text2vec_0.6.3 Rcpp_1.0.11 fastmatch_1.1-4
## [82] whisker_0.4.1 xfun_0.40 pkgconfig_2.0.3
Alterovitz, G., & Ramoni, M. (Eds.). (2011). Knowledge-Based Bioinformatics: from Analysis to Interpretation. John Wiley & Sons.
Consoli, S., Recupero, D. R., & Petkovic, M. (2019). Data Science for Healthcare: Methodologies and Applications. Springer.
DasGupta, A. (2011). Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. Springer.
Ghatak, A. (2019). Deep Learning with R. Springer.
Hassanien, A. E., & Elhoseny, M. (2019). Cybersecurity and Secure Information Systems: Challenges and Solutions and Smart Environments. Springer.
Leong, H. S., & Kipling, D. (2009). Text-based over-representation analysis of microarray gene lists with annotation bias. Nucleic acids research, 37(11), e79.
Micheas, A. C. (2018). Theory of Stochastic Objects: Probability, Stochastic Processes and Inference. CRC Press.
Shaalan, K., Hassanien, A. E., & Tolba, F. (Eds.). (2017). Intelligent Natural Language Processing: Trends and Applications (Vol. 740). Springer.