DeProViR 1.0.0
Emerging infectious diseases, including zoonoses, pose a significant threat to public health and the global economy, as exemplified by the COVID-19 pandemic caused by the zoonotic severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Understanding the protein-protein interactions (PPIs) between host and viral proteins is crucial for identifying targets for antiviral therapies and comprehending the mechanisms underlying pathogen replication and immune evasion. Experimental techniques like yeast two-hybrid screening and affinity purification mass spectrometry have provided valuable insights into host-virus interactomes. However, these approaches are limited by experimental noise and cost, resulting in incomplete interaction maps. Computational models based on machine learning have been developed to predict host-virus PPIs using sequence-derived features. Although these models have been successful, they often overlook the semantic information embedded in protein sequences and require effective encoding schemes. Here, we introduces DeProViR, a deep learning (DL) framework that predicts interactions between viruses and human hosts using only primary amino acid sequences. DeProViR employs a Siamese-like neural network architecture, incorporating convolutional and bidirectional long short-term memory (Bi-LSTM) networks to capture local and global contextual information. It utilizes GloVe embedding to represent amino acid sequences, allowing for the integration of semantic associations between residues. The proposed framework addresses limitations of existing models, such as the need for feature engineering and the dependence on the choice of encoding scheme. DeProViR presents a promising approach for accurate and efficient prediction of host-virus interactions and can contribute to the development of antiviral therapies and understanding of infectious diseases.
The DeProViR framework is composed of a two-step automated computational
workflow: (1) Learning sequence representation of both host and viral proteins
and (2) inferring host-viral PPIs through a hybrid deep learning architecture.
More specifically, in the first step, host or virus protein sequences are
separately encoded into a sequence of tokens via a tokenizer and padded to
the same length of size 1000 with a pad token. The embedding matrix E
of 100-dimension is then generated by applying the unsupervised GloVe
embedding model to a host or viral profile representation to
learn the implicit yet low-dimensional vector space based on the corpus of
tokens. Next, the embedding layer is fed with sequences of integers,
i.e., amino acid token indexes, and mapped to corresponding pre-trained
vectors in the GloVe embedding matrix E, which turns the tokens into a
dense real-valued 3D matrix M. In the subsequent step, DeProViR uses a
Siamese-like neural network architecture composed of two identical
sub-networks with the same configuration and weights. Each sub-network combines
convolution and recurrent neural networks (bidirectional Bi-LSTM) to capture
amino acids’ local and global contextual relatedness accurately.
To achieve the best-performing DL architecture, we fine-tuned the hyper-parameters for each block on the validation set by random search employing auROC as the performance metric. We determined the number of epochs through an early stopping strategy on the validation set, with a patience threshold set to 3. The optimized DL architecture achieved an auROC of 0.96 using 5-fold cross-validation and 0.90 on the test set. This architecture includes 32 filters (1-D kernel with size 16) in the first CNN layer to generate a feature map from the input layer (i.e., embedding matrix M) through convolution operation and non-linear transformation of its input with the ReLU activation function. Next, the hidden features generated by the first convolution layer are transformed to the second CNN layer with 64 filters (1-D kernel with size seven) in the same way. After the convolutional layers, the k-max pooling layer is added to perform max pooling, where k is set to 30. Subsequently, the flattened pooling output is fed into a bidirectional LSTM consisting of 64 hidden neurons, which finally connects to a fully dense layer of 8 neurons that connects to the output layer with the sigmoid activation function to output the predicted probability score.
The modular structure of this package is designed in a way that allows users the flexibility to either utilize their own training set or load a fine-tuned pre-trained model that constructed previously (see previous section). This dual capability empowers researchers to tailor their model development approach to their specific needs and preferences.
In the first approach, users can use their own training data to train a model tailored to their specific needs and subsequently apply the trained model to make predictions on uncharted interactions. This capability proves particularly valuable when users wish to undertake diverse tasks, such as predicting interactions between host and bacterial pathogens, drug-target interactions, or protein-protein interactions, etc.
Alternatively, the second approach streamlines the process by allowing users to leverage a fine-tuned pre-trained model. This model has undergone training on a comprehensive dataset, as detailed in the accompanying paper, achieving an auROC > 90 in both cross-validation and external test sets. In this scenario, users simply upload the pre-trained model and initiate predictions without the need for additional training. This approach offers the advantage of speed and convenience since it bypasses the time-consuming training phase. By employing a pre-trained model, users can swiftly obtain predictions and insights, making it a time-efficient option for their research needs.
It’s important to note that for the second approach, a random search strategy has been employed to meticulously tune all possible hyperparameters of the pre-trained model. This tuning process ensures the acquisition of the best-performing for the given training set. However, if you intend to alter the training input, we strongly recommend that you exercise caution and take the time to carefully fine-tune the hyperparameters using tfruns to achieve optimal results.
The modelTraining
function included in this package allows users to update
the training dataset. It begins by converting protein sequences into amino
acid tokens, where tokens are mapped to positive integers. Next, it represents
each amino acid token using pre-trained co-occurrence embedding vectors
acquired from GloVe. Following this, it utilizes an embedding layer to convert
a sequence of amino acid token indices into dense vectors based on the GloVe
token vectors. Finally, it leverages a Siamese-like neural network architecture
for model training, employing a k-fold cross-validation strategy. Please ensure
that the newly imported training set adheres to the format of the sample
training set stored in the inst/extdata/trainingSet directory of
the DeProViR package.
The modelTraining
function takes following parameters:
url_path
URL path to GloVe embedding. Defaults to
“https://nlp.stanford.edu/data”.
See \code{\link[DeProViR]{gloveImport}}
.
training_dir
Directory containing viral-host training set.
See \code{\link[DeProViR]{loadTrainingSet}}
.
Defaults to inst/extdata/training_Set
.
input_dim
Integer. Size of the vocabulary, i.e. amino acid
tokens. Defults to 20. See \code{keras}
.
output_dim
Integer. Dimension of the dense embedding,
i.e., GloVe. Defaults to 100. See \code{keras}
.
filters_layer1CNN
Integer, the dimensionality of the output space
(i.e. the number of output filters in the first convolution).
Defaults to 32. See \code{keras}
.
kernel_size_layer1CNN
An integer or tuple/list of 2 integers,
specifying the height and width of the convolution window in the first
layer. Can be a single integer to specify the same value for all
spatial dimensions.Defaults to 16. See \code{keras}
.
filters_layer2CNN
Integer, the dimensionality of the output space
(i.e. the number of output filters in the second convolution).
Defaults to 64. See \code{keras}
.
kernel_size_layer2CNN
An integer or tuple/list of 2 integers,
specifying the height and width of the convolution window in the second layer.
Can be a single integer to specify the same value for all spatial dimensions.
Defaults to 7. See \code{keras}
.
pool_size
Down samples the input representation by taking the
maximum value over a spatial window of size pool_size.
Defaults to 30. See \code{keras}
.
layer_lstm
Number of units in the Bi-LSTM layer. Defaults to 64.
See \code{keras}
.
units
Number of units in the MLP layer. Defaults to 8. See \code{keras}
.
metrics
Vector of metric names to be evaluated by the model
during training and testing. Defaults to “AUC”. See \code{keras}
.
cv_fold
Number of partitions for cross-validation. Defaults to 10.
epochs
Number of epochs to train the model. Defaults to 100.
See \code{keras}
.
batch_size
Number of samples per gradient update.Defults to 128.
See \code{keras}
.
plots
PDF file containing plots of tge predicitve learning algorithms
achived via cross-validatiob. Defaults to TRUE.
See \code{\link[DeProViR]{ModelPerformance_evalPlots}}
.
tpath
A character string indicating the path to the project
directory. If the directory is missing, PDF file containing performance
measures will be stored in the Temp directory.
See \code{\link[DeProViR]{ModelPerformance_evalPlots}}
.
save_model_weights
If TRUE, it allows users to save the trained weights.
Defaults to TRUE. See \code{keras}
.
filepath
A character string indicating the path to save the model weights
after training. Default to tempdir(). See \code{keras}
.
To run modelTraining
, we can use the following commands:
options(timeout=240)
library(tensorflow)
library(data.table)
library(DeProViR)
tensorflow::set_random_seed(101)
model_training <- modelTraining(
url_path = "https://nlp.stanford.edu/data",
training_dir = system.file("extdata", "training_Set",
package = "DeProViR"),
input_dim = 20,
output_dim = 100,
filters_layer1CNN = 32,
kernel_size_layer1CNN = 16,
filters_layer2CNN = 64,
kernel_size_layer2CNN = 7,
pool_size = 30,
layer_lstm = 64,
units = 8,
metrics = "AUC",
cv_fold = 2,
epochs = 5, # for the sake of this example
batch_size = 128,
plots = FALSE,
tpath = tempdir(),
save_model_weights = FALSE,
filepath = tempdir())
## .Epoch 1/5
## 2/2 - 7s - loss: 0.7638 - auc: 0.5040 - 7s/epoch - 4s/step
## Epoch 2/5
## 2/2 - 0s - loss: 0.5202 - auc: 0.4928 - 352ms/epoch - 176ms/step
## Epoch 3/5
## 2/2 - 0s - loss: 0.3505 - auc: 0.5018 - 310ms/epoch - 155ms/step
## Epoch 4/5
## 2/2 - 0s - loss: 0.2660 - auc: 0.5568 - 307ms/epoch - 154ms/step
## Epoch 5/5
## 2/2 - 0s - loss: 0.2533 - auc: 0.5473 - 311ms/epoch - 155ms/step
## 8/8 - 2s - 2s/epoch - 199ms/step
## .Epoch 1/5
## 2/2 - 0s - loss: 0.3229 - auc: 0.5350 - 287ms/epoch - 143ms/step
## Epoch 2/5
## 2/2 - 0s - loss: 0.3245 - auc: 0.5466 - 289ms/epoch - 144ms/step
## Epoch 3/5
## 2/2 - 0s - loss: 0.3135 - auc: 0.5642 - 286ms/epoch - 143ms/step
## Epoch 4/5
## 2/2 - 0s - loss: 0.2951 - auc: 0.6449 - 266ms/epoch - 133ms/step
## Epoch 5/5
## 2/2 - 0s - loss: 0.2916 - auc: 0.5947 - 248ms/epoch - 124ms/step
## 8/8 - 0s - 191ms/epoch - 24ms/step
When the plots argument set to TRUE, the modelTraining
function generates one
pdf file containing three figures as shown below indicating the performance of
the DL model using k-fold cross-validation.
In this context, users have the option to employ the loadPreTrainedModel
function to load the finely-tuned pre-trained model for predictive purposes.
options(timeout=240)
library(tensorflow)
library(data.table)
library(DeProViR)
pre_trainedmodel <-
loadPreTrainedModel()
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DeProViR_1.0.0 keras_2.15.0 data.table_1.15.4 tensorflow_2.16.0
## [5] knitr_1.46 BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.2 pROC_1.18.5 rlang_1.1.3
## [4] magrittr_2.0.3 compiler_4.4.0 RSQLite_2.3.6
## [7] png_0.1-8 vctrs_0.6.5 reshape2_1.4.4
## [10] stringr_1.5.1 crayon_1.5.2 pkgconfig_2.0.3
## [13] fastmap_1.1.1 dbplyr_2.5.0 PRROC_1.3.1
## [16] utf8_1.2.4 rmarkdown_2.26 prodlim_2023.08.28
## [19] tzdb_0.4.0 purrr_1.0.2 bit_4.0.5
## [22] xfun_0.43 cachem_1.0.8 jsonlite_1.8.8
## [25] recipes_1.0.10 blob_1.2.4 parallel_4.4.0
## [28] R6_2.5.1 bslib_0.7.0 stringi_1.8.3
## [31] reticulate_1.36.1 parallelly_1.37.1 rpart_4.1.23
## [34] lubridate_1.9.3 jquerylib_0.1.4 Rcpp_1.0.12
## [37] bookdown_0.39 iterators_1.0.14 future.apply_1.11.2
## [40] base64enc_0.1-3 readr_2.1.5 Matrix_1.7-0
## [43] splines_4.4.0 nnet_7.3-19 timechange_0.3.0
## [46] tidyselect_1.2.1 yaml_2.3.8 timeDate_4032.109
## [49] codetools_0.2-20 curl_5.2.1 listenv_0.9.1
## [52] lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
## [55] withr_3.0.0 evaluate_0.23 archive_1.1.8
## [58] future_1.33.2 survival_3.6-4 BiocFileCache_2.12.0
## [61] pillar_1.9.0 BiocManager_1.30.22 filelock_1.0.3
## [64] whisker_0.4.1 foreach_1.5.2 stats4_4.4.0
## [67] generics_0.1.3 vroom_1.6.5 hms_1.1.3
## [70] ggplot2_3.5.1 munsell_0.5.1 scales_1.3.0
## [73] globals_0.16.3 class_7.3-22 glue_1.7.0
## [76] tools_4.4.0 ModelMetrics_1.2.2.2 gower_1.0.1
## [79] fmsb_0.7.6 grid_4.4.0 ipred_0.9-14
## [82] colorspace_2.1-0 nlme_3.1-164 cli_3.6.2
## [85] tfruns_1.5.3 fansi_1.0.6 lava_1.8.0
## [88] dplyr_1.1.4 gtable_0.3.5 zeallot_0.1.0
## [91] sass_0.4.9 digest_0.6.35 caret_6.0-94
## [94] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4
## [97] hardhat_1.3.1 httr_1.4.7 bit64_4.0.5
## [100] MASS_7.3-60.2