1 Introduction

Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science. Machine learning has enabled us to generate useful protein sequences on a variety of scales. Generative models are machine learning methods which seek to model the distribution underlying the data, allowing for the generation of novel samples with similar properties to those on which the model was trained. Generative models of proteins can learn biologically meaningful representations helpful for a variety of downstream tasks. Furthermore, they can learn to generate protein sequences that have not been observed before and to assign higher probability to protein sequences that satisfy desired criteria. In this package, common deep generative models for protein sequences, such as variational autoencoder (VAE), generative adversarial networks (GAN), and autoregressive models are available. In the VAE and GAN, the Word2vec is used for embedding. The transformer encoder is applied to protein sequences for the autoregressive model.

The first step in molecular machine learning is to convert the molecular data into a numerical format suitable for the machine learning models. As an example of a raw representation is the sparse or one-hot encoding. Here, each amino acid is encoded as a vector, where a bit is set according to an index using the known amino acids. The naive way of representing a word in vector form is one hot representation but it is a very ineffective way for representing a huge corpus. In a more effective way, we need some semantic similarities to nearby points, thus creating the representation bring beneficial information about the word actual meaning, called word embedding models. In natural language processing, often large, one-hot encoded vectors are transformed into smaller dimension vectors, by finding word relations based on proximity or positions of similar words using the contexts of nearby neighbors. The Word2vec is one of the most common word embedding models that gives low-dimensional dense vector (word embedding) of the term (words and phrases). The Word2vec has also been applied to amino acids as a basis for traditional machine learning methods.

The generative adversarial network (GAN) is able to mimic any distribution of data in any domain containing images, music, speech, and prose. The GAN is an example of a network that use unsupervised learning to train two models in parallel. The network is forced to efficiently represent the training data, making it more effective at generating data similar to the training data. The GAN is made up of a discriminator and a generator that compete in a two-player minimax game. The generator generates a simulated example given as its input from a specified distribution. The objective of the generator is to produce an output that is so close to real that it confuses the discriminator in being able to differentiate the fake data from the real data. Both the generated examples and authentic ones are fed to the discriminator network, whose job is to distinguish between the fake and the real data. Generative adversarial nets can be extended to a conditional model if both the generator and discriminator are conditioned on some extra information, such as class labels or data from other views. The conditional GAN (CGAN) is based on vanilla GAN with additional conditional input to generator and discriminator. This extracts features based on the modelling of the conditional input. The auxiliary classifier GAN (ACGAN) is an extension of CGAN that adds conditional input only to the generator.

The autoencoder is the unsupervised neural network approach to learn how to encode data efficiently in a latent space. In the autoencoder, an encoder maps data from the input space to the latent space and a decoder is used to reconstruct the input data from the encoded latent data. The variational autoencoder (VAE) is a class of autoencoder where the encoder module is used to learn the parameter of a distribution and the decoder is used to generate examples from samples drawn from the learned distribution. The VAE does not allow us to constrain the generated sample to have a particular characteristic, but one may want to draw samples with the desired feature. A question then arises on how to endow the model to create targeted samples rather than completely random ones. The conditional variational autoencoder (CAVE) is designed to generate desired samples by including additional conditioning information.

Language models learn the syntactic order of a language from a given set of examples as probability values to predict a sequence of words. A language model is an indispensable ingredient in many advanced natural language processing (NLP) tasks such as text summarization, machine translation, and language generation. The language modeling is a typical self-supervised objective, for it does not require any labels. Language models developed for NLP tasks have evolved from statistical language models to neural language models. Approaches such as n-gram and hidden Markov models have become the primary ingredient in developing statistical language models. In contrast, various deep learning neural networks build neural language models. In the autoregressive (AR) language modeling, models are tasked with generating subsequent tokens based on previously generated tokens. Thus the autoregressive generative model predicts the next amino acid in a protein given the amino acid sequence up to that point. The autoregressive model generates proteins one amino acid at a time. For one step of generation, it takes a context sequence of amino acids as input and outputs a probability distribution over amino acids. We sample from that distribution and then update the context sequence with the sampled amino acid. This process repeats until a protein of desired length has been generated.

The Transformer architecture demonstrates impressive text generation capabilities. This perspective is adapted to protein engineering by training the autoregressive language model with the Transformer encoder on amino acid sequences. Compared to conventional sequential models such as RNN, the Transformer is a new architecture that is more effective in modeling long-term dependence in temporal sequences, where all tokens will be equally considered during the attention operation. And it is more efficient in training while eliminating sequential dependencies from previous tokens. The Transformer models can generally overcome the inherent limitations of classic neural network architectures. For example, these models overcome the problem of speed inherent in RNN, LSTM, or GRU, which require sequential operations that are slow in nature. They can also overcome the long-term dependencies problem of CNN, which can never accurately handle long-range dependences in the text corpus. Unlike RNN, the Transformer cannot consider the order of the input data, by using not only tokens but also positions for embedding. Since the Transformer has a very powerful ability to model sequential data, it becomes the most popular backbone of NLP applications. The Transformer architecture is a nonrecurrent architecture with a series of attention-based blocks. Each block is composed of a multi-head attention layer and a position-wise feedforward layer with an add and normalize layer in between. These layers process input sequences simultaneously, in parallel, independently of sequential order.



2 Example

2.1 GAN

The sequences of the PTEN are used to train the GAN. The input sequences must be aligned to have the same length. To train the model, we can use the function “fit_GAN” with aligned sequence data. To generate sequences, we can use the function “gen_GAN” with the trained model. It is expected that the model can rapidly generate highly diverse novel functional proteins within the allowed biological constraints of the sequence space. Note that the same dataset is used for training and validation here.

if (keras::is_keras_available() & reticulate::py_available()) {
    library(GenProSeq)
    data("example_PTEN")
    
    # model parameters
    length_seq <- 403
    embedding_dim <- 8
    latent_dim <- 4
    epochs <- 20
    batch_size <- 64
    
    # GAN
    GAN_result <- fit_GAN(prot_seq = example_PTEN,
                        length_seq = length_seq,
                        embedding_dim = embedding_dim,
                        latent_dim = latent_dim,
                        intermediate_generator_layers = list(
                            layer_dense(units = 16),
                            layer_dense(units = 128)),
                        intermediate_discriminator_layers = list(
                            layer_dense(units = 128, activation = "relu"),
                            layer_dense(units = 16, activation = "relu")),
                        prot_seq_val = example_PTEN,
                        epochs = epochs,
                        batch_size = batch_size)
}
## Loading required package: keras
## Loading required package: mclust
## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
## pre-processing...
## at least one of protein sequences may not be valid
## at least one of protein sequences may not be valid
## training...
## Epoch 1/20 
## 
1/5 [=====>........................] - ETA: 5s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 2s 126ms/step
## generator (train) : loss 0.653551 
## generator (test) : loss 0.488780 
## discriminator (train) : loss 0.448236 
## discriminator (test) : loss 0.502688 
## 
## Epoch 2/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 125ms/step
## generator (train) : loss 0.581863 
## generator (test) : loss 0.493665 
## discriminator (train) : loss 0.512578 
## discriminator (test) : loss 0.484252 
## 
## Epoch 3/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 117ms/step
## generator (train) : loss 0.694403 
## generator (test) : loss 0.673079 
## discriminator (train) : loss 0.442630 
## discriminator (test) : loss 0.370503 
## 
## Epoch 4/20 
## 
1/5 [=====>........................] - ETA: 1s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 120ms/step
## generator (train) : loss 0.877796 
## generator (test) : loss 0.850033 
## discriminator (train) : loss 0.338323 
## discriminator (test) : loss 0.297111 
## 
## Epoch 5/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 115ms/step
## generator (train) : loss 1.091484 
## generator (test) : loss 0.994955 
## discriminator (train) : loss 0.265729 
## discriminator (test) : loss 0.250368 
## 
## Epoch 6/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 177ms/step
## generator (train) : loss 1.214748 
## generator (test) : loss 1.060876 
## discriminator (train) : loss 0.241903 
## discriminator (test) : loss 0.236725 
## 
## Epoch 7/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 196ms/step
## generator (train) : loss 1.381744 
## generator (test) : loss 1.152964 
## discriminator (train) : loss 0.219514 
## discriminator (test) : loss 0.197336 
## 
## Epoch 8/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 124ms/step
## generator (train) : loss 1.457041 
## generator (test) : loss 1.294999 
## discriminator (train) : loss 0.187199 
## discriminator (test) : loss 0.170960 
## 
## Epoch 9/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 127ms/step
## generator (train) : loss 1.649266 
## generator (test) : loss 1.588225 
## discriminator (train) : loss 0.156309 
## discriminator (test) : loss 0.130028 
## 
## Epoch 10/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 185ms/step
## generator (train) : loss 2.025208 
## generator (test) : loss 1.793276 
## discriminator (train) : loss 0.118015 
## discriminator (test) : loss 0.104245 
## 
## Epoch 11/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 126ms/step
## generator (train) : loss 2.150779 
## generator (test) : loss 1.828507 
## discriminator (train) : loss 0.105818 
## discriminator (test) : loss 0.102226 
## 
## Epoch 12/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 1s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 189ms/step
## generator (train) : loss 2.182957 
## generator (test) : loss 1.750320 
## discriminator (train) : loss 0.098217 
## discriminator (test) : loss 0.108683 
## 
## Epoch 13/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 134ms/step
## generator (train) : loss 2.254692 
## generator (test) : loss 1.909065 
## discriminator (train) : loss 0.103280 
## discriminator (test) : loss 0.091943 
## 
## Epoch 14/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 155ms/step
## generator (train) : loss 2.800352 
## generator (test) : loss 2.418029 
## discriminator (train) : loss 0.079559 
## discriminator (test) : loss 0.055859 
## 
## Epoch 15/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 172ms/step
## generator (train) : loss 3.439265 
## generator (test) : loss 2.681406 
## discriminator (train) : loss 0.050705 
## discriminator (test) : loss 0.047287 
## 
## Epoch 16/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 188ms/step
## generator (train) : loss 3.518890 
## generator (test) : loss 2.226120 
## discriminator (train) : loss 0.052609 
## discriminator (test) : loss 0.078491 
## 
## Epoch 17/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 185ms/step
## generator (train) : loss 3.175209 
## generator (test) : loss 2.127767 
## discriminator (train) : loss 0.082337 
## discriminator (test) : loss 0.078460 
## 
## Epoch 18/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 203ms/step
## generator (train) : loss 3.268610 
## generator (test) : loss 2.359840 
## discriminator (train) : loss 0.069393 
## discriminator (test) : loss 0.089111 
## 
## Epoch 19/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 0s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 317ms/step
## generator (train) : loss 3.505233 
## generator (test) : loss 2.652624 
## discriminator (train) : loss 0.071827 
## discriminator (test) : loss 0.050651 
## 
## Epoch 20/20 
## 
1/5 [=====>........................] - ETA: 0s
2/5 [===========>..................] - ETA: 1s
3/5 [=================>............] - ETA: 0s
4/5 [=======================>......] - ETA: 0s
5/5 [==============================] - 1s 261ms/step
## generator (train) : loss 3.493079 
## generator (test) : loss 2.565350 
## discriminator (train) : loss 0.049255 
## discriminator (test) : loss 0.089612

The model architecture of the generator is shown below.

if (keras::is_keras_available() & reticulate::py_available()) {
    ttgsea::plot_model(GAN_result$generator)
}

The model architecture of the discriminator is shown below.

if (keras::is_keras_available() & reticulate::py_available()) {
    ttgsea::plot_model(GAN_result$discriminator)
}

In a graphical manner, a sequence logo shows the probability of occurrence for each symbol at specific positions. Thus, the height of a symbol indicates the relative frequency of the symbol at that position. The sequence logo of the first 20 amino acids of the generated protein sequences is shown below.

if (keras::is_keras_available() & reticulate::py_available()) {
    set.seed(1)
    gen_prot_GAN <- gen_GAN(GAN_result, num_seq = 100)
    ggseqlogo::ggseqlogo(substr(gen_prot_GAN$gen_seq, 1, 20))
}
## generating...
## post-processing...

The sequence logo of the first 20 amino acids of the real protein sequences is shown below. Here, it is assumed that the conserved position is the first amino acid. The sequence logo is a visualization approach to graphically represent the sequence conservation.

if (keras::is_keras_available() & reticulate::py_available()) {
    ggseqlogo::ggseqlogo(substr(example_PTEN, 1, 20))
}



2.2 VAE

Consider the aligned sequences of the luxA for traing the CVAE with labels. Suppose that the label is the third amino acid of each sequence. Thus there are two class labels. Using the function “fit_VAE”, we build an encoder model that takes a protein sequence and projects it on the latent space and a decoder model that goes from the latent space back to the amino acid representation. And then, the function “gen_VAE” generates sequences with the desired labels. Note that the same dataset is used for training and validation.

if (keras::is_keras_available() & reticulate::py_available()) {
    library(GenProSeq)
    data("example_luxA")
    label <- substr(example_luxA, 3, 3)
    
    # model parameters
    length_seq <- 360
    embedding_dim <- 8
    batch_size <- 128
    epochs <- 20
    
    # CVAE
    VAE_result <- fit_VAE(prot_seq = example_luxA,
                        label = label,
                        length_seq = length_seq,
                        embedding_dim = embedding_dim,
                        embedding_args = list(iter = 20),
                        intermediate_encoder_layers = list(layer_dense(units = 128),
                                                            layer_dense(units = 16)),
                        intermediate_decoder_layers = list(layer_dense(units = 16),
                                                            layer_dense(units = 128)),
                        prot_seq_val = example_luxA,
                        label_val = label,
                        epochs = epochs,
                        batch_size = batch_size,
                        use_generator = FALSE,
                        optimizer = keras::optimizer_adam(clipnorm = 0.1),
                        callbacks = keras::callback_early_stopping(
                            monitor = "val_loss",
                            patience = 10,
                            restore_best_weights = TRUE))
}
## pre-processing...
## training...
## Train on 2283 samples, validate on 2283 samples
## Epoch 1/20
## 
 128/2283 [>.............................] - ETA: 11s - loss: 2944.8125
 256/2283 [==>...........................] - ETA: 5s - loss: 3365.6166 
 512/2283 [=====>........................] - ETA: 2s - loss: 3151.8544
 768/2283 [=========>....................] - ETA: 1s - loss: 2952.5732
1024/2283 [============>.................] - ETA: 1s - loss: 2711.8234
1408/2283 [=================>............] - ETA: 0s - loss: 2410.1599
1920/2283 [========================>.....] - ETA: 0s - loss: 2167.0106
2283/2283 [==============================] - 2s 854us/sample - loss: 2056.7743 - val_loss: 1401.9514
## Epoch 2/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 1340.0389
 640/2283 [=======>......................] - ETA: 0s - loss: 1314.4655
1024/2283 [============>.................] - ETA: 0s - loss: 1301.0657
1408/2283 [=================>............] - ETA: 0s - loss: 1277.0421
1920/2283 [========================>.....] - ETA: 0s - loss: 1253.2888
2283/2283 [==============================] - 0s 193us/sample - loss: 1240.3888 - val_loss: 1139.4235
## Epoch 3/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 1174.5713
 640/2283 [=======>......................] - ETA: 0s - loss: 1120.6635
1152/2283 [==============>...............] - ETA: 0s - loss: 1118.5337
1664/2283 [====================>.........] - ETA: 0s - loss: 1130.9497
2176/2283 [===========================>..] - ETA: 0s - loss: 1127.3133
2283/2283 [==============================] - 0s 168us/sample - loss: 1124.6103 - val_loss: 1078.2798
## Epoch 4/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 1081.1053
 640/2283 [=======>......................] - ETA: 0s - loss: 1102.0820
1152/2283 [==============>...............] - ETA: 0s - loss: 1088.5882
1664/2283 [====================>.........] - ETA: 0s - loss: 1134.2393
2176/2283 [===========================>..] - ETA: 0s - loss: 1125.1715
2283/2283 [==============================] - 0s 171us/sample - loss: 1124.2168 - val_loss: 1110.8737
## Epoch 5/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 1120.0709
 768/2283 [=========>....................] - ETA: 0s - loss: 1122.2733
1280/2283 [===============>..............] - ETA: 0s - loss: 1093.0923
1792/2283 [======================>.......] - ETA: 0s - loss: 1088.9455
2283/2283 [==============================] - ETA: 0s - loss: 1085.5296
2283/2283 [==============================] - 0s 164us/sample - loss: 1085.5296 - val_loss: 1052.1776
## Epoch 6/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 1102.2684
 640/2283 [=======>......................] - ETA: 0s - loss: 1082.8192
1152/2283 [==============>...............] - ETA: 0s - loss: 1078.3766
1664/2283 [====================>.........] - ETA: 0s - loss: 1069.6396
2176/2283 [===========================>..] - ETA: 0s - loss: 1067.5581
2283/2283 [==============================] - 0s 175us/sample - loss: 1070.6305 - val_loss: 1138.5284
## Epoch 7/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 1199.3724
 640/2283 [=======>......................] - ETA: 0s - loss: 1140.7950
1152/2283 [==============>...............] - ETA: 0s - loss: 1141.7722
1536/2283 [===================>..........] - ETA: 0s - loss: 1151.1373
2048/2283 [=========================>....] - ETA: 0s - loss: 1125.3036
2283/2283 [==============================] - 0s 185us/sample - loss: 1120.2692 - val_loss: 1083.0261
## Epoch 8/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 1029.4960
 640/2283 [=======>......................] - ETA: 0s - loss: 1031.0318
1152/2283 [==============>...............] - ETA: 0s - loss: 1036.0177
1536/2283 [===================>..........] - ETA: 0s - loss: 1034.3129
1792/2283 [======================>.......] - ETA: 0s - loss: 1029.9614
2283/2283 [==============================] - ETA: 0s - loss: 1033.7944
2283/2283 [==============================] - 0s 203us/sample - loss: 1033.7944 - val_loss: 999.0151
## Epoch 9/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 932.5660
 640/2283 [=======>......................] - ETA: 0s - loss: 1067.2371
1152/2283 [==============>...............] - ETA: 0s - loss: 1073.0478
1664/2283 [====================>.........] - ETA: 0s - loss: 1052.6579
2176/2283 [===========================>..] - ETA: 0s - loss: 1043.8867
2283/2283 [==============================] - 0s 177us/sample - loss: 1040.1666 - val_loss: 1006.2224
## Epoch 10/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 1016.9093
 640/2283 [=======>......................] - ETA: 0s - loss: 1007.4061
1152/2283 [==============>...............] - ETA: 0s - loss: 1005.4475
1664/2283 [====================>.........] - ETA: 0s - loss: 1002.0151
2176/2283 [===========================>..] - ETA: 0s - loss: 1003.4408
2283/2283 [==============================] - 0s 167us/sample - loss: 1004.2828 - val_loss: 1064.3781
## Epoch 11/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 1102.4353
 640/2283 [=======>......................] - ETA: 0s - loss: 1074.3068
1152/2283 [==============>...............] - ETA: 0s - loss: 1047.4878
1664/2283 [====================>.........] - ETA: 0s - loss: 1032.4947
2283/2283 [==============================] - ETA: 0s - loss: 1036.7877
2283/2283 [==============================] - 0s 156us/sample - loss: 1036.7877 - val_loss: 1030.6162
## Epoch 12/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 1012.4503
 640/2283 [=======>......................] - ETA: 0s - loss: 1048.8463
1152/2283 [==============>...............] - ETA: 0s - loss: 1035.5734
1664/2283 [====================>.........] - ETA: 0s - loss: 1031.4187
2176/2283 [===========================>..] - ETA: 0s - loss: 1026.0377
2283/2283 [==============================] - 0s 175us/sample - loss: 1024.4923 - val_loss: 1013.0031
## Epoch 13/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 989.6087
 512/2283 [=====>........................] - ETA: 0s - loss: 1006.1228
 768/2283 [=========>....................] - ETA: 0s - loss: 985.0687 
1152/2283 [==============>...............] - ETA: 0s - loss: 989.8110
1536/2283 [===================>..........] - ETA: 0s - loss: 997.7690
2048/2283 [=========================>....] - ETA: 0s - loss: 999.0004
2283/2283 [==============================] - 0s 205us/sample - loss: 992.8652 - val_loss: 979.1031
## Epoch 14/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 950.4587
 640/2283 [=======>......................] - ETA: 0s - loss: 985.7572
1280/2283 [===============>..............] - ETA: 0s - loss: 976.4915
1792/2283 [======================>.......] - ETA: 0s - loss: 972.3396
2283/2283 [==============================] - ETA: 0s - loss: 972.6481
2283/2283 [==============================] - 0s 161us/sample - loss: 972.6481 - val_loss: 963.7095
## Epoch 15/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 1011.2346
 640/2283 [=======>......................] - ETA: 0s - loss: 970.5861 
1152/2283 [==============>...............] - ETA: 0s - loss: 959.2633
1664/2283 [====================>.........] - ETA: 0s - loss: 953.4568
2176/2283 [===========================>..] - ETA: 0s - loss: 952.1132
2283/2283 [==============================] - 0s 185us/sample - loss: 956.4455 - val_loss: 946.2434
## Epoch 16/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 989.6850
 640/2283 [=======>......................] - ETA: 0s - loss: 948.0811
1152/2283 [==============>...............] - ETA: 0s - loss: 960.8510
1664/2283 [====================>.........] - ETA: 0s - loss: 958.2889
2048/2283 [=========================>....] - ETA: 0s - loss: 951.5613
2283/2283 [==============================] - 0s 179us/sample - loss: 947.5279 - val_loss: 940.7115
## Epoch 17/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 922.0276
 640/2283 [=======>......................] - ETA: 0s - loss: 927.9200
1024/2283 [============>.................] - ETA: 0s - loss: 945.4638
1408/2283 [=================>............] - ETA: 0s - loss: 943.7610
1920/2283 [========================>.....] - ETA: 0s - loss: 943.1145
2283/2283 [==============================] - 0s 198us/sample - loss: 937.7190 - val_loss: 931.6530
## Epoch 18/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 926.1196
 512/2283 [=====>........................] - ETA: 0s - loss: 919.1359
1024/2283 [============>.................] - ETA: 0s - loss: 924.9029
1408/2283 [=================>............] - ETA: 0s - loss: 929.0735
1920/2283 [========================>.....] - ETA: 0s - loss: 938.1148
2283/2283 [==============================] - 0s 183us/sample - loss: 928.6614 - val_loss: 916.6278
## Epoch 19/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 897.7328
 640/2283 [=======>......................] - ETA: 0s - loss: 944.5674
1152/2283 [==============>...............] - ETA: 0s - loss: 922.9570
1792/2283 [======================>.......] - ETA: 0s - loss: 919.1797
2283/2283 [==============================] - 0s 157us/sample - loss: 926.1693 - val_loss: 910.6818
## Epoch 20/20
## 
 128/2283 [>.............................] - ETA: 0s - loss: 937.5427
 640/2283 [=======>......................] - ETA: 0s - loss: 925.0431
1152/2283 [==============>...............] - ETA: 0s - loss: 904.8269
1664/2283 [====================>.........] - ETA: 0s - loss: 909.8147
2176/2283 [===========================>..] - ETA: 0s - loss: 914.4016
2283/2283 [==============================] - 0s 186us/sample - loss: 917.5846 - val_loss: 914.1966

The plot for model architecture of the CVAE is drawn below.

if (keras::is_keras_available() & reticulate::py_available()) {
    VAExprs::plot_vae(VAE_result$model)
}

The sequence logo of the first 20 amino acids of the generated protein sequences with the label “I” is shown below.

if (keras::is_keras_available() & reticulate::py_available()) {
    set.seed(1)
    gen_prot_VAE <- gen_VAE(VAE_result, label = rep("I", 100), num_seq = 100)
    ggseqlogo::ggseqlogo(substr(gen_prot_VAE$gen_seq, 1, 20))
}
## generating...
## post-processing...

The sequence logo of the first 20 amino acids of the generated protein sequences with the label “L” is shown below.

if (keras::is_keras_available() & reticulate::py_available()) {
    gen_prot_VAE <- gen_VAE(VAE_result, label = rep("L", 100), num_seq = 100)
    ggseqlogo::ggseqlogo(substr(gen_prot_VAE$gen_seq, 1, 20))
}
## generating...
## post-processing...

The sequence logo of the first 20 amino acids of the real protein sequences is shown below.

if (keras::is_keras_available() & reticulate::py_available()) {
    ggseqlogo::ggseqlogo(substr(example_luxA, 1, 20))
}



2.3 AR with Transformer

The SARS coronavirus 3C-like Protease is used to train the autoregressive language model with the Transformer. The same dataset is used for training and validation.

if (keras::is_keras_available() & reticulate::py_available()) {
    library(GenProSeq)
    prot_seq <- DeepPINCS::SARS_CoV2_3CL_Protease
    
    # model parameters
    length_seq <- 10
    embedding_dim <- 16
    num_heads <- 2
    ff_dim <- 16
    num_transformer_blocks <- 2
    batch_size <- 32
    epochs <- 100
    
    # ART
    ART_result <- fit_ART(prot_seq = prot_seq,
                        length_seq = length_seq,
                        embedding_dim = embedding_dim,
                        num_heads = num_heads,
                        ff_dim = ff_dim,
                        num_transformer_blocks = num_transformer_blocks,
                        layers = list(layer_dropout(rate = 0.1),
                                    layer_dense(units = 32, activation = "relu"),
                                    layer_dropout(rate = 0.1)),
                        prot_seq_val = prot_seq,
                        epochs = epochs,
                        batch_size = batch_size,
                        use_generator = FALSE,
                        callbacks = callback_early_stopping(
                            monitor = "val_loss",
                            patience = 50,
                            restore_best_weights = TRUE))
}
## pre-processing...
## training...
## Train on 296 samples, validate on 296 samples
## Epoch 1/100
## 
 32/296 [==>...........................] - ETA: 6s - loss: 3.1402 - accuracy: 0.0000e+00
256/296 [========================>.....] - ETA: 0s - loss: 3.0316 - accuracy: 0.0391    
296/296 [==============================] - 1s 4ms/sample - loss: 3.0376 - accuracy: 0.0405 - val_loss: 2.9605 - val_accuracy: 0.0743
## Epoch 2/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.9268 - accuracy: 0.1250
256/296 [========================>.....] - ETA: 0s - loss: 2.9852 - accuracy: 0.0508
296/296 [==============================] - 0s 387us/sample - loss: 2.9779 - accuracy: 0.0642 - val_loss: 2.9569 - val_accuracy: 0.0642
## Epoch 3/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.9552 - accuracy: 0.1562
256/296 [========================>.....] - ETA: 0s - loss: 2.9753 - accuracy: 0.0664
296/296 [==============================] - 0s 383us/sample - loss: 2.9847 - accuracy: 0.0642 - val_loss: 2.9390 - val_accuracy: 0.0777
## Epoch 4/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 3.0149 - accuracy: 0.0938
256/296 [========================>.....] - ETA: 0s - loss: 2.9605 - accuracy: 0.0977
296/296 [==============================] - 0s 370us/sample - loss: 2.9632 - accuracy: 0.1014 - val_loss: 2.9260 - val_accuracy: 0.1081
## Epoch 5/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.9846 - accuracy: 0.1250
256/296 [========================>.....] - ETA: 0s - loss: 2.9668 - accuracy: 0.1016
296/296 [==============================] - 0s 394us/sample - loss: 2.9514 - accuracy: 0.1149 - val_loss: 2.9042 - val_accuracy: 0.0912
## Epoch 6/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.9206 - accuracy: 0.0312
256/296 [========================>.....] - ETA: 0s - loss: 2.9080 - accuracy: 0.0742
296/296 [==============================] - 0s 357us/sample - loss: 2.9181 - accuracy: 0.0642 - val_loss: 2.8891 - val_accuracy: 0.1081
## Epoch 7/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.9011 - accuracy: 0.1250
296/296 [==============================] - ETA: 0s - loss: 2.9288 - accuracy: 0.0878
296/296 [==============================] - 0s 292us/sample - loss: 2.9288 - accuracy: 0.0878 - val_loss: 2.8891 - val_accuracy: 0.0980
## Epoch 8/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.8684 - accuracy: 0.2188
256/296 [========================>.....] - ETA: 0s - loss: 2.9161 - accuracy: 0.0742
296/296 [==============================] - 0s 377us/sample - loss: 2.9151 - accuracy: 0.0811 - val_loss: 2.8745 - val_accuracy: 0.1014
## Epoch 9/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.9054 - accuracy: 0.1250
256/296 [========================>.....] - ETA: 0s - loss: 2.8976 - accuracy: 0.1016
296/296 [==============================] - 0s 387us/sample - loss: 2.8960 - accuracy: 0.0980 - val_loss: 2.8534 - val_accuracy: 0.1014
## Epoch 10/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.8661 - accuracy: 0.0938
256/296 [========================>.....] - ETA: 0s - loss: 2.8549 - accuracy: 0.1328
296/296 [==============================] - 0s 392us/sample - loss: 2.8731 - accuracy: 0.1182 - val_loss: 2.8334 - val_accuracy: 0.1284
## Epoch 11/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.8607 - accuracy: 0.0938
256/296 [========================>.....] - ETA: 0s - loss: 2.8524 - accuracy: 0.1328
296/296 [==============================] - 0s 374us/sample - loss: 2.8545 - accuracy: 0.1216 - val_loss: 2.8261 - val_accuracy: 0.1149
## Epoch 12/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.8343 - accuracy: 0.1250
256/296 [========================>.....] - ETA: 0s - loss: 2.8425 - accuracy: 0.1133
296/296 [==============================] - 0s 379us/sample - loss: 2.8500 - accuracy: 0.1115 - val_loss: 2.8021 - val_accuracy: 0.1588
## Epoch 13/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.7550 - accuracy: 0.1562
256/296 [========================>.....] - ETA: 0s - loss: 2.7942 - accuracy: 0.1523
296/296 [==============================] - 0s 375us/sample - loss: 2.8194 - accuracy: 0.1419 - val_loss: 2.7762 - val_accuracy: 0.1655
## Epoch 14/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.7119 - accuracy: 0.1562
256/296 [========================>.....] - ETA: 0s - loss: 2.7981 - accuracy: 0.1367
296/296 [==============================] - 0s 368us/sample - loss: 2.7941 - accuracy: 0.1419 - val_loss: 2.7581 - val_accuracy: 0.1689
## Epoch 15/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.7237 - accuracy: 0.0625
256/296 [========================>.....] - ETA: 0s - loss: 2.7841 - accuracy: 0.1172
296/296 [==============================] - 0s 378us/sample - loss: 2.7923 - accuracy: 0.1284 - val_loss: 2.7361 - val_accuracy: 0.1858
## Epoch 16/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.8761 - accuracy: 0.0312
256/296 [========================>.....] - ETA: 0s - loss: 2.8137 - accuracy: 0.1133
296/296 [==============================] - 0s 372us/sample - loss: 2.8009 - accuracy: 0.1149 - val_loss: 2.7151 - val_accuracy: 0.1824
## Epoch 17/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.7828 - accuracy: 0.1250
256/296 [========================>.....] - ETA: 0s - loss: 2.7776 - accuracy: 0.1523
296/296 [==============================] - 0s 374us/sample - loss: 2.7751 - accuracy: 0.1588 - val_loss: 2.7194 - val_accuracy: 0.1926
## Epoch 18/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.8538 - accuracy: 0.1250
256/296 [========================>.....] - ETA: 0s - loss: 2.7979 - accuracy: 0.1328
296/296 [==============================] - 0s 377us/sample - loss: 2.7824 - accuracy: 0.1453 - val_loss: 2.6935 - val_accuracy: 0.2095
## Epoch 19/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.5380 - accuracy: 0.2812
256/296 [========================>.....] - ETA: 0s - loss: 2.7274 - accuracy: 0.1562
296/296 [==============================] - 0s 374us/sample - loss: 2.7379 - accuracy: 0.1520 - val_loss: 2.6682 - val_accuracy: 0.1959
## Epoch 20/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.8285 - accuracy: 0.1250
256/296 [========================>.....] - ETA: 0s - loss: 2.7198 - accuracy: 0.1641
296/296 [==============================] - 0s 376us/sample - loss: 2.7430 - accuracy: 0.1655 - val_loss: 2.6590 - val_accuracy: 0.1993
## Epoch 21/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.7130 - accuracy: 0.1562
256/296 [========================>.....] - ETA: 0s - loss: 2.6922 - accuracy: 0.1914
296/296 [==============================] - 0s 382us/sample - loss: 2.6916 - accuracy: 0.1959 - val_loss: 2.6407 - val_accuracy: 0.2061
## Epoch 22/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.5563 - accuracy: 0.2188
256/296 [========================>.....] - ETA: 0s - loss: 2.6779 - accuracy: 0.1641
296/296 [==============================] - 0s 376us/sample - loss: 2.6678 - accuracy: 0.1824 - val_loss: 2.6239 - val_accuracy: 0.2027
## Epoch 23/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.5020 - accuracy: 0.2500
256/296 [========================>.....] - ETA: 0s - loss: 2.6768 - accuracy: 0.1719
296/296 [==============================] - 0s 373us/sample - loss: 2.6910 - accuracy: 0.1689 - val_loss: 2.6333 - val_accuracy: 0.2027
## Epoch 24/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.6507 - accuracy: 0.2812
288/296 [============================>.] - ETA: 0s - loss: 2.6645 - accuracy: 0.2153
296/296 [==============================] - 0s 371us/sample - loss: 2.6633 - accuracy: 0.2162 - val_loss: 2.5953 - val_accuracy: 0.1993
## Epoch 25/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.5507 - accuracy: 0.1875
256/296 [========================>.....] - ETA: 0s - loss: 2.6711 - accuracy: 0.1641
296/296 [==============================] - 0s 368us/sample - loss: 2.6637 - accuracy: 0.1723 - val_loss: 2.6010 - val_accuracy: 0.1993
## Epoch 26/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.5389 - accuracy: 0.2500
256/296 [========================>.....] - ETA: 0s - loss: 2.6492 - accuracy: 0.1719
296/296 [==============================] - 0s 362us/sample - loss: 2.6502 - accuracy: 0.1723 - val_loss: 2.6109 - val_accuracy: 0.2061
## Epoch 27/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.4997 - accuracy: 0.2188
256/296 [========================>.....] - ETA: 0s - loss: 2.6891 - accuracy: 0.1719
296/296 [==============================] - 0s 369us/sample - loss: 2.6842 - accuracy: 0.1723 - val_loss: 2.5573 - val_accuracy: 0.2399
## Epoch 28/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.5812 - accuracy: 0.1875
256/296 [========================>.....] - ETA: 0s - loss: 2.6195 - accuracy: 0.1914
296/296 [==============================] - 0s 367us/sample - loss: 2.6103 - accuracy: 0.2061 - val_loss: 2.5166 - val_accuracy: 0.2432
## Epoch 29/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.6837 - accuracy: 0.0625
288/296 [============================>.] - ETA: 0s - loss: 2.6072 - accuracy: 0.1806
296/296 [==============================] - 0s 356us/sample - loss: 2.6011 - accuracy: 0.1892 - val_loss: 2.5037 - val_accuracy: 0.2500
## Epoch 30/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.6130 - accuracy: 0.2188
256/296 [========================>.....] - ETA: 0s - loss: 2.5736 - accuracy: 0.1992
296/296 [==============================] - 0s 381us/sample - loss: 2.5718 - accuracy: 0.2027 - val_loss: 2.4746 - val_accuracy: 0.2365
## Epoch 31/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.8445 - accuracy: 0.1562
256/296 [========================>.....] - ETA: 0s - loss: 2.5866 - accuracy: 0.2070
296/296 [==============================] - 0s 381us/sample - loss: 2.5755 - accuracy: 0.2095 - val_loss: 2.4629 - val_accuracy: 0.2568
## Epoch 32/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.4381 - accuracy: 0.2812
256/296 [========================>.....] - ETA: 0s - loss: 2.5332 - accuracy: 0.1953
296/296 [==============================] - 0s 382us/sample - loss: 2.5319 - accuracy: 0.2061 - val_loss: 2.4333 - val_accuracy: 0.2601
## Epoch 33/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.4927 - accuracy: 0.2500
256/296 [========================>.....] - ETA: 0s - loss: 2.5385 - accuracy: 0.2305
296/296 [==============================] - 0s 410us/sample - loss: 2.5315 - accuracy: 0.2331 - val_loss: 2.4256 - val_accuracy: 0.2635
## Epoch 34/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.7188 - accuracy: 0.1875
256/296 [========================>.....] - ETA: 0s - loss: 2.5413 - accuracy: 0.2344
296/296 [==============================] - 0s 388us/sample - loss: 2.5251 - accuracy: 0.2230 - val_loss: 2.4070 - val_accuracy: 0.3041
## Epoch 35/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.4851 - accuracy: 0.2812
256/296 [========================>.....] - ETA: 0s - loss: 2.5190 - accuracy: 0.2148
296/296 [==============================] - 0s 398us/sample - loss: 2.4974 - accuracy: 0.2264 - val_loss: 2.3731 - val_accuracy: 0.2905
## Epoch 36/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.4488 - accuracy: 0.2812
256/296 [========================>.....] - ETA: 0s - loss: 2.4781 - accuracy: 0.2461
296/296 [==============================] - 0s 374us/sample - loss: 2.4868 - accuracy: 0.2365 - val_loss: 2.3834 - val_accuracy: 0.2703
## Epoch 37/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.4207 - accuracy: 0.3438
256/296 [========================>.....] - ETA: 0s - loss: 2.4936 - accuracy: 0.2109
296/296 [==============================] - 0s 410us/sample - loss: 2.4956 - accuracy: 0.2095 - val_loss: 2.4077 - val_accuracy: 0.2365
## Epoch 38/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.3904 - accuracy: 0.2188
256/296 [========================>.....] - ETA: 0s - loss: 2.4123 - accuracy: 0.2891
296/296 [==============================] - 0s 398us/sample - loss: 2.4352 - accuracy: 0.2703 - val_loss: 2.3442 - val_accuracy: 0.3007
## Epoch 39/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.4444 - accuracy: 0.1875
256/296 [========================>.....] - ETA: 0s - loss: 2.4702 - accuracy: 0.2383
296/296 [==============================] - 0s 397us/sample - loss: 2.4499 - accuracy: 0.2365 - val_loss: 2.3084 - val_accuracy: 0.2939
## Epoch 40/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.3571 - accuracy: 0.3750
256/296 [========================>.....] - ETA: 0s - loss: 2.4111 - accuracy: 0.2617
296/296 [==============================] - 0s 410us/sample - loss: 2.4112 - accuracy: 0.2601 - val_loss: 2.2672 - val_accuracy: 0.3243
## Epoch 41/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.4141 - accuracy: 0.2500
256/296 [========================>.....] - ETA: 0s - loss: 2.3984 - accuracy: 0.2461
296/296 [==============================] - 0s 425us/sample - loss: 2.3876 - accuracy: 0.2534 - val_loss: 2.2561 - val_accuracy: 0.3446
## Epoch 42/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.4922 - accuracy: 0.2500
256/296 [========================>.....] - ETA: 0s - loss: 2.3834 - accuracy: 0.2500
296/296 [==============================] - 0s 412us/sample - loss: 2.3881 - accuracy: 0.2534 - val_loss: 2.2408 - val_accuracy: 0.3311
## Epoch 43/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.4103 - accuracy: 0.2500
256/296 [========================>.....] - ETA: 0s - loss: 2.3649 - accuracy: 0.2500
296/296 [==============================] - 0s 425us/sample - loss: 2.3803 - accuracy: 0.2399 - val_loss: 2.2162 - val_accuracy: 0.3243
## Epoch 44/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.2861 - accuracy: 0.2500
256/296 [========================>.....] - ETA: 0s - loss: 2.3807 - accuracy: 0.2305
296/296 [==============================] - 0s 415us/sample - loss: 2.3730 - accuracy: 0.2432 - val_loss: 2.2024 - val_accuracy: 0.3581
## Epoch 45/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.3188 - accuracy: 0.2500
224/296 [=====================>........] - ETA: 0s - loss: 2.3287 - accuracy: 0.3036
296/296 [==============================] - 0s 461us/sample - loss: 2.3418 - accuracy: 0.2939 - val_loss: 2.1830 - val_accuracy: 0.3547
## Epoch 46/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.3938 - accuracy: 0.3125
256/296 [========================>.....] - ETA: 0s - loss: 2.3298 - accuracy: 0.2969
296/296 [==============================] - 0s 419us/sample - loss: 2.3042 - accuracy: 0.3176 - val_loss: 2.1327 - val_accuracy: 0.3784
## Epoch 47/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.4666 - accuracy: 0.3125
256/296 [========================>.....] - ETA: 0s - loss: 2.2356 - accuracy: 0.3359
296/296 [==============================] - 0s 399us/sample - loss: 2.2449 - accuracy: 0.3243 - val_loss: 2.1086 - val_accuracy: 0.3750
## Epoch 48/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.2743 - accuracy: 0.2500
256/296 [========================>.....] - ETA: 0s - loss: 2.2562 - accuracy: 0.2969
296/296 [==============================] - 0s 380us/sample - loss: 2.2677 - accuracy: 0.2905 - val_loss: 2.0790 - val_accuracy: 0.3851
## Epoch 49/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.4834 - accuracy: 0.2188
288/296 [============================>.] - ETA: 0s - loss: 2.2830 - accuracy: 0.2743
296/296 [==============================] - 0s 386us/sample - loss: 2.2858 - accuracy: 0.2703 - val_loss: 2.0650 - val_accuracy: 0.3986
## Epoch 50/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.1293 - accuracy: 0.3438
256/296 [========================>.....] - ETA: 0s - loss: 2.2544 - accuracy: 0.3203
296/296 [==============================] - 0s 399us/sample - loss: 2.2341 - accuracy: 0.3277 - val_loss: 2.0649 - val_accuracy: 0.3818
## Epoch 51/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.4560 - accuracy: 0.1875
256/296 [========================>.....] - ETA: 0s - loss: 2.2317 - accuracy: 0.3164
296/296 [==============================] - 0s 404us/sample - loss: 2.2442 - accuracy: 0.3108 - val_loss: 2.0362 - val_accuracy: 0.4054
## Epoch 52/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.9535 - accuracy: 0.4375
256/296 [========================>.....] - ETA: 0s - loss: 2.1790 - accuracy: 0.3281
296/296 [==============================] - 0s 391us/sample - loss: 2.1943 - accuracy: 0.3345 - val_loss: 2.0058 - val_accuracy: 0.4358
## Epoch 53/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.0280 - accuracy: 0.3438
256/296 [========================>.....] - ETA: 0s - loss: 2.0859 - accuracy: 0.3867
296/296 [==============================] - 0s 385us/sample - loss: 2.1008 - accuracy: 0.3818 - val_loss: 1.9928 - val_accuracy: 0.3953
## Epoch 54/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.2516 - accuracy: 0.2812
256/296 [========================>.....] - ETA: 0s - loss: 2.1956 - accuracy: 0.2930
296/296 [==============================] - 0s 384us/sample - loss: 2.1760 - accuracy: 0.3142 - val_loss: 1.9859 - val_accuracy: 0.4561
## Epoch 55/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.0756 - accuracy: 0.3750
256/296 [========================>.....] - ETA: 0s - loss: 2.0800 - accuracy: 0.3750
296/296 [==============================] - 0s 458us/sample - loss: 2.1353 - accuracy: 0.3649 - val_loss: 1.9842 - val_accuracy: 0.3953
## Epoch 56/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.1583 - accuracy: 0.4375
256/296 [========================>.....] - ETA: 0s - loss: 2.1675 - accuracy: 0.3359
296/296 [==============================] - 0s 385us/sample - loss: 2.1661 - accuracy: 0.3378 - val_loss: 1.9238 - val_accuracy: 0.4561
## Epoch 57/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.0477 - accuracy: 0.3750
256/296 [========================>.....] - ETA: 0s - loss: 2.1341 - accuracy: 0.3438
296/296 [==============================] - 0s 385us/sample - loss: 2.1171 - accuracy: 0.3547 - val_loss: 1.9091 - val_accuracy: 0.4628
## Epoch 58/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.0975 - accuracy: 0.3125
256/296 [========================>.....] - ETA: 0s - loss: 2.1393 - accuracy: 0.3281
296/296 [==============================] - 0s 373us/sample - loss: 2.1305 - accuracy: 0.3446 - val_loss: 1.9802 - val_accuracy: 0.4088
## Epoch 59/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.2319 - accuracy: 0.3438
288/296 [============================>.] - ETA: 0s - loss: 2.1782 - accuracy: 0.3368
296/296 [==============================] - 0s 330us/sample - loss: 2.1618 - accuracy: 0.3378 - val_loss: 1.9340 - val_accuracy: 0.4257
## Epoch 60/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.9591 - accuracy: 0.4375
256/296 [========================>.....] - ETA: 0s - loss: 2.0711 - accuracy: 0.3164
296/296 [==============================] - 0s 412us/sample - loss: 2.0857 - accuracy: 0.3209 - val_loss: 1.8592 - val_accuracy: 0.4831
## Epoch 61/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.9968 - accuracy: 0.4375
256/296 [========================>.....] - ETA: 0s - loss: 2.0452 - accuracy: 0.4062
296/296 [==============================] - 0s 364us/sample - loss: 2.0509 - accuracy: 0.3953 - val_loss: 1.8304 - val_accuracy: 0.4797
## Epoch 62/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.2761 - accuracy: 0.3125
288/296 [============================>.] - ETA: 0s - loss: 2.0351 - accuracy: 0.3958
296/296 [==============================] - 0s 345us/sample - loss: 2.0224 - accuracy: 0.3953 - val_loss: 1.7899 - val_accuracy: 0.4966
## Epoch 63/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.2107 - accuracy: 0.3750
256/296 [========================>.....] - ETA: 0s - loss: 2.0808 - accuracy: 0.3750
296/296 [==============================] - 0s 410us/sample - loss: 2.0603 - accuracy: 0.3615 - val_loss: 1.7727 - val_accuracy: 0.5000
## Epoch 64/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.0571 - accuracy: 0.3750
256/296 [========================>.....] - ETA: 0s - loss: 2.0492 - accuracy: 0.3555
296/296 [==============================] - 0s 390us/sample - loss: 2.0496 - accuracy: 0.3682 - val_loss: 1.7641 - val_accuracy: 0.4899
## Epoch 65/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.7577 - accuracy: 0.4375
256/296 [========================>.....] - ETA: 0s - loss: 1.9506 - accuracy: 0.4453
296/296 [==============================] - 0s 373us/sample - loss: 1.9536 - accuracy: 0.4493 - val_loss: 1.7524 - val_accuracy: 0.5034
## Epoch 66/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.0203 - accuracy: 0.3750
256/296 [========================>.....] - ETA: 0s - loss: 1.9588 - accuracy: 0.3828
296/296 [==============================] - 0s 387us/sample - loss: 1.9406 - accuracy: 0.3953 - val_loss: 1.7118 - val_accuracy: 0.5034
## Epoch 67/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.7946 - accuracy: 0.4375
256/296 [========================>.....] - ETA: 0s - loss: 1.9401 - accuracy: 0.3906
296/296 [==============================] - 0s 395us/sample - loss: 1.9575 - accuracy: 0.3851 - val_loss: 1.6734 - val_accuracy: 0.5236
## Epoch 68/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.9384 - accuracy: 0.3750
288/296 [============================>.] - ETA: 0s - loss: 1.8943 - accuracy: 0.4132
296/296 [==============================] - 0s 354us/sample - loss: 1.9060 - accuracy: 0.4088 - val_loss: 1.7515 - val_accuracy: 0.4966
## Epoch 69/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.9682 - accuracy: 0.3750
256/296 [========================>.....] - ETA: 0s - loss: 1.9711 - accuracy: 0.3828
296/296 [==============================] - 0s 373us/sample - loss: 1.9548 - accuracy: 0.4054 - val_loss: 1.7086 - val_accuracy: 0.4932
## Epoch 70/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.8662 - accuracy: 0.4688
256/296 [========================>.....] - ETA: 0s - loss: 1.9130 - accuracy: 0.4102
296/296 [==============================] - 0s 334us/sample - loss: 1.9578 - accuracy: 0.4054 - val_loss: 1.6876 - val_accuracy: 0.5372
## Epoch 71/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 2.0929 - accuracy: 0.2500
288/296 [============================>.] - ETA: 0s - loss: 1.9813 - accuracy: 0.3819
296/296 [==============================] - 0s 351us/sample - loss: 1.9878 - accuracy: 0.3818 - val_loss: 1.6544 - val_accuracy: 0.5169
## Epoch 72/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.7123 - accuracy: 0.5625
256/296 [========================>.....] - ETA: 0s - loss: 1.8465 - accuracy: 0.4688
296/296 [==============================] - 0s 383us/sample - loss: 1.8599 - accuracy: 0.4662 - val_loss: 1.6141 - val_accuracy: 0.5338
## Epoch 73/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.6816 - accuracy: 0.3750
256/296 [========================>.....] - ETA: 0s - loss: 1.8370 - accuracy: 0.4258
296/296 [==============================] - 0s 381us/sample - loss: 1.8178 - accuracy: 0.4257 - val_loss: 1.5882 - val_accuracy: 0.5541
## Epoch 74/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.8012 - accuracy: 0.2812
256/296 [========================>.....] - ETA: 0s - loss: 1.9000 - accuracy: 0.4102
296/296 [==============================] - 0s 364us/sample - loss: 1.8943 - accuracy: 0.4257 - val_loss: 1.5931 - val_accuracy: 0.5777
## Epoch 75/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.7862 - accuracy: 0.4375
288/296 [============================>.] - ETA: 0s - loss: 1.8698 - accuracy: 0.4410
296/296 [==============================] - 0s 339us/sample - loss: 1.8710 - accuracy: 0.4392 - val_loss: 1.5940 - val_accuracy: 0.5541
## Epoch 76/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.8061 - accuracy: 0.4688
288/296 [============================>.] - ETA: 0s - loss: 1.8352 - accuracy: 0.4514
296/296 [==============================] - 0s 343us/sample - loss: 1.8513 - accuracy: 0.4392 - val_loss: 1.5298 - val_accuracy: 0.5541
## Epoch 77/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.7556 - accuracy: 0.3438
288/296 [============================>.] - ETA: 0s - loss: 1.8032 - accuracy: 0.4236
296/296 [==============================] - 0s 339us/sample - loss: 1.8049 - accuracy: 0.4223 - val_loss: 1.5177 - val_accuracy: 0.5878
## Epoch 78/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.7668 - accuracy: 0.3438
288/296 [============================>.] - ETA: 0s - loss: 1.8323 - accuracy: 0.4479
296/296 [==============================] - 0s 373us/sample - loss: 1.8259 - accuracy: 0.4493 - val_loss: 1.4956 - val_accuracy: 0.5811
## Epoch 79/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.9284 - accuracy: 0.4688
256/296 [========================>.....] - ETA: 0s - loss: 1.7678 - accuracy: 0.4688
296/296 [==============================] - 0s 397us/sample - loss: 1.7778 - accuracy: 0.4561 - val_loss: 1.4752 - val_accuracy: 0.5845
## Epoch 80/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.7654 - accuracy: 0.3750
256/296 [========================>.....] - ETA: 0s - loss: 1.8057 - accuracy: 0.4375
296/296 [==============================] - 0s 407us/sample - loss: 1.8060 - accuracy: 0.4392 - val_loss: 1.4737 - val_accuracy: 0.6014
## Epoch 81/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.6591 - accuracy: 0.4375
256/296 [========================>.....] - ETA: 0s - loss: 1.7509 - accuracy: 0.4766
296/296 [==============================] - 0s 403us/sample - loss: 1.7777 - accuracy: 0.4628 - val_loss: 1.4721 - val_accuracy: 0.5878
## Epoch 82/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.8686 - accuracy: 0.3125
256/296 [========================>.....] - ETA: 0s - loss: 1.7913 - accuracy: 0.4219
296/296 [==============================] - 0s 387us/sample - loss: 1.7842 - accuracy: 0.4392 - val_loss: 1.4379 - val_accuracy: 0.5811
## Epoch 83/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.9879 - accuracy: 0.3125
256/296 [========================>.....] - ETA: 0s - loss: 1.8057 - accuracy: 0.4141
296/296 [==============================] - 0s 401us/sample - loss: 1.7639 - accuracy: 0.4291 - val_loss: 1.4108 - val_accuracy: 0.5980
## Epoch 84/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.5802 - accuracy: 0.4688
256/296 [========================>.....] - ETA: 0s - loss: 1.7101 - accuracy: 0.4414
296/296 [==============================] - 0s 397us/sample - loss: 1.6962 - accuracy: 0.4493 - val_loss: 1.4127 - val_accuracy: 0.5980
## Epoch 85/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.5229 - accuracy: 0.5312
256/296 [========================>.....] - ETA: 0s - loss: 1.7268 - accuracy: 0.4805
296/296 [==============================] - 0s 383us/sample - loss: 1.7185 - accuracy: 0.4831 - val_loss: 1.4050 - val_accuracy: 0.6014
## Epoch 86/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.4805 - accuracy: 0.5625
256/296 [========================>.....] - ETA: 0s - loss: 1.6208 - accuracy: 0.4688
296/296 [==============================] - 0s 404us/sample - loss: 1.6529 - accuracy: 0.4527 - val_loss: 1.3769 - val_accuracy: 0.6115
## Epoch 87/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.6033 - accuracy: 0.2500
224/296 [=====================>........] - ETA: 0s - loss: 1.6478 - accuracy: 0.4375
296/296 [==============================] - 0s 404us/sample - loss: 1.6577 - accuracy: 0.4527 - val_loss: 1.3970 - val_accuracy: 0.6081
## Epoch 88/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.9735 - accuracy: 0.3750
256/296 [========================>.....] - ETA: 0s - loss: 1.7220 - accuracy: 0.4570
296/296 [==============================] - 0s 387us/sample - loss: 1.7263 - accuracy: 0.4459 - val_loss: 1.3472 - val_accuracy: 0.6318
## Epoch 89/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.8693 - accuracy: 0.4688
256/296 [========================>.....] - ETA: 0s - loss: 1.6925 - accuracy: 0.4922
296/296 [==============================] - 0s 379us/sample - loss: 1.6858 - accuracy: 0.4831 - val_loss: 1.3312 - val_accuracy: 0.6554
## Epoch 90/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.3485 - accuracy: 0.6250
256/296 [========================>.....] - ETA: 0s - loss: 1.6404 - accuracy: 0.4922
296/296 [==============================] - 0s 395us/sample - loss: 1.6068 - accuracy: 0.5034 - val_loss: 1.3009 - val_accuracy: 0.6520
## Epoch 91/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.6250 - accuracy: 0.4688
256/296 [========================>.....] - ETA: 0s - loss: 1.6628 - accuracy: 0.4883
296/296 [==============================] - 0s 395us/sample - loss: 1.6557 - accuracy: 0.4797 - val_loss: 1.3047 - val_accuracy: 0.6351
## Epoch 92/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.6140 - accuracy: 0.5938
256/296 [========================>.....] - ETA: 0s - loss: 1.6817 - accuracy: 0.5117
296/296 [==============================] - 0s 395us/sample - loss: 1.7002 - accuracy: 0.4966 - val_loss: 1.2600 - val_accuracy: 0.6622
## Epoch 93/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.6429 - accuracy: 0.4688
256/296 [========================>.....] - ETA: 0s - loss: 1.6817 - accuracy: 0.4727
296/296 [==============================] - 0s 371us/sample - loss: 1.6819 - accuracy: 0.4865 - val_loss: 1.3049 - val_accuracy: 0.6115
## Epoch 94/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.5174 - accuracy: 0.5625
256/296 [========================>.....] - ETA: 0s - loss: 1.6432 - accuracy: 0.4727
296/296 [==============================] - 0s 396us/sample - loss: 1.6329 - accuracy: 0.4730 - val_loss: 1.2830 - val_accuracy: 0.6419
## Epoch 95/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.5349 - accuracy: 0.5312
256/296 [========================>.....] - ETA: 0s - loss: 1.5534 - accuracy: 0.5195
296/296 [==============================] - 0s 400us/sample - loss: 1.5744 - accuracy: 0.5068 - val_loss: 1.2579 - val_accuracy: 0.6351
## Epoch 96/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.3611 - accuracy: 0.5625
256/296 [========================>.....] - ETA: 0s - loss: 1.5607 - accuracy: 0.5234
296/296 [==============================] - 0s 398us/sample - loss: 1.5786 - accuracy: 0.5169 - val_loss: 1.1962 - val_accuracy: 0.6791
## Epoch 97/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.5412 - accuracy: 0.5312
256/296 [========================>.....] - ETA: 0s - loss: 1.5115 - accuracy: 0.5234
296/296 [==============================] - 0s 401us/sample - loss: 1.5300 - accuracy: 0.5101 - val_loss: 1.1689 - val_accuracy: 0.6689
## Epoch 98/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.5530 - accuracy: 0.5000
256/296 [========================>.....] - ETA: 0s - loss: 1.5097 - accuracy: 0.5352
296/296 [==============================] - 0s 390us/sample - loss: 1.4878 - accuracy: 0.5372 - val_loss: 1.1964 - val_accuracy: 0.6486
## Epoch 99/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.0985 - accuracy: 0.5938
256/296 [========================>.....] - ETA: 0s - loss: 1.5171 - accuracy: 0.5000
296/296 [==============================] - 0s 385us/sample - loss: 1.5517 - accuracy: 0.4899 - val_loss: 1.1931 - val_accuracy: 0.6588
## Epoch 100/100
## 
 32/296 [==>...........................] - ETA: 0s - loss: 1.4605 - accuracy: 0.5312
256/296 [========================>.....] - ETA: 0s - loss: 1.4464 - accuracy: 0.5156
296/296 [==============================] - 0s 385us/sample - loss: 1.4657 - accuracy: 0.5236 - val_loss: 1.1861 - val_accuracy: 0.6655

The model architecture is shown below.

if (keras::is_keras_available() & reticulate::py_available()) {
    ttgsea::plot_model(ART_result$model)
}

The learned autoregressive model generates new sequences of proteins. After the language model generates a conditional probability distribution over vocabulary for the given input sequence, we need to decide how to choose the next word or token from the distribution. The greedy search simply selects the token with the highest probability value as its next token. Rather than just considering the highest probable word, the top b words is taken as the next word in the beam search. Here b is also called the beam size which is itself a parameter. Now to compute the second word each of these b first words is fed and top b words are obtained again. The process continues till the end. In temperature sampling, we sample the token from the modified conditional probability distribution over the vocabulary for the given temperature value. In the top-k sampling, the k most likely next tokens are filtered and their probabilities are adjusted among only those k tokens. In a similar manner, the top-p sampling is another way to exclude very low probability tokens and it finds the smallest set of tokens that have summed probability at least p.

if (keras::is_keras_available() & reticulate::py_available()) {
    set.seed(1)
    seed_prot <- "SGFRKMAFPS"
    print(gen_ART(ART_result, seed_prot, length_AA = 20, method = "greedy"))
    print(substr(prot_seq, 1, 30))
    print(gen_ART(ART_result, seed_prot, length_AA = 20, method = "beam", b = 5))
    print(substr(prot_seq, 1, 30))
    print(gen_ART(ART_result, seed_prot, length_AA = 20, method = "temperature", t = 0.1))
    print(substr(prot_seq, 1, 30))
    print(gen_ART(ART_result, seed_prot, length_AA = 20, method = "top_k", k = 3))
    print(substr(prot_seq, 1, 30))
    print(gen_ART(ART_result, seed_prot, length_AA = 20, method = "top_p", p = 0.75))
    print(substr(prot_seq, 1, 30))
}
## generating...
## [1] "SGFRKMAFPSLKFVAAGDPQIFSAVLVCSG"
## [1] "SGFRKMAFPSGKVEGCMVQVTCGTTTLNGL"
## generating...
## [1] "SGFRKMAFPSLKFFSVNVCYKYVHLGQTQT"
## [1] "SGFRKMAFPSGKVEGCMVQVTCGTTTLNGL"
## generating...
## [1] "SGFRKMAFPSLKFFKVNVANYYYELLDDDR"
## [1] "SGFRKMAFPSGKVEGCMVQVTCGTTTLNGL"
## generating...
## [1] "SGFRKMAFPSTKKLAADDNFKFGKFVRVVG"
## [1] "SGFRKMAFPSGKVEGCMVQVTCGTTTLNGL"
## generating...
## [1] "SGFRKMAFPSVLQLNFQTGSNVLVELCNVQ"
## [1] "SGFRKMAFPSGKVEGCMVQVTCGTTTLNGL"

We can compute pairwise similarities between the real and generated protein sequences. The function “stringsim” vary between 0 for strings that are not similar at all, to 1 for strings that are identical. One advantage of the string similarity over the string distance function is that similarities are easier to interpret because they are normalized.

if (keras::is_keras_available() & reticulate::py_available()) {
    print(stringdist::stringsim(gen_ART(ART_result, seed_prot, length_AA = 20, method = "greedy"),
                        substr(prot_seq, 1, 30)))
    print(stringdist::stringsim(gen_ART(ART_result, seed_prot, length_AA = 30, method = "greedy"),
                        substr(prot_seq, 1, 40)))
    print(stringdist::stringsim(gen_ART(ART_result, seed_prot, length_AA = 40, method = "greedy"),
                        substr(prot_seq, 1, 50)))
    print(stringdist::stringsim(gen_ART(ART_result, seed_prot, length_AA = 50, method = "greedy"),
                        substr(prot_seq, 1, 60)))
}
## generating...
## [1] 0.4333333
## generating...
## [1] 0.35
## generating...
## [1] 0.32
## generating...
## [1] 0.3



3 Session information

sessionInfo()
## R version 4.3.1 Patched (2023-06-17 r84564)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.6.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] GenProSeq_1.6.0 mclust_6.0.0    keras_2.13.0   
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          rstudioapi_0.15.0          
##   [3] jsonlite_1.8.7              magrittr_2.0.3             
##   [5] ggbeeswarm_0.7.2            farver_2.1.1               
##   [7] rmarkdown_2.25              zlibbioc_1.48.0            
##   [9] vctrs_0.6.4                 DelayedMatrixStats_1.24.0  
##  [11] RCurl_1.98-1.12             ttgsea_1.10.0              
##  [13] base64enc_0.1-3             PRROC_1.3.1                
##  [15] koRpus.lang.en_0.1-4        htmltools_0.5.6.1          
##  [17] S4Arrays_1.2.0              BiocNeighbors_1.20.0       
##  [19] SparseArray_1.2.0           sass_0.4.7                 
##  [21] bslib_0.5.1                 htmlwidgets_1.6.2          
##  [23] tokenizers_0.3.0            cachem_1.0.8               
##  [25] whisker_0.4.1               lifecycle_1.0.3            
##  [27] pkgconfig_2.0.3             rsvd_1.0.5                 
##  [29] Matrix_1.6-1.1              R6_2.5.1                   
##  [31] fastmap_1.1.1               GenomeInfoDbData_1.2.11    
##  [33] MatrixGenerics_1.14.0       digest_0.6.33              
##  [35] colorspace_2.1-0            S4Vectors_0.40.0           
##  [37] scater_1.30.0               VAExprs_1.8.0              
##  [39] irlba_2.3.5.1               GenomicRanges_1.54.0       
##  [41] SnowballC_0.7.1             beachmat_2.18.0            
##  [43] labeling_0.4.3              fansi_1.0.5                
##  [45] tfruns_1.5.1                httr_1.4.7                 
##  [47] abind_1.4-5                 compiler_4.3.1             
##  [49] withr_2.5.1                 BiocParallel_1.36.0        
##  [51] sylly_0.1-6                 viridis_0.6.4              
##  [53] koRpus_0.13-8               tensorflow_2.14.0          
##  [55] float_0.3-1                 DelayedArray_0.28.0        
##  [57] CatEncoders_0.1.1           tools_4.3.1                
##  [59] vipor_0.4.5                 beeswarm_0.4.0             
##  [61] word2vec_0.4.0              stopwords_2.3              
##  [63] sylly.en_0.1-3              ggseqlogo_0.1              
##  [65] webchem_1.3.0               glue_1.6.2                 
##  [67] lgr_0.4.4                   DiagrammeR_1.0.10          
##  [69] grid_4.3.1                  stringdist_0.9.10          
##  [71] generics_0.1.3              gtable_0.3.4               
##  [73] data.table_1.14.8           BiocSingular_1.18.0        
##  [75] ScaledMatrix_1.10.0         xml2_1.3.5                 
##  [77] utf8_1.2.4                  XVector_0.42.0             
##  [79] BiocGenerics_0.48.0         ggrepel_0.9.4              
##  [81] pillar_1.9.0                stringr_1.5.0              
##  [83] dplyr_1.1.3                 lattice_0.22-5             
##  [85] tidyselect_1.2.0            SingleCellExperiment_1.24.0
##  [87] tm_0.7-11                   scuttle_1.12.0             
##  [89] knitr_1.44                  gridExtra_2.3              
##  [91] NLP_0.2-1                   IRanges_2.36.0             
##  [93] SummarizedExperiment_1.32.0 textstem_0.1.4             
##  [95] RhpcBLASctl_0.23-42         stats4_4.3.1               
##  [97] xfun_0.40                   Biobase_2.62.0             
##  [99] matrixStats_1.0.0           visNetwork_2.1.2           
## [101] stringi_1.7.12              yaml_2.3.7                 
## [103] rsparse_0.5.1               evaluate_0.22              
## [105] codetools_0.2-19            data.tree_1.0.0            
## [107] tibble_3.2.1                cli_3.6.1                  
## [109] matlab_1.0.4                reticulate_1.34.0          
## [111] munsell_0.5.0               jquerylib_0.1.4            
## [113] Rcpp_1.0.11                 GenomeInfoDb_1.38.0        
## [115] DeepPINCS_1.10.0            mlapi_0.1.1                
## [117] zeallot_0.1.0               png_0.1-8                  
## [119] parallel_4.3.1              ellipsis_0.3.2             
## [121] ggplot2_3.4.4               sparseMatrixStats_1.14.0   
## [123] bitops_1.0-7                viridisLite_0.4.2          
## [125] slam_0.1-50                 text2vec_0.6.3             
## [127] scales_1.2.1                purrr_1.0.2                
## [129] crayon_1.5.2                rlang_1.1.1                
## [131] rvest_1.0.3



4 References

Barbosa, V. A. F., Santana, M. A., Andrade, M. K. S., Lima, R. C. F., & Santos, W. P. (2020). Deep Learning for Data Analytics: Foundations, Biomedical Applications, and Challenges. Academic Press.

Cinelli, L. P., Marins, M. A., da Silva, E. A. B., & Netto, S. L. (2021). Variational Methods for Machine Learning with Applications to Deep Networks. Springer.

Dash, S., Acharya, B. R., Mittal, M., Abraham, A., & Kelemen, A. (Eds.). (2020). Deep learning techniques for biomedical and health informatics. Springer.

Deepak, P., Chakraborty, T., & Long, C. (2021). Data Science for Fake News: Surveys and Perspectives. Springer.

Dong, G., & Pei, J. (2007). Sequence data mining. Springer.

Frazer, J., Notin, P., Dias, M., Gomez, A., Brock, K., Gal, Y., & Marks, D. (2020). Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning. bioRxiv.

Gagniuc, P. A. (2021). Algorithms in Bioinformatics: Theory and Implementation. Wiley & Sons.

Hawkins-Hooker, A., Depardieu, F., Baur, S., Couairon, G., Chen, A., & Bikard, D. (2020). Generating functional protein variants with variational autoencoders. bioRxiv.

Hemanth, J., Bhatia, M., & Geman, O. (2020). Data Visualization and Knowledge Engineering: Spotting Data Points with Artificial Intelligence. Springer.

Lappin, S. (2021). Deep learning and linguistic representation. CRC Press.

Liebowitz, J. (Ed.). (2020). Data Analytics and AI. CRC Press.

Liu, Z., Lin, Y., & Sun, M. (2020). Representation learning for natural language processing. Springer.

Madani, A., McCann, B., Naik, N., Keskar, N. S., Anand, N., Eguchi, R. R., Huang, P., & Socher, R. (2020). Progen: Language modeling for protein generation. arXiv:2004.03497.

Pearson, R. K. (2018). Exploratory data analysis using R. CRC Press.

Pedrycz, W., & Chen, S. M. (Eds.). (2020). Deep Learning: Concepts and Architectures. Springer.

Peter, J. D., Fernandes, S. L., Thomaz, C. E., & Viriri, S. (Eds.). (2019). Computer aided intervention and diagnostics in clinical and medical images. Springer.

Repecka, D., et al. (2019). Expanding functional protein sequence space using generative adversarial networks. bioRxiv.

Suguna, S. K., Dhivya, M., & Paiva, S. (Eds.). (2021). Artificial Intelligence (AI): Recent Trends and Applications. CRC Press.

Sun, S., Mao, L., Dong, Z., & Wu, L. (2019). Multiview machine learning. Springer.

Wolkenhauer, O. (2020). Systems Medicine: Integrative, Qualitative and Computational Approaches. Academic Press.

Wu, Z., Johnston, K. E., Arnold, F. H., & Yang, K. K. (2021). Protein sequence design with deep generative models. arXiv:2104.04457.