| Title: | Deep Significance Clustering for Clinical Risk Stratification |
| Version: | 0.1.2 |
| Description: | We provide an R implementation of Deep Significance Clustering (DICE), a self-supervised learning framework designed to identify clinically meaningful and risk-stratified patient subgroups from electronic health record (EHR) data. DICE jointly optimizes deep representation learning, clustering, and outcome prediction while enforcing statistical significance between predicted outcomes and cluster membership. This integrated optimization produces subgroups that are both clinically coherent and predictive, addressing a gap where traditional unsupervised clustering methods and supervised risk prediction models alone may fail to generate actionable clinical groupings. See Huang et al. (2021) <doi:10.1093/jamia/ocab203>. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| Imports: | argparser, ggplot2, torch, pROC |
| Suggests: | knitr, rmarkdown, scales |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2026-05-21 22:05:42 UTC; sarahayton |
| Author: | Sarah Ayton |
| Maintainer: | Sarah Ayton <saa7050@med.cornell.edu> |
| Config/roxygen2/version: | 8.0.0 |
| Repository: | CRAN |
| Date/Publication: | 2026-05-28 12:50:09 UTC |
DICEr Function for Training and Evaluating a Deep Learning Model
Description
This function orchestrates the training and evaluation of a deep learning model, specifically for autoencoder-based representation learning followed by clustering and classification. It handles data loading, model training, testing, and saving the best model based on performance criteria.
Usage
DICEr(args)
Arguments
args |
A list of arguments that configure the model training and evaluation process. The list should include:
|
Details
The DICEr function executes the following steps:
Sets the random seed for reproducibility.
Loads and preprocesses the training and test datasets.
Initializes the model, optimizer, and loss functions.
Trains an autoencoder for representation learning, saving the model and plotting loss curves.
Performs k-means clustering on the learned representations and reassigns cluster labels based on outcome ratios.
Iteratively trains the model for clustering, classification, and outcome prediction, optimizing the combined loss function.
Saves the best model based on the outcome prediction likelihood and checks for p-value significance.
Outputs and saves relevant training metrics, including loss curves and model checkpoints.
Value
Called for its side effects: trains the model and saves results to
disk. Returns NULL invisibly.
Examples
## Prepare data in DICErClust format (length-3 RDS list)
data_dir <- tempdir()
n <- 100L; p <- 4L; q <- 2L
set.seed(1)
saveRDS(list(matrix(runif(n * p), n, p), # data_x
matrix(as.double(rbinom(n * q, 1, 0.5)), n, q), # data_v (float)
rbinom(n, 1, 0.3)), # data_y
file.path(data_dir, "train.rds"))
saveRDS(list(matrix(runif(30L * p), 30L, p),
matrix(as.double(rbinom(30L * q, 1, 0.5)), 30L, q),
rbinom(30L, 1, 0.3)),
file.path(data_dir, "test.rds"))
args <- list(
seed = 1L, input_path = paste0(data_dir, "/"),
filename_train = "train.rds", filename_test = "test.rds",
n_input_fea = p, n_hidden_fea = 2L,
lstm_layer = 1L, lstm_dropout = 0.0, K_clusters = 2L,
n_dummy_demov_fea = q, cuda = FALSE, lr = 1e-4,
init_AE_epoch = 2L, iter = 5L, epoch_in_iter = 1L,
lambda_AE = 1.0, lambda_classifier = 1.0,
lambda_outcome = 1.0, lambda_p_value = 1.0
)
old_wd <- setwd(tempdir())
DICEr(args)
setwd(old_wd)
DecoderRNN Neural Network Module
Description
A neural network module implementing a decoder using an LSTM architecture, based on the torch package. This module is typically used in sequence-to-sequence tasks during the decoding phase.
Usage
DecoderRNN(input_size, nhidden, nlayers, dropout)
Arguments
input_size |
A numeric value representing the number of features in the input data. |
|
A numeric value representing the number of hidden units in each LSTM layer. | |
nlayers |
A numeric value indicating the number of LSTM layers. |
dropout |
A numeric value between 0 and 1 indicating the dropout probability to be applied between LSTM layers. |
Value
A tensor representing the output sequence, flipped along the sequence dimension.
Usage
DecoderRNN <- nn_module(
"DecoderRNN",
initialize = function(input_size, nhidden, nlayers, dropout) {
...
},
forward = function(x, h) {
...
}
)
Methods
initialize(input_size, nhidden, nlayers, dropout)-
Initializes the
DecoderRNNmodule.\itemize{ \item{\code{input_size}: A numeric value representing the number of features in the input data.} \item{\code{nhidden}: A numeric value representing the number of hidden units in each LSTM layer.} \item{\code{nlayers}: A numeric value indicating the number of LSTM layers.} \item{\code{dropout}: A numeric value between 0 and 1 indicating the dropout probability to be applied between LSTM layers.} } forward(x, h)-
Executes the forward pass of the
DecoderRNNmodule.\itemize{ \item{\code{x}: A tensor of shape \code{(batch_size, sequence_length, input_size)} representing the input data.} \item{\code{h}: A list containing the hidden state (\code{hn}) and cell state (\code{cn}) from the LSTM, typically passed from the encoder or previous decoder step.} } \value{ A tensor representing the output sequence, flipped along the sequence dimension. }
Examples
dec <- DecoderRNN(input_size = 4L, nhidden = 8L, nlayers = 1L, dropout = 0.0)
enc <- EncoderRNN(input_size = 4L, nhidden = 8L, nlayers = 1L, dropout = 0.0)
x <- torch::torch_randn(2L, 1L, 4L)
enc_out <- enc(x)
dec_out <- dec(enc_out[[3]], enc_out[[2]])
EncoderRNN Neural Network Module
Description
An RNN-based encoder module using Long Short-Term Memory (LSTM) architecture, implemented with the torch package.
This module is designed to process sequential data and is commonly used in tasks like sequence-to-sequence modeling, language modeling, and time series forecasting.
Usage
EncoderRNN(input_size, nhidden, nlayers, dropout)
Arguments
input_size |
A numeric value representing the number of features in the input data. |
|
A numeric value representing the number of hidden units in each LSTM layer. | |
nlayers |
A numeric value indicating the number of LSTM layers. |
dropout |
A numeric value between 0 and 1 indicating the dropout probability to be applied between LSTM layers. |
Value
An nn_module object (torch LSTM-based encoder). When called
with a tensor of shape (batch_size, seq_len, input_size), returns a
list of: (1) processed output tensor, (2) list of hidden and cell states
(hn, cn), and (3) the input with a leading zero tensor prepended.
Usage
EncoderRNN <- nn_module(
"EncoderRNN",
initialize = function(input_size, nhidden, nlayers, dropout) {
...
},
forward = function(x) {
...
}
)
Methods
initialize(input_size, nhidden, nlayers, dropout)-
Initializes the EncoderRNN module.
\itemize{ \item{\code{input_size}: A numeric value representing the number of features in the input data.} \item{\code{nhidden}: A numeric value representing the number of hidden units in each LSTM layer.} \item{\code{nlayers}: A numeric value indicating the number of LSTM layers.} \item{\code{dropout}: A numeric value between 0 and 1 indicating the dropout probability to be applied between LSTM layers.} } forward(x)-
Executes the forward pass of the EncoderRNN module.
\itemize{ \item{\code{x}: A tensor of shape (batch_size, sequence_length, input_size) representing the input data.} } \value{ A list containing: \itemize{ \item{\code{output}: The output tensor from the LSTM, processed and flipped.} \item{\code{list(hn, cn)}: A list containing the hidden state (\code{hn}) and cell state (\code{cn}) from the LSTM.} \item{\code{newinput}: A tensor representing the input data with an additional zero tensor at the beginning.} } }
Examples
enc <- EncoderRNN(input_size = 4L, nhidden = 8L, nlayers = 1L, dropout = 0.0)
x <- torch::torch_randn(2L, 1L, 4L) # batch=2, seq_len=1, features=4
out <- enc(x)
dim(out[[1]]) # encoded output
Analyze P-Value and Outcome Ratios for Clusters
Description
This function performs an analysis of p-values and outcome ratios related to clusters in a dataset. It calculates the ratio of positive outcomes within each cluster and computes p-values for the association between clusters and the outcome variable.
Usage
analysis_p_value_related(data_train, num_clusters, if_check = FALSE)
Arguments
data_train |
A list representing the training data. This list must include the following components:
|
num_clusters |
An integer specifying the number of clusters in the data. |
if_check |
A logical value indicating whether to print intermediate steps for debugging purposes. Default is |
Details
The analysis_p_value_related function performs the following steps:
Converts the cluster assignments into a one-hot encoded matrix and counts the occurrences of each cluster.
Calculates the ratio of positive outcomes within each cluster.
Combines the cluster one-hot encodings with additional covariates to create a matrix of predictor variables.
Iteratively removes one or two clusters from the predictor matrix and calculates the p-value for the remaining clusters using the
p_value_calculatefunction.
Value
A list containing the following components:
-
dict_outcome_ratio: A numeric vector representing the ratio of positive outcomes (wheredata_yequals 1) within each cluster. -
dict_p_value: A list of p-values, with each element corresponding to a pair of clusters. The p-values are calculated using a likelihood ratio test for the association between the clusters and the outcome variable.
Examples
set.seed(1)
n <- 80L
data_train <- list(
C = sample(0:1, n, replace = TRUE),
data_v = matrix(rbinom(n * 3L, 1, 0.5), n, 3L),
data_y = rbinom(n, 1, 0.3)
)
result <- analysis_p_value_related(data_train, num_clusters = 2L)
result$dict_outcome_ratio # outcome rate per cluster
result$dict_p_value # LRT p-value for the cluster pair
Analyze P-Value and Outcome Ratios for Clusters Using Only Cluster Information
Description
This function performs an analysis of p-values and outcome ratios related to clusters in a dataset, using only the cluster assignments and not additional covariates. It calculates the ratio of positive outcomes within each cluster and computes p-values for the association between clusters and the outcome variable.
Usage
analysis_p_value_related_onlyc(data_train, num_clusters, if_check = FALSE)
Arguments
data_train |
A list representing the training data. This list must include the following components:
|
num_clusters |
An integer specifying the number of clusters in the data. |
if_check |
A logical value indicating whether to print intermediate steps for debugging purposes. Default is |
Details
The analysis_p_value_related_onlyc function performs the following steps:
Converts the cluster assignments into a one-hot encoded matrix and counts the occurrences of each cluster.
Calculates the ratio of positive outcomes within each cluster.
Iteratively removes one or two clusters from the predictor matrix and calculates the p-value for the remaining clusters using the
p_value_calculatefunction.
Value
A list containing the following components:
-
dict_outcome_ratio: A numeric vector representing the ratio of positive outcomes (wheredata_yequals 1) within each cluster. -
dict_p_value: A list of p-values, with each element corresponding to a pair of clusters. The p-values are calculated using a likelihood ratio test for the association between the clusters and the outcome variable.
Examples
set.seed(1)
n <- 80L
data_train <- list(
C = sample(0:1, n, replace = TRUE),
data_v = matrix(rbinom(n * 3L, 1, 0.5), n, 3L),
data_y = rbinom(n, 1, 0.3)
)
result <- analysis_p_value_related_onlyc(data_train, num_clusters = 2L)
result$dict_outcome_ratio
result$dict_p_value
Reassign Cluster Labels Based on Outcome Ratios
Description
This function reassigns cluster labels such that clusters with higher ratios of positive outcomes are assigned lower labels, and those with lower ratios are assigned higher labels. This relabeling can be useful for aligning cluster labels with the significance of outcomes.
Usage
change_label_from_highratio_to_lowratio(args, oldlabel, data_train)
Arguments
args |
A list of arguments, which should include the following component:
|
oldlabel |
An integer vector of the original cluster labels assigned to each sample. |
data_train |
A list representing the training data. This list must include the following components:
|
Details
The change_label_from_highratio_to_lowratio function performs the following steps:
Calculates the count of samples and the count of positive outcomes (where
data_yequals 1) for each cluster.Computes the ratio of positive outcomes within each cluster.
Orders the clusters by their outcome ratios in descending order.
Reassigns cluster labels such that clusters with higher outcome ratios receive lower labels.
Returns the new cluster labels and the mapping from old labels to new labels.
Value
A list containing:
-
new_list_c: A tensor of the new cluster labels, with clusters relabeled according to the outcome ratios. -
order_c_map: A named vector that maps the old cluster labels to the new ones, where the names represent the original labels and the values represent the new labels.
Examples
## Called internally by DICEr() after each k-means step.
## Requires a trained model object; see ?DICEr for a full usage example.
Analyze Test Error and Performance Metrics for a Model
Description
This function evaluates a trained model on a test dataset, calculating various performance metrics including autoencoder loss, outcome likelihood, classification accuracy, and AUC score for outcome prediction.
Usage
func_analysis_test_error_D0406(args, model, data_test, dataloader_test)
Arguments
args |
A list of arguments, typically containing configurations such as whether to use CUDA for computation. |
model |
A trained model that will be evaluated. The model should have a |
data_test |
A list representing the test dataset. This list will be updated with predicted cluster assignments. |
dataloader_test |
An object representing the test data loader, which provides batches of test data to the model. |
Details
The func_analysis_test_error_D0406 function performs the following steps:
Sets the model to evaluation mode to ensure proper handling of layers like dropout.
Iterates over the test data batches provided by
dataloader_test.For each batch, it performs a forward pass through the model using the "outcome_logistic_regression" mode, obtaining the encoded representations, decoded outputs, raw cluster predictions, and outcome predictions.
Computes the mean squared error (MSE) for the autoencoder loss, binary cross-entropy (BCE) for the outcome prediction loss, and classification accuracy for cluster prediction.
Accumulates the true outcomes and predicted probabilities for calculating the AUC score.
Updates the
data_testlist with the predicted cluster assignments.Returns the average values of the computed metrics across all test batches.
Value
A list containing the following performance metrics:
-
test_AE_loss: The mean autoencoder loss (MSE) across all batches in the test dataset. -
test_classifier_c_accuracy: The classification accuracy of the model in predicting cluster assignments. -
test_outcome_likelihood: The mean binary cross-entropy loss for the outcome prediction. -
outcome_auc_score: The AUC (Area Under the Curve) score for the outcome prediction, which measures the model's ability to distinguish between positive and negative classes.
Examples
## Called internally by DICEr() to evaluate the held-out test set.
## Requires a trained DICEr model object; see ?DICEr for a full example.
Main Function for Training and Evaluating a Deep Learning Model
Description
This function orchestrates the training and evaluation of a deep learning model, specifically for autoencoder-based representation learning followed by clustering and classification. It handles data loading, model training, testing, and saving the best model based on performance criteria.
Usage
main(args)
Arguments
args |
A list of arguments that configure the model training and evaluation process. The list should include:
|
Details
The main function executes the following steps:
Sets the random seed for reproducibility.
Loads and preprocesses the training and test datasets.
Initializes the model, optimizer, and loss functions.
Trains an autoencoder for representation learning, saving the model and plotting loss curves.
Performs k-means clustering on the learned representations and reassigns cluster labels based on outcome ratios.
Iteratively trains the model for clustering, classification, and outcome prediction, optimizing the combined loss function.
Saves the best model based on the outcome prediction likelihood and checks for p-value significance.
Outputs and saves relevant training metrics, including loss curves and model checkpoints.
Value
Called for its side effects: trains the model and saves results to
disk. Returns NULL invisibly.
Examples
## See ?DICEr for a complete usage example.
## main() is an alias for DICEr() kept for backwards compatibility.
model_2 Neural Network Module
Description
A neural network module implementing a multi-functional model with
encoder-decoder architecture using the torch package.
The module is constructed using nn_module() and configured
via its initialize() method.
Usage
model_2(input_size, nhidden, nlayers, dropout, n_clusters, n_dummy_demov_fea, para_cuda)
Arguments
input_size |
Integer. Number of features in the input data. |
|
Integer. Number of hidden units in each LSTM layer. | |
nlayers |
Integer. Number of LSTM layers. |
dropout |
Numeric. Dropout probability between LSTM layers (0–1). |
n_clusters |
Integer. Number of clusters/classes. |
n_dummy_demov_fea |
Integer. Number of demographic dummy features. |
para_cuda |
Logical. Whether to use CUDA (GPU acceleration). The |
Details
After initialization, the module can be used via
model$forward(x, function_name, demov = NULL, mask_BoolTensor = NULL).
Arguments:
-
x: Input tensor. -
function_name: One of "autoencoder", "get_representation", "classifier", "outcome_logistic_regression". -
demov: Optional demographic tensor. -
mask_BoolTensor: Optional boolean mask tensor.
Value
An nn_module object.
Examples
if (requireNamespace("torch", quietly = TRUE)) {
mod <- model_2(
input_size = 10L, nhidden = 64L, nlayers = 2L, dropout = 0.1,
n_clusters = 3L, n_dummy_demov_fea = 2L, para_cuda = FALSE
)
}
Calculate P-Value for Logistic Regression Model
Description
This function calculates the p-value for a logistic regression model by comparing the likelihoods of a full model and a null model.
Usage
p_value_calculate(X, y, is_intercept, X_null = NULL)
Arguments
X |
A matrix of predictor variables for the full model. |
y |
A binary response vector, with values of 0 or 1. |
is_intercept |
A logical value indicating whether the null model should include only an intercept (i.e., no predictors). If |
X_null |
(Optional) A matrix of predictor variables for the null model. This argument is required if |
Details
The p_value_calculate function performs the following steps:
Fits a full logistic regression model using
Xas the predictor matrix.Depending on the value of
is_intercept, it either:Calculates the null model's log-likelihood using only an intercept (if
is_intercept = TRUE).Fits a null logistic regression model using
X_nullas the predictor matrix (ifis_intercept = FALSE).
Computes the likelihood ratio statistic
G, which is twice the difference in log-likelihoods between the full and null models.Calculates the p-value using the chi-squared distribution with degrees of freedom equal to the difference in the number of parameters between the full and null models.
Value
A numeric value representing the p-value from the likelihood ratio test between the full and null models.
Examples
# Example with only an intercept in the null model
X <- matrix(rnorm(100 * 5), ncol = 5)
y <- rbinom(100, 1, 0.5)
p_value <- p_value_calculate(X, y, is_intercept = TRUE)
print(p_value)
# Example with a specified null model
X_null <- X[, 1:2] # Using only the first two predictors in the null model
p_value <- p_value_calculate(X, y, is_intercept = FALSE, X_null = X_null)
print(p_value)
Parse Command-Line Arguments for PPD-Aware Clustering
Description
This function parses command-line arguments for a script that performs PPD-aware clustering, setting up various parameters necessary for training and evaluating a model.
Usage
parse_args()
Details
The parse_args function uses the argparser package to define and parse a series of command-line arguments. These arguments are essential for configuring various aspects of the PPD-aware clustering process, including model architecture, training settings, and input/output paths.
The following command-line arguments are supported:
-
--init_AE_epoch: Integer. Number of epochs for representation initialization. -
--n_hidden_fea: Integer. Number of hidden units in the LSTM layers. -
--output_path: Character. Location where output results will be saved. -
--input_path: Character. Location of the input dataset. -
--filename_train: Character. Path to the training data file. -
--filename_test: Character. Path to the test data file. -
--n_input_fea: Integer. Number of original input features. -
--n_dummy_demov_fea: Integer. Number of dummy demographic features. -
--lstm_layer: Integer. Number of layers in the LSTM. Default is 1. -
--lr: Numeric. Learning rate for training. Default is 1e-4. -
--lstm_dropout: Numeric. Dropout rate for the LSTM layers. Default is 0.0. -
--K_clusters: Integer. Number of initial clusters. -
--iter: Integer. Maximum number of iterations for merging clusters. Default is 20. -
--epoch_in_iter: Integer. Number of epochs per iteration when merging clusters. Default is 1. -
--seed: Integer. Random seed for reproducibility. Default is 1111. -
--cuda: Integer. Flag indicating whether to use CUDA (GPU acceleration). Default is 1. -
--lambda_AE: Numeric. Regularization weight for the autoencoder loss in each iteration. Default is 1.0. -
--lambda_classifier: Numeric. Regularization weight for the classifier loss in each iteration. Default is 1.0. -
--lambda_outcome: Numeric. Regularization weight for the outcome loss in each iteration. Default is 1.0. -
--lambda_p_value: Numeric. Regularization weight for the p-value adjustment in each iteration. Default is 1.0.
Value
A list of parsed command-line arguments, each corresponding to a specific parameter required by the clustering model. This list is typically used to configure the model training and evaluation process.
Examples
## parse_args() reads from the command line; use the args list interface
## directly when calling DICEr() from within R (see ?DICEr).
args <- parse_args()
Evaluate Autoencoder Model on Test Data
Description
This function evaluates the performance of an autoencoder model on a test dataset, calculating the mean squared error (MSE) between the original inputs and the reconstructed outputs.
Usage
test_AE(args, model, dataloader_test)
Arguments
args |
A list of arguments, typically containing configurations such as whether to use CUDA for computation. |
model |
A trained autoencoder model that will be evaluated. The model is expected to have a |
dataloader_test |
A list representing the test data loader. Each element in the list should be a batch containing test data that the model will evaluate. |
Details
The test_AE function performs the following steps:
Initializes the mean squared error (MSE) loss criterion.
Sets the model to evaluation mode to ensure layers like dropout are appropriately handled.
Iterates over the test data batches provided by
dataloader_test.For each batch, the function:
Extracts the data, ensuring it is in numeric format and reshapes it as necessary.
Converts the data into torch tensors, moving them to CUDA if specified.
Performs a forward pass through the model in autoencoder mode, obtaining both the encoded representation and the reconstructed output.
Computes the MSE loss between the input and the reconstructed output.
Accumulates the loss for each batch.
Computes the mean of the accumulated losses to return the overall test error.
Value
A numeric value representing the mean test error (MSE) across all batches in the test dataset.
Examples
## Called internally by DICEr() after each autoencoder warm-up epoch.
## Requires a trained EncoderRNN/DecoderRNN model; see ?DICEr.
Update Cluster Embeddings in Matrix M
Description
This function updates the matrix M, which contains cluster embeddings, based on the current cluster assignments and representations in the training data.
Usage
update_M(data_train)
Arguments
data_train |
A list containing the training data. The list must include the following components:
|
Details
The update_M function works as follows:
It first checks that the number of columns in the
Mmatrix matches the number of clusters (n_cat). If not, an error is thrown.It then initializes a list of lists,
dict_c_embedding, where each sublist will hold the embeddings for samples belonging to a specific cluster.The function iterates over all samples, adding their representations to the corresponding cluster sublist in
dict_c_embedding.For each cluster, the function computes the mean of the embeddings across all samples assigned to that cluster.
The computed mean embedding is then used to update the corresponding column in the
Mmatrix.
Value
The function updates the M matrix in-place within the data_train list, reflecting the new average embeddings for each cluster.
Examples
## Called internally by DICEr() to refresh cluster centroids each iteration.
## Requires a data_train list with rep, C, n_cat, and M fields; see ?DICEr.
Update Test Set Representations, Cluster Assignments, and Cluster Centers
Description
This function updates the representations, cluster assignments, and cluster centers for a test dataset based on a trained model and a reference training dataset.
Usage
update_testset_R_C_M_K(args, model, data_test, dataloader_test, data_train)
Arguments
args |
A list of arguments, typically containing configurations such as the number of hidden features ( |
model |
A trained model, typically an autoencoder, that will be used to generate new representations for the test data. |
data_test |
A list representing the test dataset. This list will be updated with new representations, cluster assignments, and cluster centers. |
dataloader_test |
An object representing the test data loader, which provides batches of test data to the model. |
data_train |
A list representing the training dataset, which contains reference cluster centers and other attributes needed for updating the test dataset. |
Details
The update_testset_R_C_M_K function performs the following steps:
Initializes a tensor,
final_embed, to store the embeddings for the test dataset.Sets the model to evaluation mode.
Iterates through the test data batches using
dataloader_test, processing each batch through the model to obtain the embeddings.Updates the
repattribute ofdata_testwith the new embeddings.Updates
data_test$n_catanddata_test$Mto match the training dataset.For each test sample, computes the closest cluster center (based on Euclidean distance) and assigns the corresponding cluster to
pred_C.
Value
The function updates the following attributes within the data_test list:
-
rep: The new representations (embeddings) for the test dataset. -
n_cat: The number of clusters, set to match the training dataset. -
M: The cluster centers, copied from the training dataset. -
pred_C: The predicted cluster assignments for each test sample, based on the updated representations and cluster centers.
Examples
## Called internally by DICEr() to assign test-set cluster labels.
## Requires a trained DICEr model and data objects; see ?DICEr.