Introduction to DICErClust

Sarah Ayton and Yiye Zhang

2026-05-21

What is DICErClust?

DICErClust provides an R implementation of Deep Significance Clustering (DICE), a self-supervised framework that discovers clinically meaningful patient subgroups from electronic health record (EHR) data. Unlike conventional unsupervised clustering, DICE simultaneously optimises four objectives:

  1. Reconstruction fidelity — an LSTM autoencoder learns a compact latent representation of time-varying continuous features.
  2. Cluster cohesion — a soft k-means classifier assigns patients to clusters in the latent space.
  3. Outcome prediction — a logistic regression head predicts a binary clinical outcome from the cluster assignment and auxiliary demographic features.
  4. Statistical significance — a likelihood-ratio test (LRT) penalty ensures at least one cluster pair shows a significantly different outcome rate (p < 0.05) at the saved checkpoint.

The result is a partition into risk-stratified subgroups that are both data-driven and statistically validated.

Reference: Huang Y, Du C, Zhu F, et al. (2021). Self-supervised deep clustering of patient subgroups for heart failure with preserved ejection fraction. J Am Med Inform Assoc, 28, 2394–2403. doi:10.1093/jamia/ocab203


Installation

## From a local source tarball:
install.packages(
  "/path/to/DICErClust_0.1.1.tar.gz",
  repos = NULL, type = "source"
)

DICErClust depends on the torch package for R. If you have not installed torch before, run torch::install_torch() once after installing the package.


Data format

DICEr() reads training and test data from RDS files. Each file must be a length-3 list:

Position Name Type Description
[[1]] data_x numeric matrix n × p Continuous features — LSTM encoder input
[[2]] data_v numeric matrix n × q Binary demographic covariates — outcome-head auxiliary input
[[3]] data_y integer vector length n Binary outcome (0/1)

Important: data_v must use R numeric (float64) storage, not integer. The torch backend infers tensor dtype from the R storage mode; integer columns produce int64 tensors that are incompatible with the float32 model weights.

## Build a minimal synthetic dataset ----------------------------------------
set.seed(42)
n_train <- 120L; n_test <- 40L; p <- 6L; q <- 3L

make_rds <- function(n, path) {
  saveRDS(
    list(
      matrix(runif(n * p), n, p),                   # data_x: continuous
      matrix(as.numeric(rbinom(n * q, 1, 0.5)), n, q), # data_v: binary float
      rbinom(n, 1, 0.3)                             # data_y: outcome
    ),
    path
  )
}

data_dir <- file.path(tempdir(), "dice_intro")
dir.create(data_dir, showWarnings = FALSE)
make_rds(n_train, file.path(data_dir, "train.rds"))
make_rds(n_test,  file.path(data_dir, "test.rds"))

## Verify format
d <- readRDS(file.path(data_dir, "train.rds"))
cat("data_x:", nrow(d[[1]]), "×", ncol(d[[1]]), " storage:", storage.mode(d[[1]]), "\n")
cat("data_v:", nrow(d[[2]]), "×", ncol(d[[2]]), " storage:", storage.mode(d[[2]]), "\n")
cat("data_y: length", length(d[[3]]), " table:", paste(table(d[[3]]), collapse = "/"), "\n")

Quick start

library(DICErClust)

args <- list(
  seed              = 42L,
  input_path        = data_dir,
  filename_train    = "train.rds",
  filename_test     = "test.rds",
  n_input_fea       = p,       # columns in data_x
  n_hidden_fea      = 3L,      # LSTM latent dimension
  lstm_layer        = 1L,
  lstm_dropout      = 0.0,
  K_clusters        = 2L,      # number of clusters
  n_dummy_demov_fea = q,       # columns in data_v
  cuda              = FALSE,   # set TRUE to use GPU
  lr                = 1e-4,
  init_AE_epoch     = 5L,      # Stage 1 warm-up epochs
  iter              = 20L,     # Stage 2 iterations
  epoch_in_iter     = 2L,
  lambda_AE         = 1.0,
  lambda_classifier = 1.0,
  lambda_outcome    = 1.0,
  lambda_p_value    = 1.0
)

old_wd <- setwd(tempdir())
DICEr(args)            # writes output to hn_3_K_2/part2_AE_nhidden_3/
setwd(old_wd)

Loading results

part2_dir <- file.path(tempdir(), "hn_3_K_2", "part2_AE_nhidden_3")

res_train <- readRDS(file.path(part2_dir, "data_train_iter.rds"))
res_test  <- readRDS(file.path(part2_dir, "data_test_iter.rds"))

## Cluster assignments
## Training set: use res_train$C   (k-means labels, re-ordered by outcome rate)
## Test set:     use res_test$pred_C (nearest-centroid assignments)
table(res_test$pred_C)

Hyperparameters at a glance

Argument Default Effect
n_hidden_fea LSTM latent dimension; controls representation capacity
K_clusters Number of clusters
init_AE_epoch 5 Stage 1 warm-up length
iter 20 Maximum Stage 2 iterations
epoch_in_iter 1 Gradient-update epochs per iteration
lr 1e-4 Adam learning rate
lambda_AE 1.0 Weight on reconstruction loss
lambda_classifier 1.0 Weight on cluster-assignment loss
lambda_outcome 1.0 Weight on outcome BCE loss
lambda_p_value 1.0 Weight on LRT significance penalty

All four lambda weights are equal at their defaults, giving each objective equal influence. The LRT significance threshold (χ²₁, α = 0.05 → 3.841) is fixed and not user-tunable.


Output directory structure

After a successful run, DICEr creates:

<working_dir>/
└── hn_<n_hidden>_K_<K>/
    ├── part1_AE_nhidden_<n>/          # Stage 1 autoencoder outputs
    │   └── part1_loss_AE.png
    └── part2_AE_nhidden_<n>/          # Stage 2 best checkpoint
        ├── data_train_iter.rds        # training set with C assignments
        └── data_test_iter.rds         # test set with pred_C assignments

data_train_iter.rds and data_test_iter.rds are the data lists enriched with cluster assignment fields:


Full worked example

For a complete end-to-end analysis on the UCI Heart Failure Clinical Records dataset — including preprocessing, training, cluster evaluation (AUC = 0.823, χ² = 32.99, p < 0.001), and publication-quality figures — see:

vignette("heart-failure-example", package = "DICErClust")