This vignette walks through a complete DICErClust workflow on the UCI Heart Failure Clinical Records dataset (Chicco & Jurman 2020). By the end you will have:
Approximate run time on a single CPU core: 15–20 minutes (the training step is the bottleneck; all other steps are fast).
DICErClust is distributed as a source tarball. Install it once with
install.packages(), then load it like any other
package.
## Install from local tarball (run once):
# install.packages(
# "/path/to/DICErClust_0.1.1.tar.gz",
# repos = NULL, type = "source"
# )
library(DICErClust)
library(ggplot2)
library(pROC)Chicco & Jurman (2020) collected 299 heart-failure patients from
the Faisalabad Institute of Cardiology in Pakistan. The dataset contains
12 clinical features and a binary outcome DEATH_EVENT (1 =
died during follow-up; 0 = survived). It is freely available from the
UCI Machine Learning Repository (dataset #519).
hf_url <- paste0(
"https://archive.ics.uci.edu/ml/",
"machine-learning-databases/00519/",
"heart_failure_clinical_records_dataset.csv"
)
hf_dest <- tempfile(fileext = ".csv")
download.file(hf_url, hf_dest, quiet = TRUE)
hf <- read.csv(hf_dest)
cat(sprintf("Rows: %d Columns: %d\n", nrow(hf), ncol(hf)))
print(table(DEATH_EVENT = hf$DEATH_EVENT))The dataset is small but class-imbalanced: roughly 68% of patients survived (DEATH_EVENT = 0) and 32% died (DEATH_EVENT = 1).
DICEr expects the input to be split into two feature matrices:
| Matrix | Contents | Role |
|---|---|---|
data_x |
Continuous laboratory / physiological measurements | Encoder input: the LSTM compresses these into a low-dimensional latent representation used for clustering |
data_v |
Binary demographic / comorbidity indicators | Auxiliary outcome-head input: the likelihood-ratio test assesses whether the cluster explains outcome above and beyond these covariates |
This design mirrors real-world EHR practice: data_x
captures time-varying lab values while data_v captures
static patient characteristics.
## Continuous lab features → LSTM encoder (data_x)
x_cols <- c("age", "creatinine_phosphokinase", "ejection_fraction",
"platelets", "serum_creatinine", "serum_sodium", "time")
## Binary demographic indicators → outcome head (data_v)
v_cols <- c("anaemia", "diabetes", "high_blood_pressure", "sex", "smoking")
## Min-max scale continuous features to [0, 1].
## Scaling prevents any single lab value from dominating the MSE
## reconstruction loss relative to others.
scale_01 <- function(x) {
r <- range(x, na.rm = TRUE)
if (diff(r) == 0) return(x * 0)
(x - r[1]) / diff(r)
}
X_x <- apply(as.matrix(hf[, x_cols]), 2, scale_01) # 299 × 7, numeric
X_v <- apply(as.matrix(hf[, v_cols]), 2, as.numeric) # 299 × 5, binary as float
## Note: data_v *must* be stored as numeric (float), not integer.
## torch_tensor() infers dtype from R storage mode; integer columns produce
## int64 tensors that are incompatible with the float32 model weights.
cat(sprintf("data_x: %d × %d\ndata_v: %d × %d\n",
nrow(X_x), ncol(X_x), nrow(X_v), ncol(X_v)))
n_x <- ncol(X_x) # 7 continuous predictors
n_v <- ncol(X_v) # 5 binary demographics
outcome <- hf$DEATH_EVENTA 70 / 30 stratified split preserves the ~32% event rate in both partitions, which is important given the small sample size.
set.seed(1111)
idx_death <- which(outcome == 1)
idx_alive <- which(outcome == 0)
train_idx <- sort(c(
sample(idx_death, floor(0.70 * length(idx_death))),
sample(idx_alive, floor(0.70 * length(idx_alive)))
))
test_idx <- setdiff(seq_len(nrow(hf)), train_idx)
cat(sprintf("Train: %d patients (deaths: %d, %.0f%%)\n",
length(train_idx), sum(outcome[train_idx]),
100 * mean(outcome[train_idx])))
cat(sprintf("Test : %d patients (deaths: %d, %.0f%%)\n",
length(test_idx), sum(outcome[test_idx]),
100 * mean(outcome[test_idx])))DICEr() reads training and test sets from RDS files.
Each file must contain a length-3 list:
data_x — numeric matrix, shape n ×
pdata_v — numeric matrix, shape n ×
qdata_y — integer vector of 0/1 outcomes, length
ndata_dir <- file.path(tempdir(), "dice_hf")
dir.create(data_dir, showWarnings = FALSE)
saveRDS(
list(X_x[train_idx, ], X_v[train_idx, ], as.integer(outcome[train_idx])),
file.path(data_dir, "hf_train.rds")
)
saveRDS(
list(X_x[test_idx, ], X_v[test_idx, ], as.integer(outcome[test_idx])),
file.path(data_dir, "hf_test.rds")
)The argument list controls both the architecture and the training schedule.
args <- list(
seed = 1111, # reproducibility seed
input_path = data_dir, # directory containing RDS files
filename_train = "hf_train.rds",
filename_test = "hf_test.rds",
## ── Architecture ──────────────────────────────────────────
n_input_fea = n_x, # 7 continuous LSTM input features
n_hidden_fea = 4, # LSTM latent dimension (7 → 4)
lstm_layer = 1, # single LSTM layer
lstm_dropout = 0.0, # no dropout (small dataset)
K_clusters = 2, # binary risk partition: high vs. low
## ── Auxiliary features ────────────────────────────────────
n_dummy_demov_fea = n_v, # 5 binary demographic covariates
## ── Hardware ──────────────────────────────────────────────
cuda = FALSE, # set TRUE for GPU acceleration
## ── Optimiser ─────────────────────────────────────────────
lr = 1e-4, # Adam learning rate
## ── Training schedule ─────────────────────────────────────
init_AE_epoch = 5, # Stage 1: autoencoder warm-up epochs
iter = 30, # Stage 2: number of clustering iterations
epoch_in_iter = 2, # gradient-update epochs per iteration
## ── Loss weights ──────────────────────────────────────────
## Combined loss: L = λ_AE·L_AE + λ_clf·L_classifier
## + λ_out·L_outcome + λ_p·L_p_value
## L_p_value = 3.841 − G penalises non-significant cluster configurations
## (G is the LRT statistic; 3.841 is the χ²(1) critical value at α = 0.05)
lambda_AE = 1.0,
lambda_classifier = 1.0,
lambda_outcome = 1.0,
lambda_p_value = 1.0
)DICEr runs three sequential stages:
LSTM autoencoder warm-up
(init_AE_epoch epochs): trains the encoder and decoder to
reconstruct data_x, establishing a compact latent
representation before any clustering begins.
Joint optimisation (iter iterations
× epoch_in_iter epochs each): alternates between
data_y = 1
rate is relabelled cluster 0 (the high-risk cluster).Model selection: saves the checkpoint with the lowest test negative log-likelihood that also satisfies the likelihood-ratio test p < 0.05 between at least one cluster pair.
## DICEr writes output files relative to the working directory.
## We temporarily switch to tempdir() to keep them self-contained.
old_wd <- setwd(tempdir())
suppressWarnings(DICEr(args))
setwd(old_wd)Output is written to hn_4_K_2/part2_AE_nhidden_4/
relative to the working directory used during training. The key files
are:
data_train_iter.rds — training-set data frame with
cluster assignments (C)data_test_iter.rds — test-set data frame with cluster
assignments (pred_C)part2_dir <- file.path(tempdir(), "hn_4_K_2", "part2_AE_nhidden_4")
if (!file.exists(file.path(part2_dir, "data_train_iter.rds"))) {
stop(
"No checkpoint found — the p < 0.05 criterion was not met in ",
args$iter, " iterations. Increase args$iter and rerun."
)
}
res_train <- readRDS(file.path(part2_dir, "data_train_iter.rds"))
res_test <- readRDS(file.path(part2_dir, "data_test_iter.rds"))Note on
Cvspred_C:data_test$Cis initialised to 0 insideDICEr()and is never updated during training. Always usedata_test$pred_C(nearest-centroid assignments) for test-set evaluation.
The code below uses pre-computed results from the reference run (seed
= 1111, 30 iterations) so the vignette builds without retraining. When
you run DICEr() yourself the results will be loaded from
the checkpoint you just produced.
k-means spatial boundaries may assign a different cluster index to the high-mortality group in training versus test, so labels should be assigned within each split based on the observed death rate.
label_by_rate <- function(df) {
rates <- tapply(df$death, df$cluster, mean)
hi <- as.integer(names(which.max(rates)))
df$Cluster <- factor(
ifelse(df$cluster == hi, "High-risk", "Low-risk"),
levels = c("High-risk", "Low-risk")
)
df
}
train_df <- label_by_rate(train_df)
test_df <- label_by_rate(test_df)summarise_clusters <- function(df, split_name) {
do.call(rbind, lapply(split(df, df$Cluster), function(d) {
data.frame(
Split = split_name,
Cluster = as.character(d$Cluster[1]),
N = nrow(d),
Deaths = sum(d$death),
DeathRate = round(mean(d$death), 3)
)
}))
}
cluster_summary <- rbind(
summarise_clusters(train_df, "Train"),
summarise_clusters(test_df, "Test")
)[, c("Split", "Cluster", "N", "Deaths", "DeathRate")]
rownames(cluster_summary) <- NULL
print(cluster_summary)test_score <- as.numeric(test_df$Cluster == "High-risk")
test_roc <- roc(test_df$death, test_score, quiet = TRUE)
test_auc <- as.numeric(auc(test_roc))
cat(sprintf("Test AUC: %.4f\n", test_auc))ct <- table(Cluster = test_df$Cluster, Death = test_df$death)
chisq_res <- suppressWarnings(chisq.test(ct))
print(ct)
cat(sprintf("Chi-squared = %.3f, df = %d, p %s\n",
chisq_res$statistic,
chisq_res$parameter,
ifelse(chisq_res$p.value < 0.001, "< 0.001",
sprintf("= %.4f", chisq_res$p.value))))An AUC of 0.823 and a chi-squared p-value < 0.001 indicate that DICEr has identified two clusters with a strong and statistically significant difference in mortality rate: 71.9% in the High-risk cluster versus 10.3% in the Low-risk cluster.
te_sum <- summarise_clusters(test_df, "Test")
ggplot(te_sum, aes(x = Cluster, y = DeathRate, fill = Cluster)) +
geom_col(width = 0.5, colour = "black", linewidth = 0.4) +
geom_text(aes(label = paste0(Deaths, "/", N)),
vjust = -0.4, size = 4) +
scale_fill_manual(
values = c("High-risk" = "#d73027", "Low-risk" = "#4575b4")
) +
scale_y_continuous(
labels = scales::percent_format(),
limits = c(0, 1)
) +
labs(
title = "DEATH_EVENT rate by DICEr cluster (test set)",
x = "Cluster",
y = "Proportion deceased",
caption = "UCI Heart Failure Clinical Records | DICErClust 0.1.1"
) +
theme_bw(base_size = 13) +
theme(legend.position = "none")roc_df <- data.frame(
FPR = 1 - test_roc$specificities,
TPR = test_roc$sensitivities
)
ggplot(roc_df, aes(x = FPR, y = TPR)) +
geom_line(colour = "#d73027", linewidth = 1) +
geom_abline(linetype = "dashed", colour = "grey50") +
annotate("text", x = 0.55, y = 0.15,
label = sprintf("AUC = %.3f", test_auc),
size = 5, colour = "#d73027") +
labs(
title = "ROC curve — DICEr cluster vs. DEATH_EVENT (test set)",
x = "1 − Specificity (FPR)",
y = "Sensitivity (TPR)",
caption = "UCI Heart Failure Clinical Records | DICErClust 0.1.1"
) +
theme_bw(base_size = 13)| Metric | Value |
|---|---|
| Best checkpoint iteration | 19 |
| LRT p-value at checkpoint | 0.0100 |
| Test negative log-likelihood | 0.6493 |
| Test AUC | 0.823 |
| High-risk mortality (test) | 71.9% (23/32) |
| Low-risk mortality (test) | 10.3% (6/58) |
| Chi-squared statistic | 32.99 |
| Chi-squared p-value | < 0.001 |
Chicco D, Jurman G (2020). “Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone.” BMC Medical Informatics and Decision Making, 20, 16. doi:10.1186/s12911-020-1023-5
Dua D, Graff C (2019). UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records
Huang Y, Du C, Zhu F, et al. (2021). “Self-supervised deep clustering of patient subgroups for heart failure with preserved ejection fraction.” Journal of the American Medical Informatics Association, 28, 2394–2403. doi:10.1093/jamia/ocab203