---
title: "Getting started with clusteredMSM"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with clusteredMSM}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 4
)
```

## Overview

`clusteredMSM` provides nonparametric analysis of clustered multistate
process data, based on Bakoyannis (2021) and Bakoyannis & Bandyopadhyay (2022). It exposes one user-facing
function, `patp()`, that:

* Estimates the population-averaged transition probability
  P(X(t) = j | X(s) = h) using the working-independence
  Aalen-Johansen estimator.
* Returns cluster-bootstrap pointwise confidence intervals and
  simultaneous bands.
* Conducts a two-sample Kolmogorov-Smirnov-type test when the formula
  has a grouping variable on the right-hand side.
* Supports a landmark variant for non-Markov processes.
* Natively handles non-monotone (recovery) multistate models.

The package depends only on `survival`.

## Input data format

`clusteredMSM` expects data in **interval format**: one row per
mutually-exclusive time interval per subject, with columns for
interval start time (`Tstart`), end time (`Tstop`), starting state
(`Sstart`), and ending state (`Sstop`). The column names are flexible
— they are passed by name through the formula.

Censoring is encoded as `Sstart == Sstop` on the **final row** of a
subject's record. Within each subject, intervals must be temporally
contiguous (`Tstop[k] == Tstart[k+1]`) and state contiguous
(`Sstop[k] == Sstart[k+1]`).

## A progressive illness-death example

```{r setup}
library(clusteredMSM)

# 1=Healthy, 2=Ill, 3=Dead. Allowed: 1->2, 1->3, 2->3.
tmat <- trans_mat(
  list(c(2, 3), 3, integer(0)),
  names = c("Healthy", "Ill", "Dead")
)
tmat
```

A small example dataset in interval format:

```{r data}
mydata <- data.frame(
  pid     = c(1, 1, 2, 3, 4, 4),
  site    = c(1, 1, 1, 2, 2, 2),
  treatment = c("A", "A", "A", "B", "B", "B"),
  t0      = c(0.0, 1.5, 0.0, 0.0, 0.0, 1.0),
  t1      = c(1.5, 3.0, 2.0, 1.0, 1.0, 2.5),
  s0      = c(1,   2,   1,   1,   1,   2),
  s1      = c(2,   3,   1,   3,   2,   3)
)
mydata
```

Subject 1: H -> I -> D. Subject 2: censored healthy at t = 2.0
(`s0 == s1`). Subject 3: died at t = 1.0. Subject 4: H -> I -> D.

## One-sample estimation

```{r estimate, eval = FALSE}
fit <- patp(
  msm(t0, t1, s0, s1) ~ 1,
  data    = mydata,
  tmat    = tmat,
  id      = "pid",
  cluster = "site",
  h = 1, j = 2, s = 0,
  B = 1000, cband = TRUE,
  seed = 1
)
fit
```

`fit$curves` contains the time-indexed point estimate of 
\begin{eqnarray*}
P(X(t) = j \,|\, X(s) = h) &=& P(X(t) = 2 \,|\, X(0) = 1) \\
&=& P(X(t) = 2),
\end{eqnarray*}
(second equality due to the fact that all start at state = 1 in the classical illness-death model), standard error, pointwise confidence interval, and (with `cband = TRUE`) simultaneous
band limits.

## Two-sample comparison (estimation + test in one call)

If the formula's right-hand side names a binary grouping variable,
`patp()` automatically estimates both curves and conducts a
Kolmogorov-Smirnov-type test:

```{r test, eval = FALSE}
tt <- patp(
  msm(t0, t1, s0, s1) ~ treatment,
  data    = mydata,
  tmat    = tmat,
  id      = "pid",
  cluster = "site",
  h = 1, j = 2, s = 0,
  B = 1000,
  seed = 1
)
tt
```

The `$curves` slot now has one row block per group. The `$test` slot
contains the observed K-S statistic and the bootstrap p-value.

## Recovery models (non-monotone processes)

For processes with cyclic transitions (e.g., illness-death with
recovery), use the long event format directly. The same `patp()` call
works:

```{r recovery, eval = FALSE}
tmat_rec <- trans_mat(
  list(c(2, 3), c(1, 3), integer(0)),
  names = c("Healthy", "Ill", "Dead")
)

# Subject who went Healthy -> Ill -> Healthy -> censored:
recovery_data <- data.frame(
  pid = c(1, 1, 1),
  t0  = c(0.0, 1.0, 2.0),
  t1  = c(1.0, 2.0, 3.5),
  s0  = c(1,   2,   1),
  s1  = c(2,   1,   1)    # last row: censored healthy
)

patp(msm(t0, t1, s0, s1) ~ 1,
     data = recovery_data, tmat = tmat_rec,
     id = "pid",
     h = 1, j = 2, s = 0,
     B = 500, seed = 1)
```

## Landmark variant

When `s > 0` and the Markov assumption is questionable, use
`LMAJ = TRUE`:

```{r lmaj, eval = FALSE}
patp(msm(t0, t1, s0, s1) ~ 1,
     data = mydata, tmat = tmat,
     id = "pid", cluster = "site",
     h = 1, j = 2, s = 1.0,
     LMAJ = TRUE, B = 1000, seed = 1)
```

## Inverse-cluster-size weighting

When cluster size is potentially informative (e.g., sicker centers contribute more patients), use the weighted estimator:

```{r weighted, eval = FALSE}
patp(msm(t0, t1, s0, s1) ~ 1,
     data = mydata, tmat = tmat,
     id = "pid", cluster = "site",
     h = 1, j = 2, s = 0,
     weighted = TRUE, B = 1000, seed = 1)
```

`weighted = TRUE` requires the `cluster` argument.

## What `B = 0` does

Setting `B = 0` returns the point estimate without any bootstrap
inference. This is permitted only for one-sample formulas; two-sample
formulas always require `B > 0` (since the entire point of a
two-sample analysis is the test). Use `B = 0` for fast exploratory
work, then switch to `B = 1000` for inference.

## References

Bakoyannis G (2021). Nonparametric analysis of nonhomogeneous
multistate processes with clustered observations. *Biometrics*
77(2):533-546. [doi:10.1111/biom.13327](https://doi.org/10.1111/biom.13327)

Bakoyannis G, Bandyopadhyay D (2022). Nonparametric tests for
multistate processes with clustered data. *Annals of the Institute
of Statistical Mathematics* 74(5):837-867.
[doi:10.1007/s10463-021-00819-x](https://doi.org/10.1007/s10463-021-00819-x)

Putter H, Spitoni C (2018). Non-parametric estimation of transition
probabilities in non-Markov multi-state models: the landmark
Aalen-Johansen estimator. *Statistical Methods in Medical Research*
27(7):2081-2092.
[doi:10.1177/0962280216674497](https://doi.org/10.1177/0962280216674497)


