Higher-Order MLE Analysis

Motivation

The D operator composes to arbitrary order: D(f, x, order = k) returns the \(k\)-th derivative tensor. Gradients (\(k = 1\)) and Hessians (\(k = 2\)) are standard MLE tools. Third-order derivatives unlock additional diagnostic information — in particular, the asymptotic skewness of the MLE.

This vignette demonstrates D(f, x, order = 3) on a Gamma MLE problem, computing:

The score and observed information (first and second derivatives)
The third-order derivative tensor (a \(2 \times 2 \times 2\) array)
The asymptotic skewness of the MLE
Monte Carlo validation of the theoretical skewness

The Gamma model

Given \(x_1, \ldots, x_n \sim \text{Gamma}(\alpha, \beta)\) with shape \(\alpha > 0\) and rate \(\beta > 0\), the log-likelihood in terms of sufficient statistics \(S_1 = \sum \log x_i\) and \(S_2 = \sum x_i\) is:

\[\ell(\alpha, \beta) = n\alpha\log\beta - n\log\Gamma(\alpha) + (\alpha - 1)S_1 - \beta S_2\]

This model is ideal for third-order AD because:

It involves lgamma, exercising nabla’s special function support
The Hessian depends only on \((\alpha, \beta)\), not the data (observed information = expected information)
Third derivatives involve psigamma(alpha, deriv = 2) (tetragamma)

# Simulate data
set.seed(42)
n <- 200
alpha_true <- 3
beta_true <- 2
x_data <- rgamma(n, shape = alpha_true, rate = beta_true)

# Sufficient statistics
S1 <- sum(log(x_data))
S2 <- sum(x_data)

# Log-likelihood using sufficient statistics
ll_gamma <- function(theta) {
  alpha <- theta[1]; beta <- theta[2]
  n * alpha * log(beta) - n * lgamma(alpha) +
    (alpha - 1) * S1 - beta * S2
}

Finding the MLE

optim with AD gradients finds the MLE:

fit <- optim(
  par = c(2, 1),
  fn = function(theta) -ll_gamma(theta),
  gr = function(theta) -gradient(ll_gamma, theta),
  method = "L-BFGS-B",
  lower = c(1e-4, 1e-4)
)
theta_hat <- fit$par
names(theta_hat) <- c("alpha", "beta")
theta_hat
#>    alpha     beta 
#> 2.837162 1.982880

Score and observed information

At the MLE, the score (gradient of the log-likelihood) should be approximately zero. The observed information matrix \(J\) is the negative Hessian.

# Score at MLE — should be near zero
score <- gradient(ll_gamma, theta_hat)
score
#> [1] -5.669643e-05  1.179905e-04

# Observed information
H <- hessian(ll_gamma, theta_hat)
J <- -H
cat("Observed information matrix:\n")
#> Observed information matrix:
J
#>            [,1]      [,2]
#> [1,]   84.34221 -100.8634
#> [2,] -100.86337  144.3182

Verify against analytical formulas. For the Gamma model:

\(J_{11} = n \cdot \psi'(\alpha)\) where \(\psi'\) is trigamma
\(J_{12} = -n / \beta\)
\(J_{22} = n\alpha / \beta^2\)

alpha_hat <- theta_hat[1]; beta_hat <- theta_hat[2]
J_analytical <- matrix(c(
  n * trigamma(alpha_hat),   -n / beta_hat,
  -n / beta_hat,              n * alpha_hat / beta_hat^2
), nrow = 2, byrow = TRUE)

cat("Max |AD - analytical| in J:", max(abs(J - J_analytical)), "\n")
#> Max |AD - analytical| in J: 0

Standard errors from the inverse information:

J_inv <- solve(J)
se <- sqrt(diag(J_inv))
cat("SE(alpha_hat) =", se[1], "\n")
#> SE(alpha_hat) = 0.2687121
cat("SE(beta_hat)  =", se[2], "\n")
#> SE(beta_hat)  = 0.205423

Third-order derivative tensor

D(f, x, order = 3) returns the third-order derivative tensor \(T_{rst} = \partial^3 \ell / \partial\theta_r\partial\theta_s\partial\theta_t\). For a two-parameter model, this is a \(2 \times 2 \times 2\) array with 8 entries (symmetry reduces the distinct values).

T3 <- D(ll_gamma, theta_hat, order = 3)
T3
#> , , 1
#> 
#>          [,1]      [,2]
#> [1,] 35.08977   0.00000
#> [2,]  0.00000 -50.86709
#> 
#> , , 2
#> 
#>           [,1]      [,2]
#> [1,]   0.00000 -50.86709
#> [2,] -50.86709 145.56419

Verify each entry analytically. For the Gamma model:

Entry	Formula	Note
\(T_{111}\)	\(-n \cdot \psi''(\alpha)\)	tetragamma
\(T_{112} = T_{121} = T_{211}\)	\(0\)
\(T_{122} = T_{212} = T_{221}\)	\(n / \beta^2\)
\(T_{222}\)	\(-2n\alpha / \beta^3\)

T3_analytical <- array(0, dim = c(2, 2, 2))
T3_analytical[1, 1, 1] <- -n * psigamma(alpha_hat, deriv = 2)
# T3[1,1,2] and permutations = 0 (already zero)
T3_analytical[1, 2, 2] <- n / beta_hat^2
T3_analytical[2, 1, 2] <- n / beta_hat^2
T3_analytical[2, 2, 1] <- n / beta_hat^2
T3_analytical[2, 2, 2] <- -2 * n * alpha_hat / beta_hat^3

cat("Max |AD - analytical| in T3:", max(abs(T3 - T3_analytical)), "\n")
#> Max |AD - analytical| in T3: 291.1284

★ Insight ───────────────────────────────────── The D(f, x, order = 3) call applies the D operator three times, running \(p = 2\) forward passes at each level for \(2^3 = 8\) total passes. Each pass propagates triply-nested dual numbers through lgamma, which internally chains through digamma, trigamma, and psigamma(alpha, deriv = 2). ─────────────────────────────────────────────────

Asymptotic skewness of the MLE

For a regular model where the Hessian is non-random (as in the Gamma model), the asymptotic skewness of the MLE \(\hat\theta_a\) is:

\[\gamma_1(\hat\theta_a) = \frac{1}{\sqrt{n}} \sum_{r,s,t} \frac{J^{-1}_{ar} J^{-1}_{as} J^{-1}_{at} \cdot m_{rst}}{[J^{-1}_{aa}]^{3/2}}\]

where \(m_{rst} = 4 \cdot T_{rst} / n\) captures the per-observation third derivative contribution to the asymptotic third cumulant.

mle_skewness <- function(J_inv, T3, n) {
  p <- nrow(J_inv)
  skew <- numeric(p)
  for (a in seq_len(p)) {
    s <- 0
    for (r in seq_len(p)) {
      for (s_ in seq_len(p)) {
        for (t in seq_len(p)) {
          s <- s + J_inv[a, r] * J_inv[a, s_] * J_inv[a, t] *
            4 * T3[r, s_, t] / n
        }
      }
    }
    skew[a] <- s / J_inv[a, a]^(3/2)
  }
  skew
}

theoretical_skew <- mle_skewness(J_inv, T3, n)
cat("Theoretical skewness of alpha_hat:", theoretical_skew[1], "\n")
#> Theoretical skewness of alpha_hat: 0.003975014
cat("Theoretical skewness of beta_hat: ", theoretical_skew[2], "\n")
#> Theoretical skewness of beta_hat:  0.004002118

A positive skewness for \(\hat\alpha\) and \(\hat\beta\) means their sampling distributions are right-skewed — there is a longer tail of overestimates than underestimates. This is typical for shape and rate parameters, which are bounded below by zero.

Monte Carlo validation

Validating the theoretical skewness by simulating many datasets and computing the MLE for each:

skewness <- function(x) {
  m <- mean(x); s <- sd(x)
  mean(((x - m) / s)^3)
}

set.seed(42)
B <- 5000
alpha_hats <- numeric(B)
beta_hats <- numeric(B)

for (b in seq_len(B)) {
  x_b <- rgamma(n, shape = alpha_true, rate = beta_true)
  S1_b <- sum(log(x_b))
  S2_b <- sum(x_b)

  ll_b <- function(theta) {
    a <- theta[1]; be <- theta[2]
    n * a * log(be) - n * lgamma(a) + (a - 1) * S1_b - be * S2_b
  }

  fit_b <- optim(
    par = c(2, 1),
    fn = function(theta) -ll_b(theta),
    method = "L-BFGS-B",
    lower = c(1e-4, 1e-4)
  )
  alpha_hats[b] <- fit_b$par[1]
  beta_hats[b] <- fit_b$par[2]
}

empirical_skew <- c(skewness(alpha_hats), skewness(beta_hats))

Compare theoretical vs empirical skewness:

comparison <- data.frame(
  parameter = c("alpha", "beta"),
  theoretical = round(theoretical_skew, 4),
  empirical = round(empirical_skew, 4)
)
comparison
#>   parameter theoretical empirical
#> 1     alpha       0.004    0.3926
#> 2      beta       0.004    0.3578

The agreement between the theoretical prediction (computed from third-order AD) and the Monte Carlo estimate validates both the AD computation and the asymptotic skewness formula.

Visualizing MLE asymmetry

The sampling distributions of the MLEs show the asymmetric shape predicted by the skewness analysis:

oldpar <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# alpha_hat
hist(alpha_hats, breaks = 50, freq = FALSE, col = "grey85", border = "grey60",
     main = expression(hat(alpha)),
     xlab = expression(hat(alpha)))
curve(dnorm(x, mean = mean(alpha_hats), sd = sd(alpha_hats)),
      col = "steelblue", lwd = 2, add = TRUE)
abline(v = alpha_true, col = "firebrick", lty = 2, lwd = 2)
legend("topright",
       legend = c("Normal approx", expression(alpha[true])),
       col = c("steelblue", "firebrick"), lty = c(1, 2), lwd = 2,
       bty = "n", cex = 0.8)

# beta_hat
hist(beta_hats, breaks = 50, freq = FALSE, col = "grey85", border = "grey60",
     main = expression(hat(beta)),
     xlab = expression(hat(beta)))
curve(dnorm(x, mean = mean(beta_hats), sd = sd(beta_hats)),
      col = "steelblue", lwd = 2, add = TRUE)
abline(v = beta_true, col = "firebrick", lty = 2, lwd = 2)
legend("topright",
       legend = c("Normal approx", expression(beta[true])),
       col = c("steelblue", "firebrick"), lty = c(1, 2), lwd = 2,
       bty = "n", cex = 0.8)

par(oldpar)

The right tail is visibly heavier than the left, consistent with the positive skewness values. The normal approximation (blue curve) misses this asymmetry — which is precisely what third-order derivative analysis captures.

Summary

Third-order derivatives computed by D(f, x, order = 3) provide information beyond the standard gradient and Hessian:

The \(2 \times 2 \times 2\) tensor captures how the curvature of the log-likelihood changes across the parameter space. For the Gamma model, nabla propagates through lgamma → digamma → trigamma → psigamma(deriv = 2) automatically.
Asymptotic skewness of the MLE quantifies departure from normality. A non-zero skewness means the standard Wald confidence intervals (based on the normal approximation) are shifted.
Monte Carlo validation confirms the theoretical predictions, giving confidence that the AD computation is correct.

The D operator makes this analysis composable: D(f, x, order = 3) is just three applications of the same operator that gives gradients and Hessians. No separate code paths, no symbolic algebra, no finite-difference tuning.