modified: Tue Jan 28 20:29:29 2020 compiled: Mon Apr 27 20:02:24 2020

1 Introduction

Power and sample size analysis or sample size determination is concerned with the question of determining the minimum number of samples necessary to demonstrate the existence (or absence) of a difference between two or more populations of interest. The number of samples should be sufficient in that the statistical test will reject the null hypothesis, when there really exists a difference, with high probability or power.

Sample size determination for experiments involving high-dimensional data has several challenges as a multiple testing problem is involved, furthermore the occurrence of differentially and nondifferentialy expressed genes and that each gene individually has an effect size and a standard deviation leading to a distribution of effect sizes and standard deviations complicates the problem even further.

Power and sample size analysis for high-dimensional data was initiated by (Lee and Whitmore 2002). The multiple testing problem was controlled using the easy to apply family-wise error rate. Since controlling the family-wise error rate is too conservative for microarray data, methods were proposed that control the multiple testing problem using the false discovery rate (Pawitan et al. 2005; Liu and Hwang 2007; Tong and Zhao 2008). Recently, those methods were adapted for using pilot-data in order to obtain more realistic estimates of the power (Ferreira and Zwinderman 2006; Ruppert, Nettleton, and Hwang 2007; Midelfart and Bones 2008).

This vignette describes SSPA, a package implementing a general pilot data-based method for power and sample size determination for high-dimensional genomic data, such as those obtained from microarray or next-generation sequencing experiments. The method is based on the work of Ferreira and Zwinderman (Ferreira and Zwinderman 2006) and generalized as described by van Iterson et al. (Iterson et al. 2009, 2013).

By means of two simple commands (pilotData(), sampleSize()), a researcher can read in their data –a vector of test statistics– and compute the desired estimates. Other functions are provided that facilitate interpretation of results. Simulation data is used to show the general functionality and flexibility of the package and two interesting biological case study are shown.

2 Simulated data

This simulation study represents a two group microarray experiment with \(m = 5000\) features and \(N = 10\) samples per group. Let \(R_{ij}\) and \(S_{ij}\), \(i = 1, \ldots, m\), \(j = 1, \ldots, N\) be random variables where \(S_{ij} \sim N(\frac{-\mu_i}{2}, 1)\) and \(R_{ij} \sim N(\frac{\mu_i}{2}, 1)\) with the first \(\mu_{i} = 0\) for \(i = 1, \ldots, m_0\) and the remaining \(\mu_i\) for \(i = m_0 + 1, \ldots, m\) were drawn from a symmetric bi-triangular distribution as described by (Langaas, Lindqvist, and Ferkingstad 2005). A total of \(25\) data sets were simulated with the proportion of non-differentially expressed features fixed at \(\pi_0 = 0.8\). Additional technical noise is added to the model using \(X \sim \log(e^R + e^\epsilon)\) and \(Y \sim \log(e^S + e^\epsilon)\) with \(\epsilon \sim N(0, 0.1^2)\), so that the noise is positive and additive.

library(SSPA)
library(genefilter)
set.seed(12345)
m <- 5000
J <- 10
pi0 <- 0.8
m0 <- as.integer(m * pi0)
mu <- rbitri(m - m0,
             a = log2(1.2),
             b = log2(4),
             m = log2(2))
data <- simdat(
  mu,
  m = m,
  pi0 = pi0,
  J = J,
  noise = 0.01
)
statistics <-
  rowttests(data, factor(rep(c(0, 1), each = J)))$statistic

2.1 Deconvolution

The first step in doing the sample size and power analysis is creating a object of class PilotData which will contain all the necessary information needed for the following power and sample size analysis; a vector of test-statistics and the sample sizes used to compute them. A user-friendly interface for creating an object of PilotData is available as pilotData().

Several ways of viewing the content of the PilotData object are possible either graphically or using a show-method by just typing the name of the created object of PilotData:

pdD <- pilotData(
  statistics = statistics,
  samplesize = sqrt(1 / (1 / J + 1 / J)),
  distribution = "norm"
)
pdD
## Formal class 'PilotData' [package "SSPA"] with 5 slots
##   ..@ statistics  : num [1:5000] -1.98854 -0.00589 -1.26628 0.74421 -0.76088 ...
##   ..@ samplesize  : num 2.24
##   ..@ pvalues     : num [1:5000] 0.0468 0.9953 0.2054 0.4567 0.4467 ...
##   ..@ distribution: chr "norm"
##   ..@ args        : list()
plot(pdD)
Exploratory plots of the pilot-data: Upper left panel shows a histogram of the test statistics, upper right panel a histogram of the p-values. Lower left panel shows a qq-plot for the test statistics using the user-defined null distributions. Lower right panel shows the sorted p-values against their ranks.

Figure 1: Exploratory plots of the pilot-data: Upper left panel shows a histogram of the test statistics, upper right panel a histogram of the p-values
Lower left panel shows a qq-plot for the test statistics using the user-defined null distributions. Lower right panel shows the sorted p-values against their ranks.

Now we can create an object of class SampleSize which will perform the estimation of the proportion of non-differentially expressed genes and the density of effect sizes. Several options are available see ?sampleSize.

The density of effect size can be shown by a call to plot(). When there are both up- and down-regulated genes a bimodal density is observed.

ssD <- sampleSize(pdD, control = list(from = -6, to = 6))
ssD
## Formal class 'SampleSize' [package "SSPA"] with 10 slots
##   ..@ pi0         : num 0.781
##   ..@ lambda      : num [1:512] 0 0 0 0 0 0 0 0 0 0 ...
##   ..@ theta       : num [1:512] -6 -5.98 -5.95 -5.93 -5.91 ...
##   ..@ control     :List of 11
##   .. ..$ method    : chr "deconv"
##   .. ..$ pi0Method : chr "Langaas"
##   .. ..$ pi0       : num [1:90] 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 ...
##   .. ..$ adjust    : logi TRUE
##   .. ..$ a         : num 0.5
##   .. ..$ bandwith  : NULL
##   .. ..$ kernel    : chr "fan"
##   .. ..$ from      : num -6
##   .. ..$ to        : num 6
##   .. ..$ resolution: num 512
##   .. ..$ verbose   : logi FALSE
##   ..@ info        :List of 1
##   .. ..$ pi0: num 0.767
##   ..@ statistics  : num [1:5000] -1.98854 -0.00589 -1.26628 0.74421 -0.76088 ...
##   ..@ samplesize  : num 2.24
##   ..@ pvalues     : num [1:5000] 0.0468 0.9953 0.2054 0.4567 0.4467 ...
##   ..@ distribution: chr "norm"
##   ..@ args        : list()
plot(
  ssD,
  panel = function(x, y, ...)
  {
    panel.xyplot(x, y)
    panel.curve(dbitri(x),
                lwd = 2,
                lty = 2,
                n = 500)
  },
  ylim = c(0, 0.6)
)