Contents

This is an unfinished port of the package’s “old” Sweave/PDF vignette into Rmarkdown/HTML. There are still bugs in here, like unevaluated LaTeX markup (incl. footnotes), unmapped cross-references, figure formatting. Please bear with me. In doubt, please consult the existing PDF vignette.

1 Getting started

VSN is a method to preprocess microarray intensity data. This can be as simple as

library("vsn")
data("kidney")
xnorm = justvsn(kidney)

where kidney is an ExpressionSet object with unnormalised data and xnorm the resulting ExpressionSet with calibrated and glog\(_2\)-transformed data.

M = exprs(xnorm)[,1] - exprs(xnorm)[,2]

produces the vector of generalised log-ratios between the data in the first and second column.

VSN is a model-based method, and the more explicit way of doing the above is

fit = vsn2(kidney)
ynorm = predict(fit, kidney)

where fit is an object of class vsn that contains the fitted calibration and transformation parameters, and the method predict applies the fit to the data. The two-step protocol is useful when you want to fit the parameters on a subset of the data, e.,g. a set of control or spike-in features, and then apply the model to the complete set of data (see Section~ for details). Furthermore, it allows further inspection of the fit object, e.,g. for the purpose of quality assessment.

Besides ExpressionSets, there are also justvsn methods for AffyBatch objects from the affy package and RGList objects from the limma package. They are described in this vignette.

The so-called glog\(_2\) (short for generalised logarithm) is a function that is like the logarithm (base 2) for large values (large compared to the amplitude of the background noise), but is less steep for smaller values. Differences between the transformed values are the generalised log-ratios. These are shrinkage estimators of the logarithm of the fold change. The usual log-ratio is another example for an estimator of log fold change. There is also a close relationship between background correction of the intensities and the variance properties of the different estimators. Please see Section~ for more explanation of these issues.

How does VSN work? There are two components: First, an affine transformation whose aim is to calibrate systematic experimental factors such as labelling efficiency or detector sensitivity. Second, a glog\(_2\) transformation whose aim is variance stabilisation.

An affine transformation is simply a shifting and scaling of the data, i.,e. a mapping of the form \(x\mapsto (x-a)/s\) with offset \(a\) and scaling factor \(s\). By default, a different offset and a different scaling factor are used for each column, but the same for all rows within a column. There are two parameters of the function vsn2 to control this behaviour: With the parameter strata, you can ask vsn2 to choose different offset and scaling factors for different groups (“strata”) of rows. These strata could, for example, correspond to sectors on the array. With the parameter calib, you can ask vsn2 to choose the same offset and scaling factor throughout. This can be useful, for example, if the calibration has already been done by other means, e.g., quantile normalisation.

Note that VSN’s variance stabilisation only addresses the dependence of the variance on the mean intensity. There may be other factors influencing the variance, such as gene-inherent properties or changes of the tightness of transcriptional control in different conditions. These need to be addressed by other methods.

2 Running VSN on data from a single two-colour array

The dataset kidney contains example data from a spotted cDNA two-colour microarray on which cDNA from two adjacent tissue samples of the same kidney were hybridised, one labeled in green (Cy3), one in red (Cy5). The two columns of the matrix exprs(kidney) contain the green and red intensities, respectively. A local background estimate was calculated by the image analysis software and subtracted, hence some of the intensities in kidney are close to zero or negative. In Figure @ref{fig:nkid-scp} you can see the scatterplot of the calibrated and transformed data. For comparison, the scatterplot of the log-transformed raw intensities is also shown.

library("ggplot2")
allpositive = (rowSums(exprs(kidney) <= 0) == 0)

## some data shuffling to bring data into the right shape and data.frame for ggplot
df1 = data.frame(log2(exprs(kidney)[allpositive, ]),
                 type = "raw",
         allpositive = TRUE)
df2 = data.frame(exprs(xnorm),
                 type = "vsn",
         allpositive = allpositive)
df = rbind(df1, df2)
names(df)[1:2] = c("x", "y") 

ggplot(df, aes(x, y, col = allpositive)) + geom_hex(bins = 40) +
  coord_fixed() + facet_grid( ~ type)
Scatterplots of the kidney example data, which were obtained from a two-color cDNA array by quantitating spots and subtracting a local background estimate. a) unnormalised and $\log_2$-transformed. b) normalised and transformed with VSN. Panel b shows the data from the complete set of 8704 spots on the array, Panel a only the 7806 spots for which both red and green net intensities were greater than 0. Those spots which are missing in Panel a are coloured in orange in Panel b.

Figure 1: Scatterplots of the kidney example data, which were obtained from a two-color cDNA array by quantitating spots and subtracting a local background estimate. a) unnormalised and \(\log_2\)-transformed. b) normalised and transformed with VSN. Panel b shows the data from the complete set of 8704 spots on the array, Panel a only the 7806 spots for which both red and green net intensities were greater than 0. Those spots which are missing in Panel a are coloured in orange in Panel b.

To verify the variance stabilisation, there is the function meanSdPlot. For each feature \(k=1,\ldots,n\) it shows the empirical standard deviation \(\hat{\sigma}_k\) on the \(y\)-axis versus the rank of the average \(\hat{\mu}_k\) on the \(x\)-axis. \begin{equation} \hat{\mu}_k =\frac{1}{d} \sum_{i=1}^d h_{ki}\quad\quad \hat{\sigma}_k^2=\frac{1}{d-1}\sum_{i=1}^d (h_{ki}-\hat{\mu}_k)^2 \end{equation}
meanSdPlot(xnorm, ranks = TRUE)