Harman is a PCA and constrained optimisation based technique that maximises the removal of batch effects from datasets, with the constraint that the probability of overcorrection (i.e. removing genuine biological signal along with batch noise) is kept to a fraction which is set by the end-user.

The harman package is used to remove unwanted batch (technical) effects from datasets. It was designed for microarray and deep sequencing data, but it is applicable to any high-dimensional data where the experimental variable of interest is not completely confounded with the batch variable. Harman uses a PCA and constrained optimisation based technique. This vignette is a tutorial for using Harman in a number of common scenarios and also provides insight on how to interpret results.

For help with Harman, use the package help system. A good place to start is typing `?harman`

in the console. Harman required three inputs: 1. `datamatrix`

, a `matrix`

of type `numeric`

or `integer`

, or a `data.frame`

that can be coerced into such using `as.matrix()`

, 2. `expt`

, a vector or factor containing the experimental grouping variable, 3. `batch`

, a vector or factor containing the batch grouping variable.

The `matrix`

is to be constructed with data values (typically microarray probes or sequencing counts) in rows and samples in columns, much like the `assayData`

slot in the canonical Bioconductor `eSet`

object, or any object which inherits from it. The data should have normalisation and any other global adjustment for noise reduction (such as background correction) applied prior to using Harman.

The fourth parameter of interest in Harman is `limit`

, which sets the confidence limit for batch effect removal. The default is `limit=0.95`

. We can consider the confidence limit an inverse of the alpha level, so there is `1 - limit`

chance (default of `0.05`

, or 5%) that some of the variance from the experimental variable is being removed along with the batch variance. This user specified `limit`

parameter allows one to tune the aggressiveness of the batch effect removal. The more samples in each batch, the more opportunity Harman has to model the batch distributions, which allows a more precise identification (and subsequent removal) of batch effects.

*The Harman package is available from Bioconductor (*Harman*) or Github (*Harman*).*

```
library(Harman)
library(HarmanData)
library(RColorBrewer)
```

First off, let’s load an example dataset from *HarmanData* where the batches and experimental variable are balanced.

`data(OLF)`

The Olfactory stem cell study is an experiment to gauge the response of human olfactory neurosphere-derived (hONS) cells established from adult donors to ZnO nanoparticles. A total of 28 Affymetrix HuGene 1.0 ST arrays were normalised together using RMA.

The data comprises six treatment groups plus a control group, each consisting of four replicates:

`table(olf.info)`

```
## Batch
## Treatment 1 2 3 4
## 1 1 1 1 1
## 2 1 1 1 1
## 3 1 1 1 1
## 4 1 1 1 1
## 5 1 1 1 1
## 6 1 1 1 1
## 7 1 1 1 1
```

The `olf.info`

data.frame has the `expt`

and `batch`

variables across 2 columns, with each sample described in one row to give 28 rows.

`olf.info[1:7,]`

```
## Treatment Batch
## c1 1 1
## c2 2 1
## c3 3 1
## c4 4 1
## c5 5 1
## c6 6 1
## c7 7 1
```

The data is a data.frame of normalised log2 expression values with dimensions: 33297 rows x 28 columns. This data can be handed to Harman as is, or first coerced into a matrix.

`head(olf.data)`

```
## c1 c2 c3 c4 c5 c6 c7 c8 c9
## p1 5.05866 4.58076 5.58438 2.90481 5.39752 4.24041 2.46891 5.34241 2.86128
## p2 4.23886 4.08143 3.21386 3.53045 4.18741 3.70027 3.05552 4.62957 4.09687
## p3 3.66121 2.79664 4.13699 2.86271 3.17795 2.92988 3.05603 3.42135 2.70507
## p4 8.61399 9.09654 9.16841 9.10928 8.94949 8.70754 8.75558 9.31429 8.63934
## p5 2.84004 2.66609 3.03612 3.26561 3.22945 3.32247 3.05079 3.02775 3.18419
## p6 3.12234 3.05058 3.85761 3.07707 3.67759 3.72965 3.43910 3.15980 2.40544
## c10 c11 c12 c13 c14 c15 c16 c17 c18
## p1 3.03601 4.16908 2.58603 3.14912 2.80076 5.84228 4.51905 3.63710 5.40139
## p2 3.59235 3.61548 3.87856 3.28656 4.12426 4.75150 4.44983 4.59084 4.87332
## p3 3.02844 3.11758 2.78865 3.82057 2.98588 4.34385 3.04461 3.47405 2.96032
## p4 8.67534 8.68344 9.06311 8.57974 9.01710 8.80506 8.82719 9.17436 8.72230
## p5 3.39400 3.27891 2.20645 3.26020 3.10178 3.72065 3.11529 3.75199 3.43958
## p6 3.33047 2.81670 3.34086 2.61524 2.38151 3.06240 4.51134 3.69132 3.08967
## c19 c20 c21 c22 c23 c24 c25 c26 c27
## p1 4.61271 5.07018 5.11652 3.11507 3.63363 4.00769 4.57562 4.34872 4.81612
## p2 4.24876 4.43532 5.59789 2.64399 3.54669 3.26678 3.31107 4.03487 3.60095
## p3 3.31022 2.89191 3.84660 3.24867 2.56718 2.99377 3.18528 2.99099 3.14193
## p4 9.14509 9.27561 9.17592 8.79834 8.92382 9.42164 9.07371 8.91382 8.76901
## p5 3.41088 3.42294 3.77549 3.47989 2.79997 2.41906 2.64609 3.14506 3.39344
## p6 5.20479 4.35899 3.80101 2.95787 2.82707 2.76646 2.89902 3.28622 4.27340
## c28
## p1 2.64101
## p2 4.10095
## p3 2.74735
## p4 8.30663
## p5 3.82901
## p6 3.21685
```

**Okay, so let’s run Harman.**

```
olf.harman <- harman(olf.data, expt=olf.info$Treatment, batch=olf.info$Batch,
limit=0.95)
```

This creates a `harmanresults`

S3 object which has a number of slots. These include the `centre`

and `rotation`

slots of a `prcomp`

object which is returned after calling `prcomp(t(x))`

, where `x`

is the datamatrix supplied. These two slots are required for reconstructing the PCA scores later. The other slots are the `factors`

supplied for the analysis (specified as `expt`

and `batch`

), the runtime `parameters`

and some summary `stats`

for Harman. Finally, there are the `original`

and `corrected`

PC scores.

`str(olf.harman)`

```
## List of 7
## $ factors :'data.frame': 28 obs. of 4 variables:
## ..$ expt : Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7 1 2 3 ...
## ..$ batch : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 2 2 2 ...
## ..$ expt.numeric : int [1:28] 1 2 3 4 5 6 7 1 2 3 ...
## ..$ batch.numeric: int [1:28] 1 1 1 1 1 1 1 2 2 2 ...
## $ parameters:List of 4
## ..$ limit : num 0.95
## ..$ numrepeats: int 100000
## ..$ randseed : num 0
## ..$ forceRand : logi FALSE
## $ stats :'data.frame': 27 obs. of 3 variables:
## ..$ dimension : Factor w/ 27 levels "PC1","PC10","PC11",..: 1 12 21 22 23 24 25 26 27 2 ...
## ..$ confidence: num [1:27] 0.95 0.949 0.95 0.95 0.951 ...
## ..$ correction: num [1:27] 0.25 0.33 0.5 0.9 0.44 0.85 0.74 1 1 1 ...
## $ center : Named num [1:33297] 4.13 3.95 3.19 8.92 3.19 ...
## ..- attr(*, "names")= chr [1:33297] "p1" "p2" "p3" "p4" ...
## $ rotation : num [1:33297, 1:28] -0.020637 -0.011028 -0.006488 -0.000785 -0.00551 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:33297] "p1" "p2" "p3" "p4" ...
## .. ..$ : chr [1:28] "PC1" "PC2" "PC3" "PC4" ...
## $ original : num [1:28, 1:27] -26.72 3.79 -19.76 25.66 -3.14 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:28] "c1" "c2" "c3" "c4" ...
## .. ..$ : chr [1:27] "PC1" "PC2" "PC3" "PC4" ...
## $ corrected : num [1:28, 1:27] -27.17 3.34 -20.21 25.21 -3.59 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:28] "c1" "c2" "c3" "c4" ...
## .. ..$ : chr [1:27] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "harmanresults"
```

Harman objects can be inspected with methods such as `pcaPlot`

and `arrowPlot`

, as well as the generic functions `plot`

and `summary`

.

`plot(olf.harman)`

`arrowPlot(olf.harman)`

Using `summary`

or inspecting the `stats`

slot we can see Harman corrected the first 7 PCs and left the rest uncorrected.

`summary(olf.harman)`

```
## Total factor levels:
##
## expt batch
## 7 4
##
## Experiment x Batch Design:
##
## batch
## expt 1 2 3 4
## 1 1 1 1 1
## 2 1 1 1 1
## 3 1 1 1 1
## 4 1 1 1 1
## 5 1 1 1 1
## 6 1 1 1 1
## 7 1 1 1 1
##
## Simulation parameters:
## 100000 simulations (with seed of 0). ForceRand is FALSE.
##
## Harman results with confidence limit of 0.95:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15
## 0.25 0.33 0.50 0.90 0.44 0.85 0.74 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
##
## Top batch-effected PCs:
## PC1 PC2 PC5 PC3 PC7
## 0.25 0.33 0.44 0.50 0.74
```

`kable(olf.harman$stats)`

dimension | confidence | correction |
---|---|---|

PC1 | 0.9500996 | 0.25 |

PC2 | 0.9490076 | 0.33 |

PC3 | 0.9500139 | 0.50 |

PC4 | 0.9497873 | 0.90 |

PC5 | 0.9509498 | 0.44 |

PC6 | 0.9505247 | 0.85 |

PC7 | 0.9507247 | 0.74 |

PC8 | 0.8353300 | 1.00 |

PC9 | 0.8897871 | 1.00 |

PC10 | 0.7478036 | 1.00 |

PC11 | 0.7534338 | 1.00 |

PC12 | 0.6490559 | 1.00 |

PC13 | 0.8538693 | 1.00 |

PC14 | 0.5361958 | 1.00 |

PC15 | 0.7655424 | 1.00 |

PC16 | 0.4155103 | 1.00 |

PC17 | 0.6931610 | 1.00 |

PC18 | 0.7644623 | 1.00 |

PC19 | 0.2869585 | 1.00 |

PC20 | 0.1424034 | 1.00 |

PC21 | 0.8732809 | 1.00 |

PC22 | 0.8205077 | 1.00 |

PC23 | 0.8612620 | 1.00 |

PC24 | 0.9488360 | 1.00 |

PC25 | 0.5062507 | 1.00 |

PC26 | 0.5842571 | 1.00 |

PC27 | 0.9466016 | 1.00 |

As the confidence (`limit`

) was 0.95, Harman will look for batch effects until this limit is met. It does this by reducing the batch means in an iterative way using a binary search algorithm until the chance that biological variance is being removed (the factor given in `expt`

) is too high. Consider the value specified in the stats column `confidence`

as the likelihood that separation of samples in this PC is due to batch alone and not due to the experimental variable. If this confidence value is less than `limit`

, Harman will not make another iterative correction. For example, on PC8 the confidence is about `0.835`

which is below the limit of `0.95`

, so Harman did not alter data on this PC. Let’s check this on a plot.

`arrowPlot(olf.harman, 1, 8)`

The horizontal arrows show us that, on PC1 the scores were shifted, but on PC8 they were not.

If both dimensions were not shifted, the `arrowPlot`

will default to printing `x`

instead of arrows (*can’t have 0 length arrows!*)

`arrowPlot(olf.harman, 8, 9)`

We can also easily colour the PCA plot by the experimental variable to compare:

```
par(mfrow=c(1,2))
pcaPlot(olf.harman, colBy='batch', pchBy='expt', main='Col by Batch')
pcaPlot(olf.harman, colBy='expt', pchBy='batch', main='Col by Expt')
```

So, if corrections have been made and we’re happy with the value of `limit`

, then we reconstruct corrected data from the Harman adjusted PC scores. To do this a `harmanresults`

object is handed to the function `reconstructData()`

. This function produces a matrix of the same dimensions as the input matrix filled with corrected data.

`olf.data.corrected <- reconstructData(olf.harman)`

This new data can be over-written into something like an `ExpressionSet`

object with a command like `exprs(eset) <- data.corrected`

. An example of this is below.

To convince you that Harman is working as it should in reconstructing the data back from the PCA domain, let’s test this principle on the original data.

```
olf.data.remade <- reconstructData(olf.harman, this='original')
all.equal(as.matrix(olf.data), olf.data.remade)
```

`## [1] TRUE`

*So, the data is indeed the same (to within the machine epsilon limit for floating point error)!*

Let’s try graphically plotting the differences using `image()`

. First, the rows (probesets) are ordered by the variation in the difference between the original and corrected data.

```
olf.data.diff <- abs(as.matrix(olf.data) - olf.data.corrected)
diff_rowvar <- apply(olf.data.diff, 1, var)
by_rowvar <- order(diff_rowvar)
par(mfrow=c(1,1))
image(x=1:ncol(olf.data.diff),
y=1:nrow(olf.data.diff),
z=t(olf.data.diff[by_rowvar, ]),
xlab='Sample',
ylab='Probeset',
main='Harman probe adjustments (ordered by variance)',
cex.main=0.7,
col=brewer.pal(9, 'Reds'))
```

We can see that most Harman adjusted probes were from batch 3 (samples 15 to 21), while batch 1 is relatively unchanged. In batches 2 and 3, often the same probes were adjusted to a larger extent in batch 3. This suggests some probes on this array design are prone to a batch effect.

The IMR90 Human lung fibroblast cell line data that was published in a paper by Johnson et al doi: 10.1093/biostatistics/kxj037 comes with Harman as an example dataset.

```
require(Harman)
require(HarmanData)
data(IMR90)
```

The data is structured like so:

Treatment | Batch | |
---|---|---|

_23_1CON | 1 | 1 |

_23_2CON | 2 | 1 |

_23_3NO0 | 3 | 1 |

_23_4NO7 | 4 | 1 |

_26_1CON | 1 | 2 |

_26_2CON | 2 | 2 |

_26_3NO0 | 3 | 2 |

_26_4NO7 | 4 | 2 |

CONTROLZ | 1 | 3 |

CONTROL7 | 2 | 3 |

NO_ZERO | 3 | 3 |

NO_7_5 | 4 | 3 |

It can be seen from the below table the experimental structure is completely balanced.

`table(expt=imr90.info$Treatment, batch=imr90.info$Batch)`

```
## batch
## expt 1 2 3
## 1 1 1 1
## 2 1 1 1
## 3 1 1 1
## 4 1 1 1
```

While there isn’t glaring batch effects in PC1 and PC2, they are more apparent when plotting PC1 and PC3.

```
# define a quick PCA and plot function
pca <- function(exprs, pc_x=1, pc_y=2, cols, ...) {
pca <- prcomp(t(exprs), retx=TRUE, center=TRUE)
if(is.factor(cols)) {
factor_names <- levels(cols)
num_levels <- length(factor_names)
palette <- rainbow(num_levels)
mycols <- palette[cols]
} else {
mycols <- cols
}
plot(pca$x[, pc_x], pca$x[, pc_y],
xlab=paste('PC', pc_x, sep=''),
ylab=paste('PC', pc_y, sep=''),
col=mycols, ...)
if(is.factor(cols)) {
legend(x=min(pca$x[, pc_x]), y=max(pca$x[, pc_y]),
legend=factor_names, fill=palette, cex=0.5)
}
}
par(mfrow=c(1,2), mar=c(4, 4, 4, 2) + 0.1)
imr90.pca <- prcomp(t(imr90.data), scale. = TRUE)
prcompPlot(imr90.pca, pc_x=1, pc_y=2, colFactor=imr90.info$Batch,
pchFactor=imr90.info$Treatment, main='IMR90, PC1 v PC2')
prcompPlot(imr90.pca, pc_x=1, pc_y=3, colFactor=imr90.info$Batch,
pchFactor=imr90.info$Treatment, main='IMR90, PC1 v PC3')
```

```
imr90.hm <- harman(as.matrix(imr90.data), expt=imr90.info$Treatment,
batch=imr90.info$Batch)
```

We can see the most batch correction was actually on the 3rd principal component and the data was also corrected on the 1st and 4th component.

dimension | confidence | correction |
---|---|---|

PC1 | 0.9498640 | 0.76 |

PC2 | 0.9247775 | 1.00 |

PC3 | 0.9503938 | 0.35 |

PC4 | 0.9502018 | 0.69 |

PC5 | 0.6654655 | 1.00 |

PC6 | 0.2424223 | 1.00 |

PC7 | 0.2775396 | 1.00 |

PC8 | 0.7810260 | 1.00 |

PC9 | 0.1267338 | 1.00 |

PC10 | 0.3193122 | 1.00 |

PC11 | 0.1313642 | 1.00 |

Plotting PC1 v. PC2 and PC1 v. PC3…

`plot(imr90.hm, pc_x=1, pc_y=2)`

`plot(imr90.hm, pc_x=1, pc_y=3)`

While no batch effect was found on PC2, there are batch effects on PC1, 3 and 4. An `arrowPlot()`

shows the extent of the correction:

`arrowPlot(imr90.hm, pc_x=1, pc_y=3)`

On PC1 the data has been shifted only a little, while on PC3 a large batch-effect signature seems to be present. The corrected data on the PC1 v. PC3 plot is still a little separated by batch. If we wish to aggressively stomp on the batch effect, (with increased risk of removing some experimental variance), we may specify something like `limit=0.9`

.

```
imr90.hm <- harman(as.matrix(imr90.data), expt=imr90.info$Treatment,
batch=imr90.info$Batch, limit=0.9)
plot(imr90.hm, pc_x=1, pc_y=3)
```

This time the samples across batches are less separated on the plot and there is also a batch effect found on PC2.

dimension | confidence | correction |
---|---|---|

PC1 | 0.8990548 | 0.54 |

PC2 | 0.9006838 | 0.86 |

PC3 | 0.8990320 | 0.25 |

PC4 | 0.8985044 | 0.49 |

PC5 | 0.6654655 | 1.00 |

PC6 | 0.2424223 | 1.00 |

PC7 | 0.2775396 | 1.00 |

PC8 | 0.7810260 | 1.00 |

PC9 | 0.1267338 | 1.00 |

PC10 | 0.3193122 | 1.00 |

PC11 | 0.1313642 | 1.00 |

Again, as before we reconstuct the data. Let’s also use this data to do a comparison: a PCA plot of the original data and shiny new reconstructed data.

```
imr90.data.corrected <- reconstructData(imr90.hm)
par(mfrow=c(1,2))
#pca_cols <- rainbow(3)[imr90.hm$factors$batch]
pca(imr90.data, 1, 3, cols=imr90.hm$factors$batch, cex=1.5, pch=16,
main='PCA, Original')
pca(imr90.data.corrected, 1, 3, cols=imr90.hm$factors$batch, cex=1.5, pch=16,
main='PCA, Corrected')
```

The dataset (*bladderbatch*) for this exercise is that discussed by Jeff Leek and others here and arises from this paper. This was a bladder cancer study comparing Affymetrix HG-U133A microarray expression profiles across five groups: superficial transitional cell carcinoma (sTCC) with surrounding carcinoma *in situ* (CIS) lesions (sTCC+CIS) or without surrounding CIS lesions (sTCC-CIS), muscle invasive carcinomas (mTCC) with normal bladder mucosa from patients without a bladder cancer history (Normal) and biopsies from cystectomy specimens (Biopsy). The arrays were run on 5 different days (5 batches).

First, loading the data.

`library(bladderbatch, quietly=TRUE)`

```
##
## Attaching package: 'BiocGenerics'
```

```
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
```

```
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
```

```
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, cbind, colnames, do.call, duplicated, eval,
## evalq, get, grep, grepl, intersect, is.unsorted, lapply,
## lengths, mapply, match, mget, order, paste, pmax, pmax.int,
## pmin, pmin.int, rank, rbind, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit
```

```
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
```

```
library(Harman)
# This loads an ExpressionSet object called bladderEset
data(bladderdata)
bladderEset
```

```
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 57 samples
## element names: exprs, se.exprs
## protocolData: none
## phenoData
## sampleNames: GSM71019.CEL GSM71020.CEL ... GSM71077.CEL (57
## total)
## varLabels: sample outcome batch cancer
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: hgu133a
```

The phenotype data.

sample | outcome | batch | cancer | |
---|---|---|---|---|

GSM71019.CEL | 1 | Normal | 3 | Normal |

GSM71020.CEL | 2 | Normal | 2 | Normal |

GSM71021.CEL | 3 | Normal | 2 | Normal |

GSM71022.CEL | 4 | Normal | 3 | Normal |

GSM71023.CEL | 5 | Normal | 3 | Normal |

GSM71024.CEL | 6 | Normal | 3 | Normal |

GSM71025.CEL | 7 | Normal | 2 | Normal |

GSM71026.CEL | 8 | Normal | 2 | Normal |

GSM71028.CEL | 9 | sTCC+CIS | 5 | Cancer |

GSM71029.CEL | 10 | sTCC-CIS | 2 | Cancer |

GSM71030.CEL | 11 | sTCC-CIS | 5 | Cancer |

GSM71031.CEL | 12 | sTCC-CIS | 2 | Cancer |

GSM71032.CEL | 13 | sTCC+CIS | 5 | Cancer |

GSM71033.CEL | 14 | sTCC-CIS | 2 | Cancer |

GSM71034.CEL | 15 | sTCC+CIS | 5 | Cancer |

GSM71035.CEL | 16 | sTCC+CIS | 5 | Cancer |

GSM71036.CEL | 17 | sTCC-CIS | 2 | Cancer |

GSM71037.CEL | 18 | mTCC | 1 | Cancer |

GSM71038.CEL | 19 | sTCC+CIS | 5 | Cancer |

GSM71039.CEL | 20 | mTCC | 1 | Cancer |

GSM71040.CEL | 21 | mTCC | 2 | Cancer |

GSM71041.CEL | 22 | mTCC | 1 | Cancer |

GSM71042.CEL | 23 | sTCC-CIS | 2 | Cancer |

GSM71043.CEL | 24 | sTCC+CIS | 5 | Cancer |

GSM71044.CEL | 25 | sTCC-CIS | 2 | Cancer |

GSM71045.CEL | 26 | sTCC-CIS | 2 | Cancer |

GSM71046.CEL | 27 | sTCC+CIS | 5 | Cancer |

GSM71047.CEL | 28 | mTCC | 1 | Cancer |

GSM71048.CEL | 29 | mTCC | 1 | Cancer |

GSM71049.CEL | 30 | sTCC-CIS | 2 | Cancer |

GSM71050.CEL | 31 | mTCC | 1 | Cancer |

GSM71051.CEL | 32 | mTCC | 1 | Cancer |

GSM71052.CEL | 33 | mTCC | 1 | Cancer |

GSM71053.CEL | 34 | sTCC+CIS | 5 | Cancer |

GSM71054.CEL | 35 | mTCC | 1 | Cancer |

GSM71055.CEL | 36 | sTCC-CIS | 2 | Cancer |

GSM71056.CEL | 37 | sTCC-CIS | 2 | Cancer |

GSM71058.CEL | 38 | sTCC-CIS | 2 | Cancer |

GSM71059.CEL | 39 | sTCC-CIS | 2 | Cancer |

GSM71060.CEL | 40 | mTCC | 1 | Cancer |

GSM71061.CEL | 41 | sTCC+CIS | 5 | Cancer |

GSM71062.CEL | 42 | sTCC+CIS | 5 | Cancer |

GSM71063.CEL | 43 | sTCC+CIS | 5 | Cancer |

GSM71064.CEL | 44 | sTCC-CIS | 2 | Cancer |

GSM71065.CEL | 45 | sTCC-CIS | 5 | Cancer |

GSM71066.CEL | 46 | mTCC | 1 | Cancer |

GSM71067.CEL | 47 | sTCC-CIS | 5 | Cancer |

GSM71068.CEL | 48 | sTCC+CIS | 5 | Cancer |

GSM71069.CEL | 49 | Biopsy | 4 | Biopsy |

GSM71070.CEL | 50 | Biopsy | 4 | Biopsy |

GSM71071.CEL | 51 | Biopsy | 5 | Biopsy |

GSM71072.CEL | 52 | Biopsy | 5 | Biopsy |

GSM71073.CEL | 53 | Biopsy | 5 | Biopsy |

GSM71074.CEL | 54 | Biopsy | 5 | Biopsy |

GSM71075.CEL | 55 | Biopsy | 4 | Biopsy |

GSM71076.CEL | 56 | Biopsy | 4 | Biopsy |

GSM71077.CEL | 57 | Biopsy | 4 | Biopsy |

A table of `batch`

by `outcome`

.

`table(batch=pData(bladderEset)$batch, expt=pData(bladderEset)$outcome)`

```
## expt
## batch Biopsy mTCC Normal sTCC-CIS sTCC+CIS
## 1 0 11 0 0 0
## 2 0 1 4 13 0
## 3 0 0 4 0 0
## 4 5 0 0 0 0
## 5 4 0 0 3 12
```

The experimental design is very poor in controlling for a batch effect. Ideally, the five factors should be distributed across the five run dates (batches). Instead, four out of the five experimental factors are distributed across two batches. The `sTCC+CIS`

samples are all within batch `5`

. In this instance, batch is completely confounded with the experimental variable, so any variation in the data will be a sum of the biological variance and technical variance. Any inference about the difference of `sTCC+CIS`

to other groups needs to be made very guardedly. It is also worth noting the spread of the `mTCC`

samples across batches is suboptimal. While there are 11 samples in batch 1, there is only a single sample in batch 2. Finally, batches 3 and 4 have only one sample type.

Let’s explore the multivariate grouping of the data by generating some PCA plots, first with colouring by batch and then by experimental variable.

```
par(mfrow=c(1,2))
pca(exprs(bladderEset), cols=as.factor(pData(bladderEset)$batch),
pch=as.integer(pData(bladderEset)$outcome), main='Col by Batch')
pca(exprs(bladderEset), cols=pData(bladderEset)$outcome,
pch=pData(bladderEset)$batch, main='Col by Expt')
```

The batch structures are highly unbalanced and so a batch effect is not immediately obvious to casual observation. However, a closer examination of samples that are represented well across two batches (Biopsy and Normal), shows a batch effect is certainly present. Certainly, with an older microarray technology like the Affymetrix HG-U133A, batch effects will be present.

Even though the batch structures are highly unbalanced, this dataset is quite large - so, we expect Harman to be well powered to find batch effects. Let’s take a look.

```
expt <- pData(bladderEset)$outcome
batch <- pData(bladderEset)$batch
bladder_harman <- harman(exprs(bladderEset), expt=expt, batch=batch)
sum(bladder_harman$stats$correction < 1)
```

`## [1] 45`

Harman found 45 of the 56 PCs had a batch effect.

`summary(bladder_harman)`

```
## Total factor levels:
##
## expt batch
## 5 5
##
## Experiment x Batch Design:
##
## batch
## expt 1 2 3 4 5
## Biopsy 0 0 0 5 4
## mTCC 11 1 0 0 0
## Normal 0 4 4 0 0
## sTCC-CIS 0 13 0 0 3
## sTCC+CIS 0 0 0 0 12
##
## Simulation parameters:
## 100000 simulations (with seed of 0). ForceRand is FALSE.
##
## Harman results with confidence limit of 0.95:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15
## 0.00 0.56 0.07 0.11 0.27 0.56 0.38 0.10 0.22 0.25 0.24 0.42 0.60 0.28 0.36
## PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30
## 0.46 0.24 1.00 0.63 0.15 0.24 0.50 0.70 0.41 0.35 0.60 0.44 0.37 0.82 0.38
## PC31 PC32 PC33 PC34 PC35 PC36 PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44 PC45
## 0.33 0.32 0.47 0.54 0.25 0.46 0.61 1.00 0.34 0.88 1.00 1.00 1.00 0.81 1.00
## PC46 PC47 PC48 PC49 PC50 PC51 PC52 PC53 PC54 PC55 PC56
## 0.47 1.00 0.75 1.00 0.72 0.89 1.00 0.97 0.94 1.00 1.00
##
## Top batch-effected PCs:
## PC1 PC3 PC8 PC4 PC20
## 0.00 0.07 0.10 0.11 0.15
```

Now plotting the original and corrected data.

```
par(mfrow=c(1,1))
plot(bladder_harman)
```

The `summary()`

shows PC3 to be highly affected by batch.

`plot(bladder_harman, 1, 3)`

Now the PC scores changes displayed on an `arrowPlot`

.

`arrowPlot(bladder_harman, 1, 3)`

First create a new object and then fill the `exprs`

slot of the ExpressionSet object with Harman corrected data. Alternatively, the existing object can have the `exprs`

slot overwritten.

```
CorrectedBladderEset <- bladderEset
exprs(CorrectedBladderEset) <- reconstructData(bladder_harman)
comment(bladderEset) <- 'Original'
comment(CorrectedBladderEset) <- 'Corrected'
```

Let’s check the effects of Harman with a *limma* analysis. First fitting the original data:

`library(limma, quietly=TRUE)`

```
##
## Attaching package: 'limma'
```

```
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
```

```
design <- model.matrix(~0 + expt)
colnames(design) <- c("Biopsy", "mTCC", "Normal", "sTCC", "sTCCwCIS")
contrast_matrix <- makeContrasts(sTCCwCIS - sTCC, sTCCwCIS - Normal,
Biopsy - Normal,
levels=design)
fit <- lmFit(exprs(bladderEset), design)
fit2 <- contrasts.fit(fit, contrast_matrix)
fit2 <- eBayes(fit2)
summary(classifyTestsT(fit2))
```

```
## sTCCwCIS - sTCC sTCCwCIS - Normal Biopsy - Normal
## -1 4284 6404 795
## 0 16155 12342 21076
## 1 1844 3537 412
```

Now a linear model on the Harman corrected data:

```
fit_hm <- lmFit(exprs(CorrectedBladderEset), design)
fit2_hm <- contrasts.fit(fit_hm, contrast_matrix)
fit2_hm <- eBayes(fit2_hm)
summary(classifyTestsT(fit2_hm))
```

```
## sTCCwCIS - sTCC sTCCwCIS - Normal Biopsy - Normal
## -1 17 629 1
## 0 22258 21515 22245
## 1 8 139 37
```

We can see the dramatic effect Harman has had in reducing the number of significant microarray probes. The huge reduction in the `Biopsy - Normal`

contrast makes biological sense, about half of the biopsies were from cystectomies were histologically normal. The Harman corrected data on the `sTCCwCIS - sTCC`

contrast suggests that surrounding *in situ* lesions (CIS) does not overly impact the transcriptome of superficial transitional cell carcinoma (sTCC).

As an example, let’s consider the Illumina Infinium HumanMethylation450 BeadChip data (450k array). First up, an important tip, **put M-values into Harman, not Beta-values.** Harman is designed for continuous ordinal data, not data which is constrained, such as Beta-values; which by definition are between 0 and 1. Input M-values into Harman and if Beta-values are needed downstream, convert the corrected M-values back into Beta-values using something like the `m2beta()`

function in the *lumi* package, or the `ilogit2()`

function in *minfi*.

M-values also have the property of having far more centrality than Beta-values. PCA is a parametric technique and so it works best with an underlying Gaussian distribution to the data. While it has been historically shown that PCA works rather well in non-parametric settings, Harman might be expected to be more sensitive if the data is centred (M-values), rather than bimodal at the extremes (Beta-values).

Beta values of exactly `0`

or `1`

will translate to minus infinity and infinity, respectively, in M-value space. An M-value is the `logit2()`

of a Beta value and Beta-values are `ilogit2`

of the M-values. These infinite values will make the internal singular value decomposition (SVD) step of Harman throw an exception. Further, infinite M-values are not commutative. If an M-value of `Inf`

or `-Inf`

is transformed back into a Beta-value it will have the value `NaN`

or `0`

, respectively. For these reasons we do not recommend a normalisation step which creates Beta-values of exactly `0`

or `1`

.

`library(minfi, quietly = TRUE)`

```
##
## Attaching package: 'S4Vectors'
```

```
## The following objects are masked from 'package:base':
##
## colMeans, colSums, expand.grid, rowMeans, rowSums
```

`## locfit 1.5-9.1 2013-03-22`

`## Setting options('download.file.method.GEOquery'='auto')`

`## Setting options('GEOquery.inmemory.gpl'=FALSE)`

`logit2`

```
## function (x)
## {
## log2(x) - log2(1 - x)
## }
## <environment: namespace:minfi>
```

`ilogit2`

```
## function (x)
## {
## 2^(x)/(1 + 2^(x))
## }
## <environment: namespace:minfi>
```

`ilogit2(Inf)`

`## [1] NaN`

`ilogit2(-Inf)`

`## [1] 0`

A normalisation technique such as `preprocessIllumina()`

in the minfi package will give Beta-values of exactly `0`

or `1`

, while a technique such as `SWAN()`

from *missMethyl* does not. Instead it incorporates a small offset, alpha, as suggested by Pan Du et al. While we do not recommend use of a normalisation technique that generate exact `0`

or `1`

Beta-values, sometimes, we realise, only this normalised data is available. For instance, when the data has been downloaded from a public resource, pre-normalised with no raw red and green channel data. For this case, we have supplied the `shiftBetas()`

helper function in Harman to resolve this problem. We shift beta values of *exactly* `0`

or `1`

by a very small amount (typically `1e-4`

) before transformation into M-values. As an example:

```
betas <- seq(0, 1, by=0.05)
betas
```

```
## [1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65
## [15] 0.70 0.75 0.80 0.85 0.90 0.95 1.00
```

```
newBetas <- shiftBetas(betas, shiftBy=1e-4)
newBetas
```

```
## [1] 0.0001 0.0500 0.1000 0.1500 0.2000 0.2500 0.3000 0.3500 0.4000 0.4500
## [11] 0.5000 0.5500 0.6000 0.6500 0.7000 0.7500 0.8000 0.8500 0.9000 0.9500
## [21] 0.9999
```

`range(betas)`

`## [1] 0 1`

`range(newBetas)`

`## [1] 0.0001 0.9999`

`logit2(betas)`

```
## [1] -Inf -4.2479275 -3.1699250 -2.5025003 -2.0000000 -1.5849625
## [7] -1.2223924 -0.8930848 -0.5849625 -0.2895066 0.0000000 0.2895066
## [13] 0.5849625 0.8930848 1.2223924 1.5849625 2.0000000 2.5025003
## [19] 3.1699250 4.2479275 Inf
```

`logit2(newBetas)`

```
## [1] -13.2875681 -4.2479275 -3.1699250 -2.5025003 -2.0000000
## [6] -1.5849625 -1.2223924 -0.8930848 -0.5849625 -0.2895066
## [11] 0.0000000 0.2895066 0.5849625 0.8930848 1.2223924
## [16] 1.5849625 2.0000000 2.5025003 3.1699250 4.2479275
## [21] 13.2875681
```

So, let’s get underway and analyse the toy dataset supplied with minfi.

```
library(minfiData, quietly = TRUE)
baseDir <- system.file("extdata", package="minfiData")
targets <- read.metharray.sheet(baseDir)
```

`## [read.metharray.sheet] Found the following CSV files:`

`## [1] "/home/biocbuild/bbs-3.3-bioc/R/library/minfiData/extdata/SampleSheet.csv"`

Read in the files, normalise using `SWAN()`

and `preprocessIllumina()`

and filter out poorly called CpG sites.

`library(missMethyl, quietly = TRUE)`

`## `

```
rgSet <- read.metharray.exp(targets=targets)
mRaw <- preprocessRaw(rgSet)
mSetSw <- SWAN(mRaw)
```

```
## [SWAN] Preparing normalization subset
## [SWAN] Normalizing methylated channel
## [SWAN] Normalizing unmethylated channel
```

```
mSet <- preprocessIllumina(rgSet, bg.correct=TRUE, normalize="controls",
reference=2)
detP <- detectionP(rgSet)
keep <- rowSums(detP < 0.01) == ncol(rgSet)
mSetIl <- mSet[keep,]
mSetSw <- mSetSw[keep,]
```

This dataset has quite a few phenotype variables of interest. The samples are paired cancer-normal samples from three people, 1 male and 2 females.

```
kable(pData(mSetSw)[,c('Sample_Group', 'person', 'sex', 'status', 'Array',
'Slide')])
```

Sample_Group | person | sex | status | Array | Slide | |
---|---|---|---|---|---|---|

5723646052_R02C02 | GroupA | id3 | M | normal | R02C02 | 5723646052 |

5723646052_R04C01 | GroupA | id2 | F | normal | R04C01 | 5723646052 |

5723646052_R05C02 | GroupB | id3 | M | cancer | R05C02 | 5723646052 |

5723646053_R04C02 | GroupB | id1 | F | cancer | R04C02 | 5723646053 |

5723646053_R05C02 | GroupA | id1 | F | normal | R05C02 | 5723646053 |

5723646053_R06C02 | GroupB | id2 | F | cancer | R06C02 | 5723646053 |

Doing some exploratory data analysis using MDS (considering it is methylation data), we can see that `status`

is separated on PC1 and `sex`

on PC2. This makes biological sense as we know cancer and gender both have large effects on DNA methylation. In cancer samples, there is typically global hypomethylation and focal hypermethylation at some CpG islands. While in the case of gender, female and male samples have very different X chromosome methylation due to imprinting, as well as some autosomal CpG site differences.

If the plot is recoloured by `Slide`

, there is some suggestion a batch effect is intermingled with these two experimental variables. The tricky question is how to use Harman to remove the influence of `Slide`

on the data, without also removing variance due to the experimental variables of interest?

```
par(mfrow=c(1, 1))
cancer_by_gender_factor <- as.factor(paste(pData(mSetSw)$sex,
pData(mSetSw)$status)
)
mdsPlot(mSetSw, sampGroups=cancer_by_gender_factor, pch=16)
```

`mdsPlot(mSetSw, sampGroups=as.factor(pData(mSet)$Slide), pch=16)`

The experiment is very small, with only 6 (paired) samples. The MDS plots suggests there are two main phenotypes of influence, `status`

and `sex`

. The batch variable in this case is `Slide`

. The `sex`

phenotype variable, which is highly influencing the data, is spread unevenly across the two slides (batches).

`table(gender=pData(mSetSw)$sex, slide=pData(mSetSw)$Slide)`

```
## slide
## gender 5723646052 5723646053
## F 1 3
## M 2 0
```

From the table, we can see that only slide `5723646052`

has samples of male origin. If we specify a simple cancer-based model with `expt=pData(mSetSw)$status`

and `batch=pData(mSetSw)$Slide`

, Harman will attribute the male-specific methylation signature of the two paired male samples on slide `5723646052`

as batch effect and will try and eliminate it.

There are two strategies here: 1. Form a compound factor by joining the `status`

and `sex`

factors together. 2. Generate two Harman corrections, one for `status`

and one for `sex`

.

As this experiment is so small, the first strategy won’t be very effective. Only the factor level `F normal`

is shared by both batches and for only one replicate. It’s very hard for Harman to get an idea of the batch distributions.

```
cancer_by_gender_factor <- as.factor(paste(pData(mSetSw)$sex,
pData(mSetSw)$status)
)
table(expt=cancer_by_gender_factor, batch=pData(mSetSw)$Slide)
```

```
## batch
## expt 5723646052 5723646053
## F cancer 0 2
## F normal 1 1
## M cancer 1 0
## M normal 1 0
```

In the second strategy, both levels of `expt`

are shared across both levels of `batch`

. This is a far more ideal way to find batch effects. Of course though, *there is confounding between sex and Slide!* So, in these cases, do not consider a batch confounded factor in downstream differential analysis.

`table(expt=pData(mSetSw)$status, batch=pData(mSetSw)$Slide)`

```
## batch
## expt 5723646052 5723646053
## cancer 1 2
## normal 2 1
```

If the Beta-values are shifted away from `0`

or `1`

by a very small amount, (say `1e-7`

), they will generate extreme M-values. Instead a moderate correction (such as `1e-4`

) seems the preferred option.

```
par(mfrow=c(1,2))
library(lumi, quietly = TRUE)
shifted_betas <- shiftBetas(betas=getBeta(mSetIl), shiftBy=1e-7)
shifted_ms <- beta2m(shifted_betas) # same as logit2() from package minfi
plot(density(shifted_ms, 0.05), main="Shifted M-values, shiftBy = 1e-7",
cex.main=0.7)
shifted_betas <- shiftBetas(betas=getBeta(mSetIl), shiftBy=1e-4)
shifted_ms <- beta2m(shifted_betas) # same as logit2() from package minfi
plot(density(shifted_ms, 0.05), main="Shifted M-values, shiftBy = 1e-4",
cex.main=0.7)
```

It can be seen that M-values have far more central-tendency (more Gaussian-like) than Beta-values.

```
par(mfrow=c(1,2))
plot(density(shifted_betas, 0.1), main="Beta-values, shiftBy = 1e-4",
cex.main=0.7)
plot(density(shifted_ms, 0.1), main="M-values, shiftBy = 1e-4",
cex.main=0.7)
```

A comparison of M-values produced by the GenomeStudio-like `preprocessIllumina()`

function and by `SWAN()`

.

```
par(mfrow=c(1,2))
plot(density(shifted_ms, 0.1),
main="GenomeStudio-like M-values, shiftBy = 1e-4", cex.main=0.7)
plot(density(getM(mSetSw), 0.1), main="SWAN M-values", cex.main=0.7)
```

**From here on, we will work with SWAN normalised M-values.**

For this example, `status`

is declared as the experimental variable. When comparing conditions with very large DNA methylation differences, such as cancer and non-neoplastic samples, the batch effect will not be so easy to observe amongst all that biological variation. This scenario changes with a more subtle phenotype of interest. More arrays in each batch will allow Harman to more easily find batch effects. Considering this experiment is so small, an aggressive setting (`limit=0.65`

) is needed to find a batch effect across a number of PCs (PCs 2, 5 and 1 in particular). A plot shows the correction made was relatively minor.

```
methHarman <- harman(getM(mSetSw), expt=pData(mSetSw)$status,
batch=pData(mSetSw)$Slide, limit=0.65)
summary(methHarman)
```

```
## Total factor levels:
##
## expt batch
## 2 2
##
## Experiment x Batch Design:
##
## batch
## expt 5723646052 5723646053
## cancer 1 2
## normal 2 1
##
## Simulation parameters:
## 100000 simulations (with seed of 0). ForceRand is FALSE.
##
## Harman results with confidence limit of 0.65:
## PC1 PC2 PC3 PC4 PC5
## 0.97 0.60 1.00 1.00 0.92
##
## Top batch-effected PCs:
## PC2 PC5 PC1
## 0.60 0.92 0.97
```

`plot(methHarman, 2, 5)`

It is difficult to tease out the batch effect with such a small experimental group. Rather than further reduce the `limit`

, let’s just convert the data back.

```
ms_hm <- reconstructData(methHarman)
betas_hm <- m2beta(ms_hm)
```

In downstream differential methylation analysis using *limma* or otherwise, care must be taken to interpret the results. As Harman was run with the `expt`

variable set to `status`

, any other variation unrelated to `status`

which is unbalanced across the two slides is considered as batch noise. For example, we have already shown that `sex`

is a highly influential factor and it is unbalanced across the slides; only person of `id3`

is male and both the normal and cancer samples are on slide `5723646052`

. Therefore, we expect Harman to attribute this gender effect to a slide effect.

In the specification of a linear model of `model.matrix(~id + group)`

, we should then expect to find no differential methylation in person `id3`

, the male. Let’s try this:

```
library(limma, quietly=TRUE)
group <- factor(pData(mSetSw)$status, levels=c("normal", "cancer"))
id <- factor(pData(mSetSw)$person)
design <- model.matrix(~id + group)
design
```

```
## (Intercept) idid2 idid3 groupcancer
## 1 1 0 1 0
## 2 1 1 0 0
## 3 1 0 1 1
## 4 1 0 0 1
## 5 1 0 0 0
## 6 1 1 0 1
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$id
## [1] "contr.treatment"
##
## attr(,"contrasts")$group
## [1] "contr.treatment"
```

Now time to fit this design to both the original M-values and the Harman corrected M-values.

```
fit_hm <- lmFit(ms_hm, design)
fit_hm <- eBayes(fit_hm)
fit <- lmFit(getM(mSetSw), design)
fit <- eBayes(fit)
```

Our intuition is correct. Harman has indeed squashed the variation in person `id3`

, there are now no differentially methylated probes.

```
summary_fit_hm <- summary(decideTests(fit_hm))
summary_fit <- summary(decideTests(fit))
summary_fit_hm
```

```
## (Intercept) idid2 idid3 groupcancer
## -1 165228 0 0 19585
## 0 75463 466668 466668 433987
## 1 225977 0 0 13096
```

`summary_fit`

```
## (Intercept) idid2 idid3 groupcancer
## -1 162926 0 2661 13249
## 0 79801 466668 463834 443139
## 1 223941 0 173 10280
```

We also note, our ability to detect cancer-related differential methylation has been enhanced. There are now 6336 more hypermethylated CpG probes and 2816 more hypomethylated CpG probes.

```
# Call dependencies
library(msmsEDA)
```

`## Loading required package: MSnbase`

`## Loading required package: mzR`

`## Loading required package: Rcpp`

`## Loading required package: BiocParallel`

`## Loading required package: ProtGenerics`

```
##
## This is MSnbase version 1.20.5
## Read '?MSnbase' and references therein for information
## about the package and how to get started.
```

```
##
## Attaching package: 'MSnbase'
```

```
## The following object is masked from 'package:lumi':
##
## plotDensity
```

```
## The following object is masked from 'package:stats':
##
## smooth
```

```
## The following object is masked from 'package:base':
##
## trimws
```

```
library(Harman)
data(msms.dataset)
msms.dataset
```

```
## MSnSet (storageMode: lockedEnvironment)
## assayData: 697 features, 14 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: U2.2502.1 U2.2502.2 ... U6.0302.3 (14 total)
## varLabels: treat batch
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## pubMedIds: http://www.ncbi.nlm.nih.gov/pubmed/22588121
## Annotation:
## - - - Processing information - - -
## MSnbase version: 1.8.0
```

The data matrix in an `MSnSet`

object from package *msmsEDA* may have all zero rows (if it’s a subset of a larger object) and some samples may have NA values, which correspond to proteins not identified in that particular sample. Principle components analysis cannot be undertaken on matrices with such features. So first we can use the wrapper function `pp.msms.data()`

, which removes all zero rows and replaces `NA`

with `0`

.

```
# Preprocess to remove rows which are all 0 and replace NA values with 0.
msms_pp <- pp.msms.data(msms.dataset)
```

`kable(pData(msms_pp))`

treat | batch | |
---|---|---|

U2.2502.1 | U200 | 2502 |

U2.2502.2 | U200 | 2502 |

U2.2502.3 | U200 | 2502 |

U2.2502.4 | U200 | 2502 |

U6.2502.1 | U600 | 2502 |

U6.2502.2 | U600 | 2502 |

U6.2502.3 | U600 | 2502 |

U6.2502.4 | U600 | 2502 |

U2.0302.1 | U200 | 0302 |

U2.0302.2 | U200 | 0302 |

U2.0302.3 | U200 | 0302 |

U6.0302.1 | U600 | 0302 |

U6.0302.2 | U600 | 0302 |

U6.0302.3 | U600 | 0302 |

```
# Create experimental and batch variables
expt <- pData(msms_pp)$treat
batch <- pData(msms_pp)$batch
table(expt, batch)
```

```
## batch
## expt 0302 2502
## U200 3 4
## U600 3 4
```

```
# Log2 transform data, add 1 to avoid infinite values
log_ms_exprs <- log(exprs(msms_pp) + 1, 2)
# Correct data with Harman
hm <- harman(log_ms_exprs, expt=expt, batch=batch)
summary(hm)
```

```
## Total factor levels:
##
## expt batch
## 2 2
##
## Experiment x Batch Design:
##
## batch
## expt 0302 2502
## U200 3 4
## U600 3 4
##
## Simulation parameters:
## 100000 simulations (with seed of 0). ForceRand is FALSE.
##
## Harman results with confidence limit of 0.95:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13
## 0.06 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
##
## Top batch-effected PCs:
## PC1
## 0.06
```

The Harman result is interesting. This MS data seems to have the batch effect contained within the first PC only. A marked difference compared to transcriptome microarray data and the like.

A plot rather convincingly shows that Harman was able to remove the batch effect.

`plot(hm)`

Us usual we now reconstruct corrected data, but we add an extra transformation step on the end. As the data was Log2 transformed we convert it back to the original format (and subtract 1 as this was added before during the transformation into Log2 space). The corrected and transformed back data is placed into a new ‘MSnSet’ instance.

```
# Reconstruct data and convert from Log2, removing 1 as we added it before.
log_ms_exprs_hm <- reconstructData(hm)
ms_exprs_hm <- 2^log_ms_exprs_hm - 1
# Place corrected data into a new 'MSnSet' instance
msms_pp_hm <- msms_pp
exprs(msms_pp_hm) <- ms_exprs_hm
```

This example is an illustration of how Harman will have reduced power in teasing apart biological and technical variance when presented with a very small dataset (2 replicates and 2 conditions).

To illustrate, we use the toy dataset in *affydata*. The `Dilution`

data within relates to two sources of cRNA, A (human liver tissue) and B (Central Nervous System cell line), which have been hybridized to human array (HGU95A) in a range of proportions and dilutions.

This toy example is taken from arrays hybridized to source A at 10.0 and 20 μg. There are two replicate arrays for each generated cRNA, with each array replicate processed in a different scanner. *For more information see Gautier et al., affy - Analysis of Affymetrix GeneChip data at the probe level*.

`library(affydata, quietly = TRUE)`

```
##
## Attaching package: 'affy'
```

```
## The following objects are masked from 'package:MSnbase':
##
## intensity, plotDensity
```

```
## The following object is masked from 'package:ProtGenerics':
##
## intensity
```

```
## The following objects are masked from 'package:lumi':
##
## MAplot, plotDensity
```

```
## Package LibPath Item
## [1,] "affydata" "/home/biocbuild/bbs-3.3-bioc/R/library" "Dilution"
## Title
## [1,] "AffyBatch instance Dilution"
```

```
data(Dilution)
Dilution.log <- Dilution
intensity(Dilution.log) <- log(intensity(Dilution))
cols <- brewer.pal(nrow(pData(Dilution)), 'Paired')
```

This dataset contains technical replicates of liver RNA run on two different scanners. Technical replicates are denoted as A and B samples. There are two technical replicates across two array scanners. The 10 and 20 The second 2 are also replicates. The second arrays are hybridized to twice as much RNA so the intensities should be in general bigger.

liver | sn19 | scanner | |
---|---|---|---|

20A | 20 | 0 | 1 |

20B | 20 | 0 | 2 |

10A | 10 | 0 | 1 |

10B | 10 | 0 | 2 |

`boxplot(Dilution, col=cols)`

```
## Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail'
## when loading 'hgu95av2cdf'
```

```
## Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head'
## when loading 'hgu95av2cdf'
```

`## `

Notice that the scanner effect is stronger than the RNA concentration effect. This certainly hints at a batch effect.

Let’s normalize the data by two methods (Loess and Quantile) and see if it removes this technical noise.

`Dilution.loess <-normalize(Dilution.log, method="loess")`

```
## Done with 1 vs 2 in iteration 1
## Done with 1 vs 3 in iteration 1
## Done with 1 vs 4 in iteration 1
## Done with 2 vs 3 in iteration 1
## Done with 2 vs 4 in iteration 1
## Done with 3 vs 4 in iteration 1
## 1 0.01032139
```

```
Dilution.qnt <-normalize(Dilution.log, method="quantiles")
par(mfrow=c(2,2), mar=c(4, 4, 4, 2) + 0.1)
boxplot(Dilution.loess, col=cols, main='Loess')
boxplot(Dilution.qnt, col=cols, main='Quantile')
pca(exprs(Dilution.loess), cols=cols, cex=1.5, pch=16, main='PCA Loess')
pca(exprs(Dilution.qnt), cols=cols, cex=1.5, pch=16, main='PCA Quantile')
```

The boxplots give the impression the batch effect has been removed. However, principal component analysis (PCA) shows the batch effect to still be the largest source of variance in the data. PC1 is dominated by batch effect (seperation of bold and pastel colours), while on PC2 the effect of RNA quantity is observed, so green (10 micrograms) compared with blue (20 micrograms).

Let’s fire up harman and try to remove this batch effect.

```
library(Harman)
loess.hm <- harman(exprs(Dilution.loess),
expt=pData(Dilution.loess)$liver,
batch=pData(Dilution.loess)$scanner)
qnt.hm <- harman(exprs(Dilution.qnt),
expt=pData(Dilution.qnt)$liver,
batch=pData(Dilution.qnt)$scanner)
```

If we inspect the stats on the Harman runs,

Loess-normalised data

dimension | confidence | correction |
---|---|---|

PC1 | 0.7732401 | 1 |

PC2 | 0.4279982 | 1 |

PC3 | 0.1374316 | 1 |

Quantile-normalised data

dimension | confidence | correction |
---|---|---|

PC1 | 0.7784752 | 1 |

PC2 | 0.0199157 | 1 |

PC3 | 0.0628038 | 1 |

**The correction statistic is 1.0 for all dimensions, so Harman didn’t find any batch effect!**

While our intuition tells us there is a batch effect, with the default settings (limit=0.95), harman fails to find one. *This is due to the fact there are only two replicates in each batch!* Let’s step up the limit now, to a confidence interval of 0.75. So, we are 75% sure only technical (batch) variation and not biological variation is being removed.

```
loess.hm <- harman(exprs(Dilution.loess),
expt=pData(Dilution.loess)$liver,
batch=pData(Dilution.loess)$scanner,
limit=0.75)
qnt.hm <- harman(exprs(Dilution.qnt),
expt=pData(Dilution.qnt)$liver,
batch=pData(Dilution.qnt)$scanner,
limit=0.75)
```

This time a batch effect is found on PC1.

Loess-normalised data

dimension | confidence | correction |
---|---|---|

PC1 | 0.7498260 | 0.45 |

PC2 | 0.4279982 | 1.00 |

PC3 | 0.1374316 | 1.00 |

Quantile-normalised data

dimension | confidence | correction |
---|---|---|

PC1 | 0.7491598 | 0.17 |

PC2 | 0.0199157 | 1.00 |

PC3 | 0.0628038 | 1.00 |

So what corrections did Harman make with `limit=0.75`

? Let’s take a look…

```
par(mfrow=c(2,2), mar=c(4, 4, 4, 2) + 0.1)
plot(loess.hm, 1, 2, pch=17, col=cols)
```

`plot(qnt.hm, 1, 2, pch=17, col=cols)`

```
Dilution.loess.hm <- Dilution.loess
Dilution.qnt.hm <- Dilution.qnt
exprs(Dilution.loess.hm) <- reconstructData(loess.hm)
exprs(Dilution.qnt.hm) <- reconstructData(qnt.hm)
```