Harman is a PCA and constrained optimisation based technique that maximises the removal of batch effects from datasets, with the constraint that the probability of overcorrection (i.e. removing genuine biological signal along with batch noise) is kept to a fraction which is set by the end-user.
Harman 1.6.0
Harman {noun}
A threshing yard where wheat is separated from chaff
Origin: Turkish and Persian
The harman package is used to remove unwanted batch (technical) effects from datasets. It was designed for microarray and deep sequencing data, but it is applicable to any high-dimensional data where the experimental variable of interest is not completely confounded with the batch variable. Harman uses a PCA and constrained optimisation based technique. This vignette is a tutorial for using Harman in a number of common scenarios and also provides insight on how to interpret results.
For help with Harman, use the package help system. A good place to start is typing ?harman
in the console. Harman requires three inputs: 1. datamatrix
, a matrix of type numeric or integer, or a data.frame that can be coerced into such using as.matrix()
, 2. expt
, a vector or factor containing the experimental grouping variable, 3. batch
, a vector or factor containing the batch grouping variable.
The matrix is to be constructed with features (typically microarray probes or sequencing counts) in rows and samples in columns, much like the assayData slot in the canonical Bioconductor eSet
object, or any object which inherits from it. The data should have normalisation and any other global adjustment for noise reduction (such as background correction) applied prior to using Harman.
The fourth parameter of interest in Harman is limit, which sets the confidence limit for batch effect removal. The default is limit=0.95. We can consider the confidence limit an inverse of the alpha level, so there is 1 - limit chance (default of 0.05, or 5%) that some of the variance from the experimental variable is being removed along with the batch variance. This user specified limit parameter allows one to tune the aggressiveness of the batch effect removal. The more samples in each batch, the more opportunity Harman has to model the batch distributions, which allows a more precise identification (and subsequent removal) of batch effects.
The Harman package is available from Bioconductor (Harman) or Github (Harman).
library(Harman)
library(HarmanData)
First off, let’s load an example dataset from HarmanData where the batches and experimental variable are balanced.
data(OLF)
The olfactory stem cell study is an experiment to gauge the response of human olfactory neurosphere-derived (hONS) cells established from adult donors to ZnO nanoparticles. A total of 28 Affymetrix HuGene 1.0 ST arrays were normalised together using RMA.
The data comprises six treatment groups plus a control group, each consisting of four replicates:
table(olf.info)
## Batch
## Treatment 1 2 3 4
## 1 1 1 1 1
## 2 1 1 1 1
## 3 1 1 1 1
## 4 1 1 1 1
## 5 1 1 1 1
## 6 1 1 1 1
## 7 1 1 1 1
The olf.info
data.frame has the expt and batch variables across 2 columns, with each sample described in one row to give 28 rows.
olf.info[1:7,]
## Treatment Batch
## c1 1 1
## c2 2 1
## c3 3 1
## c4 4 1
## c5 5 1
## c6 6 1
## c7 7 1
The data is a data.frame of normalised log2 expression values with dimensions: 33297 rows (features) x 28 columns (samples). This data can be handed to Harman as is, or first coerced into a matrix.
head(olf.data)
## c1 c2 c3 c4 c5 c6 c7 c8 c9
## p1 5.05866 4.58076 5.58438 2.90481 5.39752 4.24041 2.46891 5.34241 2.86128
## p2 4.23886 4.08143 3.21386 3.53045 4.18741 3.70027 3.05552 4.62957 4.09687
## p3 3.66121 2.79664 4.13699 2.86271 3.17795 2.92988 3.05603 3.42135 2.70507
## p4 8.61399 9.09654 9.16841 9.10928 8.94949 8.70754 8.75558 9.31429 8.63934
## p5 2.84004 2.66609 3.03612 3.26561 3.22945 3.32247 3.05079 3.02775 3.18419
## p6 3.12234 3.05058 3.85761 3.07707 3.67759 3.72965 3.43910 3.15980 2.40544
## c10 c11 c12 c13 c14 c15 c16 c17 c18
## p1 3.03601 4.16908 2.58603 3.14912 2.80076 5.84228 4.51905 3.63710 5.40139
## p2 3.59235 3.61548 3.87856 3.28656 4.12426 4.75150 4.44983 4.59084 4.87332
## p3 3.02844 3.11758 2.78865 3.82057 2.98588 4.34385 3.04461 3.47405 2.96032
## p4 8.67534 8.68344 9.06311 8.57974 9.01710 8.80506 8.82719 9.17436 8.72230
## p5 3.39400 3.27891 2.20645 3.26020 3.10178 3.72065 3.11529 3.75199 3.43958
## p6 3.33047 2.81670 3.34086 2.61524 2.38151 3.06240 4.51134 3.69132 3.08967
## c19 c20 c21 c22 c23 c24 c25 c26 c27
## p1 4.61271 5.07018 5.11652 3.11507 3.63363 4.00769 4.57562 4.34872 4.81612
## p2 4.24876 4.43532 5.59789 2.64399 3.54669 3.26678 3.31107 4.03487 3.60095
## p3 3.31022 2.89191 3.84660 3.24867 2.56718 2.99377 3.18528 2.99099 3.14193
## p4 9.14509 9.27561 9.17592 8.79834 8.92382 9.42164 9.07371 8.91382 8.76901
## p5 3.41088 3.42294 3.77549 3.47989 2.79997 2.41906 2.64609 3.14506 3.39344
## p6 5.20479 4.35899 3.80101 2.95787 2.82707 2.76646 2.89902 3.28622 4.27340
## c28
## p1 2.64101
## p2 4.10095
## p3 2.74735
## p4 8.30663
## p5 3.82901
## p6 3.21685
Okay, so let’s run Harman.
olf.harman <- harman(olf.data, expt=olf.info$Treatment, batch=olf.info$Batch, limit=0.95)
This creates a harmanresults
S3 object which has a number of slots. These include the centre and rotation slots of a prcomp object which is returned after calling prcomp(t(x))
, where x is the datamatrix supplied. These two slots are required for reconstructing the data later. The other slots are the factors supplied for the analysis (specified as expt and batch), the runtime parameters and some summary stats for Harman. Finally, there are the original and corrected PC scores.
str(olf.harman)
## List of 7
## $ factors :'data.frame': 28 obs. of 4 variables:
## ..$ expt : Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7 1 2 3 ...
## ..$ batch : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 2 2 2 ...
## ..$ expt.numeric : int [1:28] 1 2 3 4 5 6 7 1 2 3 ...
## ..$ batch.numeric: int [1:28] 1 1 1 1 1 1 1 2 2 2 ...
## $ parameters:List of 4
## ..$ limit : num 0.95
## ..$ numrepeats: int 100000
## ..$ randseed : num 0
## ..$ forceRand : logi FALSE
## $ stats :'data.frame': 28 obs. of 3 variables:
## ..$ dimension : Factor w/ 28 levels "PC1","PC10","PC11",..: 1 12 22 23 24 25 26 27 28 2 ...
## ..$ confidence: num [1:28] 0.95 0.949 0.95 0.95 0.951 ...
## ..$ correction: num [1:28] 0.25 0.33 0.5 0.9 0.44 0.85 0.74 1 1 1 ...
## $ center : Named num [1:33297] 4.13 3.95 3.19 8.92 3.19 ...
## ..- attr(*, "names")= chr [1:33297] "p1" "p2" "p3" "p4" ...
## $ rotation : num [1:33297, 1:28] -0.020637 -0.011028 -0.006488 -0.000785 -0.00551 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:33297] "p1" "p2" "p3" "p4" ...
## .. ..$ : chr [1:28] "PC1" "PC2" "PC3" "PC4" ...
## $ original : num [1:28, 1:28] -26.72 3.79 -19.76 25.66 -3.14 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:28] "c1" "c2" "c3" "c4" ...
## .. ..$ : chr [1:28] "PC1" "PC2" "PC3" "PC4" ...
## $ corrected : num [1:28, 1:28] -27.17 3.34 -20.21 25.21 -3.59 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:28] "c1" "c2" "c3" "c4" ...
## .. ..$ : chr [1:28] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "harmanresults"
Harman objects can be inspected with methods such as pcaPlot
and arrowPlot
, as well as the generic functions plot
and summary
.
plot(olf.harman)
arrowPlot(olf.harman, main='Arrowplot of correction')
Using summary
or inspecting the stats slot we can see Harman corrected the first 7 PCs and left the rest uncorrected.
summary(olf.harman)
## Total factor levels:
##
## expt batch
## 7 4
##
## Experiment x Batch Design:
##
## batch
## expt 1 2 3 4
## 1 1 1 1 1
## 2 1 1 1 1
## 3 1 1 1 1
## 4 1 1 1 1
## 5 1 1 1 1
## 6 1 1 1 1
## 7 1 1 1 1
##
## Simulation parameters:
## 100000 simulations (with seed of 0). ForceRand is FALSE.
##
## Harman results with confidence limit of 0.95:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15
## 0.25 0.33 0.50 0.90 0.44 0.85 0.74 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
##
## Top batch-effected PCs:
## PC1 PC2 PC5 PC3 PC7
## 0.25 0.33 0.44 0.50 0.74
dimension | confidence | correction |
---|---|---|
PC1 | 0.9500996 | 0.25 |
PC2 | 0.9490076 | 0.33 |
PC3 | 0.9500139 | 0.50 |
PC4 | 0.9497873 | 0.90 |
PC5 | 0.9509498 | 0.44 |
PC6 | 0.9505247 | 0.85 |
PC7 | 0.9507247 | 0.74 |
PC8 | 0.8353300 | 1.00 |
PC9 | 0.8897871 | 1.00 |
PC10 | 0.7478036 | 1.00 |
PC11 | 0.7534338 | 1.00 |
PC12 | 0.6490559 | 1.00 |
PC13 | 0.8538693 | 1.00 |
PC14 | 0.5361958 | 1.00 |
PC15 | 0.7655424 | 1.00 |
PC16 | 0.4155103 | 1.00 |
PC17 | 0.6931610 | 1.00 |
PC18 | 0.7644623 | 1.00 |
PC19 | 0.2869585 | 1.00 |
PC20 | 0.1424034 | 1.00 |
PC21 | 0.8732809 | 1.00 |
PC22 | 0.8205077 | 1.00 |
PC23 | 0.8612620 | 1.00 |
PC24 | 0.9488360 | 1.00 |
PC25 | 0.5062507 | 1.00 |
PC26 | 0.5842571 | 1.00 |
PC27 | 0.9466016 | 1.00 |
PC28 | 0.9111912 | 1.00 |
As the confidence (limit) was 0.95, Harman will look for batch effects until this limit is met. It does this by reducing the batch means in an iterative way using a binary search algorithm until the chance that biological variance is being removed (the factor given in expt) is too high. Consider the value specified in the stats column confidence as the likelihood that separation of samples in this PC is due to batch alone and not due to the experimental variable. If this confidence value is less than limit, Harman will not make another iterative correction. For example, on PC8 the confidence is about 0.835 which is below the limit of 0.95, so Harman did not alter data on this PC. Let’s check this on a plot.
arrowPlot(olf.harman, 1, 8)
The horizontal arrows show us that, on PC1 the scores were shifted, but on PC8 they were not.
If both dimensions were not shifted, the arrowPlot
will default to printing x instead of arrows (can’t have 0 length arrows!)
arrowPlot(olf.harman, 8, 9)
We can also easily colour the PCA plot by the experimental variable to compare:
par(mfrow=c(1,2))
pcaPlot(olf.harman, colBy='batch', pchBy='expt', main='Col by Batch')
pcaPlot(olf.harman, colBy='expt', pchBy='batch', main='Col by Expt')
So, if corrections have been made and we’re happy with the value of limit, then we reconstruct corrected data from the Harman adjusted PC scores. To do this a harmanresults object is handed to the function reconstructData()
. This function produces a matrix of the same dimensions as the input matrix filled with corrected data.
olf.data.corrected <- reconstructData(olf.harman)
This new data can be over-written into something like an ExpressionSet object with a command like exprs(eset) <- data.corrected
. An example of this is below.
To convince you that Harman is working as it should in reconstructing the data back from the PCA domain, let’s test this principle on the original data.
olf.data.remade <- reconstructData(olf.harman, this='original')
all.equal(as.matrix(olf.data), olf.data.remade)
## [1] TRUE
So, the data is indeed the same (to within the machine epsilon limit for floating point error)!
Let’s try graphically plotting the differences using image()
. First, the rows (assays) are ordered by the variation in the difference between the original and corrected data.
olf.data.diff <- abs(as.matrix(olf.data) - olf.data.corrected)
diff_rowvar <- apply(olf.data.diff, 1, var)
by_rowvar <- order(diff_rowvar)
par(mfrow=c(1,1))
image(x=1:ncol(olf.data.diff),
y=1:nrow(olf.data.diff),
z=t(olf.data.diff[by_rowvar, ]),
xlab='Sample',
ylab='Probeset',
main='Harman probe adjustments (ordered by variance)',
cex.main=0.7,
col=brewer.pal(9, 'Reds'))