Batch effects are one of the major source of technical variations in high throughput studies such as omics profiling. It has been well established that batch effects can be caused by different experimental platforms, laboratory conditions, different sources of samples and personnel differences. These differences can confound the outcomes of interest and lead to spurious results. A critical input for batch correction algorithms are the knowledge of batch factors, which in many cases are unknown or inaccurate. Hence, the primary motivation of our paper is to detect hidden batch factors that can be used in standard techniques to accurately capture the relationship between expression and other modeled variables of interest. Here, we present DASC, a novel algorithm that is based on convex clustering and semi-NMF for the detection of unknown batch effects.
Package: DASC
Authors: Haidong Yi, Ayush T. Raman
Version: 0.99.11
Compiled date: 2017-05-07
License: MIT + file LICENSE
Prerequisites: NMF, cvxclustr, Biobase
DASC is an R package distributed as part of the Bioconductor project. To install the package, start R and enter:
source("http://bioconductor.org/biocLite.R")
biocLite("DASC")
DASC is used for identifying batches and classifying samples into different batches in a high dimensional gene expression dataset. The batch information can be further used as a covariate in conjunction with other variables of interest among standard bioinformatics analysis like differential expression analysis.
If you use DASC for your analysis, please cite it as here below. To cite package ‘DASC’ in publications use:
@Manual{,
title = {DASC: Detecting hidden batch factors through data adaptive
adjustment for biological effects.},
author = {Haidong Yi, Ayush T. Raman, Han Zhang, Genevera I. Allen and
Zhandong Liu},
year = {2017},
note = {R package version 0.1.0},
}
library(DASC)
data("esGolub")
samples <- c(20,21,28,30)
dat <- exprs(esGolub)[1:100,samples]
pdat <- pData(esGolub)[samples,]
## use nrun = 50 or more for better convergence of results
res <- DASC(edata = dat, pdata = pdat, factor = pdat$Cell,
method = 'ama', type = 3, lambda = 1,
rank = 2:3, nrun = 5, annotation='esGolub Dataset')
Compute NMF rank= 2 ... + measures ... OK
Compute NMF rank= 3 ... + measures ... OK
The first step in using DASC
package is to properly format the data. For example, in case of gene expression data, it should be a matrix with features (genes, transcripts) in the rows and samples in the columns. DASC
then requires the information for the variable of interest to model the gene expression data effectively.Variable of interest could be a genotype or treatment information.
Below is an example of Stanford gene expression dataset (Chen et. al. PNAS, 2015; Gilad et. al. F1000 Research, 2015). It is a filtered raw counts dataset which was published by Gilad et al. F1000 Research. 30% of genes with the lowest expression & mitochondrial genes were removed (Gilad et al.F1000 Research).
## libraries
set.seed(99999)
library(DESeq2)
library(ggplot2)
library(pcaExplorer)
## dataset
rawCounts <- stanfordData$rawCounts
metadata <- stanfordData$metadata
## Using a smaller dataset
idx <- which(metadata$tissue %in% c("adipose", "adrenal", "sigmoid"))
rawCounts <- rawCounts[,idx]
metadata <- metadata[idx,]
head(rawCounts)
adipose (h) adrenal (h) sigmoid (h) adipose (m) adrenal (m)
STAG2 1430 4707 4392 3223 8235
STAG1 835 2362 1687 2750 2732
GOSR2 142 891 97 1599 1430
C1orf43 1856 9591 2611 706 498
ART5 1 4 0 0 0
ART1 0 0 0 0 1
sigmoid (m)
STAG2 10435
STAG1 2833
GOSR2 887
C1orf43 753
ART5 0
ART1 0
head(metadata)
setname seqBatch species tissue
adipose (h) adipose (h) D87PMJN1:253:D2GUAACXX:8 human adipose
adrenal (h) adrenal (h) D87PMJN1:253:D2GUAACXX:8 human adrenal
sigmoid (h) sigmoid (h) D87PMJN1:253:D2GUAACXX:8 human sigmoid
adipose (m) adipose (m) D4LHBFN1:276:C2HKJACXX:4 mouse adipose
adrenal (m) adrenal (m) D4LHBFN1:276:C2HKJACXX:4 mouse adrenal
sigmoid (m) sigmoid (m) D4LHBFN1:276:C2HKJACXX:4 mouse sigmoid
## Normalizing the dataset using DESeq2
dds <- DESeqDataSetFromMatrix(rawCounts, metadata, design = ~ species+tissue)
dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized = TRUE)
lognormalizedCounts <- log2(dat + 1)
## PCA plot using
rld.dds <- rlog(dds)
pcaplot(rld.dds, intgroup=c("tissue","species"), ntop=1000, pcX=1, pcY=2)
In the PCA plot, PC1 shows the differences between the species. PC2 shows the differences between the species i.e. samples clustering based on tissues.
res <- DASC(edata = dat, pdata = metadata, factor = metadata$tissue,
method = 'ama', type = 3, lambda = 1, rank = 2:3, nrun = 10,
annotation = 'Stanford Dataset')
Compute NMF rank= 2 ... + measures ... OK
Compute NMF rank= 3 ... + measures ... OK
## Consensus plot
consensusmap(res)
## Residual plot
plot(res)
## Batches -- dataset has 6 batches
sample.clust <- data.frame(sample.name = colnames(lognormalizedCounts),
clust = as.vector(predict(res$fit$`2`)),
batch = metadata$seqBatch)
ggplot(data = sample.clust, aes(x=c(1:6), y=clust, color=factor(clust))) +
geom_point(size = 4) + xlab("Sample Number") + ylab("Cluster Number")
Based on the above plots, we observe that the dataset has 2 batches. This can further be compared with the sequencing platform or metadata$seqBatch
. The results suggest that differences in platform led to batch effects. Batch number can be used as another covariate, when differential expression analyses using DESeq2
,edgeR
or limma
are performed.
## not running this part of the code for building package
## Using entire dataset
rawCounts <- stanfordData$rawCounts
metadata <- stanfordData$metadata
dds <- DESeqDataSetFromMatrix(rawCounts, metadata, design = ~ species+tissue)
dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized = TRUE)
lognormalizedCounts <- log2(dat + 1)
## PCA Plot
rld.dds <- rlog(dds)
pcaplot(rld.dds, intgroup=c("tissue","species"), ntop = nrow(rld.dds),
pcX = 1, pcY = 2)
## Running DASC
res <- DASC(edata = dat, pdata = metadata, factor = metadata$tissue,
method = 'ama', type = 3, lambda = 1, rank = 2:10,
nrun = 100, annotation = 'Stanford Dataset')
## Consensus plot
consensusmap(res)
## Residual plot
plot(res)
## Clustering samples based on batches
sample.clust <- data.frame(sample.name = colnames(lognormalizedCounts),
clust = as.vector(predict(res$fit$`10`)),
batch = metadata$seqBatch)
ggplot(data = sample.clust, aes(x=c(1:26), y=clust, color=factor(clust))) +
geom_point(size = 4) + xlab("Sample Number") + ylab("Cluster Number")
Based on the PCA plot, we see the PC1 captures the differences based on species. Based on consensusmap
plot and Cophenetic & dispersion values, there are 10 batches in the dataset. Our results show that the batches are not only due to platform but due other reason like differences in the date and the time of library preparation and sequencing of the samples. Another important observation, is that at rank equal to 2, we observe entire dataset to cluster based on speicies.
sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] RColorBrewer_1.1-2 pcaExplorer_2.3.0
[3] ggplot2_2.2.1 DESeq2_1.17.1
[5] SummarizedExperiment_1.7.2 DelayedArray_0.3.4
[7] matrixStats_0.52.2 GenomicRanges_1.29.3
[9] GenomeInfoDb_1.13.1 IRanges_2.11.1
[11] S4Vectors_0.15.1 doParallel_1.0.10
[13] iterators_1.0.8 foreach_1.4.3
[15] DASC_0.99.11 cvxclustr_1.1.1
[17] igraph_1.0.1 Matrix_1.2-10
[19] NMF_0.20.6 bigmemory_4.5.19
[21] bigmemory.sri_0.1.3 cluster_2.0.6
[23] rngtools_1.2.4 pkgmaker_0.22
[25] registry_0.3 Biobase_2.37.2
[27] BiocGenerics_0.23.0 knitr_1.15.1
[29] BiocStyle_2.5.0
loaded via a namespace (and not attached):
[1] colorspace_1.3-2 rprojroot_1.2 htmlTable_1.9
[4] XVector_0.17.0 base64enc_0.1-3 d3heatmap_0.6.1.1
[7] topGO_2.29.0 ggrepel_0.6.5 DT_0.2
[10] AnnotationDbi_1.39.0 codetools_0.2-15 splines_3.4.0
[13] geneplotter_1.55.0 Formula_1.2-1 jsonlite_1.4
[16] gridBase_0.4-7 annotate_1.55.0 GO.db_3.4.1
[19] png_0.1-7 pheatmap_1.0.8 shinydashboard_0.5.3
[22] graph_1.55.0 shiny_1.0.3 compiler_3.4.0
[25] GOstats_2.43.0 backports_1.0.5 assertthat_0.2.0
[28] lazyeval_0.2.0 limma_3.33.2 acepack_1.4.1
[31] htmltools_0.3.6 prettyunits_1.0.2 tools_3.4.0
[34] gtable_0.2.0 GenomeInfoDbData_0.99.0 Category_2.43.0
[37] reshape2_1.4.2 Rcpp_0.12.10 stringr_1.2.0
[40] mime_0.5 XML_3.98-1.7 shinyAce_0.2.1
[43] zlibbioc_1.23.0 scales_0.4.1 shinyBS_0.61
[46] RBGL_1.53.0 SparseM_1.77 yaml_2.1.14
[49] memoise_1.1.0 gridExtra_2.2.1 biomaRt_2.33.1
[52] rpart_4.1-11 latticeExtra_0.6-28 stringi_1.1.5
[55] RSQLite_1.1-2 genefilter_1.59.0 checkmate_1.8.2
[58] BiocParallel_1.11.1 bitops_1.0-6 evaluate_0.10
[61] lattice_0.20-35 labeling_0.3 htmlwidgets_0.8
[64] GSEABase_1.39.0 AnnotationForge_1.19.0 plyr_1.8.4
[67] magrittr_1.5 bookdown_0.3 R6_2.2.0
[70] Hmisc_4.0-3 DBI_0.6-1 foreign_0.8-68
[73] survival_2.41-3 RCurl_1.95-4.8 nnet_7.3-12
[76] tibble_1.3.0 rmarkdown_1.5 progress_1.1.2
[79] locfit_1.5-9.1 grid_3.4.0 data.table_1.10.4
[82] threejs_0.2.2 digest_0.6.12 xtable_1.8-2
[85] tidyr_0.6.2 httpuv_1.3.3 munsell_0.4.3