Contents

1 Introduction

The EpiDISH package provides tools to infer the proportions of a priori known cell subtypes present in a sample representing a mixture of such cell-types. Inference proceeds via one of 3 methods (Robust Partial Correlations-RPC(Teschendorff et al. 2017), Cibersort (CBS)(Newman et al. 2015), Constrained Projection (CP)(Houseman et al. 2012)), as determined by user.

For now, the package only includes one whole blood reference of 333 tsDHS-DMCs and 8 blood cell subtypes(B-cells, CD4+ T-cells, CD8+ T-cells, NK-cells, Monocytes, Neutrophils, Eosinophils, and Granulocytes. Note that Granulocytes consist of Neutrophils and Eosinophils.) described in (Teschendorff et al. 2017). This referecen dataset was based on 450k DNAm array; however, it could be directly used on both of 450k and EPIC array data. This package is under development and will offer reference-based inference for different tissue types. We will also include more algorithms in the future.

2 How to use EpiDISH package

Using EpiDISH is quite simple. Here we use a small Illumina HumanMethylation450 BeadChip blood dataset(n=2) on GEO as an example.

You can download the dataset with getGEO function in GEOquery package and extract the whole beta value matrix.

require(GEOquery)
require(Biobase)
GSE80559 <- getGEO("GSE80559")
beta.m <- exprs(GSE80559[[1]])

To reduce the package size and running time, we randomly selected 1000 probes from the beta value matrix(we let 330 of the probes be overlapped with the blood reference we provide.). The resulted DummyBeta.m is stored in the package.

We load EpiDISH package, beta value matrix, and the whole blood reference dataset.

library(EpiDISH)
data(centDHSbloodDMC.m)
data(DummyBeta.m)

Notice that centDHSbloodDMC.m has 8 columns. Granulocytes consist of Neutrophils and Eosinophils. So, we only want to inlcude 7 columns(i.e B-cells, CD4+ T-cells, CD8+ T-cells, NK-cells, Monocytes, Neutrophils and Eosinophils) or 6 columns(i.e B-cells, CD4+ T-cells, CD8+ T-cells, NK-cells, Monocytes and Granulocytes). We go ahead and use epidish function with RPC mode to infer the proportions.

ref.m <- centDHSbloodDMC.m[,1:6]
out.l <- epidish(DummyBeta.m, ref.m, method = "RPC") 

Then, we check the output list. estF is the estimated cell fraction matrix. ref is the reference centroid matrix used; and dataREF is the input data matrix over the probes defined in the reference matrix.

out.l$estF
##                     B         NK      CD4T       CD8T      Gran       Mono
## GSM2130818 0.06284611 0.05383789 0.1234405 0.04984799 0.6269118 0.08311566
## GSM2130819 0.07248251 0.06788991 0.1170393 0.07111009 0.5996838 0.07179431
dim(out.l$ref)
## [1] 330   6
dim(out.l$dataREF)
## [1] 330   2

In this case, 330 out of 333 probes in the input reference matrix can be found in the inquiry matrix. So the ref is a \(330*6\) matrix, while dataREF is a \(330*2\) matrix. In QC step, we might remove bad probes; consequently, not all probes in the reference can be found in inquiry data. By checking ref and dataREF, we can extract the probes used to infer the proportions. If most of the probes in the reference cannot be found, the estimated proportions might be compromised.

3 More info about different methods

We compared CP and RPC in (Teschendorff et al. 2017). And we also have a review article(Teschendorff and Zheng 2017) which summarized all methods tackling cell heterogeneity for DNAm data. Refers to references section for more details.

4 Sessioninfo

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] EpiDISH_1.2.0   BiocStyle_2.8.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.16    bookdown_0.7    quadprog_1.5-5  class_7.3-14   
##  [5] digest_0.6.15   rprojroot_1.3-2 MASS_7.3-50     backports_1.1.2
##  [9] magrittr_1.5    e1071_1.6-8     evaluate_0.10.1 stringi_1.1.7  
## [13] rmarkdown_1.9   tools_3.5.0     stringr_1.3.0   xfun_0.1       
## [17] yaml_2.1.18     compiler_3.5.0  htmltools_0.3.6 knitr_1.20

References

Houseman, Eugene Andres, William P Accomando, Devin C Koestler, Brock C Christensen, Carmen J Marsit, Heather H Nelson, John K Wiencke, and Karl T Kelsey. 2012. “DNA methylation arrays as surrogate measures of cell mixture distribution.” BMC Bioinformatics 13 (1):86.

Newman, Aaron M, Chih Long Liu, Michael R Green, Andrew J Gentles, Weiguo Feng, Yue Xu, Chuong D Hoang, Maximilian Diehn, and Ash A Alizadeh. 2015. “Robust enumeration of cell subsets from tissue expression profiles.” Nature Methods 12 (5):453–57.

Teschendorff, Andrew E, Charles E Breeze, Shijie C Zheng, and Stephan Beck. 2017. “A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies.” BMC Bioinformatics 18 (1):105.

Teschendorff, Andrew E, and Shijie C Zheng. 2017. “Cell-type deconvolution in epigenome-wide association studies: a review and recommendations.” Epigenomics 9 (5):757–68.