Contents

1 Introduction

The EpiDISH package provides tools to infer the fractions of a priori known cell subtypes present in a sample representing a mixture of such cell-types. Inference proceeds via one of 3 methods (Robust Partial Correlations-RPC(Teschendorff et al. 2017), Cibersort-CBS(Newman et al. 2015), Constrained Projection-CP(Houseman et al. 2012)), as determined by the user. Besides, we also provide a function - CellDMC which allows the identification of differentially methylated cell-types in Epigenome-Wide Association Studies(EWAS)(Zheng, Breeze, et al. 2018).

For now, the package contains 4 references, including two whole blood subtypes reference, one generic epithelial reference with epithelial cells, fibroblasts, and total immune cells, and one reference for breast tissue, as described in (Teschendorff et al. 2017) and (Zheng, Webster, et al. 2018).

2 How to estimte cell-type fractions using DNAm data

To show how to use our package, we constructed and stored a dummy beta value matrix DummyBeta.m, which contains 2000 CpGs and 10 samples, in our package.

We first load EpiDISH package, DummyBeta.m and the EpiFibIC reference.

library(EpiDISH)
data(centEpiFibIC.m)
data(DummyBeta.m)

Notice that centEpiFibIC.m has 3 columns, with names of the columns as EPi, Fib and IC. We go ahead and use epidish function with RPC mode to infer the cell-type fractions.

out.l <- epidish(beta.m = DummyBeta.m, ref.m = centEpiFibIC.m, method = "RPC") 

Then, we check the output list. estF is the matrix of estimated cell-type fractions. ref is the reference centroid matrix used, and dataREF is the subset of the input data matrix over the probes defined in the reference matrix.

out.l$estF
##            Epi        Fib           IC
## S1  0.08836819 0.06109607 0.8505357378
## S2  0.07652115 0.57326994 0.3502089007
## S3  0.15417391 0.75663136 0.0891947251
## S4  0.77082647 0.04171941 0.1874541181
## S5  0.03960599 0.31921224 0.6411817742
## S6  0.12751711 0.79642919 0.0760537000
## S7  0.18144315 0.72889883 0.0896580171
## S8  0.20220823 0.40929344 0.3884983293
## S9  0.19398079 0.80540932 0.0006098973
## S10 0.27976647 0.23671333 0.4835201992
dim(out.l$ref)
## [1] 599   3
dim(out.l$dataREF)
## [1] 599  10

In quality control step of DNAm data preprocessing, we might remove bad probes from all probes on 450k or 850k array; consequently, not all probes in the reference could be found in the given dataset. By checking ref and dataREF, we can extract the probes actually used to biuld the model and infer the cell-type fractions. If the majority of the probes in the reference cannot be found, the estimated fractionss might be compromised.

And now we show an example of using our package to estimate cell-type fractions of whole blood tissues. We use a subset beta value matrix of GSE42861 (detailed description in manaul page of LiuDataSub.m).

data(LiuDataSub.m)
BloodFrac.m <- epidish(beta.m = LiuDataSub.m, ref.m = centDHSbloodDMC.m, method = "RPC")$estF

We can easily check the inferred fractions with boxplots. From the boxplots, we observe that just as we expected, the major cell-type in whole blood is neutrophil.

boxplot(BloodFrac.m)

3 How to estimte cell-type fractions in the two-step framework

HEpiDISH is an iterative hierarchical procedure of EpiDISH. HEpiDISH uses two distinct DNAm references, a primary reference for the estimation of fractions of several cell-types and a separate secondary non-overlapping DNAm reference for the estimation of underlying subtype fractions of one of the cell-type in the primary reference. Fig1. HEpiDISH workflow In this example, the third cell-type in primary reference is total immune cell. We would like to know the fractions of immune cell subtypes. So we use a secondary reference, which contains 7 immnue cell subtypes, and let hepidish function know that the third column of primary reference should correspond to the secodnary reference. (We only include 3 cell-types of the centBloodSub.m reference because we mixed those three cell-types to generate the dummy beta value matrix.)

data(centBloodSub.m)
frac.m <- hepidish(beta.m = DummyBeta.m, ref1.m = centEpiFibIC.m, ref2.m = centBloodSub.m[,c(1, 2, 5)], h.CT.idx = 3, method = 'RPC')
frac.m
##            Epi        Fib            B           NK       Mono
## S1  0.08836819 0.06109607 0.6446835622 0.0945693668 0.11128281
## S2  0.07652115 0.57326994 0.0502766152 0.2999322854 0.00000000
## S3  0.15417391 0.75663136 0.0381194625 0.0134501813 0.03762508
## S4  0.77082647 0.04171941 0.1434958145 0.0211681974 0.02279011
## S5  0.03960599 0.31921224 0.0167748647 0.1912747358 0.43313217
## S6  0.12751711 0.79642919 0.0286647024 0.0252778983 0.02211110
## S7  0.18144315 0.72889883 0.0515861314 0.0228453164 0.01522657
## S8  0.20220823 0.40929344 0.1908434542 0.1772700742 0.02038480
## S9  0.19398079 0.80540932 0.0003521377 0.0002577596 0.00000000
## S10 0.27976647 0.23671333 0.2546961632 0.1008399798 0.12798406

4 More info about different methods for cell-type fractions estimation

We compared CP and RPC in (Teschendorff et al. 2017). And we also published a review article(Teschendorff and Zheng 2017) which discusses most of algorithms for tackling cell heterogeneity in Epigenome-Wide Association Studies(EWAS). Refer to references section for more details.

5 How to identify differentially methylated cell-types in EWAS

After estimating cell-type fractions, we can then identify differentially methylated cell-types and their directions of change using CellDMC (Zheng, Breeze, et al. 2018)function. The workflow of CellDMC is shown below. Fig2. CellDMC workflow

We use a binary phenotype vector here, with half of them representing controls and other half representing cases.

pheno.v <- rep(c(0, 1), each = 5)
celldmc.o <- CellDMC(DummyBeta.m, pheno.v, frac.m)

The DMCTs prediction is given(pls note this is faked data. The sample size is too small to find DMCTs.):

head(celldmc.o$dmct)
##            DMC Epi Fib B NK Mono
## cg17506061   0   0   0 0  0    0
## cg09300980   0   0   0 0  0    0
## cg18886245   0   0   0 0  0    0
## cg17470327   0   0   0 0  0    0
## cg26082174   0   0   0 0  0    0
## cg14737131   0   0   0 0  0    0

The estimated coefficients for each cell-type are given in the celldmc.o$coe. Pls refer to help page of CellDMC for more info.

6 Sessioninfo

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] EpiDISH_2.18.0   BiocStyle_2.30.0
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.4         cli_3.6.1           knitr_1.44         
##  [4] magick_2.8.1        rlang_1.1.1         xfun_0.40          
##  [7] stringi_1.7.12      jsonlite_1.8.7      glue_1.6.2         
## [10] e1071_1.7-13        htmltools_0.5.6.1   sass_0.4.7         
## [13] rmarkdown_2.25      quadprog_1.5-8      grid_4.3.1         
## [16] evaluate_0.22       jquerylib_0.1.4     MASS_7.3-60        
## [19] fastmap_1.1.1       lifecycle_1.0.3     yaml_2.3.7         
## [22] locfdr_1.1-8        bookdown_0.36       stringr_1.5.0      
## [25] BiocManager_1.30.22 compiler_4.3.1      Rcpp_1.0.11        
## [28] lattice_0.22-5      digest_0.6.33       R6_2.5.1           
## [31] class_7.3-22        parallel_4.3.1      splines_4.3.1      
## [34] magrittr_2.0.3      bslib_0.5.1         Matrix_1.6-1.1     
## [37] proxy_0.4-27        tools_4.3.1         matrixStats_1.0.0  
## [40] cachem_1.0.8

References

Houseman, Eugene Andres, William P Accomando, Devin C Koestler, Brock C Christensen, Carmen J Marsit, Heather H Nelson, John K Wiencke, and Karl T Kelsey. 2012. “DNA methylation arrays as surrogate measures of cell mixture distribution.” BMC Bioinformatics 13 (1): 86.

Newman, Aaron M, Chih Long Liu, Michael R Green, Andrew J Gentles, Weiguo Feng, Yue Xu, Chuong D Hoang, Maximilian Diehn, and Ash A Alizadeh. 2015. “Robust enumeration of cell subsets from tissue expression profiles.” Nature Methods 12 (5): 453–57.

Teschendorff, Andrew E, Charles E Breeze, Shijie C Zheng, and Stephan Beck. 2017. “A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies.” BMC Bioinformatics 18 (1): 105.

Teschendorff, Andrew E, and Shijie C Zheng. 2017. “Cell-type deconvolution in epigenome-wide association studies: a review and recommendations.” Epigenomics 9 (5): 757–68.

Zheng, Shijie C, Charles E Breeze, Stephan Beck, and Andrew E Teschendorff. 2018. “Identification of differentially methylated cell-types in Epigenome-Wide Association Studies.” Nature Methods 15 (12): 1059–66.

Zheng, Shijie C, Amy P Webster, Danyue Dong, Andy Feber, David G Graham, Roisin Sullivan, Sarah Jevons, et al. 2018. “A novel cell-type deconvolution algorithm reveals substantial contamination by immune cells in saliva, buccal and cervix.” Epigenomics 10 (7): 925–40.