1. Introduction

EpiMix is a comprehensive tool for the integrative analysis of DNA methylation data and gene expression data (Figure 1). EpiMix enables automated data downloading (from TCGA or GEO), pre-processing, methylation modeling, interactive visualization and functional annotation. To identify hypo- or hypermethylated genes, EpiMix uses a beta mixture model to identify the methylation states of each CpG site and compares the DNA methylation of an experimental group to a control group. The output from EpiMix is the functional changes in DNA methylation that is associated with gene expression.

EpiMix incorporates specialized algorithms to identify functional DNA methylation at various genetic elements, including proximal cis-regulatory elements within or surrounding protein-coding genes, distal enhancers, and genes encoding microRNAs (miRNAs) or lncRNAs. There are four alternative analytic modes for modeling DNA methylation at different genetic elements:

  • Regular: cis-regulatory elements within or surrounding protein-coding genes.
  • Enhancer: distal enhancers.
  • miRNA: miRNA-coding genes.
  • lncRNA: lncRNA-coding genes.

Figure 1. Overview of EpiMix’s workflow. EpiMix incorporates four functional modules: downloading, preprocessing, methylation modeling and functional analysis. The methylation modeling module enables four alternative analytic modes, including “Regular”, “Enhancer”, “miRNA” and “lncRNA”. These analytic modes target DNA methylation analysis on different genetic elements.

2. Installation

2.2 Github

To load the EpiMix package in your R session, type library(EpiMix)

Help files. Detailed information on each function of the EpiMix package can be obtained in the help files. For example, to view the help file for the function EpiMix, use ?EpiMix.


3. Data input

The EpiMix function requires three data matricies as input:

  • DNA methylation data: a matrix of the DNA methylation data (beta values) with CpG sites in rows and samples in columns.
TCGA-05-4384-01 TCGA-05-4390-01 TCGA-05-4396-01 TCGA-05-4405-01 TCGA-05-4410-01
cg14029001 0.5137112 0.7590496 0.7344551 0.3842839 0.7792050
cg00374492 0.0819639 0.1277514 0.1730193 0.1037328 0.1079144
cg21462428 0.1397001 0.5207123 0.0738277 0.4853540 0.3803233
cg03987115 0.3346637 0.4516052 0.5916153 0.1713237 0.2410597
cg19351701 0.2816598 0.4035316 0.4949594 0.2380439 0.2139926
  • (Optional) Gene expression data: a matrix of the matched gene expression data with genes in rows and samples in columns. If gene expression data are not provided, no comparision for gene expression will be made beteween the differentially methylated groups. Gene expression data can be normalized values from any methods: TPM, RPKM, FPKM, etc.
TCGA-05-4244-01 TCGA-05-4249-01 TCGA-05-4250-01 TCGA-05-4382-01 TCGA-05-4384-01
CCND2 8.453134 9.740521 8.732788 10.00435 8.772762
CCND3 10.891247 11.427828 10.853533 10.30242 11.863776
HOXA9 1.481609 2.757258 1.992442 1.93693 2.096296
MANBAL 10.548016 10.548264 10.558500 10.18544 10.080922
SRC 11.182247 11.940123 11.198035 11.54182 10.154679
  • Sample annotation: a dataframe that maps each sample in the DNA methylation data to a study group. Should contain two columns: the first column (named “primary”) indicating the sample names, and the second column (named “sample.type”) indicating which study group each sample belongs to (e.g.,“Cancer” vs. “Normal”, “Experiment” vs. “Control”).
primary sample.type
TCGA-05-4384-01 Cancer
TCGA-05-4390-01 Cancer
TCGA-05-4396-01 Cancer
TCGA-05-4405-01 Cancer
TCGA-50-6592-11 Normal
TCGA-50-6593-11 Normal
TCGA-50-6594-11 Normal
TCGA-73-4658-11 Normal
TCGA-73-4676-11 Normal


4. Methylation modeling


4.1 Regular mode

One of the major outputs of EpiMix is a $FunctionalPairs matrix, indicating the differentially methylated CpG sites that are signficantly associated with gene expression:

4.2 Enhancer mode

The Enhancer mode targets DNA methylation analysis on distal enhancers of protein-coding genes.

4.3 miRNA mode

The miRNA mode targets DNA methylation analysis on miRNA-coding genes.

4.4 lncRNA mode

The lncRNA mode targets DNA methylation analysis on lncRNA-coding genes.


6. Step-by-step functions for TCGA data

The above one-step functions in Section can be executed in a step-by-step manner in case users want to inspect the output from each individual step.

6.1 Download and preprocess DNA methylation data from TCGA

  • Download DNA methylation data
  • Preprocess DNA methylation data

Preprocessing includes eliminating samples and genes with too many missing values (default: 20%), imputing remaining missing values, removing single-nucleotide polymorphism (SNP) probes.

The pre-processed DNA methylation data is a matrix with CpG probes in rows and patient in columns. The values in the matrix represent beta values of DNA methylation:

TCGA-04-1331-01 TCGA-04-1332-01 TCGA-04-1335-01 TCGA-04-1336-01 TCGA-04-1337-01
cg00000292 0.9221302 0.4648186 0.9042943 0.7567368 0.8791313
cg00002426 0.1486593 0.0562675 0.2632028 0.2009483 0.0726626
cg00003994 0.0293567 0.0344792 0.0552332 0.0840847 0.0209836
cg00005847 0.8248150 0.4201426 0.6690116 0.7688814 0.7630715
cg00007981 0.0177981 0.0122320 0.0410097 0.0276765 0.0110274

Optional. Since TCGA data were collected in technical batches, systematic differences may exist between technical batches. Users can optionally correct the batch effect using thedoBatchCorrection parameter. EpiMix provides two alternative methods for batch effect correction: Seurat and Combat. The Seurat method (PMID: 31178118) is much more time efficient compared to the Combat (PMID: 16632515). If using the Combat method, users are encouraged to use multiple CPU cores by tuning the cores parameter.


6.2 Download and preprocess gene expression data

  • Preprocess gene expression data

The pre-processed gene expression data is a matrix with gene in rows and patients in columns.

Example of gene expression data:

TCGA-04-1331-01 TCGA-04-1332-01 TCGA-04-1335-01 TCGA-04-1336-01 TCGA-04-1337-01
ELMO2 -0.6204167 0.184750 -0.6716667 -1.105500 0.7858333
CREB3L1 -0.0032500 1.008500 1.2210000 -0.623000 1.1265000
RPS11 0.5672500 0.967625 -0.2337500 0.555375 0.6608750
PNMA1 1.2940000 1.159000 -0.8425000 0.476750 0.6160000
MMP2 -0.3280000 0.416500 -1.1091667 -1.716333 0.9626667
  • Generate the sample annotation

To identify the hypo- or hyper-methylated genes, EpiMix compares the DNA methylation in tumors to normal tissues. Therefore, EpiMix needs to know which samples in the DNA methylation and gene expression data are tumors (group.1), and which are normal tissues (group.2). The TCGA_GetSampleInfo function can be used to generate a dataframe of sample information.

The sample.info is a dataframe with two columns: the primary column indicates the sample identifiers and the sample.type column indicates which group each sample belongs to:

primary sample.type
TCGA-04-1331-01 Cancer
TCGA-04-1332-01 Cancer
TCGA-04-1335-01 Cancer
TCGA-04-1336-01 Cancer


7. Visualization

EpiMix enables diverse types of visualization.

7.2 Genome-browser style visualization

8. Pathway enrichment analysis

EpiMix integrates the clusterProfiler R package to perform the function enrichment analysis of the differentially methylated genes. Enrichment results can be visualized in both tabular format and in graphic format. See this paper for more details.

8.1 Gene ontology (GO) analysis

ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
GO:0009952 anterior/posterior pattern specification 3/6 218/18903 0.0000295 0.0036235 0.0011056 HOXA9/HOXB7/TDRD10 3
GO:0045737 positive regulation of cyclin-dependent protein serine/threonine kinase activity 2/6 50/18903 0.0001022 0.0036235 0.0011056 CCND2/CCND3 2
GO:1904031 positive regulation of cyclin-dependent protein kinase activity 2/6 54/18903 0.0001193 0.0036235 0.0011056 CCND2/CCND3 2
GO:1900087 positive regulation of G1/S transition of mitotic cell cycle 2/6 55/18903 0.0001238 0.0036235 0.0011056 CCND2/CCND3 2
GO:0003002 regionalization 3/6 360/18903 0.0001313 0.0036235 0.0011056 HOXA9/HOXB7/TDRD10 3
GO:1902808 positive regulation of cell cycle G1/S phase transition 2/6 68/18903 0.0001895 0.0043583 0.0013298 CCND2/CCND3 2

8.2 KEGG pathway enrichment analysis

ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
hsa04115 p53 signaling pathway 2/3 73/8225 0.0002318 0.0032847 0.0005186 894/896 2
hsa04110 Cell cycle 2/3 127/8225 0.0007025 0.0032847 0.0005186 894/896 2
hsa05162 Measles 2/3 139/8225 0.0008413 0.0032847 0.0005186 894/896 2
hsa04218 Cellular senescence 2/3 156/8225 0.0010590 0.0032847 0.0005186 894/896 2
hsa04390 Hippo signaling pathway 2/3 157/8225 0.0010726 0.0032847 0.0005186 894/896 2
hsa04630 JAK-STAT signaling pathway 2/3 166/8225 0.0011986 0.0032847 0.0005186 894/896 2


9. Biomarker identification

After the differentially methylated genes were identified, we look for the genes whose methylation states are associated with patient survival. For each differentially methylated CpG, we compare the survival of the abnormally methylated patients to the normally methylated patients. The GetSurvivalProbe function generates all the survival associated CpGs.

Probe Genes State HR lower.Cl higher.Cl p.value adjusted.p.value
cg00909706 hsa-mir-34a Hyper 1.8164322924391 1.17176293780833 2.8157796825242 0.0067882646557248 0.0067883

Plot the Kaplan-Meier survival curve for the normally and the abnormally methylated patients.

Survival curve for patients showing different methylation states at the CpG cg00909706

Survival curve for patients showing different methylation states at the CpG cg00909706

10. Session Information

sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] survminer_0.4.9                        
#>  [2] ggpubr_0.6.0                           
#>  [3] ggplot2_3.4.1                          
#>  [4] survival_3.5-3                         
#>  [5] clusterProfiler_4.6.0                  
#>  [6] org.Hs.eg.db_3.16.0                    
#>  [7] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
#>  [8] GenomicFeatures_1.50.4                 
#>  [9] AnnotationDbi_1.60.0                   
#> [10] Biobase_2.58.0                         
#> [11] karyoploteR_1.24.0                     
#> [12] regioneR_1.30.0                        
#> [13] GenomicRanges_1.50.2                   
#> [14] GenomeInfoDb_1.34.9                    
#> [15] IRanges_2.32.0                         
#> [16] S4Vectors_0.36.1                       
#> [17] sesameData_1.16.0                      
#> [18] DT_0.27                                
#> [19] EpiMix_1.0.1                           
#> [20] EpiMix.data_1.0.0                      
#> [21] ExperimentHub_2.6.0                    
#> [22] AnnotationHub_3.6.0                    
#> [23] BiocFileCache_2.6.0                    
#> [24] dbplyr_2.3.0                           
#> [25] BiocGenerics_0.44.0                    
#> 
#> loaded via a namespace (and not attached):
#>   [1] rappdirs_0.3.3                rtracklayer_1.58.0           
#>   [3] bamsignals_1.30.0             bezier_1.1.2                 
#>   [5] tidyr_1.3.0                   bit64_4.0.5                  
#>   [7] knitr_1.42                    DelayedArray_0.24.0          
#>   [9] data.table_1.14.6             rpart_4.1.19                 
#>  [11] KEGGREST_1.38.0               RCurl_1.98-1.10              
#>  [13] GEOquery_2.66.0               AnnotationFilter_1.22.0      
#>  [15] doParallel_1.0.17             generics_0.1.3               
#>  [17] snow_0.4-4                    cowplot_1.1.1                
#>  [19] RSQLite_2.2.20                shadowtext_0.1.2             
#>  [21] bit_4.0.5                     tzdb_0.3.0                   
#>  [23] enrichplot_1.18.3             xml2_1.3.3                   
#>  [25] httpuv_1.6.9                  SummarizedExperiment_1.28.0  
#>  [27] assertthat_0.2.1              viridis_0.6.2                
#>  [29] xfun_0.37                     hms_1.1.2                    
#>  [31] jquerylib_0.1.4               evaluate_0.20                
#>  [33] promises_1.2.0.1              fansi_1.0.4                  
#>  [35] restfulr_0.0.15               progress_1.2.2               
#>  [37] km.ci_0.5-6                   igraph_1.4.0                 
#>  [39] DBI_1.1.3                     htmlwidgets_1.6.1            
#>  [41] purrr_1.0.1                   ellipsis_0.3.2               
#>  [43] crosstalk_1.2.0               dplyr_1.1.0                  
#>  [45] backports_1.4.1               biomaRt_2.54.0               
#>  [47] deldir_1.0-6                  MatrixGenerics_1.10.0        
#>  [49] vctrs_0.5.2                   ensembldb_2.22.0             
#>  [51] abind_1.4-5                   cachem_1.0.6                 
#>  [53] withr_2.5.0                   ggforce_0.4.1                
#>  [55] HDO.db_0.99.1                 BSgenome_1.66.2              
#>  [57] checkmate_2.1.0               GenomicAlignments_1.34.0     
#>  [59] treeio_1.22.0                 prettyunits_1.1.1            
#>  [61] cluster_2.1.4                 DOSE_3.24.2                  
#>  [63] RPMM_1.25                     ape_5.6-2                    
#>  [65] lazyeval_0.2.2                crayon_1.5.2                 
#>  [67] pkgconfig_2.0.3               labeling_0.4.2               
#>  [69] tweenr_2.0.2                  nlme_3.1-162                 
#>  [71] ProtGenerics_1.30.0           nnet_7.3-18                  
#>  [73] rlang_1.0.6                   lifecycle_1.0.3              
#>  [75] downloader_0.4                filelock_1.0.2               
#>  [77] doSNOW_1.0.20                 dichromat_2.0-0.1            
#>  [79] polyclip_1.10-4               matrixStats_0.63.0           
#>  [81] Matrix_1.5-3                  aplot_0.1.9                  
#>  [83] KMsurv_0.1-5                  carData_3.0-5                
#>  [85] zoo_1.8-11                    base64enc_0.1-3              
#>  [87] png_0.1-8                     viridisLite_0.4.1            
#>  [89] rjson_0.2.21                  bitops_1.0-7                 
#>  [91] gson_0.0.9                    Biostrings_2.66.0            
#>  [93] blob_1.2.3                    stringr_1.5.0                
#>  [95] qvalue_2.30.0                 readr_2.1.4                  
#>  [97] jpeg_0.1-10                   rstatix_0.7.2                
#>  [99] gridGraphics_0.5-1            ggsignif_0.6.4               
#> [101] scales_1.2.1                  memoise_2.0.1                
#> [103] magrittr_2.0.3                plyr_1.8.8                   
#> [105] zlibbioc_1.44.0               compiler_4.2.2               
#> [107] scatterpie_0.1.8              BiocIO_1.8.0                 
#> [109] RColorBrewer_1.1-3            Rsamtools_2.14.0             
#> [111] cli_3.6.0                     XVector_0.38.0               
#> [113] patchwork_1.1.2               htmlTable_2.4.1              
#> [115] Formula_1.2-4                 MASS_7.3-58.2                
#> [117] mgcv_1.8-41                   tidyselect_1.2.0             
#> [119] stringi_1.7.12                highr_0.10                   
#> [121] yaml_2.3.7                    GOSemSim_2.24.0              
#> [123] latticeExtra_0.6-30           ggrepel_0.9.3                
#> [125] survMisc_0.5.6                grid_4.2.2                   
#> [127] sass_0.4.5                    VariantAnnotation_1.44.0     
#> [129] fastmatch_1.1-3               tools_4.2.2                  
#> [131] parallel_4.2.2                rstudioapi_0.14              
#> [133] foreach_1.5.2                 foreign_0.8-84               
#> [135] gridExtra_2.3                 farver_2.1.1                 
#> [137] ggraph_2.1.0                  digest_0.6.31                
#> [139] BiocManager_1.30.19           shiny_1.7.4                  
#> [141] Rcpp_1.0.10                   car_3.1-1                    
#> [143] broom_1.0.3                   BiocVersion_3.16.0           
#> [145] later_1.3.0                   httr_1.4.4                   
#> [147] biovizBase_1.46.0             colorspace_2.1-0             
#> [149] XML_3.99-0.13                 ELMER.data_2.22.0            
#> [151] splines_4.2.2                 yulab.utils_0.0.6            
#> [153] tidytree_0.4.2                graphlayouts_0.8.4           
#> [155] ggplotify_0.1.0               xtable_1.8-4                 
#> [157] jsonlite_1.8.4                ggtree_3.6.2                 
#> [159] tidygraph_1.2.3               ggfun_0.0.9                  
#> [161] R6_2.5.1                      Hmisc_4.8-0                  
#> [163] pillar_1.8.1                  htmltools_0.5.4              
#> [165] mime_0.12                     glue_1.6.2                   
#> [167] fastmap_1.1.0                 BiocParallel_1.32.5          
#> [169] interactiveDisplayBase_1.36.0 codetools_0.2-19             
#> [171] fgsea_1.24.0                  utf8_1.2.3                   
#> [173] lattice_0.20-45               bslib_0.4.2                  
#> [175] tibble_3.1.8                  curl_5.0.0                   
#> [177] GO.db_3.16.0                  interp_1.1-3                 
#> [179] limma_3.54.1                  rmarkdown_2.20               
#> [181] munsell_0.5.0                 GenomeInfoDbData_1.2.9       
#> [183] iterators_1.0.14              impute_1.72.3                
#> [185] reshape2_1.4.4                gtable_0.3.1