To install and load methylGSA
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("methylGSA")
Depending on the DNA methylation array type, other packages may be needed before running the analysis.
If analyzing 450K array, IlluminaHumanMethylation450kanno.ilmn12.hg19
needs to be loaded.
If analyzing EPIC array, IlluminaHumanMethylationEPICanno.ilm10b2.hg19
needs to be loaded.
If analyzing user-supplied mapping between CpG ID and gene name, neither IlluminaHumanMethylation450kanno.ilmn12.hg19
nor IlluminaHumanMethylationEPICanno.ilm10b2.hg19
needs to be loaded.
The methylGSA package contains functions to carry out gene set analysis adjusting for the number of CpGs of each gene. It has been shown by Geeleher et al [1] that gene set analysis is extremely biased for DNA methylation data. This package contains three main functions to adjust for the bias in gene set analysis.
methylglm: Incorporating number of CpGs as a covariate in logistic regression.
methylRRA: Adjusting for multiple p-values of each gene by Robust Rank Aggregation [2], and then apply either over-representation analysis (ORA) or Preranked version of Gene Set Enrichment Analysis (GSEAPreranked) [3] in gene set testing.
methylgometh: Adjusting the number of CpGs for each gene by weighted resampling and Wallenius non-central hypergeometric approximation. (via missMethyl [4])
methylglm is an extention of GOglm [9]. GOglm adjusts length bias for RNA-Seq data by incorporating gene length as a covariate in logistic regression model. methylglm adjusts length bias in DNA methylation by the number of CpGs instead of gene length. For each gene set, we fit a logistic regression model:
\[logit (\pi_{i}) = \beta_{0} + \beta_{1}x_{i} + \beta_{2}c_{i}\] For each gene \(i\), \(\pi_{i}\) = Pr(gene \(i\) is in gene set), \(x_{i}\) represents negative logarithmic transform of its minimum p-value in differential methylation analysis, and \(c_{i}\) is logarithmic transform of its number of CpGs.
methylglm requires only a simple named vector. This vector contains p-values of each CpG. Names should be their corresponding CpG IDs.
Here is what the input vector looks like:
cg00050873 cg00212031 cg00214611 cg01707559 cg02004872 cg02011394
0.2876 0.7883 0.4090 0.8830 0.9405 0.0456
cg02050847 cg02233190 cg02494853 cg02839557 cg02842889 cg03052502
0.5281 0.8924 0.5514 0.4566 0.9568 0.4533
cg03244189 cg03443143 cg03683899 cg03706273 cg03750315 cg04016144
0.6776 0.5726 0.1029 0.8998 0.2461 0.0421
cg04042030 cg04448376
0.3279 0.9545
Please note that the p-values here in cpg.pval
is just for illustration purposes. They are used to illustrate how to use the functions in methylGSA. The actual p-values in differential methylation analysis may be quite different from the p-values in cpg.pval
.
Then call methylglm:
res1 = methylglm(cpg.pval = cpg.pval, minsize = 200,
maxsize = 500, GS.type = "KEGG")
head(res1, 15)
ID Description Size pvalue padj
04080 04080 Neuroactive ligand-receptor interaction 272 0.4811308 1
04060 04060 Cytokine-cytokine receptor interaction 265 0.6961531 1
04010 04010 MAPK signaling pathway 268 1.0000000 1
04144 04144 Endocytosis 201 1.0000000 1
04510 04510 Focal adhesion 200 1.0000000 1
04740 04740 Olfactory transduction 388 1.0000000 1
04810 04810 Regulation of actin cytoskeleton 213 1.0000000 1
05200 05200 Pathways in cancer 326 1.0000000 1
Result is a data frame ranked by p-values of gene sets.
Robust rank aggregation [2] is a parameter free model that aggregates several ranked gene lists into a single gene list. The aggregation assumes random order of input lists and assign each gene a p-value based on order statistics. We apply this order statistics idea to adjust for number of CpGs.
For gene \(i\), let \(P_{1}, P_{2}, ... P_{n}\) be the p-values of individual CpGs in differential methylation analysis. Under the null hypothesis, \(P_{1}, P_{2}, ... P_{n} ~ \overset{i.i.d}{\sim} Unif[0, 1]\). Let \(P_{(1)}, P_{(2)}, ... P_{(n)}\) be the order statistics. Define: \[\rho = \text{min}\{\text{Pr}(P_{(1)}<P_{(1)\text{obs}}), \text{Pr}(P_{(2)}<P_{(2)\text{obs}})..., \text{Pr}(P_{(n)}<P_{(n)\text{obs}}) \} \]
methylRRA supports two approaches to adjust for number of CpGs, ORA and GSEAPreranked [3]. In ORA approach, for gene \(i\), conversion from \(\rho\) score into p-value is done by Bonferroni correction [2]. We get a p-value for each gene and these p-values are then corrected for multiple testing use Benjamini & Hochberg procedure [10]. By default, genes satisfy FDR<0.05 are considered DE genes. If there are no DE genes under FDR 0.05, users are able to use sig.cut
option to specify a higher FDR cut-off or topDE
option to declare top genes to be differentially expressed. We then apply ORA based on these DE genes.
In GSEAPreranked approach, for gene \(i\), we also convert \(\rho\) score into p-value by Bonferroni correction. p-values are converted into z-scores. We then apply Preranked version of Gene Set Enrichment Analysis (GSEAPreranked) on the gene list ranked by the z-scores.
To apply ORA approach, we use argument method = "ORA"
in methylRRA
To apply GSEAPreranked approach, we use argument method = "GSEA"
in methylRRA
methylgometh calls gometh
or gsameth
function in missMethyl package [4] to adjust number of CpGs in gene set testing. gometh
modifies goseq method [11] by fitting a probability weighting function and resampling from Wallenius non-central hypergeometric distribution.
methylgometh requires two inputs, cpg.pval
and sig.cut
. sig.cut
specifies the cut-off point to declare a CpG as differentially methylated. By default, sig.cut
is 0.001. Similar to methylRRA, if no CpG is significant, users are able to specify a higher cut-off or use topDE
option to declare top CpGs to be differentially methylated.
methylGSA provides many other options for users to customize the analysis.
array.type
is to specify which array type to use. It is either “450K” or “EPIC”. Default is “450K”. This argument will be ignored if FullAnnot
is provided.FullAnnot
is preprocessed mapping between CpG ID and gene name provided by prepareAnnot function. Default is NULL. Check example below for details.group
is the type of CpG the user which to consider. If group is “body” (“promoter”), CpGs on gene body (promoters) will be considered. Default is “all”, which means all CpGs are considered.GS.list
is user supplied gene sets to be tested. It should be a list with entry names gene sets IDs and elements correpond to genes that gene sets contain. If there is no input list, Gene Ontology is used.GS.idtype
is the type of gene ID in user supplied gene sets. If GS.list
is not empty, then the user is expected to provide gene ID type. Supported ID types are “SYMBOL”, “ENSEMBL”, “ENTREZID”, “REFSEQ”. Default is “SYMBOL”.GS.type
is the published gene sets/pathways to be tested if GS.list
is empty. Supported pathways are “GO”, “KEGG”, and “Reactome”. Default is “GO”.minsize
is an integer. If the number of genes in a gene set is less than this integer, this gene set is not tested. Default is 100.maxsize
is also an integer. If the number of genes in a gene set is greater than this integer, this gene set is not tested. Default is 500.method
is to specify gene set test method. It is either “ORA” or “GSEA” as described above.parallel
is either TRUE
or FALSE
indicating whether parallel should be used.Here an example of user supplied gene sets. The gene ID type is gene symbol
data(GSlisttoy)
## to make the display compact, only a proportion of each gene set is shown
head(lapply(GS.list, function(x) x[1:30]), 3)
$GS1
[1] "ABCA11P" "ACOT1" "ACSM2A" "ADAMTS4" "ADH4"
[6] "AGTR2" "AMAC1" "AMY1B" "AMY2A" "ANKRD13C"
[11] "ANKRD20A1" "ANKRD20A3" "ANKRD20A4" "ANXA2P3" "ANXA8"
[16] "ANXA8L1" "ARGFXP2" "ARL17B" "ATF5" "ATP5F1"
[21] "BAGE2" "BCL8" "BET3L" "C12orf24" "C15orf62"
[26] "C16orf93" "C17orf86" "C19orf70" "C1orf182" "C22orf29"
$GS2
[1] "KRTAP5-5" "KRTAP6-1" "KRTAP9-8" "KTI12"
[5] "LASS1" "LCMT2" "LGALS9B" "LOC100093631"
[9] "LOC100101115" "LOC100101266" "LOC100130264" "LOC100130932"
[13] "LOC100131193" "LOC100131726" "LOC100132247" "LOC100132832"
[17] "LOC100133920" "LOC100286938" "LOC144438" "LOC145820"
[21] "LOC148413" "LOC202781" "LOC286367" "LOC339047"
[25] "LOC388499" "LOC392196" "LOC399753" "LOC440895"
[29] "LOC441294" "LOC441455"
$GS3
[1] "SNORA17" "SNORA23" "SNORA25" "SNORA2B" "SNORA31"
[6] "SNORA36C" "SNORA64" "SNORD11" "SNORD113-5" "SNORD113-7"
[11] "SNORD114-1" "SNORD114-16" "SNORD114-17" "SNORD114-18" "SNORD114-21"
[16] "SNORD114-27" "SNORD114-28" "SNORD114-5" "SNORD114-6" "SNORD114-8"
[21] "SNORD114-9" "SNORD116-16" "SNORD116-26" "SNORD116-29" "SNORD12"
[26] "SNORD125" "SNORD126" "SNORD16" "SNORD38B" "SNORD50A"
This is an example of running methylglm with parallel
library(BiocParallel)
res_p = methylglm(cpg.pval = cpg.pval, minsize = 200,
maxsize = 500, GS.type = "KEGG", parallel = TRUE)
methylglm and methylRRA support user supplied CpG ID to gene mapping. The mapping is expected to be a matrix, or a data frame or a list. For a matrix or data frame, 1st column should be CpG ID and 2nd column should be gene name. For a list, entry names should be gene names and elements correpond to CpG IDs. This is an example of user supplied CpG to gene mapping:
CpG Gene
1 cg00050873 TSPY4
2 cg00212031 TTTY14
3 cg00214611 TMSB4Y
4 cg01707559 TBL1Y
5 cg02004872 TMSB4Y
6 cg02011394 TSPY4
To use user supplied mapping in methylglm or methylRRA, first preprocess the mapping by prepareAnnot function
Test the gene sets using “ORA” in methylRRA, use FullAnnot
argument to provide the preprocessed CpG ID to gene mapping
GS.list = GS.list[1:10]
res5 = methylRRA(cpg.pval = cpg.pval, FullAnnot = FullAnnot, method = "ORA",
GS.list = GS.list, GS.idtype = "SYMBOL",
minsize = 100, maxsize = 300)
head(res5, 10)
ID Count Size pvalue padj
GS1 GS1 0 157 1 1
GS2 GS2 0 257 1 1
GS3 GS3 0 181 1 1
GS4 GS4 0 274 1 1
GS5 GS5 0 285 1 1
GS6 GS6 0 108 1 1
GS7 GS7 0 202 1 1
GS8 GS8 0 273 1 1
GS9 GS9 0 206 1 1
GS10 GS10 0 187 1 1
Here is another example. Test Reactome pathways using methylglm
Following bar plot implemented in enrichplot [12], we also provide bar plot to visualize the gene set analysis results. The input of barplot
function can be any result returned by methylglm, methylRRA, or methylgometh. Various options are provided for users to customize the plot.
xaxis
is to specify the label in x-axis. It is either “Count” (number of significant genes in gene set) or “Size” (total number of genes in gene set). “Count” option is not available for methylglm and methylRRA(GSEA). Default is “Size”.num
is to specify the number of genes sets to display on the bar plot. Default is 5.colorby
is a string. Either “pvalue” or “padj”. Default is “padj”.title
is a string. The title to display on the bar plot. Default is NULL.Here is an example of using barplot to visualize the result of methylglm
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0
[2] minfi_1.30.0
[3] bumphunter_1.26.0
[4] locfit_1.5-9.1
[5] iterators_1.0.10
[6] foreach_1.4.4
[7] Biostrings_2.52.0
[8] XVector_0.24.0
[9] SummarizedExperiment_1.14.0
[10] DelayedArray_0.10.0
[11] BiocParallel_1.18.0
[12] matrixStats_0.54.0
[13] Biobase_2.44.0
[14] GenomicRanges_1.36.0
[15] GenomeInfoDb_1.20.0
[16] IRanges_2.18.0
[17] S4Vectors_0.22.0
[18] BiocGenerics_0.30.0
[19] methylGSA_1.2.3
loaded via a namespace (and not attached):
[1] fastmatch_1.1-0
[2] igraph_1.2.4.1
[3] plyr_1.8.4
[4] lazyeval_0.2.2
[5] splines_3.6.0
[6] ggplot2_3.1.1
[7] urltools_1.7.3
[8] digest_0.6.19
[9] htmltools_0.3.6
[10] GOSemSim_2.10.0
[11] viridis_0.5.1
[12] GO.db_3.8.2
[13] magrittr_1.5
[14] memoise_1.1.0
[15] limma_3.40.2
[16] readr_1.3.1
[17] annotate_1.62.0
[18] askpass_1.1
[19] siggenes_1.58.0
[20] enrichplot_1.4.0
[21] prettyunits_1.0.2
[22] colorspace_1.4-1
[23] blob_1.1.1
[24] ggrepel_0.8.1
[25] BiasedUrn_1.07
[26] xfun_0.7
[27] dplyr_0.8.1
[28] jsonlite_1.6
[29] crayon_1.3.4
[30] RCurl_1.95-4.12
[31] genefilter_1.66.0
[32] GEOquery_2.52.0
[33] IlluminaHumanMethylationEPICmanifest_0.3.0
[34] survival_2.44-1.1
[35] glue_1.3.1
[36] ruv_0.9.7
[37] polyclip_1.10-0
[38] registry_0.5-1
[39] gtable_0.3.0
[40] zlibbioc_1.30.0
[41] UpSetR_1.4.0
[42] IlluminaHumanMethylationEPICanno.ilm10b4.hg19_0.6.0
[43] Rhdf5lib_1.6.0
[44] HDF5Array_1.12.1
[45] scales_1.0.0
[46] DOSE_3.10.1
[47] DBI_1.0.0
[48] rngtools_1.3.1.1
[49] bibtex_0.4.2
[50] Rcpp_1.0.1
[51] viridisLite_0.3.0
[52] xtable_1.8-4
[53] progress_1.2.2
[54] gridGraphics_0.4-1
[55] europepmc_0.3
[56] bit_1.1-14
[57] reactome.db_1.68.0
[58] mclust_5.4.3
[59] preprocessCore_1.46.0
[60] missMethyl_1.18.0
[61] httr_1.4.0
[62] fgsea_1.10.0
[63] RColorBrewer_1.1-2
[64] pkgconfig_2.0.2
[65] reshape_0.8.8
[66] XML_3.98-1.19
[67] farver_1.1.0
[68] labeling_0.3
[69] ggplotify_0.0.3
[70] tidyselect_0.2.5
[71] rlang_0.3.4
[72] reshape2_1.4.3
[73] AnnotationDbi_1.46.0
[74] munsell_0.5.0
[75] tools_3.6.0
[76] RSQLite_2.1.1
[77] ggridges_0.5.1
[78] evaluate_0.14
[79] stringr_1.4.0
[80] yaml_2.2.0
[81] org.Hs.eg.db_3.8.2
[82] knitr_1.23
[83] bit64_0.9-7
[84] beanplot_1.2
[85] methylumi_2.30.0
[86] scrime_1.3.5
[87] purrr_0.3.2
[88] ggraph_1.0.2
[89] nlme_3.1-140
[90] doRNG_1.7.1
[91] nor1mix_1.2-3
[92] DO.db_2.9
[93] xml2_1.2.0
[94] biomaRt_2.40.0
[95] compiler_3.6.0
[96] statmod_1.4.30
[97] tibble_2.1.1
[98] tweenr_1.0.1
[99] stringi_1.4.3
[100] GenomicFeatures_1.36.1
[101] lattice_0.20-38
[102] IlluminaHumanMethylationEPICanno.ilm10b2.hg19_0.6.0
[103] Matrix_1.2-17
[104] IlluminaHumanMethylation450kmanifest_0.4.0
[105] multtest_2.40.0
[106] pillar_1.4.1
[107] triebeard_0.3.0
[108] cowplot_0.9.4
[109] data.table_1.12.2
[110] bitops_1.0-6
[111] rtracklayer_1.44.0
[112] qvalue_2.16.0
[113] R6_2.4.0
[114] gridExtra_2.3
[115] codetools_0.2-16
[116] MASS_7.3-51.4
[117] assertthat_0.2.1
[118] rhdf5_2.28.0
[119] openssl_1.3
[120] pkgmaker_0.27
[121] withr_2.1.2
[122] GenomicAlignments_1.20.0
[123] Rsamtools_2.0.0
[124] GenomeInfoDbData_1.2.1
[125] hms_0.4.2
[126] clusterProfiler_3.12.0
[127] quadprog_1.5-7
[128] grid_3.6.0
[129] tidyr_0.8.3
[130] base64_2.0
[131] rvcheck_0.1.3
[132] rmarkdown_1.13
[133] DelayedMatrixStats_1.6.0
[134] RobustRankAggreg_1.1
[135] illuminaio_0.26.0
[136] ggforce_0.2.2
[1] Geeleher, Paul, Lori Hartnett, Laurance J Egan, Aaron Golden, Raja Affendi Raja Ali, and Cathal Seoighe. 2013. Gene-Set Analysis Is Severely Biased When Applied to Genome-Wide Methylation Data. Bioinformatics 29 (15). Oxford University Press: 1851–7.
[2] Kolde, Raivo, Sven Laur, Priit Adler, and Jaak Vilo. 2012. Robust Rank Aggregation for Gene List Integration and Meta-Analysis. Bioinformatics 28 (4). Oxford University Press: 573–80.
[3] Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles. Proceedings of the National Academy of Sciences 102 (43). National Acad Sciences: 15545–50.
[4] Phipson, Belinda, Jovana Maksimovic, and Alicia Oshlack. 2015. MissMethyl: An R Package for Analyzing Data from Illumina’s Humanmethylation450 Platform. Bioinformatics 32 (2). Oxford University Press: 286–88.
[5] Carlson M (2018). org.Hs.eg.db: Genome wide annotation for Human. R package version 3.6.0.
[6] Ligtenberg W (2018). reactome.db: A set of annotation maps for reactome. R package version 1.64.0.
[7] Hansen, KD. 2016a. IlluminaHumanMethylation450kanno.ilmn12.hg19: Annotation for Illumina’s 450k Methylation Arrays. R Package, Version 0.6.0 1.
[8] ———. 2016b. IlluminaHumanMethylationEPICanno.ilm10b2.hg19: Annotation for Illumina’s Epic Methylation Arrays. R Package, Version 0.6.0 1.
[9] Mi, Gu, Yanming Di, Sarah Emerson, Jason S Cumbie, and Jeff H Chang. 2012. Length Bias Correction in Gene Ontology Enrichment Analysis Using Logistic Regression. PloS One 7 (10). Public Library of Science: e46128.
[10] Benjamini, Yoav, and Yosef Hochberg. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 289–300.
[11] Young, Matthew D, Matthew J Wakefield, Gordon K Smyth, and Alicia Oshlack. 2012. Goseq: Gene Ontology Testing for Rna-Seq Datasets. R Bioconductor.
[12] Yu G (2018). enrichplot: Visualization of Functional Enrichment Result. R package version 1.0.2, https://github.com/GuangchuangYu/enrichplot.