An automated discovery tool for discovering hidden biological and technical links

Quick Start

knowYourCG is a tool for evaluating CpG feature enrichment using Illumina probe IDs. Tthis tool automates the hypothesis testing by asking whether a set of CpGs (indexed by Illumina methylation chip IDs, hence a sparse representation of the methylome) is enriched in certain categories or features. These categories or features can be categorical (e.g., CpGs located at specific tissue-specific transcription factors) or continuous (e.g., the local CpG density of CpGs). Additionally, the set of CpGs to which the test will be applied can be categorical or continuous as well.

The set of CpGs that will be tested for enrichment is called the query set, and the set of CpGs that will be used to determine enrichment is called the database set. A query set, for example, might be the results of a differential methylation analysis or from an epigenome-wide association study. We have taken the time to curate our own database sets from a variety of sources that describe different categorical and continuous features such as technical characterization of the probes, CpGs associated with certain chromatin states, gene association, transcription factor binding sites, CpG density, etc.

Additionally, knowYourCG has support for feature selection and feature engineering, which is currently in development.

The following commands prepares the use of KnowYourCG:

library(sesame)
sesameDataCache()

Our example uses a specific mouse design group as input (PGCMeth, methylated in primoridal germ cells). First get the CG list using the following code:

query <- KYCG_getDBs("MM285.designGroup")[["PGCMeth"]]
head(query)

## [1] "cg36615889_TC11" "cg36646136_BC21" "cg36647910_BC11" "cg36857173_TC21"
## [5] "cg36877289_BC21" "cg36899653_BC21"

Now test the enrichment over database groups. By default, KYCG will select all the categorical groups and overlapping genes (CpGs associated with a gene).

results_pgc <- testEnrichment(query)
head(results_pgc)

We can visualize the result of this test using the KYCG_plotEnrichAll function:

KYCG_plotEnrichAll(results_pgc)

## Loading required namespace: ggrepel

This plot groups different database sets along the x-axis and plot -log10(FDR) on the y-axis. As expected, the PGCMeth group itself appear on the top of the list. But one can also find histone H3K9me3, chromHMM Het and transcription factor Trim28 binding enriched in this CG group.

4 Testing Scenarios

There are four testing scenarios depending on the type format of the query set and database sets. They are shown with the respective testing scenario in the table below. testEnrichment, testEnrichmentGSEA are for Fisher’s exact test and GSEA respectively.

Four KnowYourCG Testing Scenarios
	Continuous.DB	Discrete.DB
Continuous Query	Correlation-based	GSEA
Discrete Query	GSEA	Fisher’s Exact Test

Test Set Enrichment

The main work horse function for test enrichment of a categorical query against categorical databases is the testEnrichment function. This function calculates the extent of overlap and apply different statistics for enrichment testing. The testEnrichment() will perform Fisher’s exact test (one-tailed by default, but two-tailed optionally) and report metrics about each of the the loaded database sets.

Choice of universal set: Universal set is the set of all probes for a given platform. It can either be passed in as an argument called universeSet or the platform name can be passed with argument platform. If neither of these are supplied, the universe set will be implied from the probes.

library(SummarizedExperiment)

## prepare a query
df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "fetal_brain" & df$type == "Hypo"]

results <- testEnrichment(query, "TFBS")
results %>% dplyr::filter(overlap>10) %>% head

## prepare another query
query <- df$Probe_ID[df$branch == "fetal_liver" & df$type == "Hypo"]
results <- testEnrichment(query, "TFBS")
results %>% dplyr::filter(overlap>10) %>%
    dplyr::select(dbname, estimate, test, FDR) %>% head

The output of each test contains at least four variables: the estimate (fold enrichment, not the test statistics), p-value, type of test, and whether meta data is included in the tested database set (hasMeta), as well as the name of the database set and the database group. By default, the estimate column is sorted.

It should be noted that the estimate (or test statistic) is test dependent and comparison between p-values should be limited to within the same type of test. For instance, the test statistics for Fisher’s exact test and GSEA are log fold change and the test statistic for Spearman’s test is simply the rank order correlation coefficient. For simplicity, we report all of the test types in one data frame.

The nQ and nD columns identify the length of the query set and the database set, respectively. Often, it’s important to examine the extent of overlap between the two sets, so that metric is reported as well in the overlap column.

See Supplemental Vignette for other ways of visualizing enrichment results.

Database Sets

The success of enrichment testing depends critically on the availability of biologically-relevant databases. To reflect the biological meaning of databases and facilitate selective testing, we have organized our database sets into different groups. Each group contains one or multiple databases. Here is how to find the names of pre-built database groups:

KYCG_listDBGroups("MM285")

The KYCG_listDBGroups() function returns a data frame containing information of these databases. The Title column is the accession key one needs for the testEnrichment function. With the accessions, one can either directly use them in the testEnrichment function or explicitly call the KYCG_getDBs() function to retrieve databases themselves. Caching these databases on the local machine is important, for two reasons: it limits the number of requests sent to the Bioconductor server, and secondly it limits the amount of time the user needs to wait when re-downloading database sets. For this reason, one should run sesameDataCache() before loading in any database sets. This will take some time to download all of the database sets but this only needs to be done once per installation. During the analysis the database sets can be identified using these accessions. Sesame also does some guessing when a unique substring is given. For example, the following “MM285.designGroup” retrieves the “KYCG.MM285.designGroup.20210210” database. Let’s look at the database group which we had used as the query (query and database are reciprocal) in our first example:

dbs <- KYCG_getDBs("MM285.design")

## Selected the following database groups:

## 1. KYCG.MM285.designGroup.20210210

In total, 32 datasets have been loaded for this group. We can get the “PGCMeth” as an element of the list:

str(dbs[["PGCMeth"]])

##  chr [1:474] "cg36615889_TC11" "cg36646136_BC21" "cg36647910_BC11" ...
##  - attr(*, "group")= chr "KYCG.MM285.designGroup.20210210"
##  - attr(*, "dbname")= chr "PGCMeth"

On subsequent runs of the KYCG_getDBs() function, the database loading can be faster thanks to the sesameData in-memory caching, if the corresponding database has been loaded.

Query Set(s)

A query set represents probes of interest. It may either be in the form of a character vector where the values correspond to probe IDs or a named numeric vector where the names correspond to probe IDs. The query and database definition is rather arbitrary. One can regard a database as a query and turn a query into a database, like in our first example. In real world scenario, query can come from differential methylation testing, unsupervised clustering, correlation with a phenotypic trait, and many others. For example, we could consider CpGs that show tissue-specific methylation as the query. We are getting the B-cell-specific hypomethylation.

df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "B_cell"]
head(query)

## [1] "cg32668003_TC11" "cg45118317_TC11" "cg37563895_TC11" "cg46105105_BC11"
## [5] "cg47206675_TC21" "cg38855216_TC21"

This query set represents hypomethylated probes in Mouse B-cells from the MM285 platform. This specific query set has 168 probes.

Gene Enrichment

A special case of set enrichment is to test whether CpGs are associated with specific genes. Automating the enrichment test process only works when the number of database sets is small. This is important when targeting all genes as there are tens of thousands of genes on each platform. By testing only those genes that overlap with the query set, we can greatly reduce the number of tests. For this reason, the gene enrichment analysis is a special case of these enrichment tests. We can perform this analysis using the KYCG_buildGeneDBs() function.

query <- names(sesameData_getProbesByGene("Dnmt3a", "MM285"))
results <- testEnrichment(query, KYCG_buildGeneDBs(query, max_distance=100000))
results[,c("dbname","estimate","gene_name","FDR", "nQ", "nD", "overlap")]

Using these sample results, we can plot a volcano plot and lollipop plot.

KYCG_plotLollipop(results, label="gene_name")

As expected, we recover our targeted gene (Dnmt3a).

GO/Pathway Enrichment

One can get all the genes associated with a probe set by

df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "fetal_liver" & df$type == "Hypo"]
genes <- sesameData_getGenesByProbes(query)
genes

## GRanges object with 168 ranges and 2 metadata columns:
##                         seqnames              ranges strand |   gene_name
##                            <Rle>           <IRanges>  <Rle> | <character>
##   ENSMUSG00000026069.15     chr1   40429570-40465415      + |      Il1rl1
##    ENSMUSG00000101923.1     chr1   43409584-43409963      + |     Gm29041
##   ENSMUSG00000026782.15     chr1   60409619-60481158      + |        Abi2
##   ENSMUSG00000026288.14     chr1   87620312-87720507      + |      Inpp5d
##   ENSMUSG00000026425.15     chr1 131285251-131527352      - |      Srgap2
##                     ...      ...                 ...    ... .         ...
##    ENSMUSG00000031119.4     chrX   52053021-52165252      - |        Gpc4
##   ENSMUSG00000016382.15     chrX   75785654-75875182      - |        Pls3
##    ENSMUSG00000071680.6     chrX 142412755-142412967      + |      Gm7123
##   ENSMUSG00000045180.13     chrX 152609509-152769465      - |     Shroom2
##   ENSMUSG00000031298.15     chrX 160390690-160498070      + |      Adgrg2
##                                    gene_type
##                                  <character>
##   ENSMUSG00000026069.15       protein_coding
##    ENSMUSG00000101923.1 processed_pseudogene
##   ENSMUSG00000026782.15       protein_coding
##   ENSMUSG00000026288.14       protein_coding
##   ENSMUSG00000026425.15       protein_coding
##                     ...                  ...
##    ENSMUSG00000031119.4       protein_coding
##   ENSMUSG00000016382.15       protein_coding
##    ENSMUSG00000071680.6 processed_pseudogene
##   ENSMUSG00000045180.13       protein_coding
##   ENSMUSG00000031298.15       protein_coding
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths

Here we demonstrate the use of g:Profiler2 to perform Gene ontology enrichment analysis:

library(gprofiler2)

## use gene name
gostres <- gost(genes$gene_name, organism = "mmusculus")
gostres$result[order(gostres$result$p_value),]
gostplot(gostres)

## use Ensembl gene ID, note we need to remove the version suffix
gene_ids <- sapply(strsplit(names(genes),"\\."), function(x) x[1])
gostres <- gost(gene_ids, organism = "mmusculus")
gostres$result[order(gostres$result$p_value),]
gostplot(gostres)

GSEA-like

The query may be a named continuous vector. In that case, either a gene enrichment score will be calculated (if the database is discrete) or a Spearman correlation will be calculated (if the database is continuous as well). The three other cases are shown below using biologically relevant examples.

To display this functionality, let’s load two numeric database sets individually. One is a database set for CpG density and the other is a database set corresponding to the distance of the nearest transcriptional start site (TSS) to each probe.

query <- KYCG_getDBs("KYCG.MM285.designGroup")[["TSS"]]

res <- testEnrichmentGSEA(query, "MM285.seqContextN")
res[, c("dbname", "test", "estimate", "FDR", "nQ", "nD", "overlap")]

The estimate here is enrichment score.

NOTE: Negative enrichment score suggests enrichment of the categorical database with the higher values (in the numerical database). Positive enrichment score represent enrichment with the smaller values. As expected, the designed TSS CpGs are significantly enriched in smaller TSS distance and higher CpG density.

Alternatively one can test the enrichment of a continuous query with discrete databases. Here we will use the methylation level from a sample as the query and test it against the chromHMM chromatin states.

beta_values <- getBetas(sesameDataGet("MM285.1.SigDF"))
res <- testEnrichmentGSEA(beta_values, "MM285.chromHMM")
res[, c("dbname", "test", "estimate", "FDR", "nQ", "nD", "overlap")]

As expected, chromatin states Tss, Enh has negative enrichment score, meaning these databases are associated with small values of the query (DNA methylation level). On the contrary, Quies states are associated with high methylation level.

Session Info

sessionInfo()

## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.26.1 Biobase_2.56.0             
##  [3] GenomicRanges_1.48.0        GenomeInfoDb_1.32.2        
##  [5] IRanges_2.30.0              S4Vectors_0.34.0           
##  [7] MatrixGenerics_1.8.0        matrixStats_0.62.0         
##  [9] knitr_1.39                  sesame_1.14.2              
## [11] sesameData_1.14.0           ExperimentHub_2.4.0        
## [13] AnnotationHub_3.4.0         BiocFileCache_2.4.0        
## [15] dbplyr_2.1.1                BiocGenerics_0.42.0        
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7                  bit64_4.0.5                  
##  [3] filelock_1.0.2                RColorBrewer_1.1-3           
##  [5] httr_1.4.3                    tools_4.2.0                  
##  [7] bslib_0.3.1                   utf8_1.2.2                   
##  [9] R6_2.5.1                      DBI_1.1.2                    
## [11] colorspace_2.0-3              withr_2.5.0                  
## [13] tidyselect_1.1.2              preprocessCore_1.58.0        
## [15] bit_4.0.4                     curl_4.3.2                   
## [17] compiler_4.2.0                cli_3.3.0                    
## [19] DelayedArray_0.22.0           labeling_0.4.2               
## [21] sass_0.4.1                    scales_1.2.0                 
## [23] readr_2.1.2                   rappdirs_0.3.3               
## [25] stringr_1.4.0                 digest_0.6.29                
## [27] rmarkdown_2.14                XVector_0.36.0               
## [29] pkgconfig_2.0.3               htmltools_0.5.2              
## [31] highr_0.9                     fastmap_1.1.0                
## [33] rlang_1.0.2                   RSQLite_2.2.14               
## [35] shiny_1.7.1                   farver_2.1.0                 
## [37] jquerylib_0.1.4               generics_0.1.2               
## [39] jsonlite_1.8.0                wheatmap_0.2.0               
## [41] BiocParallel_1.30.2           dplyr_1.0.9                  
## [43] RCurl_1.98-1.6                magrittr_2.0.3               
## [45] GenomeInfoDbData_1.2.8        Matrix_1.4-1                 
## [47] Rcpp_1.0.8.3                  munsell_0.5.0                
## [49] fansi_1.0.3                   lifecycle_1.0.1              
## [51] stringi_1.7.6                 yaml_2.3.5                   
## [53] zlibbioc_1.42.0               plyr_1.8.7                   
## [55] grid_4.2.0                    blob_1.2.3                   
## [57] ggrepel_0.9.1                 parallel_4.2.0               
## [59] promises_1.2.0.1              crayon_1.5.1                 
## [61] lattice_0.20-45               Biostrings_2.64.0            
## [63] hms_1.1.1                     KEGGREST_1.36.0              
## [65] pillar_1.7.0                  reshape2_1.4.4               
## [67] glue_1.6.2                    BiocVersion_3.15.2           
## [69] evaluate_0.15                 BiocManager_1.30.18          
## [71] png_0.1-7                     vctrs_0.4.1                  
## [73] tzdb_0.3.0                    httpuv_1.6.5                 
## [75] gtable_0.3.0                  purrr_0.3.4                  
## [77] assertthat_0.2.1              cachem_1.0.6                 
## [79] ggplot2_3.3.6                 xfun_0.31                    
## [81] mime_0.12                     xtable_1.8-4                 
## [83] later_1.3.0                   tibble_3.1.7                 
## [85] AnnotationDbi_1.58.0          memoise_2.0.1                
## [87] ellipsis_0.3.2                interactiveDisplayBase_1.34.0