Contents

1 Introduction and overview

The recountmethylation package provides access to databases of DNA methylation (DNAm) data from over 35,000 sample records with IDATs in the Gene Expression Omnibus (GEO, available through March 31, 2019) run using the Illumina HM450K BeadArray platform, including metadata, raw/unnormalized red and green signals, raw/unnormalized methylated and unmethylated signals, and normalized DNAm fractions or Beta-values (Maden et al. (2020)). Normalization was performed using the out-of-band signal correction (a.k.a. “noob”) method, a type of within-sample normalization (Triche et al. (2013)).

This User’s Guide shows how to use the recountmethylation package to obtain, load, and query the DNAm databases with 2 small example files. Background about DNAm arrays, DNAm measurement, SummarizedExperiment objects, database file types, and samples metadata is also provided. Further analysis examples are contained in the data_analyses vignette.

1.1 Disclaimer

1.2 Database files and access

Database access, including downloads and file loads, are managed by the get_db functions. These download and access the latest available database files (see ?get_db and below examples for details). Note you will need between 90-120 Gb of disk space to store a single database file. Files pair sample metadata with assays including red and green channel signals, methylated and unmethylated level signals, and DNAm fractions in 3 HDF5-SummarizedExperiment entities, and red and green signals in an HDF5 h5 database.

The database files are contained at the file server, located at the URL: https://recount.bio/data/. Details about the latest files are as follows.

filename date time size (bytes)
remethdb-h5se_gm_0-0-1_1590090412 30-May-2020 08:44 assays.h5 = 94164575624;se.rds = 4989437
remethdb-h5se_gr-test_0-0-1_1590090412 29-May-2020 07:28 assays.h5 = 132596;se.rds = 68522
remethdb-h5se_gr_0-0-1_1590090412 31-May-2020 10:07 assays.h5 = 132553650296;se.rds = 4989382
remethdb_h5se-rg_0-0-1_1590090412 27-May-2020 09:45 assays.h5 = 118500675976;se.rds = 2800938
remethdb-h5_rg-test_0-0-1_1590090412.h5 31-May-2020 07:26 252757
remethdb-h5_rg_0-0-1_1590090412.h5 06-Jun-2020 08:21 120167684020

1.3 ExperimentHub integration

The DNAm array database files are indexed on ExperimentHub, and are viewable as follows.

hub = ExperimentHub::ExperimentHub() # connect to the hubs
rmdat <- AnnotationHub::query(hub, "recountmethylation") # query the hubs
rmdat
## ExperimentHub with 6 records
## # snapshotDate(): 2020-10-02
## # $dataprovider: GEO/GDS
## # $species: Homo sapiens
## # $rdataclass: HDF5-SummarizedExperiment, HDF5Database
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH3773"]]' 
## 
##            title                                  
##   EH3773 | remethdb-h5se_gr_0-0-1_1590090412      
##   EH3774 | remethdb-h5se_gm_0-0-1_1590090412      
##   EH3775 | remethdb_h5se-rg_0-0-1_1590090412      
##   EH3776 | remethdb-h5se_gr-test_0-0-1_159009041  
##   EH3777 | remethdb-h5_rg_0-0-1_1590090412.h5     
##   EH3778 | remethdb-h5_rg-test_0-0-1_1590090412.h5

In addition to using the getdb functions, the HDF5 (.h5 extension) files may alternatively be downloaded from the hubs, as follows.

eid <- "EH3778" # h5 test file id
fpath <- rmdat[[eid]] # download with default caching
rhdf5::h5ls(fpath)
##   group                 name       otype dclass       dim
## 0     /          greensignal H5I_DATASET  FLOAT 2 x 11162
## 1     / greensignal.colnames H5I_DATASET STRING     11162
## 2     / greensignal.rownames H5I_DATASET STRING         2
## 3     /               mdpost H5I_DATASET STRING    2 x 19
## 4     /      mdpost.colnames H5I_DATASET STRING        19
## 5     /            redsignal H5I_DATASET  FLOAT 2 x 11162
## 6     /   redsignal.colnames H5I_DATASET STRING     11162
## 7     /   redsignal.rownames H5I_DATASET STRING         2

Note that whether downloads use the hubs or getdb functions, caching is implemented to check for previously downloaded database files.

2 Background

This section includes essential background about DNAm array platforms, assays and file types, and sample metadata.

2.1 DNAm arrays

Databases include human samples run on the Illumina Infinium HM450K BeadArray platform. HM450K is a popular 2-channel platform that probes over 480,000 CpG loci genome-wide, with enriched coverage at CG islands, genes, and enhancers (Sandoval et al. (2011)). Array processing generates 2 intensity files (IDATs) per sample, one each for the red and green color channels. These raw files also contain control signals useful for quality evaluations (“Illumina Genome Studio Methylation Module V1.8” 2010).

HM450K probes use either of 2 bead technologies, known as Type I and Type II, where the majority (72%) of probes use the latter. For Type II probes, a single bead assay informs a single probe, while Type I probes use 2 beads each. Practically, this means the bead-specific matrices found in RGChannelSet objects are larger than the probe-specific matrices found in derived object types (e.g. 622,399 assays for red/green signal matrices versus 485,512 assays for methylated/unmethylated signal, DNAm fractions matrices, see below).

2.2 SummarizedExperiment object classes

DNAm array sample IDATs can be read into an R session as an object of class RGChannelSet, a type of SummarizedExperiment. These objects support analyses of high-throughput genomics datasets, and they include slots for assay matrices, sample metadata, and experiment metadata. During a typical workflow, normalization and preprocessing convert RGChannelSet objects into new types like MethylSet and RatioSet. While not all IDAT information is accessible from every object type (e.g. only RGChannelSets can contain control assays), derived objects like MethylSets and RatioSets may be smaller and/or faster to access.

Three SummarizedExperiment databases are provided as HDF5-SummarizedExperiment files, including an unnormalized RGChannelSet (red/green signals), an unnormalized MethylSet (methylated/unmethylated signals) and a normalized GenomicRatioSet (DNAm fractions). For the latter, DNAm fractions (logit2 Beta-values, or M-values) were normalized using the out-of-band signal or “noob” method, an effective within-sample normalization that removes signal artifacts (Triche et al. (2013)).

2.3 Database file types

Database files are stored as either HDF5 or HDF5-SummarizedExperiment. For most R users, the latter files will be most convenient to work with. HDF5, or hierarchical data format 5, combines compression and chunking for convenient handling of large datasets. HDF5-SummarizedExperiment files combine the benefits of HDF5 and SummarizedExperiment entities using a DelayedArray-powered backend. Once an HDF5-SummarizedExperiment file is loaded, it can be treated similarly to a SummarizedExperiment object in active memory. That is, summary and subset operations execute rapidly, and realization of large data chunks in active memory is delayed until called for by the script (see examples).

2.4 Sample metadata

Sample metadata are included with DNAm assays in the database files. Currently, metadata variables include GEO record IDs for samples (GSM) and studies (GSE), sample record titles, learned labels for tissue and disease, sample type predictions from the MetaSRA-pipeline, and DNAm model-based predictions for age, sex, and blood cell types. Access sample metadata from SummarizedExperiment objects using the pData minfi function (see examples). Examples in the data_analyses vignette illustrate some ways to utilize the provided sample metadata.

Provided metadata derives from the GSE-specific SOFT files, which contain experiment, sample, and platform metadata. Considerable efforts were made to learn, harmonize, and predict metadata labels. Certain types of info lacking in the recountmethylation metadata may be available in the SOFT files, especially if it is sample non-specific (e.g. methods text, PubMed ID, etc.) or redundant with DNAm-derived metrics (e.g. DNAm summaries, predicted sex, etc.).

It is good practice to validate the harmonized metadata with original metadata records, especially where labels are ambiguous or there is insufficient information for a given query. GEO GSM and GSE records can be viewed from a browser, or SOFT files may be downloaded directly. Packages like GEOmetadb and GEOquery are also useful to query and summarize GEO metadata.

3 HDF5-SummarizedExperiment example

This example shows basic handling for HDF5-SummarizedExperiment (a.k.a. “h5se”) files. For these files, the getdb function returns the loaded file. Thanks to a DelayedArray backend, even full-sized h5se databases can be treated as if they were fully loaded into active memory.

3.1 Obtain the test database

The test h5se dataset includes sample metadata and noob-normalized DNAm fractions (Beta-values) for chromosome 22 probes for 2 samples. Datasets can be downloaded using the getdb series of functions (see ?getdb for details), where the dfp argument specifies the download destination. The test h5se file is included in the package “inst” directory, and can be loaded as follows.

dn <- "remethdb-h5se_gr-test_0-0-1_1590090412"
path <- system.file("extdata", dn, package = "recountmethylation")
h5se.test <- HDF5Array::loadHDF5SummarizedExperiment(path)

3.2 Inspect and summarize the database

Common characterization functions can be used on the dataset after it has been loaded. These include functions for SummarizedExperiment-like objects, such as the getBeta, pData, and getAnnotation minfi functions. First, inspect the dataset using standard functions like class, dim, and summary as follows.

class(h5se.test)
## [1] "GenomicRatioSet"
## attr(,"package")
## [1] "minfi"
dim(h5se.test)
## [1] 8552    2
summary(h5se.test)
## [1] "GenomicRatioSet object of length 8552 with 0 metadata columns"

Access the sample metadata for the 2 available samples using pData.

h5se.md <- minfi::pData(h5se.test)
dim(h5se.md)
## [1]  2 19
colnames(h5se.md)
##  [1] "gsm"            "gsm_title"      "gseid"          "disease"       
##  [5] "tissue"         "sampletype"     "arrayid_full"   "basename"      
##  [9] "age"            "predage"        "sex"            "predsex"       
## [13] "predcell.CD8T"  "predcell.CD4T"  "predcell.NK"    "predcell.Bcell"
## [17] "predcell.Mono"  "predcell.Gran"  "storage"

Next get CpG probe-specific DNAm fractions, or “Beta-values”, with getBeta (rows are probes, columns are samples).

h5se.bm <- minfi::getBeta(h5se.test)
dim(h5se.bm)
## [1] 8552    2
colnames(h5se.bm) <- h5se.test$gsm
knitr::kable(head(h5se.bm), align = "c")
GSM1038308 GSM1038309
cg00017461 0.9807283 0.9746836
cg00077299 0.3476970 0.3456837
cg00079563 0.8744652 0.9168005
cg00087182 0.9763206 0.9760947
cg00093544 0.0225112 0.0265087
cg00101350 0.9736359 0.9789818

Access manifest information for probes with getAnnotation. This includes the bead addresses, probe type, and genome coordinates and regions. For full details about the probe annotations, consult the minfi and Illumina platform documentation.

an <- minfi::getAnnotation(h5se.test)
dim(an)
## [1] 8552   33
colnames(an)
##  [1] "chr"                      "pos"                     
##  [3] "strand"                   "Name"                    
##  [5] "AddressA"                 "AddressB"                
##  [7] "ProbeSeqA"                "ProbeSeqB"               
##  [9] "Type"                     "NextBase"                
## [11] "Color"                    "Probe_rs"                
## [13] "Probe_maf"                "CpG_rs"                  
## [15] "CpG_maf"                  "SBE_rs"                  
## [17] "SBE_maf"                  "Islands_Name"            
## [19] "Relation_to_Island"       "Forward_Sequence"        
## [21] "SourceSeq"                "Random_Loci"             
## [23] "Methyl27_Loci"            "UCSC_RefGene_Name"       
## [25] "UCSC_RefGene_Accession"   "UCSC_RefGene_Group"      
## [27] "Phantom"                  "DMR"                     
## [29] "Enhancer"                 "HMM_Island"              
## [31] "Regulatory_Feature_Name"  "Regulatory_Feature_Group"
## [33] "DHS"
ant <- as.matrix(t(an[c(1:4), c(1:3, 5:6, 9, 19, 24, 26)]))
knitr::kable(ant, align = "c")
cg00017461 cg00077299 cg00079563 cg00087182
chr chr22 chr22 chr22 chr22
pos 30663316 18632618 43253521 24302043
strand - + + +
AddressA 31616369 13618325 65630302 37797387
AddressB 70798487 37626331 55610348 20767312
Type I I I I
Relation_to_Island OpenSea N_Shore Island N_Shore
UCSC_RefGene_Name OSM USP18 ARFGAP3;ARFGAP3 GSTT2B;GSTT2
UCSC_RefGene_Group TSS1500 TSS200 TSS200;TSS200 Body;Body

4 HDF5 database and example

To provide more workflow options, bead-specific red and green signal data have been provided with sample metadata in an HDF5/h5 file. This example shows how to handle objects of this type with recountmethylation.

4.1 Obtain the test database

The test h5 file includes metadata and bead-specific signals from chromosome 22 for the same 2 samples as in the h5se test file. Note getdb functions for h5 files simply return the database path. Since the test h5 file has also been included in the package “inst” folder, get the path to load the file as follows.

dn <- "remethdb-h5_rg-test_0-0-1_1590090412.h5"
h5.test <- system.file("extdata", "h5test", dn, 
                    package = "recountmethylation")

4.2 Inspect and summarize the database

Use the file path to read data into an RGChannelSet with the getrg function. Setting all.gsm = TRUE obtains data for all samples in the database files, while passing a vector of GSM IDs to gsmv argument will query a subset of available samples. Signals from all available probes are retrieved by default, and probe subsets can be obtained by passing a vector of valid bead addresses to the cgv argument.

h5.rg <- getrg(dbn = h5.test, all.gsm = TRUE)

To avoid exhausting active memory with the full-sized h5 dataset, provide either gsmv or cgv to getrg, and set either all.cg or all.gsm to FALSE (see ?getrg for details).

As in the previous example, use pData and getAnnotation to get sample metadata and array manifest information, respectively. Access the green and red signal matrices in the RGChannelSet with the getRed and getGreen minfi functions.

h5.red <- minfi::getRed(h5.rg)
h5.green <- minfi::getGreen(h5.rg)
dim(h5.red)
## [1] 11162     2
knitr::kable(head(h5.red), align = "c")
GSM1038308 GSM1038309
10601475 1234 1603
10603366 342 344
10603418 768 963
10605304 2368 2407
10605460 3003 3322
10608343 357 399
knitr::kable(head(h5.green), align = "c")
GSM1038308 GSM1038309
10601475 6732 8119
10603366 288 356
10603418 267 452
10605304 4136 4395
10605460 1395 1762
10608343 840 1269
identical(rownames(h5.red), rownames(h5.green))
## [1] TRUE

Rows in these signal matrices map to bead addresses rather than probe IDs. These matrices have more rows than the h5se test Beta-value matrix because any type I probes use data from 2 beads each.

5 Validate DNAm datasets

This section demonstrates validation using the test databases. As the disclaimer notes, it is good practice to validate data against the latest available GEO files. This step may be most useful for newer samples published close to the end compilation date (end of March 2019 for current version), which may be more prone to revisions at initial publication.

5.1 Download and read IDATs from the GEO database server

Use the gds_idat2rg function to download IDATs for the 2 test samples and load these into a new RGChannelSet object. Do this by passing a vector of GSM IDs to gsmv and the download destination to dfp.

dlpath <- tempdir()
gsmv <- c("GSM1038308", "GSM1038309")
geo.rg <- gds_idat2rg(gsmv, dfp = dlpath)
colnames(geo.rg) <- gsub("\\_.*", "", colnames(geo.rg))

5.2 Compare DNAm signals

Extract the red and green signal matrices from geo.rg.

geo.red <- minfi::getRed(geo.rg)
geo.green <- minfi::getGreen(geo.rg)

Match indices and labels between the GEO and h5 test signal matrices.

int.addr <- intersect(rownames(geo.red), rownames(h5.red))
geo.red <- geo.red[int.addr,]
geo.green <- geo.green[int.addr,]
geo.red <- geo.red[order(match(rownames(geo.red), rownames(h5.red))),]
geo.green <- geo.green[order(match(rownames(geo.green), rownames(h5.green))),]
identical(rownames(geo.red), rownames(h5.red))
## [1] TRUE
identical(rownames(geo.green), rownames(h5.green))
## [1] TRUE
class(h5.red) <- "integer"
class(h5.green) <- "integer"

Finally, compare the signal matrix data.

identical(geo.red, h5.red)
## [1] TRUE
identical(geo.green, h5.green)
## [1] TRUE

5.3 Compare DNAm Beta-values

Before comparing the GEO-downloaded data to data from the h5se.test database, normalize the data using the same out-of-band or “noob” normalization technique that was used to generate data in the h5se database.

geo.gr <- minfi::preprocessNoob(geo.rg)

Next, extract the Beta-values.

geo.bm <- as.matrix(minfi::getBeta(geo.gr))

Now match row and column labels and indices.

h5se.bm <- as.matrix(h5se.bm)
int.cg <- intersect(rownames(geo.bm), rownames(h5se.bm))
geo.bm <- geo.bm[int.cg,]
geo.bm <- geo.bm[order(match(rownames(geo.bm), rownames(h5se.bm))),]

Finally, compare the two datasets.

identical(summary(geo.bm), summary(h5se.bm))
## [1] TRUE
identical(rownames(geo.bm), rownames(h5se.bm))
## [1] TRUE

6 Get more help

Consult the Data Analyses vignette and main manuscript for analysis examples and details about data compilations.

7 Session info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] IlluminaHumanMethylation450kmanifest_0.4.0        
##  [2] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0
##  [3] ExperimentHub_1.16.0                              
##  [4] AnnotationHub_2.22.0                              
##  [5] BiocFileCache_1.14.0                              
##  [6] dbplyr_1.4.4                                      
##  [7] HDF5Array_1.18.0                                  
##  [8] DelayedArray_0.16.0                               
##  [9] Matrix_1.2-18                                     
## [10] limma_3.46.0                                      
## [11] gridExtra_2.3                                     
## [12] ggplot2_3.3.2                                     
## [13] knitr_1.30                                        
## [14] recountmethylation_1.0.0                          
## [15] minfi_1.36.0                                      
## [16] bumphunter_1.32.0                                 
## [17] locfit_1.5-9.4                                    
## [18] iterators_1.0.13                                  
## [19] foreach_1.5.1                                     
## [20] Biostrings_2.58.0                                 
## [21] XVector_0.30.0                                    
## [22] SummarizedExperiment_1.20.0                       
## [23] Biobase_2.50.0                                    
## [24] MatrixGenerics_1.2.0                              
## [25] matrixStats_0.57.0                                
## [26] GenomicRanges_1.42.0                              
## [27] GenomeInfoDb_1.26.0                               
## [28] IRanges_2.24.0                                    
## [29] S4Vectors_0.28.0                                  
## [30] BiocGenerics_0.36.0                               
## [31] rhdf5_2.34.0                                      
## [32] BiocStyle_2.18.0                                  
## 
## loaded via a namespace (and not attached):
##   [1] plyr_1.8.6                    splines_4.0.3                
##   [3] BiocParallel_1.24.0           digest_0.6.27                
##   [5] htmltools_0.5.0               magrittr_1.5                 
##   [7] memoise_1.1.0                 readr_1.4.0                  
##   [9] annotate_1.68.0               R.utils_2.10.1               
##  [11] askpass_1.1                   siggenes_1.64.0              
##  [13] prettyunits_1.1.1             colorspace_1.4-1             
##  [15] blob_1.2.1                    rappdirs_0.3.1               
##  [17] xfun_0.18                     dplyr_1.0.2                  
##  [19] crayon_1.3.4                  RCurl_1.98-1.2               
##  [21] genefilter_1.72.0             GEOquery_2.58.0              
##  [23] survival_3.2-7                glue_1.4.2                   
##  [25] gtable_0.3.0                  zlibbioc_1.36.0              
##  [27] Rhdf5lib_1.12.0               scales_1.1.1                 
##  [29] DBI_1.1.0                     rngtools_1.5                 
##  [31] Rcpp_1.0.5                    xtable_1.8-4                 
##  [33] progress_1.2.2                bit_4.0.4                    
##  [35] mclust_5.4.6                  preprocessCore_1.52.0        
##  [37] httr_1.4.2                    RColorBrewer_1.1-2           
##  [39] ellipsis_0.3.1                R.methodsS3_1.8.1            
##  [41] pkgconfig_2.0.3               reshape_0.8.8                
##  [43] XML_3.99-0.5                  farver_2.0.3                 
##  [45] tidyselect_1.1.0              labeling_0.4.2               
##  [47] rlang_0.4.8                   later_1.1.0.1                
##  [49] AnnotationDbi_1.52.0          munsell_0.5.0                
##  [51] BiocVersion_3.12.0            tools_4.0.3                  
##  [53] generics_0.0.2                RSQLite_2.2.1                
##  [55] evaluate_0.14                 stringr_1.4.0                
##  [57] fastmap_1.0.1                 yaml_2.2.1                   
##  [59] bit64_4.0.5                   beanplot_1.2                 
##  [61] scrime_1.3.5                  purrr_0.3.4                  
##  [63] nlme_3.1-150                  doRNG_1.8.2                  
##  [65] sparseMatrixStats_1.2.0       mime_0.9                     
##  [67] nor1mix_1.3-0                 R.oo_1.24.0                  
##  [69] xml2_1.3.2                    biomaRt_2.46.0               
##  [71] compiler_4.0.3                curl_4.3                     
##  [73] interactiveDisplayBase_1.28.0 tibble_3.0.4                 
##  [75] stringi_1.5.3                 highr_0.8                    
##  [77] GenomicFeatures_1.42.0        lattice_0.20-41              
##  [79] multtest_2.46.0               vctrs_0.3.4                  
##  [81] pillar_1.4.6                  lifecycle_0.2.0              
##  [83] rhdf5filters_1.2.0            BiocManager_1.30.10          
##  [85] data.table_1.13.2             bitops_1.0-6                 
##  [87] httpuv_1.5.4                  rtracklayer_1.50.0           
##  [89] R6_2.4.1                      bookdown_0.21                
##  [91] promises_1.1.1                codetools_0.2-16             
##  [93] MASS_7.3-53                   assertthat_0.2.1             
##  [95] openssl_1.4.3                 withr_2.3.0                  
##  [97] GenomicAlignments_1.26.0      Rsamtools_2.6.0              
##  [99] GenomeInfoDbData_1.2.4        mgcv_1.8-33                  
## [101] hms_0.5.3                     quadprog_1.5-8               
## [103] grid_4.0.3                    tidyr_1.1.2                  
## [105] base64_2.0                    rmarkdown_2.5                
## [107] DelayedMatrixStats_1.12.0     illuminaio_0.32.0            
## [109] shiny_1.5.0                   tinytex_0.26

Works Cited

Maden, Sean, Reid Thompson, Kasper Hansen, and Abhinav Nellore. 2020. “Human Methylome Variation Across Infinium 450K Data on the Gene Expression Omnibus.” Preprint in Preparation, June.

Sandoval, Juan, Holger A. Heyn, Sebastian Moran, Jordi Serra-Musach, Miguel A. Pujana, Marina Bibikova, and Manel Esteller. 2011. “Validation of a DNA Methylation Microarry for 450,000 CpG Sites in the Human Genome.” Epigenetics 6 (6):692–702. https://doi.org/10.4161/epi.6.6.16196.

Triche, Timothy J., Daniel J. Weisenberger, David Van Den Berg, Peter W. Laird, and Kimberly D. Siegmund. 2013. “Low-Level Processing of Illumina Infinium DNA Methylation BeadArrays.” Nucleic Acids Research 41 (7):e90. https://doi.org/10.1093/nar/gkt090.