This vignette provides instructions to construct and analyze a search index of DNAm array data. The index is made using the hnswlib Python library, and the basilisk and reticulate R/Bioconductor libraries are used to manage Python environments and functions. These methods should be widely usedful for genomics and epigenomics analyses, especially for very large datasets.

Background: search indexes for biological data

The search index has a similar function to the index of a book. Rather than storing the full/uncompressed data, only the between-entity relations are stored, which enables rapid entity lookup and nearest neighbors analysis while keeping stored file sizes manageable. Many methods for search index construction are available. The Hierarchical Navigable Small World (HNSW) graph method, used below, is fairly new (Malkov and Yashunin (2018)) and was among the overall top performing methods benchmarked in ANN benchmarks (Aumüller, Bernhardsson, and Faithfull (2018)). HNSW is implemented by the hnswlib Python library, which also includes helpful docstrings and a ReadMe to apply the method in practice.

While prior work showed the utility of indexing several types of biomedical data for research, to our knowledge this is the first time support has been provided for R users to make and analyze search indexes of DNAm array data. This vignette walks through a small example using a handful DNAm array samples from blood. Interested users can further access a comprehensive index of pre-compiled DNAm array data from blood samples on the recountmethylation server. These data were available in the Gene Expression Omnibus (GEO) by March 31, 2021, and they include 13,835 samples run on either the HM450K or EPIC platform.

Index samples on dimension-reduced data

This section shows how to make a new search index using sample DNAm array data after performing dimensionality reduction on the data using feature hashing (a.k.a. “the hashing trick”, Weinberger et al. (2010)).

Virtual environment setup

First, use the setup_sienv() function to set up a Python virtual environment named “dnam_si_vignette” which contains the required dependencies. Since this function is not exported, use ::: to call it. For greater reproducibility, specify exact dependency versions e.g. like “numpy==1.20.1”. Install the hnswlib (v0.5.1) library to manage search index construction and access. install pandas (v1.2.2) and numpy (v1.20.1) for data manipulations, and mmh3 (v3.0.0) for feature hashing.

recountmethylation:::setup_sienv(env.name = "dnam_si_vignette")

Perform dimensionality reduction on DNAm data

Search index efficiency pairs well with dimensionality reduction to enable very rapid queries on large datasets. This section shows how to reduce a dataset of DNAm fractions prior to indexing.

First, save the DNAm fractions (Beta-values) from a SummarizedExperiment object, ensuring sample data is in rows and probes are in columns. First, access the DNAm locally from the object gr-noob_h5se_hm450k-epic-merge_0-0-3, which is available for download from recount.bio/data. Next, identify samples of interest from the sample metadata, which is accessed using colData(). After subsetting the samples, store the DNAm fractions (a.k.a. `Beta-values'') for 100 CpG probes, which we access usinggetBeta()`.

set.seed(1)
gr.fname <- "gr-noob_h5se_hm450k-epic-merge_0-0-3"
gr <- HDF5Array::loadHDF5SummarizedExperiment(gr.fname)
md <- as.data.frame(colData(gr))
# identify samples from metadata
gseid <- "GSE67393"
gsmv <- md[md$gse == gseid,]$gsm # get study samples
gsmv <- gsmv[sample(length(gsmv), 10)]
# get random samples by group label
mdf <- md[md$blood.subgroup=="PBMC",]
gsmv <- c(gsmv, mdf[sample(nrow(mdf), 20),]$gsm)
mdf <- md[md$blood.subgroup=="whole_blood",]
gsmv <- c(gsmv, mdf[sample(nrow(mdf), 20),]$gsm)
# norm bvals for probe subset
num.cg <- 100
grf <- gr[,gr$gsm %in% gsmv]; dim(grf)
bval <- getBeta(grf[sample(nrow(grf), num.cg),])
bval <- t(bval) # get transpose
rownames(bval) <- gsub("\\..*", "", rownames(bval)) # format rownames
# save bval table
bval.fpath <- paste0("bval_",num.cg,".csv")
write.csv(bval, file = bval.fpath) 

Next, call get_fh() to perform feature hashing on the DNAm data. Feature hashing is a dimensionality reduction technique that reduces features/columns (or probes in this case) while preserving beween-sample variances. You can specify the target dimensions for this step using ndim; for this example, set the target dimensions to 100 using ndim = 100.

# get example table and labels
num.dim <- 10 # target reduced dimensions
fhtable.fpath <- "bval_100_fh10.csv"
get_fh(csv_savepath = fhtable.fpath, csv_openpath = bval.fpath, ndim = num.dim)

If ndim is high, the data is less reduced/compressed but more closely resembles the original uncompressed data, while the opposite is true at lower ndim. The exact target dimensions to ultimately use is up to user discretion. In practice, 10,000 dimensions yields a good tradeoff between compression and uncompressed data simliarity for HM450K arrays.

Make a new HNSW search index

Now use the make_si() function to make a new search index using hnswlib. This function takes the hashed features file name/path and makes a new search index and search index dictionary. Note, we set the extension .pickle since this is the save format for the new index objects.

si.fname.str <- "new_search_index"
si.fpath <- file.path(si.dpath, paste0(si.fname.str, ".pickle"))
sidict.fpath <- file.path(si.dpath, paste0(si.fname.str, "_dict.pickle"))
make_si(fh_csv_fpath = fhtable.fpath, si_fname = paste0(si.fname.str, ".pickle"), 
        si_dict_fname = paste0(si.fname.str, "_dict.pickle"))

The tuning parameters space_val, efc_val, m_val, and ef_val were selected to work well for DNAm array data, and further details about these parameters can be found in the hnswlib package docstrings and ReadMe.

Query nearest neighbors in the search index

This section shows an example analysis of nearest neighbors returned from a series of queries varying k from 1 to 20 neighbors.

Get nearest neighbors from search index queries

Next we can query samples from study GSE67393 (Inoshita et al. (2015)). Query the search index using query_si(), which passes valid query data to the .knn_query method in hnswlib. First get the IDs for a subset of 10 of the samples frome the study GSE67393, which were included in the search index.

query.gsmv <- c("GSM1506297", "GSM1646161", "GSM1646175", 
          "GSM1649753", "GSM1649758", "GSM1649762")

Now specify a vector of k values, lkval, where k is the number of nearest neighbors to return. For example, if a search index has 100 dimensions, the query data for a sample should have exactly 100 dimensions. Also note k cannot exceed the total indexed entities, which is 10 for this example.

lkval <- c(1,5,10,20) # vector of query k values 

Finally, use query_si() to run the query. The path fhtable.fpath to the hashed feature table “bval_100_fh10.csv” is specified, which is where the query function extracts the sample data for each query.

si_fpath <- system.file("extdata", "sitest", package = "recountmethylation")
dfnn <- query_si(sample_idv = query.gsmv, fh_csv_fpath = fhtable.fpath,
                 si_fname = "new_search_index", si_fpath = si_fpath, 
                 lkval = lkval)

Inspect query results

The query results were assigned to dfnn, which we can now inspect. First show its dimensions.

dim(dfnn)

dfnn has 10 rows, corresponding to the 10 queried sample IDs, and 5 columns. The first column shows the IDs for the queried samples. Columns 2-5 show the results of individual queries, where column names designate the k value for a query as k=....

colnames(dfnn)

Now consider the query results for the sample in the first row, called “GSM1646161.1607013450.hlink.GSM1646161_9611518054_R01C01”. Check the results of the first 3 queries for this sample (k = 1, 5, or 10).

unlist(strsplit(dfnn[1,"k=1"], ";"))

When k = 1, the sample ID is returned. This is because the query uses the same hashed features data as was used to make the search index, and the search is for a subset of the indexed samples.

unlist(strsplit(dfnn[1,"k=5"], ";"))

This shows that samples are returned in the order of descending similarity to the queried data. For k = 5, the first sample returned is the same as k = 1, followed by the next 4 nearest neighboring samples.

unlist(strsplit(dfnn[1,"k=10"], ";"))

For k = 10, the first 5 neighbors are the same as for k = 5, followed by the next 5 nearest neighbors.

Plot metadata labels among nearest neighbors

This section shows some ways to visualize the results of nearest neighbors queries using the ggplot2 package.

Metadata label frequency among neighbors from a single query

Now analyze the type of samples among returned nearest neighbors. Use the md object to map labels to returned sample IDs for a single query, e.g. the first row where k = 5.

gsmvi <- unlist(strsplit(dfnn[1, "k=5"], ";"))
blood.typev <- md[gsmvi,]$blood.subgroup

We see there are 4 whole blood samples, and 1 labeled other/NOS which corresponds to the label for the queried sample.

Distribution of neighbors labeled whole blood across queries

Now get the distribution of labels across samples for a single k value. We’ll show the distribution of samples with the label “whole_blood” from the variable “blood.subgroup”, focusing on nearest neighbors returned from the first query with k = 20.

dist.wb <- unlist(lapply(seq(nrow(dfnn)), function(ii){
  gsmvi <- unlist(strsplit(dfnn[ii,"k=20"], ";"))
  length(which(md[gsmvi,]$blood.subgroup=="whole_blood"))
}))

Now plot the results for whole blood. Use a composite of a violin plot and boxplot showing the distribution median and any outliers. Remove the x-axis title and scale the y-axis to show percentages.

dfp <- data.frame(num.samples = dist.wb)
dfp$perc.samples <- 100*dfp$num.samples/20
dfp$type <- "whole_blood"
ggplot(dfp, aes(x = type, y = perc.samples)) + 
  geom_violin() + geom_boxplot(width = 0.2) + theme_bw() +
  ylab("Percent of neighbors") + theme(axis.title.x = element_blank()) +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

Distributions of multiple labels and queries

Repeat the above for all mapped metadata labels. Specify the three metadata labels of interest for the blood.subgroup variable (e.g. “whole_blood”, “PBMC”, “other/NOS”), then calculate the number of samples for each label in each query, assigning results to dfp.

# specify labels of interest
ugroupv <- c("whole_blood", "PBMC", "other/NOS")
# function to get samples by label
get_dfgrp <- function(ugroupv){
  do.call(rbind, lapply(c("whole_blood", "PBMC", "other/NOS"), function(ugroupi){
    num.grp <- length(which(md[ugroupv,]$blood.subgroup==ugroupi))
    data.frame(num.samples = num.grp, type = ugroupi)
  }))
}
# get samples by label across queries
dfp <- do.call(rbind, lapply(seq(nrow(dfnn)), function(ii){
  get_dfgrp(unlist(strsplit(dfnn[ii,"k=20"], ";")))
}))

Format variables in dfp for plotting, then make the composite violin and boxplots showing distributions of the three labels, or percent of samples having the label, across queries.

# format dfp variables for plots
dfp$perc.samples <- 100*dfp$num.samples/20
# reorder on medians
medianv <- unlist(lapply(c("whole_blood", "PBMC", "other/NOS"), function(groupi){
  median(dfp[dfp$type==groupi,]$perc.samples)}))
# define legend groups
dfp$`Sample\ntype` <- factor(dfp$type, levels = c("whole_blood", "PBMC", "other/NOS")[order(medianv)])

# make new composite plot
ggplot(dfp, aes(x = `Sample\ntype`, y = perc.samples, fill = `Sample\ntype`)) + 
  geom_violin() + geom_boxplot(width = 0.2) + theme_bw() +
  ylab("Percent of neighbors") + theme(axis.title.x = element_blank()) +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

Session Info

utils::sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] reticulate_1.24             basilisk_1.6.0             
##  [3] HDF5Array_1.22.1            DelayedArray_0.20.0        
##  [5] Matrix_1.4-0                limma_3.50.1               
##  [7] gridExtra_2.3               ggplot2_3.3.5              
##  [9] knitr_1.37                  recountmethylation_1.4.5   
## [11] minfi_1.40.0                bumphunter_1.36.0          
## [13] locfit_1.5-9.5              iterators_1.0.14           
## [15] foreach_1.5.2               Biostrings_2.62.0          
## [17] XVector_0.34.0              SummarizedExperiment_1.24.0
## [19] Biobase_2.54.0              MatrixGenerics_1.6.0       
## [21] matrixStats_0.61.0          GenomicRanges_1.46.1       
## [23] GenomeInfoDb_1.30.1         IRanges_2.28.0             
## [25] S4Vectors_0.32.3            BiocGenerics_0.40.0        
## [27] rhdf5_2.38.1                BiocStyle_2.22.0           
## 
## loaded via a namespace (and not attached):
##   [1] BiocFileCache_2.2.1       plyr_1.8.6               
##   [3] splines_4.1.2             BiocParallel_1.28.3      
##   [5] digest_0.6.29             htmltools_0.5.2          
##   [7] magick_2.7.3              fansi_1.0.2              
##   [9] magrittr_2.0.2            memoise_2.0.1            
##  [11] tzdb_0.2.0                readr_2.1.2              
##  [13] annotate_1.72.0           askpass_1.1              
##  [15] siggenes_1.68.0           prettyunits_1.1.1        
##  [17] colorspace_2.0-3          blob_1.2.2               
##  [19] rappdirs_0.3.3            xfun_0.30                
##  [21] dplyr_1.0.8               crayon_1.5.0             
##  [23] RCurl_1.98-1.6            jsonlite_1.8.0           
##  [25] genefilter_1.76.0         GEOquery_2.62.2          
##  [27] survival_3.3-1            glue_1.6.2               
##  [29] gtable_0.3.0              zlibbioc_1.40.0          
##  [31] Rhdf5lib_1.16.0           scales_1.1.1             
##  [33] DBI_1.1.2                 rngtools_1.5.2           
##  [35] Rcpp_1.0.8.2              xtable_1.8-4             
##  [37] progress_1.2.2            bit_4.0.4                
##  [39] mclust_5.4.9              preprocessCore_1.56.0    
##  [41] httr_1.4.2                dir.expiry_1.2.0         
##  [43] RColorBrewer_1.1-2        ellipsis_0.3.2           
##  [45] farver_2.1.0              pkgconfig_2.0.3          
##  [47] reshape_0.8.8             XML_3.99-0.9             
##  [49] sass_0.4.0                dbplyr_2.1.1             
##  [51] utf8_1.2.2                labeling_0.4.2           
##  [53] tidyselect_1.1.2          rlang_1.0.2              
##  [55] AnnotationDbi_1.56.2      munsell_0.5.0            
##  [57] tools_4.1.2               cachem_1.0.6             
##  [59] cli_3.2.0                 generics_0.1.2           
##  [61] RSQLite_2.2.10            evaluate_0.15            
##  [63] stringr_1.4.0             fastmap_1.1.0            
##  [65] yaml_2.3.5                bit64_4.0.5              
##  [67] beanplot_1.2              scrime_1.3.5             
##  [69] purrr_0.3.4               KEGGREST_1.34.0          
##  [71] nlme_3.1-155              doRNG_1.8.2              
##  [73] sparseMatrixStats_1.6.0   nor1mix_1.3-0            
##  [75] xml2_1.3.3                biomaRt_2.50.3           
##  [77] compiler_4.1.2            filelock_1.0.2           
##  [79] curl_4.3.2                png_0.1-7                
##  [81] tibble_3.1.6              bslib_0.3.1              
##  [83] stringi_1.7.6             basilisk.utils_1.6.0     
##  [85] highr_0.9                 GenomicFeatures_1.46.5   
##  [87] lattice_0.20-45           multtest_2.50.0          
##  [89] vctrs_0.3.8               pillar_1.7.0             
##  [91] lifecycle_1.0.1           rhdf5filters_1.6.0       
##  [93] BiocManager_1.30.16       jquerylib_0.1.4          
##  [95] data.table_1.14.2         bitops_1.0-7             
##  [97] rtracklayer_1.54.0        R6_2.5.1                 
##  [99] BiocIO_1.4.0              bookdown_0.24            
## [101] codetools_0.2-18          MASS_7.3-55              
## [103] assertthat_0.2.1          openssl_2.0.0            
## [105] rjson_0.2.21              withr_2.5.0              
## [107] GenomicAlignments_1.30.0  Rsamtools_2.10.0         
## [109] GenomeInfoDbData_1.2.7    mgcv_1.8-39              
## [111] hms_1.1.1                 quadprog_1.5-8           
## [113] grid_4.1.2                tidyr_1.2.0              
## [115] base64_2.0                rmarkdown_2.13           
## [117] DelayedMatrixStats_1.16.0 illuminaio_0.36.0        
## [119] restfulr_0.0.13

Works Cited

Aumüller, Martin, Erik Bernhardsson, and Alexander Faithfull. 2018. “ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms.” arXiv:1807.05614 [Cs], July. http://arxiv.org/abs/1807.05614.

Inoshita, Masatoshi, Shusuke Numata, Atsushi Tajima, Makoto Kinoshita, Hidehiro Umehara, Hidenaga Yamamori, Ryota Hashimoto, Issei Imoto, and Tetsuro Ohmori. 2015. “Sex Differences of Leukocytes DNA Methylation Adjusted for Estimated Cellular Proportions.” Biology of Sex Differences 6 (1): 11. https://doi.org/10.1186/s13293-015-0029-7.

Malkov, Yu A., and D. A. Yashunin. 2018. “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.” arXiv:1603.09320 [Cs], August. http://arxiv.org/abs/1603.09320.

Weinberger, Kilian, Anirban Dasgupta, Josh Attenberg, John Langford, and Alex Smola. 2010. “Feature Hashing for Large Scale Multitask Learning.” arXiv:0902.2206 [Cs], February. http://arxiv.org/abs/0902.2206.