ENCODExplorer: A compilation of metadata from ENCODE

Audrey Lemacon, Louis Gendron, Charles Joly Beauparlant and Arnaud Droit.

This package and the underlying ENCODExplorer code are distributed under the Artistic license 2.0. You are free to use and redistribute this software.

“The ENCODE (Encyclopedia of DNA Elements) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active”^[source : ENCODE Projet Portal ] .

However, data retrieval and downloading can be really time-consuming using the current web portal.

This package has been designed to facilitate data access by compiling the metadata associated with file, experiment and dataset.

We first extract ENCODE schema from its public github repository. Then we identify the main entities and their relationship with each other to rebuild the ENCODE database into a data.table. We also developped a function which can extract the essential metadata in a R object to aid data exploration. We implemented time-saving features to select ENCODE files by querying their metadata and download them.

The data.table can be regenerated at will to query ENCODE database locally keep it up-to-date.

This vignette will introduce the way to update ENCODE data.

Loading ENCODExplorer package

suppressMessages(library(ENCODExplorer))

Data update

If you want regenerate ENCODExplorer data, you can use the update function. When you use overwrite = TRUE, it will overwrite the default package data, otherwise it will return the datatable encode_df .

# the path (relative or absolute) to the future database
database_filename <- "new.encode.rda"
new_data <- export_ENCODEdb_matrix(database_filename, overwrite = FALSE)

If you want to update the data manually or partially, you have to process the following steps:

generate a list of data tables from ENCODE:

# the path (relative or absolute) to the future database
database_filename = "new.encode.rda"
tables = prepare_ENCODEdb(database_filename)

generate the metadata encode_df from a list of data.table:

new_encode_df <- export_ENCODEdb_matrix(tables)

The whole process will take several minutes (30 to 60 minutes depending on your work environment)

Updated data usage

If you have chosen not to overwrite the default data, you can use your newly created data. Once the new_encode_df is generated, you can use it to replace the default one in the queryEncode and downloadEncode function by setting the df option of those functions.

query_results <- queryEncode(df = new_encode_df, assay = "switchgear", target ="elavl1", file_format = "bed" , fixed = F)
downloadEncode(df = new_encode_df, file_acc = query_results$file_accession)

Be sure to use the same referencial df for both queryEncode and downloadEncode.

You can also use the list of data.table database for own purpose. The imputed database model is available in the dedicated vignette