ENCODExplorer: A compilation of metadata from ENCODE

Audrey Lemacon, Louis Gendron Charles Joly Beauparlant and Arnaud Droit.

This package and the underlying ENCODExplorer code are distributed under the Artistic license 2.0. You are free to use and redistribute this software.

“The ENCODE (Encyclopedia of DNA Elements) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active”^[source : ENCODE Projet Portal ].

However, data retrieval and downloading can be really time-consuming using the current web portal.

This package has been designed to facilitate data access by compiling the metadata associated with file, experiment and dataset.

We first extract ENCODE schema from its public github repository. Then we identify the main entities and their relationship with each other to rebuild the ENCODE database into an SQLite database. We also developped a function which can extract the essential metadata in a R object to aid data exploration. We implemented time-saving features to select ENCODE files by querying their metadata and download them.

The SQLite database can be regenerated at will to query ENCODE database locally keep it up-to-date.

This vignette will introduce the imputed ENCODE database model.

Loading ENCODExplorer package

suppressMessages(library(ENCODExplorer))

Data preparation

To generate the list of data.table from ENCODE:

# the path (relative or absolute) to the future database
database_filename = "encode.dt"
tables = prepare_ENCODEdb(database_filename)

The process can take a few minutes.

Schema

imputed model of the ENCODE database

The edges indicate relationship between two tables : an origin and a destination. The relation is made between a column of the origin table and the id of the destination table.

Those relations are described in the following table:

Relations table

origin	destination	origin column
access_key	user	user
analysis_step	analysis_step	parents
analysis_step	document	documents
analysis_step_run	analysis_step	analysis_step
analysis_step_run	workflow_run	workflow_run
analysis_step	software_version	software_versions
antibody_approval	antibody_characterization	characterizations
antibody_approval	antibody_lot	antibody
antibody_approval	target	target
antibody_characterization	antibody_lot	characterizes
antibody_characterization	target	target
antibody_characterization	user	reviewed_by
antibody_lot	organism	host_organism
antibody_lot	target	targets
award	user	pi
biosample	biosample	derived_from
biosample	biosample	part_of
biosample	biosample	pooled_from
biosample_characterization	biosample	characterizes
biosample	construct	constructs
biosample	document	protocol_documents
biosample	donor	donor
biosample	organism	organism
biosample	rnai	rnais
biosample	talen	talens
biosample	treatment	treatments
characterization	document	documents
construct_characterization	construct	characterizes
construct	document	documents
construct	target	promoter_used
construct	target	target
dataset	document	documents
dataset	file	related_files
donor_characterization	donor	characterizes
donor	organism	organism
experiment	experiment	possible_controls
experiment	target	target
file	analysis_step_run	step_run
file	dataset	dataset
file	experiment	dataset
file	file	controlled_by
file	file	derived_from
file	file	paired_with
file	file	supercedes
file	platform	platform
file	replicate	replicate
fly_donor	document	documents
human_donor	human_donor	children
human_donor	human_donor	fraternal_twin
human_donor	human_donor	identical_twin
human_donor	human_donor	parents
human_donor	human_donor	siblings
lab	award	awards
lab	user	pi
library	biosample	biosample
library	dataset	spikeins_used
library	document	documents
library	treatment	treatments
mouse_donor	mouse_donor	littermates
page	page	parent
pipeline	analysis_step	analysis_steps
pipeline	analysis_step	end_points
pipeline	document	documents
publication	dataset	datasets
quality_metric	analysis_step_run	step_run
quality_metric	file	files
replicate	antibody_lot	antibody
replicate	experiment	experiment
replicate	library	library
replicate	platform	platform
rnai_characterization	rnai	characterizes
rnai	document	documents
rnai	target	target
software	publication	references
software_version	software	software
talen	document	documents
target	organism	organism
treatment	document	protocols
treatment	lab	lab
user	lab	lab
user	lab	submits_for
workflow_run	file	input_files
workflow_run	pipeline	pipeline
workflow_run	software_version	software_version
worm_donor	document	documents
worm_donor	worm_donor	outcrossed_strain

For example: [file —> replicate —> replicate] enables to write the following relation : file.replicate = replicate.id.