ENCODExplorer: A compilation of metadata from ENCODE

Audrey Lemacon, Louis Gendron, Charles Joly Beauparlant and Arnaud Droit.

This package and the underlying ENCODExplorer code are distributed under the Artistic license 2.0. You are free to use and redistribute this software.

“The ENCODE (Encyclopedia of DNA Elements) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active”[source: ENCODE Projet Portal] .

However, data retrieval and downloading can be really time consuming using current web portal, especially with multiple files from different experiments.

This package has been designed to facilitate the data access by compiling the metadata associated with file, experiment, dataset, biosample, and treatment.

We first extract ENCODE schema from its public github repository to rebuild the ENCODE database into a data.table database. Thanks to this package, the user will be enable to generate, store and query ENCODE database locally. We also developped a function which can extract the essential metadata in a R object to aid data exploration.

We implemented time-saving features to select ENCODE files by querying their metadata, downloading them and validating that the file was correctly downloaded.

The data.table database can be regenerated at will to keep it up-to-date.

This vignette will introduce all the main features of the ENCODExplorer package.

Loading ENCODExplorer package

library(ENCODExplorer)

Introduction

Up to date, there are 7 types of dataset in ENCODE : annotation, experiment, matched-set, project, reference, reference-epigenome and ucsc-browser-composite. This package comes with an up-to-date data.table containing the essential of ENCODE files metadata: encode_df. This database contains all the files within all dataset type. The accession column corresponds to the accession of the dataset and the file_accession column corresponds to the actual accession of the file. The encode_df object is mandatory for the functions provided in this package. Most of the provided functions will load encode_df as default database. For faster processing, we recommend the user to load encode_df and pass it as an argument.

To load encode_df :

    data(encode_df, package = "ENCODExplorer")

In the current release, encode_df contains 138578 entries of which 133103 coming from the experiment dataset.

Main functions

Query

The queryEncode function allows the user to find the subset of files corresponding to a precise query defined according to the following criteria :

Parameter	Description
set_accession	The experiment or dataset accession
assay	The assay type
biosample_name	The biosample name
dataset_accession	There is a subtle difference between the parameters set_accession and dataset_accession. In fact, some files can be part of experiment, dataset orboth. When using set_accession, you will get all the files directly linked withthis accession (experiment and/or dataset). While the usage of dataset_accesstion will get the files directly link to the requested dataset AND those which are part of an experiment and indirectly link to a dataset (reported as related files in the dataset and related_dataset in experiment).
file_accession	The file accesion
file_format	The current version of encode_df contains the following file format : bam, bed, fastq, bigBed, bigWig, CEL, csfasta, csqual, fasta, gff, gtf, idat, rcc, sam, tagAlign, tar, tsv, vcf, wig.
lab	The laboratory
organism	The donor organism
target	The experimental target
treatment	The treatment related to the biosample
project	The project name/id
biosample_name	The biosample name
biosample_type	The biosample type

By default, the query function uses the exact string matching to perform the selection of the relevant entries. This behavior can be changed by setting the fixed option to FALSE.

The structure of the result set is similar to the encode_df table.

For example, to select all the fastq files produced on human cell MCF-7:

query_results <- queryEncode(df=encode_df, organism = "Homo sapiens",
                      biosample_name = "MCF-7", file_format = "fastq", fixed = TRUE)
## Results : 337 files, 121 datasets

The same request with approximate spelling of the biosample name and fixed option to TRUE, will give no results :

query_results <- queryEncode(df=encode_df, organism = "Homo sapiens",
                       biosample_name = "mcf7", file_format = "fastq", fixed = TRUE)

If you follow the warning guidance and set the fixed option to FALSE:

query_results <- queryEncode(df=encode_df, organism = "Homo sapiens",
                    biosample_name = "mcf7", file_format = "fastq", fixed = FALSE)
## Results : 337 files, 121 datasets

These criteria correspond to the filters that you can find on ENCODE portal :

results of a filtered search on ENCODE portal

fuzzySearch

This function is a more user-friendly version of queryEncode function that also perform a search on the encode_df object. The character vector or the list of character specified by the user will be searched in every column of the database.The user can also constrain the query by selecting the specific column in which you want to search for the query term with the filterVector parameter.

The following request will produce a data.table with every files that contain the term brca.

fuzzy_results <- fuzzySearch(searchTerm = c("brca"), database = encode_df)
## Results: 40 files, 5 datasets

Multiple terms can be search at the same time, this example will extract all the files that contain brca or ZNF24 within the target column.

fuzzy_results <- fuzzySearch(searchTerm = c("brca", "ZNF24"), database = encode_df, filterVector = c("target"), multipleTerm = TRUE)
## Results: 106 files, 8 datasets

When searching for multiple terms, three type of input can be pass to the searchTerm parameter :

Single character with comma between the terms
Character vector
List of characters

Search

This function simulates a key word search that the user could perform through the ENCODE web portal.

The searchEncode function returns a data frame which corresponds to the result page provided by ENCODE portal. If a specific file or dataset isn't available with fuzzySearch or queryEncode (i.e. within encode_df), the user can access to the latest data of ENCODE database with the searchEncode function.

Look for searchToquery to convert the result of a search to a data.table with the same design as encode_df. This format contain more metadatas and allow the user to extract all the files within the dataset. This format also allow the user to access to control files with the createDesign function.

Here is the example of the following search : “a549 chip-seq homo sapiens”.

On ENCODE portal :

results of a key word search on ENCODE portal

With our function :

  search_results <- searchEncode(searchTerm = "a549 chip-seq homo sapiens",
                                 limit = "all")
## results : 380

createDesign

This function organize the data.table created by fuzzySearch, queryEncode or searchToquery. It extract the replicate and control files within a dataset.

It create a data.table with the file accessions, the dataset accessions and numeric value associate with the nature of the file (1:replicate / 2:control) when the format parameter is set to long.

By setting the format parameter to wide, each dataset will have his own column like the illustraded below.

Wide design exemple

downloadEncode

Allow the user to download a file or an entire dataset. Downloading file can be done by providing a vector of file accessions or dataset accessions (accession column in encode_df) to the file_acc parameter. This parameter can also be the data.table created by queryEncode, fuzzySearch, searchToquery or createDesign.

If the accession doesn't exist within the actual encode_df database, the function will search the accession directly in the ENCODE database. The user can specify the path to the download directory (default: /tmp).

To ensure the file integrity, we conduct a check md5 sum comparison for each file.

Moreover, if the accession is a dataset accession, the function will download each file in this dataset. The format option, which is set by default to all, enables to download a specific format.

Here is a small query:

query_results <- queryEncode(df=encode_df, assay = "switchgear", target ="elavl1", fixed = FALSE)
## Results : 2 files, 1 datasets

And its equivalent search:

search_results <- searchEncode(searchTerm = "switchgear elavl1", limit = "all")
## results : 1

To select a particular file format you can:

1) add filters to your query and then run the downloadEncode function.

query_results <- queryEncode(df=encode_df, assay = "switchgear", target ="elavl1", file_format = "bed" , fixed = FALSE)
downloadEncode(query_results, df = encode_df)

2) specify the format to the downloadEncode function.

downloadEncode(search_results, df=encode_df, format = "bed")

Conversion

The function searchToquery enables to convert the result of searchEncode to a queryEncode output based on the accession numbers. Thus the user can benefit from all the collected metadata and the createDesign function.

The structure of the result set is similar to the encode_df structure.

Let's try it with the previous example :

1) search

search_results <- searchEncode(searchTerm = "switchgear elavl1", limit = "all")
## results : 1

2) convert

convert_results <- searchToquery(searchResults = search_results)

shinyEncode

This function launch the shinyApp of ENCODExplorer that implements fuzzySearch and queryEncode research functions. It also allows to create a design to organize data and download specific files with downloadEncode function. The Search tab of shinyEncode applies the fuzzySearch function for a low specificity request and the Advanced Search tab applies the queryEncode function.

Simple request using Search