Audrey Lemacon, Louis Gendron, Charles Joly Beauparlant and Arnaud Droit.
This package and the underlying ENCODExplorer code are distributed under the Artistic license 2.0. You are free to use and redistribute this software.
“The ENCODE (Encyclopedia of DNA Elements) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active”[source: ENCODE Projet Portal] .
However, data retrieval and downloading can be really time consuming using current web portal, especially with multiple files from different experiments.
This package has been designed to facilitate the data access by compiling the metadata associated with file, experiment, dataset, biosample, and treatment.
We first extract ENCODE schema from its public github repository to rebuild the ENCODE database into a data.table database. Thanks to this package, the user will be enable to generate, store and query ENCODE database locally. We also developped a function which can extract the essential metadata in a R object to aid data exploration.
We implemented time-saving features to select ENCODE files by querying their metadata, downloading them and validating that the file was correctly downloaded.
The data.table database can be regenerated at will to keep it up-to-date.
This vignette will introduce all the main features of the ENCODExplorer package.
library(ENCODExplorer)
Up to date, there are 7 types of dataset in ENCODE : annotation, experiment,
matched-set, project, reference, reference-epigenome and ucsc-browser-composite.
This package comes with an up-to-date data.table
containing the essential
of ENCODE files metadata: encode_df
. This database contains all the files within all dataset type.
The accession column corresponds to the accession of the dataset and the file_accession
column corresponds to the actual accession of the file.
The encode_df
object is mandatory for the functions provided in this package.
Most of the provided functions will load encode_df
as default database. For
faster processing, we recommend the user to load encode_df
and pass it as an argument.
To load encode_df
:
data(encode_df, package = "ENCODExplorer")
In the current release, encode_df
contains 138578 entries of which 133103 coming from the experiment dataset.
The queryEncode
function allows the user to find the subset of files corresponding to
a precise query defined according to the following criteria :
Parameter | Description |
---|---|
set_accession | The experiment or dataset accession |
assay | The assay type |
biosample_name | The biosample name |
dataset_accession | There is a subtle difference between the parameters set_accession and dataset_accession. In fact, some files can be part of experiment, dataset orboth. When using set_accession, you will get all the files directly linked withthis accession (experiment and/or dataset). While the usage of dataset_accesstion will get the files directly link to the requested dataset AND those which are part of an experiment and indirectly link to a dataset (reported as related files in the dataset and related_dataset in experiment). |
file_accession | The file accesion |
file_format | The current version of encode_df contains the following file format : bam, bed, fastq, bigBed, bigWig, CEL, csfasta, csqual, fasta, gff, gtf, idat, rcc, sam, tagAlign, tar, tsv, vcf, wig. |
lab | The laboratory |
organism | The donor organism |
target | The experimental target |
treatment | The treatment related to the biosample |
project | The project name/id |
biosample_name | The biosample name |
biosample_type | The biosample type |
By default, the query function uses the exact string matching to perform the selection of
the relevant entries. This behavior can be changed by setting the fixed
option
to FALSE.
The structure of the result set is similar to the encode_df
table.
For example, to select all the fastq files produced on human cell MCF-7:
query_results <- queryEncode(df=encode_df, organism = "Homo sapiens",
biosample_name = "MCF-7", file_format = "fastq", fixed = TRUE)
## Results : 337 files, 121 datasets
The same request with approximate spelling of the biosample name and fixed
option
to TRUE
, will give no results :
query_results <- queryEncode(df=encode_df, organism = "Homo sapiens",
biosample_name = "mcf7", file_format = "fastq", fixed = TRUE)
If you follow the warning guidance and set the fixed
option to FALSE:
query_results <- queryEncode(df=encode_df, organism = "Homo sapiens",
biosample_name = "mcf7", file_format = "fastq", fixed = FALSE)
## Results : 337 files, 121 datasets
These criteria correspond to the filters that you can find on ENCODE portal :
This function is a more user-friendly version of queryEncode
function that also perform a search on the encode_df
object. The character vector or the list of character specified by the user will be searched in every column of the database.The user can also constrain the query by selecting the specific column in which you want to search for the query term with the filterVector
parameter.
The following request will produce a data.table with every files that contain the term brca.
fuzzy_results <- fuzzySearch(searchTerm = c("brca"), database = encode_df)
## Results: 40 files, 5 datasets
Multiple terms can be search at the same time, this example will extract all the files that contain brca or ZNF24 within the target column.
fuzzy_results <- fuzzySearch(searchTerm = c("brca", "ZNF24"), database = encode_df, filterVector = c("target"), multipleTerm = TRUE)
## Results: 106 files, 8 datasets
When searching for multiple terms, three type of input can be pass to the searchTerm
parameter :
This function simulates a key word search that the user could perform through the ENCODE web portal.
The searchEncode
function returns a data frame
which corresponds to the result page
provided by ENCODE portal. If a specific file or dataset isn't available with
fuzzySearch
or queryEncode
(i.e. within encode_df
), the user can access to the latest data of ENCODE database with the searchEncode function.
Look for searchToquery
to convert the result of a search to a data.table
with the same design as encode_df
. This format contain more metadatas and allow the user to extract all the files within the dataset. This format also allow the user to access to control files with the createDesign
function.
Here is the example of the following search : “a549 chip-seq homo sapiens”.
On ENCODE portal :
With our function :
search_results <- searchEncode(searchTerm = "a549 chip-seq homo sapiens",
limit = "all")
## results : 408
This function organize the data.table
created by fuzzySearch
, queryEncode
or searchToquery
. It extract the replicate and control files within a dataset.
It create a data.table
with the file accessions, the dataset accessions and numeric value associate with the nature of the file (1:replicate / 2:control) when the format
parameter is set to long
.
By setting the format
parameter to wide
, each dataset will have his own column like the illustraded below.
Allow the user to download a file or an entire dataset. Downloading file can be done by providing a vector of file accessions or dataset accessions (accession column in encode_df
) to the file_acc
parameter.
This parameter can also be the data.table
created by queryEncode
, fuzzySearch
, searchToquery
or createDesign
.
If the accession doesn't exist within the actual encode_df
database, the function will search the accession directly in the ENCODE database.
The user can specify the path to the download directory
(default: /tmp
).
To ensure the file integrity, we conduct a check md5 sum comparison for each file.
Moreover, if the accession is a dataset accession, the function will download each file in this dataset. The format option, which is set by default to all, enables to download a specific format.
Here is a small query:
query_results <- queryEncode(df=encode_df, assay = "switchgear", target ="elavl1", fixed = FALSE)
## Results : 2 files, 1 datasets
And its equivalent search:
search_results <- searchEncode(searchTerm = "switchgear elavl1", limit = "all")
## results : 1
To select a particular file format you can:
1) add filters to your query and then run the downloadEncode
function.
query_results <- queryEncode(df=encode_df, assay = "switchgear", target ="elavl1", file_format = "bed" , fixed = FALSE)
downloadEncode(query_results, df = encode_df)
2) specify the format to the downloadEncode
function.
downloadEncode(search_results, df=encode_df, format = "bed")
The function searchToquery
enables to convert the result of searchEncode
to a queryEncode
output based on the accession numbers. Thus the user can benefit from all the collected metadata and the createDesign
function.
The structure of the result set is similar to the encode_df
structure.
Let's try it with the previous example :
1) search
search_results <- searchEncode(searchTerm = "switchgear elavl1", limit = "all")
## results : 1
2) convert
convert_results <- searchToquery(searchResults = search_results)
This function launch the shinyApp of ENCODExplorer that implements fuzzySearch
and queryEncode
research functions. It also allows to create a design to organize data and download specific files with downloadEncode
function.
The Search tab of shinyEncode applies the fuzzySearch
function for a low specificity request and the Advanced Search
tab applies the queryEncode
function.