Contents

This package and the underlying code are distributed under the Artistic license 2.0. You are free to use and redistribute this software.

1 Rationale

“The ENCODE (Encyclopedia of DNA Elements) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active” source: ENCODE Projet Portal.

However, retrieving and downloading data can be time consuming using the current web portal, especially when multiple files from different experiments are involved.

This package has been designed to facilitate access to ENCODE data by compiling the metadata associated with files, experiments, datasets, biosamples, and treatments.

We implemented time-saving features to select ENCODE files by querying their metadata, downloading them and validating that the file was correctly downloaded.

This vignette will introduce the main features of the ENCODExplorer package.

2 Loading the ENCODExplorer package

library(ENCODExplorer)

3 Introduction

To use the functionalities of the ENCODExplorer package, you must first download the data.table containing all of the ENCODE metadata.

This data.table is available through the AnnotationHub package. For convenience, the latest available version at each release will be downloaded and used by default.

We also provide the following function to quickly obtain these metadata:

To load getencode_df :

encode_df <- get_encode_df()
## using temporary cache /tmp/Rtmp56pXqJ/BiocFileCache
## snapshotDate(): 2020-10-26
## downloading 1 resources
## retrieving 1 resource
## loading from cache

4 Main functions

4.1 Query

The queryEncode function allows the user to find the subset of files corresponding to a precise query defined according to the following criteria :

Parameter Description
set_accession The accession for the containing experiment or dataset
dataset_accession There is a subtle difference between the parameters set_accession and dataset_accession. In fact, some files can be part of an experiment, a dataset or both. When using set_accession, you will get all the files directly associated with this accession (experiment and/or dataset). While the usage of dataset_accession will get the files directly associated to the requested dataset AND those which are part of an experiment and indirectly linked to a dataset (reported as related files in the dataset and related_dataset in the experiment).
file_accession The accesion for one specific file
biosample_name The biosample name (“GM12878”, “kidney”)
biosample_type The biosample type (“tissue”, “cell line”)
assay The assay type (“ChIP-seq”, “polyA plus RNA-seq”)
file_format The file format. Some currently available formats include bam, bed, fastq, bigBed, bigWig, CEL, csfasta, csqual, fasta, gff, gtf, idat, rcc, sam, tagAlign, tar, tsv, vcf, wig.
lab The laboratory
organism The donor organism (“Homo sapiens”, “Mus musculus”)
target The gene, protein or histone mark which was targeted by the assay (Immunoprecipitated protein in ChIP-seq, knocked-down gene in CRISPR RNA-seq assays, etc)
treatment The treatment related to the biosample
project The project name/id

By default, the query function uses exact string matching to perform the selection of the relevant entries. This behavior can be changed by modifying the fixed or fuzzy parameters. Setting fixed to FALSE will perform case-insensitive regular expression matching. Setting fuzzy to TRUE will retrieve search results where the query string is a partial match.

The result set is a subset of the encode_df_lite table.

For example, to select all fastq files originating from assays on the MCF-7 (human breast cancer) cell line:

query_results <- queryEncode(organism = "Homo sapiens", 
                      biosample_name = "MCF-7", file_format = "fastq",
                      fixed = TRUE)
## Results : 811 files, 233 datasets

The same request with approximate spelling of the biosample name will return no results:

query_results <- queryEncode(organism = "Homo sapiens", biosample_name = "mcf7",
                        file_format = "fastq", fixed = TRUE,
                        fuzzy = FALSE)
## No result found in encode_df. You can try the <searchEncode> function or set the fuzzy option to TRUE.

However, if you follow the warning guidance and set the fuzzy parameter to TRUE:

query_results <- queryEncode(organism = "Homo sapiens",
                    biosample_name = "mcf7", file_format = "fastq",
                    fixed = TRUE, fuzzy = TRUE)
## Results : 811 files, 233 datasets

You can also perform matching through regular expressions by setting fixed to FALSE.

query_results <- queryEncode(assay = ".*RNA-seq",
                    biosample_name = "HeLa-S3", fixed = FALSE)
## Results : 318 files, 11 datasets
table(query_results$assay)
## 
## polyA minus RNA-seq  polyA plus RNA-seq       small RNA-seq 
##                  90                 150                  78

Finally, the queryEncodeGeneric function can be used to perform searches on columns which are not part of the queryEncode interface but are present within the encode_df_lite data.table:

query_results <- queryEncodeGeneric(biosample_name="HeLa-S3",
                    assay="RNA-seq", submitted_by="Diane Trout",
                    fuzzy=TRUE)
## Results : 54 files, 2 datasets
table(query_results$submitted_by)
## 
## Diane Trout 
##          54

These criteria correspond to the filters that you can find on ENCODE portal:

results of a filtered search on ENCODE portal

results of a filtered search on ENCODE portal

4.2 fuzzySearch

This function is a more user-friendly version of queryEncode that also perform searches on the encode_df_lite object. The character vector or the list of characters specified by the user will be searched for in every column of the database. The user can also constrain the query by selecting the specific columns in which to search for the query term by using the filterVector parameter.

The following request will produce a data.table with every files containing the term brca.

fuzzy_results <- fuzzySearch(searchTerm = c("brca"))
## Results: 236 files, 7 datasets

Multiple terms can be searched simultaneously. This example extracts all files containing brca or ZNF24 within the target column.

fuzzy_results <- fuzzySearch(searchTerm = c("brca", "ZNF24"),
                             filterVector = c("target"),
                             multipleTerm = TRUE)
## Results: 710 files, 17 datasets

When searching for multiple terms, three type of input can be passed to the searchTerm parameter : - A single character where the various terms are separated by commas - A character vector - A list of characters