4 Main functions
4.1 Query
The queryEncode
function allows the user to find the subset of files corresponding to
a precise query defined according to the following criteria :
Parameter | Description |
---|---|
set_accession | The accession for the containing experiment or dataset |
dataset_accession | There is a subtle difference between the parameters set_accession and dataset_accession. In fact, some files can be part of an experiment, a dataset or both. When using set_accession, you will get all the files directly associated with this accession (experiment and/or dataset). While the usage of dataset_accession will get the files directly associated to the requested dataset AND those which are part of an experiment and indirectly linked to a dataset (reported as related files in the dataset and related_dataset in the experiment). |
file_accession | The accesion for one specific file |
biosample_name | The biosample name (“GM12878”, “kidney”) |
biosample_type | The biosample type (“tissue”, “cell line”) |
assay | The assay type (“ChIP-seq”, “polyA plus RNA-seq”) |
file_format | The file format. Some currently available formats include bam, bed, fastq, bigBed, bigWig, CEL, csfasta, csqual, fasta, gff, gtf, idat, rcc, sam, tagAlign, tar, tsv, vcf, wig. |
lab | The laboratory |
organism | The donor organism (“Homo sapiens”, “Mus musculus”) |
target | The gene, protein or histone mark which was targeted by the assay (Immunoprecipitated protein in ChIP-seq, knocked-down gene in CRISPR RNA-seq assays, etc) |
treatment | The treatment related to the biosample |
project | The project name/id |
By default, the query function uses exact string matching to perform the
selection of the relevant entries. This behavior can be changed by modifying the
fixed
or fuzzy
parameters. Setting fixed
to FALSE
will perform
case-insensitive regular expression matching. Setting fuzzy
to TRUE
will
retrieve search results where the query string is a partial match.
The result set is a subset of the encode_df_lite
table.
For example, to select all fastq files originating from assays on the MCF-7 (human breast cancer) cell line:
query_results <- queryEncode(organism = "Homo sapiens",
biosample_name = "MCF-7", file_format = "fastq",
fixed = TRUE)
## Results : 811 files, 233 datasets
The same request with approximate spelling of the biosample name will return no results:
query_results <- queryEncode(organism = "Homo sapiens", biosample_name = "mcf7",
file_format = "fastq", fixed = TRUE,
fuzzy = FALSE)
## No result found in encode_df. You can try the <searchEncode> function or set the fuzzy option to TRUE.
However, if you follow the warning guidance and set the fuzzy
parameter to
TRUE
:
query_results <- queryEncode(organism = "Homo sapiens",
biosample_name = "mcf7", file_format = "fastq",
fixed = TRUE, fuzzy = TRUE)
## Results : 811 files, 233 datasets
You can also perform matching through regular expressions by setting fixed to
FALSE
.
query_results <- queryEncode(assay = ".*RNA-seq",
biosample_name = "HeLa-S3", fixed = FALSE)
## Results : 318 files, 11 datasets
table(query_results$assay)
##
## polyA minus RNA-seq polyA plus RNA-seq small RNA-seq
## 90 150 78
Finally, the queryEncodeGeneric
function can be used to perform searches on
columns which are not part of the queryEncode interface but are present within
the encode_df_lite data.table:
query_results <- queryEncodeGeneric(biosample_name="HeLa-S3",
assay="RNA-seq", submitted_by="Diane Trout",
fuzzy=TRUE)
## Results : 54 files, 2 datasets
table(query_results$submitted_by)
##
## Diane Trout
## 54
These criteria correspond to the filters that you can find on ENCODE portal:
results of a filtered search on ENCODE portal
4.2 fuzzySearch
This function is a more user-friendly version of queryEncode
that also
perform searches on the encode_df_lite
object. The character vector or the
list of characters specified by the user will be searched for in every column of
the database. The user can also constrain the query by selecting the specific
columns in which to search for the query term by using the filterVector
parameter.
The following request will produce a data.table with every files containing the term brca.
fuzzy_results <- fuzzySearch(searchTerm = c("brca"))
## Results: 236 files, 7 datasets
Multiple terms can be searched simultaneously. This example extracts all files containing brca or ZNF24 within the target column.
fuzzy_results <- fuzzySearch(searchTerm = c("brca", "ZNF24"),
filterVector = c("target"),
multipleTerm = TRUE)
## Results: 710 files, 17 datasets
When searching for multiple terms, three type of input can be passed to the
searchTerm
parameter :
- A single character where the various terms are separated by commas
- A character vector
- A list of characters
4.3 Search
This function simulates a keyword search performed through the ENCODE web portal.
The searchEncode
function returns a data frame
corresponding to the result
page provided by the ENCODE portal. If a specific file or dataset isn’t
available with fuzzySearch
or queryEncode
(i.e. within get_encode_df()
),
the user can access the latest data from the ENCODE database through the
searchEncode function.
The searchToquery
function convert the result of a search to a data.table
with the same design as get_encode_df()
. This format contains more metadata
and allow the user to extract all files within the dataset. This format also
allows the user to create a design using the createDesign
function.
Here is the example of the following search : “a549 chip-seq homo sapiens”.
On ENCODE portal :