The DeepBlue Epigenomic Data Server is an online application that allows researchers to access data from various epigenomic mapping consortia such as DEEP, BLUEPRINT, ENCODE, or ROADMAP. DeepBlue can be accessed through a web interface or programmatically via its API. The usage of the API is documented with examples, use cases, and a user manual. While the description of the API is language agnostic, the examples and use cases shown online are focused on the python language. However, the R package presented here also enables access to the DeepBlue API directly within the R statistical environment and provides convenient functionality for triggering operations on the DeepBlue server as well as for data retrievel using R functions. In the following, we give a brief introduction to the package and subsequently show how python examples from the online documentation can be reproduced with it.
A wealth of epigenomic data has been collected over the past decade by large epigenomic mapping consortia. Event though most of these data are publicly available, the task of identifiying, downloading and processing data from various experiments is challenging. Recognizing that these tedious steps need to be tackled programmatically, we developed the DeepBlue epigenomic data server. Epigenome data from the different epigenome mapping consortia are accessible with standardized metadata. An experiment is the most important entity in DeepBlue and typically encompasses a single file (usually a bed or wig file) with a set of mandatory metadata: name, genome assembly, epigenetic mark, biosource, sample, technique, and project. For the sake of organization, all metadata fields are part of controlled vocabularies, some of which are imported from ontologies (CL, EFO, and UBERON, to name a few). DeepBlue also contains annotations, i.e. auxiliary data that is helpful in epigenomic analysis, such as, for example, CpG Islands, promoter regions, and genes. DeepBlue provides different types of commands, such as listing and searching commands as well as commands for data retrieval. A typical work-flow for the latter is to select, filter, transform, and finally download the selected data. For a more thorough description of DeepBlue we refer to the DeepBlue publication in the 2016 NAR webserver issue. If you find DeepBlue useful and use it in your project consider citing this paper.
Important note: With the exception of data aggregation tasks, DeepBlue does not alter the imported data, i.e. it remains exactly as provided by the epigenome mapping consortia.
Installation of DeepBlueR and its companion packages can be performed using the Bioconductor installer:
source("https://bioconductor.org/biocLite.R")
biocLite("DeepBlueR")
The package name is DeepBlueR
and it can be loaded via:
library(DeepBlueR)
You can test your installation and connectivity by saying hello to the DeepBlue server:
deepblue_info("me")
DeepBlue provides a comprehensive programmatic interface for finding, selecting, filtering, summarizing and downloading annotated genomic region sets. Downloaded region sets are stored using the GenomicRanges R package, which allows for downloaded region sets to be further processed, visualized and analyzed with existing R packages such as LOLA or GViz.
A list of all commands available by DeepBlue is provided in its API page. The vast majority of these commands is also available through this R package and can be listed as follows:
help(package="DeepBlueR")
In the following we listed the most frequently used DeepBlue commands. The full list of commands is available here. Note that each command in the following two tables has the prefix ’deepblue_*’, e.g. deepblue_select_genes.
Category | Command | Description |
---|---|---|
Information | info | Information about an entity |
List and search | list_genomes | List registered genomes |
list_biosources | List registered biosources | |
list_samples | List registered samples | |
list_epigenetic marks | List registered epigenetic marks | |
list_experiments | List available experiments | |
list_annotations | List available annotations | |
search | Perform a full-text search | |
Selection | select_regions | Select regions from experiments |
select_experiments | Select regions from experiments | |
select_annotations | Select regions from annotations | |
select_genes | Select genes as regions | |
select_expressions | Select expression data | |
tiling_regions | Generate tiling regions | |
input_regions | Upload and use a small region-set | |
Operation | aggregate | Aggregate and summarize regions |
filter_regions | Filter regions by theirs attributes | |
flank | Generate flanking regions | |
intersection | Filter for intersecting regions | |
overlap | Filter for regions overlapping by at least a specific size | |
merge_queries | Merge two regions set | |
Result | count_regions | Count selected regions |
score_matrix | Request a score matrix | |
get_regions | Request the selected regions | |
binning | Bin results according to counts | |
Request | get_request data | Obtain the requested data |
In addition, this package provides a set of convenience functions not part of the DeepBlue API, such as:
Category | Command | Description |
---|---|---|
Request | batch_export_results | Download the results for a list of requests |
download_request_data | Download and convert the requested data (blocking) | |
export_meta_data | Export metadata to a tab delimited file | |
export_tab | Export any result as tab delimited file | |
export_bed | Export GenomicRanges results as BED file |
In the following we give a number of increasingly complex examples illustrating what DeepBlue can achieve in your epigenomic data analysis work-flow. We go beyond the online description of these examples by showing how the retrieved information can be further used in R.
One of the first tasks in DeepBlue is finding the data of interest. This can be achieved in three ways:
deepblue_search
commanddeepblue_list_{experiments, annotations, ...}
commandsIn this example, we use the command deepblue_search
to find experiments that contain the keywords ‘H3k27AC’, ‘blood’, and ‘peaks’ in their metadata. We put the names in single quotes to show that these names must be in the metadata.
# We are selecting the experiments with terms 'H3k27AC', 'blood', and
# 'peak' in the metadata.
experiments_found = deepblue_search(
keyword="'H3k27AC' 'blood' 'peak'", type="experiments")
custom_table = do.call("rbind", apply(experiments_found, 1, function(experiment){
experiment_id = experiment[1]
# Obtain the information about the experiment_id
info = deepblue_info(experiment_id)
# Print the experiment name, project, biosource, and epigenetic mark.
with(info, { data.frame(name = name, project = project,
biosource = sample_info$biosource_name, epigenetic_mark = epigenetic_mark)
})
}))
head(custom_table)
## name project biosource
## 1 E038-H3K27ac.narrowPeak.bed Roadmap Epigenomics BLOOD
## 2 E047-H3K27ac.narrowPeak.bed Roadmap Epigenomics BLOOD
## 3 E048-H3K27ac.narrowPeak.bed Roadmap Epigenomics BLOOD
## 4 E037-H3K27ac.narrowPeak.bed Roadmap Epigenomics BLOOD
## 5 E045-H3K27ac.narrowPeak.bed Roadmap Epigenomics BLOOD
## 6 E040-H3K27ac.narrowPeak.bed Roadmap Epigenomics BLOOD
## epigenetic_mark
## 1 H3K27ac
## 2 H3K27ac
## 3 H3K27ac
## 4 H3K27ac
## 5 H3K27ac
## 6 H3K27ac
We use the deepblue_list_experiments
command to list all experiments with the corresponding values in their metadata.
experiments = deepblue_list_experiments(type="peaks", epigenetic_mark="H3K4me3",
biosource=c("inflammatory macrophage", "macrophage"),
project="BLUEPRINT Epigenome")
The extra-metadata is important because it contains information that is not stored in the mandatory metadata fields. We use the deepblue_info
command to access an experiment’s metadata- and extra-metadata fields. The following example prints the file_url
attribute that is contained in the data imported from the ENCODE project.
info = deepblue_info("e30000")
print(info$extra_metadata$file_url)
## [1] "https://www.encodeproject.org/files/ENCFF001YBB/"
We use the deepblue_select_experiments
command to select all genomic regions from the two informed experiments. We use the deepblue_count_regions
command with the query_id
value returned by the deepblue_select_experiments
command.
The deepblue_count_regions
command is executed asynchronously. This means that the user receives a request_id
and should check the status of this request. In contrast to the command deepblue_get_request_data
, the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.
query_id = deepblue_select_experiments(
experiment_name=c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
"S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"))
# Count how many regions where selected
request_id = deepblue_count_regions(query_id=query_id)
# Download the request data as soon as processing is finished
requested_data = deepblue_download_request_data(request_id=request_id)
print(paste("The selected experiments have", requested_data, "regions."))
## [1] "The selected experiments have 115347 regions."
We use the deepblue_select_experiments
command to select genomic regions from the experiments that are in chromosome 1, position 0 to 50,000,000.
We then use the deepblue_get_regions
command with the query_id
value returned by the deepblue_select_experiments
command to request the regions with the selected columns. Selecting the columns @NAME
and @BIOSOURCE
represent the experiment name and the experiment biosource.
The deepblue_get_regions
command is executed asynchronously. This means that the user receives a request_id
to be able to check for the status of this request. In contrast to the command deepblue_get_request_data
, the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.
query_id = deepblue_select_experiments (
experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
"S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
chromosome="chr1", start=0, end=50000000)
# Retrieve the experiments data (The @NAME meta-column is used to include the
# experiment name and @BIOSOURCE for experiment's biosource
request_id = deepblue_get_regions(query_id=query_id,
output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions
## GRanges object with 3783 ranges and 4 metadata columns:
## seqnames ranges strand | SIGNAL_VALUE PEAK
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 [270668, 270987] * | 6.5758 39
## [2] chr1 [271277, 271468] * | 6.2148 136
## [3] chr1 [273768, 274209] * | 14.1567 164
## [4] chr1 [778377, 778676] * | 8.0198 154
## [5] chr1 [778409, 778678] * | 4.5767 123
## ... ... ... ... . ... ...
## [3779] chr1 [47437420, 47437621] * | 3.7686 147
## [3780] chr1 [47437751, 47438038] * | 9.6553 149
## [3781] chr1 [48245368, 48245867] * | 4.7708 346
## [3782] chr1 [48542755, 48543280] * | 7.3002 152
## [3783] chr1 [48793649, 48793986] * | 5.1974 108
## @NAME @BIOSOURCE
## <character> <character>
## [1] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [2] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [3] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [4] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [5] BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed BL-2
## ... ... ...
## [3779] BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed BL-2
## [3780] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [3781] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [3782] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [3783] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We use the deepblue_list_samples
command to obtain all samples with the biosource ‘myeloid cell’ from the BLUEPRINT project. The deepblue_list_samples
returns a list of samples with their IDs and content. We extract the sample IDs from this list and use it in the deepblue_select_regions
command to selects genomic regions that are in chromosome 1, position 0 to 50,000.
Then, we use the deepblue_get_regions
command with the parameter query_id
returned by the deepblue_select_regions
command and the columns @NAME
, SAMPLE_ID
, and @BIOSOURCE
representing the experiment name, the sample ID, and the experiment biosource.
The deepblue_get_regions
command is executed asynchronously. This means that the user receives a request_id
to be able to check for the status of this request. In contrast to the command deepblue_get_request_data
, the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.
samples = deepblue_list_samples(
biosource="myeloid cell",
extra_metadata = list("source" = "BLUEPRINT Epigenome"))
samples_ids = deepblue_extract_ids(samples)
query_id = deepblue_select_regions(genome="GRCh38", sample=samples_ids,
chromosome="chr1", start=0, end=50000)
request_id = deepblue_get_regions(query_id=query_id,
output_format="CHROMOSOME,START,END,@NAME,@SAMPLE_ID,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
head(regions,1)
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr1 [0, 10000] * |
## @NAME @SAMPLE_ID
## <character> <character>
## [1] S00Q7NH1_12_12_Blueprint_release_201608_segments.bed s8797
## @BIOSOURCE
## <character>
## [1] myeloid cell
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We use the deepblue_select_experiments
command for selecting genomic regions from two specific experiments that are in chromosome 1, position 0 to 50,000,000. Then, we filter these for regions with SIGNAL_VALUE
> 10 and PEAK
> 1000.
Then, we use the deepblue_get_regions
command with the parameter query_id
returned by the deepblue_select_regions
command and the columns @NAME
and @BIOSOURCE
representing the experiment name and the experiment biosource.
The deepblue_get_regions
command is executed asynchronously. This means that the user receives a request_id
to be able to check for the status of this request. In contrast to the command deepblue_get_request_data
, the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.
query_id = deepblue_select_experiments(
experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
"S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
chromosome="chr1", start=0, end=50000000)
query_id_filter_signal = deepblue_filter_regions(
query_id=query_id, field="SIGNAL_VALUE", operation=">",
value="10", type="number")
query_id_filters = deepblue_filter_regions(
query_id=query_id_filter_signal, field="PEAK", operation=">",
value="1000", type="number")
request_id = deepblue_get_regions(query_id=query_id_filters,
output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions
## GRanges object with 161 ranges and 4 metadata columns:
## seqnames ranges strand | SIGNAL_VALUE PEAK
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 [1142428, 1144001] * | 10.9313 1275
## [2] chr1 [1573400, 1575582] * | 17.8805 1094
## [3] chr1 [1612814, 1616174] * | 32.2064 2802
## [4] chr1 [1668761, 1670450] * | 20.2936 1017
## [5] chr1 [1778583, 1783797] * | 35.4277 1293
## ... ... ... ... . ... ...
## [157] chr1 [44774644, 44776655] * | 16.3227 1160
## [158] chr1 [44806139, 44811000] * | 22.8156 1381
## [159] chr1 [46301112, 46304262] * | 19.8041 2397
## [160] chr1 [46579227, 46582046] * | 15.9613 1824
## [161] chr1 [46593677, 46595181] * | 11.8798 1304
## @NAME @BIOSOURCE
## <character> <character>
## [1] BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed BL-2
## [2] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [3] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [4] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [5] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## ... ... ...
## [157] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [158] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [159] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [160] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [161] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We use the deepblue_select_experiments
command for selecting genomic regions from two specific experiments that are in chromosome 1, position 0 to 50,000,000. Then, we filter these for regions with SIGNAL_VALUE
> 10 and PEAK
> 1000.
The command deepblue_intersection
filters for all regions of the query_id
that intersect with at least one region in promoters_id
.
Then, we use the deepblue_get_regions
command with the parameter query_id
returned by the deepblue_select_regions
command and the columns @NAME
and @BIOSOURCE
representing the experiment name and the experiment biosource.
The deepblue_get_regions
command is executed asynchronously. This means that the user receives a request_id
to be able to check for the status of this request. In contrast to the command deepblue_get_request_data
, the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.
query_id = deepblue_select_experiments(
experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
"S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
chromosome="chr1", start=0, end=50000000)
promoters_id = deepblue_select_annotations(annotation_name="promoters",
genome="GRCh38", chromosome="chr1")
intersect_id = deepblue_intersection(
query_data_id=query_id, query_filter_id=promoters_id)
request_id = deepblue_get_regions(
query_id=intersect_id,
output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions
## GRanges object with 608 ranges and 4 metadata columns:
## seqnames ranges strand | SIGNAL_VALUE PEAK
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 [903997, 904177] * | 4.7708 89
## [2] chr1 [904302, 905111] * | 5.4928 560
## [3] chr1 [910269, 910975] * | 4.7201 136
## [4] chr1 [911973, 913915] * | 17.0446 624
## [5] chr1 [923976, 924329] * | 4.7201 109
## ... ... ... ... . ... ...
## [604] chr1 [46718435, 46719027] * | 6.2499 230
## [605] chr1 [47313340, 47313980] * | 6.8711 174
## [606] chr1 [47313412, 47313588] * | 4.8558 120
## [607] chr1 [47313632, 47314141] * | 10.2801 371
## [608] chr1 [47333183, 47335172] * | 18.9772 857
## @NAME @BIOSOURCE
## <character> <character>
## [1] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [2] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [3] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [4] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [5] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## ... ... ...
## [604] BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed BL-2
## [605] BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed BL-2
## [606] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [607] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## [608] S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed myeloid cell
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
library(Gviz)
atrack <- AnnotationTrack(regions, name = "Intersecting regions",
group = regions$`@BIOSOURCE`, genome="hg38")
gtrack <- GenomeAxisTrack()
itrack <- IdeogramTrack(genome = &qu